Logo réduit OCTOPIZE - pictogramme

Revaluation of a cohort for a new use

This use case illustrates the problem of reuse of personal data for a new purpose not foreseen by the processing purpose of the first consent.

Context

In order to justify processing, personal data must be collected in compliance with one of the legal bases defined by Article 6 of the GDPR (most often consent). As a result, any further processing not provided for in the initial purpose must require a new consent to be lawful; which, in practice, is difficult to achieve (deceased individuals, change of contact details...). Many data sets with significant information potential are thus blocked because the new processing purposes could not be anticipated at the time of collection.

Solution

Octopize, thanks to its Avatar anonymization method, allows to create a synthetic dataset that protects the individuals at the origin of the data, while keeping the statistical potential and the original granularity. The Avatar method being certified as a true anonymization solution in the sense of the RGPD, the resulting synthetic dataset is no longer considered as a personal dataset and can be reused for any other purpose without constraint.

Example

Let's take the example of a pharmaceutical company that has set up a cohort to evaluate the influence of a dietary supplement on body fat percentage. A start-up wants to use this data to develop a diagnostic application to predict the fat mass of individuals that is more accurate than current measurement tools. However, the transfer of personal data to the start-up is not compatible with the initial purpose of the collection. In addition, since the creation of the cohort, many individuals have changed their contact information, making it difficult to obtain new consent.

Data type

The personal health data associated with this cohort is quantitative only and measures the percentage of body fat of patients based on other physiological measures.
  • 252 individuals
  • 15 variables

This dataset is anonymized by the Avatar method which gives a new dataset with the same structure as the initial one (same number of individuals, same number of variables, same format).

Objectives of anonymization

In this use case, two main objectives can be identified.
  1. Make it impossible to re-identify individuals in the dataset.

  2. Preserve the predictive capacity of the data.

How does Octopize ensure the anonymity of individuals?

The transformation of data by the Avatar method is systematically accompanied by an evaluation of the security of the synthetic data generated through unique metrics . These metrics were developed to verify compliance with the 3 criteria identified by the European Data Protection Board (EDPB) (formerly WP29) to qualify data as anonymous under the GDPR; namely:

  • Singling Out
  • Linkability
  • Inference

From our example we obtain the following results:

  • Hidden rate: 89.71%
  • Local cloaking: 6
  • Correlation protection rate: 98.81% (reference variables : age, weight, height)
  • Inference rate: 26.78% (reference variables : age, weight, height, target : siri)

The results help obtained indicate that it is impossible in practice for an attacker to re-identify the individuals in the cohort.

Do the synthetic data generated provide the same results as the original data?

We seek to verify whether the dataset anonymized by the Avatar method, and transmitted by the industrialist to the start-up, has a predictive potential of the percentage of fat mass equivalent to the original dataset.

The training protocol consists in training two identical machine learning models (here RandomForest) on the original data on one hand and their avatarized equivalent on the other. The two models were then tested on the remaining original data.
The graph on the right represents the prediction quality according to the training model. The avatars can therefore predict the value of real life data with the same performance as the original data.
This graph allows us to visualize the importance of the variables in the regression models generated. By comparing the results of the two models respectively trained on real life data and avatars, we notice a similarity of interpretability. Indeed, the variables that are preponderant in the prediction of the target value are globally the same.

Conclusion

The transformation of data into avatars makes it possible to revalue the dormant potential of certain data by allowing or facilitating their transfer while respecting the privacy of individuals and ensuring a strong preservation of the statistical qualities of the original data.

Other use cases

Data transfer outside the EU

This use case illustrates the problem of transferring personal data to a partner outside the European Union.

Risk limitation in internal use

This use case deals with precautionary notions in the use and governance of personal data. How to limit the risks linked to an internal use of this data?

Storage of spatio-temporal data

The "New York Taxi" use case presents a context of anonymization of spatio-temporal data. The difficulty lies in the particular nature of this data, where the combination of spatial and temporal dimensions accentuates the risk of re-identification.
© Octopize 2022
crossmenuchevron-down