Logo réduit OCTOPIZE - pictogramme

Risk limitation in internal use

This use case deals with precautionary notions in the use and governance of personal data. How to limit the risks linked to an internal use of this data?


Between the collection of personal data and its use for a defined purpose, many risks related to infrastructure or user practices can lead to the leakage of personal information. Due to the volume of data generated, healthcare institutions are prime targets for attackers, whether through the theft of hard drives, the recovery of email content or data from unsecured workstations. In addition to the invasion of individual privacy, this type of incident greatly deteriorates the image of and trust in these institutions. However, the use of health data is essential in the daily life of many units (research, training...).


Octopize, thanks to its Avatar anonymization method, allows to create synthetic datasets that protect the individuals at the origin of the data, while keeping the statistical potential and the original granularity. Avatar data is no longer considered personal data and can be safely shared within a research unit for example. In case of malicious or accidental leakage of this anonymized data, re-identification of patients is practically impossible.


A research unit in oncology at an hospital wishes to improve its practices in the use of personal data while allowing its doctoral students to understand clinical health data in order to set up analyses.

This is a cohort of women with breast cancer whose tumor severity is being determined through measurements taken on biopsies. The objective is to share this health data with PhD students so that they can understand the pathology without holding personal data on their position.

Data type

In this example, the personal data used represent a sample of 683 patients who underwent biopsy. Following the biopsy, 9 measurements were performed with a diagnosis of the benign or malignant character of the tumor.
  • 683 patients
  • 9 measurements / patient

Objectives of anonymization

  1. Make it impossible to re-identify individuals in the dataset.
  2. Retain the predictive capacity of the data for PhD.

How does Octopize ensure the anonymity of individuals?

The transformation of data by the Avatar solution is systematically accompanied by an evaluation of the security of the generated summary data through unique metrics. These metrics were developed to verify compliance with the 3 criteria identified by the European Data Protection Committee (EDPS) (formerly G29) to qualify data as anonymous under the GDPR; namely:

  • Singling Out
  • Linkability
  • Inference.

From our example we obtain the following results:

  • Hidden rate: 86.38%.
  • Local cloaking : 6
  • Correlation protection rate : Not applicable, the information present in the dataset is unlikely to be available in an external database (measurements from the biopsy)
  • Inference rate: Not applicable, the information in the dataset is unlikely to be available in an external database (biopsy measurements)

The results obtained indicate that it is impossible in practice for an attacker to re-identify the individuals in the cohort.

Do the synthetic data generated provide the same results as the original data?

The aim is to verify whether the data set anonymized by the Avatar method has retained its pedagogical character and can be used by doctoral students to carry out analyses while respecting the privacy of the patients.

The comparison of the projection of the two datasets after an independent dimension reduction step illustrates the conservation of the data structure. Indeed, we can distinguish individuals by their tumor class in the same way in both datasets. This result is an indicator of the likelihood of the generated data. Thus, the synthetic data offer similar utility to the original data for the PhD students.
To ensure that the predictive potential of the data is maintained, we compare the performance of several classification models. We use the following protocol: a model is trained on an original dataset, an identical model is trained on the avatar data. The performance of the two models is tested on original data that were not used for training. This protocol is repeated several times for each model and for several different models. The result is that the performance of the prediction models trained on synthetic data is comparable to that obtained on the original data.
This result shows that the predictive potential of the data is maintained after the anonymization treatment.


 The transformation of data into avatars makes it possible to secure and facilitate the internal use of data. The data in circulation is not personal data, thus avoiding any risk of malicious or accidental leakage. However, after transformation, the data remains useful for the uses initially planned.

Other use cases

Data transfer outside the EU

This use case illustrates the problem of transferring personal data to a partner outside the European Union.

Revaluation of a cohort for a new use

This use case illustrates the problem of reuse of personal data for a new purpose not foreseen by the processing purpose of the first consent.

Storage of spatio-temporal data

The "New York Taxi" use case presents a context of anonymization of spatio-temporal data. The difficulty lies in the particular nature of this data, where the combination of spatial and temporal dimensions accentuates the risk of re-identification.
© Octopize 2022