The notion of anonymous data crystallizes a lot of misunderstandings and misconceptions to the point that the term "anonymous" does not have the same meaning depending on the person who uses it.
In order to re-establish a consensus, the Octopize team wanted to discuss the differences between pseudonymization and anonymization, two notions that are often confused.
At first glance, the term "anonymization" evokes the notion of a mask, of concealment. We then imagine that the principle of anonymization amounts to masking the directly identifying attributes of an individual (name, first name, social security number). This shortcut is precisely the trap to avoid. Indeed, the masking of these parameters constitutes rather a pseudonymization.
At first glance, these two concepts are similar, but there are major differences between them, both from a legal and a security point of view.
According to the CNIL, pseudonymization is "the processing of personal data in such a way that it is no longer possible to attribute the data to a natural person without additional information". It is one of the measures recommended by the RGPD to limit the risks related to the processing of personal data.
But pseudonymization is not a method of anonymization. Pseudonymization simply reduces the correlation of a data set with the original identity of a data subject and is therefore a useful but not absolute security measure. Indeed, pseudonymization consists in replacing directly identifying data (name, first name...) of a data set by indirectly identifying data (alias, number in a classification, etc.) thus preventing the direct re-identification of individuals.
However, pseudonymization is not an infallible protection because the identity of an individual can also be deduced from a combination of several pieces of information called quasi-identifiers. Thus, in practice, pseudonymized data remains potentially re-identifying indirectly by crossing information. The identity of the individual can be betrayed by one of his indirectly identifying characteristics. This transformation is therefore reversible, justifying the fact that pseudonymized data are always considered as personal data. To date, the most widely used pseudonymization techniques are based on secret key cryptographic systems, hash functions, deterministic encryption and tokenization.
The "AOL (America On Line) case" is a typical example of the misunderstanding that exists between pseudonymization and anonymization. In 2006, a database containing 20 million keywords from the searches of more than 650,000 users over a period of three months was made public, with no other measure to preserve privacy than the replacement of the AOL user ID by a numerical attribute (pseudonymization).
Despite this treatment, the identity and location of some users were made public. Indeed, queries sent to a search engine, especially if they can be coupled with other attributes, such as IP addresses or other configuration parameters, have a very high potential for identification.
This incident is just one example of the many pitfalls that show that a pseudonymized dataset is not anonymous; simply changing the identity does not prevent an individual from being re-identified based on quasi-identifying information (age, gender, zip code). In many cases, it can be as easy to identify an individual in a pseudonymized dataset as it is from the original data (the "Who's That?" game).
Anonymization consists in using techniques that make it impossible, in practice, to re-identify the individuals who provided the anonymized personal data. This treatment is irreversible and implies that the anonymized data are no longer considered as personal data, thus falling outside the scope of the RGPD. To characterise anonymization, the European Data Protection Committee (formerly WP29) relies on the 3 criteria set out in the opinion of 05/2014 (source at foot of page):
- Individualization: anonymized data must not allow to distinguish an individual. Therefore, even with all the quasi-identifying information about an individual, it must be impossible to distinguish him in a database once anonymized.
- Correlation: anonymized data must not be re-identifiable by crossing it with other data sets. Thus, it must be impossible to link two data sets from different sources concerning the same individual. Once anonymized, an individual's health data should not be linkable to his or her banking data based on common information.
- Inference: The data should not allow for the inference of additional information about an individual in a reasonable manner. For example, it must not be possible to determine with certainty the health status of an individual from anonymous data.
It is when these three criteria are met that data is considered to be anonymous in the strict sense. It then changes its legal status: it is no longer considered as personal data and falls outside the scope of the RGPD.
There are currently several families of anonymization methods that we will detail in our next article. For the most part, these methods provide protection by degrading the quality, structure or fineness of the original data, thus limiting the informational value of this data after processing. The real challenge is to solve the paradox between the legitimate protection of everyone's data, and its exploitation for the benefit of all.
The Avatar anonymization method, developed by Octopize, is a unique anonymization method. It solves the paradox between the protection of patients' personal data and the sharing of this data for its informative value. Indeed, the Avatar solution, which has been successfully evaluated by the CNIL, allows, thanks to synthetic data, to ensure on the one hand the confidentiality of the original data (and thus their sharing without risk) and on the other hand, to preserve the informative value of the original data.