What are the different anonymization techniques?
After having differentiated the concepts of anonymization and pseudonymization in a previous article, it is important for the Octopize team to take stock of the different existing techniques for anonymizing personal data.
Before talking about anonymization of data, it should be noted that pseudonymization is necessary first to remove any directly identifying character from the dataset: this is an essential first security step. Anonymization techniques allow for the handling of quasi-identifying attributes. By combining them with a prior pseudonymization step, it is ensured that direct identifiers are taken care of and that all personal information related to an individual is protected.
Secondly, as a reminder, anonymization consists of using techniques that make it impossible, in practice, to re-identify the individuals from whom the anonymized personal data originated. This technique has an irreversible character which implies that anonymized data are no longer considered as personal data, thus falling outside the scope of the GDPR.
To characterise anonymization, the EDPS (European Data Protection Committee), formerly the G29 Working Party, has set out 3 criteria to be respected, namely
The EDPS then defines two main families of anonymization techniques, namely randomization and generalization.
RANDOMIZATION |
GENERALIZATION |
Randomization involves changing the attributes in a dataset so that they are less precise, while maintaining the overall distribution.
This technique protects the dataset from the risk of inference. Examples of randomization techniques include noise addition, permutation and differential privacy. Randomization situation: permuting data on the date of birth of individuals so as to alter the veracity of the information contained in a database. |
Generalization involves changing the scale of dataset attributes, or their order of magnitude, to ensure that they are common to a set of people. This technique avoids the individualisation of a dataset. It also limits the possible correlations of the dataset with others. Examples of generalisation techniques include aggregation, k-anonymity, l-diversity and t-proximity. Generalization situation: in a file containing the date of birth of individuals, replacing this information by the year of birth only. |
These different techniques make it possible to respond to certain issues, with their own advantages and disadvantages. We will thus detail the operating principle of these different methods and will expose, through factual examples, the limits to which they are subject.
Each of the anonymization techniques may be appropriate, depending on the circumstances and context, to achieve the desired purpose without compromising the privacy rights of the data subjects.
1- Adding noise :
Principle: Modification of the attributes of the dataset to make them less accurate. Example: Following anonymization by adding noise, the age of patients is modified by plus or minus 5 years.
Strengths:
Weaknesses:
Common errors :
Failure to use :
Netflix case:
In the Netflix case, the initial database had been made publicly "anonymised" in accordance with the company's internal privacy policy (removing all identifying information about users except ratings and dates).
In this case, it was possible to re-identify 68% of Netflix users through a database external to Netflix, by cross-referencing. Users were uniquely identified in the dataset using 8 ratings and dates with a margin of error of 14 days as selection criteria.
2- Permutation :
Principle:
Consists of mixing attribute values in a table in such a way that some of them are artificially linked to different data subjects. Permutation therefore alters the values within the dataset by simply swapping them from one record to another. Example: As a result of permutation anonymization, the age of patient A has been replaced by that of patient J.
Strengths:
Weakness:
- Doesn’t allow for the preservation of correlations between values and individuals, thus making it impossible to perform advanced statistical analyses (regression, machine learning, etc.).
Common mistakes:
Failure to use: the permutation of correlated attributes
In the following example, we can see that intuitively, we will try to link salaries with occupations according to the correlations that seem logical to us (see arrow).
Thus, the random permutation of attributes does not offer guarantees of confidentiality when there are logical links between different attributes.
Table 1. Example of inefficient anonymization by permutation of correlated attributes
3- Differential Privacy :
Principle: Differential Privacy is the production of anonymized views of a dataset while retaining a copy of the original data.
The anonymized view is generated as a result of a third party query to the database, the result of which will be associated with added noise. To be considered "differentially private", the presence or absence of a particular individual in the query must not be able to change its outcome.
Strength :
Weaknesses:
Common mistakes:
- Not injecting enough noise: In order to prevent links from being made to knowledge from context, noise must be added. The challenge from a data protection perspective is to generate the appropriate level of noise to add to the actual responses, so as to protect the privacy of individuals without undermining the utility of the data.
- Don’t allocate a security budget: it’s necessary to keep information about the queries made and to allocate a security budget that will increase the amount of noise added if a query is repeated.
Usability failures:
- Independent processing of each query: Without keeping the history of queries and adapting the noise level, the results of repeating the same query or a combination of them could lead to the disclosure of personal information. An attacker could in fact carry out several queries which would allow an individual to be isolated and one of his characteristics to emerge. It should also be taken into account that Differencial Privacy only allows one question to be answered at a time. Thus, the original data must be maintained throughout the defined use.
- Re-identification of individuals: Differential privacy doesn’t guarantee non-disclosure of personal information. An attacker can re-identify individuals and reveal their characteristics using another data source or by inference. For example, in this paper (source: https://arxiv.org/abs/1807.09173) researchers from the Georgia Institute of Technology (Atlanta) have developed an algorithm, called "membership inference attacks", which re-identifies training (and therefore sensitive) data from a differential privacy model. The researchers conclude that further research is needed to find a stable and viable differential privacy mechanism against membership inference attacks. Thus, differential privacy doesn’t appear to be a totally secure protection.
1- Aggregation and k-anonymity :
Principle: Generalization of attribute values to such an extent that all individuals share the same value. These two techniques aim to prevent a data subject from being isolated by grouping him/her with at least k other individuals. Example: In order to have at least 20 individuals sharing the same value, the age of all patients between 20 and 25 is reduced to 23 years.
Strength:
Weaknesses:
Common mistakes:
- Neglecting certain quasi-identifiers: The choice of the parameter k is the key parameter of the k-anonymity technique. The higher the value of k, the more the method guarantees confidentiality. However, a common mistake is to increase this parameter without considering all the variables. However, sometimes one variable is enough to re-identify a large number of individuals and make the generalization applied to the other quasi-identifiers useless.
- Low value of k: If k is too small, the weighting of an individual within a group is too large and attacks by inference are more likely to succeed. For example, if k=2 the probability that both individuals share the same property is greater than in the case where k >10.
- Don’t group individuals with similar weights: The parameter k must be adapted to the case of variables that are unbalanced in the distribution of their values.
Failure to use :
The main problem with k-anonymity is that it doesn’t prevent inference attacks. In the following example, if the attacker knows that an individual is in the dataset and was born in 1964, he also knows that this individual had a heart attack. Furthermore, if it is known that this dataset was obtained from a French organisation, it can be inferred that each of the individuals resides in Paris since the first three digits of the postal codes are 750*).
Table 2. An example of poorly engineered k-anonymization
To overcome the shortcomings of k-anonymity, other aggregation techniques have been developed, notably L-diversity and T-proximity. These two techniques refine k-anonymity by ensuring that each of the classes has L different values (l-diversity) and that the classes created resemble the initial distribution of the data.
Note that despite these improvements, it does not address the main weaknesses of k-anonymity presented above.
Thus, these different generalization and randomization techniques each have security advantages but do not always fully meet the three criteria set out by the EDPS, formerly G29, as shown in Table 3 "Strengths and weaknesses of the techniques considered" produced by the CNIL.
Table 3. Strengths and weaknesses of the techniques considered
Based on more recent anonymization techniques, synthetic data are now emerging as better anonymization solutions.
Recent years of research have seen the emergence of solutions that allow the generation of synthetic records that ensure a high retention of statistical relevance and facilitate the reproducibility of scientific results. They are based on the creation of models that allow the global structure of the original data to be understood and reproduced. A distinction is made between adversarial neural networks (ANNs) and methods based on conditional distributions.
Strength :
Weakness:
The Avatar anonymization solution, developed by Octopize, uses a unique patient-centric design approach, allowing the creation of relevant and protected synthetic data while providing proof of protection. Its compliance has been demonstrated by the CNIL on the 3 EDPS criteria. Click here to learn more about avatars.
Finally, the CNIL (the French National Data Processing and Liberties Commission) reminds us that since anonymization and re-identification techniques are bound to evolve regularly, it is essential for any data controller concerned to keep a regular watch to preserve the anonymous nature of the data produced over time. This monitoring must take into account the technical means available and other sources of data that may make it possible to remove the anonymity of information.
The CNIL stresses that research into anonymisation techniques is ongoing and shows definitively that no technique is in itself infallible.