Logo réduit OCTOPIZE - pictogramme
July 282021

What anonymization techniques to protect your personal data?

What anonymization techniques to protect your personal data?

What are the different anonymization techniques?

After having differentiated the concepts of anonymization and pseudonymization in a previous article, it is important for the Octopize team to take stock of the different existing techniques for anonymizing personal data.

Anonymization techniques

Before talking about anonymization of data, it should be noted that pseudonymization is necessary first to remove any directly identifying character from the dataset: this is an essential first security step. Anonymization techniques allow for the handling of quasi-identifying attributes. By combining them with a prior pseudonymization step, it is ensured that direct identifiers are taken care of and that all personal information related to an individual is protected.

Secondly, as a reminder, anonymization consists of using techniques that make it impossible, in practice, to re-identify the individuals from whom the anonymized personal data originated. This technique has an irreversible character which implies that anonymized data are no longer considered as personal data, thus falling outside the scope of the GDPR.

To characterise anonymization, the EDPS (European Data Protection Committee), formerly the G29 Working Party, has set out 3 criteria to be respected, namely

  • Individualisation: is it always possible to isolate an individual?
  • Correlation: is it always possible to link records relating to an individual?
  • Inference: can information about an individual be inferred?

The EDPS then defines two main families of anonymization techniques, namely randomization and generalization.

RANDOMIZATION

GENERALIZATION

Randomization involves changing the attributes in a dataset so that they are less precise, while maintaining the overall distribution.

This technique protects the dataset from the risk of inference. Examples of randomization techniques include noise addition, permutation and differential privacy.

Randomization situation: permuting data on the date of birth of individuals so as to alter the veracity of the information contained in a database.

Generalization involves changing the scale of dataset attributes, or their order of magnitude, to ensure that they are common to a set of people.

This technique avoids the individualisation of a dataset. It also limits the possible correlations of the dataset with others. Examples of generalisation techniques include aggregation, k-anonymity, l-diversity and t-proximity.

Generalization situation: in a file containing the date of birth of individuals, replacing this information by the year of birth only.

These different techniques make it possible to respond to certain issues, with their own advantages and disadvantages. We will thus detail the operating principle of these different methods and will expose, through factual examples, the limits to which they are subject.

Which technique to use and why?

Each of the anonymization techniques may be appropriate, depending on the circumstances and context, to achieve the desired purpose without compromising the privacy rights of the data subjects.

The randomization family :

1- Adding noise :

Principle: Modification of the attributes of the dataset to make them less accurate. Example: Following anonymization by adding noise, the age of patients is modified by plus or minus 5 years.

Strengths:

  • If noise addition is applied effectively, a third party will not be able to identify an individual nor will they be able to restore the data or otherwise discern how the data has been altered.
  • Relevant where attributes may have a significant negative effect on individuals.
  • Retains the general distribution.

Weaknesses:

  • The noise introduced alters the quality of the data, so the analyses performed on the dataset are less relevant.
  • The level of noise depends on the level of information required and the impact that the disclosure of attributes would have on the privacy of individuals.

Common errors :

  • Inconsistent noise addition: If the noise is not semantically viable (i.e. it is disproportionate and does not respect the logic between attributes in a set) or if the data set is too sparse.
  • Assuming that adding noise is sufficient: adding noise is a complementary measure that makes it more difficult for an attacker to recover the data, it should not be assumed to be a self-sufficient anonymisation solution.

Failure to use :

Netflix case:

In the Netflix case, the initial database had been made publicly "anonymised" in accordance with the company's internal privacy policy (removing all identifying information about users except ratings and dates).

In this case, it was possible to re-identify 68% of Netflix users through a database external to Netflix, by cross-referencing. Users were uniquely identified in the dataset using 8 ratings and dates with a margin of error of 14 days as selection criteria.

 

2- Permutation :

Principle:

Consists of mixing attribute values in a table in such a way that some of them are artificially linked to different data subjects. Permutation therefore alters the values within the dataset by simply swapping them from one record to another. Example: As a result of permutation anonymization, the age of patient A has been replaced by that of patient J.

Strengths:

  • Useful when it is important to keep the exact distribution of each attribute in the dataset.
  • Guarantee that the range and distribution of values will remain the same.

Weakness:

- Doesn’t allow for the preservation of correlations between values and individuals, thus making it impossible to perform advanced statistical analyses (regression, machine learning, etc.).

Common mistakes:

  • Selecting the wrong attribute: swapping non-sensitive or non-risky attributes does not provide a significant gain in terms of personal data protection. Therefore, if the sensitive attributes remain associated with the original value, an attacker will still be able to extract them.
  • Random swapping of attributes: If two attributes are highly correlated, randomly swapping the attributes will not provide strong guarantees.

Failure to use: the permutation of correlated attributes

In the following example, we can see that intuitively, we will try to link salaries with occupations according to the correlations that seem logical to us (see arrow).

Thus, the random permutation of attributes does not offer guarantees of confidentiality when there are logical links between different attributes.

table1_article2_EN

Table 1. Example of inefficient anonymization by permutation of correlated attributes

 

3- Differential Privacy :

Principle: Differential Privacy is the production of anonymized views of a dataset while retaining a copy of the original data.

The anonymized view is generated as a result of a third party query to the database, the result of which will be associated with added noise. To be considered "differentially private", the presence or absence of a particular individual in the query must not be able to change its outcome.

Strength :

  • Adaptability: In contrast to the practice of sharing data as a whole, the results of queries arising from Differential Privacy can be given on a case-by-case basis depending on the requests and authorised third parties promoting governance issues.

Weaknesses:

  • Doesn’t allow the dataset to be shared in its initial structure, thus limiting the range of analyses that can be performed.
  • Monitoring must be continuous (at least for each new query) to identify any possibility of identifying an individual in the query result set.
  • Does not directly modify the data as it is an after-the-fact addition of noise related to a query. The original data is therefore still present. As such, the results can also be considered as personal data.
  • To limit inference and correlation attacks, it is necessary to keep track of the queries submitted by an entity and to monitor the information obtained about the individuals involved. « Differential privacy » databases therefore exist and are a weakness of the method as they should not be deployed on open search engines that do not allow control over the identity of the searcher and the nature of their queries.

Common mistakes:

- Not injecting enough noise: In order to prevent links from being made to knowledge from context, noise must be added. The challenge from a data protection perspective is to generate the appropriate level of noise to add to the actual responses, so as to protect the privacy of individuals without undermining the utility of the data.

- Don’t allocate a security budget: it’s necessary to keep information about the queries made and to allocate a security budget that will increase the amount of noise added if a query is repeated.

Usability failures:

- Independent processing of each query: Without keeping the history of queries and adapting the noise level, the results of repeating the same query or a combination of them could lead to the disclosure of personal information. An attacker could in fact carry out several queries which would allow an individual to be isolated and one of his characteristics to emerge. It should also be taken into account that Differencial Privacy only allows one question to be answered at a time. Thus, the original data must be maintained throughout the defined use.

- Re-identification of individuals: Differential privacy doesn’t guarantee non-disclosure of personal information. An attacker can re-identify individuals and reveal their characteristics using another data source or by inference. For example, in this paper (source: https://arxiv.org/abs/1807.09173) researchers from the Georgia Institute of Technology (Atlanta) have developed an algorithm, called "membership inference attacks", which re-identifies training (and therefore sensitive) data from a differential privacy model. The researchers conclude that further research is needed to find a stable and viable differential privacy mechanism against membership inference attacks. Thus, differential privacy doesn’t appear to be a totally secure protection.

The generalization family :

1- Aggregation and k-anonymity :

Principle: Generalization of attribute values to such an extent that all individuals share the same value. These two techniques aim to prevent a data subject from being isolated by grouping him/her with at least k other individuals. Example: In order to have at least 20 individuals sharing the same value, the age of all patients between 20 and 25 is reduced to 23 years.

Strength:

  • Individualization: Once the same attributes are shared by k users, it should no longer be possible to isolate an individual within a group of k users.

Weaknesses:

  • Inference: k-anonymity doesn’t prevent any kind of inference attack. Indeed, if all individuals belong to the same group, as long as we know to which group an individual belongs, it is easy to obtain the value of this property.
  • Loss of granularity: The data resulting from generalization processing necessarily loses finesse and sometimes consistency.

Common mistakes:

- Neglecting certain quasi-identifiers: The choice of the parameter k is the key parameter of the k-anonymity technique. The higher the value of k, the more the method guarantees confidentiality. However, a common mistake is to increase this parameter without considering all the variables. However, sometimes one variable is enough to re-identify a large number of individuals and make the generalization applied to the other quasi-identifiers useless.

- Low value of k: If k is too small, the weighting of an individual within a group is too large and attacks by inference are more likely to succeed. For example, if k=2 the probability that both individuals share the same property is greater than in the case where k >10.

- Don’t group individuals with similar weights: The parameter k must be adapted to the case of variables that are unbalanced in the distribution of their values.

Failure to use :

The main problem with k-anonymity is that it doesn’t prevent inference attacks. In the following example, if the attacker knows that an individual is in the dataset and was born in 1964, he also knows that this individual had a heart attack. Furthermore, if it is known that this dataset was obtained from a French organisation, it can be inferred that each of the individuals resides in Paris since the first three digits of the postal codes are 750*).

table2_article2_EN

Table 2. An example of poorly engineered k-anonymization

To overcome the shortcomings of k-anonymity, other aggregation techniques have been developed, notably L-diversity and T-proximity. These two techniques refine k-anonymity by ensuring that each of the classes has L different values (l-diversity) and that the classes created resemble the initial distribution of the data.

Note that despite these improvements, it does not address the main weaknesses of k-anonymity presented above.

Thus, these different generalization and randomization techniques each have security advantages but do not always fully meet the three criteria set out by the EDPS, formerly G29, as shown in Table 3 "Strengths and weaknesses of the techniques considered" produced by the CNIL.

table3_article2_EN

Table 3. Strengths and weaknesses of the techniques considered

Based on more recent anonymization techniques, synthetic data are now emerging as better anonymization solutions.

The case of synthetic data

Recent years of research have seen the emergence of solutions that allow the generation of synthetic records that ensure a high retention of statistical relevance and facilitate the reproducibility of scientific results. They are based on the creation of models that allow the global structure of the original data to be understood and reproduced. A distinction is made between adversarial neural networks (ANNs) and methods based on conditional distributions.

Strength :

  • High level of guarantee in terms of preservation of the structure, finesse and statistical relevance of the data generated.

Weakness:

  • Models can lead to the generation of synthetic data that is very close or even equivalent to the original records. In a situation where an attacker links this synthetic data to an individual, the only defence is to claim that the attacker is unable to prove this link. This can lead to a loss of trust by the people who originated the data.

The Avatar anonymization solution, developed by Octopize, uses a unique patient-centric design approach, allowing the creation of relevant and protected synthetic data while providing proof of protection. Its compliance has been demonstrated by the CNIL on the 3 EDPS criteria. Click here to learn more about avatars.

Rapid evolution of techniques

Finally, the CNIL (the French National Data Processing and Liberties Commission) reminds us that since anonymization and re-identification techniques are bound to evolve regularly, it is essential for any data controller concerned to keep a regular watch to preserve the anonymous nature of the data produced over time. This monitoring must take into account the technical means available and other sources of data that may make it possible to remove the anonymity of information.

The CNIL stresses that research into anonymisation techniques is ongoing and shows definitively that no technique is in itself infallible.

Sources :
https://edpb.europa.eu/edpb_fr
https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
Membership Inference Attacks : https://arxiv.org/pdf/1807.09173.pdf
Netflix : https://arxiv.org/PS_cache/cs/pdf/0610/0610105v2.pdf

© Octopize 2022
crossmenuchevron-down