What are the criteria for considering data to be truly anonymous?

How to measure the anonymity of a database?

In the age of Big Data, personal data is an essential raw material for the development of research and the operation of many companies. However, despite their great value, the use of this type of data necessarily implies a risk of re-identification and leakage of sensitive information, even after prior pseudonymisation treatment (see article 1). In the case of personal data, especially sensitive data, the risk of re-identification can be considered a betrayal of the trust of the individuals from whom the data originated, especially when they are used without clear and informed consent.

The implementation of the General Data Protection Regulation (GDPR) in 2018 and the Data Protection Act before it offered an attempt to address this issue by initiating a change in the practices of collecting, processing and storing personal data. An independent think tank specialising in privacy issues has also been set up. Called the European Data Protection Committee (EDPS) or formerly the G29, this consultative body has published work (see Article G29) which now serves as a reference for the European national authorities (CNIL in France) in the application of the RGPD.

The EDPS thus agrees on the potential of anonymisation to enhance the value of personal data while limiting the risks for the individuals from whom they originate. As a reminder, data are considered as anonymous if the re-identification of the original individuals is impossible. It is therefore an irreversible process. However, the anonymisation methods developed to meet this need are not infallible and their effectiveness often depends on many parameters (see article 2). In order to use these methods in an optimal way, it is necessary to make further clarifications on the nature of the anonymised data. The EDPS, in his Opinion of 05/2014 on anonymisation techniques, identifies three criteria for determining the impossibility of re-identification; namely:

 

  1. Individualisation: is it always possible to isolate an individual?

The individualisation criterion corresponds to the most favourable scenario for an attacker, i.e. a person, malicious or not, seeking to re-identify an individual in a dataset. To be considered anonymous, a dataset must not allow an attacker to isolate a target individual. In practice, the more information an attacker has about the individual they wish to isolate in a database, the higher the probability of re-identification. Indeed, in a pseudonymised dataset, i.e. one that has been stripped of its direct identifiers, the remaining quasi-identifying information acts like a barcode of an individual's identity when considered together. Thus, the more prior information the attacker has about the individual he is trying to identify, the more precise a query he can make to try to isolate that individual. An example of an individualisation attack is shown in Figure 1.

Ré-identification d’un patient par individualisation dans un jeu de données sur la base de deux attributs (Age, Gender)

Figure 1: Re-identification of a patient by individualisation in a dataset based on two attributes (Age, Gender)

One of the attributes of this type of attack is also the increased sensitivity of individuals with unusual characteristics. It will be easier for an attacker, with only gender and height information, to isolate a woman who is 2 metres tall than a man who is 1.75 metres tall.

 

2. Correlation: Is it always possible to link records about an individual? 

Correlation attacks are the most common scenario. Therefore, in order to consider data as anonymous, it is essential that it meets the correlation criterion. Between the democratisation of Open Data and the numerous incidents linked to personal data leaks, the amount of data available has never been so large. These databases containing personal information, sometimes directly identifying, are opportunities for attackers to carry out re-identification attempts by cross-referencing. In practice, correlation attacks use directly identifying databases with information similar to the database to be attacked, as illustrated in Figure 2.

Illustration d’une attaque par corrélation

Figure 2: Illustration of a correlation attack. The directly identifying external database (top) is used to re-identify individuals in the attacked database (bottom). The correlation is done on the basis of common variables.

In the case of the tables illustrated in Figure 2, the attacker would have succeeded in re-identifying the 5 individuals in the pseudonymised database thanks to the two attributes common to both databases. Moreover, the re-identification would have allowed him to infer new sensitive information about the patients, namely the pathology that affects them. In this context, the more information the databases have in common, the higher the probability of re-identifying an individual by correlation.

 

3. Inference: can information about an individual be inferred?

The third and last criterion identified by the EDPS is probably the most complex to assess. This is the criterion of inference. In order to consider data as anonymous, it must be impossible to identify by inference, with a high degree of certainty, new information about an individual. For example, if a dataset contains information on the health status of individuals who have participated in a clinical study and all the men over 65 in this cohort have lung cancer, then it will be possible to infer the health status of certain participants. Indeed, knowing a man over 65 in this study is enough to say that he has lung cancer.

The inference attack is particularly effective on groups of individuals sharing a single modality. If the inference is successful, then the disclosure of the sensitive attribute concerns the whole group of individuals identified.

These three criteria identified by the EDPS cover the majority of threats to data after it has been processed to preserve its security. If these three criteria are met, then the processing can be considered as anonymisation in the true sense of the word.

 

Can current techniques satisfy all three criteria?

Randomisation and generalisation techniques each have advantages and disadvantages with respect to each criterion (see Article 2). The assessment of the performance in meeting the criteria for several anonymisation techniques is shown in Figure 3, taken from the Opinion published by the former G29 on anonymisation techniques.

Forces et faiblesses des techniques considérées - OCTOPIZE

Figure 3: Strengths and weaknesses of the techniques considered

 

It is clear that none of these techniques can meet all three criteria simultaneously. They should therefore be used with caution in their most appropriate context. In addition to the methods evaluated, synthetic data seems to be a promising alternative for meeting all three criteria. However, methodologies for producing synthetic data face the challenge of proving this protection. At present, all synthetic data generation solutions rely on the principle of plausible deniability to prove the protection associated with a data item. In other words, if a piece of synthetic data were to happen to resemble an original piece of data, the defence would be that in such circumstances, it is impossible to prove that the synthetic data is related to an original piece of data. At Octopize, we have developed a unique methodology to produce synthetic data while quantifying and proving the protection provided. This evaluation is carried out through metrics specifically developed to measure the satisfaction of the criteria, namely individualisation, correlation and inference. We will develop the subject of metrics for assessing the quality and security of synthetic data in more detail in another article.

What anonymization techniques to protect your personal data?

What anonymization techniques to protect your personal data?

What are the different anonymization techniques?

After having differentiated the concepts of anonymization and pseudonymization in a previous article, it is important for the Octopize team to take stock of the different existing techniques for anonymizing personal data.

Anonymization techniques

Before talking about anonymization of data, it should be noted that pseudonymization is necessary first to remove any directly identifying character from the dataset: this is an essential first security step. Anonymization techniques allow for the handling of quasi-identifying attributes. By combining them with a prior pseudonymization step, it is ensured that direct identifiers are taken care of and that all personal information related to an individual is protected.

Secondly, as a reminder, anonymization consists of using techniques that make it impossible, in practice, to re-identify the individuals from whom the anonymized personal data originated. This technique has an irreversible character which implies that anonymized data are no longer considered as personal data, thus falling outside the scope of the GDPR.

To characterise anonymization, the EDPS (European Data Protection Committee), formerly the G29 Working Party, has set out 3 criteria to be respected, namely

The EDPS then defines two main families of anonymization techniques, namely randomization and generalization.

RANDOMIZATION

GENERALIZATION

Randomization involves changing the attributes in a dataset so that they are less precise, while maintaining the overall distribution.

This technique protects the dataset from the risk of inference. Examples of randomization techniques include noise addition, permutation and differential privacy.

Randomization situation: permuting data on the date of birth of individuals so as to alter the veracity of the information contained in a database.

Generalization involves changing the scale of dataset attributes, or their order of magnitude, to ensure that they are common to a set of people.

This technique avoids the individualisation of a dataset. It also limits the possible correlations of the dataset with others. Examples of generalisation techniques include aggregation, k-anonymity, l-diversity and t-proximity.

Generalization situation: in a file containing the date of birth of individuals, replacing this information by the year of birth only.

These different techniques make it possible to respond to certain issues, with their own advantages and disadvantages. We will thus detail the operating principle of these different methods and will expose, through factual examples, the limits to which they are subject.

Which technique to use and why?

Each of the anonymization techniques may be appropriate, depending on the circumstances and context, to achieve the desired purpose without compromising the privacy rights of the data subjects.

The randomization family :

1- Adding noise :

Principle: Modification of the attributes of the dataset to make them less accurate. Example: Following anonymization by adding noise, the age of patients is modified by plus or minus 5 years.

Strengths:

Weaknesses:

Common errors :

Failure to use :

Netflix case:

In the Netflix case, the initial database had been made publicly "anonymised" in accordance with the company's internal privacy policy (removing all identifying information about users except ratings and dates).

In this case, it was possible to re-identify 68% of Netflix users through a database external to Netflix, by cross-referencing. Users were uniquely identified in the dataset using 8 ratings and dates with a margin of error of 14 days as selection criteria.

 

2- Permutation :

Principle:

Consists of mixing attribute values in a table in such a way that some of them are artificially linked to different data subjects. Permutation therefore alters the values within the dataset by simply swapping them from one record to another. Example: As a result of permutation anonymization, the age of patient A has been replaced by that of patient J.

Strengths:

Weakness:

- Doesn’t allow for the preservation of correlations between values and individuals, thus making it impossible to perform advanced statistical analyses (regression, machine learning, etc.).

Common mistakes:

Failure to use: the permutation of correlated attributes

In the following example, we can see that intuitively, we will try to link salaries with occupations according to the correlations that seem logical to us (see arrow).

Thus, the random permutation of attributes does not offer guarantees of confidentiality when there are logical links between different attributes.

table1_article2_EN

Table 1. Example of inefficient anonymization by permutation of correlated attributes

 

3- Differential Privacy :

Principle: Differential Privacy is the production of anonymized views of a dataset while retaining a copy of the original data.

The anonymized view is generated as a result of a third party query to the database, the result of which will be associated with added noise. To be considered "differentially private", the presence or absence of a particular individual in the query must not be able to change its outcome.

Strength :

Weaknesses:

Common mistakes:

- Not injecting enough noise: In order to prevent links from being made to knowledge from context, noise must be added. The challenge from a data protection perspective is to generate the appropriate level of noise to add to the actual responses, so as to protect the privacy of individuals without undermining the utility of the data.

- Don’t allocate a security budget: it’s necessary to keep information about the queries made and to allocate a security budget that will increase the amount of noise added if a query is repeated.

Usability failures:

- Independent processing of each query: Without keeping the history of queries and adapting the noise level, the results of repeating the same query or a combination of them could lead to the disclosure of personal information. An attacker could in fact carry out several queries which would allow an individual to be isolated and one of his characteristics to emerge. It should also be taken into account that Differencial Privacy only allows one question to be answered at a time. Thus, the original data must be maintained throughout the defined use.

- Re-identification of individuals: Differential privacy doesn’t guarantee non-disclosure of personal information. An attacker can re-identify individuals and reveal their characteristics using another data source or by inference. For example, in this paper (source: https://arxiv.org/abs/1807.09173) researchers from the Georgia Institute of Technology (Atlanta) have developed an algorithm, called "membership inference attacks", which re-identifies training (and therefore sensitive) data from a differential privacy model. The researchers conclude that further research is needed to find a stable and viable differential privacy mechanism against membership inference attacks. Thus, differential privacy doesn’t appear to be a totally secure protection.

The generalization family :

1- Aggregation and k-anonymity :

Principle: Generalization of attribute values to such an extent that all individuals share the same value. These two techniques aim to prevent a data subject from being isolated by grouping him/her with at least k other individuals. Example: In order to have at least 20 individuals sharing the same value, the age of all patients between 20 and 25 is reduced to 23 years.

Strength:

Weaknesses:

Common mistakes:

- Neglecting certain quasi-identifiers: The choice of the parameter k is the key parameter of the k-anonymity technique. The higher the value of k, the more the method guarantees confidentiality. However, a common mistake is to increase this parameter without considering all the variables. However, sometimes one variable is enough to re-identify a large number of individuals and make the generalization applied to the other quasi-identifiers useless.

- Low value of k: If k is too small, the weighting of an individual within a group is too large and attacks by inference are more likely to succeed. For example, if k=2 the probability that both individuals share the same property is greater than in the case where k >10.

- Don’t group individuals with similar weights: The parameter k must be adapted to the case of variables that are unbalanced in the distribution of their values.

Failure to use :

The main problem with k-anonymity is that it doesn’t prevent inference attacks. In the following example, if the attacker knows that an individual is in the dataset and was born in 1964, he also knows that this individual had a heart attack. Furthermore, if it is known that this dataset was obtained from a French organisation, it can be inferred that each of the individuals resides in Paris since the first three digits of the postal codes are 750*).

table2_article2_EN

Table 2. An example of poorly engineered k-anonymization

To overcome the shortcomings of k-anonymity, other aggregation techniques have been developed, notably L-diversity and T-proximity. These two techniques refine k-anonymity by ensuring that each of the classes has L different values (l-diversity) and that the classes created resemble the initial distribution of the data.

Note that despite these improvements, it does not address the main weaknesses of k-anonymity presented above.

Thus, these different generalization and randomization techniques each have security advantages but do not always fully meet the three criteria set out by the EDPS, formerly G29, as shown in Table 3 "Strengths and weaknesses of the techniques considered" produced by the CNIL.

table3_article2_EN

Table 3. Strengths and weaknesses of the techniques considered

Based on more recent anonymization techniques, synthetic data are now emerging as better anonymization solutions.

The case of synthetic data

Recent years of research have seen the emergence of solutions that allow the generation of synthetic records that ensure a high retention of statistical relevance and facilitate the reproducibility of scientific results. They are based on the creation of models that allow the global structure of the original data to be understood and reproduced. A distinction is made between adversarial neural networks (ANNs) and methods based on conditional distributions.

Strength :

Weakness:

The Avatar anonymization solution, developed by Octopize, uses a unique patient-centric design approach, allowing the creation of relevant and protected synthetic data while providing proof of protection. Its compliance has been demonstrated by the CNIL on the 3 EDPS criteria. Click here to learn more about avatars.

Rapid evolution of techniques

Finally, the CNIL (the French National Data Processing and Liberties Commission) reminds us that since anonymization and re-identification techniques are bound to evolve regularly, it is essential for any data controller concerned to keep a regular watch to preserve the anonymous nature of the data produced over time. This monitoring must take into account the technical means available and other sources of data that may make it possible to remove the anonymity of information.

The CNIL stresses that research into anonymisation techniques is ongoing and shows definitively that no technique is in itself infallible.

Sources :
https://edpb.europa.eu/edpb_fr
https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
Membership Inference Attacks : https://arxiv.org/pdf/1807.09173.pdf
Netflix : https://arxiv.org/PS_cache/cs/pdf/0610/0610105v2.pdf

Is your data pseudonymized or anonymized ?

What is the difference between anonymization and pseudonymization ?

The notion of anonymous data crystallizes a lot of misunderstandings and misconceptions to the point that the term "anonymous" does not have the same meaning depending on the person who uses it.
In order to re-establish a consensus, the Octopize team wanted to discuss the differences between pseudonymization and anonymization, two notions that are often confused.
At first glance, the term "anonymization" evokes the notion of a mask, of concealment. We then imagine that the principle of anonymization amounts to masking the directly identifying attributes of an individual (name, first name, social security number). This shortcut is precisely the trap to avoid. Indeed, the masking of these parameters constitutes rather a pseudonymization.
At first glance, these two concepts are similar, but there are major differences between them, both from a legal and a security point of view.

What is pseudonymization ?

According to the CNIL, pseudonymization is "the processing of personal data in such a way that it is no longer possible to attribute the data to a natural person without additional information". It is one of the measures recommended by the RGPD to limit the risks related to the processing of personal data.

But pseudonymization is not a method of anonymization. Pseudonymization simply reduces the correlation of a data set with the original identity of a data subject and is therefore a useful but not absolute security measure. Indeed, pseudonymization consists in replacing directly identifying data (name, first name...) of a data set by indirectly identifying data (alias, number in a classification, etc.) thus preventing the direct re-identification of individuals.

However, pseudonymization is not an infallible protection because the identity of an individual can also be deduced from a combination of several pieces of information called quasi-identifiers. Thus, in practice, pseudonymized data remains potentially re-identifying indirectly by crossing information. The identity of the individual can be betrayed by one of his indirectly identifying characteristics. This transformation is therefore reversible, justifying the fact that pseudonymized data are always considered as personal data. To date, the most widely used pseudonymization techniques are based on secret key cryptographic systems, hash functions, deterministic encryption and tokenization.

The "AOL (America On Line) case" is a typical example of the misunderstanding that exists between pseudonymization and anonymization. In 2006, a database containing 20 million keywords from the searches of more than 650,000 users over a period of three months was made public, with no other measure to preserve privacy than the replacement of the AOL user ID by a numerical attribute (pseudonymization).
Despite this treatment, the identity and location of some users were made public. Indeed, queries sent to a search engine, especially if they can be coupled with other attributes, such as IP addresses or other configuration parameters, have a very high potential for identification.

This incident is just one example of the many pitfalls that show that a pseudonymized dataset is not anonymous; simply changing the identity does not prevent an individual from being re-identified based on quasi-identifying information (age, gender, zip code). In many cases, it can be as easy to identify an individual in a pseudonymized dataset as it is from the original data (the "Who's That?" game).

What is the difference with anonymization ?

Anonymization consists in using techniques that make it impossible, in practice, to re-identify the individuals who provided the anonymized personal data. This treatment is irreversible and implies that the anonymized data are no longer considered as personal data, thus falling outside the scope of the RGPD. To characterise anonymization, the European Data Protection Committee (formerly WP29) relies on the 3 criteria set out in the opinion of 05/2014 (source at foot of page):

- Individualization: anonymized data must not allow to distinguish an individual. Therefore, even with all the quasi-identifying information about an individual, it must be impossible to distinguish him in a database once anonymized.

- Correlation: anonymized data must not be re-identifiable by crossing it with other data sets. Thus, it must be impossible to link two data sets from different sources concerning the same individual. Once anonymized, an individual's health data should not be linkable to his or her banking data based on common information.

- Inference: The data should not allow for the inference of additional information about an individual in a reasonable manner. For example, it must not be possible to determine with certainty the health status of an individual from anonymous data.

It is when these three criteria are met that data is considered to be anonymous in the strict sense. It then changes its legal status: it is no longer considered as personal data and falls outside the scope of the RGPD.

Our solution: Avatar

There are currently several families of anonymization methods that we will detail in our next article. For the most part, these methods provide protection by degrading the quality, structure or fineness of the original data, thus limiting the informational value of this data after processing. The real challenge is to solve the paradox between the legitimate protection of everyone's data, and its exploitation for the benefit of all.

The Avatar anonymization method, developed by Octopize, is a unique anonymization method. It solves the paradox between the protection of patients' personal data and the sharing of this data for its informative value. Indeed, the Avatar solution, which has been successfully evaluated by the CNIL, allows, thanks to synthetic data, to ensure on the one hand the confidentiality of the original data (and thus their sharing without risk) and on the other hand, to preserve the informative value of the original data.

Click here to learn more.

Sources: