How to measure the anonymity of a database?
In the age of Big Data, personal data is an essential raw material for the development of research and the operation of many companies. However, despite their great value, the use of this type of data necessarily implies a risk of re-identification and leakage of sensitive information, even after prior pseudonymisation treatment (see article 1). In the case of personal data, especially sensitive data, the risk of re-identification can be considered a betrayal of the trust of the individuals from whom the data originated, especially when they are used without clear and informed consent.
The implementation of the General Data Protection Regulation (GDPR) in 2018 and the Data Protection Act before it offered an attempt to address this issue by initiating a change in the practices of collecting, processing and storing personal data. An independent think tank specialising in privacy issues has also been set up. Called the European Data Protection Committee (EDPS) or formerly the G29, this consultative body has published work (see Article G29) which now serves as a reference for the European national authorities (CNIL in France) in the application of the RGPD.
The EDPS thus agrees on the potential of anonymisation to enhance the value of personal data while limiting the risks for the individuals from whom they originate. As a reminder, data are considered as anonymous if the re-identification of the original individuals is impossible. It is therefore an irreversible process. However, the anonymisation methods developed to meet this need are not infallible and their effectiveness often depends on many parameters (see article 2). In order to use these methods in an optimal way, it is necessary to make further clarifications on the nature of the anonymised data. The EDPS, in his Opinion of 05/2014 on anonymisation techniques, identifies three criteria for determining the impossibility of re-identification; namely:
- Individualisation: is it always possible to isolate an individual?
The individualisation criterion corresponds to the most favourable scenario for an attacker, i.e. a person, malicious or not, seeking to re-identify an individual in a dataset. To be considered anonymous, a dataset must not allow an attacker to isolate a target individual. In practice, the more information an attacker has about the individual they wish to isolate in a database, the higher the probability of re-identification. Indeed, in a pseudonymised dataset, i.e. one that has been stripped of its direct identifiers, the remaining quasi-identifying information acts like a barcode of an individual's identity when considered together. Thus, the more prior information the attacker has about the individual he is trying to identify, the more precise a query he can make to try to isolate that individual. An example of an individualisation attack is shown in Figure 1.
Figure 1: Re-identification of a patient by individualisation in a dataset based on two attributes (Age, Gender)
One of the attributes of this type of attack is also the increased sensitivity of individuals with unusual characteristics. It will be easier for an attacker, with only gender and height information, to isolate a woman who is 2 metres tall than a man who is 1.75 metres tall.
2. Correlation: Is it always possible to link records about an individual?
Correlation attacks are the most common scenario. Therefore, in order to consider data as anonymous, it is essential that it meets the correlation criterion. Between the democratisation of Open Data and the numerous incidents linked to personal data leaks, the amount of data available has never been so large. These databases containing personal information, sometimes directly identifying, are opportunities for attackers to carry out re-identification attempts by cross-referencing. In practice, correlation attacks use directly identifying databases with information similar to the database to be attacked, as illustrated in Figure 2.
Figure 2: Illustration of a correlation attack. The directly identifying external database (top) is used to re-identify individuals in the attacked database (bottom). The correlation is done on the basis of common variables.
In the case of the tables illustrated in Figure 2, the attacker would have succeeded in re-identifying the 5 individuals in the pseudonymised database thanks to the two attributes common to both databases. Moreover, the re-identification would have allowed him to infer new sensitive information about the patients, namely the pathology that affects them. In this context, the more information the databases have in common, the higher the probability of re-identifying an individual by correlation.
3. Inference: can information about an individual be inferred?
The third and last criterion identified by the EDPS is probably the most complex to assess. This is the criterion of inference. In order to consider data as anonymous, it must be impossible to identify by inference, with a high degree of certainty, new information about an individual. For example, if a dataset contains information on the health status of individuals who have participated in a clinical study and all the men over 65 in this cohort have lung cancer, then it will be possible to infer the health status of certain participants. Indeed, knowing a man over 65 in this study is enough to say that he has lung cancer.
The inference attack is particularly effective on groups of individuals sharing a single modality. If the inference is successful, then the disclosure of the sensitive attribute concerns the whole group of individuals identified.
These three criteria identified by the EDPS cover the majority of threats to data after it has been processed to preserve its security. If these three criteria are met, then the processing can be considered as anonymisation in the true sense of the word.
Can current techniques satisfy all three criteria?
Randomisation and generalisation techniques each have advantages and disadvantages with respect to each criterion (see Article 2). The assessment of the performance in meeting the criteria for several anonymisation techniques is shown in Figure 3, taken from the Opinion published by the former G29 on anonymisation techniques.
Figure 3: Strengths and weaknesses of the techniques considered
It is clear that none of these techniques can meet all three criteria simultaneously. They should therefore be used with caution in their most appropriate context. In addition to the methods evaluated, synthetic data seems to be a promising alternative for meeting all three criteria. However, methodologies for producing synthetic data face the challenge of proving this protection. At present, all synthetic data generation solutions rely on the principle of plausible deniability to prove the protection associated with a data item. In other words, if a piece of synthetic data were to happen to resemble an original piece of data, the defence would be that in such circumstances, it is impossible to prove that the synthetic data is related to an original piece of data. At Octopize, we have developed a unique methodology to produce synthetic data while quantifying and proving the protection provided. This evaluation is carried out through metrics specifically developed to measure the satisfaction of the criteria, namely individualisation, correlation and inference. We will develop the subject of metrics for assessing the quality and security of synthetic data in more detail in another article.