One of the key points to tackle before diving into the privacy of a dataset, is the notion of pseudonymization versus anonymization. These are terms that are often used interchangeably, but are actually quite different in terms of the protection of individuals.
Note that pseudonymization is required step before anonymization, as direct identifiers do not bring any value to a dataset.
To be considered anonymous, a dataset must satisfy the three criteria identified by the European Data Protection Board (EDPB, formerly known as WP29). To measure the compliance with these criteria, you always compare the original dataset to its treated version, where the treatment is any technique that aims to improve the privacy of the dataset (noise addition, generative models, Avatars)
Before diving into specific metrics and how they are measured, we have to clarify what we are actually trying to prevent.
We will take the official criteria from the EDPB and add some examples to highlight the key differences between the three.
These are:
Example: you work at an insurance company, and have a dataset of your clients and their vehicles. You simply remove the personal identifiers i.e their name. But, given that the combination of the other values are unique (vehicle type, brand, age of the vehicle, color), you are able to directly identify each and every one of your clients, even without their name being present.
Example: in a recruiting agency’s dataset, clients and their salary, along with related information, are listed. In a separate, publicly available database (e.g LinkedIn), you collect information such as job title, city, and company. Given these, you are able to link each individual from one dataset to the other, and this enables you to learn new information e.g. salary.
Example: a pharmaceutical company owns a dataset of people having participated in a clinical trial. If you know that a particular individual is a man, and every man in the dataset is overweight, you can infer that the specific individual is overweight, without actually singling him out.
The first family of metrics we will now introduce aims to evaluate the protection of a dataset against singling-out attacks. Such attacks can take different forms and so different complementary metrics are required. Some singling-out metrics are model-agnostic and so can be used on any pair of original and treated datasets. Other metrics require temporarily keeping a link between original and treated individuals.
We now present two straightforward metrics that can be used on datasets treated by any technique. These metrics are particularly useful when it comes to comparing the results of different approaches.
A dataset with high DTC and high CDR will ensure that the treatment that was applied to the data has changed the characteristics of the individuals. However, even if treated individuals are distant from the originals, there remains a risk that original individuals can be associated with their most similar treated counterpart.
At Octopize, our treatment generates synthetic anonymized data. We have developed additional metrics, placing ourselves in the worst case scenario where an attacker has both original and anonymized data. Although unlikely in practice, this approach is recommended by the EDPB. The hidden rate and local cloaking are metrics that are here to measure the protection of the data against distance-based singling-out attacks. Both metrics require that the link between each individual and its synthetic version is available.
To illustrate these metrics, let us look at a simplified example where a cohort of animals (why not !?) would be anonymized (with our Avatar solution for example).
With individual-centric anonymization solutions, a synthetic individual is generated from an original. The link between originals and synthetic individuals can be used to measure the level of protection against distance-based attacks. In our example, we see that the ginger cat was anonymized as a cheetah while the synthetic record created from the tiger is a black cat.
A distance-based attack assumes that singling-out can be done by associating an original with its most similar synthetic individual. In our example, a distance-based linkage would associate the ginger cat with the black cat, the tiger with the cheetah and so on.
The current hidden rate measures the probability that an attacker makes a mistake when linking an individual with its most similar synthetic individual. In this illustration, we see that most distance-based matches are not correct and so the hidden rate is high, illustrating a good protection against distance-based singling-out attacks.
In this figure, we illustrate how the local cloaking is computed for a single original individual, here the ginger cat. Thanks to the link we are keeping temporarily, we know that the actual synthetic individual generated from the ginger cat is the cheetah. Its local cloaking is the number of synthetic records between itself and the cheetah. In this example, there is one such synthetic record: the black cat, meaning that the local cloaking of the ginger cat is 1. The same calculation is done for all originals.
The four metrics we have just seen provide a good coverage of the protection against singling-out attacks but as we have seen at the start of this post, there are other types of attacks against which personal data should be protected.
Metrics that meet the linkability criterion respond to a more common and more likely attack scenario.
The attacker has a treated dataset and an external identifying database (e.g. a voter's register) with information in common with the treated data (e.g. age, gender, zip code). The more information there is in common between the two databases, the more effective the attack will be.
The Correlation Protection Rate evaluates the percentage of individuals that would not be successfully linked to their synthetic counterpart with the attacker using an external data source. The variables selected as being common to both databases must be likely to be found in an external data source. (e.g. age
should be considered whereas insulin_concentration_D2
should not). To cover the worst-case scenario, we assume that the same individuals are present in both databases. In practice, some individuals in the anonymized database are not present in the external data source and vice versa. This metric also relies on the original to synthetic link being kept temporarily. This link is used to measure how many of the pairings are incorrect.
Metrics that meet the Inference criterion respond to another type of attack where the attacker seeks to infer additional information about an individual from the available anonymized data.
Our solution, Avatar, computes all of the above metrics and more. We take it as our mission to generate anonymized datasets will a fully explainable model and concrete privacy metrics that allow us to measure the degree of protection.
To do this, there are many things to take into consideration and rendering a dataset anonymous should not be taken lightly, there are many pitfalls one can encounter and accidentally leak information. That’s why, in addition to the metrics and the associated guarantee of privacy, we generate a report that clearly outlines all the different metrics, and the evaluation criteria they are aiming to measure, similar to what we have laid out above. It explains, in layman’s terms, all the metrics, and additionally prints out statistics about the datasets, before and after anonymization.
In practice, anonymizing a dataset is always a tradeoff between guaranteeing privacy, and preserving utility. A fully random dataset is private, but serves no purpose.
We’ll examine how to measure the utility of a dataset, before and after anonymization, in a future post.
Interested in our solution? Contact us !
Writing : Tom Crasset & Olivier Regnier-Coudert