Logo réduit OCTOPIZE - pictogramme
January 312023

Evaluating the privacy of a dataset

One of the key points to tackle before diving into the privacy of a dataset, is the notion of pseudonymization versus anonymization. These are terms that are often used interchangeably, but are actually quite different in terms of the protection of individuals.

  • Pseudonymization is a matter of replacing direct identifiers such as first name and last name by new identifiers, through techniques such as hashing, tokenization or encryption. Pseudonymized data is still considered personal data and remains subject to the GDPR (General Data Protection Regulation).
  • Anonymization consists in using techniques that make it impossible in practice to re-identify an individual in a dataset. This treatment is irreversible and implies that the anonymized data are no longer considered personal data. Such data thus falls outside the scope of the GDPR. A variety of anonymization methods exist. The choice depends on the degree of risk and the intended use of the data.

Note that pseudonymization is required step before anonymization, as direct identifiers do not bring any value to a dataset.

To be considered anonymous, a dataset must satisfy the three criteria identified by the European Data Protection Board (EDPB, formerly known as WP29). To measure the compliance with these criteria, you always compare the original dataset to its treated version, where the treatment is any technique that aims to improve the privacy of the dataset (noise addition, generative models, Avatars)

Privacy according to EDPB

Before diving into specific metrics and how they are measured, we have to clarify what we are actually trying to prevent.

We will take the official criteria from the EDPB and add some examples to highlight the key differences between the three.

These are:

  • Singling out is the risk to identify an individual in a dataset.

Example: you work at an insurance company, and have a dataset of your clients and their vehicles. You simply remove the personal identifiers i.e their name. But, given that the combination of the other values are unique (vehicle type, brand, age of the vehicle, color), you are able to directly identify each and every one of your clients, even without their name being present.

  • Linkability is the ability to link individuals with an external data source with common features

Example: in a recruiting agency’s dataset, clients and their salary, along with related information, are listed. In a separate, publicly available database (e.g LinkedIn), you collect information such as job title, city, and company. Given these, you are able to link each individual from one dataset to the other, and this enables you to learn new information e.g. salary.

  • Inference is the possibility to deduce, with significant probability, information about individuals using the anonymized dataset.

Example: a pharmaceutical company owns a dataset of people having participated in a clinical trial. If you know that a particular individual is a man, and every man in the dataset is overweight, you can infer that the specific individual is overweight, without actually singling him out.


Evaluating Singling-out

The first family of metrics we will now introduce aims to evaluate the protection of a dataset against singling-out attacks. Such attacks can take different forms and so different complementary metrics are required. Some singling-out metrics are model-agnostic and so can be used on any pair of original and treated datasets. Other metrics require temporarily keeping a link between original and treated individuals.

Model-agnostic metrics

We now present two straightforward metrics that can be used on datasets treated by any technique. These metrics are particularly useful when it comes to comparing the results of different approaches.

  • Distance to Closest (DTC). To compute DTC, the distance between each treated record and its closest original is measured. The median value is kept in order to have a single representative value associated with this metric.The rational behind DTC is that if every treated record is close to an original, then the dataset could present a singling-out risk. However, a low DTC does not necessarily mean that it is at risk and so, the Closest Distance Ratio should be measured to complement it.
  • Closest distances ratio (CDR). Similarly to DTC, CDR is computed by first measuring the distance between a treated record and its closest original record divided by the distance to its second closest original record. In other words, the distance to the two closest originals is measured. If the ratio is high, the two closest originals are at the same distance and so it is practically impossible to distinguish between them with certainty. From the ratios computed for every treated individual, the median is kept to provide a single CDR value.There is a singling-out risk when both DTC and CDR are low.
  • Current Hidden Rate. The current hidden rate is the probability that an attacker makes a mistake when linking an individual with its most similar synthetic individual. This is where the original to synthetic link that was kept temporarily comes to use.

Going further: our metrics

A dataset with high DTC and high CDR will ensure that the treatment that was applied to the data has changed the characteristics of the individuals. However, even if treated individuals are distant from the originals, there remains a risk that original individuals can be associated with their most similar treated counterpart.

At Octopize, our treatment generates synthetic anonymized data. We have developed additional metrics, placing ourselves in the worst case scenario where an attacker has both original and anonymized data. Although unlikely in practice, this approach is recommended by the EDPB. The hidden rate and local cloaking are metrics that are here to measure the protection of the data against distance-based singling-out attacks. Both metrics require that the link between each individual and its synthetic version is available.

To illustrate these metrics, let us look at a simplified example where a cohort of animals (why not !?) would be anonymized (with our Avatar solution for example).

With individual-centric anonymization solutions, a synthetic individual is generated from an original. The link between originals and synthetic individuals can be used to measure the level of protection against distance-based attacks. In our example, we see that the ginger cat was anonymized as a cheetah while the synthetic record created from the tiger is a black cat.

A distance-based attack assumes that singling-out can be done by associating an original with its most similar synthetic individual. In our example, a distance-based linkage would associate the ginger cat with the black cat, the tiger with the cheetah and so on.

  • Current Hidden Rate. The current hidden rate is the probability that an attacker makes a mistake when linking an individual with its most similar synthetic individual. This is where the original to synthetic link that was kept temporarily comes to use.

The current hidden rate measures the probability that an attacker makes a mistake when linking an individual with its most similar synthetic individual. In this illustration, we see that most distance-based matches are not correct and so the hidden rate is high, illustrating a good protection against distance-based singling-out attacks.

  • Local Cloaking. The local cloaking represents the number of synthetic individuals that look more like an original individual than the synthetic individual it has generated. The higher this number for an individual, the more it is protected. The median local cloaking over all individuals is used to evaluate a dataset.

In this figure, we illustrate how the local cloaking is computed for a single original individual, here the ginger cat. Thanks to the link we are keeping temporarily, we know that the actual synthetic individual generated from the ginger cat is the cheetah. Its local cloaking is the number of synthetic records between itself and the cheetah. In this example, there is one such synthetic record: the black cat, meaning that the local cloaking of the ginger cat is 1. The same calculation is done for all originals.

The four metrics we have just seen provide a good coverage of the protection against singling-out attacks but as we have seen at the start of this post, there are other types of attacks against which personal data should be protected.


Evaluating linkability

Metrics that meet the linkability criterion respond to a more common and more likely attack scenario.

The attacker has a treated dataset and an external identifying database (e.g. a voter's register) with information in common with the treated data (e.g. age, gender, zip code). The more information there is in common between the two databases, the more effective the attack will be.

Correlation protection rate

The Correlation Protection Rate evaluates the percentage of individuals that would not be successfully linked to their synthetic counterpart with the attacker using an external data source. The variables selected as being common to both databases must be likely to be found in an external data source. (e.g. age should be considered whereas  insulin_concentration_D2 should not). To cover the worst-case scenario, we assume that the same individuals are present in both databases. In practice, some individuals in the anonymized database are not present in the external data source and vice versa. This metric also relies on the original to synthetic link being kept temporarily. This link is used to measure how many of the pairings are incorrect.


Evaluating inference

Metrics that meet the Inference criterion respond to another type of attack where the attacker seeks to infer additional information about an individual from the available anonymized data.

  • Inference metrics. The Inference Metric computes the possibility to deduce, with significant probability, the original value of a target variable from the values of other treated variables. The inference metric can be used on numeric and categorical targets. When the target is numeric, we speak about regression inference metric and we evaluate the protection as the mean absolute difference between the value predicted by the attacker and the original numeric value. On the other hand, we speak about classification inference metric when the target is categorical and the level of protection is represented by the prediction accuracy.


How is it in practice ?

Our solution, Avatar, computes all of the above metrics and more. We take it as our mission to generate anonymized datasets will a fully explainable model and concrete privacy metrics that allow us to measure the degree of protection.

To do this, there are many things to take into consideration and rendering a dataset anonymous should not be taken lightly, there are many pitfalls one can encounter and accidentally leak information. That’s why, in addition to the metrics and the associated guarantee of privacy, we generate a report that clearly outlines all the different metrics, and the evaluation criteria they are aiming to measure, similar to what we have laid out above. It explains, in layman’s terms, all the metrics, and additionally prints out statistics about the datasets, before and after anonymization.

In practice, anonymizing a dataset is always a tradeoff between guaranteeing privacy, and preserving utility. A fully random dataset is private, but serves no purpose.

We’ll examine how to measure the utility of a dataset, before and after anonymization, in a future post.

Interested in our solution? Contact us !

Writing : Tom Crasset & Olivier Regnier-Coudert
© Octopize 2022