Logo réduit OCTOPIZE - pictogramme
March 142023

Synthetic VS Anonymous data

When it comes to using personnal data for a secondary ethical explotation compared to the original purpose of collection, anonymous data and synthetic data are often used without differentiation. However, these are two types of data with their own characteristics that should not be confused.

Definitions

Anonymous data: The General Data Protection Regulation (GDPR) defines anonymous data as:

"information that does not relate to an identified or identifiable
 natural person or that has been irreversibly anonymized."

In other words, anonymous data is data that cannot be used to identify an individual, even when combined with other external data sources (a register of voters for instance). This type of data is not subject to the GDPR's data protection rules, as it they are not considered personal data. When anonymous, the individuals from whom the data is collected are protected from re-identification. This property makes anonymous data used for a variety of secondary uses, such as research, statistical analysis, and marketing, as the use of anonymous data doesn’t require for consent from the individual concerned. However, it is important to note that the process of anonymization must be carried out in accordance with the GDPR's strict guidelines to ensure the protection of personal data. These guidelines are illustrated by the three criteria identified by European Data Protection Board (EDPB, ex WP-29):

  • singling out
  • linkability
  • inference

See more details in this article.

Synthetic data: Artificially generated data that mimics the characteristics of real-world data. It is created using computer algorithms and statistical models to simulate data that resembles real-world data without containing any actual personal information.
Synthetic data are used for a variety of purposes, including training machine learning models, testing software applicationsor testing a production environnement. One of the main advantages of synthetic data is that it can be generated at scale, making it ideal for use in scenarios where real-world data are either expensive or tricky to obtain.

Anonymous data vs synthetic data

The fact that the synthetic data are artificially generated data might indicate that these data are anonymous by default. The opportunity to share the generation method rather than the data itself seems to be an additional guarantee of privacy and a paradigm shift in data use.

However, generative models can also fail to provide privacy over training data. That's because generative models can memorize specific details of the training data, including the presence of specific individuals or personal information, and incorporate this information into the generated synthetic data. This type of privacy breach is called Membership inference attack, where an attacker attempts to determine if a specific individual's data was used to train a machine learning model. It can lead to serious privacy violations, especially in sensitive domains.

Besides, anonymous data is not always synthetic. For instance, some anonymization methods are based on aggregation over real-world data. K-anonyma is probably the most known of those aggregation methods, with its refinement being l-diversity and t-closeness. Those anonymization methods rely solely on aggregation and cannot be considered synthetic as it’s only a generalization of the content of the data. We thus have an example of data that is anonymous but not synthetic.

Nevertheless, do keep in mind that an aggregation is not always anonymous either. Let’s imagine a dataset containing the age of individuals. Aggregating naively in classes like 0-49, 50-99, 100-149 would probably result in very few people in the third category, resulting in (too) easy identification.

Trying to explain the confusion

An explanation of why synthetic data is often confused with anonymous data might be that most - if not all - anonymization methods that don’t rely on creating synthetic data have too many drawbacks to be effective. The fall can be a lack of privacy, utility, or both.

For instance, an aggregation method will not only lose some utility but will also change the data structure. Thus, this method cannot replace sensitive data in a pipeline. We recommend this article if you want to dig further into the subject of existing anonymization methods.

It explains why nowadays, someone wishing to anonymize data will probably use synthetic data generation method.

At Octopize, with our Avatar method, we create avatars that look like the original data but are fake. We ensure through metrics that EDPB guidelines are respected while keeping the most utility from data.

To sum up, privacy is not taken for granted while treating with synthetic data. Generating synthetic private data is a cutting-edge expertise topic, where some naive approaches tend to expose sensitive information. However, when used with caution, synthesizing anonymized data is nowadays the most efficient way to keep a maximum of utility while preserving privacy.

Interested in synthesizied anonymized data? Please contact us : contact@octopize.io !

 

Writing: Gaël Russeil & Morgan Guillaudeux
© Octopize 2022
crossmenuchevron-down