How to evaluate the utility of synthetic data?

Synthetically generated data are an increasingly popular tool for data analysis and machine learning purposes. By generating new data that mimics the statistical properties of the original data without replicating them, synthetic data can be used to exploit the potential of the data without compromising individuals' privacy. 

However, to ensure that synthetic data is useful and effective, evaluating its utility is important. In this article, we'll explore how to evaluate the utility of synthetic data and ensure that it can be used effectively for analysis and modeling.

To evaluate the level of information retained in synthetic data, we use utility metrics which assess two aspects: consistency at the individual level and consistency at the population level.

Consistency at the individual level means that logical rules should be respected. This criteria is dataset dependent, thus it will not be developed in this article.

Consistency at the population level means statistical similarity between original and synthetic data. We assess this similarity at three levels:

In this article, we will describe how to evaluate the retention of statistical information at the population level. This analysis is global and not specific to the use case. For specific use cases, it is recommended to compare original and synthetic data based on the target analysis.

There are as many possibilities of evaluation of the utility as there are possible analyses. Here, we decide to focus on a sample of utility retention.

 

Comparing variables distributions 

distributions

For each variable of a dataset, we compare the distribution of this variable in the original (in grey) and synthetic dataset (in green). The Hellinger distance can be computed between both distributions. It results in a score between 0 and 1. 0 means that the two distributions are identical while 1 means that the distributions have no common bins. 

In the figure below, we can see small Hellinger distances, which reveal that the Avatar data distributions are similar to the original distributions.

 

Hellinger

In other cases, we can also use statistical tests like Kolmogorov-Smirnov test or the Chi-square test to assess if the original and Avatar samples are drawn from the same underlying distribution.

 

Comparing dependencies across the variables

Evaluating variable distributions alone is not sufficient. If we generate synthetic data as a draw of each variable independently, the distributions would be preserved but correlation across the variables would be destroyed. The synthetic data might not be useful for analyses or modeling tasks that depend on this correlation. Therefore, in addition to distribution comparisons, it is also important to compare variable dependencies or correlations. This is usually done with the Pearson correlation coefficient to evaluate for linear relationship between numerical variables.

Here, we see that Avatar data is preserving the correlation matrix of the original data.

Correlation

With this analysis, we understand that the Avatar method is preserving variable dependencies (bi-variate analysis). Weak correlations stay weak through the anonymization while the strongest stay strong. Other metrics, such as mutual information could be computed to evaluate bivariate utility retention with categorical data.

 

Comparing the general information of the data

projection

Preserving the general information contained in a dataset is a main concern of anonymization. In order to evaluate the multidimensional utility, we can use the factor analysis methods (FAMD, PCA and MCA). Using those, we can study the link between many variables and individuals of the dataset.

The visualization illustrates the similarity between the original data (in grey) and the Avatar data (in green). Links between variables and cluster of the dataset are maintained in the Avatar dataset. 

 

In the figure below, we visualize that the preanti variable information is maintained during the anonymization.

projection variety

 

In summary, it is important to verify that synthetic data preserve the useful information of the data. This evaluation is done by the use of utility metrics. By ensuring that synthetic data is consistent at both individual and population levels, we can be confident that synthetic data can effectively replace original data for analysis and modeling purposes.

Have a look at our technical documentation to see an example of an anonymization report which evaluates the privacy and utility of the Avatar data. 

Want to understand more? Read our scientific article published in Nature npj digital medicine that demonstrates the utility conservation and privacy protection of the Avatar method in two medical use cases.

 

Writing : Julien Petot & Alban-Félix Barreteau

Synthetic VS Anonymous data

When it comes to using personnal data for a secondary ethical explotation compared to the original purpose of collection, anonymous data and synthetic data are often used without differentiation. However, these are two types of data with their own characteristics that should not be confused.

Definitions

Anonymous data: The General Data Protection Regulation (GDPR) defines anonymous data as:

"information that does not relate to an identified or identifiable
 natural person or that has been irreversibly anonymized."

In other words, anonymous data is data that cannot be used to identify an individual, even when combined with other external data sources (a register of voters for instance). This type of data is not subject to the GDPR's data protection rules, as it they are not considered personal data. When anonymous, the individuals from whom the data is collected are protected from re-identification. This property makes anonymous data used for a variety of secondary uses, such as research, statistical analysis, and marketing, as the use of anonymous data doesn’t require for consent from the individual concerned. However, it is important to note that the process of anonymization must be carried out in accordance with the GDPR's strict guidelines to ensure the protection of personal data. These guidelines are illustrated by the three criteria identified by European Data Protection Board (EDPB, ex WP-29):

See more details in this article.

Synthetic data: Artificially generated data that mimics the characteristics of real-world data. It is created using computer algorithms and statistical models to simulate data that resembles real-world data without containing any actual personal information.
Synthetic data are used for a variety of purposes, including training machine learning models, testing software applicationsor testing a production environnement. One of the main advantages of synthetic data is that it can be generated at scale, making it ideal for use in scenarios where real-world data are either expensive or tricky to obtain.

Anonymous data vs synthetic data

The fact that the synthetic data are artificially generated data might indicate that these data are anonymous by default. The opportunity to share the generation method rather than the data itself seems to be an additional guarantee of privacy and a paradigm shift in data use.

However, generative models can also fail to provide privacy over training data. That's because generative models can memorize specific details of the training data, including the presence of specific individuals or personal information, and incorporate this information into the generated synthetic data. This type of privacy breach is called Membership inference attack, where an attacker attempts to determine if a specific individual's data was used to train a machine learning model. It can lead to serious privacy violations, especially in sensitive domains.

Besides, anonymous data is not always synthetic. For instance, some anonymization methods are based on aggregation over real-world data. K-anonyma is probably the most known of those aggregation methods, with its refinement being l-diversity and t-closeness. Those anonymization methods rely solely on aggregation and cannot be considered synthetic as it’s only a generalization of the content of the data. We thus have an example of data that is anonymous but not synthetic.

Nevertheless, do keep in mind that an aggregation is not always anonymous either. Let’s imagine a dataset containing the age of individuals. Aggregating naively in classes like 0-49, 50-99, 100-149 would probably result in very few people in the third category, resulting in (too) easy identification.

Trying to explain the confusion

An explanation of why synthetic data is often confused with anonymous data might be that most - if not all - anonymization methods that don’t rely on creating synthetic data have too many drawbacks to be effective. The fall can be a lack of privacy, utility, or both.

For instance, an aggregation method will not only lose some utility but will also change the data structure. Thus, this method cannot replace sensitive data in a pipeline. We recommend this article if you want to dig further into the subject of existing anonymization methods.

It explains why nowadays, someone wishing to anonymize data will probably use synthetic data generation method.

At Octopize, with our Avatar method, we create avatars that look like the original data but are fake. We ensure through metrics that EDPB guidelines are respected while keeping the most utility from data.

To sum up, privacy is not taken for granted while treating with synthetic data. Generating synthetic private data is a cutting-edge expertise topic, where some naive approaches tend to expose sensitive information. However, when used with caution, synthesizing anonymized data is nowadays the most efficient way to keep a maximum of utility while preserving privacy.

Interested in synthesizied anonymized data? Please contact us : contact@octopize.io !

 

Writing: Gaël Russeil & Morgan Guillaudeux

Evaluating the privacy of a dataset

One of the key points to tackle before diving into the privacy of a dataset, is the notion of pseudonymization versus anonymization. These are terms that are often used interchangeably, but are actually quite different in terms of the protection of individuals.

Note that pseudonymization is required step before anonymization, as direct identifiers do not bring any value to a dataset.

To be considered anonymous, a dataset must satisfy the three criteria identified by the European Data Protection Board (EDPB, formerly known as WP29). To measure the compliance with these criteria, you always compare the original dataset to its treated version, where the treatment is any technique that aims to improve the privacy of the dataset (noise addition, generative models, Avatars)

Privacy according to EDPB

Before diving into specific metrics and how they are measured, we have to clarify what we are actually trying to prevent.

We will take the official criteria from the EDPB and add some examples to highlight the key differences between the three.

These are:

Example: you work at an insurance company, and have a dataset of your clients and their vehicles. You simply remove the personal identifiers i.e their name. But, given that the combination of the other values are unique (vehicle type, brand, age of the vehicle, color), you are able to directly identify each and every one of your clients, even without their name being present.

Example: in a recruiting agency’s dataset, clients and their salary, along with related information, are listed. In a separate, publicly available database (e.g LinkedIn), you collect information such as job title, city, and company. Given these, you are able to link each individual from one dataset to the other, and this enables you to learn new information e.g. salary.

Example: a pharmaceutical company owns a dataset of people having participated in a clinical trial. If you know that a particular individual is a man, and every man in the dataset is overweight, you can infer that the specific individual is overweight, without actually singling him out.

 

Evaluating Singling-out

The first family of metrics we will now introduce aims to evaluate the protection of a dataset against singling-out attacks. Such attacks can take different forms and so different complementary metrics are required. Some singling-out metrics are model-agnostic and so can be used on any pair of original and treated datasets. Other metrics require temporarily keeping a link between original and treated individuals.

Model-agnostic metrics

We now present two straightforward metrics that can be used on datasets treated by any technique. These metrics are particularly useful when it comes to comparing the results of different approaches.

Going further: our metrics

A dataset with high DTC and high CDR will ensure that the treatment that was applied to the data has changed the characteristics of the individuals. However, even if treated individuals are distant from the originals, there remains a risk that original individuals can be associated with their most similar treated counterpart.

At Octopize, our treatment generates synthetic anonymized data. We have developed additional metrics, placing ourselves in the worst case scenario where an attacker has both original and anonymized data. Although unlikely in practice, this approach is recommended by the EDPB. The hidden rate and local cloaking are metrics that are here to measure the protection of the data against distance-based singling-out attacks. Both metrics require that the link between each individual and its synthetic version is available.

To illustrate these metrics, let us look at a simplified example where a cohort of animals (why not !?) would be anonymized (with our Avatar solution for example).

With individual-centric anonymization solutions, a synthetic individual is generated from an original. The link between originals and synthetic individuals can be used to measure the level of protection against distance-based attacks. In our example, we see that the ginger cat was anonymized as a cheetah while the synthetic record created from the tiger is a black cat.

A distance-based attack assumes that singling-out can be done by associating an original with its most similar synthetic individual. In our example, a distance-based linkage would associate the ginger cat with the black cat, the tiger with the cheetah and so on.

The current hidden rate measures the probability that an attacker makes a mistake when linking an individual with its most similar synthetic individual. In this illustration, we see that most distance-based matches are not correct and so the hidden rate is high, illustrating a good protection against distance-based singling-out attacks.

In this figure, we illustrate how the local cloaking is computed for a single original individual, here the ginger cat. Thanks to the link we are keeping temporarily, we know that the actual synthetic individual generated from the ginger cat is the cheetah. Its local cloaking is the number of synthetic records between itself and the cheetah. In this example, there is one such synthetic record: the black cat, meaning that the local cloaking of the ginger cat is 1. The same calculation is done for all originals.

The four metrics we have just seen provide a good coverage of the protection against singling-out attacks but as we have seen at the start of this post, there are other types of attacks against which personal data should be protected.

 

Evaluating linkability

Metrics that meet the linkability criterion respond to a more common and more likely attack scenario.

The attacker has a treated dataset and an external identifying database (e.g. a voter's register) with information in common with the treated data (e.g. age, gender, zip code). The more information there is in common between the two databases, the more effective the attack will be.

Correlation protection rate

The Correlation Protection Rate evaluates the percentage of individuals that would not be successfully linked to their synthetic counterpart with the attacker using an external data source. The variables selected as being common to both databases must be likely to be found in an external data source. (e.g. age should be considered whereas  insulin_concentration_D2 should not). To cover the worst-case scenario, we assume that the same individuals are present in both databases. In practice, some individuals in the anonymized database are not present in the external data source and vice versa. This metric also relies on the original to synthetic link being kept temporarily. This link is used to measure how many of the pairings are incorrect.

 

Evaluating inference

Metrics that meet the Inference criterion respond to another type of attack where the attacker seeks to infer additional information about an individual from the available anonymized data.

 

How is it in practice ?

Our solution, Avatar, computes all of the above metrics and more. We take it as our mission to generate anonymized datasets will a fully explainable model and concrete privacy metrics that allow us to measure the degree of protection.

To do this, there are many things to take into consideration and rendering a dataset anonymous should not be taken lightly, there are many pitfalls one can encounter and accidentally leak information. That’s why, in addition to the metrics and the associated guarantee of privacy, we generate a report that clearly outlines all the different metrics, and the evaluation criteria they are aiming to measure, similar to what we have laid out above. It explains, in layman’s terms, all the metrics, and additionally prints out statistics about the datasets, before and after anonymization.

In practice, anonymizing a dataset is always a tradeoff between guaranteeing privacy, and preserving utility. A fully random dataset is private, but serves no purpose.

We’ll examine how to measure the utility of a dataset, before and after anonymization, in a future post.

Interested in our solution? Contact us !

Writing : Tom Crasset & Olivier Regnier-Coudert

Octopize - Mimethik Data present at the AI for Health 2022 Summit!

Octopize - Mimethik Data will be present at the AI for Health 2022 Summit!

Since 2018, AI for Health has been promoting and fostering the best innovations, use cases and collaborations in the health and AI ecosystem. Their "Summit" event brings together startups, public institutions, patient associations, healthcare professionals, tech, medtech and pharmaceutical companies...

At Octopize - Mimethik Data, we have developed and patented a unique method for anonymizing personal data, Avatar, which was successfully assessed by the CNIL in June 2020. Our method is marketed as a software or service allowing new uses in an ethical way and is already recognized in the health sector and in other verticals.

We will be delighted to meet you at this 5th edition 2022:

PS: we are planning to make our platform available for live testing of anonymization by Octopize, we will tell you more soon!

Make a 15 min appointment with us!

New motion design video !

How to optimize the use of your personal data with avatars?

Today, personal data is a risk factor and an opportunity that is poorly controlled by organizations.

Discover how the Avatar personal data anonymization solution, developed by Octopize, protects confidentiality while freeing up secondary uses of data: sharing outside the EU, enhancement, retention, research...

Avatars, the hidden revolution behind digital twins

Spearheading Industry 4.0, digital twins are now spreading to the healthcare sector. Boosted by the Covid-19 epidemic, their market is exploding, as are the risks weighing on the privacy of the individuals who provide the data. How can we unleash the potential of digital twins without compromising on ethics? We have the solution: avatars, a unique data anonymization method that has been successfully evaluated by the CNIL. Impossible, in practice, to re-identify, the avatarized data goes beyond the RGPD. They become usable, shareable - even outside the European Union - and retainable without limits, while guaranteeing the quality of the initial data set. How do we differ from the competition? We prove all these points with our metrics. A real revolution in the current Health Data Hub context. What if avatars became the norm tomorrow?

 

"Houston, we've had a problem", said the Apollo 13 crew on April 17, 1970.

A few miles from the moon, an explosion has just occurred on board the spacecraft. Hundreds of thousands of kilometers away, on Earth, NASA teams diagnose and solve the problem remotely thanks to several simulators, a kind of "digital doubles", synchronized thanks to the data flow coming from the shuttle. The crew returns safely. The ancestors of the digital twins are born. NASA was the first to develop them, but it was not until 30 years later that the concept of "digital twin" emerged.

 

What is a "digital twin"?

In 2002, Michael Grieves was a PLM (Product Lifecycle Management) researcher at the University of Michigan. During the presentation of a center dedicated to product lifecycle management, he explained for the first time to the industrials present the notion of a "digital twin": a digital replica of a physical object or system. It is not a fixed model, but a dynamic model, reproducing its needs, its behavior and its evolution over time. As with Apollo 13, there is a deep connection between the physical entity to its digital twin: the flow of data from one to the other.

Since then, the concept of the digital twin has evolved little. It involves replicating an object (a piston or a car engine), a system (a nuclear power plant or a city) or an abstract process (a production schedule). The concept also applies to the living things: a molecule, a cell, an organ or a patient, such as a drug, a virus, a disease or an epidemic can have their digital twin.

 

Digital twins are an evolution, more than a revolution, combining mathematical modelling and digital simulation.

 

The result of the growth of new technologies (IoT, big data, AI, cloud, etc.) and computing power, digital twins are an evolution, more than a revolution, combining mathematical modelling and digital simulation. Incoming data, wherever it comes from - real, synthetic, collected in real time using sensors or via pre-existing databases - feeds a mathematical model to fine-tune it. The model can then be transformed into a digital guinea pig, on which to test different scenarios via simulations, in order to predict the evolution of the real system.

Product design and life cycle, automotive and aeronautics, energy production and distribution, transport, smart building and urban planning, digital twins are now one of the pillars of Industry 4.0. They have recently spread to other sectors, such as logistics and, above all, healthcare. According to a study by MarketsandMarkets, the digital twins market could grow from $3.1 billion in 2020 to $48.2 billion in 2026, a spectacular 58% growth, partly due to the Covid-19 epidemic.

 

The promise of digital twins in healthcare, myth or reality?

Last January, at the CES (Consumer electronics show) in Las Vegas, Dassault Systèmes presented its latest feat, the digital twin of a human heart, the result of 7 years of development. Powered by data collected from hundreds of doctors, researchers and industrialists around the world, it replicates not only the anatomy of the heart, but also its functioning: the flow of electrical current along the nerves, the behaviour of muscle fibres, the reaction to various drugs, etc. Thanks to advances in medical imaging, this digital twin is easily customisable. It takes less than a day to replicate the morphology and pathologies of a patient's heart. 

Dassault Systèmes and its competitors are already working on other organs, including the lungs, liver and of course the brain, but exact replication is currently out of reach. And for good reason! Neurobiologists have yet to unravel all its mysteries. The perfect clone of the human body - modelling anatomy, genetics, metabolism, bodily functions and pathologies - is therefore not yet within reach. However, there is no need to wait for complete digital twins to make great strides. Digital twins, even partial ones, of certain organs, diseases or patient/drug combinations - such as those developed by the start-up ExactCure - are already sufficient to address specific problems.

 

If digital twins live up to their promise, they will ultimately signal the advent of personalised medicine.

 

Simulating the anatomy and functioning of our body at the molecular, cellular, tissue and organic levels; modelling tailor-made implants; simulating ageing or a disease; testing a drug or a vaccine on a virtual patient or cohort; rehearsing and assisting complex surgical procedures; monitoring patient flows in hospitals to rationalise human and technical resources: if digital twins fulfil all their promises, they will ultimately signal the advent of personalised medicine.

A study published in July 2021 in the journal Life Sciences, Society and Policy reviews the socio-ethical benefits of digital twins in health services. On the podium are the prevention and treatment of disease, followed by cost savings for some healthcare institutions, and finally, increased autonomy for patients - better informed, they are better able to make informed decisions about their care.  

Risks commensurate with the hopes raised

Nevertheless, there are still many hurdles to overcome before we reach this public health Eldorado. The fundamental problem lies in the crux of the digital twins' war: health data. This highly sensitive personal data contains genetic, biological, physical and lifestyle information. The same study warns of the number one socio-ethical risk of digital twins, mentioned by all participants: the violation of privacy. 

 

The fundamental problem is the crux of the digital twins' war: health data. This highly sensitive personal data contains genetic, biological, physical and lifestyle information.

 

If the digital twins are owned or hosted by private organisations, this information can be used without the knowledge of the patients, or even turned against them. The simplest example: a bank or insurance company with access to it could deny a loan or increase premiums to a sick person.

Add to this the security holes. If digital twins multiply, the risk of losing or having data stolen increases with them. But once the data has been leaked, it is too late. It can be used by anyone, in any way. This is a disaster scenario that is becoming increasingly common in France, where cyber attacks on healthcare organisations doubled in 2021. The theft of health insurance data from half a million French people in early 2022 is a striking example.

 

All the benefits of digital twins are therefore conditioned by the availability and quality of health data.

 

All the benefits of digital twins are therefore conditioned by the availability and quality of health data.

Then there is another risk: the low quality of the data. Indeed, AI algorithms are trained on available biomedical data. However, this data is often heterogeneous, incomplete and not always reliable. This is due to several reasons: lack of standardisation, pressure to publish, bias, tradition of not publishing failures, etc. Bad data means bad models and bad simulations. 

All the benefits of digital twins therefore depend on the availability and quality of health data. However, it is extremely difficult for researchers to retrieve and use this data, particularly in France, where its use is strictly limited by the GDPR (General Data Protection Regulation) and the Loi Informatique et Libertés. In particular, their transfer outside the European Union is prohibited, a particularly sensitive issue in the current public debate. The cases follow one another at a frantic pace, from Google Analytics to Meta. The government has even preferred to postpone its request for authorisation from the CNIL for the Health Data Hub, while this health data centralisation project undergoes a transformation.

 

Avatars to unlock the growth potential of digital twins

To unleash the growth potential of digital twins, there is already a solution proposed by Octopize - Mimethik Data, our deeptech start-up. We have developed a unique and patented method of data anonymisation: avatars. Data anonymisation is not new and the methods are multiplying all the time. However, most of them do not provide proof that it is impossible to re-identify patients, far from it. Our disruptive innovation, based on a new Artificial Intelligence technique, allows personal data to be exploited and shared in full respect of privacy. Unlike our competitors, we can prove through our metrics the effectiveness of our avatars in terms of both privacy and data quality. Our secret? An AI algorithm focused on each patient, not on the whole dataset.

For each patient (i.e. each row in the database), we use a KNN algorithm to identify a number of neighbouring data. From these neighbouring data we build our model. At this stage, the real patient and his data have "disappeared" - it is impossible to know whether they are in the model or not, only his nearest neighbours are. We then generate an avatar using a local pseudo-stochastic model, i.e. we introduce random, and therefore non-reversible, noise for each attribute (i.e. each column in the database). It is impossible to go backwards, each time we run the model again for the same patient, we create a different avatar. This ensures anonymisation, while preserving the granularity of the dataset, the correlations between individuals and the distributions on each variable. Same Gauss curves, same means and same standard deviations, to within an epsilon.

 

The data, once avatarized, becomes synthetic data, without any risk of re-identification for the patients. It then falls outside the scope of the GDPR and its exploitation becomes unlimited.

 

The data, once avatarised, become synthetic data, without risk of re-identification for the patients. They are then out of the RGPD and their exploitation becomes unlimited. They can be stored, exploited, shared and reused without geographical or temporal constraints. Moreover, the CNIL has not been mistaken and has successfully evaluated our method in 2020, attesting to its compliance with the three criteria on anonymisation described in the G29 opinion. Thanks to avatars, the privacy risk inherent in digital twins is eliminated.

Avatars are also easily deployable and scalable. They can be configured to suit all needs, from internal use to open data. Another advantage is that avatars also solve the problems of availability and bias of health data. From a real dataset, we can generate synthetic datasets that are larger than the initial database, as each individual can give rise to several avatars. In this way we can amplify a cohort. In the end, we propose labelled and "clean" health datasets, ready for use, ready for all uses.

 

Beyond digital twins, avatars are in themselves a revolution, and not only in the health field.

 

By addressing issues of privacy, data availability and data quality, avatarisation is therefore a great opportunity to unleash the growth potential of digital twins. But beyond that, avatars are a revolution in themselves, and not just in the health sector. Banking, insurance, telecom, industry, energy, all sectors handling sensitive data now have a turnkey solution. Octopize - Mimethik Data defends with its avatars an ethical point of view at the service of value creation. We are firmly convinced that data avatarization, a disruptive innovation today, will be the new European standard tomorrow.

 

15/05/2022© Octopize 

 

Octopize, laureate of the i-Nov competition for its unique solution for anonymizing personal data: avatars

The Prime Minister has decided to award a contribution of approximately half a million euros from the Programme d'investissements d'avenir (P.I.A.) to the company Octopize as part of the 8th wave of the i-Nov innovation competition. Octopize was competing in the Digital Deep Tech theme. Its project concerns the deployment of its disruptive method of anonymising personal data: avatars.

Co-piloted by the Ministry of the Economy, Finance and Industrial and Digital Sovereignty and the Ministry of Ecological Transition and Territorial Cohesion, and operated by Bpifrance and ADEME, this competition rewards start-ups and SMEs with innovation projects that have great potential for the French economy. The government wishes to accelerate the development of innovative companies with a high technological content and at the cutting edge of research. The i-Nov competition favours companies that are leaders in their field and have the potential to become world-class. Octopize, a Nantes-based start-up with the Deeptech label, meets this dual objective of technological innovation and European ambition.

Avatars, a revolution for the personal data market

Indeed, Octopize aims to become the European leader in personal data anonymisation, thanks to a unique and patented method: avatars. This disruptive innovation, based on a new Artificial Intelligence technique, allows personal data to be used and shared in full respect of privacy. In 2020, the French Data Protection Authority (CNIL) successfully audited this method and certified the compliance of the solution with the three criteria on anonymisation described in the G29 opinion.

Avatars transform personal data into anonymous and statistically relevant summary data. By maintaining the quality and structure of the original data, the results are easily reproducible. On the other hand, avatars fall outside the General Data Protection Regulation (GDPR). They are therefore usable, shareable (even outside the European Union) and retainable without time limit. What is the difference with regard to competing solutions? Thanks to its metrics, Octopize quantifies and proves the effectiveness of its avatars both in terms of privacy and data quality. Avatars become multi-use, multi-user data with no expiry date, no longer putting the individuals behind the data at risk.

In the age of big data, avatars are therefore a revolution for the personal data market. Indeed, while the exponential growth in the collection of personal data offers an immeasurable source of value, both for economic players and public services, it is accompanied by serious risks, weighing on the protection of the privacy of the individuals concerned. This is evidenced by the accumulation of cases linked to the hosting or processing of European personal data by American operators: Google Analytics, Meta... Avatars are the solution to exploit and share personal data in an ethical manner.

Avatars, already used in the health sector

Moreover, Octopize's clients are not mistaken. Avatars are already being marketed in a sector that collects highly sensitive data: health. Tabular data and time series are anonymised via software or the service. The Data Clinic, for example, attached to the Nantes University Hospital, uses patient data with the agreement of the CNIL thanks to avatars. The same applies to the AP-HP, the CHU of Angers, Inserm, SOS Médecins, the European HAP2 project, the Health Data Hub or with start-ups such as Epidemium, EchOpen or Samdoc, and with pharmaceutical laboratories, as recently illustrated with Roche.

Avatars are thus opening the way to the revaluation of health data. They unleash medical research and facilitate open science. The Octopize start-up is proud to contribute to this vital public health issue, exacerbated by the health crisis.

What if tomorrow, avatars became the norm in the European Union?

With the i-Nov funding, Octopize will accelerate R&D to extend the use of avatars to complex data (textual, spatial, etc.) and improve the industrialisation of the method, in order to become the European leader in data anonymisation. The strength of the Octopize method lies in its flexibility, which allows it to adapt to all needs, from internal use to open data, and in its robustness, which opens the way to a wide variety of uses. The next stage of Octopize is already underway and aims to conquer new French and European markets: banking, marketing, insurance, finance, mobility, communities, etc.

Octopize advocates a radical change in the use of data for the benefit of all and respectful of everyone: let's reserve personal data for personal use and use avatars for all other uses. What if tomorrow, avatars became the norm in the European Union?

With Octopize, let's harness the value of data for the benefit of all, while respecting everyone.

 

About Octopize

Octopize - Mimethik Data is a deeptech startup from Nantes that aims to become the European leader in anonymization. It has developed and patented a unique method for anonymizing personal data, which is certified by the CNIL in June 2020: avatars. The method is marketed in the form of software or a service allowing new uses in an ethical manner. It is already recognised in the health sector and in other verticals. The startup employs about ten people. In September 2021, Octopize raised €1.5 million from several venture capitalists, Bpifrance and Business Angels. Winner of the 2022 i-Nov competition, by decision of the Prime Minister, it is opening up to other economic sectors and confirming its ambition.
For more information: https://octopize-md.com/
Founder: Olivier BREILLACQ - linkedin.com/in/olivier-breillacq
Press contact: contact@octopize.io

 

About the i-Nov Competition

Launched in 2017 and co-piloted by the Ministry of Ecological Transition and the Ministry of the Economy, Finance and Recovery, the i-Nov competition already has more than 400 winners. It is part of the "Innovation Contest" continuum, which has three complementary components: i-PhD, i-Lab and i-Nov. The innovation contest marks a commitment by the State through financing, labelling and enhanced communication, making it possible to support the development of highly innovative and technological companies. Upstream, the i-PhD and i-Lab competitions aim to encourage the emergence and creation of Deeptech start-ups, born of advances in French cutting-edge research. Downstream, the i-Nov competition supports innovative development projects carried out by start-ups and SMEs. This competition is financed by the State via the Programme d'investissements d'avenir (P.I.A.) as part of France 2030. It mobilises up to 80 million euros per year around themes such as the digital revolution, the ecological and energy transition, health and security. It is operated by Bpifrance and ADEME. For the winners, it is an opportunity to obtain co-financing for their research, development and innovation project, the total costs of which are between €600 000 and €5 million. The prize is financial support of up to 45% of the project cost in the form of grants and recoverable advances.
For more information: https://www.gouvernement.fr/investissements-d-avenir-lancement-de-la-8eme-vague-du-volet-i-nov-du-concours-d-innovation

 

About the Programme d'Investissements d'Avenir (P.I.A.)

The P.I.A. has been running for 10 years and is managed by the Prime Minister's General Secretariat for Investment. It finances innovative projects that contribute to the country's transformation, sustainable growth and the creation of tomorrow's jobs. From the emergence of an idea to the dissemination of a new product or service, the P.I.A. supports the entire life cycle of innovation, between the public and private sectors, alongside economic, academic, regional and European partners. These investments are based on a demanding doctrine, open selective procedures, and principles of co-financing or return on investment for the State. The fourth P.I.A. (P.I.A.4) is endowed with 20 billion euros of commitments over the period 2021-2025, of which 11 billion euros will contribute to supporting innovative projects within the framework of the France Relance plan.
For more information: https://www.gouvernement.fr/le-programme-d-investissements-d-avenir

 

About Bpifrance

Bpifrance finances companies at every stage of their development through loans, guarantees and equity. Bpifrance supports them in their innovation and international projects. Bpifrance now also ensures their export activity through a wide range of products. Advice, university, networking and acceleration programmes for start-ups, SMEs and ETIs are also part of the offer to entrepreneurs. Thanks to Bpifrance and its 50 regional offices, entrepreneurs benefit from a close, single and efficient contact to support them and meet their challenges.
For more information: https://www.bpifrance.fr

What are the criteria for considering data to be truly anonymous?

How to measure the anonymity of a database?

In the age of Big Data, personal data is an essential raw material for the development of research and the operation of many companies. However, despite their great value, the use of this type of data necessarily implies a risk of re-identification and leakage of sensitive information, even after prior pseudonymisation treatment (see article 1). In the case of personal data, especially sensitive data, the risk of re-identification can be considered a betrayal of the trust of the individuals from whom the data originated, especially when they are used without clear and informed consent.

The implementation of the General Data Protection Regulation (GDPR) in 2018 and the Data Protection Act before it offered an attempt to address this issue by initiating a change in the practices of collecting, processing and storing personal data. An independent think tank specialising in privacy issues has also been set up. Called the European Data Protection Committee (EDPS) or formerly the G29, this consultative body has published work (see Article G29) which now serves as a reference for the European national authorities (CNIL in France) in the application of the RGPD.

The EDPS thus agrees on the potential of anonymisation to enhance the value of personal data while limiting the risks for the individuals from whom they originate. As a reminder, data are considered as anonymous if the re-identification of the original individuals is impossible. It is therefore an irreversible process. However, the anonymisation methods developed to meet this need are not infallible and their effectiveness often depends on many parameters (see article 2). In order to use these methods in an optimal way, it is necessary to make further clarifications on the nature of the anonymised data. The EDPS, in his Opinion of 05/2014 on anonymisation techniques, identifies three criteria for determining the impossibility of re-identification; namely:

 

  1. Individualisation: is it always possible to isolate an individual?

The individualisation criterion corresponds to the most favourable scenario for an attacker, i.e. a person, malicious or not, seeking to re-identify an individual in a dataset. To be considered anonymous, a dataset must not allow an attacker to isolate a target individual. In practice, the more information an attacker has about the individual they wish to isolate in a database, the higher the probability of re-identification. Indeed, in a pseudonymised dataset, i.e. one that has been stripped of its direct identifiers, the remaining quasi-identifying information acts like a barcode of an individual's identity when considered together. Thus, the more prior information the attacker has about the individual he is trying to identify, the more precise a query he can make to try to isolate that individual. An example of an individualisation attack is shown in Figure 1.

Ré-identification d’un patient par individualisation dans un jeu de données sur la base de deux attributs (Age, Gender)

Figure 1: Re-identification of a patient by individualisation in a dataset based on two attributes (Age, Gender)

One of the attributes of this type of attack is also the increased sensitivity of individuals with unusual characteristics. It will be easier for an attacker, with only gender and height information, to isolate a woman who is 2 metres tall than a man who is 1.75 metres tall.

 

2. Correlation: Is it always possible to link records about an individual? 

Correlation attacks are the most common scenario. Therefore, in order to consider data as anonymous, it is essential that it meets the correlation criterion. Between the democratisation of Open Data and the numerous incidents linked to personal data leaks, the amount of data available has never been so large. These databases containing personal information, sometimes directly identifying, are opportunities for attackers to carry out re-identification attempts by cross-referencing. In practice, correlation attacks use directly identifying databases with information similar to the database to be attacked, as illustrated in Figure 2.

Illustration d’une attaque par corrélation

Figure 2: Illustration of a correlation attack. The directly identifying external database (top) is used to re-identify individuals in the attacked database (bottom). The correlation is done on the basis of common variables.

In the case of the tables illustrated in Figure 2, the attacker would have succeeded in re-identifying the 5 individuals in the pseudonymised database thanks to the two attributes common to both databases. Moreover, the re-identification would have allowed him to infer new sensitive information about the patients, namely the pathology that affects them. In this context, the more information the databases have in common, the higher the probability of re-identifying an individual by correlation.

 

3. Inference: can information about an individual be inferred?

The third and last criterion identified by the EDPS is probably the most complex to assess. This is the criterion of inference. In order to consider data as anonymous, it must be impossible to identify by inference, with a high degree of certainty, new information about an individual. For example, if a dataset contains information on the health status of individuals who have participated in a clinical study and all the men over 65 in this cohort have lung cancer, then it will be possible to infer the health status of certain participants. Indeed, knowing a man over 65 in this study is enough to say that he has lung cancer.

The inference attack is particularly effective on groups of individuals sharing a single modality. If the inference is successful, then the disclosure of the sensitive attribute concerns the whole group of individuals identified.

These three criteria identified by the EDPS cover the majority of threats to data after it has been processed to preserve its security. If these three criteria are met, then the processing can be considered as anonymisation in the true sense of the word.

 

Can current techniques satisfy all three criteria?

Randomisation and generalisation techniques each have advantages and disadvantages with respect to each criterion (see Article 2). The assessment of the performance in meeting the criteria for several anonymisation techniques is shown in Figure 3, taken from the Opinion published by the former G29 on anonymisation techniques.

Forces et faiblesses des techniques considérées - OCTOPIZE

Figure 3: Strengths and weaknesses of the techniques considered

 

It is clear that none of these techniques can meet all three criteria simultaneously. They should therefore be used with caution in their most appropriate context. In addition to the methods evaluated, synthetic data seems to be a promising alternative for meeting all three criteria. However, methodologies for producing synthetic data face the challenge of proving this protection. At present, all synthetic data generation solutions rely on the principle of plausible deniability to prove the protection associated with a data item. In other words, if a piece of synthetic data were to happen to resemble an original piece of data, the defence would be that in such circumstances, it is impossible to prove that the synthetic data is related to an original piece of data. At Octopize, we have developed a unique methodology to produce synthetic data while quantifying and proving the protection provided. This evaluation is carried out through metrics specifically developed to measure the satisfaction of the criteria, namely individualisation, correlation and inference. We will develop the subject of metrics for assessing the quality and security of synthetic data in more detail in another article.

What anonymization techniques to protect your personal data?

What anonymization techniques to protect your personal data?

What are the different anonymization techniques?

After having differentiated the concepts of anonymization and pseudonymization in a previous article, it is important for the Octopize team to take stock of the different existing techniques for anonymizing personal data.

Anonymization techniques

Before talking about anonymization of data, it should be noted that pseudonymization is necessary first to remove any directly identifying character from the dataset: this is an essential first security step. Anonymization techniques allow for the handling of quasi-identifying attributes. By combining them with a prior pseudonymization step, it is ensured that direct identifiers are taken care of and that all personal information related to an individual is protected.

Secondly, as a reminder, anonymization consists of using techniques that make it impossible, in practice, to re-identify the individuals from whom the anonymized personal data originated. This technique has an irreversible character which implies that anonymized data are no longer considered as personal data, thus falling outside the scope of the GDPR.

To characterise anonymization, the EDPS (European Data Protection Committee), formerly the G29 Working Party, has set out 3 criteria to be respected, namely

The EDPS then defines two main families of anonymization techniques, namely randomization and generalization.

RANDOMIZATION

GENERALIZATION

Randomization involves changing the attributes in a dataset so that they are less precise, while maintaining the overall distribution.

This technique protects the dataset from the risk of inference. Examples of randomization techniques include noise addition, permutation and differential privacy.

Randomization situation: permuting data on the date of birth of individuals so as to alter the veracity of the information contained in a database.

Generalization involves changing the scale of dataset attributes, or their order of magnitude, to ensure that they are common to a set of people.

This technique avoids the individualisation of a dataset. It also limits the possible correlations of the dataset with others. Examples of generalisation techniques include aggregation, k-anonymity, l-diversity and t-proximity.

Generalization situation: in a file containing the date of birth of individuals, replacing this information by the year of birth only.

These different techniques make it possible to respond to certain issues, with their own advantages and disadvantages. We will thus detail the operating principle of these different methods and will expose, through factual examples, the limits to which they are subject.

Which technique to use and why?

Each of the anonymization techniques may be appropriate, depending on the circumstances and context, to achieve the desired purpose without compromising the privacy rights of the data subjects.

The randomization family :

1- Adding noise :

Principle: Modification of the attributes of the dataset to make them less accurate. Example: Following anonymization by adding noise, the age of patients is modified by plus or minus 5 years.

Strengths:

Weaknesses:

Common errors :

Failure to use :

Netflix case:

In the Netflix case, the initial database had been made publicly "anonymised" in accordance with the company's internal privacy policy (removing all identifying information about users except ratings and dates).

In this case, it was possible to re-identify 68% of Netflix users through a database external to Netflix, by cross-referencing. Users were uniquely identified in the dataset using 8 ratings and dates with a margin of error of 14 days as selection criteria.

 

2- Permutation :

Principle:

Consists of mixing attribute values in a table in such a way that some of them are artificially linked to different data subjects. Permutation therefore alters the values within the dataset by simply swapping them from one record to another. Example: As a result of permutation anonymization, the age of patient A has been replaced by that of patient J.

Strengths:

Weakness:

- Doesn’t allow for the preservation of correlations between values and individuals, thus making it impossible to perform advanced statistical analyses (regression, machine learning, etc.).

Common mistakes:

Failure to use: the permutation of correlated attributes

In the following example, we can see that intuitively, we will try to link salaries with occupations according to the correlations that seem logical to us (see arrow).

Thus, the random permutation of attributes does not offer guarantees of confidentiality when there are logical links between different attributes.

table1_article2_EN

Table 1. Example of inefficient anonymization by permutation of correlated attributes

 

3- Differential Privacy :

Principle: Differential Privacy is the production of anonymized views of a dataset while retaining a copy of the original data.

The anonymized view is generated as a result of a third party query to the database, the result of which will be associated with added noise. To be considered "differentially private", the presence or absence of a particular individual in the query must not be able to change its outcome.

Strength :

Weaknesses:

Common mistakes:

- Not injecting enough noise: In order to prevent links from being made to knowledge from context, noise must be added. The challenge from a data protection perspective is to generate the appropriate level of noise to add to the actual responses, so as to protect the privacy of individuals without undermining the utility of the data.

- Don’t allocate a security budget: it’s necessary to keep information about the queries made and to allocate a security budget that will increase the amount of noise added if a query is repeated.

Usability failures:

- Independent processing of each query: Without keeping the history of queries and adapting the noise level, the results of repeating the same query or a combination of them could lead to the disclosure of personal information. An attacker could in fact carry out several queries which would allow an individual to be isolated and one of his characteristics to emerge. It should also be taken into account that Differencial Privacy only allows one question to be answered at a time. Thus, the original data must be maintained throughout the defined use.

- Re-identification of individuals: Differential privacy doesn’t guarantee non-disclosure of personal information. An attacker can re-identify individuals and reveal their characteristics using another data source or by inference. For example, in this paper (source: https://arxiv.org/abs/1807.09173) researchers from the Georgia Institute of Technology (Atlanta) have developed an algorithm, called "membership inference attacks", which re-identifies training (and therefore sensitive) data from a differential privacy model. The researchers conclude that further research is needed to find a stable and viable differential privacy mechanism against membership inference attacks. Thus, differential privacy doesn’t appear to be a totally secure protection.

The generalization family :

1- Aggregation and k-anonymity :

Principle: Generalization of attribute values to such an extent that all individuals share the same value. These two techniques aim to prevent a data subject from being isolated by grouping him/her with at least k other individuals. Example: In order to have at least 20 individuals sharing the same value, the age of all patients between 20 and 25 is reduced to 23 years.

Strength:

Weaknesses:

Common mistakes:

- Neglecting certain quasi-identifiers: The choice of the parameter k is the key parameter of the k-anonymity technique. The higher the value of k, the more the method guarantees confidentiality. However, a common mistake is to increase this parameter without considering all the variables. However, sometimes one variable is enough to re-identify a large number of individuals and make the generalization applied to the other quasi-identifiers useless.

- Low value of k: If k is too small, the weighting of an individual within a group is too large and attacks by inference are more likely to succeed. For example, if k=2 the probability that both individuals share the same property is greater than in the case where k >10.

- Don’t group individuals with similar weights: The parameter k must be adapted to the case of variables that are unbalanced in the distribution of their values.

Failure to use :

The main problem with k-anonymity is that it doesn’t prevent inference attacks. In the following example, if the attacker knows that an individual is in the dataset and was born in 1964, he also knows that this individual had a heart attack. Furthermore, if it is known that this dataset was obtained from a French organisation, it can be inferred that each of the individuals resides in Paris since the first three digits of the postal codes are 750*).

table2_article2_EN

Table 2. An example of poorly engineered k-anonymization

To overcome the shortcomings of k-anonymity, other aggregation techniques have been developed, notably L-diversity and T-proximity. These two techniques refine k-anonymity by ensuring that each of the classes has L different values (l-diversity) and that the classes created resemble the initial distribution of the data.

Note that despite these improvements, it does not address the main weaknesses of k-anonymity presented above.

Thus, these different generalization and randomization techniques each have security advantages but do not always fully meet the three criteria set out by the EDPS, formerly G29, as shown in Table 3 "Strengths and weaknesses of the techniques considered" produced by the CNIL.

table3_article2_EN

Table 3. Strengths and weaknesses of the techniques considered

Based on more recent anonymization techniques, synthetic data are now emerging as better anonymization solutions.

The case of synthetic data

Recent years of research have seen the emergence of solutions that allow the generation of synthetic records that ensure a high retention of statistical relevance and facilitate the reproducibility of scientific results. They are based on the creation of models that allow the global structure of the original data to be understood and reproduced. A distinction is made between adversarial neural networks (ANNs) and methods based on conditional distributions.

Strength :

Weakness:

The Avatar anonymization solution, developed by Octopize, uses a unique patient-centric design approach, allowing the creation of relevant and protected synthetic data while providing proof of protection. Its compliance has been demonstrated by the CNIL on the 3 EDPS criteria. Click here to learn more about avatars.

Rapid evolution of techniques

Finally, the CNIL (the French National Data Processing and Liberties Commission) reminds us that since anonymization and re-identification techniques are bound to evolve regularly, it is essential for any data controller concerned to keep a regular watch to preserve the anonymous nature of the data produced over time. This monitoring must take into account the technical means available and other sources of data that may make it possible to remove the anonymity of information.

The CNIL stresses that research into anonymisation techniques is ongoing and shows definitively that no technique is in itself infallible.

Sources :
https://edpb.europa.eu/edpb_fr
https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
Membership Inference Attacks : https://arxiv.org/pdf/1807.09173.pdf
Netflix : https://arxiv.org/PS_cache/cs/pdf/0610/0610105v2.pdf

Is your data pseudonymized or anonymized ?

What is the difference between anonymization and pseudonymization ?

The notion of anonymous data crystallizes a lot of misunderstandings and misconceptions to the point that the term "anonymous" does not have the same meaning depending on the person who uses it.
In order to re-establish a consensus, the Octopize team wanted to discuss the differences between pseudonymization and anonymization, two notions that are often confused.
At first glance, the term "anonymization" evokes the notion of a mask, of concealment. We then imagine that the principle of anonymization amounts to masking the directly identifying attributes of an individual (name, first name, social security number). This shortcut is precisely the trap to avoid. Indeed, the masking of these parameters constitutes rather a pseudonymization.
At first glance, these two concepts are similar, but there are major differences between them, both from a legal and a security point of view.

What is pseudonymization ?

According to the CNIL, pseudonymization is "the processing of personal data in such a way that it is no longer possible to attribute the data to a natural person without additional information". It is one of the measures recommended by the RGPD to limit the risks related to the processing of personal data.

But pseudonymization is not a method of anonymization. Pseudonymization simply reduces the correlation of a data set with the original identity of a data subject and is therefore a useful but not absolute security measure. Indeed, pseudonymization consists in replacing directly identifying data (name, first name...) of a data set by indirectly identifying data (alias, number in a classification, etc.) thus preventing the direct re-identification of individuals.

However, pseudonymization is not an infallible protection because the identity of an individual can also be deduced from a combination of several pieces of information called quasi-identifiers. Thus, in practice, pseudonymized data remains potentially re-identifying indirectly by crossing information. The identity of the individual can be betrayed by one of his indirectly identifying characteristics. This transformation is therefore reversible, justifying the fact that pseudonymized data are always considered as personal data. To date, the most widely used pseudonymization techniques are based on secret key cryptographic systems, hash functions, deterministic encryption and tokenization.

The "AOL (America On Line) case" is a typical example of the misunderstanding that exists between pseudonymization and anonymization. In 2006, a database containing 20 million keywords from the searches of more than 650,000 users over a period of three months was made public, with no other measure to preserve privacy than the replacement of the AOL user ID by a numerical attribute (pseudonymization).
Despite this treatment, the identity and location of some users were made public. Indeed, queries sent to a search engine, especially if they can be coupled with other attributes, such as IP addresses or other configuration parameters, have a very high potential for identification.

This incident is just one example of the many pitfalls that show that a pseudonymized dataset is not anonymous; simply changing the identity does not prevent an individual from being re-identified based on quasi-identifying information (age, gender, zip code). In many cases, it can be as easy to identify an individual in a pseudonymized dataset as it is from the original data (the "Who's That?" game).

What is the difference with anonymization ?

Anonymization consists in using techniques that make it impossible, in practice, to re-identify the individuals who provided the anonymized personal data. This treatment is irreversible and implies that the anonymized data are no longer considered as personal data, thus falling outside the scope of the RGPD. To characterise anonymization, the European Data Protection Committee (formerly WP29) relies on the 3 criteria set out in the opinion of 05/2014 (source at foot of page):

- Individualization: anonymized data must not allow to distinguish an individual. Therefore, even with all the quasi-identifying information about an individual, it must be impossible to distinguish him in a database once anonymized.

- Correlation: anonymized data must not be re-identifiable by crossing it with other data sets. Thus, it must be impossible to link two data sets from different sources concerning the same individual. Once anonymized, an individual's health data should not be linkable to his or her banking data based on common information.

- Inference: The data should not allow for the inference of additional information about an individual in a reasonable manner. For example, it must not be possible to determine with certainty the health status of an individual from anonymous data.

It is when these three criteria are met that data is considered to be anonymous in the strict sense. It then changes its legal status: it is no longer considered as personal data and falls outside the scope of the RGPD.

Our solution: Avatar

There are currently several families of anonymization methods that we will detail in our next article. For the most part, these methods provide protection by degrading the quality, structure or fineness of the original data, thus limiting the informational value of this data after processing. The real challenge is to solve the paradox between the legitimate protection of everyone's data, and its exploitation for the benefit of all.

The Avatar anonymization method, developed by Octopize, is a unique anonymization method. It solves the paradox between the protection of patients' personal data and the sharing of this data for its informative value. Indeed, the Avatar solution, which has been successfully evaluated by the CNIL, allows, thanks to synthetic data, to ensure on the one hand the confidentiality of the original data (and thus their sharing without risk) and on the other hand, to preserve the informative value of the original data.

Click here to learn more.

Sources: