Avatars, the hidden revolution behind digital twins

Spearheading Industry 4.0, digital twins are now spreading to the healthcare sector. Boosted by the Covid-19 epidemic, their market is exploding, as are the risks weighing on the privacy of the individuals who provide the data. How can we unleash the potential of digital twins without compromising on ethics? We have the solution: avatars, a unique data anonymization method that has been successfully evaluated by the CNIL. Impossible, in practice, to re-identify, the avatarized data goes beyond the RGPD. They become usable, shareable - even outside the European Union - and retainable without limits, while guaranteeing the quality of the initial data set. How do we differ from the competition? We prove all these points with our metrics. A real revolution in the current Health Data Hub context. What if avatars became the norm tomorrow?

 

"Houston, we've had a problem", said the Apollo 13 crew on April 17, 1970.

A few miles from the moon, an explosion has just occurred on board the spacecraft. Hundreds of thousands of kilometers away, on Earth, NASA teams diagnose and solve the problem remotely thanks to several simulators, a kind of "digital doubles", synchronized thanks to the data flow coming from the shuttle. The crew returns safely. The ancestors of the digital twins are born. NASA was the first to develop them, but it was not until 30 years later that the concept of "digital twin" emerged.

 

What is a "digital twin"?

In 2002, Michael Grieves was a PLM (Product Lifecycle Management) researcher at the University of Michigan. During the presentation of a center dedicated to product lifecycle management, he explained for the first time to the industrials present the notion of a "digital twin": a digital replica of a physical object or system. It is not a fixed model, but a dynamic model, reproducing its needs, its behavior and its evolution over time. As with Apollo 13, there is a deep connection between the physical entity to its digital twin: the flow of data from one to the other.

Since then, the concept of the digital twin has evolved little. It involves replicating an object (a piston or a car engine), a system (a nuclear power plant or a city) or an abstract process (a production schedule). The concept also applies to the living things: a molecule, a cell, an organ or a patient, such as a drug, a virus, a disease or an epidemic can have their digital twin.

 

Digital twins are an evolution, more than a revolution, combining mathematical modelling and digital simulation.

 

The result of the growth of new technologies (IoT, big data, AI, cloud, etc.) and computing power, digital twins are an evolution, more than a revolution, combining mathematical modelling and digital simulation. Incoming data, wherever it comes from - real, synthetic, collected in real time using sensors or via pre-existing databases - feeds a mathematical model to fine-tune it. The model can then be transformed into a digital guinea pig, on which to test different scenarios via simulations, in order to predict the evolution of the real system.

Product design and life cycle, automotive and aeronautics, energy production and distribution, transport, smart building and urban planning, digital twins are now one of the pillars of Industry 4.0. They have recently spread to other sectors, such as logistics and, above all, healthcare. According to a study by MarketsandMarkets, the digital twins market could grow from $3.1 billion in 2020 to $48.2 billion in 2026, a spectacular 58% growth, partly due to the Covid-19 epidemic.

 

The promise of digital twins in healthcare, myth or reality?

Last January, at the CES (Consumer electronics show) in Las Vegas, Dassault Systèmes presented its latest feat, the digital twin of a human heart, the result of 7 years of development. Powered by data collected from hundreds of doctors, researchers and industrialists around the world, it replicates not only the anatomy of the heart, but also its functioning: the flow of electrical current along the nerves, the behaviour of muscle fibres, the reaction to various drugs, etc. Thanks to advances in medical imaging, this digital twin is easily customisable. It takes less than a day to replicate the morphology and pathologies of a patient's heart. 

Dassault Systèmes and its competitors are already working on other organs, including the lungs, liver and of course the brain, but exact replication is currently out of reach. And for good reason! Neurobiologists have yet to unravel all its mysteries. The perfect clone of the human body - modelling anatomy, genetics, metabolism, bodily functions and pathologies - is therefore not yet within reach. However, there is no need to wait for complete digital twins to make great strides. Digital twins, even partial ones, of certain organs, diseases or patient/drug combinations - such as those developed by the start-up ExactCure - are already sufficient to address specific problems.

 

If digital twins live up to their promise, they will ultimately signal the advent of personalised medicine.

 

Simulating the anatomy and functioning of our body at the molecular, cellular, tissue and organic levels; modelling tailor-made implants; simulating ageing or a disease; testing a drug or a vaccine on a virtual patient or cohort; rehearsing and assisting complex surgical procedures; monitoring patient flows in hospitals to rationalise human and technical resources: if digital twins fulfil all their promises, they will ultimately signal the advent of personalised medicine.

A study published in July 2021 in the journal Life Sciences, Society and Policy reviews the socio-ethical benefits of digital twins in health services. On the podium are the prevention and treatment of disease, followed by cost savings for some healthcare institutions, and finally, increased autonomy for patients - better informed, they are better able to make informed decisions about their care.  

Risks commensurate with the hopes raised

Nevertheless, there are still many hurdles to overcome before we reach this public health Eldorado. The fundamental problem lies in the crux of the digital twins' war: health data. This highly sensitive personal data contains genetic, biological, physical and lifestyle information. The same study warns of the number one socio-ethical risk of digital twins, mentioned by all participants: the violation of privacy. 

 

The fundamental problem is the crux of the digital twins' war: health data. This highly sensitive personal data contains genetic, biological, physical and lifestyle information.

 

If the digital twins are owned or hosted by private organisations, this information can be used without the knowledge of the patients, or even turned against them. The simplest example: a bank or insurance company with access to it could deny a loan or increase premiums to a sick person.

Add to this the security holes. If digital twins multiply, the risk of losing or having data stolen increases with them. But once the data has been leaked, it is too late. It can be used by anyone, in any way. This is a disaster scenario that is becoming increasingly common in France, where cyber attacks on healthcare organisations doubled in 2021. The theft of health insurance data from half a million French people in early 2022 is a striking example.

 

All the benefits of digital twins are therefore conditioned by the availability and quality of health data.

 

All the benefits of digital twins are therefore conditioned by the availability and quality of health data.

Then there is another risk: the low quality of the data. Indeed, AI algorithms are trained on available biomedical data. However, this data is often heterogeneous, incomplete and not always reliable. This is due to several reasons: lack of standardisation, pressure to publish, bias, tradition of not publishing failures, etc. Bad data means bad models and bad simulations. 

All the benefits of digital twins therefore depend on the availability and quality of health data. However, it is extremely difficult for researchers to retrieve and use this data, particularly in France, where its use is strictly limited by the GDPR (General Data Protection Regulation) and the Loi Informatique et Libertés. In particular, their transfer outside the European Union is prohibited, a particularly sensitive issue in the current public debate. The cases follow one another at a frantic pace, from Google Analytics to Meta. The government has even preferred to postpone its request for authorisation from the CNIL for the Health Data Hub, while this health data centralisation project undergoes a transformation.

 

Avatars to unlock the growth potential of digital twins

To unleash the growth potential of digital twins, there is already a solution proposed by Octopize - Mimethik Data, our deeptech start-up. We have developed a unique and patented method of data anonymisation: avatars. Data anonymisation is not new and the methods are multiplying all the time. However, most of them do not provide proof that it is impossible to re-identify patients, far from it. Our disruptive innovation, based on a new Artificial Intelligence technique, allows personal data to be exploited and shared in full respect of privacy. Unlike our competitors, we can prove through our metrics the effectiveness of our avatars in terms of both privacy and data quality. Our secret? An AI algorithm focused on each patient, not on the whole dataset.

For each patient (i.e. each row in the database), we use a KNN algorithm to identify a number of neighbouring data. From these neighbouring data we build our model. At this stage, the real patient and his data have "disappeared" - it is impossible to know whether they are in the model or not, only his nearest neighbours are. We then generate an avatar using a local pseudo-stochastic model, i.e. we introduce random, and therefore non-reversible, noise for each attribute (i.e. each column in the database). It is impossible to go backwards, each time we run the model again for the same patient, we create a different avatar. This ensures anonymisation, while preserving the granularity of the dataset, the correlations between individuals and the distributions on each variable. Same Gauss curves, same means and same standard deviations, to within an epsilon.

 

The data, once avatarized, becomes synthetic data, without any risk of re-identification for the patients. It then falls outside the scope of the GDPR and its exploitation becomes unlimited.

 

The data, once avatarised, become synthetic data, without risk of re-identification for the patients. They are then out of the RGPD and their exploitation becomes unlimited. They can be stored, exploited, shared and reused without geographical or temporal constraints. Moreover, the CNIL has not been mistaken and has successfully evaluated our method in 2020, attesting to its compliance with the three criteria on anonymisation described in the G29 opinion. Thanks to avatars, the privacy risk inherent in digital twins is eliminated.

Avatars are also easily deployable and scalable. They can be configured to suit all needs, from internal use to open data. Another advantage is that avatars also solve the problems of availability and bias of health data. From a real dataset, we can generate synthetic datasets that are larger than the initial database, as each individual can give rise to several avatars. In this way we can amplify a cohort. In the end, we propose labelled and "clean" health datasets, ready for use, ready for all uses.

 

Beyond digital twins, avatars are in themselves a revolution, and not only in the health field.

 

By addressing issues of privacy, data availability and data quality, avatarisation is therefore a great opportunity to unleash the growth potential of digital twins. But beyond that, avatars are a revolution in themselves, and not just in the health sector. Banking, insurance, telecom, industry, energy, all sectors handling sensitive data now have a turnkey solution. Octopize - Mimethik Data defends with its avatars an ethical point of view at the service of value creation. We are firmly convinced that data avatarization, a disruptive innovation today, will be the new European standard tomorrow.

 

15/05/2022© Octopize - Cynthia Laboureau

 

Octopize, laureate of the i-Nov competition for its unique solution for anonymizing personal data: avatars

The Prime Minister has decided to award a contribution of approximately half a million euros from the Programme d'investissements d'avenir (P.I.A.) to the company Octopize as part of the 8th wave of the i-Nov innovation competition. Octopize was competing in the Digital Deep Tech theme. Its project concerns the deployment of its disruptive method of anonymising personal data: avatars.

Co-piloted by the Ministry of the Economy, Finance and Industrial and Digital Sovereignty and the Ministry of Ecological Transition and Territorial Cohesion, and operated by Bpifrance and ADEME, this competition rewards start-ups and SMEs with innovation projects that have great potential for the French economy. The government wishes to accelerate the development of innovative companies with a high technological content and at the cutting edge of research. The i-Nov competition favours companies that are leaders in their field and have the potential to become world-class. Octopize, a Nantes-based start-up with the Deeptech label, meets this dual objective of technological innovation and European ambition.

Avatars, a revolution for the personal data market

Indeed, Octopize aims to become the European leader in personal data anonymisation, thanks to a unique and patented method: avatars. This disruptive innovation, based on a new Artificial Intelligence technique, allows personal data to be used and shared in full respect of privacy. In 2020, the French Data Protection Authority (CNIL) successfully audited this method and certified the compliance of the solution with the three criteria on anonymisation described in the G29 opinion.

Avatars transform personal data into anonymous and statistically relevant summary data. By maintaining the quality and structure of the original data, the results are easily reproducible. On the other hand, avatars fall outside the General Data Protection Regulation (GDPR). They are therefore usable, shareable (even outside the European Union) and retainable without time limit. What is the difference with regard to competing solutions? Thanks to its metrics, Octopize quantifies and proves the effectiveness of its avatars both in terms of privacy and data quality. Avatars become multi-use, multi-user data with no expiry date, no longer putting the individuals behind the data at risk.

In the age of big data, avatars are therefore a revolution for the personal data market. Indeed, while the exponential growth in the collection of personal data offers an immeasurable source of value, both for economic players and public services, it is accompanied by serious risks, weighing on the protection of the privacy of the individuals concerned. This is evidenced by the accumulation of cases linked to the hosting or processing of European personal data by American operators: Google Analytics, Meta... Avatars are the solution to exploit and share personal data in an ethical manner.

Avatars, already used in the health sector

Moreover, Octopize's clients are not mistaken. Avatars are already being marketed in a sector that collects highly sensitive data: health. Tabular data and time series are anonymised via software or the service. The Data Clinic, for example, attached to the Nantes University Hospital, uses patient data with the agreement of the CNIL thanks to avatars. The same applies to the AP-HP, the CHU of Angers, Inserm, SOS Médecins, the European HAP2 project, the Health Data Hub or with start-ups such as Epidemium, EchOpen or Samdoc, and with pharmaceutical laboratories, as recently illustrated with Roche.

Avatars are thus opening the way to the revaluation of health data. They unleash medical research and facilitate open science. The Octopize start-up is proud to contribute to this vital public health issue, exacerbated by the health crisis.

What if tomorrow, avatars became the norm in the European Union?

With the i-Nov funding, Octopize will accelerate R&D to extend the use of avatars to complex data (textual, spatial, etc.) and improve the industrialisation of the method, in order to become the European leader in data anonymisation. The strength of the Octopize method lies in its flexibility, which allows it to adapt to all needs, from internal use to open data, and in its robustness, which opens the way to a wide variety of uses. The next stage of Octopize is already underway and aims to conquer new French and European markets: banking, marketing, insurance, finance, mobility, communities, etc.

Octopize advocates a radical change in the use of data for the benefit of all and respectful of everyone: let's reserve personal data for personal use and use avatars for all other uses. What if tomorrow, avatars became the norm in the European Union?

With Octopize, let's harness the value of data for the benefit of all, while respecting everyone.

 

About Octopize

Octopize - Mimethik Data is a deeptech startup from Nantes that aims to become the European leader in anonymization. It has developed and patented a unique method for anonymizing personal data, which is certified by the CNIL in June 2020: avatars. The method is marketed in the form of software or a service allowing new uses in an ethical manner. It is already recognised in the health sector and in other verticals. The startup employs about ten people. In September 2021, Octopize raised €1.5 million from several venture capitalists, Bpifrance and Business Angels. Winner of the 2022 i-Nov competition, by decision of the Prime Minister, it is opening up to other economic sectors and confirming its ambition.
For more information: https://octopize-md.com/
Founder: Olivier BREILLACQ - linkedin.com/in/olivier-breillacq
Press contact: contact@octopize.io

 

About the i-Nov Competition

Launched in 2017 and co-piloted by the Ministry of Ecological Transition and the Ministry of the Economy, Finance and Recovery, the i-Nov competition already has more than 400 winners. It is part of the "Innovation Contest" continuum, which has three complementary components: i-PhD, i-Lab and i-Nov. The innovation contest marks a commitment by the State through financing, labelling and enhanced communication, making it possible to support the development of highly innovative and technological companies. Upstream, the i-PhD and i-Lab competitions aim to encourage the emergence and creation of Deeptech start-ups, born of advances in French cutting-edge research. Downstream, the i-Nov competition supports innovative development projects carried out by start-ups and SMEs. This competition is financed by the State via the Programme d'investissements d'avenir (P.I.A.) as part of France 2030. It mobilises up to 80 million euros per year around themes such as the digital revolution, the ecological and energy transition, health and security. It is operated by Bpifrance and ADEME. For the winners, it is an opportunity to obtain co-financing for their research, development and innovation project, the total costs of which are between €600 000 and €5 million. The prize is financial support of up to 45% of the project cost in the form of grants and recoverable advances.
For more information: https://www.gouvernement.fr/investissements-d-avenir-lancement-de-la-8eme-vague-du-volet-i-nov-du-concours-d-innovation

 

About the Programme d'Investissements d'Avenir (P.I.A.)

The P.I.A. has been running for 10 years and is managed by the Prime Minister's General Secretariat for Investment. It finances innovative projects that contribute to the country's transformation, sustainable growth and the creation of tomorrow's jobs. From the emergence of an idea to the dissemination of a new product or service, the P.I.A. supports the entire life cycle of innovation, between the public and private sectors, alongside economic, academic, regional and European partners. These investments are based on a demanding doctrine, open selective procedures, and principles of co-financing or return on investment for the State. The fourth P.I.A. (P.I.A.4) is endowed with 20 billion euros of commitments over the period 2021-2025, of which 11 billion euros will contribute to supporting innovative projects within the framework of the France Relance plan.
For more information: https://www.gouvernement.fr/le-programme-d-investissements-d-avenir

 

About Bpifrance

Bpifrance finances companies at every stage of their development through loans, guarantees and equity. Bpifrance supports them in their innovation and international projects. Bpifrance now also ensures their export activity through a wide range of products. Advice, university, networking and acceleration programmes for start-ups, SMEs and ETIs are also part of the offer to entrepreneurs. Thanks to Bpifrance and its 50 regional offices, entrepreneurs benefit from a close, single and efficient contact to support them and meet their challenges.
For more information: https://www.bpifrance.fr

What are the criteria for considering data to be truly anonymous?

How to measure the anonymity of a database?

In the age of Big Data, personal data is an essential raw material for the development of research and the operation of many companies. However, despite their great value, the use of this type of data necessarily implies a risk of re-identification and leakage of sensitive information, even after prior pseudonymisation treatment (see article 1). In the case of personal data, especially sensitive data, the risk of re-identification can be considered a betrayal of the trust of the individuals from whom the data originated, especially when they are used without clear and informed consent.

The implementation of the General Data Protection Regulation (GDPR) in 2018 and the Data Protection Act before it offered an attempt to address this issue by initiating a change in the practices of collecting, processing and storing personal data. An independent think tank specialising in privacy issues has also been set up. Called the European Data Protection Committee (EDPS) or formerly the G29, this consultative body has published work (see Article G29) which now serves as a reference for the European national authorities (CNIL in France) in the application of the RGPD.

The EDPS thus agrees on the potential of anonymisation to enhance the value of personal data while limiting the risks for the individuals from whom they originate. As a reminder, data are considered as anonymous if the re-identification of the original individuals is impossible. It is therefore an irreversible process. However, the anonymisation methods developed to meet this need are not infallible and their effectiveness often depends on many parameters (see article 2). In order to use these methods in an optimal way, it is necessary to make further clarifications on the nature of the anonymised data. The EDPS, in his Opinion of 05/2014 on anonymisation techniques, identifies three criteria for determining the impossibility of re-identification; namely:

 

  1. Individualisation: is it always possible to isolate an individual?

The individualisation criterion corresponds to the most favourable scenario for an attacker, i.e. a person, malicious or not, seeking to re-identify an individual in a dataset. To be considered anonymous, a dataset must not allow an attacker to isolate a target individual. In practice, the more information an attacker has about the individual they wish to isolate in a database, the higher the probability of re-identification. Indeed, in a pseudonymised dataset, i.e. one that has been stripped of its direct identifiers, the remaining quasi-identifying information acts like a barcode of an individual's identity when considered together. Thus, the more prior information the attacker has about the individual he is trying to identify, the more precise a query he can make to try to isolate that individual. An example of an individualisation attack is shown in Figure 1.

Ré-identification d’un patient par individualisation dans un jeu de données sur la base de deux attributs (Age, Gender)

Figure 1: Re-identification of a patient by individualisation in a dataset based on two attributes (Age, Gender)

One of the attributes of this type of attack is also the increased sensitivity of individuals with unusual characteristics. It will be easier for an attacker, with only gender and height information, to isolate a woman who is 2 metres tall than a man who is 1.75 metres tall.

 

2. Correlation: Is it always possible to link records about an individual? 

Correlation attacks are the most common scenario. Therefore, in order to consider data as anonymous, it is essential that it meets the correlation criterion. Between the democratisation of Open Data and the numerous incidents linked to personal data leaks, the amount of data available has never been so large. These databases containing personal information, sometimes directly identifying, are opportunities for attackers to carry out re-identification attempts by cross-referencing. In practice, correlation attacks use directly identifying databases with information similar to the database to be attacked, as illustrated in Figure 2.

Illustration d’une attaque par corrélation

Figure 2: Illustration of a correlation attack. The directly identifying external database (top) is used to re-identify individuals in the attacked database (bottom). The correlation is done on the basis of common variables.

In the case of the tables illustrated in Figure 2, the attacker would have succeeded in re-identifying the 5 individuals in the pseudonymised database thanks to the two attributes common to both databases. Moreover, the re-identification would have allowed him to infer new sensitive information about the patients, namely the pathology that affects them. In this context, the more information the databases have in common, the higher the probability of re-identifying an individual by correlation.

 

3. Inference: can information about an individual be inferred?

The third and last criterion identified by the EDPS is probably the most complex to assess. This is the criterion of inference. In order to consider data as anonymous, it must be impossible to identify by inference, with a high degree of certainty, new information about an individual. For example, if a dataset contains information on the health status of individuals who have participated in a clinical study and all the men over 65 in this cohort have lung cancer, then it will be possible to infer the health status of certain participants. Indeed, knowing a man over 65 in this study is enough to say that he has lung cancer.

The inference attack is particularly effective on groups of individuals sharing a single modality. If the inference is successful, then the disclosure of the sensitive attribute concerns the whole group of individuals identified.

These three criteria identified by the EDPS cover the majority of threats to data after it has been processed to preserve its security. If these three criteria are met, then the processing can be considered as anonymisation in the true sense of the word.

 

Can current techniques satisfy all three criteria?

Randomisation and generalisation techniques each have advantages and disadvantages with respect to each criterion (see Article 2). The assessment of the performance in meeting the criteria for several anonymisation techniques is shown in Figure 3, taken from the Opinion published by the former G29 on anonymisation techniques.

Forces et faiblesses des techniques considérées - OCTOPIZE

Figure 3: Strengths and weaknesses of the techniques considered

 

It is clear that none of these techniques can meet all three criteria simultaneously. They should therefore be used with caution in their most appropriate context. In addition to the methods evaluated, synthetic data seems to be a promising alternative for meeting all three criteria. However, methodologies for producing synthetic data face the challenge of proving this protection. At present, all synthetic data generation solutions rely on the principle of plausible deniability to prove the protection associated with a data item. In other words, if a piece of synthetic data were to happen to resemble an original piece of data, the defence would be that in such circumstances, it is impossible to prove that the synthetic data is related to an original piece of data. At Octopize, we have developed a unique methodology to produce synthetic data while quantifying and proving the protection provided. This evaluation is carried out through metrics specifically developed to measure the satisfaction of the criteria, namely individualisation, correlation and inference. We will develop the subject of metrics for assessing the quality and security of synthetic data in more detail in another article.

What anonymization techniques to protect your personal data?

What anonymization techniques to protect your personal data?

What are the different anonymization techniques?

After having differentiated the concepts of anonymization and pseudonymization in a previous article, it is important for the Octopize team to take stock of the different existing techniques for anonymizing personal data.

Anonymization techniques

Before talking about anonymization of data, it should be noted that pseudonymization is necessary first to remove any directly identifying character from the dataset: this is an essential first security step. Anonymization techniques allow for the handling of quasi-identifying attributes. By combining them with a prior pseudonymization step, it is ensured that direct identifiers are taken care of and that all personal information related to an individual is protected.

Secondly, as a reminder, anonymization consists of using techniques that make it impossible, in practice, to re-identify the individuals from whom the anonymized personal data originated. This technique has an irreversible character which implies that anonymized data are no longer considered as personal data, thus falling outside the scope of the GDPR.

To characterise anonymization, the EDPS (European Data Protection Committee), formerly the G29 Working Party, has set out 3 criteria to be respected, namely

The EDPS then defines two main families of anonymization techniques, namely randomization and generalization.

RANDOMIZATION

GENERALIZATION

Randomization involves changing the attributes in a dataset so that they are less precise, while maintaining the overall distribution.

This technique protects the dataset from the risk of inference. Examples of randomization techniques include noise addition, permutation and differential privacy.

Randomization situation: permuting data on the date of birth of individuals so as to alter the veracity of the information contained in a database.

Generalization involves changing the scale of dataset attributes, or their order of magnitude, to ensure that they are common to a set of people.

This technique avoids the individualisation of a dataset. It also limits the possible correlations of the dataset with others. Examples of generalisation techniques include aggregation, k-anonymity, l-diversity and t-proximity.

Generalization situation: in a file containing the date of birth of individuals, replacing this information by the year of birth only.

These different techniques make it possible to respond to certain issues, with their own advantages and disadvantages. We will thus detail the operating principle of these different methods and will expose, through factual examples, the limits to which they are subject.

Which technique to use and why?

Each of the anonymization techniques may be appropriate, depending on the circumstances and context, to achieve the desired purpose without compromising the privacy rights of the data subjects.

The randomization family :

1- Adding noise :

Principle: Modification of the attributes of the dataset to make them less accurate. Example: Following anonymization by adding noise, the age of patients is modified by plus or minus 5 years.

Strengths:

Weaknesses:

Common errors :

Failure to use :

Netflix case:

In the Netflix case, the initial database had been made publicly "anonymised" in accordance with the company's internal privacy policy (removing all identifying information about users except ratings and dates).

In this case, it was possible to re-identify 68% of Netflix users through a database external to Netflix, by cross-referencing. Users were uniquely identified in the dataset using 8 ratings and dates with a margin of error of 14 days as selection criteria.

 

2- Permutation :

Principle:

Consists of mixing attribute values in a table in such a way that some of them are artificially linked to different data subjects. Permutation therefore alters the values within the dataset by simply swapping them from one record to another. Example: As a result of permutation anonymization, the age of patient A has been replaced by that of patient J.

Strengths:

Weakness:

- Doesn’t allow for the preservation of correlations between values and individuals, thus making it impossible to perform advanced statistical analyses (regression, machine learning, etc.).

Common mistakes:

Failure to use: the permutation of correlated attributes

In the following example, we can see that intuitively, we will try to link salaries with occupations according to the correlations that seem logical to us (see arrow).

Thus, the random permutation of attributes does not offer guarantees of confidentiality when there are logical links between different attributes.

table1_article2_EN

Table 1. Example of inefficient anonymization by permutation of correlated attributes

 

3- Differential Privacy :

Principle: Differential Privacy is the production of anonymized views of a dataset while retaining a copy of the original data.

The anonymized view is generated as a result of a third party query to the database, the result of which will be associated with added noise. To be considered "differentially private", the presence or absence of a particular individual in the query must not be able to change its outcome.

Strength :

Weaknesses:

Common mistakes:

- Not injecting enough noise: In order to prevent links from being made to knowledge from context, noise must be added. The challenge from a data protection perspective is to generate the appropriate level of noise to add to the actual responses, so as to protect the privacy of individuals without undermining the utility of the data.

- Don’t allocate a security budget: it’s necessary to keep information about the queries made and to allocate a security budget that will increase the amount of noise added if a query is repeated.

Usability failures:

- Independent processing of each query: Without keeping the history of queries and adapting the noise level, the results of repeating the same query or a combination of them could lead to the disclosure of personal information. An attacker could in fact carry out several queries which would allow an individual to be isolated and one of his characteristics to emerge. It should also be taken into account that Differencial Privacy only allows one question to be answered at a time. Thus, the original data must be maintained throughout the defined use.

- Re-identification of individuals: Differential privacy doesn’t guarantee non-disclosure of personal information. An attacker can re-identify individuals and reveal their characteristics using another data source or by inference. For example, in this paper (source: https://arxiv.org/abs/1807.09173) researchers from the Georgia Institute of Technology (Atlanta) have developed an algorithm, called "membership inference attacks", which re-identifies training (and therefore sensitive) data from a differential privacy model. The researchers conclude that further research is needed to find a stable and viable differential privacy mechanism against membership inference attacks. Thus, differential privacy doesn’t appear to be a totally secure protection.

The generalization family :

1- Aggregation and k-anonymity :

Principle: Generalization of attribute values to such an extent that all individuals share the same value. These two techniques aim to prevent a data subject from being isolated by grouping him/her with at least k other individuals. Example: In order to have at least 20 individuals sharing the same value, the age of all patients between 20 and 25 is reduced to 23 years.

Strength:

Weaknesses:

Common mistakes:

- Neglecting certain quasi-identifiers: The choice of the parameter k is the key parameter of the k-anonymity technique. The higher the value of k, the more the method guarantees confidentiality. However, a common mistake is to increase this parameter without considering all the variables. However, sometimes one variable is enough to re-identify a large number of individuals and make the generalization applied to the other quasi-identifiers useless.

- Low value of k: If k is too small, the weighting of an individual within a group is too large and attacks by inference are more likely to succeed. For example, if k=2 the probability that both individuals share the same property is greater than in the case where k >10.

- Don’t group individuals with similar weights: The parameter k must be adapted to the case of variables that are unbalanced in the distribution of their values.

Failure to use :

The main problem with k-anonymity is that it doesn’t prevent inference attacks. In the following example, if the attacker knows that an individual is in the dataset and was born in 1964, he also knows that this individual had a heart attack. Furthermore, if it is known that this dataset was obtained from a French organisation, it can be inferred that each of the individuals resides in Paris since the first three digits of the postal codes are 750*).

table2_article2_EN

Table 2. An example of poorly engineered k-anonymization

To overcome the shortcomings of k-anonymity, other aggregation techniques have been developed, notably L-diversity and T-proximity. These two techniques refine k-anonymity by ensuring that each of the classes has L different values (l-diversity) and that the classes created resemble the initial distribution of the data.

Note that despite these improvements, it does not address the main weaknesses of k-anonymity presented above.

Thus, these different generalization and randomization techniques each have security advantages but do not always fully meet the three criteria set out by the EDPS, formerly G29, as shown in Table 3 "Strengths and weaknesses of the techniques considered" produced by the CNIL.

table3_article2_EN

Table 3. Strengths and weaknesses of the techniques considered

Based on more recent anonymization techniques, synthetic data are now emerging as better anonymization solutions.

The case of synthetic data

Recent years of research have seen the emergence of solutions that allow the generation of synthetic records that ensure a high retention of statistical relevance and facilitate the reproducibility of scientific results. They are based on the creation of models that allow the global structure of the original data to be understood and reproduced. A distinction is made between adversarial neural networks (ANNs) and methods based on conditional distributions.

Strength :

Weakness:

The Avatar anonymization solution, developed by Octopize, uses a unique patient-centric design approach, allowing the creation of relevant and protected synthetic data while providing proof of protection. Its compliance has been demonstrated by the CNIL on the 3 EDPS criteria. Click here to learn more about avatars.

Rapid evolution of techniques

Finally, the CNIL (the French National Data Processing and Liberties Commission) reminds us that since anonymization and re-identification techniques are bound to evolve regularly, it is essential for any data controller concerned to keep a regular watch to preserve the anonymous nature of the data produced over time. This monitoring must take into account the technical means available and other sources of data that may make it possible to remove the anonymity of information.

The CNIL stresses that research into anonymisation techniques is ongoing and shows definitively that no technique is in itself infallible.

Sources :
https://edpb.europa.eu/edpb_fr
https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
Membership Inference Attacks : https://arxiv.org/pdf/1807.09173.pdf
Netflix : https://arxiv.org/PS_cache/cs/pdf/0610/0610105v2.pdf

Is your data pseudonymized or anonymized ?

What is the difference between anonymization and pseudonymization ?

The notion of anonymous data crystallizes a lot of misunderstandings and misconceptions to the point that the term "anonymous" does not have the same meaning depending on the person who uses it.
In order to re-establish a consensus, the Octopize team wanted to discuss the differences between pseudonymization and anonymization, two notions that are often confused.
At first glance, the term "anonymization" evokes the notion of a mask, of concealment. We then imagine that the principle of anonymization amounts to masking the directly identifying attributes of an individual (name, first name, social security number). This shortcut is precisely the trap to avoid. Indeed, the masking of these parameters constitutes rather a pseudonymization.
At first glance, these two concepts are similar, but there are major differences between them, both from a legal and a security point of view.

What is pseudonymization ?

According to the CNIL, pseudonymization is "the processing of personal data in such a way that it is no longer possible to attribute the data to a natural person without additional information". It is one of the measures recommended by the RGPD to limit the risks related to the processing of personal data.

But pseudonymization is not a method of anonymization. Pseudonymization simply reduces the correlation of a data set with the original identity of a data subject and is therefore a useful but not absolute security measure. Indeed, pseudonymization consists in replacing directly identifying data (name, first name...) of a data set by indirectly identifying data (alias, number in a classification, etc.) thus preventing the direct re-identification of individuals.

However, pseudonymization is not an infallible protection because the identity of an individual can also be deduced from a combination of several pieces of information called quasi-identifiers. Thus, in practice, pseudonymized data remains potentially re-identifying indirectly by crossing information. The identity of the individual can be betrayed by one of his indirectly identifying characteristics. This transformation is therefore reversible, justifying the fact that pseudonymized data are always considered as personal data. To date, the most widely used pseudonymization techniques are based on secret key cryptographic systems, hash functions, deterministic encryption and tokenization.

The "AOL (America On Line) case" is a typical example of the misunderstanding that exists between pseudonymization and anonymization. In 2006, a database containing 20 million keywords from the searches of more than 650,000 users over a period of three months was made public, with no other measure to preserve privacy than the replacement of the AOL user ID by a numerical attribute (pseudonymization).
Despite this treatment, the identity and location of some users were made public. Indeed, queries sent to a search engine, especially if they can be coupled with other attributes, such as IP addresses or other configuration parameters, have a very high potential for identification.

This incident is just one example of the many pitfalls that show that a pseudonymized dataset is not anonymous; simply changing the identity does not prevent an individual from being re-identified based on quasi-identifying information (age, gender, zip code). In many cases, it can be as easy to identify an individual in a pseudonymized dataset as it is from the original data (the "Who's That?" game).

What is the difference with anonymization ?

Anonymization consists in using techniques that make it impossible, in practice, to re-identify the individuals who provided the anonymized personal data. This treatment is irreversible and implies that the anonymized data are no longer considered as personal data, thus falling outside the scope of the RGPD. To characterise anonymization, the European Data Protection Committee (formerly WP29) relies on the 3 criteria set out in the opinion of 05/2014 (source at foot of page):

- Individualization: anonymized data must not allow to distinguish an individual. Therefore, even with all the quasi-identifying information about an individual, it must be impossible to distinguish him in a database once anonymized.

- Correlation: anonymized data must not be re-identifiable by crossing it with other data sets. Thus, it must be impossible to link two data sets from different sources concerning the same individual. Once anonymized, an individual's health data should not be linkable to his or her banking data based on common information.

- Inference: The data should not allow for the inference of additional information about an individual in a reasonable manner. For example, it must not be possible to determine with certainty the health status of an individual from anonymous data.

It is when these three criteria are met that data is considered to be anonymous in the strict sense. It then changes its legal status: it is no longer considered as personal data and falls outside the scope of the RGPD.

Our solution: Avatar

There are currently several families of anonymization methods that we will detail in our next article. For the most part, these methods provide protection by degrading the quality, structure or fineness of the original data, thus limiting the informational value of this data after processing. The real challenge is to solve the paradox between the legitimate protection of everyone's data, and its exploitation for the benefit of all.

The Avatar anonymization method, developed by Octopize, is a unique anonymization method. It solves the paradox between the protection of patients' personal data and the sharing of this data for its informative value. Indeed, the Avatar solution, which has been successfully evaluated by the CNIL, allows, thanks to synthetic data, to ensure on the one hand the confidentiality of the original data (and thus their sharing without risk) and on the other hand, to preserve the informative value of the original data.

Click here to learn more.

Sources: