Logo réduit OCTOPIZE - pictogramme

Heart Disease

This use case corresponds to a classification exercise. The goal is to predict a bimodal value, here the presence of cardiac disease in the patient, according to the other parameters entered in the dataset.

Data type

Health data is both qualitative and quantitative and specifies the presence or absence of cardiac pathology in a patient based on different variables. It is sensitive data since re-identification can lead to the leakage of a patient's health status.
  • 303 individuals
  • 14 variables

Objectives of anonymization

In this use case, we identify two objectives.

  1. First, the goal is to make it impossible to re-identify individuals in the dataset: personal data protection objective.
  2. Secondly, we want to preserve the predictive capacity of the data in terms of performance with respect to several usual machine learning models.
In terms of analysis, performing a dimension reduction of the data and projecting them from a Euclidean space is a useful practice for determining the axes of majority variance of the data. We divert its use here by projecting, following a mixed data factorial analysis, the avatarized data in the space determined by the original data. The structure of the data is preserved following the avatar transformation process. The stability of the confidence ellipses determining the health status is an indicator of the quality of the signal conservation after the data transformation into avatars.
The training protocol consists in training several identical machine learning models, identical 2 by 2, on the original data on one hand and their avatarized equivalent on the other. The two models were then tested on the remaining original data.
The overall accuracy performance (the percentage of good prediction) of the different models is then compared. The result is that the models trained on avatars predict with a similar performance to the models trained on original data, regardless of the model used.

Other use cases

New York Taxi

The "New York Taxi" use case presents a context of anonymization of spatio-temporal data. The difficulty lies in the particular nature of this data, where the combination of spatial and temporal dimensions accentuates the risk of re-identification.

Body Fat

This use case illustrates a supervised learning problem: the prediction of a continuous value, here the percentage of fat mass, according to the other parameters of the data set.

Systolic Blood Pressure

The data captured in this use case covers the very specific context of time series anonymization.
© Octopize 2021