Logo réduit OCTOPIZE - pictogramme

Body Fat

This use case illustrates a supervised learning problem: the prediction of a continuous value, here the percentage of fat mass, according to the other parameters of the data set.

Data type

Health data are all quantitative. They measure an individual's body fat percentage as well as other physiological variables. We have here a dataset representative of a clinical trial cohort with a few hundred individuals in the database. The research context is determined by moderate sensitivity and low risk.
  • 252 individuals
  • 15 variables

Objectives of anonymization

In this use case, several objectives can be identified.
  1. First, the goal is to make it impossible to re-identify individuals in the dataset: personal data protection objective.

  2. Secondly, the goal was to preserve the predictive capacity of the data in terms of performance and explicability of the model.
The training protocol consists in training two identical machine learning models (here RandomForest) on the original data on one hand and their avatarized equivalent on the other. The two models were then tested on the remaining original data.
The graph on the right represents the prediction quality according to the training model. The avatars can therefore predict the value of real life data with the same performance as the original data.
This graph allows us to visualize the importance of the variables in the regression models generated. By comparing the results of the two models respectively trained on real life data and avatars, we notice a similarity of interpretability. Indeed, the variables that are preponderant in the prediction of the target value are globally the same.

Other use cases

New York Taxi

The "New York Taxi" use case presents a context of anonymization of spatio-temporal data. The difficulty lies in the particular nature of this data, where the combination of spatial and temporal dimensions accentuates the risk of re-identification.

Heart Disease

This use case corresponds to a classification exercise. The goal is to predict a bimodal value, here the presence of cardiac disease in the patient, according to the other parameters entered in the dataset.

Systolic Blood Pressure

The data captured in this use case covers the very specific context of time series anonymization.
© Octopize 2021
crossmenuchevron-down