Health data are all quantitative. They measure an individual's body fat percentage as well as other physiological variables. We have here a dataset representative of a clinical trial cohort with a few hundred individuals in the database. The research context is determined by moderate sensitivity and low risk.
- 252 individuals
- 15 variables
Objectives of anonymization
In this use case, several objectives can be identified.
First, the goal is to make it impossible to re-identify individuals in the dataset: personal data protection objective.
- Secondly, the goal was to preserve the predictive capacity of the data in terms of performance and explicability of the model.
The training protocol consists in training two identical machine learning models (here RandomForest) on the original data on one hand and their avatarized equivalent on the other. The two models were then tested on the remaining original data.
The graph on the right represents the prediction quality according to the training model. The avatars can therefore predict the value of real life data with the same performance as the original data.
This graph allows us to visualize the importance of the variables in the regression models generated. By comparing the results of the two models respectively trained on real life data and avatars, we notice a similarity of interpretability. Indeed, the variables that are preponderant in the prediction of the target value are globally the same.