Health data is both qualitative and quantitative and specifies the presence or absence of cardiac pathology in a patient based on different variables. It is sensitive data since re-identification can lead to the leakage of a patient's health status.
In this use case, we identify two objectives.
- First, the goal is to make it impossible to re-identify individuals in the dataset: personal data protection objective.
- Secondly, we want to preserve the predictive capacity of the data in terms of performance with respect to several usual machine learning models.
In terms of analysis, performing a dimension reduction of the data and projecting them from a Euclidean space is a useful practice for determining the axes of majority variance of the data. We divert its use here by projecting, following a mixed data factorial analysis, the avatarized data in the space determined by the original data. The structure of the data is preserved following the avatar transformation process. The stability of the confidence ellipses determining the health status is an indicator of the quality of the signal conservation after the data transformation into avatars.
The training protocol consists in training several identical machine learning models, identical 2 by 2, on the original data on one hand and their avatarized equivalent on the other. The two models were then tested on the remaining original data.
The overall accuracy performance (the percentage of good prediction) of the different models is then compared. The result is that the models trained on avatars predict with a similar performance to the models trained on original data, regardless of the model used.