Logo réduit OCTOPIZE - pictogramme

New York Taxi

The "New York Taxi" use case presents a context of anonymization of spatio-temporal data. The difficulty lies in the particular nature of this data, where the combination of spatial and temporal dimensions accentuates the risk of re-identification.

Data type

In this use case, the personal data used represents a sample of 1,451,721 cab trips made in New York City in 2016.
The dataset, initially pseudonymous, presents a high risk of re-identification represented by the combination of spatial (GPS coordinates of departure and arrival) and temporal (departure and arrival times) information. In this context, the possibility for an attacker to infer an individual's place of residence from the information at his disposal represents a risk.

  • 1,451,721 individuals
  • 9 variables

Objectives of anonymization

In this use case, several objectives are identified.

  1. First, the goal is to make it impossible to re-identify individuals who have used the cab service.
  2. Secondly, it is important to keep the usefulness of the data for the city of New-York, especially for mobility and planning projects:
  • Identification of congested areas
  • Traffic density as a function of time
  • Identification of users' preferred routes.

This information must be preserved while respecting the topographical plausibility of the original data. Indeed, the avatars must not be able to take impossible GPS coordinates such as the East River branch or Central Park.

At the end of the avatarization step, the set of original GPS coordinates is compared to those generated by the Avatar method. The representation of these data on a background map allows us to ensure the topographic plausibility of the avatars. The urban grid is kept as well as the geographical density. Indeed, the majority of the races concern the Manhattan area.
From a statistical point of view, the study of the evolution of the average speed on different time scales (hour, day, month) allows to obtain a global vision of all the parameters. The comparison of the results from the original data and the avatars demonstrates the excellent retention of information.
The visualization of user traffic over the course of 2016 verifies the interoperability of the avatar data with external sources. Indeed, the sharp drop in activity recorded around January 26 and preserved by the avatar data, is related to a snowstorm that hit New York. Avatars can therefore be used to enrich external data sources (here meteorological) without interaction bias.

Other use cases

Body Fat

This use case illustrates a supervised learning problem: the prediction of a continuous value, here the percentage of fat mass, according to the other parameters of the data set.

Data transfer outside the EU

This use case illustrates the problem of transferring personal data to a partner outside the European Union.

Systolic Blood Pressure

The data captured in this use case covers the very specific context of time series anonymization.
© Octopize 2021
crossmenuchevron-down