This repository contains the code related to the work done on investigating the rate of unicity reduction for large scale populations in mobility data. It accompanies the paper: "The risk of re-identification remains high even in country-scale location datasets", by Farzanehfar A., Houssiau F., and de Montjoye Y.-A., published in Cell Patterns. We publish this code for the sake of transparency, reproducibility, and as an aid to understand the details of the research for the curious.
This document will outline, in summary, what each code file contains so that the reader can navigate the repository.
Each of the folders in the repository contain the following:
library: Contains all the code used in for the project, including the code for the model and those used to generate the plots.results: Contains the results that can be used to recreate the plots in the paper.inputs: Contains, as much as possible, the inputs used in the model.
Here you can find a description of what each code file does.
model_source.py: Contains the code that is used to generate trajectories based on the unicity model.dataformat_utils.py: Provides a series of helper functions to load, unload, and manipulate the data.unicity_utils.py: Contains the code used to compute unicitygeoloc_utils.py: Contains the code to construct the Delaunay tesselation from a set of coordinates and other related helper functions.extract_time.py: Code that extracts the mean circadian distribution from data.extract_time_dp.py: Code that extracts the mean circadian distribution from data, with differential privacy (ϵ=1).extract_activity.py: Code that extracts the activity distribution from data.extract_frequency.py: Code that extracts the mean frequency distribution from data.generate_gridsearch_params.py:This file computes the range of parameters for the beta and power law functions according to the earth movers' distance (EMD) of the resulting distributions from the empirical distribution.
60M_run.py:Code that is used to generate unicity estimates for populations ranging from 1M to 60M.learning_curve.py: Provides the code to compute the data to support the fact that the unicity model using distributions extracted from small samples of the data converges to the unicity model that uses distributions extracted from the entire 1M trajectories observed.gridsearch.py: This file runs the sensitivity analysis by running the unicity data many times to generate the different unicity curves based on different input distributions
Here you can find CSV files that replicate all the main results of the work.
learning_curve: Each CSV file in this folder corresponds to unicity estimates computed using sample sizes corresponding to the value inidicated in the file name. The folder also contains a numpy array which details the sample sizes.60M_model.csv: This CSV file contains the unicity estimates of the model up to 60M trajectories.1M_observed.csv: Contains the observed unicity values of the 1M trajectories in this study.1M_model.csv: Contains the unicity values of 1M trajectories generated by the model.1M_model_example.csv: Contains the unicity values of 1M trajectories generated by the model, using the dummy antenna distribution provided in this repository. This is not used in the paper, but can be used for replicability.
Here all the inputs that we can share publicly are present. Unfortunately the frequency and activity distributions cannot be shared publicly, however, the fits to these distributions are expressed in the text of the main work. The other two inputs to the model are:
circadian.npy: The circadian distribution of activity. For privacy reasons, this is computed from the real data with differential privacy (ϵ=1).location_grid.txt: This file contains the list and location of antennas, with each line giving an antenna's number, latitude and longitude. Since we are not able to share this information as per the request of the data controller, the file is populated with dummy antennas over the area of the United Kingdom, obtained from publicly available data.activity.npy: For privacy reasons, we generate this array using the gamma distribution fit from the publication (α = 1.72 and β = 14.7).frequency.npy: Similarly, we generate this array using the power distribution fit from the publication (γ = 1.43).
@article{farzanehfar2021risk,
  title={The risk of re-identification remains high even in country-scale location datasets},
  author={Farzanehfar, Ali and Houssiau, Florimond and de Montjoye Yves-Alexandre},
  journal={Patterns},
  year={2021},
  publisher={Elsevier}
}