How Anonymous Datasets Turn-Out to be Not So Anonymous

This paper (‘anonymous data too easy to identify ‘in French, refering to this scientific paper published in Nature ‘Estimating the success of re-identifications in incomplete datasets using generative models‘) exposes how easy it is to identify people even in the midst of anonymous datasets.

In fact the paper states that any individual can be identified by only 15 demographic attributes, where common marketing databases can have 5,000 attributes per person (“15 demographic attributes would render 99.98% of people in Massachusetts unique“).

Therefore, while “de-identification, the process of anonymizing datasets before sharing them, has been the main paradigm used in research and elsewhere to share data while preserving people’s privacy“, it does not seem to work so well. Several methods are often used to improve anonymity such as for example, only using a subset of the dataset so that it is never possible to be really sure about a correct identification. However the nature paper concludes that “the likelihood of a specific individual to have been correctly re-identified can be estimated with high accuracy even when the anonymized dataset is heavily incomplete.”

Some frightening examples are quoted such as “In 2016, journalists re-identified politicians in an anonymized browsing history dataset of 3 million German citizens, uncovering their medical information and their sexual preferences“.

There is thus still some way to go to have really anonymized databases for data research, which shows again that privacy is now quite virtual! We certainly need to be aware of it.

Share