Software developer manages to de-anonymize the data in less than two hours

Jun 24, 2014 08:32 GMT  ·  By

Details of more than 173 million taxi trips have been fully de-anonymized by a software developer in less than an hour.

Following a public records request, New York City officials provided Chris Whong with the full log of the trips and fare logs from the NY taxis.

Not all information could be offered in plain text and personally identifying details, like the driver’s taxi license number and the car’s identification number (medallion), had been hidden using one-way cryptographic hashes of the MD5 algorithm.

After analyzing the 20GB CSV file, software developer Vijay Pandurangan noticed that the obscured data was organized in predictable patterns and that the code used was MD5.

From this point, he realized that the operation could be reversed and he could obtain access to the anonymized data.

Provided that there was plenty of information about the identification numbers (license number and medallion), the developer proceeded to calculate the total amount of taxi license numbers and the medallions, as well as which they were.

The total figure amounted to about 22 million medallion numbers and about two million license numbers. The file with the taxi trips contained a set of 24 million hashes, which Pandurangan managed to compute in less than two minutes.

Using specific tools, the de-anonymization procedure took less than an hour to complete. The total work time to figure out the pattern and the algorithm used and to reveal the information was about two hours.

“Security researchers have been warning for a while that simply using hash functions is an ineffective way to anonymize data. In this case, it’s substantially worse because of the structured format of the input data. This anonymization is so poor that anyone could, with less then 2 hours work, figure which driver drove every single trip in this entire dataset. It would be even be easy to calculate drivers’ gross income, or infer where they live,” Pandurangan said in a post.

He also suggested that a better way to keep sensitive information secret in such cases would be to assign random numbers for the license and medallion numbers and to use them in the entire dump file. Another way would be to encrypt each value individually, using a secret AES key.

In this case, the city officials failed to choose an appropriate anonymization algorithm and, as Pandurangan says, “The cat is already out of the bag in this case, but hopefully in the future, agencies will think carefully about the method they use to anonymize data before releasing it to the public.”