Are you really lost in the crowd.jpg

Not hidden in the crowd

Cast your mind back a few months to the discussions about using Call Data Records (CDRs) in big data sets to model the spread of ebola and help combat it. One of the concerns raised was privacy. But this was routinely dismissed: after all data can be scrubbed to remove all personally identifiable information. Right?

Wrong. A paper by Yves-Alexandre de Montjoye in Science (30 January 2015) shows that you can identify individuals in big data sets that have been anonymised. He studied 3 months worth of credit card records for a set of over 1 million individuals from which all personally identifiable information had been removed.

"...four spatiotemporal points are enough to uniquely reidentify 90% of individuals."

Its not everyday you come across spatiotemporal points: I think in this case the points are the location and time of the transaction and it took only 4 transactions to be able to identify 90% of the individuals in the data set. With some extra work, cross referencing the data with other public sources, it is possible to work out the name of that person.

This is not the first time this supposedly anonymised data has been used to identify individuals. The same author did a study on mobile call data back in October 2012 and came to a similar conclusion "...four spatio-temporal points are enough to uniquely identify 95% of the individuals." . Back in March last year, a Harvard computer science professor, Latanya Sweeney, bought an anonymised set of hospital records and was able to match a name to over 40% of the records.

The data is not really made anonymous; removing personally identifiable information such as name, address, and phone number is really just de-identifying the data and it is possible to re-identify the individuals. So the problem lies with the models being used to de-identify data in these large sets.

It is routine to take data from operational systems and remove personally identifiable information so that it can be used in testing. Using unaltered production data will improve test coverage and therefore should find more bugs. But data privacy laws and regulations mean that this is usually a non-starter. Creating data specifically for testing takes time and effort and will inevitably not cover all the quirks and variation of production data. So the tendency is to use production data and de-identify it to meet these data privacy requirements. Should we be concerned that the individuals in this test data can be re-identified?

The short answer is yes. But a qualified yes. We should be concerned if

  • the test data is not fit for purpose, is inadequate or contains excessive or irrelevant data;
  • we don't use authentication mechanisms, access controls and data security measures to protect the data;
  • we don't believe that the test data users are trustworthy and educated on data privacy;
  • we don't audit trail who accessed the data so that we re-create the data usage to ensure that it was used for its intended purpose.

But if you're testing security measures cover these concerns, then please go ahead and use desensitised operational data. Your testing will benefit.

Contact acutest