Authors:
Joel Reardon, Serge Egelman, Kenneth A. Bamberger & Laurel E. McGrane
Preview:
While legal scholars have cited decades of computer science research that demonstrates why anonymity is hard (and that datasets should not be labelled as “anonymous” cavalierly), industry and legal practitioners have not heeded those warnings: many organizations trafficking in consumer data continue to assert to customers, courts, and regulators, that their data is anonymous or “deidentified.” We acquired datasets from multiple data brokers to demonstrate empirically why this is false. Using publicly available email addresses found in data breaches posted on the Internet, we trivially reidentified 88% of the hashed email addresses that we obtained; using modern password-cracking techniques, we were able to reidentify 97% of the 6 million email addresses that we collected. Reidentifying hashed email addresses need not rely on illicit data or specialized hardware: by constructing rainbow tables with synthetic data representative of typical email addresses, we reidentified most of the hashed email addresses. In all cases, the hashed email addresses were linked to other device-based identifiers (e.g., mobile device advertising IDs, IPs, etc.), demonstrating why device-based identifiers have long been considered personally identifiable information. Relatedly, organizations trafficking in this data make another assertion, that this data was collected from consumers with their consent. To evaluate this claim, we performed a survey (n=369), in which we emailed a subset of the reidentified individuals in our datasets to recruit them to participate. This survey asked participants about their recollections of having provided consent (99% had no recollection) and their feelings about the sale of their information (94% were opposed, while 77% said they planned to submit deletion requests). Overall, our study shows that hashed email addresses and device identifiers do not come close to meeting commonly understood definitions of “anonymous” or “deidentified” data, and that any notion of “consent” must also involve a similarly tortured definition. We argue that this industry and its defenders are not simply misinformed or indifferent to the veracity of their statements, but that this is an example of Plato’s “noble lie”: their entire social order relies on these demonstrably untrue statements being believed by courts, regulators, policymakers, and the public.