Carnegie Mellon, Microsoft Research Automate Privacy Compliance for Big Data Systems

Bing logoWeb services companies, such as Facebook, Google and Microsoft, all make promises about how they will use personal information they gather. But ensuring that millions of lines of code in their systems operate in ways consistent with privacy promises is labor-intensive and difficult. A team from Carnegie Mellon University and Microsoft Research, however, has shown these compliance checks can be automated.
The researchers developed a prototype automated system that is now running on the data analytics pipeline of Bing, Microsoft’s search engine. According to Saikat Guha, researcher at Microsoft, it’s the first time automated privacy compliance analysis has been applied to the production code of an Internet-scale system. 
Employing a new, lawyer-friendly language to specify privacy policies and using a data inventory to annotate existing programs, the researchers showed that a team of just five people could manage a daily compliance check on millions of lines of code written by several thousand developers.
“Tens of millions of lines of code are already in the pipeline,” noted Shayak Sen, a Ph.D. student in computer science who interned at Microsoft Research India and the lead student author on the study. “And during our implementation on Bing, we found that more than 20 percent of the code was changing on a daily basis.” At these large scales, automated methods offer the best hope of verifying compliance.
The researchers developed a language – Legalease – that could be easily learned and used by privacy advocates. It employs allow-deny rules with exceptions, a structure that is found in many privacy policies and laws and is expressive enough to capture the real policies of a system such as Bing.
The researchers paired Legalease with Grok – a data inventory that annotates existing programs written in languages typically employed by MapReduce-like systems.

You can view the full press release here.