'Synthetic Data' and Privacy - Penn Engineering Blog

a grid of synthetic faces, created by Datagen, a synthetic data company — None of these faces belong to real people. Datagen, which produces synthetic data for deep learning, starts with real images, but transforms them such that they are still useful for training algorithms. (Image courtesy of Datagen)

The promise of big data lies in the ability to glean useful trends out of massive collections of individual examples. Whether those examples are shopping lists, taxi routes or faces, algorithms can pick out commonalities and associations that no human could detect.

But as those data sets have grown and the examples in them have become more personal, data scientists have had to address new questions about privacy.

One of the newer responses is the idea of “synthetic data.” A synthetic data set of human faces, for example, might start with real-life photographs, which are then manipulated by a series of algorithms. The resulting set would be full of realistic faces, except they would all belong to people who don’t actually exist.

At MIT Technology Review, reporter Karen Hao spoke with Aaron Roth, professor in the Department of Computer Science, about this new trend. Roth is one of the leading figures in the field of differential privacy, a technique that allows deep learning algorithms to analyze data sets culled from real people while guaranteeing that data from any one individual cannot be teased out of the aggregate.

While synthetic data seems like it is doing something similar, it doesn’t make that same guarantee, Roth warns:

When it comes to privacy, “just because the data is ‘synthetic’ and does not directly correspond to real user data does not mean that it does not encode sensitive information about real people,” says Aaron Roth, a professor of computer and information science at the University of Pennsylvania. Some data generation techniques have been shown to closely reproduce images or text found in the training data, for example, while others are vulnerable to attacks that make them fully regurgitate that data.

Read “These creepy fake humans herald a new age in AI” at MIT Technology Review.