When Algorithms Infer Pregnancy or Other Sensitive Information About People

Written by Eric Siegel, PhD (@predictanalytic). He is the founder of Predictive Analytics World, and the instructor of the Coursera’s Machine Learning for Everyone.

*Originally published in Harvard Business Review.

Machine learning can ascertain a lot about you — including some of your most sensitive information. For instance, it can predict your sexual orientation, whether you’re pregnant, whether you’ll quit your job, and whether you’re likely to die soon. Researchers can predict race based on Facebook likes, and officials in China use facial recognition to identify and track the Uighurs, a minority ethnic group.

Now, do the machines actually “know” these things about you, or are they only making informed guesses? And, if they’re making an inference about you, just the same as any human you know might do, is there really anything wrong with them being so astute?

Let’s look at a few cases:

In the U.S., the story of Target predicting who’s pregnant is probably the most famous example of an algorithm making sensitive inferences about people. In 2012, a New York Times story about how companies can leverage their data included an anecdote about a father learning that his teenage daughter was pregnant due to Target sending her coupons for baby items in an apparent act of premonition. Although the story about the teenager may be apocryphal — even if it did happen, it would most likely have been coincidence, not predictive analytics that was responsible for the coupons, according to Target’s process detailed by The New York Times story — there is a real risk to privacy in light of this predictive project. After all, if a company’s marketing department predicts who’s pregnant, they’ve ascertained medically sensitive, unvolunteered data that only healthcare staff are normally trained to appropriately handle and safeguard.

Mismanaged access to this kind of information can have huge implications on someone’s life. As one concerned citizen posted online, imagine that a pregnant woman’s “job is shaky, and [her] state disability isn’t set up right yet…to have disclosure could risk the retail cost of a birth (approximately $20,000), disability payments during time off (approximately $10,000 to $50,000), and even her job.”

This isn’t a case of mishandling, leaking, or stealing data. Rather, it is the generation of new data — the indirect discovery of unvolunteered truths about people. Organizations can predict these powerful insights from existing innocuous data, as if creating them out of thin air.

So are we ironically facing a downside when predictive models perform too well? We know there’s a cost when models predict incorrectly, but is there also a cost when they predict correctly?

Even if the model isn’t highly accurate, per se, it may still be confident in its predictions for a certain group of pregnant individuals. Let’s say that 2% of the female customers between age 18 and 40 are pregnant. If the model identifies customers, say, three times more likely than average to be pregnant, only 6% of those identified will actually be pregnant. That’s a lift of three. But if you look at a much smaller, focused group, say the top 0.1% likely to be pregnant, you may have a much higher lift of, say, 46, which would make women in that group 92% likely to be pregnant. In that case, the system would be capable of revealing those women as very likely to be pregnant.

The same concept applies when predicting sexual orientation, race, health status, location, and your intentions to leave your job. Even if a model isn’t highly accurate in general, it can still reveal with high confidence — for a limited group — things like sexual orientation, race, or ethnicity. This is because, typically, there is a small portion of the population for whom it is easier to predict. Now, it may only be able to predict confidently for a relatively small group, but even just the top 0.1% of a population of a million would mean 1,000 individuals have been confidently identified.

It’s easy to think of reasons why people wouldn’t want someone to know these things. As of 2013, Hewlett-Packard was predictively scoring its more than 300,000 workers with the probability of whether they’d quit their job — HP called this the Flight Risk score, and it was delivered to managers. If you’re planning to leave, your boss would probably be the last person you’d want to find out before it’s official.

As another example, facial recognition technologies can serve as a way to track location, decreasing the fundamental freedom to move about without disclosure, since, for example, publicly-positioned security cameras can identify people at specific times and places. I certainly don’t sweepingly condemn face recognition, but know that CEO’s at both Microsoft and Google have come down on it for this reason.

In yet another example, a consulting firm was modeling employee loss for an HR department, and noticed that they could actually model employee deaths, since that’s one way you lose an employee. The HR folks responded with, “Don’t show us!” They didn’t want the liability of potentially knowing which employees were at risk of dying soon.

Research has shown that predictive models can also discern other personal attributes — such as race and ethnicity — based on, for example, Facebook likes. A concern here is the ways in which marketers may be making use of these sorts of predictions. As Harvard professor of government and technology Latanya Sweeney put it, “At the end of the day, online advertising is about discrimination. You don’t want mothers with newborns getting ads for fishing rods, and you don’t want fishermen getting ads for diapers. The question is when does that discrimination cross the line from targeting customers to negatively impacting an entire group of people?” Indeed, a study by Sweeney showed that Google searches for “black-sounding” names were 25% more likely to show an ad suggesting that the person had an arrest record, even if the advertiser had nobody with that name in their database of arrest records.

“If you make a technology that can classify people by an ethnicity, someone will use it to repress that ethnicity,” says Clare Garvie, senior associate at the Center on Privacy and Technology at Georgetown Law.

Which brings us to China, where the government applies facial recognition to identify and track members of the Uighurs, an ethnic group systematically oppressed by the government. This is the first known case of a government using machine learning to profile by ethnicity. This flagging of individuals by ethnic group is designed specifically to be used as a factor in discriminatory decisions — that is, decisions based at least in part on a protected class. In this case, members of this group, once identified, will be treated or considered differently on the basis of their ethnicity. One Chinese start-up valued at more than $1 billion said its software could recognize “sensitive groups of people.” Its website said, “If originally one Uighur lives in a neighborhood, and within 20 days six Uighurs appear, it immediately sends alarms” to law enforcement.

Implementing the differential treatment of an ethic group based on predictive technology takes the risks to a whole new level. Jonathan Frankle, a deep learning researcher at MIT, warns that this potential extends beyond China. “I don’t think it’s overblown to treat this as an existential threat to democracy. Once a country adopts a model in this heavy authoritarian mode, it’s using data to enforce thought and rules in a much more deep-seated fashion… To that extent, this is an urgent crisis we are slowly sleepwalking our way into.”

It’s a real challenge to draw the line as to which predictive objectives pursued with machine learning are unethical, let alone which should be legislated against, if any. But, at the very least, it’s important to stay vigilant for when machine learning serves to empower a preexisting unethical practice, and also for when it generates data that must be handled with care.

Follow Eric on Twitter at @predictanalytic.