When Big Data goes badNovember 5, 2013: 1:00 PM ET
How the models underlying today's supercomputing prowess are costing us its success.
By Joshua Klein
FORTUNE -- Big Data and the cloud are putting supercomputer capabilities into everyone's hands. But what's getting lost in the mix is that the tools we use to interpret and apply this tidal wave of information often have a fatal flaw. Much of the data analysis we do rests on erroneous models, meaning mistakes are inevitable. And when our outsized expectations exceed our capacity, the consequences can be dire.
This wouldn't be such a problem if Big Data wasn't so very, very big. But the amount of data that we have access to is enabling us to use even flawed models to produce what are often useful results. The trouble is that we're frequently confusing those results for omniscience. We're falling in love with our own technology, and when the models fail it can be pretty ugly, especially when the mistakes all that data produces are concomitantly large.
Part of the issue is oversimplification of the models computer programs are based on, rather than actual errors in their programming. For example, in early April 2011, Peter Lawrence's The Making of a Fly, a classic work in developmental biology that many biologists consult regularly, was listed on Amazon.com (AMZN) as having 17 copies for sale: 15 used from $35.54, and two new from $23,698,655.93 (plus $3.99 shipping).
The book, last published in 1992, is now out of print, but that doesn't quite explain the multimillion-dollar price tag. What had happened was that two automated programs, one run by seller "bordeebook" and one by seller "profnath," were engaged in an iterative and incremental bidding war. Once a day profnath would raise their price to 0.9983 times bordeebook's listed price. Several hours later, bordeebook would increase their price to 1.270589 times profnath's latest amount.
It's a classic example of how unanticipated factors can foil even the best-prepared computer models, and it's not an isolated incident.
For example, does this sound anything like the subprime mortgage crisis? Before 2008, the best minds with the best technology running the most advanced hypothetical scenarios completely missed the looming crisis and then failed to understand its severity. The more broadly a model is scoped, the more possibilities for error it includes. It sounds obvious, but we often miss the fact that those models are not, and will never be, as accurate as reality itself.
Here's another example. One t-shirt seller on Amazon.co.uk put up a shirt for sale emblazoned with the statement, "Keep Calm and Rape a Lot." One might wonder who thought such a shirt would be a good idea. But Solid Gold Bomb, the company that made the shirt, wasn't necessarily aware that it was even selling it. The company apologized publicly and copiously, but in its defense the only mistake it made was a small coding error. That's because the shirt wasn't designed by anyone. Nor were the shirts even necessarily ever printed. Solid Gold Bomb's business isn't in artfully designing T-shirts. Instead, it writes code that takes libraries of words that slot into popular phrases (such as "Keep Calm and Carry On," which enjoyed a brief mimetic popularity online) to make derivations that get dropped onto a template of a T-shirt and automatically get posted as an Amazon item for sale. Their mistake was overlooking a single word in a list of 4,000 or so others (the company was lucky no other offensive words or phrases made it onto the site). The problem was context.
Again, a simple model, with serious social consequences. The program that made the Solid Gold Bomb T-shirt isn't aware of how its intended audience perceives the concept of rape, let alone how the business process that rendered the T-shirt works. And yet that context turned a one-word oversight into a massively damaging event.
In both these instances an inability to anticipate how the program would interact with other programs, or of the broader context in which it would operate, caused significant harm. Those are just two ways in which a model on which code is based can be flawed.
Big Data still has big issues. For example, the information we're gathering is often not being properly normalized (put into a format where all data is apples-to-apples), the models we're making aren't often peer tested or reviewed (witness the problems with the ranking tool Klout as a standard for social media influence), and, most crucially, the information itself is usually siloed inside of large corporations instead of being democratically available and verifiable.
Which isn't to say our technology is doomed. Most of the applications we use every day work tremendously well, and in some cases really do produce amazing capabilities that improve our lives in countless ways every day. But it behooves us to examine the models that underpin them. Because someday, somehow, they will fail.
Joshua Klein is a hacker, consultant, television host, and author of Reputation Economics: Why Who You Know is Worth More than What You Have (Palgrave Macmillan), from which this essay is adapted.