Sensors everywhere. Infinite storage. Clouds of processors. Our ability to capture, warehouse, and understand massive amounts of data is changing science, medicine, business, and technology. As our collection of facts and figures grows, so will the opportunity to find answers to fundamental questions. Because in the era of big data, more isn’t just more. More is different.
Republished from the 06.23.08 edition of Wired magazine. View original.
[There are 8 essays; I promote 5 here.]
“Flight Patterns” shows 141,000 aircraft paths over a 24-hour period. Image: Aaron Koblin
In 2001, Oren Etzioni was on a plane chatting up his seat mates when he realized they had all paid less for their tickets than he did. “I thought, ‘Don’t get mad, get even,’” he says. So he came home to his computer lab at the University of Washington, got his hands on some fare data, and plugged it into a few basic prediction algorithms. He wanted to see if they could reliably foresee changes in ticket prices. It worked: Not only did the algorithms accurately anticipate when fares would go up or down, they gave reasonable estimates of what the new prices would be. Read more.
The Iowa agriculture landscape: Green areas are more productive for soy, corn, and wheat; red are least.
Farmer’s Almanac is finally obsolete. Last October, agricultural consultancy Lanworth not only correctly projected that the US Department of Agriculture had overestimated the nation’s corn crop, it nailed the margin: roughly 200 million bushels. That’s just 1.5 percent fewer kernels but still a significant shortfall for tight markets, causing a 13 percent price hike and jitters in the emerging ethanol industry. When the USDA downgraded expectations a month after Lanworth’s prediction, the little Illinois-based company was hailed as a new oracle among soft-commodity traders — who now pay the firm more than $100,000 a year for a timely heads-up on fluctuations in wheat, corn, and soybean supplies. Read more.
Want to know exactly how many Democratic-leaning Asian Americans making more than $30,000 live in the Austin, Texas, television market? Catalist, the Washington, DC, political data-mining shop, knows the answer. CTO Vijay Ravindran says his company has compiled nearly 15 terabytes of data for this election year — orders of magnitude larger than the databases available just four years ago. (In 2004, Howard Dean’s formidable campaign database clocked in at less than 100 GB, meaning that in one election cycle the average data set has grown 150-fold.) In the next election cycle, we should be measuring voter data in petabytes.
Large-scale data-mining and micro-targeting was pioneered by the 2004 Bush-Cheney campaign, but Democrats, aided by privately financed Catalist, are catching up. They’re documenting the political activity of every American 18 and older: where they registered to vote, how strongly they identify with a given party, what issues cause them to sign petitions or make donations. (Catalist is matched by the Republican National Committee’s Voter Vault and Aristotle Inc.’s immense private bipartisan trove of voter information.) Read more.
A visualization of thousands of Wikipedia edits that were made by a single software bot. Each color corresponds to a different page. Photo credit: Fernanda B. Viégas, Martin Wattenberg, and Kate Hollenbach.
The biggest challenge of the Petabyte Age won’t be storing all that data, it’ll be figuring out how to make sense of it. Martin Wattenberg, a mathematician and computer scientist at IBM’s Watson Research Center in Cambridge, Massachusetts, is a pioneer in the art of visually representing and analyzing complex data sets. He and his partner at IBM, Fernanda Viégas, created Many Eyes, a collaborative site where users can share their own dynamic, interactive representations of big data. He spoke with Wired‘s Mark Horowitz: Read more.
Used to be that if you wanted to wrest usable information from a big mess of data, you needed two things: First, a meticulously maintained database, tagged and sorted and categorized. And second, a giant computer to sift through that data using a detailed query.
But when data sets get to the petabyte scale, the old way simply isn’t feasible. Maintenance — tag, sort, categorize, repeat — would gobble up all your time. And a single computer, no matter how large, can’t crunch that many numbers.
Google’s solution for working with colossal data sets is an elegant approach called MapReduce. It eliminates the need for a traditional database and automatically splits the work across a server farm of PCs. For those not inside the Googleplex, there’s an open source version of the software library called Hadoop.
MapReduce can handle almost any type of information you throw at it, from photos to phone numbers. In the example below, we count the frequency of specific words in Google Books. Read more.