Posts Tagged ‘wired’

The Petabyte Age: Because More Isn’t Just More — More Is Different (Wired mag)

Sunday, July 6th, 2008

Sensors everywhere. Infinite storage. Clouds of processors. Our ability to capture, warehouse, and understand massive amounts of data is changing science, medicine, business, and technology. As our collection of facts and figures grows, so will the opportunity to find answers to fundamental questions. Because in the era of big data, more isn’t just more. More is different.

Republished from the 06.23.08 edition of Wired magazine. View original.

[There are 8 essays; I promote 5 here.]

Tracking Air Fares: Elaborate Algorithms Predict Ticket Prices

wired air lines

“Flight Patterns” shows 141,000 aircraft paths over a 24-hour period. Image: Aaron Koblin

In 2001, Oren Etzioni was on a plane chatting up his seat mates when he realized they had all paid less for their tickets than he did. “I thought, ‘Don’t get mad, get even,’” he says. So he came home to his computer lab at the University of Washington, got his hands on some fare data, and plugged it into a few basic prediction algorithms. He wanted to see if they could reliably foresee changes in ticket prices. It worked: Not only did the algorithms accurately anticipate when fares would go up or down, they gave reasonable estimates of what the new prices would be. Read more.

Feeding the Masses

wired feeding the masses

The Iowa agriculture landscape: Green areas are more productive for soy, corn, and wheat; red are least.
Image: Firstborn

Farmer’s Almanac is finally obsolete. Last October, agricultural consultancy Lanworth not only correctly projected that the US Department of Agriculture had overestimated the nation’s corn crop, it nailed the margin: roughly 200 million bushels. That’s just 1.5 percent fewer kernels but still a significant shortfall for tight markets, causing a 13 percent price hike and jitters in the emerging ethanol industry. When the USDA downgraded expectations a month after Lanworth’s prediction, the little Illinois-based company was hailed as a new oracle among soft-commodity traders — who now pay the firm more than $100,000 a year for a timely heads-up on fluctuations in wheat, corn, and soybean supplies. Read more.

Predicting the Vote: Pollsters Identify Tiny Voting Blocs

wired vote

Infographic: Build

Want to know exactly how many Democratic-leaning Asian Americans making more than $30,000 live in the Austin, Texas, television market? Catalist, the Washington, DC, political data-mining shop, knows the answer. CTO Vijay Ravindran says his company has compiled nearly 15 terabytes of data for this election year — orders of magnitude larger than the databases available just four years ago. (In 2004, Howard Dean’s formidable campaign database clocked in at less than 100 GB, meaning that in one election cycle the average data set has grown 150-fold.) In the next election cycle, we should be measuring voter data in petabytes.

Large-scale data-mining and micro-targeting was pioneered by the 2004 Bush-Cheney campaign, but Democrats, aided by privately financed Catalist, are catching up. They’re documenting the political activity of every American 18 and older: where they registered to vote, how strongly they identify with a given party, what issues cause them to sign petitions or make donations. (Catalist is matched by the Republican National Committee’s Voter Vault and Aristotle Inc.’s immense private bipartisan trove of voter information.) Read more.

Visualizing Big Data: Bar Charts for Words

wired big data visualization

A visualization of thousands of Wikipedia edits that were made by a single software bot. Each color corresponds to a different page. Photo credit: Fernanda B. Viégas, Martin Wattenberg, and Kate Hollenbach.

The biggest challenge of the Petabyte Age won’t be storing all that data, it’ll be figuring out how to make sense of it. Martin Wattenberg, a mathematician and computer scientist at IBM’s Watson Research Center in Cambridge, Massachusetts, is a pioneer in the art of visually representing and analyzing complex data sets. He and his partner at IBM, Fernanda Viégas, created Many Eyes, a collaborative site where users can share their own dynamic, interactive representations of big data. He spoke with Wired‘s Mark Horowitz: Read more.

Sorting the World: Google Invents New Way to Manage Data

wired sorting

Used to be that if you wanted to wrest usable information from a big mess of data, you needed two things: First, a meticulously maintained database, tagged and sorted and categorized. And second, a giant computer to sift through that data using a detailed query.

But when data sets get to the petabyte scale, the old way simply isn’t feasible. Maintenance — tag, sort, categorize, repeat — would gobble up all your time. And a single computer, no matter how large, can’t crunch that many numbers.

Google’s solution for working with colossal data sets is an elegant approach called MapReduce. It eliminates the need for a traditional database and automatically splits the work across a server farm of PCs. For those not inside the Googleplex, there’s an open source version of the software library called Hadoop.

MapReduce can handle almost any type of information you throw at it, from photos to phone numbers. In the example below, we count the frequency of specific words in Google Books. Read more.

GIS Routing Topology – This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize (Wired)

Monday, March 24th, 2008

To me the following sounds like a “cost routing” topology problems associated with road travel and travel times from my GIS classes in university. My local movie rental outlet closed this weekend and I’m back to Netflix land. Here’s hoping they can improve their “you’d like this” recommendations.

netflix statsBy Jordan Ellenberg Email 02.25.08 | 6:00 PM

(From Wired) At first, it seemed some geeked-out supercoder was going to make an easy million. 

In October 2006, Netflix announced it would give a cool seven figures to whoever created a movie-recommending algorithm 10 percent better than its own. Within two weeks, the DVD rental company had received 169 submissions, including three that were slightly superior to Cinematch, Netflix’s recommendation software. After a month, more than a thousand programs had been entered, and the top scorers were almost halfway to the goal.

But what started out looking simple suddenly got hard. The rate of improvement began to slow. The same three or four teams clogged the top of the leaderboard, inching forward decimal by agonizing decimal. There was BellKor, a research group from AT&T. There was Dinosaur Planet, a team of Princeton alums. And there were others from the usual math powerhouses like the University of Toronto. After a year, AT&T’s team was in first place, but its engine was only 8.43 percent better than Cinematch. Progress was almost imperceptible, and people began to say a 10 percent improvement might not be possible.

Then, in November 2007, a new entrant suddenly appeared in the top 10: a mystery competitor who went by the name “Just a guy in a garage.” His first entry was 7.15 percent better than Cinematch; BellKor had taken seven months to achieve the same score. On December 20, he passed the team from the University of Toronto. On January 9, with a score 8.00 percent higher than Cinematch, he passed Dinosaur Planet. 

Read more at . . . 

Frame That Spam! Data-Crunching Artists Transform the World of Information (Wired)

Monday, March 10th, 2008

wired logoTim McKeough posted an interactive piece on the Feb 29th edition of Wired magazine showcasing artists who muse on new media (from his intro):

Blog posts, traffic patterns, government reports, digital video, email—a new crop of data-crunching artists are using data in much the same way Picasso applied paint to transform the world of information into mesmerizing abstractions.

Their tools are programs like Processing, an open-source electronic sketchbook (flickr pool), and VVVV, which can merge audio, video, and 3-D models (flicker pool).

The results are sweet, but they’re not just eye candy: They deliver a fresh perspective on the digital detritus we hums shed–or acquire–as we inhabit the virtual world.

Read more at…

You might find some cool desktop pictures in the flickr pools linked above. Here are a couple to get you started: Image 1. Image 2. Image 3. Image 4. Thanks Laris.

Image below: Jason Salavon for the US Census Bureau, “US Population by County, 1790-2000″

jason salavon census