On the Wisdom of Crowds and The Good Judgment Project

This is a very interesting NPR piece on The Good Judgment Project, whereby 3,000 ordinary citizens are making incredible predictions about world events/current affairs.

In fact, Tetlock and his team have even engineered ways to significantly improve the wisdom of the crowd — all of which greatly surprised Jason Matheny, one of the people in the intelligence community who got the experiment started.

“They’ve shown that you can significantly improve the accuracy of geopolitical forecasts, compared to methods that had been the state of the art before this project started,” he said.

What’s so challenging about all of this is the idea that you can get very accurate predictions about geopolitical events without access to secret information. In addition, access to classified information doesn’t automatically and necessarily give you an edge over a smart group of average citizens doing Google searches from their kitchen tables.

The story focuses on Elaine Rich, a so-called superforecaster:

In fact, she’s so good she’s been put on a special team with other superforecasters whose predictions are reportedly 30 percent better than intelligence officers with access to actual classified information.

Rich and her teammates are that good even though all the information they use to make their predictions is available to anyone with access to the Internet.

When I asked if she goes to obscure Internet sources, she shook her head no.

“Usually I just do a Google search,” she said.

Google FTW.

Data Science of the Facebook World

The ever insightful Stephen Wolfram has another graph-heavy post, this time compiling data on Facebook analytics:

More than a million people have now used our Wolfram|Alpha Personal Analytics for Facebook. And as part of our latest update, in addition to collecting some anonymized statistics, we launched a Data Donor program that allows people to contribute detailed data to us for research purposes.

A few weeks ago we decided to start analyzing all this data. And I have to say that if nothing else it’s been a terrific example of the power of Mathematica and the Wolfram Language for doing data science. (It’ll also be good fodder for the Data Science course I’m starting to create.)

We’d always planned to use the data we collect to enhance our Personal Analyticssystem. But I couldn’t resist also trying to do some basic science with it.

I’ve always been interested in people and the trajectories of their lives. But I’ve never been able to combine that with my interest in science. Until now. And it’s been quite a thrill over the past few weeks to see the results we’ve been able to get. Sometimes confirming impressions I’ve had; sometimes showing things I never would have guessed. And all along reminding me of phenomena I’ve studied scientifically in A New Kind of Science.

So what does the data look like? Here are the social networks of a few Data Donors—with clusters of friends given different colors. (Anyone can find their own network usingWolfram|Alpha—or the SocialMediaData function in Mathematica.)

It’s a pretty fascinating read.

My favorite graph was this one of the distribution of  your Facebook friends’ age versus your age:

The age of your Facebook friends versus your age.

The age of your Facebook friends versus your age.

It’s also quite interesting how the marriage statistics from Facebook line up with the official Census data:

Facebook marriage age vs. Census data.

Facebook marriage age vs. Census data.

For a lot more analysis, read Stephen Wolfram’s entire post.

Your E-Book Is Reading You

With the increased proliferation of e-books, publishers are using data analytics to determine what and how people are reading on their e-book devices. The Wall Street Journal provides some detail:

Barnes & Noble, which accounts for 25% to 30% of the e-book market through its Nook e-reader, has recently started studying customers’ digital reading behavior. Data collected from Nooks reveals, for example, how far readers get in particular books, how quickly they read and how readers of particular genres engage with books. Jim Hilt, the company’s vice president of e-books, says the company is starting to share their insights with publishers to help them create books that better hold people’s attention.

Some details on which books tend to get dropped by readers:

Barnes & Noble has determined, through analyzing Nook data, that nonfiction books tend to be read in fits and starts, while novels are generally read straight through, and that nonfiction books, particularly long ones, tend to get dropped earlier. Science-fiction, romance and crime-fiction fans often read more books more quickly than readers of literary fiction do, and finish most of the books they start. Readers of literary fiction quit books more often and tend skip around between books.

Those insights are already shaping the types of books that Barnes & Noble sells on its Nook. Mr. Hilt says that when the data showed that Nook readers routinely quit long works of nonfiction, the company began looking for ways to engage readers in nonfiction and long-form journalism. They decided to launch “Nook Snaps,” short works on topics ranging from weight loss and religion to the Occupy Wall Street movement.

Not very surprising, I suppose. I’d be interested in finding out what the criteria for a drop are: is it starting to read another book? No change in page numbers in a week? Longer?

Another thing to consider: giving readers what they want based on analytics can backfire. Imagine someone who’s read a longer book than they otherwise would have and their sense of accomplishment after finishing versus a publisher that tells authors to limit how and what they put on the page. As one astute publisher noted: “We’re not going to shorten War and Peace because someone didn’t finish it.”

For Factual, The World Is One Big Data Problem

This is a very interesting article about Gil Elbaz, Caltech graduate, and the company he founded, Factual:

Geared to both big companies and smaller software developers, it includes available government data, terabytes of corporate data and information on 60 million places in 50 countries, each described by 17 to 40 attributes. Factual knows more than 800,000 restaurants in 30 different ways, including location, ownership and ratings by diners and health boards. It also contains information on half a billion Web pages, a list of America’s high schools and data on the offices, specialties and insurance preferences of 1.8 million United States health care professionals. There are also listings of 14,000 wine grape varietals, of military aircraft accidents from 1950 to 1974, and of body masses of major celebrities. Odd facts matter too, Mr. Elbaz notes.

He keeps 500 terabytes of storage near Factual’s headquarters. That’s about twice the amount needed to hold the entire Library of Congress. He has more data stored inside Amazon’s giant cloud of computers. His statisticians have cleaned and corrected data to account for things like how different health departments score sanitation, whether the term “middle school” means two years or three in a particular town, and whether there were revisions between an original piece of data and its duplicate.

A quote from Mr. Elbaz: “Having money is overrated when you are brought up not to believe you are entitled to it…You can make enough money to not need things, or you can just not need things.”

###

Related: Stephen Wolfram on Personal Data Analytics