On Learning Data Science

I’ve been learning more about data science in the last couple of months and recently stumbled upon a very good blog post from Dataquest on how to learn data science.

First, it’s important that there is some inherent motivation to learn data science:

Nobody ever talks about motivation in learning. Data science is a broad and fuzzy field, which makes it hard to learn. Really hard. Without motivation, you’ll end up stopping halfway through and believing you can’t do it, when the fault isn’t with you – it’s with the teaching.

You need something that will motivate you to keep learning, even when it’s midnight, formulas are starting to look blurry, and you’re wondering if this will be the night that neural networks finally make sense.

You need something that will make you find the linkages between statistics, linear algebra, and neural networks. Something that will prevent you from struggling with the “what do I learn next?” question.

My entry point to data science was predicting the stock market, although I didn’t know it at the time. Some of the first programs I coded to predict the stock market involved almost no statistics. But I knew they weren’t performing well, so I worked day and night to make them better.

There are good links throughout, including 100 data sets for statistics.

I like the suggestions on communicating your findings and/or your learning process:

Part of communicating insights is understanding the topic and theory well. Another part is understanding how to clearly organize your results. The final piece is being able to explain your analysis clearly.

It’s hard to get good at communicating complex concepts effectively, but here are some things you should try:

Start a blog. Post the results of your data analysis.

Try to teach your less tech-savvy friends and family about data science concepts. It’s amazing how much teaching can help you understand concepts…

More resources and links here.

On the Wisdom of Crowds and The Good Judgment Project

This is a very interesting NPR piece on The Good Judgment Project, whereby 3,000 ordinary citizens are making incredible predictions about world events/current affairs.

In fact, Tetlock and his team have even engineered ways to significantly improve the wisdom of the crowd — all of which greatly surprised Jason Matheny, one of the people in the intelligence community who got the experiment started.

“They’ve shown that you can significantly improve the accuracy of geopolitical forecasts, compared to methods that had been the state of the art before this project started,” he said.

What’s so challenging about all of this is the idea that you can get very accurate predictions about geopolitical events without access to secret information. In addition, access to classified information doesn’t automatically and necessarily give you an edge over a smart group of average citizens doing Google searches from their kitchen tables.

The story focuses on Elaine Rich, a so-called superforecaster:

In fact, she’s so good she’s been put on a special team with other superforecasters whose predictions are reportedly 30 percent better than intelligence officers with access to actual classified information.

Rich and her teammates are that good even though all the information they use to make their predictions is available to anyone with access to the Internet.

When I asked if she goes to obscure Internet sources, she shook her head no.

“Usually I just do a Google search,” she said.

Google FTW.

Statistical Stylometry: Quantifying Elements of Writing Style that Differentiate Successful Fiction

Can good writing be differentiated from bad writing through some kind of algorithm? Many have tried to answer this research question. The latest news in this realm comes from Stony Brook University, in which a group of researchers:

…[T]ook 1000 sentences from the beginning of each book. They performed systematic analyses based on lexical and syntactic features that have been proven effective in Natural Language Processing (NLP) tasks such as authorship attribution, genre detection, gender identification, and native language detection.

“To the best of our knowledge, our work is the first that provides quantitative insights into the connection between the writing style and the success of literary works,” Choi says. “Previous work has attempted to gain insights into the ‘secret recipe’ of successful books. But most of these studies were qualitative, based on a dozen books, and focused primarily on high-level content—the personalities of protagonists and antagonists and the plots. Our work examines a considerably larger collection—800 books—over multiple genres, providing insights into lexical, syntactic, and discourse patterns that characterize the writing styles commonly shared among the successful literature.”

I had no idea there was a name for this kind of research. Statistical stylometry is the statistical analysis of variations in literary style between one writer or genre and another. This study reports, for the first time, that the discipline can be effective in distinguishing highly successful literature from its less successful counterpart, achieving accuracy rates as high as 84%.

The best book on writing that I’ve read is Stephen King’s On Writing, in which he echoes the descriptive nature of writing that the researchers back up as well:

[T]he less successful books also rely on verbs that explicitly describe actions and emotions (“wanted”, “took”, “promised”, “cried”, “cheered”), while more successful books favor verbs that describe thought-processing (“recognized”, “remembered”) and verbs that simply serve the purpose of quotes (“say”).

Nate Silver on Learning, Intuition, Boredom, and Changing Jobs

The Harvard Business Review recently sat down with Nate Silver, everyone’s favorite stat nerd and author of The Signal and the Noise, for an interview. The whole thing is worth the read, but I enjoyed the following two exchanges.

On learning and intuition:

HBR: What about if I’ve read your book and I’m just starting college or a little younger and I’m trying to think actually maybe this statistician/data scientist role is something that I’m interested in? What do I study? How much education do I need? What’s that base for plugging into some of these jobs?

Silver:  Again, I think the applied experience is a lot more important than the academic experience. It probably can’t hurt to take a stats class in college.

But it really is something that requires a lot of different parts of your brain. I mean the thing that’s toughest to teach is the intuition for what are big questions to ask. That intellectual curiosity. That bullshit detector for lack of a better term, where you see a data set and you have at least a first approach on how much signal there is there. That can help to make you a lot more efficient.

That stuff is kind of hard to teach through book learning. So it’s by experience. I would be an advocate if you’re going to have an education, then have it be a pretty diverse education so you’re flexing lots of different muscles.

You can learn the technical skills later on, and you’ll be more motivated to learn more of the technical skills when you have some problem you’re trying to solve or some financial incentive to do so. So, I think not specializing too early is important.

On being listened to, being bored at work, and changing jobs:

HBR: You’ve had obviously some very public experience with the fact that even when the data is good and the model is good, people can push back a lot for various reasons, legitimate and otherwise. Any advice for once you’re in that position, you have a seat at the table, but the other people around the table are really just not buying what you’re selling?

Silver: If you can’t present your ideas to at least a modestly larger audience, then it’s not going to do you very much good. Einstein supposedly said that I don’t trust any physics theory that can’t be explained to a 10-year-old. A lot of times the intuitions behind things aren’t really all that complicated. In Moneyball that on-base percentage is better than batting average looks like ‘OK, well, the goal is to score runs. The first step in scoring runs is getting on base, so let’s have a statistic that measures getting on base instead of just one type of getting on base.’ Not that hard a battle to fight.

Now, if you feel like you’re expressing yourself and getting the gist of something and you’re still not being listened to, then maybe it’s time to change careers. It is the case [that] people who have analytic talent are very much in demand right now across a lot of fields so people can afford to be picky to an extent.

Don’t take a job where you feel bored. If it’s challenging, you feel like you’re growing, you have good internal debates, that’s fine. Some friction can be healthy. But if you feel like you’re not being listened to, then you’re going just want to slit your wrists after too much longer. It’s time to move on.

Excellent advice.

Nassim Taleb on Big Data

This is a strange article from Nassim Taleb, in which he cautions us about big data:

[B]ig data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal). It’s a property of sampling: In real life there is no cherry-picking, but on the researcher’s computer, there is. Large deviations are likely to be bogus.

I had to re-read that sentence a few times. It still doesn’t make sense to me when I think of “big data.” As the sample size increases, large variations due to chance actually decrease. This is a good comment in the article which captures my thoughts:

This article is misleading. When the media/public talk about big data, they almost always mean big N data. Taleb is talking about data where P is “big” (i.e., many many columns but relative few rows, like genetic microarray data where you observe P = millions of genes for about N = 100 people), but he makes it sound like the issues he discuss apply to big N data as well. Big N data has the OPPOSITE properties of big P data—spurious correlations due to random noise are LESS likely with big N. Of course, the more important issue of causation versus correlation is an important problem when analyzing big data, but one that was not discussed in this article.

So I think Nassim Taleb should offer an explanation on what he means by BIG DATA.

Kevin Kelly: The Impossible is the New Normal

One of the best things I’ve read this week is Kevin Kelly’s take on “the impossible is the new normal”:

Every minute a new impossible thing is uploaded to the internet and that improbable event becomes just one of hundreds of extraordinary events that we’ll see or hear about today. The internet is like a lens which focuses the extraordinary into a beam, and that beam has become our illumination. It compresses the unlikely into a small viewable band of everyday-ness. As long as we are online – which is almost all day many days — we are illuminated by this compressed extraordinariness. It is the new normal.

That light of super-ness changes us. We no longer want mere presentations, we want the best, greatest, the most extraordinary presenters alive, as in TED. We don’t want to watch people playing games, we want to watch the highlights of the highlights, the most amazing moves, catches, runs, shots, and kicks, each one more remarkable and improbable than the other.

We are also exposed to the greatest range of human experience, the heaviest person, shortest midgets, longest mustache — the entire universe of superlatives! Superlatives were once rare — by definition — but now we see multiple videos of superlatives all day long, and they seem normal. Humans have always treasured drawings and photos of the weird extremes of humanity (early National Geographics), but there is an intimacy about watching these extremities on video on our phones while we wait at the dentist. They are now much realer, and they fill our heads.

My only lament is how Mr. Kelly chose to present the extraordinary with a poor statistical anecdote:

To the uninformed, the increased prevalence of improbable events will make it easier to believe in impossible things. A steady diet of coincidences makes it easy to believe they are more than just coincidences, right? But to the informed, a slew of improbably events make it clear that the unlikely sequence, the outlier, the black swan event, must be part of the story. After all, in 100 flips of the penny you are just as likely to get 100 heads in a row as any other sequence. But in both cases, when improbable events dominate our view — when we see an internet river streaming nothing but 100 heads in a row — it makes the improbable more intimate, nearer.

Sure. But it would have made more sense to discuss the probability of getting 100 heads in a row versus various other distributions (for example: probability of getting between 45 and 55 heads in 100 tosses of a fair coin).

KK is the author of What Technology Wants, which I recommend reading (I read it near the end of 2010).