On Learning Data Science

I’ve been learning more about data science in the last couple of months and recently stumbled upon a very good blog post from Dataquest on how to learn data science.

First, it’s important that there is some inherent motivation to learn data science:

Nobody ever talks about motivation in learning. Data science is a broad and fuzzy field, which makes it hard to learn. Really hard. Without motivation, you’ll end up stopping halfway through and believing you can’t do it, when the fault isn’t with you – it’s with the teaching.

You need something that will motivate you to keep learning, even when it’s midnight, formulas are starting to look blurry, and you’re wondering if this will be the night that neural networks finally make sense.

You need something that will make you find the linkages between statistics, linear algebra, and neural networks. Something that will prevent you from struggling with the “what do I learn next?” question.

My entry point to data science was predicting the stock market, although I didn’t know it at the time. Some of the first programs I coded to predict the stock market involved almost no statistics. But I knew they weren’t performing well, so I worked day and night to make them better.

There are good links throughout, including 100 data sets for statistics.

I like the suggestions on communicating your findings and/or your learning process:

Part of communicating insights is understanding the topic and theory well. Another part is understanding how to clearly organize your results. The final piece is being able to explain your analysis clearly.

It’s hard to get good at communicating complex concepts effectively, but here are some things you should try:

Start a blog. Post the results of your data analysis.

Try to teach your less tech-savvy friends and family about data science concepts. It’s amazing how much teaching can help you understand concepts…

More resources and links here.

How Do Rumors Spread on Facebook?

How do rumors propagate on Facebook? And what propels them to go viral? One component seems to be whether people try to stop false rumors by linking to Snopes.com debunking such a rumor. From the Facebook Data Science team, their blog post and paper titled “Rumor Cascades” explains:

Tracking rumors on Facebook requires two types of information: a corpus of known rumors, and a sample of reshare cascades circulating on Facebook which can be matched to the corpus. The website Snopes.com has diligently documented thousands of rumors, and provides the starting point for our analysis. To match known rumors to this anonymized set of reshare cascades, we identify uploads and reshares that have been snoped — someone linked to a Snopes.com article in a comment. Those comments are posted by people to either warn their friends that something they posted is inaccurate or to the contrary, to validate that a rumor, though hard to believe, is in fact true. 

We gathered 250K comments, posted during July and August 2013 on 17K individual cascades, containing 62 million shares…

A summary from the abstract:

We find that receiving such a comment increases the likelihood that a reshare of a rumor will be deleted. Furthermore, large cascades are able to accumulate hundreds of Snopes comments while continuing to propagate. 

Hinge: A Dating App Developed by a Military Contractor

The Verge reports on one John Kleint, a former military contractor who’s now switched gears and is helping develop a dating app called Hinge:

When Kleint first started working at Hinge, in a DC office not far from his old defense gig, the first challenge was understanding his new data set — tens of thousands of completely harmless Facebook users. On a good day at his old job, nobody got hurt, and now, a good day is when Hinge receives an email from two soul mates who found each other using the service. Hinge doesn’t ask the usual array of questions like “Do you believe in God?” from its users, and instead relies on pre-existing signals to make assumptions about you. Solely by examining your friends and interests, the service can predict your political leaning, your age, your sexual orientation, and your race. Kleint works on the algorithms and machine learning techniques to make it all work.

“There are certain factors that go into a stable long-term relationship, and you can infer some of those factors from your friends,” he says. “There’s no explicit equation. There’s no guessing that likes should have 20 percent weight and attraction should be 30 percent.” Picking matches is especially hard since different people have different tastes. Hinge takes the opposite approach to some dating sites like OkCupid with overt “hot or not” meters and percentage odds of being a a match. And unlike dating services that simply pair you with somebody who’s also obsessed with Jay and Silent Bob Strike Back,Hinge uses that data to learn other things about you. Kleint won’t expose Hinge’s secret sauce, but points to a study by researchers at Cambridge University who created an algorithm that correctly predicts male sexuality 88 percent of the time, and is 95 percent accurate at distinguishing between African Americans and Caucasian Americans, without ever having seen a photo.

The app is in limited release so far: Washington D.C. and New York City, primarily.