This is a strange article from Nassim Taleb, in which he cautions us about big data:
[B]ig data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal). It’s a property of sampling: In real life there is no cherry-picking, but on the researcher’s computer, there is. Large deviations are likely to be bogus.
I had to re-read that sentence a few times. It still doesn’t make sense to me when I think of “big data.” As the sample size increases, large variations due to chance actually decrease. This is a good comment in the article which captures my thoughts:
This article is misleading. When the media/public talk about big data, they almost always mean big N data. Taleb is talking about data where P is “big” (i.e., many many columns but relative few rows, like genetic microarray data where you observe P = millions of genes for about N = 100 people), but he makes it sound like the issues he discuss apply to big N data as well. Big N data has the OPPOSITE properties of big P data—spurious correlations due to random noise are LESS likely with big N. Of course, the more important issue of causation versus correlation is an important problem when analyzing big data, but one that was not discussed in this article.
So I think Nassim Taleb should offer an explanation on what he means by BIG DATA.