Content Tags

There are no tags.

5 Statistical Traps Data Scientists Should Avoid

Authors
Matthew Mayo

(This article originally appeared at KDNuggets.com here. For more, visit https://www.kdnuggets.com/)

Here are five statistical fallacies — data traps — which data scientists should be aware of and definitely avoid.

Fallacies are what we call the results of faulty reasoning. Statistical fallacies, a form of misuse of statistics, is poor statistical reasoning; you may have started off with sound data, but your use or interpretation of it, regardless of your possible purity of intent, has gone awry. Therefore, whatever decisions you base on these wrong moves will necessarily be incorrect.

There are infinite ways to incorrectly reason from data, some of which are much more obvious than others. Given that people have been making these mistakes for so long, many statistical fallacies have been identified and can be explained. The good thing is that once they are identified and studied, they can be avoided. Let's have a look at a few of these more common fallacies and see how we can avoid them.

Out of interest, when misuse of statistics is not intentional, the process bears a resemblance to cognitive biases, which Wikipedia defines as "tendencies to think in certain ways that can lead to systematic deviations from a standard of rationality or good judgment." The former builds incorrect reasoning on top of data and its explicit and active analysis, while the latter reaches a similar outcome much more implicitly and passively. That's not hard and fast, however, as there is definitely overlap between these 2 phenomena. The end results is the same, however: plain ol' wrong.

Here are five statistical fallacies — traps — which data scientists should be aware of and definitely avoid. The failure to do so will be catastrophic in terms of both data outcomes and a data scientist's credibility.

1. Cherry Picking

In an attempt to demonstrate just how obvious and simplistic that statistical fallacies can be, let's start off with the classic which everyone should already know: cherry picking. We can put this in the category of other easily recognizable fallacies, such as the Gambler's Fallacy, False Causality, biased sampling, overgeneralization, and many others.

The idea of cherry picking is a simple one, and something you have definitely done before: the intentional selection of data points which help support your hypothesis, at the expense of other data points which either do not support your hypothesis or actively oppose it. Have you ever heard a politician talk? Then you've heard cherry picking. Also, if you are a living, breathing human being, you have cherry picked data at some point in your life. You know you have. It's often tempting, a piece of low-hanging fruit which can win over or confound an opponent in a debate, or help push your agenda at the expense of an opposing view.

Why is it bad? Because it's dishonest, that's why. If data is truth, and analysis of data using statistical tools is supposed to help unearth truth, then cherry picking is the antithesis of truth-seeking. Don't do it.

2. McNamara Fallacy

The McNamara Fallacy is named after former US Secretary of Defense, Robert McNamara, who, during the Vietnam War, based his related decisions on quantitative metrics which were easily obtainable while ignoring others. This led to his treatment of body counts (easily obtainable metric) as the sole indicator of success, at the expense of all other quantitative measures.

Without dispensing much mental power, it should be relatively straightforward to see how a simple body count comparison could lead you astray when evaluating your performance on the battlefield. As one simple example, perhaps the enemy is pushing into your territory with disproportionate numbers of fighters and taking control as they do, but are losing slightly more bodies than you are as they do so. As another, perhaps the enemy is taking your fighters prisoner at a much higher rate than you are killing theirs. And so on.

Putting the statistical blinders on and placing all of your trust in a single, simple metric wasn't good enough to paint a full picture of what was happening in Vietnam, and it's not going to paint a full picture of whatever it is that you are doing.

3. Cobra Effect

The Cobra Effect is an unintended consequence from what was thought to be a solution to a problem, but which instead makes the problem worse. The name comes from a specific instance of the phenomenon which took place in India under British colonial rule, which included — you guessed it — cobras.

The Wikipedia page has a few examples of the Cobra Effect, my favorite being the attempt to reduce pollutants in Mexico City in the late 1980s. The government intended to lower emissions from vehicles by restricting by 20% the number of vehicles which could drive in a given week, based on the last digits of a license plate. To circumvent this policy, residents of the city purchased additional vehicles with different license plates, in hopes of having alternate permissible means of driving on the days their primary cars were banned. This led to a flood of often cheaper cars into the city, and ultimately made the pollution problem worse.

This is a much trickier issue than cherry picking, given the latent and often difficult to predict nature of unintended consequences. Team approaches to data science, and the additional thought processes these extra individuals bring, is a good way to combat Cobra Effect creep.

4. Simpson's Paradox

This paradox, named after British statistician Edward H. Simpson (though it had previously been identified by other individuals), refers to the observance of certain trends in a subgroup of a dataset which disappears once these subgroups are combined. In this sense, it can be thought of as unintentional cherry picking. An example from baseball can help to illustrate.

If we compared batting averages of a pair of professional ballplayers over the full years of their entire careers, you may find some subgroup years in which player A had a higher batting average than player B, perhaps even significantly higher. It is entirely possible, however, that looking at their batting averages over the entirety of their careers could show that player B actually had a higher batting average than player A, perhaps even significantly higher.

If you knew this ahead of time and selectively chose years X, Y, and Z as evidence that player A was a better player, that would be cherry picking. If you were not aware of the aggregate statistic, but chanced upon those individual isolated years and took them as representative of their entire careers — but (hopefully) found out otherwise once looking at the full statistical picture — that would be an example of Simpson's Paradox.

Both scenarios lead to incorrect outcomes, with one being a more innocent way of arriving at the misinterpretation. It's still wrong, though, and should be guarded against. Full statistical analysis should be part of a data scientist's regimen, and is a robust approach to ensuring you don't succumb to this phenomenon.

5. Data Dredging

Data dredging, known by other more ominous names such as p-hacking, is the "misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect." This amounts to performing a wide range of statistical tests on the data and cherry picking significant results in order to advance a narrative (meta cherry picking?). While statistical analysis should move from hypothesis to testing, data dredging involves allowing the results of statistical testing to dictate a conforming hypothesis. It amounts to the difference between "I think this is the case, now I will test if I am correct" and "Let's see what I can make the data say with testing, and then come up with an idea that it helps support."

But why is it wrong? Why are we concerned with forming hypotheses first and then testing them, instead of just letting the data dictate what might be a finding we have not thought to look for? With enough data and enough variables to test for correlations, it doesn't take long to have enough individual combinations for which something may appear to be significant. If we disregard all of the counterfactual evidence and focus on these conforming test results, it can appear that there is something there, when in reality there is no there there; it appears so due to chance. Capitalizing on, and justifying, chance is clearly not what science should be about.

For a related concept, and an approach to determining where the "chance determination line" can be drawn, have a look at the Bonferroni correction.

Stay in the loop.

Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.