-up to my earlier article: The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines. My first article centered on how visualizations can be utilized to mislead, diving right into a type of knowledge presentation broadly utilized in public issues.
On this article, I’m going a bit deeper, taking a look at how a misunderstanding of statistical concepts is breeding floor for being deceived by knowledge. Particularly, I’ll stroll by way of how correlation, base proportions, abstract statistics, and misinterpretation of uncertainty can lead folks astray.
Let’s get proper into it.
Correlation ≠ Causation
Let’s begin with a basic to get in the correct state of mind for some extra complicated concepts. From the earliest statistics lessons in grade college, we’re all advised that correlation isn’t equal to causation.
When you do a little bit of Googling or studying, you’ll find “statistics” that present a excessive correlation between cigarette consumption and common life expectancy [1]. Attention-grabbing. Properly, does that imply we must always all begin smoking to reside longer?
In fact not. We’re lacking a confounding issue: shopping for cigarettes requires cash, and international locations with increased wealth understandably have increased life expectations. There isn’t any causal hyperlink between cigarettes and age. I like this instance as a result of it’s so blatantly deceptive and highlights the purpose effectively. Usually, it’s vital to be cautious of any knowledge that solely reveals a correlational hyperlink.
From a scientific standpoint, a correlation may be recognized through statement, however the one solution to declare causation is to truly conduct a randomized trial controlling for potential confounding elements—a reasonably concerned course of.
I selected to start out right here as a result of whereas being introductory, this idea additionally highlights a key concept that underpins understanding knowledge successfully: The information solely reveals what it reveals, and nothing else.
Maintain that in thoughts as we transfer ahead.
Keep in mind Base Proportions
In 1978, Dr. Stephen Casscells and his crew famously requested a bunch of 60 physicians, residents, and college students at Harvard Medical College the next questions:
“If a check to detect a illness whose prevalence is 1 in 1,000 has a false constructive price of 5%, what’s the likelihood that an individual discovered to have a constructive outcome really has the illness, assuming you recognize nothing concerning the individual’s signs or indicators?”
Although introduced in medical phrases, this query is absolutely about statistics. Accordingly, it additionally has connections to knowledge science. Take a second to consider your personal reply to this query earlier than studying additional.
The reply is (roughly) 2%. Now, in case you appeared by way of this rapidly (and aren’t in control together with your statistics), you will have guessed considerably increased.
This was actually the case with the medical college of us. Solely 11/60 folks appropriately answered the query, with 27/60 going as excessive as 95% of their response (presumably simply subtracting the false constructive price from 100).
It’s simple to imagine that the precise worth must be excessive because of the constructive relaxation outcome, however this assumption incorporates a vital reasoning error: It fails to account for the extraordinarily low prevalence of the illness within the inhabitants.
Mentioned one other method, if only one in each 1,000 folks has the illness, this must be taken under consideration when calculating the likelihood of a random individual having the illness. The likelihood doesn’t rely solely on the constructive check outcome. As quickly because the check accuracy falls beneath 100%, the affect of the bottom price comes into play fairly considerably.
Formally, this reasoning error is named the base price fallacy.
To see this extra clearly, think about that only one in each 1,000,000 folks had the illness, however the check nonetheless has a false constructive price of 5%. Would you continue to assume {that a} constructive check outcome instantly signifies a 95% likelihood of getting the illness? What if it was 1 in a billion?
Base charges are extraordinarily vital. Do not forget that.
Statistical Measures Are NOT Equal to the Information
Let’s check out the next quantitative knowledge units (13 of them, to be exact), all of that are visualized as a scatter plot. One is even within the form of a dinosaur.

Do you see something fascinating about these knowledge units?
I’ll level you in the correct path. Here’s a set of abstract statistics for the information:
X-Imply | 54.26 |
Y-Imply | 47.83 |
X-SD (Normal Deviation) | 16.76 |
Y-SD | 26.93 |
Correlation | -0.06 |
When you’re questioning why there is just one set of statistics, it’s as a result of they’re all the identical. Each single one of many 13 Charts above has the identical imply, commonplace deviation, and correlation between variables.
This well-known set of 13 knowledge units is named the Datasaurus Dozen [5], and was printed some years in the past as a stark instance of why abstract statistics can not all the time be trusted. It additionally highlights the worth of visualization as a instrument for knowledge exploration. Within the phrases of famend statistician John Tukey,
“The best worth of an image is when it forces us to note what we by no means anticipated to see.“
Understanding Uncertainty
To conclude, I wish to discuss a slight variation of misleading knowledge, however one that’s equally vital: mistrusting knowledge that’s really appropriate. In different phrases, false deception.
The next chart is taken from a research analyzing the feelings of headlines taken from left-leaning, right-leaning, and centrist information shops [6]:

There may be fairly a bit occurring within the chart above, however there’s one explicit side I wish to draw your consideration to: the vertical strains extending from every plotted level. You could have seen these earlier than. Formally, these are known as error bars, and they’re a method that scientists typically depict uncertainty within the knowledge.
Let me say that once more. In statistics and Data Science, “error” is synonymous with “uncertainty.” Crucially, it doesn’t imply one thing is flawed or incorrect about what’s being proven. When a chart depicts uncertainty, it depicts a fastidiously calculated measure of the vary of a price and the extent of confidence at varied factors inside that vary. Sadly, many individuals simply take it to imply that whoever made the chart is basically guessing.
This can be a critical error in reasoning, for the harm is twofold: Not solely does the information at hand get misinterpreted, however the presence of this false impression additionally contributes to the damaging societal perception that science is to not be trusted. Being upfront concerning the limitations of information ought to really enhance our confidence in a declare’s reliability, however mistaking that limitation as admission of foul play results in the other impact.
Studying learn how to interpret uncertainty is difficult however extremely vital. On the minimal, an excellent place to start out is realizing what the so-called “error” is definitely making an attempt to convey.
Recap and Remaining Ideas
Right here’s a cheat sheet for being cautious of misleading knowledge:
- Correlation ≠ causation. Search for the confounding issue.
- Keep in mind base proportions. The likelihood of a phenomenon is extremely influenced by its prevalence within the inhabitants, regardless of how correct your check is (except 100% accuracy, which is uncommon).
- Beware abstract Statistics. Means and medians will solely take you thus far; you have to discover your knowledge.
- Don’t misunderstand uncertainty. It isn’t an error; it’s a fastidiously thought of description of confidence ranges.
Keep in mind these, and also you’ll be effectively positioned to deal with the subsequent knowledge science downside that makes its solution to you.
Till subsequent time.
References
[1] How Charts Lie, Alberto Cairo
[2] https://pmc.ncbi.nlm.nih.gov/articles/PMC4955674
[4] https://visualizing.jp/the-datasaurus-dozen
[6] https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276367