P-Hacking: Crash Course Statistics #30

CrashCourse

5 Sept 201811:02

EducationalLearning

32 Likes 10 Comments

TLDRThe video discusses the problem of p-hacking, where researchers manipulate data or analyses to artificially obtain significant p-values. This can lead to false research being published. The video explains how doing multiple statistical tests increases the chance of getting spuriously significant results. It provides the jelly bean experiment and buffet pricing study as real-world examples. The video stresses the importance of pre-defining analyses, correcting for multiple comparisons, and understanding family-wise error rates to ensure statistically valid, ethical research.

Takeaways

😕 P-hacking is manipulating data or analyses to artificially get significant p-values
😠 P-hacking can have serious consequences like contributing to incorrect medical studies
😳 With enough tests, fluke statistically significant results are likely even if there's no effect
😉 Ideal analyses are chosen before seeing any data
🤔 Statistical significance can be misleading without context of other tests done
😎 Bonferroni corrections adjust p-values to account for multiple tests
😞 Publishing only statistically significant results biases science
😡 Unethical data practices like p-hacking erode public trust
🤓 Spotting questionable science protects people from bad decisions
😀 Green jelly beans are the best!

Q & A

What is p-hacking?
-P-hacking is manipulating data or analyses to artificially get significant p-values.
Why might a researcher be motivated to p-hack?
-Researchers are incentivised to find significant results in order to publish their work and advance their careers. Non-significant results are less likely to be published.
How can p-hacking undermine the integrity of scientific research?
-P-hacking can lead to the publication of incorrect or misleading results. This can have consequences ranging from people making poor health choices to serious issues like the anti-vaccination movement.
What was the original hypothesis in the Cornell buffet study example?
-The original hypothesis was that there is an effect of buffet price on the amount that people eat.
Why is running multiple statistical tests on the same data problematic?
-Running multiple tests inflates the chance of getting a significant result by chance even if there is no real effect. Reporting only the significant results is misleading.
How does the jelly bean example illustrate issues with multiple comparisons?
-Testing 20 jelly bean colors substantially increases the chance of a significant result occurring by chance. Significant findings may be false positives.
What is the Family Wise Error rate?
-The inflated Type I error rate that occurs when running multiple related statistical tests is called the Family Wise Error rate.
How can researchers adjust for the Family Wise Error rate?
-Applying a Bonferroni correction adjusts the significance level to account for the number of tests being conducted.
Why should the general public care about issues like p-hacking?
-Questionable research practices can lead to poor policy decisions, health recommendations, and more that impact people's everyday lives.
What might be some non-malicious reasons behind p-hacking?
-P-hacking could come from gaps in statistical knowledge, belief in a theory leading to confirmation bias, or honest mistakes.

Outlines

00:00

📊 Understanding p-hacking and its implications

This paragraph introduces p-hacking, which involves manipulating data or analyses to artificially get significant p-values. It states that researchers are incentivized to find significant results, but sometimes things can go wrong through p-hacking. Examples of p-hacking are provided, including choosing analyses based on what makes the p-value significant rather than having a predetermined analysis plan.

05:02

😖 The high likelihood of errors when doing multiple statistical tests

This paragraph explains how doing multiple statistical tests, such as testing different colors of jelly beans, greatly increases the chances of getting a significant result just by chance. This is called the Family Wise Error rate. Even if there is no real effect, the more tests done, the more likely spurious significant results occur.

10:03

💡 Recommendations for accountable and ethical statistical analyses

This paragraph suggests ways for researchers to do accountable, ethical statistical analyses, including: determining hypotheses and analyses before looking at data, correcting for inflated Family Wise Error rates, using Bonferroni correction to adjust p-values when doing multiple tests, and understanding that limiting false research results is important.

Mindmap

Keywords

💡p-value

A p-value is a statistical measure that indicates the probability of getting results as extreme or more extreme than observed, assuming the null hypothesis is true. P-values are central to null hypothesis significance testing, which is used to determine if study results are statistically significant. The video discusses how p-values can be manipulated or "hacked" through questionable research practices like p-hacking.

💡p-hacking

P-hacking refers to intentionally manipulating data or statistical analyses to achieve a desired p-value, usually one that is statistically significant (less than 0.05). The video explains how p-hacking can lead to false research results and incorrect conclusions. Examples in the video include testing multiple hypotheses or doing multiple statistical tests and only reporting the significant ones.

💡null hypothesis

The null hypothesis assumes there is no effect or no difference. Researchers try to reject the null hypothesis through statistical testing to show evidence of an effect. The video discusses the incentives in academia to reject the null and obtain statistically significant results.

💡Type I error

A Type I error occurs when the null hypothesis is true but is incorrectly rejected. The standard Type I error rate in research is 5%, but the video explains how practices like p-hacking can inflate this rate.

💡multiple comparisons

When doing multiple statistical tests, the chance of incorrectly rejecting the null hypothesis goes up. This is called the Family Wise Error rate. The video uses an example with testing multiple jelly bean colors to show how likely it is to get a significant result just by chance.

💡retraction

The video discusses a research study on buffet pricing that was retracted due to accusations of p-hacking and questionable research practices. Retraction is the withdrawal of published papers because the findings cannot be trusted.

💡false positives

A false positive occurs when a test incorrectly rejects the null hypothesis, identifying an effect that is not real. The more statistical tests done, the more likely false positives are to occur just by chance even if there is no real effect.

💡Bonferroni correction

A method to control the Family Wise Error rate when doing multiple comparisons. It adjusts the threshold p-value needed to declare significance by dividing it by the number tests. This makes it harder to obtain false positives.

💡transparency

The video argues that researchers need to be more transparent about all analyses conducted and not just report significant findings. This allows readers to better judge the validity of the results.

💡reproducibility

While not directly stated, the issues covered in the video relate closely to the replication crisis in science and the struggle to reproduce many published research findings. Questionable practices like p-hacking are seen as contributing factors.

Highlights

P-hacking is manipulating data or analyses to artificially get significant p-values.

Academic journals don’t want to publish results saying there's no evidence that something doesn't work.

Being able to publish results is key for job stability, salary, and prestige in science.

P-hacking is choosing analyses based on what makes the p-value significant, not the best analysis plan.

P-hacked analyses can mislead and contribute to incorrect studies with serious ramifications.

Ideally, choose analyses before seeing data. Accept some false positives due to chance.

With multiple related tests, Family Wise Error rates increase, inflating false positives.

Reporting only significant results of many tests is misleading without full context.

By 14 tests, it's likely at least 1 false positive result even if nothing there.

Bonferroni correction: Divide usual p-value threshold by # tests for new threshold.

Putting out false research matters - can affect laws, food/water regulations, more.

Spotting questionable science means not having to avoid those green jelly beans.

P-hacking is manipulating data or analyses to artificially get significant p-values.

Reporting only significant results of many tests is misleading without full context.

Bonferroni correction: Divide usual p-value threshold by # tests for new threshold.

Transcripts

Browse More Related Video

The Problem of Multiple Comparisons | NEJM Evidence

FDR, q-values vs p-values: multiple testing simply explained!

p-hacking: What it is and how to avoid it!

The Replication Crisis: Crash Course Statistics #31

False Discovery Rates, FDR, clearly explained

Types of Variables in Research and Their Uses (Practical Research 2)

P-Hacking: Crash Course Statistics #30

Takeaways

Q & A

What is p-hacking?

Why might a researcher be motivated to p-hack?

How can p-hacking undermine the integrity of scientific research?

What was the original hypothesis in the Cornell buffet study example?

Why is running multiple statistical tests on the same data problematic?

How does the jelly bean example illustrate issues with multiple comparisons?

What is the Family Wise Error rate?

How can researchers adjust for the Family Wise Error rate?

Why should the general public care about issues like p-hacking?

What might be some non-malicious reasons behind p-hacking?