Home » Researcher.Life » What Is a P-Value? Definition, Calculation, Use, Interpretation
What is p-value: How to calculate it and statistical significance

What Is a P-Value? Definition, Calculation, Use, Interpretation

Table of Contents

Glossary of Key Terms

The following terms are used throughout this article. Refer back to this section whenever an unfamiliar concept appears.

 

Term Definition
P-value The probability of observing results as extreme as those obtained, assuming the null hypothesis is true.
Null hypothesis (H0) A statement asserting no effect, no difference, or no relationship between variables.
Alternative hypothesis (H1) A statement asserting that an effect, difference, or relationship does exist.
Alpha (significance level) The threshold probability, set before the experiment, below which the null hypothesis is rejected.
Type I error Falsely rejecting a true null hypothesis; its maximum probability equals alpha.
Type II error Failing to reject a false null hypothesis; its probability is denoted beta.
Statistical power The probability of correctly rejecting a false null hypothesis; calculated as 1 minus beta.
Effect size A quantitative measure of the magnitude of a difference or relationship, independent of sample size.
Confidence interval (CI) A range of values that, with a specified probability, contains the true population parameter.
Test statistic A numerical value calculated from sample data that is compared against a reference distribution.
Degrees of freedom (df) The number of independent values in a calculation that are free to vary.
Bayes factor A ratio comparing the probability of data under two competing hypotheses, used in Bayesian inference.
P-hacking The practice of manipulating analyses or selectively reporting results to achieve a desired p-value.
Multiple testing Conducting several hypothesis tests simultaneously, which inflates the risk of false positives.
False discovery rate (FDR) The expected proportion of rejected null hypotheses that are actually true.

 

Key Takeaways

  • A p-value measures the probability of obtaining results as extreme as those observed if the null hypothesis were true; it does not measure the probability that the null hypothesis is true.
  • P-values must be interpreted alongside effect sizes, confidence intervals, and sample sizes to be meaningful.
  • The conventional alpha threshold of 0.05 is arbitrary; always justify the chosen threshold in your methods section.
  • Type I error (false positive) is controlled by alpha; Type II error (false negative) is controlled by sample size and study power.
  • Running multiple tests inflates the false-positive rate; corrections such as the Bonferroni adjustment or the Benjamini-Hochberg procedure should be applied.
  • P-hacking and selective reporting are ethically problematic and undermine scientific reproducibility.
  • Confidence intervals communicate the same information as p-values and are increasingly preferred by journal editors.
  • Bayesian alternatives such as Bayes factors provide a complementary framework that directly addresses the probability of a hypothesis given the data.

 

Introduction

The p-value is one of the most widely used, most frequently misunderstood, and most actively debated concepts in quantitative research. From clinical trials and psychology experiments to quality control and machine learning, researchers in virtually every discipline encounter p-values as part of the process of deciding whether observed results are likely to reflect real phenomena or mere chance.

 

This article explains what a p-value is, how to calculate it using common statistical tests, how to interpret it correctly in the context of Type I and Type II errors, statistical power, confidence intervals, and effect sizes, and how to report it in a research paper. It also addresses important limitations and modern debates around p-value use, including p-hacking, multiple testing, and Bayesian alternatives.

 

What Is a P-Value?

A p-value (probability value) is the probability of obtaining a test result as extreme as, or more extreme than, the one actually observed, assuming the null hypothesis is true. In practical terms, it answers the question: if there were truly no effect or no difference in the population, how likely would it be to observe results like these simply by chance?

 

A small p-value indicates that the observed data are unlikely to have occurred by chance alone under the null hypothesis, providing evidence against the null. A large p-value indicates that the observed data are consistent with the null hypothesis.

 

Key clarifications about what a p-value is and is not:

 

What a p-value IS What a p-value is NOT
The probability of the data (or more extreme data) given H0 is true The probability that H0 is true
A measure of evidence against H0 A measure of the size or importance of an effect
One input into a broader statistical decision A definitive proof of anything
Dependent on sample size, effect size, and variability A fixed property of a phenomenon
A continuous value from 0 to 1 A binary pass/fail judgment on its own

 

What Is the Null Hypothesis, and Why Does It Matter?

The null hypothesis is true by default and the p-value only makes sense relative to it. It is the foundation of every hypothesis test. The null hypothesis (H0) states that there is no effect, no difference, or no relationship. The alternative hypothesis (H1) states that an effect, difference, or relationship does exist.

 

Consider a clinical trial comparing two pain medications. The hypotheses might be stated as follows:

 

  • H0: The mean duration of pain relief for Drug A equals the mean duration for Drug B.
  • H1: The mean duration of pain relief for Drug A is greater than the mean duration for Drug B.

 

The p-value quantifies how strongly the data contradict H0. Unless the p-value falls below the pre-specified alpha threshold, H0 is not rejected. Importantly, failing to reject H0 does not prove it is true; it only means the data are insufficient to rule it out.

 

A helpful analogy: the null hypothesis works like a presumption of innocence in a legal trial. The defendant is assumed innocent (H0 is true) until sufficient evidence (a sufficiently small p-value) is produced to conclude otherwise.

How Is a P-Value Calculated?

P-values are derived from a test statistic, a number calculated from your sample data that measures how far the observed result deviates from what would be expected under H0. The specific formula depends on the test you choose. The p-value is then the area in the tail or tails of the relevant probability distribution, beyond the observed test statistic.

 

The general steps are the same for all parametric tests:

 

  • Step 1: State H0 and H1.
  • Step 2: Choose the appropriate statistical test for your data and research question.
  • Step 3: Calculate the test statistic from your sample data.
  • Step 4: Identify the relevant probability distribution (t, z, chi-squared, or F) and degrees of freedom.
  • Step 5: Find the area under the distribution curve beyond the test statistic. This area is the p-value.
  • Step 6: Compare the p-value to the pre-set alpha level and make a decision about H0.

 

Which Statistical Test Should You Use?

Choosing the correct test is essential. The wrong test produces an unreliable p-value. Use the table below as a guide.

 

Test Use When Example Research Question
Z-test Large sample (n > 30) and population variance is known Does the average systolic blood pressure in a sample of 200 patients differ from the national average?
One-sample t-test Small sample and population variance is unknown; one group vs. a fixed value Does the average response time of our software differ from the benchmark of 200 ms?
Independent samples t-test Comparing means of two unrelated groups Is there a difference in exam scores between students taught by Method A vs. Method B?
Paired t-test Comparing means within the same group measured twice Do patients show lower cholesterol after 12 weeks of treatment than before?
Chi-squared test Categorical variables; testing independence or goodness of fit Is smoking status independent of disease outcome?
F-test / ANOVA Comparing means across three or more groups Do three different fertilizers produce significantly different crop yields?
Correlation test (Pearson or Spearman) Testing whether a linear or monotonic relationship exists between two variables Is there a significant correlation between hours studied and exam score?

 

Worked Example: Two-Sample T-Test

Suppose a researcher compares the mean height of male and female students at a university.

 

Parameter Group 1 (Males)
Sample size (n) 30
Sample mean 175 cm
Standard deviation 5 cm

 

Parameter Group 2 (Females)
Sample size (n) 35
Sample mean 168 cm
Standard deviation 6 cm

 

H0: There is no difference in mean height between males and females.

H1: There is a difference in mean height between males and females.

 

The two-sample t-statistic formula is:

 

t = (x1 – x2) / sqrt( (s1^2 / n1) + (s2^2 / n2) )

 

Substituting the values:

 

t = (175 – 168) / sqrt( (25/30) + (36/35) ) = 7 / sqrt(0.833 + 1.029) = 7 / 1.364 = 5.13

 

Degrees of freedom: df = (30 + 35) – 2 = 63

 

Using a t-distribution table or statistical software with df = 63 and t = 5.13, the two-tailed p-value is approximately 0.000003. Because this is far below the conventional alpha of 0.05, H0 is rejected. The evidence strongly suggests a real difference in mean height between the two groups.

 

How to Calculate a P-Value Using Software

In practice, researchers use statistical software rather than tables. Examples are shown below.

 

Software / Language Example Code or Method
Python (scipy) from scipy import stats; t_stat, p = stats.ttest_ind(group1, group2)
R t.test(group1, group2)  # outputs t-statistic and p-value directly
SPSS Analyze > Compare Means > Independent-Samples T Test
SAS PROC TTEST; CLASS group; VAR score; RUN;
Excel Use the T.TEST() function with appropriate tail and type arguments

 

One-Tailed vs. Two-Tailed Tests: Which Should You Use?

Use a two-tailed test unless you have a strong, pre-specified reason to expect a difference in only one direction. A two-tailed test is almost always the safer and more defensible choice.

 

Test Type When to Use Implication for P-Value
Two-tailed You predict a difference but not a specific direction (most common) P-value reflects probability in both tails; threshold is stricter
Upper-tailed (right) You predict the new value will be greater than the reference P-value is area in the right tail only
Lower-tailed (left) You predict the new value will be less than the reference P-value is area in the left tail only

 

A one-sided test is only appropriate when a large change in the unexpected direction would have absolutely no relevance to the study. If there is any doubt, use a two-tailed test.

 

P-Values and Statistical Significance

A result is described as statistically significant when the p-value falls below the pre-set alpha level, indicating that the observed data would be unlikely if H0 were true. Statistical significance does not imply practical or scientific importance.

 

The standard p-value interpretation table is shown below.

 

P-Value Range Interpretation Decision on H0
p > 0.10 Not significant; results are consistent with H0 Do not reject H0
0.05 < p <= 0.10 Marginally significant; weak evidence against H0 Do not reject H0; interpret with caution
0.01 < p <= 0.05 Statistically significant; reasonable evidence against H0 Reject H0 at alpha = 0.05
0.001 < p <= 0.01 Highly statistically significant; strong evidence against H0 Reject H0 at alpha = 0.01
p <= 0.001 Very highly significant; very strong evidence against H0 Reject H0 at alpha = 0.001

 

The Asterisk Rating System for Journal Reporting

Many journals use an asterisk system alongside exact p-values to signal significance levels at a glance. Always report the exact p-value in addition to any asterisk notation, because asterisks alone do not allow the reader to assess the strength of evidence.

 

Asterisk Notation Meaning
* p < 0.05 (statistically significant)
** p < 0.01 (highly statistically significant)
*** p < 0.001 (very highly statistically significant)
Ns Not statistically significant (p > 0.05)

 

Type I and Type II Errors: What Can Go Wrong?

Every hypothesis test carries a risk of two types of incorrect conclusions. Understanding these errors is critical to designing studies with appropriate rigor.

 

Reject H0 Fail to Reject H0
H0 is TRUE Type I Error (alpha) Correct Decision (1 – alpha)
H0 is FALSE Correct Decision (Power: 1 – beta) Type II Error (beta)

 

Type I error (false positive):

  • Occurs when H0 is true but is incorrectly rejected.
  • Its maximum probability is alpha (the significance level), which you set before the study.
  • It is not affected by sample size; it is entirely controlled by the chosen alpha.
  • Example: concluding a drug works when it actually has no effect.

 

Type II error (false negative):

  • Occurs when H0 is false but is not rejected.
  • Its probability is beta, which depends on sample size, alpha, and the true effect size.
  • As sample size increases, beta decreases, making it less likely you will miss a real effect.
  • Example: concluding a drug has no effect when it actually works.

 

The relationship between alpha and beta is a trade-off. Lowering alpha (making the test stricter about false positives) tends to increase beta (making it harder to detect real effects) unless the sample size is also increased.

What Is Statistical Power, and Why Does It Matter for Your Study?

Statistical power is the probability of correctly detecting a real effect when one exists. It equals 1 minus beta. A study with low power risks missing real effects even when they are present, producing false negatives.

 

Power is determined by four interrelated factors:

 

Factor Effect on Power
Sample size (n) Larger samples increase power; smaller samples reduce it
Effect size Larger effects are easier to detect; power increases with effect size
Alpha level A higher alpha (e.g., 0.10 vs. 0.05) increases power but also increases Type I error risk
Data variability Lower variability in the data increases power

 

As a general standard in behavioral and biomedical research, a minimum power of 0.80 (80%) is recommended. This means the study has at least an 80% chance of detecting a real effect of the expected size at the chosen alpha level.

 

Power analysis should always be performed before data collection to determine the required sample size. Post-hoc power calculations (performed after the study) are generally considered uninformative.

 

What Influences the P-Value?

Understanding what drives p-values helps researchers design better studies and interpret results more accurately.

 

Factor Effect on P-Value
Larger sample size Tends to produce smaller p-values, even for trivial effects
Larger effect size Produces smaller p-values; stronger evidence against H0
Greater data variability Produces larger p-values; harder to detect significance
Lower alpha threshold Raises the bar for significance; does not change the p-value itself
Choice of statistical test Different tests may yield different p-values for identical data
Violated test assumptions Distorts p-values; tests may become unreliable

 

Confidence Intervals: The Companion to P-Values

A confidence interval (CI) is a range of values that, with a stated level of confidence (typically 95%), contains the true population parameter. Confidence intervals and p-values convey related but complementary information: a 95% CI that does not include zero (for a difference) or one (for a ratio) corresponds to p < 0.05.

 

Statistical referees at major scientific journals increasingly expect confidence intervals to be reported with at least as much prominence as p-values. Confidence intervals are preferred because they communicate both statistical significance and practical magnitude simultaneously.

 

What the Metric Tells You P-Value Confidence Interval
Is the result statistically significant? Yes (compare to alpha) Yes (does the CI exclude the null value?)
How large is the effect? No Partially (the width and position of the CI indicate magnitude)
How precisely is the effect estimated? No Yes (narrower CI = more precision)
What values are plausible for the true parameter? No Yes

 

Best practice: always report p-values alongside the corresponding effect estimate and its confidence interval. Do not report p-values in isolation.

 

Effect Size: Why Statistical Significance Is Not Enough

Effect size measures the practical magnitude of a finding, independent of sample size. A result can be statistically significant yet trivially small in practice, especially in large samples.

 

Effect Size Measure Used With Interpretation Benchmarks (Cohen)
Cohen’s d Difference between two means (t-test) Small: 0.2; Medium: 0.5; Large: 0.8
Eta-squared (eta^2) ANOVA Small: 0.01; Medium: 0.06; Large: 0.14
Pearson’s r Correlation Small: 0.1; Medium: 0.3; Large: 0.5
Odds ratio (OR) Binary outcomes; logistic regression OR = 1 means no effect; OR > 1 or < 1 indicates direction and magnitude
Hedges’ g Difference between means when sample sizes differ Similar benchmarks to Cohen’s d; adjusts for small-sample bias

 

Example: A study with 10,000 participants finds that a new teaching method improves test scores by 0.5 points on a 100-point scale, with p = 0.001. The result is statistically significant, but the effect size (Cohen’s d = 0.02) suggests the improvement is negligible in practice. Reporting the effect size alongside the p-value prevents this misinterpretation.

 

Always report at least one effect size metric alongside the p-value. Refer to the journal’s author guidelines for the preferred measure.

 

Multiple Testing and the Inflation of False Positives

Each hypothesis test carried at alpha = 0.05 has a 5% chance of producing a false positive. When multiple tests are performed simultaneously, those individual error probabilities compound, sharply inflating the overall chance of at least one false positive.

 

Illustration: if 20 independent tests are each run at alpha = 0.05 and H0 is true for all of them, the probability of obtaining at least one false positive is approximately 64%, not 5%.

 

Number of Tests Alpha per Test Probability of at Least One False Positive
1 0.05 5%
5 0.05 23%
10 0.05 40%
20 0.05 64%
50 0.05 92%

 

Corrections for Multiple Testing

Several methods exist to control the false-positive rate when conducting multiple tests.

 

Method How It Works Best Used When
Bonferroni correction Divide alpha by the number of tests; use the result as the new per-test threshold Small number of pre-planned comparisons
Benjamini-Hochberg procedure Controls the false discovery rate (FDR) rather than the per-test error rate Large numbers of tests (e.g., genomics, imaging)
Holm-Bonferroni method A stepwise version of Bonferroni; less conservative while still controlling familywise error Multiple comparisons with varying importance
Dunn’s test Post-hoc pairwise comparisons after a significant ANOVA; adjusts alpha for each comparison Post-ANOVA pairwise tests

 

Whenever multiple outcomes, subgroups, or time points are analyzed in the same study, the multiple comparisons strategy must be declared in the methods section before data collection.

What Is P-Hacking, and Why Is It a Problem?

P-hacking is the manipulation of analyses, data subsets, or variables to obtain a statistically significant p-value. It is unethical and produces results that do not replicate. P-hacking inflates the literature with false positives and is a primary driver of the reproducibility crisis in science.

 

Common forms of p-hacking:

 

  • Collecting data until the p-value crosses 0.05, then stopping.
  • Running multiple tests and reporting only the significant one.
  • Removing outliers selectively to achieve significance.
  • Switching between one-tailed and two-tailed tests after seeing the data.
  • Adding or dropping covariates in regression models until a significant result emerges.
  • Splitting the sample into subgroups and testing each until one shows significance.

 

Safeguards against p-hacking:

 

  • Pre-register the study: specify H0, H1, primary outcomes, sample size, and statistical tests before data collection.
  • Report all analyses conducted, not just significant ones.
  • Apply multiple-testing corrections when conducting more than one test.
  • Share data and analysis scripts openly to allow independent verification.

 

Publication bias, the tendency for journals to favor statistically significant results, compounds the problem by selectively publishing p-hacked findings. Funnel plot asymmetry in meta-analyses and the prevalence of p-values just below 0.05 in the literature are two markers of this phenomenon.

 

Bayesian Alternatives to the P-Value

The frequentist p-value answers the question: given that H0 is true, how surprising are these data? Bayesian inference inverts this logic and asks: given these data, how probable is each hypothesis? These are fundamentally different questions, and the answer to one does not answer the other.

 

Feature Frequentist P-Value Bayesian (Bayes Factor)
What is quantified? Probability of data given H0 is true Probability of H1 relative to H0 given the data
Does it require a prior belief? No Yes (prior distribution must be specified)
Can it confirm H0? No (only fail to reject) Yes (BF < 1 supports H0)
Interpretation of a significant result Evidence against H0 Degree of support for H1 over H0
Widely required by journals? Yes (most fields) Increasingly accepted; required in some Bayesian journals

 

Bayes Factor (BF) values and what they indicate:

 

Bayes Factor (BF) Interpretation
BF > 100 Decisive evidence for H1
BF 30-100 Very strong evidence for H1
BF 10-30 Strong evidence for H1
BF 3-10 Moderate evidence for H1
BF 1-3 Anecdotal evidence for H1
BF = 1 No evidence either way
BF < 1 Evidence in favor of H0 (stronger as value decreases toward zero)

 

Bayes factors are particularly useful when a researcher wants to quantify evidence for the null hypothesis, or when sequential data collection is planned and stopping rules need to be justified. Software packages such as JASP (free, open-source) calculate Bayes factors without requiring manual prior specification.

 

The P-Value Reform Debate: Should 0.05 Still Be the Universal Threshold?

The 0.05 threshold has been the dominant cutoff in science for decades, but leading statistical organizations have challenged its use as a universal, binary decision rule. The American Statistical Association issued a formal statement warning against such use.

 

Key positions in the debate:

 

  • The 0.05 threshold was historically arbitrary, not derived from fundamental statistical theory.
  • Using a single threshold encourages binary thinking (significant vs. not significant) and discourages nuanced interpretation.
  • Some researchers propose moving to a threshold of 0.005 for novel claims, reserving 0.05 for exploratory work.
  • Others argue the threshold should vary by field: stricter in particle physics (5-sigma, or p < 0.0000003), more flexible in exploratory psychology.
  • The most radical position advocates retiring the p-value entirely as a decision criterion in favor of reporting effect sizes, confidence intervals, and, where possible, Bayes factors.

 

The growing consensus is not that p-values are wrong, but that they have been over-relied upon and frequently misunderstood. Best practice combines p-values with effect sizes, confidence intervals, and clearly stated pre-registered hypotheses.

 

Where Are P-Values Used? Applications Across Research Domains

 

Domain Typical Application
Clinical medicine and pharmacology Testing whether a drug or intervention produces a significantly different outcome compared to a placebo or standard treatment.
Psychology and behavioral science Determining whether a manipulation (e.g., a priming task) significantly affects a measured behavior or cognitive response.
Public health and epidemiology Assessing whether an exposure (e.g., smoking) is significantly associated with a disease outcome in a cohort or case-control study.
Biology and genetics Identifying whether gene expression differences between conditions exceed what would be expected by chance; GWAS studies test millions of variants simultaneously.
Social sciences Evaluating whether observed group differences in attitudes, behaviors, or outcomes are statistically significant.
Business and economics Testing whether an intervention (e.g., a marketing campaign) significantly changes a key performance metric.
Manufacturing and quality control Using hypothesis tests (e.g., t-tests on process measurements) to determine whether a production process is operating within specification.
Machine learning and data science Feature selection pipelines use p-values to identify variables that significantly predict an outcome; also used in model comparison tests.

 

How to Report P-Values in a Research Paper

P-values should always appear in the results section, and the significance threshold (alpha) should always appear in the methods section. The following formatting conventions apply in most major journals using APA style.

 

Formatting Rules for APA Style

 

  • Italicize p: write p, not p.
  • Do not use a leading zero: write p = .042, not p = 0.042.
  • Include spaces on both sides of the equals sign: p = .042, not p=.042.
  • For very small values, write p < .001 rather than listing many decimal places.
  • Never write p = .000; write p < .001 instead.
  • Do not describe non-significant results as ‘insignificant’; use ‘not statistically significant’ or ‘ns’.
  • Report the exact p-value wherever possible, not just whether it crossed a threshold.

 

What Else to Report Alongside the P-Value

 

Statistic Where to Report It Why It Matters
Test statistic (t, F, chi-squared, z) Results section Allows readers to verify calculations and assess the direction of an effect
Degrees of freedom Results section (in parentheses after test statistic) Required to locate the result in the reference distribution
Effect size (Cohen’s d, eta-squared, r, OR) Results section Indicates practical significance independent of sample size
95% confidence interval Results section Shows the range of plausible values for the true population parameter
Sample size (n) Methods section; sometimes results Critical context for interpreting the sensitivity of the test
Alpha threshold used Methods section Required so readers know the decision rule applied

 

Example Reporting Language

The new medication produced a significantly longer duration of pain relief than the standard treatment, t(58) = 3.24, p = .002 (two-tailed), d = 0.83, 95% CI [0.41, 1.25].

 

Can P-Values Be Used to Compare Two Different Studies?

Comparing p-values across different studies or experiments is generally not valid. A p-value reflects the probability that specific results arose by chance in one particular study; it is not a standardized measure that can be ranked or compared across contexts.

 

Why direct p-value comparison fails:

 

  • P-values are sensitive to sample size: a large study may produce p = .01 for a trivial effect, while a small but well-designed study may produce p = .04 for a substantial effect.
  • P-values depend on variability in the data, which differs across studies and populations.
  • The choice of test, the study design, and the measurement instruments all influence the p-value.

 

If direct comparison across studies is needed, use a formal meta-analysis that pools effect sizes rather than p-values.

 

Common Pitfalls and Cautions When Using P-Values

 

Pitfall Correct Approach
Setting alpha after seeing the data (data dredging) Always set the significance threshold before collecting data.
Treating p < 0.05 as proof of a meaningful effect Interpret alongside effect size and confidence interval.
Treating p > 0.05 as proof the null is true A non-significant result only means the data do not provide sufficient evidence to reject H0.
Reporting only significant results Report all pre-specified analyses; use multiple-testing corrections.
Using p-values to compare results across studies Use meta-analysis with standardized effect sizes.
Ignoring test assumptions (normality, independence, homoscedasticity) Check assumptions before running any test; use non-parametric alternatives if violated.
Conflating statistical significance with clinical or practical significance Always pair p-values with effect size and domain knowledge.
Using one-tailed tests without prior justification Default to two-tailed tests; justify one-tailed tests in the pre-registration or protocol.

 

Frequently Asked Questions

 

What exactly does a p-value of 0.05 mean?

A p-value of 0.05 means that, if the null hypothesis were true, there is a 5% probability of observing results as extreme as those actually obtained purely by chance. It does not mean there is a 95% probability that the alternative hypothesis is true, and it does not prove that the null hypothesis is false.

 

Why is 0.05 the standard threshold?

The 0.05 threshold originated with the statistician Ronald Fisher in the 1920s as a convenient rough benchmark. It was never intended to be a universal decision rule. Many researchers and statistical organizations now advocate for more flexible, context-appropriate thresholds or for abandoning strict thresholds altogether in favor of reporting exact p-values alongside effect sizes.

 

Does a larger sample size always produce a smaller p-value?

Not always, but in general, larger samples increase the sensitivity of a test to detect even small true effects. A larger sample reduces the standard error, which increases the test statistic for any given effect, which in turn reduces the p-value. This is why a statistically significant result from a very large study can correspond to a trivially small effect size.

 

What is the difference between a p-value and a confidence interval?

A p-value tells you whether the observed effect is unlikely under the null hypothesis. A confidence interval tells you the range of plausible values for the true population parameter and how precisely it has been estimated. They carry related information: a 95% CI that excludes the null value corresponds to p < 0.05, but the CI additionally conveys the magnitude and direction of the effect.

 

Is a non-significant p-value (p > 0.05) evidence that there is no effect?

No. A non-significant result means only that the study did not find sufficient evidence to reject H0. It does not prove that no effect exists. The study may have lacked sufficient power to detect a real but small effect. The correct phrasing is ‘the result was not statistically significant’ rather than ‘there was no effect’.

 

What should I do if I am testing many hypotheses at once?

Apply a multiple-testing correction. For a small number of pre-planned comparisons, the Bonferroni correction (dividing alpha by the number of tests) is simple and widely accepted. For large-scale testing such as genomics or neuroimaging, the Benjamini-Hochberg procedure for controlling the false discovery rate is more appropriate and less conservative.

 

When should I use a Bayes factor instead of a p-value?

Consider Bayes factors when you want to quantify evidence for H0 (p-values cannot do this), when you are conducting sequential testing and need a stopping rule that does not inflate Type I error, or when your field or journal increasingly expects Bayesian reporting. Bayes factors require specifying a prior distribution, so they involve additional assumptions that must be justified and reported.

 

How do I report a p-value that is essentially zero in my output?

Never report p = .000 or p = 0. When software returns a p-value smaller than its reporting precision, write p < .001. If the software provides a more precise value (such as p = 2.99 x 10^-6), you may report it in scientific notation or simply as p < .001 depending on the journal’s preferences.

Editage All Access is a subscription-based platform that unifies the best AI tools and services designed to speed up, simplify, and streamline every step of a researcher’s journey. The Editage All Access Pack is a one-of-a-kind subscription that unlocks full access to an AI writing assistant, literature recommender, journal finder, scientific illustration tool, and exclusive discounts on professional publication services from Editage. 

Based on 22+ years of experience in academia, Editage All Access empowers researchers to put their best research forward and move closer to success. Explore our top AI Tools pack, AI Tools + Publication Services pack, or Build Your Own Plan. Find everything a researcher needs to succeed, all in one place – Get All Access now starting at just $14 a month! 

This article was originally published on February 9, 2023, and updated on June 8, 2026.

Related Posts