Glossary of Key Terms
The following terms are used throughout this article. Refer back to this section whenever an unfamiliar concept appears.
| Term | Definition |
| P-value | The probability of observing results as extreme as those obtained, assuming the null hypothesis is true. |
| Null hypothesis (H0) | A statement asserting no effect, no difference, or no relationship between variables. |
| Alternative hypothesis (H1) | A statement asserting that an effect, difference, or relationship does exist. |
| Alpha (significance level) | The threshold probability, set before the experiment, below which the null hypothesis is rejected. |
| Type I error | Falsely rejecting a true null hypothesis; its maximum probability equals alpha. |
| Type II error | Failing to reject a false null hypothesis; its probability is denoted beta. |
| Statistical power | The probability of correctly rejecting a false null hypothesis; calculated as 1 minus beta. |
| Effect size | A quantitative measure of the magnitude of a difference or relationship, independent of sample size. |
| Confidence interval (CI) | A range of values that, with a specified probability, contains the true population parameter. |
| Test statistic | A numerical value calculated from sample data that is compared against a reference distribution. |
| Degrees of freedom (df) | The number of independent values in a calculation that are free to vary. |
| Bayes factor | A ratio comparing the probability of data under two competing hypotheses, used in Bayesian inference. |
| P-hacking | The practice of manipulating analyses or selectively reporting results to achieve a desired p-value. |
| Multiple testing | Conducting several hypothesis tests simultaneously, which inflates the risk of false positives. |
| False discovery rate (FDR) | The expected proportion of rejected null hypotheses that are actually true. |
Key Takeaways
- A p-value measures the probability of obtaining results as extreme as those observed if the null hypothesis were true; it does not measure the probability that the null hypothesis is true.
- P-values must be interpreted alongside effect sizes, confidence intervals, and sample sizes to be meaningful.
- The conventional alpha threshold of 0.05 is arbitrary; always justify the chosen threshold in your methods section.
- Type I error (false positive) is controlled by alpha; Type II error (false negative) is controlled by sample size and study power.
- Running multiple tests inflates the false-positive rate; corrections such as the Bonferroni adjustment or the Benjamini-Hochberg procedure should be applied.
- P-hacking and selective reporting are ethically problematic and undermine scientific reproducibility.
- Confidence intervals communicate the same information as p-values and are increasingly preferred by journal editors.
- Bayesian alternatives such as Bayes factors provide a complementary framework that directly addresses the probability of a hypothesis given the data.
Introduction
The p-value is one of the most widely used, most frequently misunderstood, and most actively debated concepts in quantitative research. From clinical trials and psychology experiments to quality control and machine learning, researchers in virtually every discipline encounter p-values as part of the process of deciding whether observed results are likely to reflect real phenomena or mere chance.
This article explains what a p-value is, how to calculate it using common statistical tests, how to interpret it correctly in the context of Type I and Type II errors, statistical power, confidence intervals, and effect sizes, and how to report it in a research paper. It also addresses important limitations and modern debates around p-value use, including p-hacking, multiple testing, and Bayesian alternatives.
What Is a P-Value?
A p-value (probability value) is the probability of obtaining a test result as extreme as, or more extreme than, the one actually observed, assuming the null hypothesis is true. In practical terms, it answers the question: if there were truly no effect or no difference in the population, how likely would it be to observe results like these simply by chance?
A small p-value indicates that the observed data are unlikely to have occurred by chance alone under the null hypothesis, providing evidence against the null. A large p-value indicates that the observed data are consistent with the null hypothesis.
Key clarifications about what a p-value is and is not:
| What a p-value IS | What a p-value is NOT |
| The probability of the data (or more extreme data) given H0 is true | The probability that H0 is true |
| A measure of evidence against H0 | A measure of the size or importance of an effect |
| One input into a broader statistical decision | A definitive proof of anything |
| Dependent on sample size, effect size, and variability | A fixed property of a phenomenon |
| A continuous value from 0 to 1 | A binary pass/fail judgment on its own |
What Is the Null Hypothesis, and Why Does It Matter?
The null hypothesis is true by default and the p-value only makes sense relative to it. It is the foundation of every hypothesis test. The null hypothesis (H0) states that there is no effect, no difference, or no relationship. The alternative hypothesis (H1) states that an effect, difference, or relationship does exist.
Consider a clinical trial comparing two pain medications. The hypotheses might be stated as follows:
- H0: The mean duration of pain relief for Drug A equals the mean duration for Drug B.
- H1: The mean duration of pain relief for Drug A is greater than the mean duration for Drug B.
The p-value quantifies how strongly the data contradict H0. Unless the p-value falls below the pre-specified alpha threshold, H0 is not rejected. Importantly, failing to reject H0 does not prove it is true; it only means the data are insufficient to rule it out.
A helpful analogy: the null hypothesis works like a presumption of innocence in a legal trial. The defendant is assumed innocent (H0 is true) until sufficient evidence (a sufficiently small p-value) is produced to conclude otherwise.
How Is a P-Value Calculated?
P-values are derived from a test statistic, a number calculated from your sample data that measures how far the observed result deviates from what would be expected under H0. The specific formula depends on the test you choose. The p-value is then the area in the tail or tails of the relevant probability distribution, beyond the observed test statistic.
The general steps are the same for all parametric tests:
- Step 1: State H0 and H1.
- Step 2: Choose the appropriate statistical test for your data and research question.
- Step 3: Calculate the test statistic from your sample data.
- Step 4: Identify the relevant probability distribution (t, z, chi-squared, or F) and degrees of freedom.
- Step 5: Find the area under the distribution curve beyond the test statistic. This area is the p-value.
- Step 6: Compare the p-value to the pre-set alpha level and make a decision about H0.
Which Statistical Test Should You Use?
Choosing the correct test is essential. The wrong test produces an unreliable p-value. Use the table below as a guide.
| Test | Use When | Example Research Question |
| Z-test | Large sample (n > 30) and population variance is known | Does the average systolic blood pressure in a sample of 200 patients differ from the national average? |
| One-sample t-test | Small sample and population variance is unknown; one group vs. a fixed value | Does the average response time of our software differ from the benchmark of 200 ms? |
| Independent samples t-test | Comparing means of two unrelated groups | Is there a difference in exam scores between students taught by Method A vs. Method B? |
| Paired t-test | Comparing means within the same group measured twice | Do patients show lower cholesterol after 12 weeks of treatment than before? |
| Chi-squared test | Categorical variables; testing independence or goodness of fit | Is smoking status independent of disease outcome? |
| F-test / ANOVA | Comparing means across three or more groups | Do three different fertilizers produce significantly different crop yields? |
| Correlation test (Pearson or Spearman) | Testing whether a linear or monotonic relationship exists between two variables | Is there a significant correlation between hours studied and exam score? |
Worked Example: Two-Sample T-Test
Suppose a researcher compares the mean height of male and female students at a university.
| Parameter | Group 1 (Males) |
| Sample size (n) | 30 |
| Sample mean | 175 cm |
| Standard deviation | 5 cm |
| Parameter | Group 2 (Females) |
| Sample size (n) | 35 |
| Sample mean | 168 cm |
| Standard deviation | 6 cm |
H0: There is no difference in mean height between males and females.
H1: There is a difference in mean height between males and females.
The two-sample t-statistic formula is:
t = (x1 – x2) / sqrt( (s1^2 / n1) + (s2^2 / n2) )
Substituting the values:
t = (175 – 168) / sqrt( (25/30) + (36/35) ) = 7 / sqrt(0.833 + 1.029) = 7 / 1.364 = 5.13
Degrees of freedom: df = (30 + 35) – 2 = 63
Using a t-distribution table or statistical software with df = 63 and t = 5.13, the two-tailed p-value is approximately 0.000003. Because this is far below the conventional alpha of 0.05, H0 is rejected. The evidence strongly suggests a real difference in mean height between the two groups.
How to Calculate a P-Value Using Software
In practice, researchers use statistical software rather than tables. Examples are shown below.
| Software / Language | Example Code or Method |
| Python (scipy) | from scipy import stats; t_stat, p = stats.ttest_ind(group1, group2) |
| R | t.test(group1, group2) # outputs t-statistic and p-value directly |
| SPSS | Analyze > Compare Means > Independent-Samples T Test |
| SAS | PROC TTEST; CLASS group; VAR score; RUN; |
| Excel | Use the T.TEST() function with appropriate tail and type arguments |
One-Tailed vs. Two-Tailed Tests: Which Should You Use?
Use a two-tailed test unless you have a strong, pre-specified reason to expect a difference in only one direction. A two-tailed test is almost always the safer and more defensible choice.
| Test Type | When to Use | Implication for P-Value |
| Two-tailed | You predict a difference but not a specific direction (most common) | P-value reflects probability in both tails; threshold is stricter |
| Upper-tailed (right) | You predict the new value will be greater than the reference | P-value is area in the right tail only |
| Lower-tailed (left) | You predict the new value will be less than the reference | P-value is area in the left tail only |
A one-sided test is only appropriate when a large change in the unexpected direction would have absolutely no relevance to the study. If there is any doubt, use a two-tailed test.
P-Values and Statistical Significance
A result is described as statistically significant when the p-value falls below the pre-set alpha level, indicating that the observed data would be unlikely if H0 were true. Statistical significance does not imply practical or scientific importance.
The standard p-value interpretation table is shown below.
| P-Value Range | Interpretation | Decision on H0 |
| p > 0.10 | Not significant; results are consistent with H0 | Do not reject H0 |
| 0.05 < p <= 0.10 | Marginally significant; weak evidence against H0 | Do not reject H0; interpret with caution |
| 0.01 < p <= 0.05 | Statistically significant; reasonable evidence against H0 | Reject H0 at alpha = 0.05 |
| 0.001 < p <= 0.01 | Highly statistically significant; strong evidence against H0 | Reject H0 at alpha = 0.01 |
| p <= 0.001 | Very highly significant; very strong evidence against H0 | Reject H0 at alpha = 0.001 |
The Asterisk Rating System for Journal Reporting
Many journals use an asterisk system alongside exact p-values to signal significance levels at a glance. Always report the exact p-value in addition to any asterisk notation, because asterisks alone do not allow the reader to assess the strength of evidence.
| Asterisk Notation | Meaning |
| * | p < 0.05 (statistically significant) |
| ** | p < 0.01 (highly statistically significant) |
| *** | p < 0.001 (very highly statistically significant) |
| Ns | Not statistically significant (p > 0.05) |
Type I and Type II Errors: What Can Go Wrong?
Every hypothesis test carries a risk of two types of incorrect conclusions. Understanding these errors is critical to designing studies with appropriate rigor.
| Reject H0 | Fail to Reject H0 | |
| H0 is TRUE | Type I Error (alpha) | Correct Decision (1 – alpha) |
| H0 is FALSE | Correct Decision (Power: 1 – beta) | Type II Error (beta) |
Type I error (false positive):
- Occurs when H0 is true but is incorrectly rejected.
- Its maximum probability is alpha (the significance level), which you set before the study.
- It is not affected by sample size; it is entirely controlled by the chosen alpha.
- Example: concluding a drug works when it actually has no effect.
Type II error (false negative):
- Occurs when H0 is false but is not rejected.
- Its probability is beta, which depends on sample size, alpha, and the true effect size.
- As sample size increases, beta decreases, making it less likely you will miss a real effect.
- Example: concluding a drug has no effect when it actually works.
The relationship between alpha and beta is a trade-off. Lowering alpha (making the test stricter about false positives) tends to increase beta (making it harder to detect real effects) unless the sample size is also increased.
What Is Statistical Power, and Why Does It Matter for Your Study?
Statistical power is the probability of correctly detecting a real effect when one exists. It equals 1 minus beta. A study with low power risks missing real effects even when they are present, producing false negatives.
Power is determined by four interrelated factors:
| Factor | Effect on Power |
| Sample size (n) | Larger samples increase power; smaller samples reduce it |
| Effect size | Larger effects are easier to detect; power increases with effect size |
| Alpha level | A higher alpha (e.g., 0.10 vs. 0.05) increases power but also increases Type I error risk |
| Data variability | Lower variability in the data increases power |
As a general standard in behavioral and biomedical research, a minimum power of 0.80 (80%) is recommended. This means the study has at least an 80% chance of detecting a real effect of the expected size at the chosen alpha level.
Power analysis should always be performed before data collection to determine the required sample size. Post-hoc power calculations (performed after the study) are generally considered uninformative.
What Influences the P-Value?
Understanding what drives p-values helps researchers design better studies and interpret results more accurately.
| Factor | Effect on P-Value |
| Larger sample size | Tends to produce smaller p-values, even for trivial effects |
| Larger effect size | Produces smaller p-values; stronger evidence against H0 |
| Greater data variability | Produces larger p-values; harder to detect significance |
| Lower alpha threshold | Raises the bar for significance; does not change the p-value itself |
| Choice of statistical test | Different tests may yield different p-values for identical data |
| Violated test assumptions | Distorts p-values; tests may become unreliable |
Confidence Intervals: The Companion to P-Values
A confidence interval (CI) is a range of values that, with a stated level of confidence (typically 95%), contains the true population parameter. Confidence intervals and p-values convey related but complementary information: a 95% CI that does not include zero (for a difference) or one (for a ratio) corresponds to p < 0.05.
Statistical referees at major scientific journals increasingly expect confidence intervals to be reported with at least as much prominence as p-values. Confidence intervals are preferred because they communicate both statistical significance and practical magnitude simultaneously.
| What the Metric Tells You | P-Value | Confidence Interval |
| Is the result statistically significant? | Yes (compare to alpha) | Yes (does the CI exclude the null value?) |
| How large is the effect? | No | Partially (the width and position of the CI indicate magnitude) |
| How precisely is the effect estimated? | No | Yes (narrower CI = more precision) |
| What values are plausible for the true parameter? | No | Yes |
Best practice: always report p-values alongside the corresponding effect estimate and its confidence interval. Do not report p-values in isolation.
Effect Size: Why Statistical Significance Is Not Enough
Effect size measures the practical magnitude of a finding, independent of sample size. A result can be statistically significant yet trivially small in practice, especially in large samples.
| Effect Size Measure | Used With | Interpretation Benchmarks (Cohen) |
| Cohen’s d | Difference between two means (t-test) | Small: 0.2; Medium: 0.5; Large: 0.8 |
| Eta-squared (eta^2) | ANOVA | Small: 0.01; Medium: 0.06; Large: 0.14 |
| Pearson’s r | Correlation | Small: 0.1; Medium: 0.3; Large: 0.5 |
| Odds ratio (OR) | Binary outcomes; logistic regression | OR = 1 means no effect; OR > 1 or < 1 indicates direction and magnitude |
| Hedges’ g | Difference between means when sample sizes differ | Similar benchmarks to Cohen’s d; adjusts for small-sample bias |
Example: A study with 10,000 participants finds that a new teaching method improves test scores by 0.5 points on a 100-point scale, with p = 0.001. The result is statistically significant, but the effect size (Cohen’s d = 0.02) suggests the improvement is negligible in practice. Reporting the effect size alongside the p-value prevents this misinterpretation.
Always report at least one effect size metric alongside the p-value. Refer to the journal’s author guidelines for the preferred measure.
Multiple Testing and the Inflation of False Positives
Each hypothesis test carried at alpha = 0.05 has a 5% chance of producing a false positive. When multiple tests are performed simultaneously, those individual error probabilities compound, sharply inflating the overall chance of at least one false positive.
Illustration: if 20 independent tests are each run at alpha = 0.05 and H0 is true for all of them, the probability of obtaining at least one false positive is approximately 64%, not 5%.
| Number of Tests | Alpha per Test | Probability of at Least One False Positive |
| 1 | 0.05 | 5% |
| 5 | 0.05 | 23% |
| 10 | 0.05 | 40% |
| 20 | 0.05 | 64% |
| 50 | 0.05 | 92% |
Corrections for Multiple Testing
Several methods exist to control the false-positive rate when conducting multiple tests.
| Method | How It Works | Best Used When |
| Bonferroni correction | Divide alpha by the number of tests; use the result as the new per-test threshold | Small number of pre-planned comparisons |
| Benjamini-Hochberg procedure | Controls the false discovery rate (FDR) rather than the per-test error rate | Large numbers of tests (e.g., genomics, imaging) |
| Holm-Bonferroni method | A stepwise version of Bonferroni; less conservative while still controlling familywise error | Multiple comparisons with varying importance |
| Dunn’s test | Post-hoc pairwise comparisons after a significant ANOVA; adjusts alpha for each comparison | Post-ANOVA pairwise tests |
Whenever multiple outcomes, subgroups, or time points are analyzed in the same study, the multiple comparisons strategy must be declared in the methods section before data collection.
What Is P-Hacking, and Why Is It a Problem?
P-hacking is the manipulation of analyses, data subsets, or variables to obtain a statistically significant p-value. It is unethical and produces results that do not replicate. P-hacking inflates the literature with false positives and is a primary driver of the reproducibility crisis in science.
Common forms of p-hacking:
- Collecting data until the p-value crosses 0.05, then stopping.
- Running multiple tests and reporting only the significant one.
- Removing outliers selectively to achieve significance.
- Switching between one-tailed and two-tailed tests after seeing the data.
- Adding or dropping covariates in regression models until a significant result emerges.
- Splitting the sample into subgroups and testing each until one shows significance.
Safeguards against p-hacking:
- Pre-register the study: specify H0, H1, primary outcomes, sample size, and statistical tests before data collection.
- Report all analyses conducted, not just significant ones.
- Apply multiple-testing corrections when conducting more than one test.
- Share data and analysis scripts openly to allow independent verification.
Publication bias, the tendency for journals to favor statistically significant results, compounds the problem by selectively publishing p-hacked findings. Funnel plot asymmetry in meta-analyses and the prevalence of p-values just below 0.05 in the literature are two markers of this phenomenon.
Bayesian Alternatives to the P-Value
The frequentist p-value answers the question: given that H0 is true, how surprising are these data? Bayesian inference inverts this logic and asks: given these data, how probable is each hypothesis? These are fundamentally different questions, and the answer to one does not answer the other.
| Feature | Frequentist P-Value | Bayesian (Bayes Factor) |
| What is quantified? | Probability of data given H0 is true | Probability of H1 relative to H0 given the data |
| Does it require a prior belief? | No | Yes (prior distribution must be specified) |
| Can it confirm H0? | No (only fail to reject) | Yes (BF < 1 supports H0) |
| Interpretation of a significant result | Evidence against H0 | Degree of support for H1 over H0 |
| Widely required by journals? | Yes (most fields) | Increasingly accepted; required in some Bayesian journals |
Bayes Factor (BF) values and what they indicate:
| Bayes Factor (BF) | Interpretation |
| BF > 100 | Decisive evidence for H1 |
| BF 30-100 | Very strong evidence for H1 |
| BF 10-30 | Strong evidence for H1 |
| BF 3-10 | Moderate evidence for H1 |
| BF 1-3 | Anecdotal evidence for H1 |
| BF = 1 | No evidence either way |
| BF < 1 | Evidence in favor of H0 (stronger as value decreases toward zero) |
Bayes factors are particularly useful when a researcher wants to quantify evidence for the null hypothesis, or when sequential data collection is planned and stopping rules need to be justified. Software packages such as JASP (free, open-source) calculate Bayes factors without requiring manual prior specification.
The P-Value Reform Debate: Should 0.05 Still Be the Universal Threshold?
The 0.05 threshold has been the dominant cutoff in science for decades, but leading statistical organizations have challenged its use as a universal, binary decision rule. The American Statistical Association issued a formal statement warning against such use.
Key positions in the debate:
- The 0.05 threshold was historically arbitrary, not derived from fundamental statistical theory.
- Using a single threshold encourages binary thinking (significant vs. not significant) and discourages nuanced interpretation.
- Some researchers propose moving to a threshold of 0.005 for novel claims, reserving 0.05 for exploratory work.
- Others argue the threshold should vary by field: stricter in particle physics (5-sigma, or p < 0.0000003), more flexible in exploratory psychology.
- The most radical position advocates retiring the p-value entirely as a decision criterion in favor of reporting effect sizes, confidence intervals, and, where possible, Bayes factors.
The growing consensus is not that p-values are wrong, but that they have been over-relied upon and frequently misunderstood. Best practice combines p-values with effect sizes, confidence intervals, and clearly stated pre-registered hypotheses.
Where Are P-Values Used? Applications Across Research Domains
| Domain | Typical Application |
| Clinical medicine and pharmacology | Testing whether a drug or intervention produces a significantly different outcome compared to a placebo or standard treatment. |
| Psychology and behavioral science | Determining whether a manipulation (e.g., a priming task) significantly affects a measured behavior or cognitive response. |
| Public health and epidemiology | Assessing whether an exposure (e.g., smoking) is significantly associated with a disease outcome in a cohort or case-control study. |
| Biology and genetics | Identifying whether gene expression differences between conditions exceed what would be expected by chance; GWAS studies test millions of variants simultaneously. |
| Social sciences | Evaluating whether observed group differences in attitudes, behaviors, or outcomes are statistically significant. |
| Business and economics | Testing whether an intervention (e.g., a marketing campaign) significantly changes a key performance metric. |
| Manufacturing and quality control | Using hypothesis tests (e.g., t-tests on process measurements) to determine whether a production process is operating within specification. |
| Machine learning and data science | Feature selection pipelines use p-values to identify variables that significantly predict an outcome; also used in model comparison tests. |
How to Report P-Values in a Research Paper
P-values should always appear in the results section, and the significance threshold (alpha) should always appear in the methods section. The following formatting conventions apply in most major journals using APA style.
Formatting Rules for APA Style
- Italicize p: write p, not p.
- Do not use a leading zero: write p = .042, not p = 0.042.
- Include spaces on both sides of the equals sign: p = .042, not p=.042.
- For very small values, write p < .001 rather than listing many decimal places.
- Never write p = .000; write p < .001 instead.
- Do not describe non-significant results as ‘insignificant’; use ‘not statistically significant’ or ‘ns’.
- Report the exact p-value wherever possible, not just whether it crossed a threshold.
What Else to Report Alongside the P-Value
| Statistic | Where to Report It | Why It Matters |
| Test statistic (t, F, chi-squared, z) | Results section | Allows readers to verify calculations and assess the direction of an effect |
| Degrees of freedom | Results section (in parentheses after test statistic) | Required to locate the result in the reference distribution |
| Effect size (Cohen’s d, eta-squared, r, OR) | Results section | Indicates practical significance independent of sample size |
| 95% confidence interval | Results section | Shows the range of plausible values for the true population parameter |
| Sample size (n) | Methods section; sometimes results | Critical context for interpreting the sensitivity of the test |
| Alpha threshold used | Methods section | Required so readers know the decision rule applied |
Example Reporting Language
The new medication produced a significantly longer duration of pain relief than the standard treatment, t(58) = 3.24, p = .002 (two-tailed), d = 0.83, 95% CI [0.41, 1.25].
Can P-Values Be Used to Compare Two Different Studies?
Comparing p-values across different studies or experiments is generally not valid. A p-value reflects the probability that specific results arose by chance in one particular study; it is not a standardized measure that can be ranked or compared across contexts.
Why direct p-value comparison fails:
- P-values are sensitive to sample size: a large study may produce p = .01 for a trivial effect, while a small but well-designed study may produce p = .04 for a substantial effect.
- P-values depend on variability in the data, which differs across studies and populations.
- The choice of test, the study design, and the measurement instruments all influence the p-value.
If direct comparison across studies is needed, use a formal meta-analysis that pools effect sizes rather than p-values.
Common Pitfalls and Cautions When Using P-Values
| Pitfall | Correct Approach |
| Setting alpha after seeing the data (data dredging) | Always set the significance threshold before collecting data. |
| Treating p < 0.05 as proof of a meaningful effect | Interpret alongside effect size and confidence interval. |
| Treating p > 0.05 as proof the null is true | A non-significant result only means the data do not provide sufficient evidence to reject H0. |
| Reporting only significant results | Report all pre-specified analyses; use multiple-testing corrections. |
| Using p-values to compare results across studies | Use meta-analysis with standardized effect sizes. |
| Ignoring test assumptions (normality, independence, homoscedasticity) | Check assumptions before running any test; use non-parametric alternatives if violated. |
| Conflating statistical significance with clinical or practical significance | Always pair p-values with effect size and domain knowledge. |
| Using one-tailed tests without prior justification | Default to two-tailed tests; justify one-tailed tests in the pre-registration or protocol. |
Frequently Asked Questions
What exactly does a p-value of 0.05 mean?
A p-value of 0.05 means that, if the null hypothesis were true, there is a 5% probability of observing results as extreme as those actually obtained purely by chance. It does not mean there is a 95% probability that the alternative hypothesis is true, and it does not prove that the null hypothesis is false.
Why is 0.05 the standard threshold?
The 0.05 threshold originated with the statistician Ronald Fisher in the 1920s as a convenient rough benchmark. It was never intended to be a universal decision rule. Many researchers and statistical organizations now advocate for more flexible, context-appropriate thresholds or for abandoning strict thresholds altogether in favor of reporting exact p-values alongside effect sizes.
Does a larger sample size always produce a smaller p-value?
Not always, but in general, larger samples increase the sensitivity of a test to detect even small true effects. A larger sample reduces the standard error, which increases the test statistic for any given effect, which in turn reduces the p-value. This is why a statistically significant result from a very large study can correspond to a trivially small effect size.
What is the difference between a p-value and a confidence interval?
A p-value tells you whether the observed effect is unlikely under the null hypothesis. A confidence interval tells you the range of plausible values for the true population parameter and how precisely it has been estimated. They carry related information: a 95% CI that excludes the null value corresponds to p < 0.05, but the CI additionally conveys the magnitude and direction of the effect.
Is a non-significant p-value (p > 0.05) evidence that there is no effect?
No. A non-significant result means only that the study did not find sufficient evidence to reject H0. It does not prove that no effect exists. The study may have lacked sufficient power to detect a real but small effect. The correct phrasing is ‘the result was not statistically significant’ rather than ‘there was no effect’.
What should I do if I am testing many hypotheses at once?
Apply a multiple-testing correction. For a small number of pre-planned comparisons, the Bonferroni correction (dividing alpha by the number of tests) is simple and widely accepted. For large-scale testing such as genomics or neuroimaging, the Benjamini-Hochberg procedure for controlling the false discovery rate is more appropriate and less conservative.
When should I use a Bayes factor instead of a p-value?
Consider Bayes factors when you want to quantify evidence for H0 (p-values cannot do this), when you are conducting sequential testing and need a stopping rule that does not inflate Type I error, or when your field or journal increasingly expects Bayesian reporting. Bayes factors require specifying a prior distribution, so they involve additional assumptions that must be justified and reported.
How do I report a p-value that is essentially zero in my output?
Never report p = .000 or p = 0. When software returns a p-value smaller than its reporting precision, write p < .001. If the software provides a more precise value (such as p = 2.99 x 10^-6), you may report it in scientific notation or simply as p < .001 depending on the journal’s preferences.
Editage All Access is a subscription-based platform that unifies the best AI tools and services designed to speed up, simplify, and streamline every step of a researcher’s journey. The Editage All Access Pack is a one-of-a-kind subscription that unlocks full access to an AI writing assistant, literature recommender, journal finder, scientific illustration tool, and exclusive discounts on professional publication services from Editage.
Based on 22+ years of experience in academia, Editage All Access empowers researchers to put their best research forward and move closer to success. Explore our top AI Tools pack, AI Tools + Publication Services pack, or Build Your Own Plan. Find everything a researcher needs to succeed, all in one place – Get All Access now starting at just $14 a month!
This article was originally published on February 9, 2023, and updated on June 8, 2026.



