Home » Researcher.Life » What is p-value: How to Calculate It and Statistical Significance
What is p-value: How to calculate it and statistical significance

What is p-value: How to Calculate It and Statistical Significance

“What is a p-value?” are words often uttered by early career researchers and sometimes even by more experienced ones. The p-value is an important and frequently used concept in quantitative research. It can also be confusing and easily misused. In this article, we delve into what is a p-value, how to calculate it, and its statistical significance.

What is a p-value

The p-value, or probability value, is the probability that your results occurred randomly given that the null hypothesis is true. P-values are used in hypothesis testing to find evidence that differences in values or groups exist. P-values are determined through the calculation of the test statistic for the test you are using and are based on the assumed or known probability distribution.

For example, you are researching a new pain medicine that is designed to last longer than the current commonly prescribed drug. Please note that this is an extremely simplified example, intended only to demonstrate the concepts. From previous research, you know that the underlying probability distribution for both medicines is the normal distribution, which is shown in the figure below.

What is p-value: How to calculate it and statistical significance
P-values are used in hypothesis testing to find evidence that differences in values or groups exist.

You are planning a clinical trial for your drug. If your results show that the average length of time patients are pain-free is longer for the new drug than that for the standard medicine, how will you know that this is not just a random outcome? If this result falls within the green shaded area of the graph, you may have evidence that your drug has a longer effect. But how can we determine this scientifically? We do this through hypothesis testing.

What is a null hypothesis

Stating your null and alternative hypotheses is the first step in conducting a hypothesis test. The null hypothesis (H0) is what you’re trying to disprove, usually a statement that there is no relationship between two variables or no difference between two groups. The alternative hypothesis (Ha) states that a relationship exists or that there is a difference between two groups. It represents what you’re trying to find evidence to support.

Before we conduct the clinical trial, we create the following hypotheses:

H0: the mean longevity of the new drug is equal to that of the standard drug

Ha: the mean longevity of the new drug is greater than that of the standard drug

Note that the null hypothesis states that there is no difference in the mean values for the two drugs. Because Ha includes “greater than,” this is an upper-tailed test. We are not interested in the area under the lower side of the curve.

Next, we need to determine our criterion for deciding whether or not the null hypothesis can be rejected. This is where the critical p-value comes in. If we assume the null hypothesis is true, how much longer does the new drug have to last?

Let’s say your results show that the new drug lasts twice as long as the standard drug. In theory, this could still be a random outcome, due to chance, even if the null hypothesis were true. However, at some point, you must consider that the new drug may just have a better longevity. The researcher will typically set that point, which is the probability of rejecting the null hypothesis given that it is true, prior to conducting the trial. This is the critical p-value. Typically, this value is set at p = .05, although, depending on the circumstances, it could be set at another value, such as .10 or .01.

Another way to consider the null hypothesis that might make the concept clearer is to compare it to the adage “innocent until proven guilty.” It is assumed that the null hypothesis is true unless enough strong evidence can be found to disprove it. Statistically significant p-value results can provide some of that evidence, which makes it important to know how to calculate p-values.

How to calculate p-values

The p-value that is determined from your results is based on the test statistic, which depends on the type of hypothesis test you are using. That is because the p-value is actually a probability, and its value, and calculation method, depends on the underlying probability distribution. The p-value also depends in part on whether you are conducting a lower-tailed test, upper-tailed test, or two-tailed test.

The actual p-value is calculated by integrating the probability distribution function to find the relevant areas under the curve using integral calculus. This process can be quite complicated. Fortunately, p-values are usually determined by using tables, which use the test statistic and degrees of freedom, or statistical software, such as SPSS, SAS, or R.

For example, with the simplified clinical test we are performing, we assumed the underlying probability distribution is normal; therefore, we decide to conduct a t-test to test the null hypothesis. The resulting t-test statistic will indicate where along the x-axis, under the normal curve, our result is located. The p-value will then be, in our case, the area under the curve to the right of the test statistic.

Many factors affect the hypothesis test you use and therefore the test statistic. Always make sure to use the test that best fits your data and the relationship you’re testing. The sample size and number of independent variables you use will also impact the p-value.

P-Value and statistical significance

You have completed your clinical trial and have determined the p-value. What’s next? How can the result be interpreted? What does a statistically significant result mean?

A statistically significant result means that the p-value you obtained is small enough that the result is not likely to have occurred by chance. P-values are reported in the range of 0–1, and the smaller the p-value, the less likely it is that the null hypothesis is true and the greater the indication that it can be rejected. The critical p-value, or the point at which a result can be considered to be statistically significant, is set prior to the experiment.

In our simplified clinical trial example, we set the critical p-value at 0.05. If the p-value obtained from the trial was found to be p = .0375, we can say that the results were statistically significant, and we have evidence for rejecting the null hypothesis. However, this does not mean that we can be absolutely certain that the null hypothesis is false. The results of the test only indicate that the null hypothesis is likely false. 

P-value table

So, how can we interpret the p-value results of an experiment or trial? A p-value table, prepared prior to the experiment, can sometimes be helpful. This table lists possible p-values and their interpretations.

P-value range Interpretation
p > 0.05 Results are not statistically significant; do not reject the null hypothesis
p < 0.05 Results are statistically significant; in general, reject the null hypothesis
p < 0.01 Results are highly statistically significant; reject the null hypothesis

How to report p-values in research

P-values, like all experimental outcomes, are usually reported in the results section, and sometimes in the abstract, of a research paper. Enough information also needs to be provided so that the readers can place the p-values into context. For our example, the test statistic and effect size should also be included in the results.

To enable readers to clearly understand your results, the significance threshold you used, the critical p-value should be reported in the methods section of your paper. For our example, we might state that “In this study, the statistical threshold was set at p = .05.” The sample sizeshttps://www.editage.com/insights/an-introduction-to-sample-size-effect-size-and-statistical-power-for-biomedical-researchers and assumptions should also be discussed there as they will greatly impact the p-value.

How one can use p-value to compare two different results of a hypothesis test?

What if we conduct two experiments using the same null and alternative hypotheses? Or what if we conduct the same clinical trial twice with different drugs? Can we use the resulting p-values to compare them?

In general, it is not a good idea to compare results using only p-values. A p-value only reflects the probability that those specific results occurred by chance; it is not related at all to any other results and does not indicate degree. So, just because you obtained a p-value of .04 in with one drug and a value of .025 in with a second drug does not necessarily mean that the second drug is better.

Using p-values to compare two different results may be more feasible if the experiments are exactly the same and all other conditions are controlled except for the one being studied. However, so many different factors impact the p-value that it would be difficult to control them all.

Why just using p-values is not enough while interpreting two different variables

P-values can indicate whether or not the null hypothesis should be rejected; however, p-values alone are not enough to show the relative size differences between groups. Therefore, both the statistical significance and the effect size should be reported when discussing the results of a study.

For example, suppose the sample size in our clinical trials was very large, maybe 1,000, and we found the p-value to be .035. The difference between the two drugs is statistically significant because the p-value was less than .05. However, if we looked at the difference in the actual times the drugs were effective, we might find that the new drug lasted only 2 minutes longer than the standard drug. Large sample sizes generally show even very small differences to be significant. We would need this information to make any recommendations based on the results of the trial.

Statistical significance, or p-values, are dependent on both sample size and effect size. Therefore, they all need to be reported for readers to clearly understand the results.

Things to consider while using p-values

P-values are very useful tools for researchers. However, much care must be taken to avoid treating them as black and white indicators of a study’s results or misusing them. Here are a few other things to consider when using p-values:

  • When using p-values in your research report, it’s a good idea to pay attention to your target journal’s guidelines on formatting. Typically, p-values are written without a leading zero. For example, write p = .01 instead of p = 0.01. Also, p-values, like all other variables, are usually italicized, and spaces are included on both sides of the equal sign.
  • The significance threshold needs to be set prior to the experiment being conducted. Setting the significance level after looking at the data to ensure a positive result is considered unethical.
  • P-values have nothing to say about the alternative hypothesis. If your results indicate that the null hypothesis should be rejected, it does not mean that you accept the alternative hypothesis.
  • P-values never prove anything. All they can do is provide evidence to support rejecting or not rejecting the null hypothesis. Statistics are extremely non-committal.
  • “Nonsignificant” is the opposite of significant. Never report that the results were “insignificant.”

One-Tailed vs Two-Tailed Tests

When conducting a hypothesis test, you must decide whether to use a one-tailed or two-tailed test before you collect your data. This decision affects how your p-value is calculated and interpreted.

A two-tailed test checks for a difference in either direction: your alternative hypothesis states that the two groups are simply different, without specifying which is larger. The p-value is calculated from both tails of the probability distribution. This is the most common choice when you have no prior reason to expect a result in one particular direction.

A one-tailed test (also called a directional test) checks for a difference in only one direction. Your alternative hypothesis specifies that one group is either greater than or less than the other. The p-value is calculated from only one tail of the distribution, which means a one-tailed test is more sensitive to detecting an effect in the predicted direction — but it cannot detect an effect in the opposite direction at all.

Choosing between them

Two-tailed One-tailed
Alternative hypothesis A ≠ B A > B or A < B
Direction predicted in advance? No Yes
More conservative? Yes No
Risk of missing a reverse effect? No Yes

In the clinical trial example used throughout this article, because we predicted that the new drug would last longer (not just differently), an upper-tailed (one-tailed) test is appropriate. If we had simply asked whether the two drugs differed, a two-tailed test would be the safer choice.

As a rule of thumb, use a two-tailed test unless you have a strong theoretical reason to predict the direction of the effect in advance. Using a one-tailed test specifically to achieve a smaller p-value after looking at the data is considered p-hacking and is not acceptable in research.

 

Confidence Intervals

A p-value tells you whether a result is statistically significant, but it does not tell you how large the effect is or how precisely it has been estimated. Confidence intervals provide that missing information, which is why most journals now require them alongside p-values.

What is a confidence interval?

A confidence interval (CI) is a range of values that is likely to contain the true population parameter.

What does a 95% confidence interval mean?

A 95% confidence interval means that if you repeated the study 100 times, approximately 95 of those intervals would contain the true value. The width of the interval reflects the precision of your estimate: a narrow interval means a more precise estimate; a wide one reflects more uncertainty, usually due to a small sample size.

How confidence intervals relate to p-values

There is a direct relationship between confidence intervals and p-values. For a two-tailed test at a significance level of α = 0.05:

  • If a 95% confidence interval does not include zero (for a difference) or one (for a ratio), the result is statistically significant at p < 0.05.
  • If the interval does include zero, the result is not statistically significant.

Returning to the clinical trial example: suppose the new drug was found to last an average of 3 hours longer than the standard drug, with a 95% CI of [0.8, 5.2] hours. Because the interval does not include zero, this confirms the statistically significant finding. The interval also tells us something the p-value cannot: the true benefit is plausibly anywhere between just under one hour and over five hours longer.

Why confidence intervals matter

A p-value can be very small simply because the sample was very large, even when the true effect is trivial. A confidence interval anchors the result to a real-world scale. Reporting the interval allows readers to judge whether a statistically significant result is also practically significant. This is something the p-value alone cannot answer. The American Psychological Association (APA) and most major journals now strongly recommend or require the reporting of confidence intervals alongside p-values.

 

Type I and Type II Errors

Every hypothesis test carries the risk of reaching the wrong conclusion. There are two distinct ways this can happen, and understanding them is essential for interpreting p-values correctly.

What is a Type I error?

Type I error (false positive) occurs when you reject the null hypothesis even though it is actually true. In other words, you conclude that an effect exists when in reality it does not. The probability of making a Type I error is equal to your significance level, α. If you set α = 0.05, there is a 5% chance of a false positive even if the null hypothesis is completely true.

What is a Type II error?

Type II error (false negative) occurs when you fail to reject the null hypothesis even though it is actually false. You conclude there is no effect when in reality one does exist. The probability of a Type II error is denoted β. It is influenced by sample size, effect size, and the chosen significance level.

Null hypothesis is true Null hypothesis is false
Reject null hypothesis Type I error (false positive) Correct decision ✓
Fail to reject null hypothesis Correct decision ✓ Type II error (false negative)

The trade-off between error types

Reducing the risk of one type of error increases the risk of the other, when sample size is held constant. If you lower the significance threshold from 0.05 to 0.01 to reduce false positives, you simultaneously make it harder to detect real effects, raising the rate of false negatives.

The appropriate balance depends on the consequences of each error type in your field. In clinical research, for example, a false positive (approving an ineffective drug) may be less dangerous than a false negative (failing to detect a life-saving one), or vice versa depending on the intervention.

Practical implications

When a study reports a non-significant result (p > 0.05), this does not mean the null hypothesis is proven true. It may simply mean the study lacked the power to detect the effect. This is a Type II error. This is why sample size planning and statistical power analysis are critical steps before any study begins.

 

Statistical Power

Statistical power is the probability that your hypothesis test will correctly detect a true effect when one exists. In other words, it is the probability of avoiding a Type II error. Power is expressed as 1 − β, where β is the Type II error rate.

A study with 80% power has an 80% chance of detecting a real effect and a 20% chance of missing it. Most researchers aim for a minimum power of 0.80 (80%), though fields with higher stakes, such as clinical trials, often target 0.90 or higher.

What determines statistical power?

Four factors interact to determine power:

  • Sample size: The most controllable factor. Larger samples reduce variability and increase the precision of estimates, making it easier to detect true effects.
  • Effect size: Larger, more substantial effects are easier to detect than small ones. Power calculations require a pre-specified expected effect size, usually drawn from prior literature or a pilot study.
  • Significance level (α): A stricter threshold (e.g., α = 0.01 instead of 0.05) reduces Type I errors but also reduces power.
  • Variability in the data: Greater spread in the data means more noise, which makes it harder to detect a signal.

Why power matters

An underpowered study is problematic in two ways. First, it may miss a real effect, wasting resources and potentially delaying beneficial treatments or interventions.

Second, and less intuitively, an underpowered study that does achieve significance is more likely to have overestimated the true effect size, because only unusually large effects clear the significance threshold by chance in a small sample. This is known as the “winner’s curse” and contributes to the replication problems discussed below.

Power analysis

Power analysis should be conducted before data collection to determine the minimum sample size needed to detect your expected effect at a given significance level and desired power. Many statistical software packages (G*Power, R, SPSS) include power analysis tools. Reporting the results of a pre-study power calculation is now standard practice in clinical and psychological research.

 

The Multiple Comparisons Problem

When you perform multiple hypothesis tests within the same study, the probability of making at least one Type I error (false positive) increases substantially, even if each individual test is run at α = 0.05. This is known as the multiple comparisons problem, also called the problem of multiplicity.

Why multiple tests inflate the false positive rate

If you run one test at α = 0.05, there is a 5% chance of a false positive. But if you run 20 independent tests, the probability of getting at least one false positive (even if no true effects exist) rises to approximately 64%. The overall error rate across a family of tests is called the familywise error rate (FWER).

This problem arises in several common situations:

  • Comparing more than two groups pairwise (e.g., testing Drug A vs Drug B, Drug A vs Drug C, and Drug B vs Drug C separately)
  • Testing multiple outcome variables in the same study
  • Running subgroup analyses that were not planned in advance
  • Repeatedly testing accumulating data and stopping when p < 0.05

ANOVA vs multiple t-tests

A common example of the multiple comparisons problem occurs when researchers compare three or more groups. Running separate t-tests for each pair of groups inflates the false positive rate. The appropriate solution is a one-way ANOVA (Analysis of Variance), which tests all groups simultaneously and produces a single p-value for the overall difference. If the ANOVA is significant, post-hoc tests (such as Tukey’s HSD or Bonferroni correction) can then be used to identify which specific groups differ, while controlling the familywise error rate.

Common corrections

Several methods exist to adjust p-values or significance thresholds when multiple comparisons are made:

  • Bonferroni correction: Divide α by the number of tests (e.g., for 10 tests, use α = 0.005). Conservative but simple.
  • False Discovery Rate (FDR) / Benjamini–Hochberg procedure: Controls the expected proportion of false positives among significant results. Less conservative than Bonferroni, commonly used in genomics and neuroimaging.

When reading research, always check whether multiple comparisons were made and whether appropriate corrections were applied. Uncorrected multiple comparisons are one of the most common sources of irreproducible findings in the literature.

 

Publication Bias and the Replication Crisis

Understanding what p-values cannot do is just as important as knowing what they can. Over the past two decades, a growing body of evidence has revealed that an overreliance on p < 0.05 as a binary decision criterion has contributed to a serious replication crisis across the social sciences, medicine, and psychology.

Publication bias

Publication bias refers to the tendency for scientific journals to publish studies with statistically significant results (p < 0.05) while rejecting or ignoring studies that fail to reach significance. This creates a distorted picture of the evidence base: the published literature overrepresents positive findings, even though null results are equally valid and informative.

The consequences are significant. When multiple research groups independently study the same question, only those that happen to find p < 0.05 are likely to appear in journals. The unpublished null results remain invisible: a phenomenon sometimes called the “file drawer problem.” Meta-analyses that rely on published studies alone therefore tend to overestimate the true size of effects.

The replication crisis

Beginning around 2011, large-scale replication projects began systematically re-running published studies. The results were sobering. The Reproducibility Project: Psychology, which attempted to replicate 100 studies published in leading psychology journals, found that only about 36–39% reproduced the original significant finding. Similar replication failures have been documented in cancer biology, economics, and medicine.

The causes are multiple and interrelated: publication bias, underpowered studies, p-hacking, hypothesising after results are known (HARKing), and the flexibility researchers have in data collection and analysis. Small p-values in underpowered studies are particularly fragile, because (as discussed above) underpowered studies that achieve significance tend to overestimate effect sizes.

What this means for interpreting p-values

A single study reporting p < 0.05 is not proof. P-values need to be interpreted in context: the prior plausibility of the hypothesis, the statistical power of the study, whether the analysis was pre-registered, and whether the result has been independently replicated. Many statisticians and scientific organisations now recommend supplementing or replacing p-value thresholds with effect sizes, confidence intervals, Bayesian approaches, or pre-registered replication as the gold standard of evidence.

In 2019, over 800 scientists signed a call in Nature to abandon the use of “statistical significance” as a binary label, arguing it has been systematically misused and misunderstood. While the debate continues, the consensus is clear: p < 0.05 is a starting point for evidence, not a finish line.

Reporting non-significant results

Non-significant results should be reported fully, not omitted or described vaguely. Write, for example: “There was no significant difference in test scores between the two groups, t(58) = 1.14, p = .259, d = 0.29, 95% CI [−0.52, 1.90].” Reporting the effect size and confidence interval is especially important in null results, as they indicate whether the study had sufficient precision to detect a meaningful effect if one existed.

Pre-registration and the methods section

The significance threshold you used (e.g., α = .05) should be declared in the methods section, not chosen after viewing the results. A sentence such as “The significance threshold was set at α = .05 prior to data collection” is standard practice and signals to reviewers that the analysis was not adjusted post hoc.

 

 

Reporting P-Values in APA Format

When writing up research for publication, p-values must be reported clearly and consistently. The Publication Manual of the American Psychological Association (APA, 7th edition) provides the most widely adopted guidelines for reporting statistical results, and most peer-reviewed journals in the social and health sciences require or recommend following them.

General formatting rules

  • The letter p is always italicised.
  • P-values are reported without a leading zero: write p = .032, not p = 0.032.
  • Spaces appear on both sides of the equals sign: p = .05, not p=.05.
  • Very small p-values are reported as p < .001 rather than as an exact value (e.g., p = .0000032).
  • Exact p-values are preferred over inequalities wherever software reports them: write p = .043, not p < .05.
  • Round to two or three decimal places; never report more precision than your software provides meaningfully.

What to report alongside the p-value

A p-value reported in isolation gives readers very little information. APA guidelines require reporting the test statistic, degrees of freedom, and effect size alongside the p-value. Confidence intervals are strongly recommended.

Worked example: independent samples t-test

Suppose a study compared anxiety scores between two groups: one receiving a new intervention (n = 45) and a control group (n = 42). After conducting an independent samples t-test, the results might be reported as follows:

Participants in the intervention group (M = 14.2, SD = 3.8) reported significantly lower anxiety than those in the control group (M = 17.6, SD = 4.1), t(85) = −4.21, p < .001, d = 0.86, 95% CI [−5.01, −1.79].

Breaking this down:

Element Meaning
M = 14.2, SD = 3.8 Group mean and standard deviation
t(85) = −4.21 t-statistic with degrees of freedom in parentheses
p < .001 P-value (exact value too small to report as decimal)
d = 0.86 Cohen’s d effect size (large effect)
95% CI [−5.01, −1.79] Confidence interval for the mean difference

 

Frequently Asked Questions (FAQs) on p-value 

Q: What influences p-value?  

The primary factors that affect p-value in statistics include the size of the observed effect, sample size, variability within the data, and the chosen significance level (alpha). A larger effect size, a larger sample size, lower variability, and a lower significance level can all contribute to a lower p-value, indicating stronger evidence against the null hypothesis. 

Q: What does p-value of 0.05 mean?  

A p-value of 0.05 is a commonly used threshold in statistical hypothesis testing. It represents the level of significance, typically denoted as alpha, which is the probability of rejecting the null hypothesis when it is true. If the p-value is less than or equal to 0.05, it suggests that the observed results are statistically significant at the 5% level, meaning they are unlikely to occur by chance alone. 

Q: What is the p-value significance of 0.15? 

The significance of a p-value depends on the chosen threshold, typically called the significance level or alpha. If the significance level is set at 0.05, a p-value of 0.15 would not be considered statistically significant. In this case, there is insufficient evidence to reject the null hypothesis. However, it is important to note that significance levels can vary depending on the specific field or study design. 

Q: Which p-value to use in T-Test?  

When performing a T-Test, the p-value obtained indicates the probability of observing the data if the null hypothesis is true. The appropriate p-value to use in a T-Test is based on the chosen significance level (alpha). Generally, a p-value less than or equal to the alpha indicates statistical significance, supporting the rejection of the null hypothesis in favour of the alternative hypothesis. 

Q: Are p-values affected by sample size?  

Yes, sample size can influence p-values. Larger sample sizes tend to yield more precise estimates and narrower confidence intervals. This increased precision can affect the p-value calculations, making it easier to detect smaller effects or subtle differences between groups or variables. This can potentially lead to smaller p-values, indicating statistical significance. However, it’s important to note that sample size alone is not the sole determinant of statistical significance. Consider it along with other factors, such as effect size, variability, and chosen significance level (alpha), when determining the p-value. 

Editage All Access is a subscription-based platform that unifies the best AI tools and services designed to speed up, simplify, and streamline every step of a researcher’s journey. The Editage All Access Pack is a one-of-a-kind subscription that unlocks full access to an AI writing assistant, literature recommender, journal finder, scientific illustration tool, and exclusive discounts on professional publication services from Editage. 

Based on 22+ years of experience in academia, Editage All Access empowers researchers to put their best research forward and move closer to success. Explore our top AI Tools pack, AI Tools + Publication Services pack, or Build Your Own Plan. Find everything a researcher needs to succeed, all in one place – Get All Access now starting at just $14 a month! 

This article was originally published on February 9, 2023, and updated on June 8, 2026.

Related Posts