What is Criterion Validity? Definition, Types and Examples

Key Takeaways:

Criterion validity measures how well a test’s results align with a trusted, established benchmark for the same construct.
There are 3 types: concurrent, predictive, and retrospective validity, distinguished by when the criterion is measured relative to the test.
Criterion validity is calculated using correlation coefficients such as Pearson r or point-biserial r, typically run in Excel, R, or SPSS.
A test is only as valid as its criterion. A flawed or biased gold standard undermines the entire validity assessment.

Table of Contents

Glossary of Key Terms

Term	Definition
Criterion validity	The extent to which a test’s scores correlate with a trusted, established measure of the same construct.
Construct	An abstract concept, such as intelligence, anxiety, or job performance, that cannot be measured directly.
Gold standard	A widely accepted benchmark measure used as the criterion for comparison.
Concurrent validity	Criterion validity assessed when the test and the criterion are measured at the same time.
Predictive validity	Criterion validity assessed when the criterion is measured at a later point in time.
Retrospective validity	Criterion validity assessed when the criterion was measured in the past, before the test was given.
Correlation coefficient (r)	A statistic ranging from -1.0 to 1.0 that shows the strength and direction of the relationship between two variables.
Point-biserial correlation	A correlation method used when one variable is continuous and the other is dichotomous, such as pass or fail.

What Is Criterion Validity?

Criterion validity is the degree to which a test’s scores correlate with a trusted, established measure, called a criterion, of the same construct. It shows whether a new tool actually reflects the real-world outcome it claims to measure.

Many research constructs, such as depression, intelligence, or stress, cannot be observed directly. Researchers instead build tests and questionnaires to approximate them. Criterion validity checks whether those approximations hold up against a recognized external benchmark, often called a gold standard.

For example, a psychologist develops a new anxiety questionnaire. To confirm it works, they compare its scores against a well-established clinical anxiety assessment. If both tools produce similar results, the new questionnaire demonstrates criterion validity.

Why Does Criterion Validity Matter in Research?

Criterion validity matters because researchers, clinicians, and employers rely on test scores to make real decisions about people. Without it, there is no evidence that a test measures anything meaningful.

Builds trust in new or shorter instruments before they replace established, lengthier ones.
Supports high-stakes decisions, including university admissions, hiring, and clinical diagnosis.
Allows researchers to justify using a convenient tool in place of a more expensive or time-consuming one.
Strengthens the credibility of published research by demonstrating the measures used are sound.

Types of Criterion Validity

Criterion validity has 3 recognized sub-types. Each differs based on when the criterion is measured relative to the test being validated.

Concurrent Validity

Concurrent validity compares a test’s results to a criterion measured at the same time. It is used when a new or shorter tool needs to be checked against an existing, established one.

Example: A researcher administers a new depression screening tool alongside a validated clinical interview on the same day. A strong correlation between the two scores supports concurrent validity.

Predictive Validity

Predictive validity assesses whether a test can forecast a future outcome. The criterion is measured after the test, often weeks, months, or years later.

Example: A university admission test is compared to students’ grade point average at the end of their first year. If higher test scores align with higher grades, the test shows predictive validity.

Retrospective Validity

Retrospective validity compares a current test to a criterion that was already measured in the past by someone else. It is less common but useful when historical records exist.

Example: A new screening tool for a medical condition is compared against past diagnostic records to see whether current scores align with outcomes documented years earlier.

How Do You Choose a Good Criterion?

Choosing a good criterion is often the hardest step, because a flawed or biased criterion compromises the entire validity assessment, no matter how well-designed the new test is.

A strong criterion should meet 4 basic conditions:

Established: it is already recognized and widely used in the field.
Relevant: it measures the same construct as the new test, not a related but different one.
Unbiased: it is not affected by the same errors or assumptions built into the new test.
Feasible: it can realistically be collected within the study’s time and resource constraints.

If no accepted criterion exists for a construct, researchers cannot assess criterion validity directly. In that case, they typically turn to construct validity instead, which examines relationships with other related and unrelated variables.

How Is Criterion Validity Calculated?

Criterion validity is calculated by running a correlation between the test scores and the criterion scores. The specific method depends on whether the variables are continuous or categorical.

Method	When to Use	Value Range	Example
Pearson r	Both variables are continuous	-1.0 to 1.0	Test score vs. GPA
Point-biserial r	One variable is dichotomous	-1.0 to 1.0	Test score vs. pass or fail
Phi coefficient	Both variables are dichotomous	-1.0 to 1.0	Screening result vs. diagnosis

The resulting r-value indicates 2 things: strength, how closely the variables move together, and direction, whether the relationship is positive or negative. These calculations are typically run in Excel, R, SPSS, or Python rather than by hand.

What Counts as a Good Criterion Validity Score?

There is no single universal cutoff for criterion validity. The acceptable r-value depends on the field, the construct being measured, and how the results will be used.

10 to 0.29: weak relationship
30 to 0.49: moderate relationship
50 to 1.0: strong relationship

High-stakes fields, such as medical diagnosis or pilot selection, generally require stronger correlations than exploratory academic research. A moderate correlation may be perfectly acceptable in an early-stage study.

How Do You Measure Criterion Validity Step by Step?

Measuring criterion validity follows a consistent process, whether the study involves a psychological questionnaire, an employment test, or a medical screening tool.

Define the construct the test is intended to measure.
Select and justify a reliable, relevant criterion.
Recruit a sample that represents the population of interest.
Administer the test and measure the criterion, either at the same time or later.
Calculate the appropriate correlation coefficient.
Interpret the strength, direction, and practical significance of the result.

Limitations of Criterion Validity

A test is only as valid as its criterion. A biased or flawed gold standard undermines the entire result.
No accepted criterion exists for some constructs, making direct assessment impossible.
A strong correlation does not guarantee the test measures the underlying construct correctly; construct validity is still needed.
Results can be sample-specific and may not generalize to other populations or settings.

Criterion Validity vs Other Types of Validity

Criterion validity is 1 of 4 validity types commonly discussed in research methodology. Each answers a different question about how well a test measures what it claims to.

Validity Type	What It Checks	Compared Against	Example
Criterion	Correlation with a real-world outcome	An established gold-standard measure	Admission test vs. course grades
Content	Coverage of the full construct	Expert judgment or theory	Math test covering all required topics
Construct	Whether the test reflects the theoretical concept	Related and unrelated constructs	Anxiety test correlates with stress, not memory
Face	Whether the test appears valid on the surface	Subjective impression	Test visibly looks like it measures anxiety

Criterion Validity in Different Fields

Criterion validity is applied wherever tests are used to make consequential decisions about people. Its relevance and stakes shift depending on the field.

Human Resources and Hiring

Employers use criterion validity to confirm that pre-employment tests predict actual job performance. For example, scores on a cognitive ability test are correlated with supervisor performance ratings collected 6 to 12 months later.

Education and Admissions

Standardized admission tests are validated against outcomes such as first-year grade point average or graduation rates. A test with strong predictive validity helps institutions make fairer, evidence-based admission decisions.

Clinical and Medical Diagnosis

New screening tools are compared against confirmed diagnostic results, such as biopsy outcomes or established clinical interviews. This ensures a faster or cheaper screening method still identifies cases accurately.

Worked Example: Measuring Happiness

A social psychologist studies how happiness relates to relationship longevity. They design a new, more engaging online questionnaire to measure happiness and want to confirm it works before using it in their study.

Concurrent validity: they administer the new questionnaire alongside the established Subjective Happiness Scale to the same group at the same time, then correlate the 2 sets of scores.
Predictive validity: they re-contact participants 1 year later to see whether early happiness scores correlate with whether their relationships lasted.

Strong correlations in both cases would confirm the new questionnaire is both an accurate current measure and a useful predictive tool.

Frequently Asked Questions

What is the difference between criterion validity and construct validity?

Criterion validity compares a test to a single established benchmark. Construct validity is broader, examining how a test relates to multiple related and unrelated variables to confirm it measures the intended theoretical concept.

Is criterion validity the same as predictive validity?

No. Predictive validity is 1 of 3 sub-types of criterion validity. The others are concurrent validity and retrospective validity, which differ in when the criterion is measured relative to the test.

What is a good correlation coefficient for criterion validity?

There is no fixed rule, but many researchers treat 0.30 to 0.49 as moderate and above 0.50 as strong. High-stakes fields, such as medical diagnosis, typically require stronger correlations than exploratory research.

How do you calculate criterion validity in SPSS or R?

In SPSS, run a bivariate correlation (Analyze, then Correlate, then Bivariate) between the test and criterion scores. In R, use the cor() or cor.test() function, selecting Pearson or point-biserial depending on the variable types.

What is an example of criterion validity in psychology?

A common example is comparing a new anxiety questionnaire to an established clinical anxiety assessment administered at the same time. A strong correlation between the 2 confirms concurrent criterion validity.

Why is criterion validity important in employee selection tests?

It confirms that a pre-employment test genuinely predicts job performance rather than measuring an unrelated trait. This protects employers from making costly, unfair, or legally risky hiring decisions.

What is the difference between concurrent and predictive validity?

Concurrent validity measures the test and the criterion at the same time. Predictive validity measures the criterion at a later date, testing whether the current test forecasts a future outcome.

Can a test have criterion validity but not be truly valid overall?

Yes. A test can correlate well with a flawed or biased criterion, producing misleading results. Researchers typically pair criterion validity with construct and content validity for a complete picture of a test’s overall quality.

This article was originally published on March 03, 2025, and updated on July 31, 2026.