Why Talk About Test Validity?
Many parts of personality -- e.g., extraversion, intelligence, emotionality -- are measured with psychological tests. Psychological tests, in fact, provide an important infrastructure for research in personality psychology. That is, they support the field's central activities by helping scientists and practitioners measure, and therefore study, a person's attributes. Perhaps the most important quality of a psychological test is whether it possesses validity. Most generally speaking, test validity concerns whether a test measures what it is supposed to measure. Test validity is crucial because it provides the test-taker with an assurance that test's information about him or her is accurate.
The Definition(s) of Test Validity
The concept of test validity has evolved since it was first introduced at the beginning of the 20th century. One classic definition of validity is:
A test is valid when the available evidence indicates that the test and its subscales measure the attribute it is intended to measure.
That is, if a test measures what it claims to, it is considered a valid test. If it doesn't, it is invalid.
The rules for determining test validity are developed in journal articles and book chapters (and, perhaps, books) about test validity. These articles and chapters present new opinions, research, and theories about validity.
A set of official standards for validity also exist. in a book called the Standards for Educational and Psychological Tests. The standards are produced by a joint committee of the American Psychological Association (APA), the American Education Research Association (AERA), and the National Council for Measurement in Education (NCME).
The 1999 Standards have their own definition of validity, which are somewhat different from the classic definition provided above. The definition in the Standards reads, in part:
"The degree to which evidence and theory support the interpretation of test scores entailed by proposed used of tests." (AERA/APA/NCME, 1999).
The new Standards emphasize test scores and their interpretation; the classic definition wants to know (sorry for the repetition) whether a test measures what it is supposed to. For now, let's interpret these two approaches as similar and aimed at the same question -- do you trust the test and the test results?
Assessing a Test's Validity
Okay, so does the test measure what it should? To decide this requires integrating information from several sources. The new standards for validity discuss four areas of evidence for validity. These include:
Evidence from Response Processes (or Response Process Validity) -- When a person takes a test, the person goes through mental processes so as to provide an answer. The test maker assumes that the person's processes reflect what the test is trying to measure. For example, if the test is measuring ability at mathematical problem solving, then the individual's mental processes should reflect problem solving rather than, say, remembering an answer that they already had memorized. Studies that examine think-aloud protocols provide one way of determining whether response process validity is present.
Evidence from Test Content (or Content Validity) -- To assess this sort of validity, we ask, "Are the items used in collecting data truly representative of the specified domain?" For example: will an in-class test measure the curriculum that the student has been taught?
Content validity can be assessed informally (e.g., "Is the right content there?"), or more formal methods can be employed. One example of a more formal method is the logical (item) sampling approach. The method uses a multi-step process that involves a careful definition of domain of the behaviors to be sampled from. Then, the test is designed logically to cover all aspects of domain.
For example, a teacher might be covering 20th Century World History, from 1900 to 1929. She might map out here coverage as follows:
| |
African history |
Asian history |
Latin-American history |
American history |
European history |
| Dates |
|
|
|
|
|
| Economic changes |
|
|
|
|
|
| Social changes |
|
|
|
|
|
| Political changes |
|
|
|
|
|
Next, she might fill in how many test questions from each area she will ask -- weighting areas according to their importance. A 10-item quiz might take this form...
| |
African history |
Asian history |
Latin-American history |
American history |
European history |
| Dates |
|
1 |
|
1 |
|
| Economic changes |
1 |
|
1 |
|
1 |
| Social changes |
|
1 |
|
1 |
|
| Political changes |
1 |
|
1 |
|
1 |
Notice that the 10 items carefully cover the intersections among all topics, 2 each from each region, and between 2 and 3 questions for each major area of focus. This test would have content validity.
Evidence from Test Structure (or Structural Validity) -- Structural validity asks the question, "How many things does this test measure?"
"Test structure" here is short for "the structure of test covariance," or, in (somewhat) plainer English, the degree to which all the test items rise and fall together; or to which, by contrast, perhaps, one set of test items rise and fall together in one pattern, and another group of test items rises and falls in a different pattern. Test structure is generally studied by advanced mathematical techniques such as structural equation modeling, factor analysis, or multidimensional scaling. These techniques all hope to discover how many things a test measures.
Basically, if a test-maker claims a test measures one thing -- general intelligence, say -- then the mathematical evidence concerning the test should indicate that it measures one thing. If the test-maker claims it measures two things, then the math should indicate the test measures two things, and so on.
Evidence Based on Relations to Other Variables (or Criterion Validity) -- Perhaps the final word, here, involves whether the test predicts as it is supposed to. This involves the test scores' relation to other variables. There are several matters to examine here, including convergent validity, discriminant validity, and criterion validity.
Convergent validity -- The test should converge (correlate highly) with other tests of the same concept; for example, if the test measures extraversion, it should correlate highly with other tests of extraversion.
Discriminant validity -- The test should not correlate with other measures of different concepts or constructs. For example, a test of extraversion should not correlate with a test of anxiety (unless there is a known relationship between the two variables).
Criterion validity -- Finally, test scores ought to be related to a criterion of importance, such as:
- graduating from school
- performing successfully in a position
- staying healthy
Relations among variables usually are expressed as a correlation coefficient.
There are three subtypes of criterion relationships:
- Predictive -- where the test score predicts a future outcome
- Concurrent -- where the test score and criterion are measured at the same time
- Post-dictive -- where the test score predicts backward to the individual's history
References
Joint Committee (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.