ATD Blog

Write Test Questions That Actually Measure Something

Wed Aug 01 2018

Writing good test questions is both an art and a science. Just because people have taken tests, doesn’t make them a good test-item writer. Yet, writing good test items is a critical skill for L&D professionals. In fact, their personal credibility and ability to prove the value of a training program depend on it.

Unfortunately, writing good test questions is not something many L&D professionals do well. Test items frequently contain obvious clues to the correct answer, or they are overly difficult and discourage test takers from getting the correct answer. In either case, the result is an invalid test—one that doesn’t measure what it is supposed to and is unfair either to test taker or the test taker’s organization.

What’s more, invalid tests put L&D professionals at risk by creating two distinct situations:

It appears that learning took place when it actually didn’t (the test items contained obvious clues as to the correct answer).
It appears that learning didn’t take place when it did (the test items were tricky or overly difficult and discouraged test takers from getting the correct answer).

In the first situation, business executives may question why participant job behaviors didn’t change (Level 3) or business results didn’t improve (Level 4) if learning improved. In the second situation, business executives may question why time and money was wasted on training where participants didn’t learn anything. In either case, your reputation and credibility as an L&D professional are on the line and certain to suffer.

Here’s the good news: both of these situations can be avoided by conducting a test-item analysis to determine whether—or not—each of your questions is “good.”

In Criterion-Referenced Test Development, instructional design professors Sharon Shrock and William C. Coscarelli outline three test item statistics you can use to evaluate the quality of your test items: 1) difficulty index, 2) p-values, and 3) point-biserial correlation. To apply these statistics, you first need to create your test and then administer it to a group of at least 25-30 program participants. Let’s take a closer look:

Difficulty Index

The difficulty index, as the name implies, is a measure of how many program participants answer a particular test question correctly. The statistic is expressed as a percentage; for example, “.60 or 60 percent of program participants answered the question correctly.” Effective or good test questions typically have a difficulty score between .30 and .70, in which the range is from .00 (no one answered the item correctly) to 1.00 (everyone answered the item correctly). Test items that fall outside the 30/70 range should be considered either too easy or too difficult. One exception is if you created a mastery test, in which case you would be looking for difficulty index scores in the .90 or 90 percent range.

P-Value

The p-value is similar to the difficulty index, but it indicates what percentage of test takers chose each of the incorrect response options rather than the percent of test takers who answered the item correctly. For instance, in the case of a multiple-choice test question with four response options, each of the three incorrect responses would have its own a p-value. As an example, imagine a multiple-choice test question with a difficulty index of .60 and the following p-value scores for each of the incorrect responses: .10, .15, and .15. (Note: that the p-values plus the difficulty index sum to 1.00 or 100 percent of the test takers.)

The p-value data enables you to conduct a response-option analysis, when using multiple-choice test questions, to see if any of the responses are being over or under selected. An over-selected response option, when the option is the correct answer, indicates that the question either is too easy or that none of the other response options are seen as plausible. An over-selected incorrect response option indicates that the question is misleading or that the response option needs to be reworded so it is less attractive (less like the correct response). The existence of under-selected response options also increases the odds of a test taker guessing the correct answer. Case in point: the chances of guessing the correct answer to a multiple-choice test question with four response choices goes from 25 percent to 33 percent with the existence of one under selected response option and to 50 percent with two under-selected response options.

Point-Biserial Correlation

The point-biserial correlation is regarded by most test-creation experts as the single most useful test-item analysis statistic. This statistic correlates test-takers’ performance on a single test item with their overall test scores. In short, the statistic shows if test-takers who scored high on the test overall also answered the particular test question correctly. Each test item will have a point-biserial correlation score ranging from +1.00 to -1.00. Of particular concern are negative point-biserial scores, because they indicate that test-takers who generally scored high on the test overall missed the item, while test-takers who generally did poorly on the test overall got the item right. Any test items with a negative point-biserial score should be investigated immediately to determine the source of the problem and then rewritten.

Bottom line: If you’re creating a Level 2 knowledge test, writing test items is only half your job. The other half is ensuring you write test questions that actually measure something. By using these three test-item statistics to analyze your test items, you can safeguard success.

For a deeper dive into measuring training, join me October 11-12 in New Orleans for the ATD Core 4 Conference.

You've Reached ATD Member-only Content

Become an ATD member to continue

Already a member?Sign In