{\huge Multiple-Credit Tests}
Can chess help design multiple-choice exams? \vspace{.5in}
Bruce Pandolfini is one of very few living people who have been played by an Oscar-winning actor. He was the real-life chess teacher of junior chess player Joshua Waitzkin, who went on to become the world champion---in T'ai Chi. Their story is told in the movie ``Searching For Bobby Fischer,'' which is still iconic after 20-plus years. Pandolfini is still doing what he loves as an active chess teacher in New York City. For much of this time he has also written a popular feature called ``Solitaire Chess'' for Chess Life magazine, which is published by the United States Chess Federation.
Today Dick and I wish to compare styles of multiple-choice exams, with reference to ``Solitaire Chess,'' and have some fun as well.
Most multiple-choice questions are designed to have a unique correct answer, with all other answers receiving 0 points or even a minus. This is like a chess problem of ``find the winning move'' type. Mate-in-2, mate-in-3, and endgame problems generally have unique answers---a ``dual'' solution is an esthetic blemish. There are several popular websites devoted to this kind of chess puzzle, which is great for honing one's tactical ability.
``Solitaire Chess'' is different, with more emphasis on strategy. The reader takes the winning side of a notable game Bruce has prepared, and chooses his/her move before revealing the answer and the opponent's next move. It simulates the feeling of playing a master game.
Incidentally Bruce recently attended the wedding of another master player featured in the movie, the real-life Asa Hoffman to the former Virginia LoPresto, who all remember me from New York tournaments in the 1970's. Among children he's coached in preteen years is the world's current 5th-ranked player, Fabiano Caruana of Brooklyn and Italy.
Solitaire Chess
The difference we emphasize is that the game positions often give partial or even full credit for alternative choices. For example, here is the position at move 22 in the March 2014 Chess Life column, of a game that was played in 1934 by Fred Reinfeld, an earlier master teacher who wrote many great books until the 1960's:
\includegraphics[width=2.4in]{ReinfeldPosition.png}
The top score of 5 points goes to the capture move 22.axb4, but the alternatives 22.Nc6 and 22.Ne6+ are deemed almost as good, worth 4 points each, while the non-capture move 22.a4 still gets 2 points. Several other game turns have 3-point partial credits. At the end is a chart connecting your total score over all the moves to the standard chess rating scale devised by Arpad Elo, for instance 81--94 points is deemed the range of a 2200--2399 master player such as myself, while 36--50 is for a ``good club player'' with 1600--1799 rating.
Pandolfini's judgment goes into setting both the partial credits and the overall assessment scale. Although chess positions often have 30--50 legal moves or even more, there are typically at most 3--5 moves worth considering, so this is like a standard multiple-choice test in that way. The partial credits, however, are more typical of ranking applications such as judging the value of search-engine hits, where there are 10, 20, 30, or hundreds or thousands of choices to consider. Our topic is about having the best of both kinds of application, and how to do the assessment scientifically.
But Let's Have Fun
Well we guess you didn't come to a blog to take an exam, so we'll try to make at least the first part fun, before we introduce more ``strategic'' questions with partial credits. You are on your honor not to Google the answers---we can tell of course; we won't tell you how we know but our heartbleeds for you.
\newpage
Multi-Choice
OK, more serious now. Start your engines. Actually in chess, ``start your engines'' would mean either you are cheating, or you are playing in the InfinityChess Freestyle tournament, which finishes tomorrow.
Getting Help With Judgment Calls
We think each of our latter six ``multi-choice'' questions has a clear best answer, but our judgment comes from perspectives in our field. For instance, ``structural'' complexity came with a specific meaning apart from algorithmic and practical considerations. Even granting that meaning, arguments can be made for several answers to the last question---all except the one that is false on current knowledge. For example, random-oracle results used to be considered stronger evidence than is commonly ascribed to them now.
We could have made catch-all ``some of the above'' answers as in our first set. However, this would miss our feeling of a pecking order even among the non-optimal answers. Again with reference to the last question, random oracles and complete languages are ``structural'' while the history of classifying problems is not, and between the first two, lacking a copmpleteness level is not generally evidence of being tractable. Hence we see the possibility of better assessment by giving different partial credits to these answers.
An even more quantitative option is that we could ask the test-taker to rate each statement on (say) a 0--5 scale. This would be just like asking the takers to estimate the partial credits themselves. We could then score according to distributional similarity to our own assignment, weighting closeness on the best answers the most. Of course this style of grading is most appropriate to judging search engines, based on an expert reference assessment of the importance of the various 'hits' returned. And it is also like simulating the creation of ``Solitaire Chess'' itself---more than just looking for the best move which is what we do when we actuallly play chess. Thus the teacher has a harder task than the player.
The most ambitious goal is to turn the process around by making backwards inferences about the values of questions from the aggregated selection of many well-informed takers. In chess this would be like judging the value of a move based on the proportion of strong players who choose it. Nowadays this is regarded as overruled by the judgments of strong computer programs, notwithstanding the issue that players' ``book knowledge'' of past games makes their choices less independent than among test takers. However, the ability in chess to correlate players' judgments with computer values of moves, and map the distributions, may help us make inferences about ``objective value'' from the distributions of the test-takers. This plays into quantifying the wisdom of crowds along lines discussed toward the end of the Distinguished Speaker lecture given last week by Lance Fortnow on his visit to Buffalo. At least this is our motive for making tests more like strategic chess.
Open Problems
What partial-credit values would you assign to our complexity questions?
Should multiple-choice tests be more like ``Solitaire Chess''? Does one obtain deeper and better assessment that way? Is the difference important enough to massive online courses?
Here are the answers to our April Fool's anagram quiz, besides ``Pearl Gates'' = Peter Sagal and ``Slack Laser'' = Carl Kasell: