FUNNY PAGES

With the kind permission of Taylor and Francis Publishing, the following Humour Article from the year 2000 is being reprinted from Connection Science, 12, 91-94.

Turning the Tables on the Turing Test: The Spivey Test

Michael Spivey
Cornell University

After several decades of research in Artificial Intelligence (AI) (e.g., Turing 1950, Rosenblatt 1961, Winograd 1972, Rumelhart & McClelland 1986), and even in comparative cognition (e.g. Schusterman et al. 1986, Zentall 1993, Hauser 1996), the cognitive, neural, and computational sciences are still loathe to let go of their markedly anthropocentric criteria for "intelligence." Indeed, the only non-subjective evidence that humans are thinking reasoners at all is the mere fact that most of them vehemently claim to be thinking reasoners. Of course, it is trivially easy to program a computer to insist that it is an intelligent, thinking reasoner as well. Rather than allow a one-line BASIC program to be accepted as 'intelligent', most researchers would prefer to set the bar a little higher. Therefore, a more stringent test is necessary.

Alan Turing (1950) provided that test. In the Turing Test, a human judge communicates, via a computer terminal, with an AI conversation program and with a human. If the human judge cannot tell which one is the human, then the AI has passed the Turing Test -- and may as well be considered as intelligent, and capable of thought, as the human is. In a recent Turing test tournament (Loebner 1999), the best AI was rated by the judges as 11% Turing, or humanly intelligent. This may not seem a very impressive success rate, until one considers the success rate of the best human. The best human was rated only 61% humanly intelligent! (Across all the human participants, the average Turing rating, or human-like intelligence score, was 50%.)

The obvious problem with all of this is the glaring prejudice toward human-like reasoning as the benchmark of intelligence. Why is not computer-like reasoning also put on such a pedestal? If the results of the recent Turing tournament are any indication, computer-like intelligence is certainly a prominent format (among AI programs as well as humans)!

In the work presented here, this prejudice has been remedied. A test inspired by the Turing test was designed in which a human judge communicates, via computer terminal, with a computer program and with a human. The important difference is that instead of the computer program contestant struggling to appear human-like in its communications, the human contestant is struggling to appear computer-like in its communications. If the human judge is unable to determine whether the human contestant is an AI or a human, then the human can be considered as 'intelligent' as the computer is.

The human judges were 12 graduate students in the cognitive studies programme at Cornell University. The human contestants were 120 Cornell undergraduates from a variety of majors. The AI contestants were a collection of computer programs with varying levels of AI: (1) The MATLAB command window; (2) The Unix program 'Zippy'; (3) Weizenbaum's (1974) Rogerian therapist program, ELIZA; (4) an interactive version of Chamberlain and Etter's (1984) "free verse" poetry program, Racter; (5) Winograd's (1972) SHRDLU; and 6) Elman's (1990) simple recurrent network. Importantly, none of the human judges or contestants were told what computer programs were participating. Each judge participated in 10 sessions where he/she communicated with one (randomly selected) computer program and one human; order was counterbalanced across the 10 sessions. At the end of each session, in a two-alternative forced choice paradigm, the judge picked which one of that session's conversants appeared to display computer-like intelligence. Each conversation lasted exactly 15 minutes.

The complete transcripts comprise a considerable amount of data. (I considered subjecting the transcripts to a 'content analysis', but then I realized I did not know what the hell a "content analysis" was.) Some illustrative excerpts are given below. Examples (1) and (2) show the same judge (a philosophy graduate student) interacting first with a computer program (the MATLAB Command Window) and then with a human (a neurobiology undergraduate). Clearly, certain responses to certain questions are dead giveaways of human-like intelligence. Example (3) shows an interaction between a psychology graduate student judge and Elman's (1990) Simple Recurrent Network.

(1)
Judge: Why
MATLAB: R.T.F.M.
Judge: What does that mean?

(2)
Judge: Why
Human: Why what?
Judge: You're definitely a human.

(3)
Judge: Would
SRN: you
Judge: please
SRN: stop
Judge: anticipating
SRN: my
Judge: every
SRN: word!

Another illustrative example from the transcripts is a conversation between a Linguistics undergraduate contestant and a Computer Science graduate student judge. The human contestant attempted to fool the judge by responding to every statement with 'SYNTAX ERROR!' The judge, remembering her Apple IIe from childhood, then gave the command 'RUN ZORK' and the contestant immediately conceded. Finally, one judge, after conversing with SHRDLU, was convinced that he was communicating neither with a human nor a computer, but with the spirit of Gautama Buddha himself.

As expected, when the total results were tallied, the undergraduates with the best performance at exhibiting computer-like intelligence (25% success) were those in the computer science major. (These are perhaps the same people who, as mentioned before, fail miserably at the Turing Test. A notable exception to this greater success by computer science majors was a Chinese literature major who "out computeresed" the MATLAB Command Window. However, it was later discovered that he had smuggled in a MATLAB manual. After receiving a query from a judge, he would rapidly flick through the pages of the manual and reply with an appropriate matrix or 'undefined function' response. The fact that this contestant claimed to have no understanding of the responses he was typing would, for some, be grounds for disqualification. However, his responses were perceived by the judge as competent, and in order to adhere to our own Spivey Test rules, we did not disqualify him.) The undergraduate major with the overall poorest performance (0%) on the Spivey Test was Business Administration. For many of them, it was their first time touching a computer that wasn't also a cash register.

Of the computer program contestants, the MATLAB Command Window was the most frequently identified as having computer-like intelligence (90%). The Unix program 'Zippy,' which prints out random quotes from Zippy the Pinhead, was the computer program most frequently mistaken for having human-like intelligence (50%). In addition to revealing that some humans actually have computer-like intelligence, rather than human-like intelligence, these results suggest that perhaps future Turing Test Tournaments should include 'Zippy' as an AI contestant.

In sum, the Spivey Test demonstrates that computer-like reasoning is, for the most part, just as difficult for humans to display as human-like reasoning is for an AI to display. Importantly, there appears to be no objective reason for bestowing one form of reasoning with the label 'intelligent', or 'capable of thought', and not the other. It is hoped that this work will contribute to the growing movement for kinder and more respectful treatment of non-biological life forms. (This hope is in direct opposition to Loebner's recommendation that 'If we want intelligent robots and computers to care for us, to fetch and to carry for us, as I do, then this belief system [that 'Humans are gods'] will facilitate the matter.' Future work will conduct the obvious next permutations of the Turing and Spivey Tests, which will have an AI program be the judge.

Acknowledgements
These musings were supported by discussions with Daniel Richardson, Melinda Tyler, and Bob McMurray, and funding from the Sloan Foundation. The data from this 'experiment'. although they were not scientifically collected and are actually mere hypothetical data points, are consistent with a possible world that bears a considerable likeness to our own.

References

Chamberlain, W. and Etter, T., 1984. The Policeman's Beard Is Half-Constructed: Computer Prose & Poetry. (Warner Books)

Elman, J., 1990. Finding structure in time. Cognitive Science, 14, 179-211.

Hauser, M. D., 1996. The Evolution of Communication. (Cambridge, MA: MIT Press)

Loebner, H. G., 1999. The Loebner Prize for Artificial Intellgience. Competition held at Flinders University of South Australia. http://www.cs.flinders.edu.au/research/AI/LoebnerPrize

Rosenblatt, F. 1961. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. (Buffalo, NY; Cornell Aeronautical Laboratory)

Rumelhart, D. E., McClelland, J. L., (eds), 1986. Parallel distributed processing: Explorations in the microstructure of cognition. (Cambridge, MA: MIT Press)