- Describing the Turing Tournament
- Comparing the Turing Test and the Turing Tournament
- Devising new rules, and non-linguistic competitors
- But is it intelligent?
MH: Are you a computer?
MH: You’d be surprised how many fall for that one.
Dell: Not me.
MH: What’s fifty-six times thirty-three?
Dell: One thousand eight hundred forty-eight.
MH: You’re pretty fast!
Dell: Those are my favorite numbers.
The Turing Test was designed to be an operational test of whether a machine can think. In Stuart Shieber‘s words:
“How do you test if something is a meter long? You compare it with an object postulated to be a meter long. If the two are indistinguishable with regard to the pertinent property, their length, then you can conclude that the tested object is the given length. Now, how do you tell if something is intelligent? You compare it with an entity postulated to be intelligent. If the two are indistinguishable with regard to the pertinent properties, then you can conclude that the tested entity is intelligent (pg 1).”
In order for a machine to be deemed intelligent according to the Turing Test, we would determine whether human judges could reliably distinguish a human from the machine after some lengthy text-only conversation. I don’t think a machine is going to pass it any time soon, and when it does, it’ll be pretty self-evident that we’re dealing with a machine that can think.
Anyone who disagrees that a full and proper Turing Test is a stringent enough test of intelligence should read Robert French‘s excellent discussion of the kinds of very human and culturally rooted subcognitive processes that would have to going on in the machine in order for it to pass. His criticism is that the Turing Test “provides a guarantee not of intelligence but of culturally-oriented human intelligence”, i.e. that it sets the bar too high, or too narrowly. This is a subtler variant of the obvious point that human beings who don’t speak English would fail a Turing Test with English-speaking judges. In other words, the Turing Test is a necessary but not sufficient test of intelligence, because you would have to have a certain subcognitive makeup in order to pass it, on top of being intelligent.
The beautiful thing about the Turing Test is that there’s nothing about it that’s specific to machines. Indeed, Turing’s original idea for the Imitation Game, as he termed it, was based on a parlour game where the judge attempted to distinguish male from female players. This essay is an attempt to broaden the scope of the Turing Test from being a binary and culturally-rooted test of human intelligence to something vaguer and less unidimensional.
Let’s make this idea somewhat more concrete, and considerably more vivid. Imagine that a small, slimy green-headed alien lands on your lawn right now, travelling in a spaceship the size of a Buick. Assume that the alien demonstrates its extraterrestrial credentials to your satisfaction by whisking you to its home planet and back before breakfast. It bats away the impact of a few .357 rounds with its forcefield and patiently replicates household objects for your amusement. It would seem niggardly to refuse a being that has mastered faster-than-light travel the ascription of intelligence when most humans can’t tie their shoelaces in the morning without a dose of caffeine. So we might be moved to patch the Turing Test in some ad hoc manner to read:
“Any entity that cannot be reliably distinguished from a human after a lengthy text-only conversation, OR that has mastered faster-than-light travel and can withstand a .357 round at close up, can be considered to be intelligent.”
It’s clear that this lacks the pithy generality of Turing’s original formulation, and we’d have to do quite a lot more work to restrict the scope of the above to exclude asteroids. Perhaps over time, our super-intelligent alien will learn to speak English with a flawless cockney accent, and will pass the standard Turing Test, rendering this discussion moot. But in the meantime, before it has learned to speak a human language, we are faced with a manifestly intelligent being that fails our gold standard test for intelligence. The background aim of this whole essay will be to consider a new version of the Turing Test that overcomes the inherent human- and language-specific parochialism of the original. That way, our intelligent alien might pass, without having to learn to speak English with a cockney accent.
Along the way, it may be that our reformulated test provides a more constructive goal and yardstick by which to direct and evaluate progress in AI research than the standard Turing Test. Perhaps its primary limitation is that it’s difficult to restrict the difficulty or scope without losing everything that’s interesting about the test. And since even our current best efforts are a long way from success, the gradient of improvement is almost flat in every direction, making it difficult to discern when progress is being made in the right direction. This makes it difficult for machines to bootstrap themselves by training against each other, requiring lots of labour-intensive profiling against humans. Finally, the current test is very language-orientated, and undesirably emphasizes domain knowledge,
Describing the Turing Tournament
I’ll term this new version of the Turing Test the ‘Turing Tournament’, to reflect its competitive round robin form. Like the original Turing Test, the Turing Tournament will not yield a definitive, objective yes/no answer, but rather a ranking of the entrants, where the human players provide a benchmark. A lot of the details I’m proposing will probably need considerable refinement. Here are the organizing principles of a Turing Tournament:
- The organizers of each tournament decide what the domain of play will be, e.g. a chessboard, text, a paint program, a 3D virtual reality environment, binary numbers, or some multidimensional analogue stimuli.
- Every ‘player’ (within which I’m subsuming both human and machine variants) is competing in a round robin competition, and will play every other player twice, once as the ‘teacher’ and once as the ‘student’.
- Every bout will have two players, a teacher and a student. Play proceeds in turns, with the teacher going first. Play terminates when the allotted time has been exceeded, or when some terminating criterion specified by the teacher has been satisfied. Neither player will know the identity of the other player.
- Before the bouts begin, every player is given access to the domain of play so that they can construct their own set of rules that will operate when they are the teachers in a bout.
- The organizers of each tournament determine the scoring for bouts that terminate relative to bouts whose time elapses. We will consider some possible scoring systems later.
These sound like strange rules. What kind of games could be played? Why does each teacher get to set their own rules? Do teachers get rewarded or punished if a student is able to reach criterion for their bout?
I think the easiest way to illustrate what I have in mind is with a concrete example. Imagine the following scenario:
- A big room with lots of people sitting at computers. The people are the human players. The machine players are sitting inside a big server at the back of the room.
- The domain for this competition is a Go board, a 19×19 checkerboard with black and white pieces. Although all bouts in this tournament will take place on a Go board, the rules and goals of each bout will be up the teacher of that bout.
- Let us peer over the shoulder of a human player, currently in the role of student, trying to determine what the rules of the bout are, and play so that the bout terminates before running out of time. Neither we nor they know whether the other player is human or machine.
- The board is blank initially.
- As always, the teacher makes the first move. They place a horizontal line of 19 black pieces in the bottom row of the board.
- Now it is the student’s turn. They have no idea how the bout is scored, what the aim is, what constitutes a legal move, how many moves there will be or whether there will be multiple sub-bouts. All of that is up to the teacher.
Working on the assumption that the teacher wants the student to play white, the student lays down a single white piece in the top left corner.
- The teacher removes the white piece, and replaces it with a horizontal row of white pieces just above the existing horizontal black line, and another horizontal row of black pieces above that. So now there are three rows of pieces filling up from the bottom of the board: black, white and then black.
- The student decides that the removal of their white piece in the corner was a signal that its future moves should consist of placing an entire row of pieces on the board at a time. The student tries placing an entire row of white pieces in the top row of the board.
- The teacher again removes all the student’s pieces, and replaces them with another row of white pieces and another row of black pieces. The bottom of the board consists of black, white, black, white and black stripes.
- The student reasons that its next move should be to place a row of white pieces above the most recent row of black pieces to continue the stripy pattern.
- Gratifyingly, the teacher leaves the row of white pieces in place, and adds a black row above it, as expected.
- The two players continue to take turns until all but the top row has been filled with alternating black and white rows.
Now, it is once more the teacher’s turn, and the student wonders whether the last row will be filled in. Instead, the board blanks again, and the teacher places a vertical column of white pieces on the right hand side.
- The student tries tentatively to place an adjacent column of black pieces, deciding that this sub-bout involves creating black and white vertical stripes, with the black and white players reversed.
- As it turns out, this assumption appears to be correct, since the teacher does not remove the student’s pieces, and together they quickly build up an alternating vertical stripe that moves leftwards.
- When only the last column remains to be filled in by the teacher, the bout has reached criterion, and the student moves on to the next bout, with a different player.
- Upon inspecting the scores later, our human player (the ‘student’ in the bout just described) finds that they had scored highly on that bout, but not as high as some. Some of the machine players had failed to see a pattern at all, and had been putting pieces down more or less at random. These players did the worst, since the scoring for this tournament is a function of the total number of turns taken to finish the bout, as well as the number of errors made by the student. (need to clarify???)
Like our hero, the best players at this bout had also quickly deduced that the pattern involved stripes. Their extra insight came after a few turns, where they tried placing multiple stripes down at once. As it turns out, there was nothing in the rules set by the teacher prohibiting this, and so they finished more quickly, earning a higher score.
It seems reasonable to imagine that most humans would quickly figure out the stripy pattern, and some would eventually think to lay down multiple stripes at a time. Might a machine? Perhaps soon.
This is intended as a toy example. The rules of the bout are pretty simple, but I think they would discriminate somewhat between intelligent and not-so-intelligent players. The key point to note is that each player would play twice against every other player, once as the teacher and once as the student playing within the teacher’s rules. Perhaps some bouts are too hard, and some are too easy. But en masse, the rankings should discriminate quite finely between players, even between human players. The exact details of the scoring, especially how teachers are scored, and how teachers pre-specify their rules, are clearly going to be crucial. It will suffice for now to say that students should probably get points for satisfying the criterion of a bout quickly, and teachers should be rewarded for devising discriminative games, that is, games that only intelligent players can solve. I will defer further discussion of these topics until later.
Comparing the Turing Test and the Turing Tournament
In discussing the idea of an Inverted Turing Test (more below) Robert French states that:
“All variations of the original Turing Test, at least all of those of which I am currently aware, that attempt to make it more powerful, more subtle, or more sensitive can, in fact, be done within the framework of the original Turing Test.”
Is the same true of the Turing Tournament? I think the answer is both yes and no. In fact, you could think of a Turing Tournament as a kind of generalization of the Turing Test. That is, the original Turing Test could be treated (more or less) as a Turing Tournament where the domain of play is restricted to text, and the bouts terminate if the teachers/judges are satisfied they are talking to a human. It wouldn’t be quite the same, since here the players would double up as judges and the judges would double up as players. In other words, the machines would also themselves be making judgements about the humanness of both each other and the humans – an ‘Inverted Turing Test’. In its current formulation, where every player plays every other player as both teacher and student (i.e. judge and player), a Turing Tournament would really be a strange hybrid of both the Inverted and standard Turing Tests.
The idea of an Inverted Turing Test has been proposed before:
“Instead of evaluating a system’s ability to deceive people, we should test to see if a system ascribes intelligence to others in the same way that people do … by building a test that puts the system in the role of the observer … [A] system passes [this Inverted Turing Test] if it is itself unable to distinguish between two humans, or between a human and a machine that can pass the normal Turing Test, but which can discriminate between a human and a machine that can be told apart by a normal Turing Test with a human observer.”
French ingeniously showed that this Inverted Turing Test could be simulated within a standard (if somewhat convoluted) Turing Test. In contrast, it seems clear that an unrestricted Turing Tournament could not be fully simulated by a Turing Test because the potential domains of play are limitless. So although one might imagine instantiating the Go domain by communicating using grid references within a standard Turing test, it seems clear that there would be no way to run a domain of play such as a 3D virtual reality environment within a standard Turing Test using text alone. The principle advantage of widening the domain of play from text-only in this way is to allow players to pass some kinds of Turing Tournaments without speaking English, or any language at all. As a result, it seems reasonable to think of the Turing Tournament as (more or less) a superset of the Turing Test, or if the reader prefers, at least a redescription of it with unrestricted domains of play. I find this Ouroborean quality quite pleasing. [is it really ouroborean???] Either way, we can agree that most of the original Test’s merits and stringency should still be present in the Tournament version, depending on the way a particular Tournament’s domain of play and restrictions are set up. This does raise the important question of whether a Tournament victory would be as convincing a demonstration of intelligence as a victory in a standard Turing Test – I will return to this below.
French also shows that the Inverted Turing Test could be passed by a simple and mindless program that would take advantage of the very subcognitive demands that make the original test so parochial and difficult to pass. In short, the machine could have a pre-prepared list of questions that have been shown to weed out machines in the past, such as
“On a scale of 0 (completely implausible) to 10 (completely plausible), please rate:
- ‘Flugblogs’ as a name Kellogg’s would give to a new breakfast cereal.
- ‘Flugblogs’ as the name of a new computer company.
- ‘Flugblogs’ as the name of big, air-filled bags worn on the feet and used to walk on water.
- ‘Flugly’ as the name a student might give its favorite teddy bear.
- ‘Flugly’ as the surname of a bank accountant in a W.C. Fields movie.
- ‘Flugly’ as the surname of a glamorous female movie star.”
By pre-testing lots of humans and machines to figure out what kinds of things people say, and machines fail to say, a simple but well-prepared machine could draw up a ‘Human Subcognitive Profile’. By comparing this to the responses of players, it would be an extremely effective judge in an Inverted Turing Test. There are two reasons why this strategy would not work in a Turing Tournament:
a) In the above specification, none of the players know which domain they will be playing in until the competition begins officially (after which the designer is barred from tweaking his machine). As a result, it would be impossible for the designer to create Human Subcognitive Profiles for every possible domain that the machine might find itself playing in a Tournament.
This same effect could perhaps be wrought in a standard Test by restricting the domain of conversation, but not telling the players before the competition begins what that domain will be.
b) In order to be successful, players have to be good as both teachers and students. As mentioned above, this is akin to holding both a standard and Inverted Turing Test. Even if the domain was known in advance, and even if it was possible to draw up a Human Subcognitive Profile for that domain somehow, such a machine would be exposed as a student.
Lastly, French asks whether the standard Turing Test might be modified to forbid the kind of subcognitive questions that underly its cultural and species-specific parochialism. He concludes that the kinds of questions that probe “intelligence in general … are the very questions that will allow us, unfailingly, to unmask the computer”.
He may well be right. However, it may be that moving out of the text domain will dramatically reduce the scope of possible subcognitive shibboleths that human teachers could employ. Having said that, there will still be many possibilities for rules that would place human student-players at a big advantage. For instance, in the case of the Go domain, a cunning human teacher could choose to play by the rules of Connect4, which other humans might be much quicker to fathom. In the case of some kind of Photoshop canvas domain, humans could spell out words cursively, outwitting even the most seasoned OCR software. If there’s any kind of free-text involved, any of the subcognitive tricks designed for the standard Turing Test might be employed. In the case of a 3D virtual environment, human student-players will have a huge edge, though perhaps 2D or high-D worlds would level the playing field. One might hope that imaginative specification of domains could minimize such advantages, and that after 10 years of such competitions, machine programmers will almost certainly know to build in pre-loaded expert knowledge of Connect4, for instance, but the problem will clearly still remain.
[N.B. In order to ensure that the scales aren't conversely weighted too heavily against human players, it seems reasonable to allow all human players the use of a laptop throughout the Tournament.]
Maybe instead we should accept the possibility of subcognitive shibboleths, and embrace their utility instead as a means of cataloguing different kinds of conceptual schemes. There is a presumption inherent in the standard Turing Test that smartness can be measured on a one-dimensional continuum ranging from rocks to rocket scientists. In the case of the aliens that have travelled 4 million light years in a space ship built out of genetically-engineered quantum nanobits and powered by fermented mango juice, we could be pretty sure they’re intelligent, even if they were never to get the hang of English. It’s just that their conceptual schemes are different. In this case, we may find that there are cases where they think more like machines than like humans. Or possibly more like dolphins, or African grey parrots, or white mice or marmosets. If we’re able to set up a domain in a Tournament that everyone can play in, then we can expect that human student-players may not necessarily come out on top in all respects, even within the animal kingdom. We will return to this idea when we discuss Turing Tournaments between groups of individuals.
Devising new rules, and non-linguistic competitors
Besides extending the domain of play beyond text, the principle innovation of the Turing Tournament is in casting every player as both student and teacher.
It is clear enough what is required of the student player. When the bout begins, they have some idea of the kinds of interactions, puzzles and patterns that the domain presents. By interacting with the teacher player, they have to somehow fathom what the aim (i.e. terminating criterion) of the current bout is, and attempt to satisfy that. It might involve placing pieces on the board in some complex pattern, learning the structure of a maze, guessing at the next number in a sequence or optimizing some noisy function. Depending on the tournament, they may or may not be given feedback after each move:
- If they’re given a running score, they can attempt to learn how to maximise that reward.
- If no reward is given, but the teacher corrects incorrect moves, then the learning by imitation can be seen as a kind of supervised mapping or reconstructive learning problem.
- There may even be cases where no feedback is given whatsoever, such as when the bout requires the student to guess the next number in some sequence.
It is the teacher’s job to come up with new and inventive rules for bouts that challenge the student-players, and also to perhaps lead the student in the right direction. For the Tournament to work as intended, teachers should be intending to come up with the most discriminative bout rules they can.
Getting the incentive structure for the teachers right is therefore key. I expect that early scoring structures will contain loopholes that ingenious machine designers can exploit, but that over time, scoring structures that serve their purpose in a robust way will emerge. If our goal is to discriminate humans from machines, then this simple scoring system may work:
In effect, we’re rewarding players that seem human, and can devise rules that discriminate whether other players are human. This Tournament setup is the combo standard/Inverse Turing Test described above, that would not differ all that wildly in principle from the standard Turing Test if played in a text domain. Such Tournaments would encourage the kinds of subcognitive or culturally-rooted human-parochialism that we’re trying to avoid.
Perhaps this more general scheme will work instead:
What’s the point of all this complexity? If you’re a teacher, then you do best if you can design your rules such that only above-average players (whether human or machine) win in your bouts. There’s actually a penalty if you make your rules so easy that everyone can figure them out, and you’ll get zero points if no one can figure them out at all. When you’re a student, you want to be as smart as you can, and when you’re a teacher, you want to be as discriminative as you can. En masse, the community of competitors are striving to do as well as they can and to evaluate each other as well as they can.
Inventively devising rules to favour intelligent over non-intelligent participants requires sufficient representational power to understand, let alone manipulate, your own rules, a rich theory of mind, as well as a generative good taste. Consider a Tournament played in the simple domain consisting solely of letterstring analogy problems, where the student is faced with problems such as:
“I change abc into abd. Can you ‘do the same thing’ to ijk?”
or in non-linguistic terms:
abc —> abd; ijk —> ?
Reasonable responses include ijl, ijd, or even abd.
Let us imagine that a player as cunning as Douglas Hofstadter has devised the following problem:
abc —> abd; mrrjjj —> ?
Peer at this for a moment – you won’t appreciate that this is somewhat fiendish unless you try it for a while yourself. Any ideas?
There’s no obvious pattern to the letters chosen on the right hand side, so mrrkkk seems kind of lame, and abd always feels lame. Well, how about if you try this one first:
abc —> abd; abbccc —> ?
Though your first thought may have been abbddd, doesn’t abbdddd seem so much nicer? It’s as though the successorship sequence of letters needs to be reflected in the increasing length of the letter groups (to use the FARG’s terminology). Now, let us turn back to:
abc —> abd; mrrjjj —> ?
Doesn’t mrrjjjj seem like a nice, reasonable solution now? Would you have considered it so nice before the previous example. Probably almost as nice. Would you have thought of it on your own, without the previous example? Probably, given some head-scratching.
The point of this digression is to point out how an imaginative teacher can guide, plant ideas, manipulate, prime, coax and lead the student by example in such a way that an intelligent player would almost certainly get the right answer, but there are almost no extant machines that would stand a chance. Besides having the sheer representational flexibility to deal with even barebones analogies such as the one above, a really intelligent player would be using the first few turns to gauge the teacher, get a sense of what kinds of solutions are admissible, and would probably be relying on Gricean maxims wherever possible.
What if your alien doesn’t know anything about Gricean maxims? Or doesn’t understand concepts like tournaments, rules, intelligence, machines, scores or games? We’ve finished outlining how a Tournament might be run that might require less domain knowledge and linguistic ability than the standard Turing test. But one striking pragmatic problem remains, which becomes apteacher when we consider our newly-arrived green visitor. If the alien doesn’t speak English, how are we going to explain the idea of the Turing Tournament to him so that he can participate?
Following Minsky, I think that we will be able to converse with aliens to some degree, provided they are motivated to cooperate, because we’ll both think in similar ways in spite of our different origins. Every evolving intelligence operates within spatial and temporal constraints, suffers from a scarcity of resources (and presumably, competition), must develop symbols and rules, and must have thought about computation and machine learning in order to be able build spaceships. Perhaps notions of games, intelligence, scores and tournaments are only relevant in a society of individual entities that compete with each other for resources, and that maybe a hive mind or single monolithic being or other unimaginable entity might not need such concepts. In that case, you wouldn’t have any more luck using the standard Turing Test on such a being.
Will we have much more luck with machines? Not unless we start small. At the moment, the state of the art in artificial intelligence wouldn’t do very well in most of the domains we’ve discussed, and would struggle especially when trying to generate new rules of its own. Sadly, very few researchers have focused on generative heuristics to curiously discover things that are interesting simply for their own sake, such as Lenat‘s Automated Mathematician that sought interesting mathematical concepts. In order to stand a chance in a Turing Tournament, much work needs to be done on curiously discovering interesting things that could serve as the basis for a rule set in a new domain. Good, that is, discriminative, rule sets for a Turing Tournament bout might consist of a difficult but ultimately guessable sequence of numbers based on a funny arithmetical pattern, or the kind of letterstring analogy problem that elicits an ‘aha’. Better still, teacher players that can lead an intelligent student player down a suggestive road towards the terminating criterion through tutorial or warm-up sub-bout problems will be at a tremendous advantage, where half the problem for the student consists of figuring out what their goal is supposed to be.
But is it intelligent?
Let us recall Shieber’s pithy test for intelligence:
“Now, how do you tell if something is intelligent? You compare it with an entity postulated to be intelligent.”
We’ve replaced that with an intellectual obstacle course. As teachers devising rules for their bouts, we are effectively asking players to define their own micro-test of intelligence (since being able to do this is surely a sign of good taste?). They must then be able to convey the parameters of that test such that other intelligent student players can figure out how to pass it, perhaps by creating lead-up sub-bouts, internalizing what the student player is probably thinking and so guiding the student players’ intuitions in the right direction. Finally, as students, the players must demonstrate in turn that they can flexibly assimilate what their goal should be, and then be able to get to it.
So although we might imagine some narrow machines that could best humans in certain kinds of puzzles or computation, but it seems less likely that a brute force machine player would also do well on Bongard problems, letterstring analogies, or be able to devise ingenious, fun and discriminative rules for bouts. This new generative aspect is intended to tap into the kind of creative, generative, playful, inventive or aesthetic faculty that humans display, as well as the ability to form a rich internal model of the student player’s state of confusion, and guide them towards a solution. In this respect, it borrows the idea of a dialog or gentle interrogation from the original Test, but allows for the translation of that dialog to new domains.
Bringing this back to Turing’s original question, we can finally ask, ‘if a machine were to score higher than some of the humans in a Turing Tournament, would we definitely be willing to call it intelligent?’ The answer could depend on a few factors:
- Let us assume that the Tournament is well-planned, that the human competitors are well-chosen, that no independent experts can find any scoring loopholes or weaknesses in the organization of the Tournament, and that the result is replicable. If any of these conditions are not met, we will not consider the Tournament to be well-run.
- If the domain is too restrictive, then there may be a dearth of interesting rule sets that can be devised. In this case, a good player won’t do much better than a poor player, and this wouldn’t be an interesting result.
- Even if the domain is a rich one, such as letterstring analogy problems, it could be that a highly specialized program like Copycat could outperform many humans. Unless the success is relatively domain-general, then you’ve shown what you probably knew already, i.e. the machine is exhibiting some domain-specific proto-intelligence.
- At that point, we would probably want to analyze the machine’s performance. Did it do better as a teacher or student? Was it simply very good at certain kinds of problems? Was there some simple trick to its way of devising problems which, once exposed, would clue in future human players in a rerun of the Tournament?
- Could it pass a standard Turing Test?
Let us imagine that a machine is designed which is a poor teacher player, but a good student player, particularly in a couple of abstract limited-interaction domains like letter strings, number sequences, Go boards and cryptography, but that it can’t pass the Turing Test. Is it intelligent? Somewhat? We’ve forfeited the no-frills and no-free-parameters yes/no answer that you get from a Turingf Test, but we now have a much richer set of data with which to try and place this machine in the space of all possible minds. We have a more finely-graded multi-dimensional scale. Our machines can bootstrap themselves by competing amongst themselves without human intervention – specialist teacher machines that are good at discovering generative heuristics can be used to train specialist student machines that are good at problem solving, and vice versa. So in forfeiting our neat yes/no answer, we’ve gained a great deal.
Perhaps most importantly for the field of AI, we can now attempt to scale the enormous subcognitive iceberg of the mind incrementally, using ever more complex Turing Tournaments as yardsticks. In time, perhaps this will lead back towards the Turing Test as the final summit.
see also: AlienIntelligenceLinks