Assortative mixing for peer review

[this is an idea proposed by Adrian de Froment]

Let’s say you are a great scientific reviewer. You respond in a timely fashion, provide detailed and insightful comments, and your judgments about which papers should be published tend to match the judgments of other reviewers and of journal editors.

Being a good reviewer is like wetting yourself in a dark suit – you get a warm feeling, but no one ever notices. When it comes to having your own papers reviewed, your good karma is worth nothing. You’re just as likely to be assigned a slow, careless reviewer as everyone else, despite all your contributions to the community.

In evolutionary biology and network theory, the term ‘assortative mixing’ refers to a bias in favor of connections between network nodes with similar characteristics. In other words, good reviewers should be reviewed by other good reviewers. We should build karma directly into the system.

Note that I’m not in any way suggesting that papers written by good reviewers should be more likely to be published. But perhaps they should be more likely to be well-reviewed – that is, rapidly, fairly and carefully. Likewise, if you perenially submit your reviews late, you would be more likely to be reviewed by someone similarly tardy. What better cure for your anti-social behavior than a dose of your own medicine?

I’ve used speed of response here as the dimension that determines who gets assigned whom as a reviewer, but the objective function determining what constitutes a ‘good’ reviewer could be anything.

If this assortative mixing of reviewers was adopted, it would reward good reviewers, punish bad reviewers, and almost certainly lead to an overall improvement in the standard of reviewing.

Picking scientific reviewers

It’s hard to find good reviewers for scientific papers. Because it’s all anonymized (though that may be slowly changing), there’s no easy way to tell who’s a good reviewer and who’s a bad reviewer. It’s easier to define what makes for a bad than a good reviewer:
  • tardy in responding with their comments [not relevant for the points I make below]
  • don’t read the paper carefully enough, and make fatuous criticisms
  • don’t read the paper carefully enough, and miss gigantic flaws
  • misjudge the import of a paper, and reject it for being less interesting/novel than it actually is
To some degree, these are unchanging realities and limitations of human nature. But we all respond and improve with feedback. Wouldn’t it be great if every reviewer had a public scorecard, summarizing their efficacy as a reviewer? (ideally aggregated across all the journals for which they’ve reviewed)
We can imagine various objective functions for what makes a good reviewer.
We could bin all a reviewer’s reviews into a 2×2 boolean matrix: your review recommendation x whether the paper was published. If we adopt the parlance of signal detection theory, a good reviewer will have many ‘hits’, where they recommended publication, and the paper was indeed published, and many ‘misses’, where they recommended against publication, and the paper was indeed rejected. In other words, a good reviewer’s recommendations will be predictive of whether a paper went on to be published in that journal or not. A bad reviewer’s recommendations will often be at odds with the eventual fate of the paper.
We could swap in a variety of other dependent variables, besides just the boolean ‘published in this journal or not’, such as:
  • how many times the paper was cited
  • correlation of reviewer’s numerical ratings of goodness with other reviewers
  • how many rounds of revisions were necessary for the paper to be published
Of course, there are problems and limitations to this approach. For instance, it says nothing of the usefulness of the reviewer’s comments. And just because a paper eventually got published or was cited many times doesn’t necessarily mean that it’s genuinely good.
But on the plus side, it provides a measurement that reviewers can try to improve, which if optimized would be broadly good for the system. It’s hard to game. It would provide journal editors with more information when deciding which reviewers to pay attention to. And if it were made public, it would provide a way to incentivize and recognize good reviewing.

Running a psychology experiment

Designing an elegant psychology experiment is a fiendish business. Even after you’ve done that, running and implementing it correctly requires considerable attention to detail. I’ve attempted here to catalogue all the checks I (should) run before collecting data in a brand new experiment. Where possible, I’ve tried to think of extra points to worry about when collecting fMRI data.

The most important advice I can give is to run yourself in your own experiment at least a couple of times before you run anyone else. This is a pain – running in the same experiment over and over again is tiring, but it’s less of a pain than collecting 20 subjects’ data only to find that the data are worthless.

N.B. Make sure you run these tests on the testing computer that you’re going to use to actually collect the data – each machine is a complex ecosystem, and you can’t generalize success from one to another.

Obvious bugs

Firstly, running yourself will help you spot relatively obvious bugs, such as:

– Stimuli that should only show up once occurring more than once

– Stimuli that should be randomized appearing in alphabetical or the same order each time

– Bugs that only occur when you actually interact with the other experiment, rather than just running it passively

– Display glitches

– Input bugs. If you can, add a little test program at the beginning of each experiment that requires the subject to press the appropriate buttons, speak into the microphone you’re recording from etc. This confirms that they understand which buttons are which, and that everything is plugged in.

Secondly, running in your own experiment will give you a sense of what it feels like to be a subject in the experiment, and help you notice more subtle bugs:

– Is it impossibly long and tiring? My rule of thumb is that half an hour is usually too short – you might as well add some more trials to increase your power. More than 45 minutes starts to feel unbearable though. But this varies from experiment to experiment.

– Can you feel your brain working in the way you hope it should? Do the hard things feel hard in the right way?

– Can you sense that there’s some strategy that you want to use, but that would distort your data? If there’s an easy shortcut to doing your experiment, subjects will find and exploit it. In this case, you can directly instruct them to avoid it, but you’d be better off modifying the design so that they can’t. For instance, if you don’t want subjects to rehearse in between trials, add in some kind of distractor task to keep them busy and engaged.

– Is everything counterbalanced? Might there be lurking order effects (where one type of trial always occurs at the beginning or end of a phase, or always precedes/follows another type of trial)?

Timing glitches

It’s very hard to make sure that every piece of your experiment starts when it should and lasts for as long as it should. Before you do anything else, sit down with a pen and paper and calculate exactly how long each piece of your experiment is supposed to take, and store these as variables somewhere. Better still, they should be calculated automatically from your parameters.

Now, add code to your experiment that automatically times how long each piece lasts and when things are being displayed, making sure this all gets logged. Compare how long things are taking with how long you’ve calculated they should take. If you don’t trust your timing code, use a stopwatch to make sure it’s at least approximately right.

Modern computers have hundreds of background processes running (virus checkers, software updaters, email checking, self-refreshing webpages etc). Write a list of everything running, and make sure that as much as possible gets turned off before you run your experiment. Ideally, as little of this should be installed on your testing room computer as possible, but it’s hard to avoid on Windows.

Some experiment presentation programs (e.g. the Matlab Psych Toolbox, PyEPL) can self-calibrate their internal timing for each computer. Others have separate timing modes that are optimized for accuracy in duration vs onset (e.g. EPrime, PyEPL).

Avoiding disaster

There are a variety of ways in which things can be brought to a crashing halt:

– can subjects quit the experiment easily/accidentally? If possible, disable standard keyboard shortcuts like Alt-Tab, Alt-F4

– turn off the screensaver

– make sure that email notifications, software update warnings and the like won’t pop up in the corner of the screen, distracting the subject

– if your experiment was to crash for some reason (e.g. a power cut), can you resume where you left off?


Log everything. If you’re lucky, your experimental presentation software will do much of the work for you (e.g. PyEPL, EPrime). Either way, you should attempt to log enough data that you could reconstruct the exact stimuli, and all of the subject’s interactions with the experiment. This might seem like overkill, but it’s valuable for a number of reasons:

– You never know which analyses you might want to run in the future. You might suddenly care about reaction times, or the exact placement of the randomly moving dots on the screen – who knows? If you haven’t logged all the data you need, you’ll be out of luck.

– You may be worried that there’s a bug somewhere. Being able to cross-check one log against another is key to determining if/where there’s a problem.

– You might be logging the same information in multiple ways – that’s fine. Depending on the analysis, it might be much easier to process in one form or another

– Don’t just log the low-level details. Logging every keypress and pixel color will certainly capture all the information you could ever need, but it will create an enormous amount of work to make sense of it all afterwards. If you have the high-level variables available in your experiment code, you might as well record them too to make your life easier later.


The list of extra things to check for when running an fMRI experiment is pretty bewildering, but here’s a handy subset:

– make sure you view things through the projector, not just on the monitor in the control room. Who knows what devilry the projector might wreak as a result of display resolution interpolation, longer video cables, dying bulbs and the like?

– check your button boxes carefully. Some of them number in ascending order, some in descending order, some of them are re-programmable…

– our scanner emits a ‘!’ every time it starts to collect a new image. Some programs see this as a ‘LEFT SHIFT’ plus ‘1’, others as ‘!’. Make sure your experiment knows how to start each run in sync with the trigger, and can’t be set off by an inadvertent button box press.

– timing is critical with fMRI. If each of your stimuli take a few milliseconds longer than you intend, you could easily be out of sync by an entire image by the end of a long run, which would be enough to scupper all your analyses.

– it’s difficult to see anything in the bottom half of the screen in our head-only scanner. Make sure subjects will be able to see what’s going on during your experiment.

Analyze your data early

Once you’ve run yourself at least once, try analyzing your data. Since you aren’t a naive subject, your data probably won’t be publishable, but this is still a critical step for at least the following reasons:

– If you can run the analyses, then you know that you’ve get everything logged that you need.

– You can check that the most obvious, basic, uncontroversial effects are there. If they’re not, then that’s a serious problem. You can also confirm that subjects’ performance isn’t too far towards floor or ceiling.

– You might be able to check whether any of your stimuli are noticeably poorly-normed (i.e. they stick out when they shouldn’t)

– Sometimes, grievous logical errors slip through the design phase. Running an analysis is a really good way to pick up on such confusions. To be honest, running an analysis on fake (i.e. synthetic) data would probably work just as well, but it can sometimes be more work to generate good fake data than to collect a small bucket of real data.

Ask your friends

Finally, once you’re pretty sure that things are working, try running a few of your friends or colleagues as subjects before anyone else. You can be sure that they’ll pay attention to your instructions, try hard, and you may be able to get useful feedback about how it feels to run in your experiment as a naive subject. That way, if the data from your first few subjects aren’t the way you’d hoped, you can be more confident that it’s not just because indolent or surly Psych 101 students were chatting amiably on the phone while doing the experiment.

Version control

If you’re not using a version control system to keep track of your experiment scripts, you’re making your life harder for yourself in a dozen ways. Here are the key benefits:

– you don’t need to keep saving your files as experiment1.m, experiment2.m, experiment3.m… The version control system will keep track of all the different versions, so you can see what you’ve changed at every point

– if you’re working on multiple computers, or there are multiple people all changing things, you can keep things synchronized across all these computers. No more carting around USB keys.

– your experiment is always backed up

To be honest, if you’re not using version control for almost everything that you write or program on your computer, then you fall into the same category as people who want to be a world-class chef but refuse to abandon their tried-and-tested approach of an open fire and a flint axe.

It’s not worth putting the experimental data into the version control system, since the data won’t be frequently changed and updated. Instead, just backing these up to an external hard drive is probably sufficient.


Mostly, ‘experience’ results from having done things enough to make all the common mistakes. You could then say that ‘competence’ is when you institute procedures that make those mistakes unlikely to happen again in the future. My aim in writing this was to help you find ways to make the common mistakes unlikely without having to make them all first yourself, so that you can be competent without being experienced.

If you can think of anything I left out, please point it out in the comments.

The Pittsburgh EBC competition

Try and picture the scene: you’re in a narrow tube in almost complete darkness, there’s a loud thumping noise surrounding you and you’re watching episodes of the 90s sitcom, ‘Home Improvement’, with Tim The Tool Man Taylor and his family. There’s a panic button in case you feel claustrophobic, but it’s all over in less than an hour. It sounds a little surreal, but that’s what it would have been like to be a subject whose functional magnetic resonance imaging (fMRI) brain data was used in last year’s Pittsburgh Brain Analysis Competition.

After you’ve watched three episodes, kindly folk in glasses and white coats would take you out of the scanner bore, give you a glass of water and then over the next few days, they’d ask you to watch those same three episodes again over and over. On the second viewing, they’d ask you ‘How amused are you?’ every couple of seconds. On the third viewing, they’d keep wanting to know how aroused you are on a moment-by-moment basis. Then, ‘Can you see anyone’s face on the screen?’, ‘Is there music playing?’, ‘Are people speaking?’ and so on, until you’ve watched every moment of every episode thirteen times, each time being asked something different about your experience.

Our job, as a team entering the competition, was to try and understand the mapping between your brain data and the subjective experiences you reported. For two of the episodes, we were given your brain data along with the thirteen numbers for every corresponding moment that described your arousal, amusement, whether there were faces on the screen, music playing, people speaking etc. Our team, comprising psychologists, neuroscientists, physicists and engineers, put together a pipeline of algorithms and techniques to whittle down your brain to just the areas we needed and smooth away as much of the noise and complexity as possible. Think of these first two episodes as the ‘training’ data. Then, we were given only the brain data for the third episode, the ‘test’ episode, from which we had to predict the reported experience ratings.

Our predictions were then correlated with the subjects’ actual reports, and we were given a score. We ended up coming second in the whole competition, and we’re hoping for the top spot in 2007. Much of this effort has had a direct payoff for our day-to-day research. We now routinely incorporate a lot of these machine learning techniques when trying to understand the representations used by different neural systems, and how they relate to behavior.

Members of the team: David Blei, Eugene Brevdo, Ronald Bryan, Melissa Carroll, Denis Chigirev, Greg Detre, Andrew Engell, Shannon Hughes, Christopher Moore, Ehren Newman, Ken Norman, Vaidehi Natu, Susan Robison, Greg Stephens, Matt Weber, and David Weiss

The Turing tournament – a proposal for a reformulation of the Turing Test

  1. Introduction
  2. Describing the Turing Tournament
  3. Comparing the Turing Test and the Turing Tournament
  4. Devising new rules, and non-linguistic competitors
  5. But is it intelligent?

MH: Are you a computer?

Dell: Nope.

MH: You’d be surprised how many fall for that one.

Dell: Not me.


MH: What’s fifty-six times thirty-three?

Dell: One thousand eight hundred forty-eight.

MH: You’re pretty fast!

Dell: Those are my favorite numbers.

— from


The Turing Test was designed to be an operational test of whether a machine can think. In Stuart Shieber‘s words:

“How do you test if something is a meter long? You compare it with an object postulated to be a meter long. If the two are indistinguishable with regard to the pertinent property, their length, then you can conclude that the tested object is the given length. Now, how do you tell if something is intelligent? You compare it with an entity postulated to be intelligent. If the two are indistinguishable with regard to the pertinent properties, then you can conclude that the tested entity is intelligent (pg 1).”

In order for a machine to be deemed intelligent according to the Turing Test, we would determine whether human judges could reliably distinguish a human from the machine after some lengthy text-only conversation. I don’t think a machine is going to pass it any time soon, and when it does, it’ll be pretty self-evident that we’re dealing with a machine that can think.

Anyone who disagrees that a full and proper Turing Test is a stringent enough test of intelligence should read Robert French‘s excellent discussion of the kinds of very human and culturally rooted subcognitive processes that would have to going on in the machine in order for it to pass. His criticism is that the Turing Test “provides a guarantee not of intelligence but of culturally-oriented human intelligence”, i.e. that it sets the bar too high, or too narrowly. This is a subtler variant of the obvious point that human beings who don’t speak English would fail a Turing Test with English-speaking judges. In other words, the Turing Test is a necessary but not sufficient test of intelligence, because you would have to have a certain subcognitive makeup in order to pass it, on top of being intelligent.

The beautiful thing about the Turing Test is that there’s nothing about it that’s specific to machines. Indeed, Turing’s original idea for the Imitation Game, as he termed it, was based on a parlour game where the judge attempted to distinguish male from female players. This essay is an attempt to broaden the scope of the Turing Test from being a binary and culturally-rooted test of human intelligence to something vaguer and less unidimensional.

Let’s make this idea somewhat more concrete, and considerably more vivid. Imagine that a small, slimy green-headed alien lands on your lawn right now, travelling in a spaceship the size of a Buick. Assume that the alien demonstrates its extraterrestrial credentials to your satisfaction by whisking you to its home planet and back before breakfast. It bats away the impact of a few .357 rounds with its forcefield and patiently replicates household objects for your amusement. It would seem niggardly to refuse a being that has mastered faster-than-light travel the ascription of intelligence when most humans can’t tie their shoelaces in the morning without a dose of caffeine. So we might be moved to patch the Turing Test in some ad hoc manner to read:

“Any entity that cannot be reliably distinguished from a human after a lengthy text-only conversation, OR that has mastered faster-than-light travel and can withstand a .357 round at close up, can be considered to be intelligent.”

It’s clear that this lacks the pithy generality of Turing’s original formulation, and we’d have to do quite a lot more work to restrict the scope of the above to exclude asteroids. Perhaps over time, our super-intelligent alien will learn to speak English with a flawless cockney accent, and will pass the standard Turing Test, rendering this discussion moot. But in the meantime, before it has learned to speak a human language, we are faced with a manifestly intelligent being that fails our gold standard test for intelligence. The background aim of this whole essay will be to consider a new version of the Turing Test that overcomes the inherent human- and language-specific parochialism of the original. That way, our intelligent alien might pass, without having to learn to speak English with a cockney accent.

Along the way, it may be that our reformulated test provides a more constructive goal and yardstick by which to direct and evaluate progress in AI research than the standard Turing Test. Perhaps its primary limitation is that it’s difficult to restrict the difficulty or scope without losing everything that’s interesting about the test. And since even our current best efforts are a long way from success, the gradient of improvement is almost flat in every direction, making it difficult to discern when progress is being made in the right direction. This makes it difficult for machines to bootstrap themselves by training against each other, requiring lots of labour-intensive profiling against humans. Finally, the current test is very language-orientated, and undesirably emphasizes domain knowledge,

Describing the Turing Tournament

I’ll term this new version of the Turing Test the ‘Turing Tournament’, to reflect its competitive round robin form. Like the original Turing Test, the Turing Tournament will not yield a definitive, objective yes/no answer, but rather a ranking of the entrants, where the human players provide a benchmark. A lot of the details I’m proposing will probably need considerable refinement. Here are the organizing principles of a Turing Tournament:

  • The organizers of each tournament decide what the domain of play will be, e.g. a chessboard, text, a paint program, a 3D virtual reality environment, binary numbers, or some multidimensional analogue stimuli.
  • Every ‘player’ (within which I’m subsuming both human and machine variants) is competing in a round robin competition, and will play every other player twice, once as the ‘teacher’ and once as the ‘student’.
  • Every bout will have two players, a teacher and a student. Play proceeds in turns, with the teacher going first. Play terminates when the allotted time has been exceeded, or when some terminating criterion specified by the teacher has been satisfied. Neither player will know the identity of the other player.
  • Before the bouts begin, every player is given access to the domain of play so that they can construct their own set of rules that will operate when they are the teachers in a bout.
  • The organizers of each tournament determine the scoring for bouts that terminate relative to bouts whose time elapses. We will consider some possible scoring systems later.

These sound like strange rules. What kind of games could be played? Why does each teacher get to set their own rules? Do teachers get rewarded or punished if a student is able to reach criterion for their bout?

I think the easiest way to illustrate what I have in mind is with a concrete example. Imagine the following scenario:

  • A big room with lots of people sitting at computers. The people are the human players. The machine players are sitting inside a big server at the back of the room.
  • The domain for this competition is a Go board, a 19×19 checkerboard with black and white pieces. Although all bouts in this tournament will take place on a Go board, the rules and goals of each bout will be up the teacher of that bout.
  • Let us peer over the shoulder of a human player, currently in the role of student, trying to determine what the rules of the bout are, and play so that the bout terminates before running out of time. Neither we nor they know whether the other player is human or machine.
  • The board is blank initially.
  • As always, the teacher makes the first move. They place a horizontal line of 19 black pieces in the bottom row of the board.
  • Now it is the student’s turn. They have no idea how the bout is scored, what the aim is, what constitutes a legal move, how many moves there will be or whether there will be multiple sub-bouts. All of that is up to the teacher.

    Working on the assumption that the teacher wants the student to play white, the student lays down a single white piece in the top left corner.

  • The teacher removes the white piece, and replaces it with a horizontal row of white pieces just above the existing horizontal black line, and another horizontal row of black pieces above that. So now there are three rows of pieces filling up from the bottom of the board: black, white and then black.
  • The student decides that the removal of their white piece in the corner was a signal that its future moves should consist of placing an entire row of pieces on the board at a time. The student tries placing an entire row of white pieces in the top row of the board.
  • The teacher again removes all the student’s pieces, and replaces them with another row of white pieces and another row of black pieces. The bottom of the board consists of black, white, black, white and black stripes.
  • The student reasons that its next move should be to place a row of white pieces above the most recent row of black pieces to continue the stripy pattern.
  • Gratifyingly, the teacher leaves the row of white pieces in place, and adds a black row above it, as expected.
  • The two players continue to take turns until all but the top row has been filled with alternating black and white rows.

    Now, it is once more the teacher’s turn, and the student wonders whether the last row will be filled in. Instead, the board blanks again, and the teacher places a vertical column of white pieces on the right hand side.

  • The student tries tentatively to place an adjacent column of black pieces, deciding that this sub-bout involves creating black and white vertical stripes, with the black and white players reversed.
  • As it turns out, this assumption appears to be correct, since the teacher does not remove the student’s pieces, and together they quickly build up an alternating vertical stripe that moves leftwards.
  • When only the last column remains to be filled in by the teacher, the bout has reached criterion, and the student moves on to the next bout, with a different player.
  • Upon inspecting the scores later, our human player (the ‘student’ in the bout just described) finds that they had scored highly on that bout, but not as high as some. Some of the machine players had failed to see a pattern at all, and had been putting pieces down more or less at random. These players did the worst, since the scoring for this tournament is a function of the total number of turns taken to finish the bout, as well as the number of errors made by the student. (need to clarify???)

    Like our hero, the best players at this bout had also quickly deduced that the pattern involved stripes. Their extra insight came after a few turns, where they tried placing multiple stripes down at once. As it turns out, there was nothing in the rules set by the teacher prohibiting this, and so they finished more quickly, earning a higher score.

    It seems reasonable to imagine that most humans would quickly figure out the stripy pattern, and some would eventually think to lay down multiple stripes at a time. Might a machine? Perhaps soon.

This is intended as a toy example. The rules of the bout are pretty simple, but I think they would discriminate somewhat between intelligent and not-so-intelligent players. The key point to note is that each player would play twice against every other player, once as the teacher and once as the student playing within the teacher’s rules. Perhaps some bouts are too hard, and some are too easy. But en masse, the rankings should discriminate quite finely between players, even between human players. The exact details of the scoring, especially how teachers are scored, and how teachers pre-specify their rules, are clearly going to be crucial. It will suffice for now to say that students should probably get points for satisfying the criterion of a bout quickly, and teachers should be rewarded for devising discriminative games, that is, games that only intelligent players can solve. I will defer further discussion of these topics until later.

Comparing the Turing Test and the Turing Tournament

In discussing the idea of an Inverted Turing Test (more below) Robert French states that:

“All variations of the original Turing Test, at least all of those of which I am currently aware, that attempt to make it more powerful, more subtle, or more sensitive can, in fact, be done within the framework of the original Turing Test.”

Is the same true of the Turing Tournament? I think the answer is both yes and no. In fact, you could think of a Turing Tournament as a kind of generalization of the Turing Test. That is, the original Turing Test could be treated (more or less) as a Turing Tournament where the domain of play is restricted to text, and the bouts terminate if the teachers/judges are satisfied they are talking to a human. It wouldn’t be quite the same, since here the players would double up as judges and the judges would double up as players. In other words, the machines would also themselves be making judgements about the humanness of both each other and the humans – an ‘Inverted Turing Test’. In its current formulation, where every player plays every other player as both teacher and student (i.e. judge and player), a Turing Tournament would really be a strange hybrid of both the Inverted and standard Turing Tests.

The idea of an Inverted Turing Test has been proposed before:

“Instead of evaluating a system’s ability to deceive people, we should test to see if a system ascribes intelligence to others in the same way that people do … by building a test that puts the system in the role of the observer … [A] system passes [this Inverted Turing Test] if it is itself unable to distinguish between two humans, or between a human and a machine that can pass the normal Turing Test, but which can discriminate between a human and a machine that can be told apart by a normal Turing Test with a human observer.”

French ingeniously showed that this Inverted Turing Test could be simulated within a standard (if somewhat convoluted) Turing Test. In contrast, it seems clear that an unrestricted Turing Tournament could not be fully simulated by a Turing Test because the potential domains of play are limitless. So although one might imagine instantiating the Go domain by communicating using grid references within a standard Turing test, it seems clear that there would be no way to run a domain of play such as a 3D virtual reality environment within a standard Turing Test using text alone. The principle advantage of widening the domain of play from text-only in this way is to allow players to pass some kinds of Turing Tournaments without speaking English, or any language at all. As a result, it seems reasonable to think of the Turing Tournament as (more or less) a superset of the Turing Test, or if the reader prefers, at least a redescription of it with unrestricted domains of play. I find this Ouroborean quality quite pleasing. [is it really ouroborean???] Either way, we can agree that most of the original Test’s merits and stringency should still be present in the Tournament version, depending on the way a particular Tournament’s domain of play and restrictions are set up. This does raise the important question of whether a Tournament victory would be as convincing a demonstration of intelligence as a victory in a standard Turing Test – I will return to this below.

French also shows that the Inverted Turing Test could be passed by a simple and mindless program that would take advantage of the very subcognitive demands that make the original test so parochial and difficult to pass. In short, the machine could have a pre-prepared list of questions that have been shown to weed out machines in the past, such as

“On a scale of 0 (completely implausible) to 10 (completely plausible), please rate:

  • ‘Flugblogs’ as a name Kellogg’s would give to a new breakfast cereal.
  • ‘Flugblogs’ as the name of a new computer company.
  • ‘Flugblogs’ as the name of big, air-filled bags worn on the feet and used to walk on water.
  • ‘Flugly’ as the name a student might give its favorite teddy bear.
  • ‘Flugly’ as the surname of a bank accountant in a W.C. Fields movie.
  • ‘Flugly’ as the surname of a glamorous female movie star.”

By pre-testing lots of humans and machines to figure out what kinds of things people say, and machines fail to say, a simple but well-prepared machine could draw up a ‘Human Subcognitive Profile’. By comparing this to the responses of players, it would be an extremely effective judge in an Inverted Turing Test. There are two reasons why this strategy would not work in a Turing Tournament:

a) In the above specification, none of the players know which domain they will be playing in until the competition begins officially (after which the designer is barred from tweaking his machine). As a result, it would be impossible for the designer to create Human Subcognitive Profiles for every possible domain that the machine might find itself playing in a Tournament.

This same effect could perhaps be wrought in a standard Test by restricting the domain of conversation, but not telling the players before the competition begins what that domain will be.

b) In order to be successful, players have to be good as both teachers and students. As mentioned above, this is akin to holding both a standard and Inverted Turing Test. Even if the domain was known in advance, and even if it was possible to draw up a Human Subcognitive Profile for that domain somehow, such a machine would be exposed as a student.

Lastly, French asks whether the standard Turing Test might be modified to forbid the kind of subcognitive questions that underly its cultural and species-specific parochialism. He concludes that the kinds of questions that probe “intelligence in general … are the very questions that will allow us, unfailingly, to unmask the computer”.

He may well be right. However, it may be that moving out of the text domain will dramatically reduce the scope of possible subcognitive shibboleths that human teachers could employ. Having said that, there will still be many possibilities for rules that would place human student-players at a big advantage. For instance, in the case of the Go domain, a cunning human teacher could choose to play by the rules of Connect4, which other humans might be much quicker to fathom. In the case of some kind of Photoshop canvas domain, humans could spell out words cursively, outwitting even the most seasoned OCR software. If there’s any kind of free-text involved, any of the subcognitive tricks designed for the standard Turing Test might be employed. In the case of a 3D virtual environment, human student-players will have a huge edge, though perhaps 2D or high-D worlds would level the playing field. One might hope that imaginative specification of domains could minimize such advantages, and that after 10 years of such competitions, machine programmers will almost certainly know to build in pre-loaded expert knowledge of Connect4, for instance, but the problem will clearly still remain.

[N.B. In order to ensure that the scales aren’t conversely weighted too heavily against human players, it seems reasonable to allow all human players the use of a laptop throughout the Tournament.]

Maybe instead we should accept the possibility of subcognitive shibboleths, and embrace their utility instead as a means of cataloguing different kinds of conceptual schemes. There is a presumption inherent in the standard Turing Test that smartness can be measured on a one-dimensional continuum ranging from rocks to rocket scientists. In the case of the aliens that have travelled 4 million light years in a space ship built out of genetically-engineered quantum nanobits and powered by fermented mango juice, we could be pretty sure they’re intelligent, even if they were never to get the hang of English. It’s just that their conceptual schemes are different. In this case, we may find that there are cases where they think more like machines than like humans. Or possibly more like dolphins, or African grey parrots, or white mice or marmosets. If we’re able to set up a domain in a Tournament that everyone can play in, then we can expect that human student-players may not necessarily come out on top in all respects, even within the animal kingdom. We will return to this idea when we discuss Turing Tournaments between groups of individuals.

Devising new rules, and non-linguistic competitors

Besides extending the domain of play beyond text, the principle innovation of the Turing Tournament is in casting every player as both student and teacher.

It is clear enough what is required of the student player. When the bout begins, they have some idea of the kinds of interactions, puzzles and patterns that the domain presents. By interacting with the teacher player, they have to somehow fathom what the aim (i.e. terminating criterion) of the current bout is, and attempt to satisfy that. It might involve placing pieces on the board in some complex pattern, learning the structure of a maze, guessing at the next number in a sequence or optimizing some noisy function. Depending on the tournament, they may or may not be given feedback after each move:

  • If they’re given a running score, they can attempt to learn how to maximise that reward.
  • If no reward is given, but the teacher corrects incorrect moves, then the learning by imitation can be seen as a kind of supervised mapping or reconstructive learning problem.
  • There may even be cases where no feedback is given whatsoever, such as when the bout requires the student to guess the next number in some sequence.

It is the teacher’s job to come up with new and inventive rules for bouts that challenge the student-players, and also to perhaps lead the student in the right direction. For the Tournament to work as intended, teachers should be intending to come up with the most discriminative bout rules they can.

Getting the incentive structure for the teachers right is therefore key. I expect that early scoring structures will contain loopholes that ingenious machine designers can exploit, but that over time, scoring structures that serve their purpose in a robust way will emerge. If our goal is to discriminate humans from machines, then this simple scoring system may work:

  • If the student ‘wins’ (i.e. satisfies the terminating criterion) a bout, whether human or machine, then they get a point, otherwise they get nothing.
  • If a human student wins a bout, then the teacher gets a point, otherwise they get nothing
  • If a machine student wins a bout, then the teacher loses a point, otherwise they get nothing.
  • Total player score: the sum of the player’s scores as a student and their scores as a teacher

    [There may need to be some weighting/normalization if the number of human and machine players is unequal.]

In effect, we’re rewarding players that seem human, and can devise rules that discriminate whether other players are human. This Tournament setup is the combo standard/Inverse Turing Test described above, that would not differ all that wildly in principle from the standard Turing Test if played in a text domain. Such Tournaments would encourage the kinds of subcognitive or culturally-rooted human-parochialism that we’re trying to avoid.

Perhaps this more general scheme will work instead:

  • If the student wins, whether human or machine, then they get a point, otherwise they get nothing.
  • To calculate the student’s score: at the end of all the bouts, count the number of bouts that each player won as a student. Calculate the mean number of bouts won. For each student, subtract this mean value from the number of bouts they won. This will mean that a very average player will have zero student points, a good player will have a positive number of points, and a poor player will actually have negative student points.

    In other words:

    c_m = sigma_n^N( W_nm ) – sigma_n^N( sigma_m^N( W_nm )) / 2N


    c_m = the overall student score for player m

    N = the number of players

    W_nm = 1 if student m won in their bout with teacher n

    W_nm = 0 if student m lost in their bout with teacher n

  • To calculate the teacher’s score: if the student wins a bout, then add the student’s student score (which may be negative) to the teacher’s teacher score. If the student loses the bout, then the teacher gets nothing.

    In other words:

    p_n = sigma_m^N( W_nm * c_m )


    p_n = the overall teacher score for player n

    N = the number of players

    W_nm = 1 if student m won in their bout with teacher n

    W_nm = 0 if student m lost in their bout with teacher n

    [It may be that W_nm should be -1 if the student lost]

  • Total player score: the sum of the player’s teacher score and student score

    [There may need be some normalization to ensure that the teacher and student scores are weighted equally.]

What’s the point of all this complexity? If you’re a teacher, then you do best if you can design your rules such that only above-average players (whether human or machine) win in your bouts. There’s actually a penalty if you make your rules so easy that everyone can figure them out, and you’ll get zero points if no one can figure them out at all. When you’re a student, you want to be as smart as you can, and when you’re a teacher, you want to be as discriminative as you can. En masse, the community of competitors are striving to do as well as they can and to evaluate each other as well as they can.

Inventively devising rules to favour intelligent over non-intelligent participants requires sufficient representational power to understand, let alone manipulate, your own rules, a rich theory of mind, as well as a generative good taste. Consider a Tournament played in the simple domain consisting solely of letterstring analogy problems, where the student is faced with problems such as:

“I change abc into abd. Can you ‘do the same thing’ to ijk?”

or in non-linguistic terms:

abc —> abd; ijk —> ?

Reasonable responses include ijl, ijd, or even abd.

Let us imagine that a player as cunning as Douglas Hofstadter has devised the following problem:

abc —> abd; mrrjjj —> ?

Peer at this for a moment – you won’t appreciate that this is somewhat fiendish unless you try it for a while yourself. Any ideas?

There’s no obvious pattern to the letters chosen on the right hand side, so mrrkkk seems kind of lame, and abd always feels lame. Well, how about if you try this one first:

abc —> abd; abbccc —> ?

Though your first thought may have been abbddd, doesn’t abbdddd seem so much nicer? It’s as though the successorship sequence of letters needs to be reflected in the increasing length of the letter groups (to use the FARG’s terminology). Now, let us turn back to:

abc —> abd; mrrjjj —> ?

Doesn’t mrrjjjj seem like a nice, reasonable solution now? Would you have considered it so nice before the previous example. Probably almost as nice. Would you have thought of it on your own, without the previous example? Probably, given some head-scratching.

The point of this digression is to point out how an imaginative teacher can guide, plant ideas, manipulate, prime, coax and lead the student by example in such a way that an intelligent player would almost certainly get the right answer, but there are almost no extant machines that would stand a chance. Besides having the sheer representational flexibility to deal with even barebones analogies such as the one above, a really intelligent player would be using the first few turns to gauge the teacher, get a sense of what kinds of solutions are admissible, and would probably be relying on Gricean maxims wherever possible.

What if your alien doesn’t know anything about Gricean maxims? Or doesn’t understand concepts like tournaments, rules, intelligence, machines, scores or games? We’ve finished outlining how a Tournament might be run that might require less domain knowledge and linguistic ability than the standard Turing test. But one striking pragmatic problem remains, which becomes apteacher when we consider our newly-arrived green visitor. If the alien doesn’t speak English, how are we going to explain the idea of the Turing Tournament to him so that he can participate?

Following Minsky, I think that we will be able to converse with aliens to some degree, provided they are motivated to cooperate, because we’ll both think in similar ways in spite of our different origins. Every evolving intelligence operates within spatial and temporal constraints, suffers from a scarcity of resources (and presumably, competition), must develop symbols and rules, and must have thought about computation and machine learning in order to be able build spaceships. Perhaps notions of games, intelligence, scores and tournaments are only relevant in a society of individual entities that compete with each other for resources, and that maybe a hive mind or single monolithic being or other unimaginable entity might not need such concepts. In that case, you wouldn’t have any more luck using the standard Turing Test on such a being.

Will we have much more luck with machines? Not unless we start small. At the moment, the state of the art in artificial intelligence wouldn’t do very well in most of the domains we’ve discussed, and would struggle especially when trying to generate new rules of its own. Sadly, very few researchers have focused on generative heuristics to curiously discover things that are interesting simply for their own sake, such as Lenat‘s Automated Mathematician that sought interesting mathematical concepts. In order to stand a chance in a Turing Tournament, much work needs to be done on curiously discovering interesting things that could serve as the basis for a rule set in a new domain. Good, that is, discriminative, rule sets for a Turing Tournament bout might consist of a difficult but ultimately guessable sequence of numbers based on a funny arithmetical pattern, or the kind of letterstring analogy problem that elicits an ‘aha’. Better still, teacher players that can lead an intelligent student player down a suggestive road towards the terminating criterion through tutorial or warm-up sub-bout problems will be at a tremendous advantage, where half the problem for the student consists of figuring out what their goal is supposed to be.

But is it intelligent?

Let us recall Shieber’s pithy test for intelligence:

“Now, how do you tell if something is intelligent? You compare it with an entity postulated to be intelligent.”

We’ve replaced that with an intellectual obstacle course. As teachers devising rules for their bouts, we are effectively asking players to define their own micro-test of intelligence (since being able to do this is surely a sign of good taste?). They must then be able to convey the parameters of that test such that other intelligent student players can figure out how to pass it, perhaps by creating lead-up sub-bouts, internalizing what the student player is probably thinking and so guiding the student players’ intuitions in the right direction. Finally, as students, the players must demonstrate in turn that they can flexibly assimilate what their goal should be, and then be able to get to it.

So although we might imagine some narrow machines that could best humans in certain kinds of puzzles or computation, but it seems less likely that a brute force machine player would also do well on Bongard problems, letterstring analogies, or be able to devise ingenious, fun and discriminative rules for bouts. This new generative aspect is intended to tap into the kind of creative, generative, playful, inventive or aesthetic faculty that humans display, as well as the ability to form a rich internal model of the student player’s state of confusion, and guide them towards a solution. In this respect, it borrows the idea of a dialog or gentle interrogation from the original Test, but allows for the translation of that dialog to new domains.

Bringing this back to Turing’s original question, we can finally ask, ‘if a machine were to score higher than some of the humans in a Turing Tournament, would we definitely be willing to call it intelligent?’ The answer could depend on a few factors:

  • Let us assume that the Tournament is well-planned, that the human competitors are well-chosen, that no independent experts can find any scoring loopholes or weaknesses in the organization of the Tournament, and that the result is replicable. If any of these conditions are not met, we will not consider the Tournament to be well-run.
  • If the domain is too restrictive, then there may be a dearth of interesting rule sets that can be devised. In this case, a good player won’t do much better than a poor player, and this wouldn’t be an interesting result.
  • Even if the domain is a rich one, such as letterstring analogy problems, it could be that a highly specialized program like Copycat could outperform many humans. Unless the success is relatively domain-general, then you’ve shown what you probably knew already, i.e. the machine is exhibiting some domain-specific proto-intelligence.
  • At that point, we would probably want to analyze the machine’s performance. Did it do better as a teacher or student? Was it simply very good at certain kinds of problems? Was there some simple trick to its way of devising problems which, once exposed, would clue in future human players in a rerun of the Tournament?
  • Could it pass a standard Turing Test?

Let us imagine that a machine is designed which is a poor teacher player, but a good student player, particularly in a couple of abstract limited-interaction domains like letter strings, number sequences, Go boards and cryptography, but that it can’t pass the Turing Test. Is it intelligent? Somewhat? We’ve forfeited the no-frills and no-free-parameters yes/no answer that you get from a Turingf Test, but we now have a much richer set of data with which to try and place this machine in the space of all possible minds. We have a more finely-graded multi-dimensional scale. Our machines can bootstrap themselves by competing amongst themselves without human intervention – specialist teacher machines that are good at discovering generative heuristics can be used to train specialist student machines that are good at problem solving, and vice versa. So in forfeiting our neat yes/no answer, we’ve gained a great deal.

Perhaps most importantly for the field of AI, we can now attempt to scale the enormous subcognitive iceberg of the mind incrementally, using ever more complex Turing Tournaments as yardsticks. In time, perhaps this will lead back towards the Turing Test as the final summit.

see also: AlienIntelligenceLinks

Intrinsicality, symbols, self-organisation and gradual transformation

In his thesis, he discusses the use of hierarchical grouping self-organising maps to get symbols to self-organise. I can’t decide whether i feel as though self-organization is intrinsic to intrinsicality or not, but it definitely feels as though the two are somewhat intertwined, esp. in the brain

The second thing is that such self-organising symbols are gradually transformable – to illustrate what is meant by this, consider C++ source code. If you imagine that source code is a genotype (though this point applies to learning in general and not just to genetic algorithms), and you mutate it slightly or combine the first half with another piece of source code and see how well it performs at your chosen task, then chances are that it will be completely broken. C++ source code is not gradually transformable. Genotypes and neural networks, on the other hand, are gradually transformable. This, i think, is what allows them to learn by self-organizing, and is somehow key to the whole intrinsicality business. Because self-organizing systems can morph gradually as a function of changes or experiences in their environment, they are inextricably tied to it, and form intrinsic representations of it

– from a message to Stephen Larson

P.S. Is it a coincidence that both types of systems represent things in the universal language of vectors???

The Cog team

“Three conceptual errors commonly made by classical AI researchers are presuming the presence of monolithic internal models, monolithic control, and the existence of general purpose processing. These and other errors primarily derive from naive models based on subjective observation and introspection, and biases from common computational metaphors (mathematical logic, Von Neumann architectures, etc.). A modern understanding of cognitive science and neuroscience refutes these assumptions.
– Alternative Essences of Intelligence

Trophic levels, trade and the Terminator

As an AI aficionado, I’ve had my fair share of debates about the Terminator scenario. Perhaps blindly, I’m optimistic about the possibility of being enslaved or eradicated by robot overlords. Here’s one possible response.

Why would the machines be so obssessed with the idea of dominating or eradicating us? They’ll occupy a completely different ecological niche. They’ll almost certainly have entirely different energy/resource requirements, be free from human claustrophobia and may not even be embodied at all. They won’t care if the earth runs out of resources because they’ll just photosynthesise or transduce faeces. So, even if it turns out that aggression is a necessary component of intelligence, it just doesn’t make sense that they’ll want to wipe us off the face of the earth any more than we’re determined to wipe bacteria off the face of the earth.

Unfortunately though, that’s a somewhat spurious parallel. We aren’t in competition with bacteria – in fact, we depend on them. In contrast, there’s less reason to suppose that intelligent machines will depend on us. In that case, the better parallel might be the relationship we have with chimpanzees. That would be more cause for concern.

So the best case scenario would be mutual dependence and integration with the machines. It’s no coincidence that trading partners rarely go to war with one another. However, we’re going to have to find something we’re better than machines at in order to have something to trade. Perhaps we could elevate captchas to a form of poetry?