As flexible natural language conversation systems get more and more attention, it is important that we establish a common, independent benchmark for measuring the quality of these systems or methods that underlie them. Progress in computer vision and speech recognition have mainly been driven by competitions and benchmarks like ImageNet or switchboard. To test modern conversational AI, we don't really have robust benchmarks like these. There are a number of challenges, such as the Dialogue State Tracking Challenge, the Spoken Dialogue Challenges or the Loebner prize, but these lack the scale and generality we need to make quick progress in the space.
What makes setting up a benchmark for dialogues particularly hard is the interaction with another party. Because the progression of the dialogue depends on the system's answers, we can't evaluate the system's full performance off-line. Ideally, we need a live challenge where people interact live with each system. And we need to make these evaluations sufficiently large scale, so that each time, the results are statistically significant.
As conversational interfaces become more accurate and commonplace, we also have to make sure that we track human level performance in the same conversational tasks. We take it for granted that humans can pass the Turing test, but that's because most people today expect to be talking to another person, so relatively little effort is enough to demonstrate you're not a robot. But just as CAPTCHAs are getting more and more challenging (or maybe I'm getting older), we may need to put more effort into convincing other humans that we are not robots.