Submitted by Nicholas Colas of DataTrek Research
Peter Thiel’s most famous one-liner may well be, “We wanted flying cars, instead we got 140 characters”. As a send-up of tech entrepreneurship’s focus on the mundane rather than the magnificent, it’s pretty good. And Twitter may be 280 characters now, but we still don’t have flying cars.
We thought of that line today when we came across an article in the MIT Technology Review titled “The way we train AI is fundamentally flawed” by Will Heaven, the journal’s senior editor for AI and Ph.D. in Computer Science. The piece reviews a recent paper by 40 Google researchers across 7 different teams. There’s links to both the MIT article and the Google paper below (the latter is heavy reading), but here is a brief summary:
Artificial intelligence training is a 2-step process. You start by showing an algorithm a dataset. As it goes through that data, it “learns” to identify images, voices, or whatever you’re trying to teach it by subtly altering the weights of the criteria it is coded to evaluate. Once that’s done, you test it on data it hasn’t seen before. When you get a satisfactory outcome to that test, you’re done.
The problem is that if you train 50 AI algos on a given data set and then see them all pass their tests, they still perform very differently in the real world. They might be great at identifying images in low light conditions, but not when presented with high contrast, for example. Another AI algo trained on the same data and validated by the same test may have the opposite problem. Or it may do both well…
Put succinctly, the MIT journal article says, “the process used to build most machine-learning models today cannot tell which models will work in the real world and which ones won’t”.
The problem is “underspecification”, and as the article notes “Google researchers ended up looking at a range of different AI applications, from image recognition to natural language processing (NLP) to disease prediction. They found that underspecification was to blame for poor performance in all of them.”
There are ways around underspecification, ranging from more testing (essentially “up-specifying” the algo so it does the work correctly) to releasing multiple versions of the same algo and letting users pick which ones work best for their application.
Our take: while it is surprising that AI researchers have only recently figured out what’s inherently wrong with AI development (and the fact that it’s Google fessing up seems important), at least they are one step closer to a better testing/validation regime. We won’t get flying cars – or even just self-driving terrestrial ones – until this process improves.