Computer scientists from the Allen Institute for Artificial Intelligence in Seattle and the University of Washington reported Sunday that an artificial intelligence software program answered geometry questions from the SAT at the level of an average 11th-grader.
An artificial-intelligence software program capable of seeing and reading has for the first time answered geometry questions from the SAT at the level of an average 11th-grader.
The achievement, in which the program answered math questions it had not previously seen, was reported in a paper presented by computer scientists from the Allen Institute for Artificial Intelligence in Seattle and the University of Washington at a scientific conference in Lisbon, Portugal, on Sunday.
The software had to combine machine vision, to understand diagrams, with the ability to read and understand complete sentences; its success represents a breakthrough in artificial intelligence.
Despite the advance, the researchers acknowledge that the program’s abilities underscore how far scientists have to go to create software capable of mimicking human intelligence.
Most Read Stories
- Seattle Zestimates are off by $40,000; now hundreds of data crunchers vie to improve Zillow’s model
- 2 men shot at Seattle’s Gas Works Park; suspect sought
- Off-lease used cars are flooding market, pushing prices down
- Seattle once again nation’s fastest-growing big city; population exceeds 700,000 | FYI Guy
- 2 Bellevue High students investigated in alleged rape of 14-year-old girl at Yarrow Point party
For example, Ali Farhadi, a University of Washington artificial-intelligence researcher and a designer of the test-taking program, noted that even a simple task for children, like understanding the meaning of an arrow in the context of a test diagram, was not yet something the most advanced AI programs could do reliably.
“A lot of my colleagues have said machine vision is a solved problem,” Farhadi said. “My answer is, ‘Call me when you’ve solved this.’ ”
In 1950, at the dawn of the computing age, mathematician Alan Turing proposed a simple test to determine whether a machine could “think” in a human sense. If a person communicating via keyboard with a computer could not tell whether it was a machine or a human, Turing reasoned, the question would be resolved.
Since then, a debate has emerged about whether the Turing test signifies anything about machine intelligence or is merely an indication of human gullibility.
The argument came to a head in June 2014, when a British computer scientist claimed that a program — a chatbot disguised as a 13-year-old Ukrainian boy named Eugene Goostman — created by Russian and Ukrainian programmers, passed the Turing test.
The demonstration was widely criticized, and in January, Gary Marcus, a cognitive scientist at New York University, organized the first of two scientific workshops intended to develop more accurate methods than the Turing test for measuring the capabilities of artificial-intelligence programs.
Researchers in the field are now developing a wide range of gauges to measure intelligence — including the Allen Institute’s standardized-test approach and a task that Marcus proposed, which he called the “Ikea construction challenge.” That test would provide an AI program with a bag of parts and an instruction sheet, and require it to assemble a piece of furniture.
Another approach, the Winograd Schema Challenge, has been made an official competition, sponsored by Nuance Communications, a company that develops speech-recognition technology. The challenge — named after pioneering AI researcher Terry Winograd, a professor of computer science at Stanford University — will be a test of common-sense reasoning.
First proposed in 2011 by Hector Levesque, a University of Toronto computer scientist, the Winograd Schema Challenge would pose questions that require real-world logic to AI programs. A question might be: “The trophy would not fit in the brown suitcase because it was too big. What was too big, A: the trophy or B: the suitcase?” Answering this question would require a program to reason spatially and have specific knowledge about the size of objects.
Within the AI community, discussions about software programs that can reason in a humanlike way are significant because recent progress in the field has been more focused on improving perception, not reasoning.
For example, progress has been made in applications used to recognize objects or human speech. To improve machine recognition and speech recognition, researchers have exploited the ability to compile vast libraries of examples to train programs known as neural networks, which can do sophisticated pattern recognition.
The Allen Institute researchers said these techniques fell short in developing technologies to match human skills such as abstract and common-sense reasoning.
The Allen Institute’s program, which is known as GeoSolver, or GeoS, was described at the Conference on Empirical Methods on Natural Language Processing in Lisbon over the weekend. It operates by separately generating a series of logical equations, which serve as components of possible answers, from the text and the diagram in the question. It then weighs the accuracy of the equations and tries to discern whether its interpretation of the diagram and text is strong enough to select one of the multiple-choice answers.
The Allen Institute approach has more in common with an earlier generation of artificial intelligence research that relied on logic and reasoning.
Moreover, the Allen Institute researchers said, machine-learning techniques have continued to fall short in areas where humans excel, such as problem solving.
“This is not pattern matching,” said Oren Etzioni, a computer scientist and the chief executive of the Allen Institute.
While neural networks have made progress based on the availability of huge amounts of data online, the Allen Institute approach works with relatively sparse data (even all the standardized-test questions do not make up a big data set). The data are too sparse, in fact, to be broadly useful in solving school test questions in subjects that require reasoning, such as algebra and science disciplines, Etzioni said.
Ultimately, Marcus said, he believed that progress in artificial intelligence would require multiple tests, just as multiple tests are used to assess human performance.
“There is no one measure of human intelligence,” he said. “Why should there be just one AI test?”
One open question is whether the incremental progress that is evident in the Allen Institute geometry-solving program is a significant step forward or whether it has more in common with a series of earlier proclamations in the field of “thinking machines” that ended in blind alleys.
In the 1960s, Hubert Dreyfus, a philosophy professor at the University of California, Berkeley, expressed this skepticism most clearly when he wrote, “Believing that writing these types of programs will bring us closer to real artificial intelligence is like believing that someone climbing a tree is making progress toward reaching the moon.”