A Giant Leap for Machine Kind; When Robots Can See

Mar 16, 2016

Advancements in Artificial Intelligence are moving at a faster pace than ever. It’s already commonplace to give them voice commands and have then respond in natural language and even identify images they have been programmed to recognize.  But the next frontier in A.I. is teaching computers to ‘see,’ that is, to recognize unique objects the way humans so. But something that comes naturally to most people is a very complex problem for robots. Scientists at Virginia Tech are leading an effort in the A.I. community to get computers past a hurdle that has seemed almost insurmountable, until now. Robbie Harris reports.

Scientists have been working on artificial intelligence since at least the 1960s. And theye’ve passed many milestones - computers winning at chess and now the Chinese game, GO, and apps like Siri on iPhones answering all manner of questions.  But as smart as all that sounds, it’s nothing compared the kind of nuanced and intuitive thinking humans do all the time, like for instance, looking at a picture and telling you what they see.

Devi Parikh and Dhruv Batra are Assistant Professors of electrical and computer engineering. Like parents of a precocious child, they’re teaching this computer to see and interpret a complex world.

Parikh speaks into the computer’s microphone; “What is the man doing?”

Because you’re listening to this story on the radio –or the web- you can’t see what the man is doing but the computer can tell you.

VQA answers; “Surfing”

This would be a small step for a child but it’s a giant leap for a machine. It’s a relatively new field in artificial intelligence called Visual Question Answering, or VQA. Not only could it be of help to the visually impaired, but also to people who are situationally impaired.

“Whether you’re driving and you can no longer look at your phone to see which picture someone sent you or you’re an analyst who has thousand of hours of video in front of you and you cannot possibly spend thousands of hours looking through that video footage. You can just query the A-I agent and say, did a man in a red shirt cross the street between 3 & 4 pm on any Tuesday?”

Right now VQA is still in its infancy.

Devi Parikh asks, “What is on the Plate?”

VQA responds: “No.”

It’s a baby with a very grown up voice if not yet, 100 percent accuracy.

Parikh says, “ We often find that computer vision is in some way, an unfortunate area to be working in because it comes so naturally to us, that when our systems make mistakes people are like, ‘How could it get that wrong, that’s so obvious’ It’s a challenge to convince people how hard it is to figure these things out.”

Parikh says the computer gets the right answer about 60 per cent of the time. That’s because, while computers are good at quickly retrieving data, not so much at recognizing unique or everyday objects like something as simple as your basic chair.

“If you think about what chairs look like, the color is not consistent. The shape need not be consistent. The texture is not consistent so visually it’s not clear, what makes a chair a chair. It’s just that when you look at the object, you can imagine yourself sitting in it and that’s what makes a chair a chair.  So visually that’s a much more challenging problem to be able recognize any chair that you may not have seen before.

And even though Watson, the computer, beat the humans on Jeopardy, Batra says, “But interestingly, for everyone at home who watched Watson win there was not a single image question and that was by design. There are estimates of a significantly large percentage of the human brain is devoted to performing visual recognition, visual processing.  And that’s something that we’re trying to connect; The bridge between understanding language, understanding visual data whether it’s images or videos and then producing an answer.”

So Batra and Parikh had to build a deep, rich database.  Computers are good at handling a lot of data, but it takes humans to frame the questions and scenarios that might arise. 

“We asked them to come up with questions to stump a smart robot.”

Larry Zitnick is with Facebook’s Artificial Intelligence Research Lab and a partner on the Visual Question Answering project.

“One of the biggest challenges right now in A-I is how to measure the intelligence of the machines. And that’s really the main emphasis of the visual question and answering data set, to get a better handle on how far along we are.

Last fall, the team made its, data set public.  Now teams at universities and research institutes all over the country are working with it to see which team’s computers return the most accurate answers to these human crowd sourced questions.  A kind of S-A-T for the robot set. They’ll announce the winners at a conference on computer vision and pattern recognition in Las Vegas in June.

Also contributing to the Visual Question Answering Project are:  Meg Mitchell of Microsoft Research, and Virginia Tech students: Aishwarya Agrawal, Stan Antol and Jiasen Lu.