It categorizes images in 100 milliseconds

Apr 3, 2007 08:19 GMT  ·  By

The human eye, during normal everyday vision, moves around a scene, giving the brain time to focus attention on relevant items, such as a snake curled in the path. Evolutionarily speaking, survival often depends on extracting vital information in one glance, so that we jump out of danger's way before we even realize what we've seen.

Computers usually have computing speeds far superior to the human brain, but there are some tasks, such as visual object recognition, that the brain performs easily yet are very challenging for computers. The brain has a much more sophisticated and swift visual processing system than even the most advanced artificial vision system, giving us an uncanny ability to extract valuable information after just a glimpse that is presumably too fast for conscious thought.

To explore this phenomenon, neuroscientists have long used rapid categorization tasks, in which subjects indicate whether an object from a specific class (such as an animal) is present or not in the image.

Now, in a new MIT study, a computer model designed to mimic the way the brain itself processes visual information performs as well as humans do on rapid categorization tasks. The model even tends to make similar errors as humans, possibly because it so closely follows the organization of the brain's visual system.

"This is a very hard task for any artificial vision system. Animals are extremely varied in shape and size. Snakes, butterflies, and elephants have little in common, and the animals in the image may be lying, standing, flying, or leaping," explained Tomaso Poggio of the MIT.

The team organized images in different subcategories from full views of an animal head to far views, using single as well as groups of animals. As preliminary model simulations predicted, the task became harder as the relative size of the animal decreased and the amount of the background increased.

The results showed no significant difference between humans and the model. Both had a similar pattern of performance, with well above 90% accuracy for the close views dropping to 74% for distant views. The 16% drop in performance for distant views represents a limitation of the one feed-forward sweep in dealing with background. With more time for cognitive feedback, people would outperform the model because they could focus attention on the target and ignore the background.

"We have not solved vision yet," Poggio cautioned, "but this model of immediate recognition may provide the skeleton of a theory of vision. The huge task in front of us is to incorporate into the model the effects of attention and top-down beliefs."

This is a big step forward in understanding how humans see, and toward its applications in a human-like robot.