Computers can now do more than merely recognize individual objects in a photograph; they can provide simple sentence-type descriptions of the whole scene, such as "a dog standing on the grass." New research into automatically captioning complex images is giving interesting results.
The breakthrough in automatic speech recognition, image recognition, and translation came in 2006–2007 with research by Geoffrey Hinton and Simon Osindero into training deep belief networks and by Ranzato et al. and Bengio et al. on stacked auto-encoders. In pre-training, lower layers of a deep learning network automatically extract features of an image that might be useful. The auto-encoder then finds a feature of the input data to be able to reconstruct and output. These features are then input to the next layer, and so on, becoming more and more finely tuned. The goal is complete automation producing accurate, natural, and useful results, eventually creating descriptive paragraph-length auto-captions.
Google has taken research into the concept of text translation and applied similar algorithms to images. Late in 2014, Oriol Vinyals, a research scientist who is part of the Google Brain project, and his team from Google Research (Alexander Toshev, Samy Bengio, and Dumitru Erhan) announced new technology that could auto-caption complex images, translating pictures into words. To accomplish this, they connected a convolutional neural network (CNN) to the front of a recurrent neural network (RNN) to produce a combined network that can recognize and describe what it is looking at; the CNN learned image features and the RNN learned phrases associated with the features. The process is one of machine translation, of translating pixels to English, in a similar way to that of iterative language translation. Likely features within the photograph are detected; words for these features are then found on a neural network, and descriptive sentences are generated using a language model. The system was pre-trained with an image dataset with suitable descriptive captions, and about half of the time the network will generate new descriptive phrases, so the potential for new learning is enormous. For example, none of the photos in the image below were in the training set:
Microsoft researcher John Platt is also working on a system to automatically generate descriptive captions of an image as well as humans do. Fei-Fei Li and Andrej Karpathy of the Stanford Artificial Intelligence Laboratory have created auto-captioning software to tell the story behind an image by looking for patterns and scenes. Li also been involved in creating the ImageNet training database of objects. Tamara L. Berg of the University of North Carolina at Chapel Hill is also training a neural network to recognize complex images and produce natural language to describe them.
It is a big step from identifying meaningful features in a single photograph to those in the fast-moving frames of a video. A group of researchers from the Vision Group at the International Computer Science Institute at the University of California, Berkeley, along with colleagues at several other institutions, are using long-term RNN models that map variable-length inputs such as video frames onto variable-length outputs such as natural language text.
A further development is the question and answer system. Boris Katz of the Massachusetts Institute of Technology's Computer Science and Artificial Intelligence Laboratory has created the START natural language system and developed a patented method of natural language annotations to facilitate access to multimedia information in response to questions expressed in everyday language.
Google's Vinyals sees the main challenge to be that computers are not like humans, who can apply a large amount of acquired knowledge to what they know (for example, the differences between breeds of dog). In theory, such acquired knowledge cannot be used in deep learning because the sheer scale of such an undertaking is impracticable; the dataset would require many different images of all breeds and sizes of dog (across a full range of photos, drawings, cartoons, and so on). However, the ImageNet project is working toward compiling a comprehensive database (click here to see its cocker spaniel synset).
Another challenge Vinyals mentions is that current metrics used to compare computer-generated captions to those humans create, for example the Meteor Automatic Machine Translation Evaluation System and the BLEU (Bilingual Evaluation Understudy) algorithm, do not perfectly measure the real goal: generating a useful description. "It is thus important to come up with a meaningful metric which all scientists agree on to make real progress."
The rewards of auto-captioning could be huge. Potential applications include:
Image search is already entering the mainstream. A free visual search and translation app called Whatzit recognizes objects in photos you take, and then translates the name of the object from English to French, Spanish, or German. With the free grocery shopping app Shopper, you can snap a photo of an item and add it to a shopping list, buy it online, and get recipes. This technology could be applicable to numerous other activities.
Logan Kugler is a freelance technology writer based in Tampa FL. He has written for over 60 major publications.
No entries found