Researchers at the Massachusetts Institute of Technology (MIT), Columbia University, and IBM have trained a hybrid language-vision machine learning model to recognize abstract concepts in video.
The researchers used the WordNet word-meaning database to map how each action-class label in MIT's Multi-Moments in Time and DeepMind's Kinetics datasets relates to the other labels in both datasets.
The model was trained on this graph of abstract classes to generate a numerical representation for each video that aligns with word representations of the depicted actions, then combine them into a new set of representations to identify abstractions common to all the videos.
When compared with humans performing the same visual reasoning tasks online, the model performed as well as them in many situations.
MIT's Aude Oliva said, "A model that can recognize abstract events will give more accurate, logical predictions and be more useful for decision-making."
From MIT News
View Full Article
Abstracts Copyright © 2020 SmithBucklin, Washington, DC, USA
No entries found