Massachusetts Institute of Technology researchers were part of a team that developed a technique that uses computer-generated data to help vision and language models better understand concepts.
The researchers used an annotated synthetic dataset to fine-tune popular vision and language models, increasing their accuracy in concept understanding by up to 10%.
They produced close to 800,000 photorealistic images using computer-generated synthetic videos of diverse three-dimensional environments and objects, with human avatars added to interact with them.
A detailed caption was added to each image, covering object attributes, positional relationships, and human-object interactions.
Synthetic data allowed the researchers to create more diverse images at a lower cost than generating real data while preserving privacy through the use of avatars.
From MIT News
View Full Article
Abstracts Copyright © 2023 SmithBucklin, Washington, D.C., USA
No entries found