The Feature Fields for Robotic Manipulation (F3RM) method designed by Massachusetts Institute of Technology researchers helps robots identify and grasp nearby objects by forming three-dimensional (3D) scenes from two-dimensional (2D) images and vision foundation models.
F3RM can be applied to real-world settings with thousands of objects by interpreting open-ended text prompts from humans using natural language.
A camera mounted on a selfie stick shoots 50 2D images in different poses to build a neural radiance field, with the resulting collage rendering a 360-degree "digital twin" of the environment.
F3RM uses the Contrastive Language-Image Pre-training (CLIP) vision foundation model to enhance geometry with semantic data, reassembling the 2D CLIP features for the camera-shot images into a 3D representation.
Following a few demonstrations, the robot, when prompted, grasps previously unencountered objects by applying its geometric and semantic knowledge, choosing the highest-scoring option.
From MIT News
View Full Article
Abstracts Copyright © 2023 SmithBucklin, Washington, D.C., USA
No entries found