Meta has released SAM, the first image segmentation foundational model in history


 

Just now, Meta AI released the Segment Anything Model (SAM) - the first basic image segmentation model.

SAM can achieve one-click segmentation of any object from photos or videos, and can transfer to other tasks with zero samples.

Overall, SAM follows the idea of ​​the basic model:

1. A very simple yet scalable architecture that can handle multimodal cues: text, keypoints, bounding boxes.

2. The intuitive labeling process is closely connected with the model design.

3. A data flywheel that allows models to bootstrap to a large number of unlabeled images.

Moreover, it is no exaggeration to say that SAM has learned the general concept of "object", even for unknown objects, unfamiliar scenes (such as underwater and under the microscope), and blurry cases.

In addition, SAM can also generalize to new tasks and new fields, and practitioners do not need to fine-tune the model themselves.



In this regard, Tencent AI algorithm expert Jin Tian said, "The prompt paradigm in the NLP field has begun to extend to the CV field. This time, it may completely change the traditional prediction thinking of CV. Now you can really use a model, to segment arbitrary objects, and it's dynamic!"

Nvidia AI scientist Jim Fan even praised this: We have come to the "GPT-3 moment" in the field of computer vision!

So, CV really doesn't exist anymore?


SAM: One-click "cut out" of all objects in any image




Segment Anything is the first base model dedicated to image segmentation.

Segmentation, which refers to identifying which image pixels belong to an object, has always been a core task in computer vision.

However, if you want to create an accurate segmentation model for a specific task, it usually requires experts to perform highly specialized work. This process requires training AI infrastructure and a large amount of carefully labeled in-domain data, so the threshold is extremely high.

In order to solve this problem, Meta proposed a basic model of image segmentation - SAM. This promptable model, trained on diverse data, is not only adaptable to various tasks, but also operates similarly to how prompts are used in NLP models.

The SAM model grasps the concept of "what is an object" and can generate masks for any object in any image or video, even objects it has not seen in training.

SAMs are so broadly general enough to cover a wide variety of use cases that they can be used out-of-the-box for new imaging domains, from underwater photographs to cell microscopy, without additional training. That is to say, SAM already has the ability of zero-sample migration.

Meta said excitedly in the blog: It can be expected that in the future, in any application that needs to find and segment objects in images, there is a place for SAM.

SAM can be part of a larger AI system that makes a more general multimodal understanding of the world, for example, understanding the visual and textual content of web pages.

In the field of AR/VR, SAM can select objects according to the user's line of sight, and then "promote" the objects into 3D.

For content creators, SAM can extract image regions for collage, or video editing.

SAM can also locate and track animals or objects in the video, which is helpful for natural science and astronomy research.

General Segmentation Method



Previously, there were two approaches to solving the segmentation problem.
One is interactive segmentation, which can segment objects of any class but requires a human to iteratively fine-tune the mask.
The second is automatic segmentation, which can segment specific objects defined in advance, but the training process requires a large number of manually labeled objects (for example, to segment cats, thousands of examples are required).
In conclusion, neither of these two approaches can provide a general, fully automatic segmentation method.
While SAM can be seen as a generalization of these two methods, it can easily perform interactive segmentation and automatic segmentation.
On the model's promptable interface, a wide range of segmentation tasks can be accomplished as long as the correct prompts (clicks, boxes, text, etc.) are designed for the model.
Additionally, SAM is trained on a diverse, high-quality dataset containing more than 1 billion masks, enabling the model to generalize to novel objects and images beyond what it observes during training. As a result, practitioners no longer need to collect their own segmented data to fine-tune models for use cases.
This flexibility to generalize to new tasks and domains is a first for image segmentation.


Post a Comment

Previous Post Next Post