Haotian-Zhang / ml-ferret

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Alt text for the image Ferret: Refer and Ground Anything Anywhere at Any Granularity

An end-to-end MLLM that can accept any-form referring and ground anything in response.*

Haoxuan You*, Haotian Zhang*, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang [*: equal contribution]

Overview


Diagram of Ferret Model.

Key Contributions:

  • Ferret Model - Hybrid Region Representation + Spatial-aware Visual Sampler enable fine-grained and open-vocabulary referring and grounding in MLLM.
  • GRIT Dataset (~1.1M) - A Large-scale, Hierarchical, Robust ground-and-refer instruction tuning dataset.
  • Ferret-Bench - A multimodal evaluation benchmark that jointly requires Referring/Grounding, Semantics, Knowledge, and Reasoning.

About

License:Other