Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Model Summary

Welcome to the Florence-2 repository! This repository contains a Hugging Face's transformers implementation of the Florence-2 model, developed by Microsoft.

Florence-2 is a cutting-edge vision foundation model designed to handle a diverse array of vision and vision-language tasks through a prompt-based approach. This model can interpret simple text prompts to perform tasks such as captioning, object detection, and segmentation. It utilizes the FLD-5B dataset, which includes 5.4 billion annotations across 126 million images, to excel in multi-task learning. Florence-2's sequence-to-sequence architecture allows it to perform exceptionally well in both zero-shot and fine-tuned settings, making it a competitive and versatile vision foundation model.

For more details, you can read the technical paper.

Features

Prompt-based Approach: Handles a wide range of vision tasks with simple text prompts.
Multi-task Learning: Leverages the extensive FLD-5B dataset to master multiple tasks.
Sequence-to-Sequence Architecture: Excels in zero-shot and fine-tuned settings.
Vision and Vision-Language Tasks: Capable of captioning, object detection, segmentation, and more.

Sghosh1999 / Computer-Vision-OCR-Florence2

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Model Summary

Features

About

Languages