saprrow / moondream

tiny vision language model

Home Page:https://moondream.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

🌔 moondream

a tiny vision language model that kicks ass and runs anywhere

Website | Hugging Face | Demo

Benchmarks

moondream2 is a 1.86B parameter model initialized with weights from SigLIP and Phi 1.5.

Model VQAv2 GQA TextVQA TallyQA (simple) TallyQA (full)
moondream1 74.7 57.9 35.6 - -
moondream2 (latest) 76.8 60.6 46.4 79.6 73.3

Examples

Image Example
What is the girl doing?
The girl is eating a hamburger.

What color is the girl's hair?
The girl's hair is white.
What is this?
This is a computer server rack, specifically designed for holding multiple computer processors and other components. The rack has multiple shelves or tiers, each holding several processors, and it is placed on a carpeted floor. The rack is filled with various computer parts, including processors, wires, and other electronic devices.

What is behind the stand?
There is a brick wall behind the stand.

Usage

Using transformers (recommended)

pip install transformers timm einops
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-03-13"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.

To enable Flash Attention on the text model, pass in attn_implementation="flash_attention_2" when instantiating the model.

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision,
    torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")

Batch inference is also supported.

answers = moondream.batch_answer(
    images=[Image.open('<IMAGE_PATH_1>'), Image.open('<IMAGE_PATH_2>')],
    prompts=["Describe this image.", "Are there people in this image?"],
    tokenizer=tokenizer,
)

Using this repository

Clone this repository and install dependencies.

pip install -r requirements.txt

sample.py provides a CLI interface for running the model. When the --prompt argument is not provided, the script will allow you to ask questions interactively.

python sample.py --image [IMAGE_PATH] --prompt [PROMPT]

Use gradio_demo.py script to start a Gradio interface for the model.

python gradio_demo.py

webcam_gradio_demo.py provides a Gradio interface for the model that uses your webcam as input and performs inference in real-time.

python webcam_gradio_demo.py

Limitations

  • The model may generate inaccurate statements, and struggle to understand intricate or nuanced instructions.
  • The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
  • The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.

About

tiny vision language model

https://moondream.ai

License:Apache License 2.0


Languages

Language:Python 100.0%