🌔 moondream

a tiny vision language model that kicks ass and runs anywhere

Benchmarks

moondream2 is a 1.86B parameter model initialized with weights from SigLIP and Phi 1.5.

Model	VQAv2	GQA	TextVQA	TallyQA (simple)	TallyQA (full)
moondream1	74.7	57.9	35.6	-	-
moondream2 (latest)	76.8	60.6	46.4	79.6	73.3

Examples

Image	Example
	What is the girl doing? The girl is eating a hamburger. What color is the girl's hair? The girl's hair is white.
	What is this? This is a computer server rack, specifically designed for holding multiple computer processors and other components. The rack has multiple shelves or tiers, each holding several processors, and it is placed on a carpeted floor. The rack is filled with various computer parts, including processors, wires, and other electronic devices. What is behind the stand? There is a brick wall behind the stand.

Image

Example

What is the girl doing?
The girl is eating a hamburger.

What color is the girl's hair?
The girl's hair is white.

What is this?
This is a computer server rack, specifically designed for holding multiple computer processors and other components. The rack has multiple shelves or tiers, each holding several processors, and it is placed on a carpeted floor. The rack is filled with various computer parts, including processors, wires, and other electronic devices.

What is behind the stand?
There is a brick wall behind the stand.

Usage

Using transformers (recommended)

pip install transformers timm einops

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-03-13"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.

To enable Flash Attention on the text model, pass in attn_implementation="flash_attention_2" when instantiating the model.

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision,
    torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")

Batch inference is also supported.

answers = moondream.batch_answer(
    images=[Image.open('<IMAGE_PATH_1>'), Image.open('<IMAGE_PATH_2>')],
    prompts=["Describe this image.", "Are there people in this image?"],
    tokenizer=tokenizer,
)

Using this repository

Clone this repository and install dependencies.

pip install -r requirements.txt

sample.py provides a CLI interface for running the model. When the --prompt argument is not provided, the script will allow you to ask questions interactively.

python sample.py --image [IMAGE_PATH] --prompt [PROMPT]

Use gradio_demo.py script to start a Gradio interface for the model.

python gradio_demo.py

webcam_gradio_demo.py provides a Gradio interface for the model that uses your webcam as input and performs inference in real-time.

python webcam_gradio_demo.py

Limitations

The model may generate inaccurate statements, and struggle to understand intricate or nuanced instructions.
The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.

saprrow / moondream

🌔 moondream

Benchmarks

Examples

Usage

About

Languages