🌔 moondream

a tiny vision language model that kicks ass and runs anywhere

Benchmarks

moondream2 is a 1.86B parameter model initialized with weights from SigLIP and Phi 1.5.

Model	VQAv2	GQA	TextVQA	POPE	TallyQA
moondream1	74.7	57.9	35.6	-	-
moondream2 (latest)	74.2	58.5	36.4	(coming soon)	(coming soon)

Examples

Image	Example
	What is the girl doing? The girl is sitting at a table, eating a burger. What color is the girl's hair? White
	What is this? A metal stand is positioned in the center of the image, with CPUs and wires visible. The background features a wall, and a black object is situated in the top left corner. What is behind the stand? A wall made of red bricks is visible behind the stand, which holds several electronic devices and wires.

Usage

Using transformers (recommended)

pip install transformers timm einops

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision="2024-03-04"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision="2024-03-04")

image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.

Using this repository

Clone this repository and install dependencies.

pip install -r requirements.txt

sample.py provides a CLI interface for running the model. When the --prompt argument is not provided, the script will allow you to ask questions interactively.

python sample.py --image [IMAGE_PATH] --prompt [PROMPT]

Use gradio_demo.py script to start a Gradio interface for the model.

python gradio_demo.py

webcam_gradio_demo.py provides a Gradio interface for the model that uses your webcam as input and performs inference in real-time.

python webcam_gradio_demo.py

Limitations

The model may generate inaccurate statements, and struggle to understand intricate or nuanced instructions.
The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.

papasanimohansrinivas / moondream

🌔 moondream

Benchmarks

Examples

Usage

About

Languages