moondream

a tiny vision language model

project goals

Build a high-quality, low-hallucination vision language model small enough to run on an edge device without a GPU.

moondream0

Initial prototype built using SigLIP, Phi-1.5, and the LLaVa training dataset. The model is for research purposes only, and is subject to the Phi and LLaVa license restrictions.

Examples

Image	Examples
	What is the title of this book? The Little Book of Deep Learning Author: Francois Flaubert What can you tell me about this book? The book in question is a small paperback titled "The Little Book of Deep Learning." It appears to be a guide or reference for those interested in deep learning, as the title suggests. The book is likely to be helpful for individuals who are new to the field or looking to expand their knowledge in this area. The image shows the book being held by a person, possibly a reader or a professional in the field. What is the person holding? The person is holding a book, which is titled "The Little Book of Deep Learning."
	What type of food is the girl holding? The girl is holding a sandwich. What color is the woman's hair? The woman's hair is white. What is the girl doing in the image? The girl in the image is eating a sandwich, specifically a hamburger, while sitting at a table.
	What color is the train? The train is red. What is the bus doing in the image? The bus in the image is traveling down a city street, passing by tall buildings and driving under a street light. What kind of public transportation is in the image? The image features a red light-rail train, which is a type of public transportation.
	What is the girl looking at? The girl is looking at her reflection in a mirror. What color is the woman's hair? The woman has black hair. What is the woman doing in the mirror? The woman is looking at herself in the mirror, possibly adjusting her hair or makeup.
	What might be the reason for the dog's aggressive behavior? The dog's aggressive behavior could be attributed to the fact that it is being bitten by a person, which is a form of animal abuse. The dog's eyes are red, and it appears to be growling, which are signs of fear or pain. The aggressive behavior might be a result of the dog's attempt to protect itself or its owner from the perceived threat. It is essential to address such behavior in dogs to ensure their well-being and prevent potential harm to both the dog and the person involved. What kind of dog is in the picture? The dog in the picture is a Chihuahua. What color is the dog? The dog is black and white.
	What is the man wearing? The man is wearing a large, inflatable suit or a giant inflatable costume. Is he indoors or outdoors? The man is outdoors, standing in front of a building. What is in the background? In the background, there is a building visible behind the person dressed up in a large amount of plastic wrap.

Usage

Clone this repository and install the dependencies:

pip install -r requirements.txt

Use the sample.py script to run the model on CPU:

python sample.py --image [IMAGE_PATH] [--interactive]

When the --interactive flag is not set, the script will predict three questions and try to answer them.

Limitations

The model may generate inaccurate statements.
It may struggle to adhere to intricate or nuanced instructions.
It is primarily designed to understand English. Informal English, slang, and non-English languages may not work well.
The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.

rminz / moondream

moondream

project goals

moondream0

About

Languages