eisneim / nanoVLM

A simple multi-modal vision-language model that describes an image using only keywords.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nanoVLM

a simple multi-modal vision language model that discribes a image with only keywords

!! currently WORKING IN PROGRESS

Roadmap

  • image dataset prepaeration ☑
  • text dataset preparation ◻︎
  • nano language model ◻︎
  • openCLIP b/32 projection layer ◻︎
  • supervised vs instruction fine tuning ◻︎
  • usage examples ◻︎
  • export to ONNX ◻︎
  • add WASM for javascript support ◻︎

About

A simple multi-modal vision-language model that describes an image using only keywords.

License:Apache License 2.0


Languages

Language:Python 100.0%