zineos / llama.onnx

llama onnx models and onnxruntime demo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

llama.onnx

I'm here to release

  • llama 7B onnx models
  • and a 400-lines python script without torch to run it

So you can quantize the model partially and optimize kernel step by step.

How to use

Please download it here

These models converted from alpaca huggingface, here is the graph to call them:

Try onnxruntime demo, no torch required, and the precision has been checked.

$ python3 -m pip install -r requirements.txt
$ python3 demo-single.py ${ONNX_DIR} "bonjour"
..
Bonjour.

Updates

2023/04/?? add memory plan, add temperature warp

2023/04/07 add onnxruntime demo and tokenizer.model (don't forget to download it)

2023/04/05 init project

Notes

  1. Any logits_warper or logits_processor or BeamSearch not implemented, so the result would be not good. Please wait for nexxxxt version !!!
  2. I have compared the output values of onnxruntime-cpu and torch-cuda, and the maximum error is 0.002, not bad
  3. The current state is equivalent to these configurations
temperature=1.0
total_tokens=2000
top_p=1.0
top_k=None
repetition_penalty=1.0

Acknowlegements

License

GPLv3 and why

About

llama onnx models and onnxruntime demo

License:GNU General Public License v3.0


Languages

Language:Python 100.0%