Where are the docs for the `llama` API?

Question

Where are the docs for the `llama` API?

antonkratz opened this issue 5 months ago · comments

I have difficulty finding docs for the llama API. Like, where is the meaning of the parameters defined (some like temperature and top_p I can guess but others...), how is generator to be used etc.

Jonas Gehring · Answer 1 · Thu Jan 25 2024 23:32:56 GMT+0800 (China Standard Time)

Hi @antonkratz, please take a look at example_completion.py, example_infilling.py, and example_instructions.py for some usage examples. Are there specific parameters you're wondering about?

Anton Kratz · Answer 2 · Fri Jan 26 2024 13:49:50 GMT+0800 (China Standard Time)

I want to use CodeLlama-7b-Instruct interactively, i.e. I want to be able to have a back-and-forth style conversation about the generated code.

I already managed to install Code Llama and run it on my own infrastructure, i.e. I can run:
torchrun --nproc_per_node 1 example_instructions.py --ckpt_dir CodeLlama-7b-Instruct/ --tokenizer_path CodeLlama-7b-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 4

But how to get from here to an interaction? Does each call to generator.chat_completion "reset" the state? Or how can I write my script it so that the next prompt "stays in the same conversation"? My terminology is maybe a bit off here.

A related issue is that for this, it is critical to be able to work interactively, but I do not know how to achieve that because I apparently must run stuff via torchrun?!

P.S.: Also, how can I query generator such that I get the tokens back one after the other, for interactive display, i.e. I want to display while it is generating.

P.P.S.: @jgehring I just realize you are one of the authors of the Code Llama paper, congratulations to this wonderful work!

Jonas Gehring · Answer 3 · Fri Jan 26 2024 22:32:46 GMT+0800 (China Standard Time)

For including CodeLlama in real applications I would recommend building on top of other open-source inference engines. The repo here serves as a reference implementation, whereas other projects such as transformers or ollama provide a better offering in terms of bells and whistles and/or inference speed. I suggest you check out a few inference engines for Llama models; I'm sure you'll find something that fits your requirements.

For completeness, let me briefly answer your concrete questions, thought:

If you want to perform an ongoing dialog, you'd have to keep track of individual turns and provide chat_completion() with the full history at each call.
torchrun doesn't prevent you from interacting with stdin/stdout; it's a thin wrapper that helps setting up the torch distributed machinery. I think for the 7B model you could also get away without torchrun, but for the bigger models you'll need multiple GPUs (and hence processes), and writing interactive applications gets a bit unwieldy.
Streaming generation is not implemented in this repo, but here's the loop that could be repurposed to yield individual tokens.