fauxpilot / fauxpilot

FauxPilot - an open-source alternative to GitHub Copilot server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CodeGen2 compatibility

gilsdav opened this issue · comments

New version of CodeGen was released: https://github.com/salesforce/CodeGen2

Can't tell if the model architecture has changed – any idea?

I don't know it's realy what you whant to know but as I can see you can use it the same way (Causal) but they added Infill way. And it seems to support a lot of more langages too.

For infill sampling, we introduce three new special token types:

<mask_N>: N-th span to be masked. In practice, use <mask_1> to where you want to sample infill.

<sep>: Seperator token between the suffix and the infilled sample. See below.

<eom>: "End-Of-Mask" token that model will output at the end of infilling. You may use this token to truncate the output.

It's probably possible with a single modification of setup.sh

@moyix how did you come up with the calculations & code in codegen_gptj_convert.py? It seems like this conversion from CodeGen to GPT-J is the most difficult part of supporting a new model type.

I was able to modify the python backend to support bigcode/starcoder. It's obviously really slow because we are just loading the model via transformers library in the python backend (are we sure that is the right way to do the python backend thing?). I got fairly far along with the faster transformer conversion but stopped when I saw the bit of math / calculations going on in codegen_gptj_convert. I haven't tried just removing it and seeing if the conversion from GPTBigCode -> GPT-J is simpler than CodeGen -> GPT-J.

@moyix Let me bring some light into the dark.
After comparing which configuration Codegen-2 (trust_remote_code=True) uses over Codegen-1, I found one obvious hyperparameter with mp_num = 8 over mp_num=4.
After debugging some hours, I reverse engineered that the following tweak in the permutation order should do the job.

Below explainations for Salesforce/codegen2-1B sizes

1B: qkv_proj has shape [6144, other]
6144 contains for 8*mp_num all 2d vectors for Q,K, and V, which now need to go to 8, 256 shape
toy example: qkv_proj were just np.arange, if would go through this transformation
qkv_proj[:,0] = np.arange(0,6144)
1B: qw has shape [1,2,8,256]
qw = 
tensor([[   0.,  768., 1536., 2304., 3072., 3840., 4608., 5376.])
        [...254 rows missing]
        [ 255., 1023., 1791., 2559., 3327., 4095., 4863., 5631.]])
value gets
qv =  tensor(
        [[ 256., 1024., 1792., 2560., 3328., 4096., 4864., 5632.])
        [...254 rows missing]
        [ 511., 1279., 2047., 2815., 3583., 4351., 5119., 5887.]])
rest goes to kv ..
generalized vector for permutation is therefore:
```python
mp_num =8
mp_num = codegen_2 = 8
base_permutation = np.arange(0,mp_num*3).reshape(-1,3).T.flatten().tolist()
base_permutation == [0, 3, 6, 9, 12, 15, 18, 21,
                                    1, 4, 7, 10, 13, 16, 19, 22,
                                    2, 5, 8, 11, 14, 17, 20, 23]

All you need to do is to make the permutation configurable.

Anyhow, the Triton Server is not really performant, when comparing to Ctranslate2 [https://github.com/OpenNMT/CTranslate2]. It also can do batching, and there is no need to perform padding to certain shapes in the FastAPI proxy. (Ctranslate2-codegen2 on int8-CPU is around 4.1x faster, and takes ~4 less memory than huggingface-codegen2)

I'll try to add some models for Codegen-1 and Codegen-2 for all sizes for Ctranslate2-framework, stay tuned.
https://github.com/OpenNMT/CTranslate2/pull/1230/files

Oops, really sorry that I didn't see this before you figured it out on your own. I wrote up an article explaining how the permutation was derived here:

https://gist.github.com/moyix/7896575befbe1b99162ccfec8d135566

I'll look into Ctranslate2 – are there gains over FT when using GPUs for inference?

Not sure about FT. On my GPU:
task: input 16 tokens -> generate ten times exactly 64 tokens
timings

  • ct2 codegen2-7B on float16 =9.55seconds (67 tokens/s, 1x GPU, 7gb Vram)
  • huggingface codegen2-7B on int8 =17.06seconds (37.5 tokens/s, 1x GPU, 7gb Vram)

For the smaller models (2B etc) should be more like 3x speedup, for large models the size of tensors benefit less from the c++ implementation: 16B more 1.5.
Most importantly, only ct2 only takes half the memory. I am not sure about the speeds of FT (I think you used to write ~2x speedup).

Edit: I found another of your markdown posts, which helped me to derive the codegen2 conversion.

Here are some benchmarks for codegen2 on FasterTransformers. This is with A6000s.

codegen2-1B   on 4 GPUs generated 16+64 tokens in  0.18s ~349.40 tokens/sec
codegen2-1B   on 2 GPUs generated 16+64 tokens in  0.19s ~337.68 tokens/sec
codegen2-1B   on 1 GPU  generated 16+64 tokens in  0.25s ~253.55 tokens/sec
codegen2-3_7B on 4 GPUs generated 16+64 tokens in  0.29s ~220.00 tokens/sec
codegen2-3_7B on 2 GPUs generated 16+64 tokens in  0.43s ~148.69 tokens/sec
codegen2-3_7B on 1 GPU  generated 16+64 tokens in  0.73s ~ 87.47 tokens/sec
codegen2-7B   on 4 GPUs generated 16+64 tokens in  0.51s ~125.93 tokens/sec
codegen2-7B   on 2 GPUs generated 16+64 tokens in  0.80s ~ 80.26 tokens/sec
codegen2-7B   on 1 GPU  generated 16+64 tokens in  1.38s ~ 46.26 tokens/sec
codegen2-16B  on 4 GPUs generated 16+64 tokens in  0.99s ~ 64.97 tokens/sec
codegen2-16B  on 2 GPUs generated 16+64 tokens in  1.68s ~ 38.13 tokens/sec
codegen2-16B  on 1 GPU  generated 16+64 tokens in  3.10s ~ 20.61 tokens/sec

Do you have a comparison on the transformers float16 or bitsandbytes int8 version? Can‘t benchmark

While you are one it, you can pull the Ctranslate2 model from here, should just take 2-3 min to install + a download, see:
https://huggingface.co/michaelfeil/ct2fast-codegen2-7B

Do we have any updates on this?