karpathy / llama2.c

Inference Llama 2 in one file of pure C

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Significant Quality Degradation with q8 Quantization in Small Models

tzipproth opened this issue · comments

I have observed a significant degradation in the quality of generated text when applying q8 quantization.
The models where trained in float16.
During training, I saved the q8-quantized model alongside the original as follows:

torch.save(checkpoint, os.path.join(out_dir, "ckpt.pt"))
model_export(raw_model, os.path.join(out_dir, "model.bin"), version=0)
model_export(raw_model, os.path.join(out_dir, "modelq8.bin"), version=2)

The hyperparameters used were as follows:
dtype = "float16", max_seq_len = 512, dim = 256, n_layers = 8, n_heads = 8, n_kv_heads = 4, multiple_of = 4.
The model sizes after export were 18 MB for model.bin and 4.7 MB for modelq8.bin.

The non-quantized model produced reasonable text for its size. However, the q8-quantized model's output contained significantly more non-existent words. To quantify this, I conducted a test using a reduced TinyStories dataset, which contained only 8,232 words but was nevertheless 3.1 GB in size.

From this dataset, I generated 50,000 tiny stories using both model.bin and modelq8.bin.

The results were as follows:

For model.bin generated text:

Total words: 7,166,263
Total occurrences of words not in the training set: 1,901
Unique words: 8,182
Unique words not in the training set: 1,486

For modelq8.bin generated text:

Total words: 7,262,807
Total occurrences of words not in the training set: 179,535
Unique words: 55,211
Unique words not in the training set: 48,494

In summary:

model.bin: Approximately 1 out of 3,769 generated words is not in the training set.
modelq8.bin: Approximately 1 out of 40 generated words is not in the training set.

This significant increase in the frequency of non-existent words with the q8 model suggests a substantial degradation in quality, indicating that the q8 quantization may be too aggressive for small models or that there might be an underlying issue with the quantization process.

I haven't tried it myself, but maybe it helps to reduce the group size and thus increase the accuracy? It is currently fixed at 64:

def version2_export(model, filepath, group_size=64):

A group size of 32 or 16 might be better. It would be interesting to see if this has an effect on the quality.

Thanks for that hint, it helped, but so far (group_size=16) not enough.
I tried: def version2_export(model, filepath, group_size=16)

The resulting quantized model size grew from 4.7 to 5.6 MB

Results for model.bin (18 Mb) generated 50.000 stories text:

Words: 7166263
Words not in the training set: 1901
Different words: 8182
Different words not in the training set:1486
Results for modelq8.bin (4.7 Mb) generated 50.000 stories text:

Words: 7262807
Words not in the training set: 179535
Different words: 55211
Different words not in the training set: 48494
Results for modelq8_16.bin (5.6 Mb) generated 50.000 stories text:

Words: 7300327
Words not in the training set: 76884
Different words: 32143
Different words not in the training set: 25284

In summary:

model.bin (18 Mb) : 1 out of 3769 generated words is not in the training set.
modelq8.bin (4.7 Mb) 1 out of 40 generated words is not in the training set.
modelq8_16.bin (5.6 Mb) 1 out of 95 generated words is not in the training set.

I will try group_size 8 and 4 next and see what happens.

The final results for variations of group_size in "def version2_export(model, filepath, group_size=64)".

Model            Size      inference words not in the training set

model.bin        18.0 Mb   1/3769    
modelq8_64.bin    4.7 Mb   1/40
modelq8_16.bin    5.6 Mb   1/95
modelq8_8.bin     6.7 Mb   garbage output
modelq8_4.bin     9   Mb   1/1811

This means group_size = 4 is the only one which is good enough and usable for the quantization.
In this case, the quantized model size is exactly the halve of the not quantized model.
Not sure how this result can be interpreted.

@tzipproth I am looking into interesting metrics on evaluating the metrics for the baby llama2 model. I see you did somewhat of a perplexity analysis. Can you share the code you used for this?