Which dataset should I use?

Question

Which dataset should I use?

ccccj opened this issue 9 months ago · comments

Hello, I have a question, I currently have a model of the llama series that has been fine-tuned with my own dataset. If I want to SpQR quantize it, do I use data/red_pajama_n=1024.pth for the parameter as well? Or do I use my own dataset that I used for fine-tuning?
Looking forward to getting your response!

Poedator · Answer 1 · Thu Nov 23 2023 17:44:26 GMT+0800 (China Standard Time)

Hello @ccccj ,
if you are focused on the best performance in some specific domain (presumably this is the reason for having your own dataset) - then you may get slightly better results using your own dataset for SpQR quantization. Just take a subset comparable in size to data/red_pajama_n=1024.pth.
red_pajama should also give decent results. If you can try both - please write back here with your quality measurements.