Model Compression

Question

Model Compression

kalyangvs opened this issue 4 years ago · comments

Kalyan Guntupalli commented 4 years ago

Hi, Can you please provide the code used to compress the model by 18.2 X using pruning and quantization.
Thanks.

Kalyan Guntupalli · Answer 1 · Fri May 29 2020 18:29:30 GMT+0800 (China Standard Time)

@Michaelvll Does the quant plus pruning model include other data such as last_optimizer_state, optimizer_history etc..

Zhanghao Wu · Answer 2 · Fri May 29 2020 23:06:23 GMT+0800 (China Standard Time)

Thank you for asking! We are still cleaning the code for compression. We quantized the model parameters to 8 bits and sensitive prune the model with NervanaSystems/distiller. We only calculated the model size since the optimizer states are not used in inference.

Vlado Boza · Answer 3 · Thu Jun 25 2020 18:03:59 GMT+0800 (China Standard Time)

@Michaelvll can you please provide distiller config, which was used?

Also do you prune individual weights, or whole channels/filters/heads?

Zhanghao Wu · Answer 4 · Thu Aug 13 2020 10:33:55 GMT+0800 (China Standard Time)

For simplicity, we use sensitivity pruning for our model, which is fine-grained pruning, i.e. pruning the individual weights. You can try on the configuration for the WMT En-Fr model with 527M #Multi-Adds.

Zilun Peng · Answer 5 · Tue Nov 24 2020 04:38:56 GMT+0800 (China Standard Time)

Could you share some more information on how you quantize the model? Did you use NervanaSystems/distiller for quantization?