mit-han-lab / lite-transformer

[ICLR 2020] Lite Transformer with Long-Short Range Attention

Home Page:https://arxiv.org/abs/2004.11886

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Model Compression

kalyangvs opened this issue · comments

Hi, Can you please provide the code used to compress the model by 18.2 X using pruning and quantization.
Thanks.

@Michaelvll Does the quant plus pruning model include other data such as last_optimizer_state, optimizer_history etc..

Thank you for asking! We are still cleaning the code for compression. We quantized the model parameters to 8 bits and sensitive prune the model with NervanaSystems/distiller. We only calculated the model size since the optimizer states are not used in inference.

@Michaelvll can you please provide distiller config, which was used?

Also do you prune individual weights, or whole channels/filters/heads?

For simplicity, we use sensitivity pruning for our model, which is fine-grained pruning, i.e. pruning the individual weights. You can try on the configuration for the WMT En-Fr model with 527M #Multi-Adds.

Could you share some more information on how you quantize the model? Did you use NervanaSystems/distiller for quantization?