Qualcomm-AI-research / transformer-quantization

Hi there,

Thanks a lot for releasing the code.
I have a question regarding the nonlinear layers such as GELU, softmax or even LayerNorm (as it has RSQRT). If I understand your code correctly, you are using the floating-point versions of their implementations in the QAT model. Does this mean that we are not actually simulating the quantized behaviours of these layers in the QAT model accurately? Maybe these layers are implemented as look-up tables or they have full int implementations on hardware devices, and not simulating these in QAT has minimal impact on quantized model performance? Can you clarify on this a bit more?

Thanks a lot.

I am also wondering about this. Can anyone from the authors/developers clarify?

Yes, it looks like QDQ is used here:

transformer-quantization/models/quantized_bert.py

Line 197 in 18ae42c

attention_probs = nn.Softmax(dim=-1)(attention_scores)

transformer-quantization/quantization/autoquant_utils.py

Line 55 in 18ae42c

class QuantLayerNorm(QuantizationHijacker, nn.LayerNorm):

transformer-quantization/models/quantized_bert.py

Line 291 in 18ae42c

return quantize_model(nn.Sequential(m_dense, m_act), **quant_params)

A question on the nonlinear layers