intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dose linear support input_tensor with dtype int8?

Septend-fun opened this issue · comments

Hi, experts. It seems that the matmul (with weight and input-tensor's dtype both int8) is not supported right? I must convert weight to fp16 when using matmul op.

The key is in src/bindings.cpp

intel_npu_acceleration_library_DLL_API ov::op::Op* linear(intel_npu_acceleration_library::ModelFactory* factory,
                                                          ov::op::Op* in0, size_t dim0, size_t dim1, bool bias,
                                                          char* act_dtype, char* wt_dtype) {
    ov::element::Type_t act_ov_dtype = intel_npu_acceleration_library::dtype_from_string(std::string(act_dtype));
    ov::element::Type_t wt_ov_dtype = intel_npu_acceleration_library::dtype_from_string(std::string(wt_dtype));

    bool quantized = wt_ov_dtype == ov::element::Type_t::i8 || wt_ov_dtype == ov::element::Type_t::i4;

    auto weights = factory->parameter({dim0, dim1}, wt_ov_dtype);
    if (quantized) {
        weights = factory->convert_to(weights, act_ov_dtype);
    }

    auto mm = factory->matmul(in0, weights);

    if (quantized) {
        auto scale = factory->parameter({1, dim0}, act_ov_dtype);
        mm = factory->eltwise_mul(mm, scale);
    }

    if (bias) {
        auto bias = factory->parameter({1, dim0}, act_ov_dtype);
        return factory->eltwise_add(mm, bias);
    }
    return mm;
}

If I set act_dtype dtype as int8, then I will get this error:
Matmul op #0 must be ranked tensor of 16 bit float or 32 bit float or 32 bit int , but got tensor<1x16x16xsi8>
It is probably caused by openvino, but I think the NPU supports int8 x int8 op right?

Hi, I have another question about the NPU latency. I got results when I tested matmul op.

If batch=32, inC=4096, outC=11008, the  latency is 16.58ms;
If batch=32, inC=11008, outC=4096, the  latency is 2.3ms;

I think these two cases have similar FLOPS and IO. Why do they have so big diff?

Hi, I have another question about the NPU latency. I got results when I tested matmul op.

If batch=32, inC=4096, outC=11008, the  latency is 16.58ms;
If batch=32, inC=11008, outC=4096, the  latency is 2.3ms;

I think these two cases have similar FLOPS and IO. Why do they have so big diff?

Sorry I cannot reproduce this behavior.

Also, Op support is ongoing so stay tuned for new operations and dtypes to come

Thanks for your reply. So in your test, you got the similar results, right? It may be caused by my environment, I'll check it.

Any update? I'm happy to help if you need it. Otherwise I'll close the issue