intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why is the matrix multiplication slower after I installed the NPU driver?

BingoZha opened this issue · comments

Describe the bug
Why is the matrix multiplication slower after I installed the NPU driver?
Here is my python code modified from example:

from intel_npu_acceleration_library.backend import MatMul
import numpy as np
from datetime import datetime

def run_matmul(inC, outC, batch):
    mm = MatMul(inC, outC, batch, profile=False)
    npu_delta_time = 0.0
    common_delta_time = 0.0
    for _ in range(100):
        # Create both inputs
        X1 = np.random.uniform(-1, 1, (batch, inC)).astype(np.float16)
        X2 = np.random.uniform(-1, 1, (outC, inC)).astype(np.float16)
        
        start_time = datetime.now()
        for _ in range(100):
            npu_result = mm.run(X1, X2)
        end_time = datetime.now()
        time_difference = end_time - start_time
        npu_delta_time = npu_delta_time + time_difference.total_seconds()
        #print(time_difference.total_seconds())
        #print(npu_result)
        start_time = datetime.now()
        for _ in range(100):
            common_result = np.dot(X1, X2.T)
        end_time = datetime.now()
        time_difference = end_time - start_time
        common_delta_time = common_delta_time + time_difference.total_seconds()
        #print(time_difference.total_seconds())
        #print(common_result)
    
    print(npu_delta_time/100)
    print(common_delta_time/100)
    


if __name__ == "__main__":
    result = run_matmul(128, 128, 32)

result:
2024-04-09 224646

As the image shows,the average time of mm.run() has slowed down after installing the NPU driver.

Desktop (please complete the following information):

  • OS: win11
  • CPU:Intel(R) Core(TM) Ultra 5 125H 3.60 GHz

I found that the reason for the slowness is these two lines of code; removing them makes the speed almost the same.
for _ in range(100):
common_result = np.dot(X1, X2.T)

why?

@BingoZha what driver version are you using?

@BingoZha what driver version are you using?

@alessandropalla npu_win_32.0.100.2267

Also try to use time.perf_counter() function as it is more reliable than datetime for profiling

Also try to use time.perf_counter() function as it is more reliable than datetime for profiling

@alessandropalla The difference in results is not significant.

Here is the modified code:

from intel_npu_acceleration_library.backend import MatMul
import numpy as np
import time


def run_matmul(inC, outC, batch):

    mm = MatMul(inC, outC, batch, profile=False)
    npu_delta_time = 0.0
    common_delta_time = 0.0
    for _ in range(100):
        # Create both inputs
        X1 = np.random.uniform(-1, 1, (batch, inC)).astype(np.float16)
        X2 = np.random.uniform(-1, 1, (outC, inC)).astype(np.float16)
        
        start_time = time.perf_counter()
        for _ in range(100):
            npu_result = mm.run(X1, X2)
        end_time = time.perf_counter()
        time_difference = end_time - start_time
        print(time_difference)
        npu_delta_time = npu_delta_time + time_difference
        for _ in range(100):
            common_result = np.dot(X1, X2.T)
        
    print(npu_delta_time/100)
    


if __name__ == "__main__":
    result = run_matmul(128, 128, 32)

result:

C:\Users\DSD\AppData\Local\Programs\Python\Python310\lib\site-packages\intel_npu_acceleration_library\nn\autograd.py:6: UserWarning: NPU is not available in your system. Library will fallback to AUTO device selection mode
  from intel_npu_acceleration_library.backend import run_matmul
0.04765880000013567
0.10254319999967265
0.05188799999996263
0.046970600000349805
0.04976079999960348
0.04896469999948749
0.04999989999942045
0.04749850000007427
0.049022400000467314
0.04979179999918415
0.04704169999968144
0.047730599999340484
0.05120950000036828
0.048201999999946565
0.047818699999879755
0.050647099999878264
0.046193700000003446
0.04915100000016537
0.0486003999994864
0.050607099999979255
0.047450999999455235
0.04915090000031341
0.05021160000069358
0.048043800000414194
0.04818599999998696
0.05125870000028954
0.04633929999999964
0.049118799999632756
0.049735199999304314
0.0491029000004346
0.04909919999954582
0.046879599999556376
0.04859600000054343
0.04854509999950096
0.04857770000035089
0.049387900000510854
0.048158600000533625
0.0475655000000188
0.046616999999969266
0.049579099999391474
0.04803550000087853
0.04740659999970376
0.04787690000011935
0.04669830000057118
0.05284369999935734
0.04919140000038169
0.049543399999492976
0.047205099999700906
0.04701120000027004
0.04793729999983043
0.05938539999988279
0.047786899999664456
0.04845709999972314
0.07805519999965327
0.04935310000018944
0.04962549999982002
0.047137300000031246
0.049221700000089186
0.04938479999964329
0.0505726000001232
0.04910699999982171
0.05535589999999502
0.05002890000014304
0.050125499999921885
0.04889709999952174
0.0479617999999391
0.05411710000043968
0.05511419999947975
0.0508096999992631
0.049849999999423744
0.05275349999919854
0.05028119999951741
0.05036079999990761
0.05515250000007654
0.04871689999981754
0.07981579999977839
0.0492776999999478
0.050627300000087416
0.046894500000234984
0.048262500000419095
0.04710070000055566
0.04997430000003078
0.05063040000004548
0.04894799999965471
0.05330159999994066
0.04804929999954766
0.05567640000026586
0.05023410000012518
0.04906920000030368
0.049790699999903154
0.04698980000011943
0.0497482000000673
0.04990599999928236
0.051089800000227115
0.04973599999993894
0.05356989999927464
0.047150999999757914
0.050318200000219804
0.04954520000046614
0.04882629999974597
0.050587983999930655

C:\code\intel\intel-npu-acceleration-library\examples>python "matmul.py"
0.08091940000031173
0.10518949999914184
0.05905519999942044
0.06505140000081155
0.08273579999968206
0.08095940000021074
0.08096199999999953
0.08017040000049747
0.07704349999949045
0.07456430000002001
0.07914720000007947
0.07826899999963643
0.08221719999983179
0.07464790000085486
0.07656289999977162
0.07466090000070835
0.08318709999912244
0.09415309999985766
0.07725259999915579
0.08149460000004183
0.07746670000051381
0.08184660000006261
0.08808690000023489
0.08402230000046984
0.0802560000001904
0.07623280000007071
0.0826561000003494
0.07742200000029698
0.0762667999997575
0.07674880000013218
0.08466990000033547
0.08417869999993854
0.08660820000022795
0.077901099999508
0.07373809999990044
0.07591349999984232
0.07338610000078916
0.086707600000409
0.08171990000028018
0.07630199999948672
0.08329109999976936
0.0783051999997042
0.08670989999973244
0.07781900000009045
0.0791652999996586
0.08005020000018703
0.07841960000041581
0.0749258000005284
0.07846859999972366
0.07941659999960393
0.0773471000002246
0.07923779999964609
0.07675939999990078
0.08374779999940074
0.08001440000043658
0.0804400999995778
0.07835660000000644
0.08211370000026363
0.07755559999986872
0.08111049999934039
0.08203239999966172
0.07882109999991371
0.07731419999981881
0.08383229999981268
0.07447210000009363
0.0785451000001558
0.08218669999951089
0.0811279999998078
0.07834290000027977
0.0820451999998113
0.07854600000064238
0.07885389999955805
0.0808340000003227
0.07542519999969954
0.07826360000035493
0.08222590000059427
0.07744459999958053
0.0781211000003168
0.085073299999749
0.07956819999981235
0.08035289999952511
0.08433969999987312
0.07941999999911786
0.08261419999962527
0.0805184000000736
0.07999849999941944
0.0812738000004174
0.07898529999965831
0.0842542999998841
0.07721769999989192
0.0775616999999329
0.07917769999949087
0.07916219999970053
0.0824066999994102
0.07616330000018934
0.07731640000019979
0.07446770000024117
0.07966329999999289
0.07940090000010969
0.07677080000030401
0.07967769099996076

Does numpy function affect the speed of npu calculation?

Also, in your case you might benefit from the Linear layer that has an algorithm to optimally allocate the weights. In this case

from intel_npu_acceleration_library.backend import Linear
from tqdm import tqdm
import numpy as np
import time

def run_matmul(inC, outC, batch, samples=100, iterations=100):
    mm = Linear(inC, outC, batch, profile=False)
    npu_delta_time = 0.0
    common_delta_time = 0.0
    for _ in tqdm(range(samples)):
        # Create both inputs
        X1 = np.random.uniform(-1, 1, (batch, inC)).astype(np.float16)
        X2 = np.random.uniform(-1, 1, (outC, inC)).astype(np.float16)
        
        start_time = time.perf_counter()
        for _ in range(iterations):
            npu_result = mm.run(X1, X2, "0")
        end_time = time.perf_counter()
        time_difference = end_time - start_time
        npu_delta_time = npu_delta_time + time_difference


        start_time = time.perf_counter()
        for _ in range(iterations):
            common_result = np.dot(X1, X2.T)
        end_time = time.perf_counter()
        time_difference = end_time - start_time
        common_delta_time = common_delta_time + time_difference
        

    print(f"NPU average time {npu_delta_time/(iterations * samples) * 1000:.3f} ms")
    print(f"Numpy average time {common_delta_time/(iterations * samples) * 1000:.3f} ms")



if __name__ == "__main__":
    result = run_matmul(128, 128, 32)

It is worth noticing that when you do not have the driver installed you are using the OpenVINO AUTO device that fallbacks to CPU/GPU (system dependent, but in general it offloads to CPU for matmuls). Because of Microsoft OS system memory security limitations, there is a higher latency in NPU inference even if the actual throughput is higher

NPU time = initial latency (higher than CPU but almost fixed and do not depend on model you run) + model inference (faster than CPU in general and roughly proportional to the TOPs of the model you run)

So if you are interested in performance alone, is beneficial to run big models on the NPU like larger matmuls, or entire models.

For example, with 1k channels and 128 batch I got (I suggest you to try with and without drivers but remove the numpy part as it is too slow)

using backend.MatMul (lower the better)

NPU average time 0.611 ms 
CPU average time 5.759 ms 

using backend.Linear (lower the better)

NPU average time 0.354 ms
CPU average time 5.419 ms

In your case you were running a small matmul that was mostly memory bounded. A model with larger TOPs is where the NPU shine. If you are interested in performance/watt, in general NPU is a better choice as its a low power inference design

Hope this clarify the expectations :)

Also, in your case you might benefit from the Linear layer that has an algorithm to optimally allocate the weights. In this case

from intel_npu_acceleration_library.backend import Linear
from tqdm import tqdm
import numpy as np
import time

def run_matmul(inC, outC, batch, samples=100, iterations=100):
    mm = Linear(inC, outC, batch, profile=False)
    npu_delta_time = 0.0
    common_delta_time = 0.0
    for _ in tqdm(range(samples)):
        # Create both inputs
        X1 = np.random.uniform(-1, 1, (batch, inC)).astype(np.float16)
        X2 = np.random.uniform(-1, 1, (outC, inC)).astype(np.float16)
        
        start_time = time.perf_counter()
        for _ in range(iterations):
            npu_result = mm.run(X1, X2, "0")
        end_time = time.perf_counter()
        time_difference = end_time - start_time
        npu_delta_time = npu_delta_time + time_difference


        start_time = time.perf_counter()
        for _ in range(iterations):
            common_result = np.dot(X1, X2.T)
        end_time = time.perf_counter()
        time_difference = end_time - start_time
        common_delta_time = common_delta_time + time_difference
        

    print(f"NPU average time {npu_delta_time/(iterations * samples) * 1000:.3f} ms")
    print(f"Numpy average time {common_delta_time/(iterations * samples) * 1000:.3f} ms")



if __name__ == "__main__":
    result = run_matmul(128, 128, 32)

It is worth noticing that when you do not have the driver installed you are using the OpenVINO AUTO device that fallbacks to CPU/GPU (system dependent, but in general it offloads to CPU for matmuls). Because of Microsoft OS system memory security limitations, there is a higher latency in NPU inference even if the actual throughput is higher

NPU time = initial latency (higher than CPU but almost fixed and do not depend on model you run) + model inference (faster than CPU in general and roughly proportional to the TOPs of the model you run)

So if you are interested in performance alone, is beneficial to run big models on the NPU like larger matmuls, or entire models.

For example, with 1k channels and 128 batch I got (I suggest you to try with and without drivers but remove the numpy part as it is too slow)

using backend.MatMul (lower the better)

NPU average time 0.611 ms 
CPU average time 5.759 ms 

using backend.Linear (lower the better)

NPU average time 0.354 ms
CPU average time 5.419 ms

In your case you were running a small matmul that was mostly memory bounded. A model with larger TOPs is where the NPU shine. If you are interested in performance/watt, in general NPU is a better choice as its a low power inference design

Hope this clarify the expectations :)

OK,the last question:
"NPU time = initial latency (higher than CPU but almost fixed and do not depend on model you run) + model inference (faster than CPU in general and roughly proportional to the TOPs of the model you run)"

Does this code npu_result = mm.run(X1, X2) include initial latency?I think mm = MatMul(inC, outC, batch, profile=False) looks like initial code, so I didn't include it in the calculation time.

Does this code npu_result = mm.run(X1, X2) include initial latency?

It does. When you run mm = MatMul(inC, outC, batch, profile=False) there is a small latency in the driver that compiles and loads the model in the NPU but that's initialization, and also it is outside of your profiling loop

When you run mm.run(X1, X2) there is a small latency because of the change of context in the OS that is not present when you do CPU inference. That get exacerbated when the models you run are small so latency dominates. Like that 128x128x16 matmul has only 256k operations to do to execute the layers

As shown , already with a medium size operation NPU is much faster and when you pass networks the gap widens

OK, Thank you for your patience :)