intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory usage is low

QiXuanWang opened this issue · comments

Describe the issue

After 2 weeks struggle, finally I got My A770 working on Fedora38.
But it seems the training is barely faster than my 24 cpu machine.
I tried to increase batch size, but the memory consumption kept same at 1.7g. Is it expected?
What could I do to improve the training performance and to increase the memory usage?

What's the problem of "faster"? The memory consumption is about the host or the device? It is weird if it is for the device. Can you share your training script?

Sorry for the confusion.
I'm using A770 GPU to test the ML code. I can't install xpumanager on Fedora for now due to lots of issues. So I use intel_gpu_top and zello_sysman to check resource usage.
I installed latest "intel_extension_for_pytorch-2.1.30+xpu". My python version is 3.11

My code used a very simple transformer encode layer and some linear layers.

"
if device == "xpu":
model = model.to("xpu")
loss_fn = loss_fn.to("xpu")
model,optimizer = ipex.optimize(model, optimizer=optimizer)
X_train = X_train.to("xpu")
y_train = y_train.to("xpu")
X_val = X_val.to("xpu")
y_val = y_val.to("xpu")
for epoch in range(100):
for i,data in enumerate(X_train):
x_data, y_data = get_batch(X_train, y_train, i, batch_size)
if x_data is None:
break
y_pred = model(x_data)
loss = loss_fn(y_pred, y_data)
optimizer.zero_grad()
loss.backward()
optimizer.step()
"

I use Xeon 6146 with 24 Cores to train it originally but it takes very long time. I was hoping that using A770 GPU it could be much faster (5X+). But it turns out the speed is not "faster" at all.

As for resource usage, with "intel_gpu_top", the "Blitter" usage by "python3" process is like at 10%, which I feel is not correct.
So the question is, how could I fully utilize the GPU. And is the 5X speedup with GPU is reasonable expectation?
I may have to install a new Nvidia GPU later on to test too. But in case my usage or code is not correct...
Any information is helpful.

May I ask why do you have to use Fedora38 ?

I don't have to use Fedora38. But installed it since it's compatible with RHEL. I have to pick one among RHEL, SUSE, CentOS.
I think the performance issue is mainly internal GPU computing mechanism, no?
BTW, what linux distribution do you recommend, besides Ubuntu?

In case it's helpful, here is the memory summary:

  1. ===========================================================================|

  2. | PyTorch XPU memory summary, device ID 0 |

  3. |---------------------------------------------------------------------------|

  4. | XPU OOMs: 0 | xpuMalloc retries: 0 |

  5. ===========================================================================|

  6. | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |

  7. |---------------------------------------------------------------------------|

  8. | Allocated memory | 812 KB | 28210 KB | 3407 GB | 3407 GB |

  9. | from large pool | 0 KB | 16128 KB | 1974 GB | 1974 GB |

  10. | from small pool | 812 KB | 12468 KB | 1433 GB | 1433 GB |

  11. |---------------------------------------------------------------------------|

  12. | Active memory | 812 KB | 28210 KB | 3407 GB | 3407 GB |

  13. | from large pool | 0 KB | 16128 KB | 1974 GB | 1974 GB |

  14. | from small pool | 812 KB | 12468 KB | 1433 GB | 1433 GB |

  15. |---------------------------------------------------------------------------|

  16. | GPU reserved memory | 34816 KB | 34816 KB | 34816 KB | 0 B |

  17. | from large pool | 20480 KB | 20480 KB | 20480 KB | 0 B |

  18. | from small pool | 14336 KB | 14336 KB | 14336 KB | 0 B |

  19. |---------------------------------------------------------------------------|

  20. | Non-releasable memory | 7379 KB | 28019 KB | 4321 GB | 4321 GB |

  21. | from large pool | 0 KB | 18688 KB | 2756 GB | 2756 GB |

  22. | from small pool | 7379 KB | 11449 KB | 1565 GB | 1565 GB |

  23. |---------------------------------------------------------------------------|

  24. | Allocations | 186 | 274 | 17274 K | 17274 K |

  25. | from large pool | 0 | 9 | 1155 K | 1155 K |

  26. | from small pool | 186 | 266 | 16119 K | 16119 K |

  27. |---------------------------------------------------------------------------|

  28. | Active allocs | 186 | 274 | 17274 K | 17274 K |

  29. | from large pool | 0 | 9 | 1155 K | 1155 K |

  30. | from small pool | 186 | 266 | 16119 K | 16119 K |

  31. |---------------------------------------------------------------------------|

  32. | GPU reserved segments | 8 | 8 | 8 | 0 |

  33. | from large pool | 1 | 1 | 1 | 0 |

  34. | from small pool | 7 | 7 | 7 | 0 |

  35. |---------------------------------------------------------------------------|

  36. | Non-releasable allocs | 15 | 30 | 6580 K | 6580 K |

  37. | from large pool | 0 | 2 | 625 K | 625 K |

  38. | from small pool | 15 | 29 | 5954 K | 5954 K |

  39. ===========================================================================|

You could try running your code in Intel VTune to see the CPU/GPU compute+memory usage and find possible bottlenecks.

You could try running your code in Intel VTune to see the CPU/GPU compute+memory usage and find possible bottlenecks.

Thanks. I'll try.
BTW my CPU version runs fine. The CPU usage is full and memory usage is reasonable. And training speed is acceptable. But with GPU, the power consumption and memory consumption tells it's not working as expected. The training speed is extremely slow. The results are a little bit weird too.
Not sure if it's a known issue or maybe somehow the installation is broken.

I tried aliyun os and it's OK to run oneAPI and IPEX. I think aliyun os is CentOS 基于ECS Intel实例部署GPT-2大语言模型 - 云起实验室-在线实验-上云实践-阿里云开发者社区-阿里云官方实验平台-阿里云 (aliyun.com)

Oh. I use my own machine for this task. My problem is not that it doesn't run. But the results are suspicious.

We have not verified on Fedora38.
Firstly, we should check the results correction.
I suggest use CentOS to check if you like.