RuntimeError: CUDA out of memory with 6M nodes, 8M edges on A100 GPU
chi2liu opened this issue Β· comments
633WHU commented
π Bug
|-------------------------------------------------------------------------------------------------------|
*** Running (`tmp_data.pt`, `unsup_graphsage`, `node_classification_dw`, `unsup_graphsage_mw`)
|-------------------------------------------------------------------------------------------------------|
Model Parameters: 1568
0%| | 0/500 [00:00<?, ?it/s]OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
0%| | 0/500 [00:47<?, ?it/s]
Traceback (most recent call last):
File "generate_emb.py", line 12, in <module>
outputs = generator(edge_index, x=x)
File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/pipelines.py", line 204, in __call__
model = train(self.args)
File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/experiments.py", line 216, in train
result = trainer.run(model_wrapper, dataset_wrapper)
File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/trainer/trainer.py", line 188, in run
self.train(self.devices[0], model_w, dataset_w)
File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/trainer/trainer.py", line 334, in train
training_loss = self.train_step(model_w, train_loader, optimizers, lr_schedulers, rank, scaler)
File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/trainer/trainer.py", line 468, in train_step
loss = model_w.on_train_step(batch)
File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/wrappers/model_wrapper/base_model_wrapper.py", line 73, in on_train_step
return self.train_step(*args, **kwargs)
File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/wrappers/model_wrapper/node_classification/unsup_graphsage_mw.py", line 43, in train_step
neg_loss = -torch.log(torch.sigmoid(-torch.sum(x.unsqueeze(1).repeat(1, self.num_negative_samples, 1) * x[self.negative_samples], dim=-1))).mean()
RuntimeError: CUDA out of memory. Tried to allocate 11.02 GiB (GPU 0; 39.45 GiB total capacity; 29.23 GiB already allocated; 8.01 GiB free; 30.03 GiB reserved in total by PyTorch)
To Reproduce
Steps to reproduce the behavior:
from cogdl import pipeline
# build a pipeline for generating embeddings using unsupervised GNNs
# pass model name and num_features with its hyper-parameters to this API
import pandas as pd
graph = pd.read_csv("G1.weighted.edgelist", header=None, sep=' ')
edge_index = graph[[0,1]].to_numpy()
edge_weight = graph[[2]].to_numpy(dtype=np.float16)
e = pd.read_csv("vertex_embeddings.csv", header=None, sep=' ')
x = e.iloc[:, :32].to_numpy(dtype=np.float16)
generator = pipeline("generate-emb", model="unsup_graphsage", no_test=True, num_features=32, hidden_size=16, walk_length=2, sample_size=[4, 2], is_large=True)
outputs = generator(edge_index, x=x)
pd.DataFrame("embeddings.csv")
the graph is 6M nodes, 8M edges on A100 GPU 40Gb
Expected behavior
Environment
- CogDL version: 0.5.3
- OS (e.g., Linux): ubuntu
- Python version: 3.7
- PyTorch version: 1.9.1.post3
- CUDA/cuDNN version (if applicable): 11.7
- Any other relevant information: