Error in samgraph/common/common.cc.
swordfate opened this issue · comments
After successfully installing all packages and doing data_preprocessing according to README.md
, an error occurred when I run the scripts in README:
Command:
python samgraph/multi_gpu/train_gcn.py --dataset papers100M --num-train-worker 1 --num-sample-worker 1 --pipeline --cache-policy pre_sample --cache-percentage 0.1 --num-epoch 10 --batch-size 8000
Output:
Using backend: pytorch
config:eval_tsp="2022-05-08 08:19:49"
config:arch=arch5
config:num_train_worker=1
config:num_sample_worker=1
config:sample_type=khop2
config:root_path=/graph-learning/samgraph/
config:dataset=papers100M
config:pipeline=True
config:cache_policy=pre_sample
config:cache_percentage=0.1
config:num_epoch=11
config:batch_size=8000
config:num_hidden=256
config:max_sampling_jobs=10
config:max_copying_jobs=1
config:barriered_epoch=0
config:presample_epoch=1
config:omp_thread_num=40
config:fanout=[5, 10, 15]
config:lr=0.003
config:dropout=0.5
config:weight_decay=0.0005
config:single_gpu=False
config:validate_configs=False
config:dataset_path=/graph-learning/samgraph/papers100M
config:train_workers=['cuda:0']
config:sample_workers=['cuda:1']
config:num_fanout=3
config:num_layer=3
config:_run_mode=RunMode.FGNN
config:_log_level=error
config:_profile_level=0
config:_empty_feat=0
config:_arch=5
config:_sample_type=5
config:_cache_policy=2
ERROR: /root/gitclone/fgnn-artifacts/samgraph/common/common.cc:100] Check failed: (data) != ((void *)-1)
Aborted (core dumped)
After consulting the common.cc
code, I know that this error is caused by the mmap() of file /graph-rearning/samgraph/papers100m/indptr.bin
, but I don't know why.
Can you give me a hint on how to solve this problem, please? Thank you very much.
First check if the file path is correct. If so, it may be caused by the memlock limit. Please refer to link.
You may print errno right after the failed mmap.
It is also recommanded to move to this new repo. This repo is a snapshot version for AE in eurosys'22, while future maintenance will be adopted to the new repo. It is welcomed to open a new issue in that repo, or continue our discussion in this thread if you like.
First check if the file path is correct. If so, it may be caused by the memlock limit. Please refer to link. You may print errno right after the failed mmap.
It is also recommanded to move to this new repo. This repo is a snapshot version for AE in eurosys'22, while future maintenance will be adopted to the new repo. It is welcomed to open a new issue in that repo, or continue our discussion in this thread if you like.
Thank you for your quick reply. When I type ulimit -a
in the terminal, I see max locked memory (kbytes, -l) 65536
, thus I failed to set it to 200000000. Is there any way to solve this problem? (ulimit -n 65535 is ok.)
There is difference between hard limit and soft limit. Editing /etc/security/limits.conf
should solve this issue I suppose? How did you fail when editing this file?
There is difference between hard limit and soft limit. Editing
/etc/security/limits.conf
should solve this issue I suppose? How did you fail when editing this file?
The server I am using is a commercial online service platform. I cannot execute reboot -f
(maybe because I am in a container setup by the commercial platform), so I cannot modify ulimit by modifying /etc/security/limits.conf
and reboot.
I can only execute ulimit -l 200000,000
from the terminal, but it fails with bash: ulimit: max locked memory: cannot modify limit: Operation not permitted
.
What's the output of ulimit -aH
?
If you cannot reboot, you can try this command:
sudo sh -c "ulimit -l 200000000 && exec su <your user name>"
Not sure if this can bypass the hard limit. Please notify me if this solves your issue or not.
Oh, the readme is not accurate. After updating /etc/security/limits.conf
, reboot should not be required. Simply open a new session and the limit should be updated. Not sure if this works in your container environment. Anyway the solution is to find a way to increase the soft/hard limit in your environment.
If non of these works, you can try to remove the lock flag of all call to mmap
in the codebase. This may resolve the issue, but the performance will be degraded.
What's the output of
ulimit -aH
? If you cannot reboot, you can try this command:sudo sh -c "ulimit -l 200000000 && exec su <your user name>"
Not sure if this can bypass the hard limit. Please notify me if this solves your issue or not.
Below is the result of ulimit -aH
:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 3089868
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Below is the result of sudo sh -c "ulimit -l 200000000 && exec su root"
:
sh: 1: ulimit: error setting limit (Operation not permitted)
There is difference between hard limit and soft limit. Editing
/etc/security/limits.conf
should solve this issue I suppose? How did you fail when editing this file?The server I am using is a commercial online service platform. I cannot execute
reboot -f
(maybe because I am in a container setup by the commercial platform), so I cannot modify ulimit by modifying/etc/security/limits.conf
and reboot.I can only execute
ulimit -l 200000,000
from the terminal, but it fails withbash: ulimit: max locked memory: cannot modify limit: Operation not permitted
.
Hello! If you use a container environment, you can add some options when starting this container. For example, you can add the option "--ulimit memlock=-1" with the command "docker run --ulimit memlock=-1 ..." to run a docker environment.
Oh, the readme is not accurate. After updating
/etc/security/limits.conf
, reboot should not be required. Simply open a new session and the limit should be updated. Not sure if this works in your container environment. Anyway the solution is to find a way to increase the soft/hard limit in your environment. If non of these works, you can try to remove the lock flag of all call tommap
in the codebase. This may resolve the issue, but the performance will be degraded.
Unfortunately, the above solution cannot solve my problem. The method of removing the lock flag can indeed work. Anyway, thank you very much for your detailed and quick reply. :)
There is difference between hard limit and soft limit. Editing
/etc/security/limits.conf
should solve this issue I suppose? How did you fail when editing this file?The server I am using is a commercial online service platform. I cannot execute
reboot -f
(maybe because I am in a container setup by the commercial platform), so I cannot modify ulimit by modifying/etc/security/limits.conf
and reboot.
I can only executeulimit -l 200000,000
from the terminal, but it fails withbash: ulimit: max locked memory: cannot modify limit: Operation not permitted
.Hello! If you use a container environment, you can add some options when starting this container. For example, you can add the option "--ulimit memlock=-1" with the command "docker run --ulimit memlock=-1 ..." to run a docker environment.
I can't set the flag when starting the container because it is controlled by the commercial platform. Anyway, thanks a lot for providing another method. I will continue to explore whether there are other ways to solve this problem in my environment :)
OK.
Perhaps, you can ask your commercial platform provider about how to change the Linux ulimit values.
Or you can remove the "MAP_LOCKED" flag for function mmap and remove function mlock in our code, but this method will hurt the performance(not recommended).
If our answer is ok, you can close this issue. Welcome to reopen it if you have any questions later.