several breakages due to recent `datasets`
stas00 opened this issue · comments
It seems that datasets==2.16.0
and higher breaks evaluate
$ cat test-evaluate.py
from evaluate import load
import os
import torch.distributed as dist
dist.init_process_group("nccl")
rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = dist.get_world_size()
metric = load("accuracy",
experiment_id = "test4",
num_process = world_size,
process_id = rank)
metric.add_batch(predictions=[], references=[])
Problem 1. umask
isn't being respected when creating lock files
as we are in a group setting we use umask 000
but this script creates files with missing perms:
-rw-r--r-- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock
which is invalid, since umask 000
should have led to:
-rw-rw-rw- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock
the problem applies to all other locks created during such run - that is a few more .lock files there.
this is the same issue that was reported and dealt with multiple times in datasets
if I downgrade to datasets==2.15.0
the files are created correctly with:
-rw-rw-rw-
Problem 2. Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.
$ python -u -m torch.distributed.run --nproc_per_node=2 --rdzv_endpoint localhost:6000 --rdzv_backend c10d test-evaluate.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Traceback (most recent call last):
File "/home/stas/test/test-evaluate.py", line 14, in <module>
metric.add_batch(predictions=[], references=[])
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
self._init_writer()
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 656, in _init_writer
self._check_all_processes_locks() # wait for everyone to be ready
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 350, in _check_all_processes_locks
raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 0 but it doesn't exist.
Traceback (most recent call last):
File "/home/stas/test/test-evaluate.py", line 14, in <module>
metric.add_batch(predictions=[], references=[])
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
self._init_writer()
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 659, in _init_writer
self._check_rendez_vous() # wait for master to be ready and to let everyone go
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 362, in _check_rendez_vous
raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.
The files are there:
-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:15 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock
-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow.lock
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-rdv.lock
if I downgrade to datasets==2.15.0
the above code starts to work.
with datasets<2.16
works, datasets>=2.16
breaks.
Using evaluate==0.4.1
Thank you!
@williamberrios who reported this
@lhoestq, I updated the OP and was able to bisect which package and version lead to the breakage.
It seems to be an issue with recent versions of filelock
? I was able to reproduce using the latest version 3.13.1
Can you try using an older version ? e.g. I use 3.9.0 which seems to work fine:
pip install "filelock==3.9.0"
I just opened huggingface/datasets#6631 in datasets
to fix this.
Can you try it out ? Once I have your green light I can make a new release
thanks a lot, @lhoestq
@williamberrios - could you please test this asap and if all started working they can make a new release - thank you!
Hi @lhoestq, filelock==3.9.0
fixed my issue with distributed evaluation. Thanks a lot ❤️
Thank you for confirming it solved your problem, William!
Problem 2 is affecting me too. Downgrading fixed it but it frustrates me that I have to downgrade filelock on every machine I want to use multi-node evaluate on; is there another workaround? Can we get this fixed @stas00?
Not sure why you've tagged me, Jack ;) I have just reported the problem on behalf of my colleague.
sorry :)