several breakages due to recent `datasets`

Question

several breakages due to recent `datasets`

stas00 opened this issue 4 months ago · comments

It seems that datasets==2.16.0 and higher breaks evaluate

$ cat test-evaluate.py
from evaluate import load
import os
import torch.distributed as dist

dist.init_process_group("nccl")

rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = dist.get_world_size()

metric = load("accuracy",
                  experiment_id = "test4",
                  num_process = world_size,
                  process_id  = rank)
metric.add_batch(predictions=[], references=[])

Problem 1. `umask` isn't being respected when creating lock files

as we are in a group setting we use umask 000

but this script creates files with missing perms:

-rw-r--r-- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock

which is invalid, since umask 000 should have led to:

-rw-rw-rw- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock

the problem applies to all other locks created during such run - that is a few more .lock files there.

this is the same issue that was reported and dealt with multiple times in datasets

if I downgrade to datasets==2.15.0 the files are created correctly with:

-rw-rw-rw-

Problem 2. `Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.`

$ python -u -m torch.distributed.run --nproc_per_node=2 --rdzv_endpoint localhost:6000  --rdzv_backend c10d test-evaluate.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Traceback (most recent call last):
  File "/home/stas/test/test-evaluate.py", line 14, in <module>
    metric.add_batch(predictions=[], references=[])
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
    self._init_writer()
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 656, in _init_writer
    self._check_all_processes_locks()  # wait for everyone to be ready
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 350, in _check_all_processes_locks
    raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 0 but it doesn't exist.
Traceback (most recent call last):
  File "/home/stas/test/test-evaluate.py", line 14, in <module>
    metric.add_batch(predictions=[], references=[])
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
    self._init_writer()
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 659, in _init_writer
    self._check_rendez_vous()  # wait for master to be ready and to let everyone go
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 362, in _check_rendez_vous
    raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.

The files are there:

-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:15 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock
-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow.lock
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-rdv.lock

if I downgrade to datasets==2.15.0 the above code starts to work.

with datasets<2.16 works, datasets>=2.16 breaks.

Using evaluate==0.4.1

Thank you!

@lhoestq

@williamberrios who reported this

Jack Morris commented 2 months ago

sorry :)

Stas Bekman · Answer 1 · Tue Jan 30 2024 06:25:32 GMT+0800 (China Standard Time)

@lhoestq, I updated the OP and was able to bisect which package and version lead to the breakage.

Quentin Lhoest · Answer 2 · Tue Jan 30 2024 19:07:56 GMT+0800 (China Standard Time)

It seems to be an issue with recent versions of filelock ? I was able to reproduce using the latest version 3.13.1

Can you try using an older version ? e.g. I use 3.9.0 which seems to work fine:

pip install "filelock==3.9.0"

Quentin Lhoest · Answer 3 · Tue Jan 30 2024 20:56:44 GMT+0800 (China Standard Time)

I just opened huggingface/datasets#6631 in datasets to fix this.

Can you try it out ? Once I have your green light I can make a new release

Stas Bekman · Answer 4 · Wed Jan 31 2024 09:10:02 GMT+0800 (China Standard Time)

thanks a lot, @lhoestq

@williamberrios - could you please test this asap and if all started working they can make a new release - thank you!

William Berrios · Answer 5 · Fri Feb 02 2024 23:08:48 GMT+0800 (China Standard Time)

Hi @lhoestq, filelock==3.9.0 fixed my issue with distributed evaluation. Thanks a lot ❤️

Stas Bekman · Answer 6 · Sat Feb 03 2024 02:31:56 GMT+0800 (China Standard Time)

Thank you for confirming it solved your problem, William!

Jack Morris · Answer 7 · Fri Mar 08 2024 05:48:49 GMT+0800 (China Standard Time)

Problem 2 is affecting me too. Downgrading fixed it but it frustrates me that I have to downgrade filelock on every machine I want to use multi-node evaluate on; is there another workaround? Can we get this fixed @stas00?

Stas Bekman · Answer 8 · Fri Mar 08 2024 05:54:44 GMT+0800 (China Standard Time)

Not sure why you've tagged me, Jack ;) I have just reported the problem on behalf of my colleague.

several breakages due to recent `datasets`

Problem 1. umask isn't being respected when creating lock files

Problem 2. Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.

Problem 1. `umask` isn't being respected when creating lock files

Problem 2. `Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.`