Cerebras / modelzoo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[transformers/slimpajama] where is lm_dataformat?

tbarton16 opened this issue · comments

I am trying to deduplicate some data. When I run main.py I get an error ModuleNotFoundError: No module named 'lm_dataformat' And looking around the module, for the public version, lm_dataformat is not included. I would appreciate help with this.

Hi @tbarton16 , thanks for catching this issue.. we'll update our requirements.txt. Meanwhile, did you try pip install lm_dataformat in your environment?

I installed lm_dataformat. I can run python -c 'import lm_datformat'. Running main still produces ModuleNotFoundError: No module named 'lm_dataformat.lm_dataformat'

Unless you have a different version the line should be from lm_dataformat import Reader not https://github.com/Cerebras/modelzoo/blob/main/modelzoo/transformers/data_processing/slimpajama/preprocessing/filter.py#L14

Hi @tbarton16, can you run git clone git@github.com:leogao2/lm_dataformat.git in the directory https://github.com/Cerebras/modelzoo/tree/main/modelzoo/transformers/data_processing/slimpajama. It should resolve your issue.
Thanks.