EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pass2_shuffle_holdout.py - ModuleNotFoundError: No module named 'parse'

dboggs95 opened this issue · comments

commented

Goal

I'm trying to replicate a subset of the Pile that works with the GPT-NeoX trainer. I have pretty good hardware, but nothing like the 90 Tesla rack that made the 20 billion parameter GPT-NeoX-20 model. So, I'm trying to keep everything simple while I learn how this process works.

Environment and Setup

I am using Python 3.8 + pip on WSL2 + Ubuntu 22.04. Since the project requires Python 3.6 or above, I figured that should be fine.

I have run the setup.py and installed the project, so all the requirements declared by the project are there.

I also commented out every dataset except for Gutenberg, because like said, I'm trying to keep everything simple.

Problem

I downloaded the full Gutenberg dataset and ran the first pass shuffle on it via pile.py.

For the second pass, I ran pass2_shuffle_holdout.py, but it fails on the error below. The readme mentions some of the scripts may be obsolete, but doesn't specifically say which scripts. This doesn't look like an obsolete script, and several other scripts import the parse module. If I read everything correctly, I need two shuffle passes to do this properly, so I'm 99% sure this script is supposed to work.

Error Message

\Traceback (most recent call last):
  File "./processing_scripts/pass2_shuffle_holdout.py", line 8, in <module>
    import parse
ModuleNotFoundError: No module named 'parse'

Research

I looked in the repo for a parse module and did not find it. I looked online for a parse module for Python 3, and I couldn't any evidence one exists.

I searched the web, Stackoverflow, and the Git issues for a solution.

commented

I just figured out the answer to my own question.

There is a python library called parse: https://pypi.org/project/parse/

I just had to run this command to install it:

pip install parse

Right after that, I noticed it couldn't create the directories it needed to run, so I manually added pile_output and pile_holdout folders to the project root.