This project is a clone of the GPT-2 WebText dataset as outlined in the OpenAI paper. This project is still heavily WIP.
Pipenv, Python 3,
To install python dependencies:
pipenv install
Newspaper Dependencies:
On Ubuntu:
sudo apt-get install libxml2-dev libxslt-dev
On OS X:
brew install libxml2 libxslt
- Get list of URLs from reddit:
pipenv run python get_urls.py
- Download data from URLs:
pipenv run python download.py
Resulting files will be deposited in data/
with format {domain}-{sha256 hash of url}.txt
.
Enjoy!