OpenWebText

This project is a clone of the GPT-2 WebText dataset as outlined in the OpenAI paper. This project is still heavily WIP.

Dependencies

Pipenv, Python 3,

To install python dependencies:

pipenv install

Newspaper Dependencies:

On Ubuntu:

sudo apt-get install libxml2-dev libxslt-dev

On OS X:

brew install libxml2 libxslt

pipenv run python get_urls.py

pipenv run python download.py

Resulting files will be deposited in data/ with format {domain}-{sha256 hash of url}.txt.

Enjoy!

An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.

Language:Python 100.0%