Skylion007 / openwebtext

An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OpenWebText

This project is a clone of the GPT-2 WebText dataset as outlined in the OpenAI paper. This project is still heavily WIP.

Dependencies

Pipenv, Python 3,

To install python dependencies:

pipenv install

Newspaper Dependencies:

On Ubuntu:

sudo apt-get install libxml2-dev libxslt-dev

On OS X:

brew install libxml2 libxslt

Usage

  1. Get list of URLs from reddit:
pipenv run python get_urls.py
  1. Download data from URLs:
pipenv run python download.py

Resulting files will be deposited in data/ with format {domain}-{sha256 hash of url}.txt.

Enjoy!

About

An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.


Languages

Language:Python 100.0%