nlngh / ct_warc_to_doc

Source code to extract content from commoncrawl news corpus and upload to S3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Commoncrawl News Specific WARC File Parser

Aim of the project
Extracts documents from commoncrawl news specific warc-files

Requirements

AWS
EC2


How to run

python main.py --month_id 01 --year_id 2020 --month_half first

ToDo

  • improve the overall warc file parsing workflow
    • the workflow should be more robust
  • remove parameters and it should parse in a parameterless fashion
    • maybe the month and year parameters are stored somewhere else
  • should be run in aws spot instances
  • it should have autoscaling so that weird instances are killed and new instances are spawned

About

Source code to extract content from commoncrawl news corpus and upload to S3


Languages

Language:Python 98.0%Language:Shell 2.0%