timtadh / warc-extractor

extract a random sample of HTML files from WARCs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

warc-extractor

by Junqi Ma (jxm844@case.edu) and Tim Henderson (tim.tadh@gmail.com)

"-n 30000" is used to generate about 700 files whose sizes are larger than 300kb Example

./WarcExtractor -n 30000 --file crawl-file.warc.gz  -o result-dir

TODO: add command input to give the size of html file

About

extract a random sample of HTML files from WARCs

License:Other


Languages

Language:Java 100.0%