paper reads pdf
pdf pipeline for paper.mindynode.com
Install
pip install -r requirements.txt
Run
python main.py --page_uri=https://www.hackernewspapers.com/ --depth=2
// skip crawling
python main.py --page_uri=https://www.hackernewspapers.com/ --depth=2 --skip_crawl
// use original file names
python main.py --page_uri=https://www.hackernewspapers.com/ --depth=2 --keep_filename
5 steps are inlcuded
- crawl pdfs
- pdf to text
- screenshot thumbs
- analyze tdidf matrix
- extract meta using grobid
the code is based on: