josherich / paper-reads-pdf

download and parse pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

paper reads pdf

pdf pipeline for paper.mindynode.com

Install

pip install -r requirements.txt

Run

python main.py --page_uri=https://www.hackernewspapers.com/ --depth=2

// skip crawling
python main.py --page_uri=https://www.hackernewspapers.com/ --depth=2 --skip_crawl

// use original file names
python main.py --page_uri=https://www.hackernewspapers.com/ --depth=2 --keep_filename

5 steps are inlcuded

  1. crawl pdfs
  2. pdf to text
  3. screenshot thumbs
  4. analyze tdidf matrix
  5. extract meta using grobid

the code is based on:

  1. arxiv-sanity-preserver
  2. pdf-crawler
  3. grobid-client-python

About

download and parse pdf

License:GNU General Public License v3.0


Languages

Language:Python 99.6%Language:TSQL 0.4%