arxiv sanity preserver

There are way too many arxiv papers, so I wrote a quick webapp that lets you search and sort through the mess in a pretty interface, similar to my pretty conference format.

It's super hacky and was written in 4 hours. I'll keep polishing it a bit over time perhaps but it serves its purpose for me already. The code uses Arxiv API to download the most recent papers (as many as you want - I used the last 1100 papers over last 3 months), and then downloads all papers, extracts text, creates tfidf vectors for each paper, and lastly is a flask interface for searching through and filtering similar papers using the vectors.

Main functionality is a search feature, and most useful is that you can click "sort by tfidf similarity to this", which returns all the most similar papers to that one in terms of tfidf bigrams. I find this quite useful.

See it in action

This code is currently running live at https://karpathy23-5000.terminal.com, serving 10400 arxiv papers from cs.[CV|CL|LG] over the last ~3 years. Clearly, this is not the final home and I would like to move it to a more permanent location soon.

Dependencies

You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer), and flask (for serving the results), and tornado (if you want to run the flask server in production). Also dateutil, and scipy. Most of these are easy to get through pip, e.g.:

$ virtualenv env                # optional: use virtualenv
$ source env/bin/activate       # optional: use virtualenv
$ pip install feedparser        # only if you want to scrape arxiv
$ pip install numpy             
$ pip install scipy             
$ pip install scikit-learn      # needed for sparse arrays
$ pip install python-dateutil   # only in serve.py for some date utils
$ pip install flask             # only in serve.py
$ pip install tornado           # only in serve.py

Ugly I don't have time processing pipeline

Requires reading code and getting hands dirty. Magic numbers throughout code.

Run scrape.py, which queries most recent papers in Arxiv and dumps xml into folder raw
Run parse_raw.py, which reads all xml files in raw and creates a pickle with all critical information called db.p.
Run download_pdf.py, which iterates over all papers in parsed pickle and downloads the papers into folder pdf
Run parse_pdf_to_text.py to export all text from pdfs to files in txt
Run analyze.py to compute tfidf vectors for all documents based on bigrams. Saves a tfidf.p pickle file.
Run thumb_pdf.py to export thumbnails of all pdfs to thumb
Run the flask server with serve.py. Visit localhost:5000 and enjoy sane viewing of papers

Prebuilt database

If you'd like to browse the 10400 arxiv papers currently running in the demo, you can download the prebuilt database. This means you can skip steps 1-6 above and simply run the server (step 7). Here is the download link.. Unzip in root folder and fire up flask with serve.py.

Running online

If you'd like to run this flask server online (e.g. AWS/Terminal) run it as python serve.py --prod.

yoavg / arxiv-sanity-preserver