makestuff / tides-ocr

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extract machine-readable tidal information from the PLA PDF

The Port of London Authority publish tidal predictions in advance for the whole year. They publish it in the form of a PDF, which is great for humans but not so good for machines. So this thing will download the PDF and OCR it to extract the raw data, so you can run your own analytics on it.

pdf2json.py Download, extract and analyze the tide-table PDFs, producting a JSON file for each year.

dev_ocr.ipynb: Jupyter notebook for a more interactive experience.

In order to use the Jupyter notebook, you must at least have downloaded and extracted data for 2022. You can do this with:

./pdf2json.py 2022 1,2 0-6 48

If you have access to GitHub Codespaces, you can open it there. Alternatively you can install Docker Desktop and VSCode, and just clone the repo and open it in VSCode. It should be smart enough to figure out how to start it running. The Jupyter notebook makes it very easy to analyse the data.

About

License:GNU General Public License v3.0


Languages

Language:Jupyter Notebook 98.9%Language:Python 1.0%Language:Dockerfile 0.1%