anisotropi4 / nesa

Extracts data from the Network Rail (NR) National Electronic Sectional Appendix data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nesa

This project consists of a series of scripts that extract data from the Network Rail National Electronic Sectional Appendix (NESA) into a series of Route Clearance reports using PDF text extraction.

The downloadable NESA data is available here and contains as a set of route PDF files with spreadsheet and embedded TIFF image files

Extracted data download links

Unformatted text Per page Route Clearance TSV Route Clearance XLSX Report
Anglia Route Anglia Route Anglia Route
Kent, Sussex and Wessex Kent, Sussex and Wessex Kent, Sussex and Wessex
London North-Eastern London North-Eastern London North-Eastern
London North-Western North London North-Western North London North-Western North
London North-Western South London North-Western South London North-Western South
Scotland Scotland Scotland
Western Western Western
  • Notes: South Wales data is now in Western, and North Wales in London North Western North. Kent, Sussex and Wessex data is now back in the Kent-Sussex-Wessex directory.

Data Source

The PDF files for these seven routes are available here

Prerequisites

  • jq is a lightweight and flexible command-line JSON processor. On an Debian or similar apt based Linux system:

    $ sudo apt install jq

  • poppler-utils package to decompress, extract text and render PDF based on the xpdf-3.0 code base

    $ sudo apt install poppler-utils

  • ghostscript package to interpret and manipulate PostScript and PDF files

    $ sudo apt install ghostscript

python dependencies

  • python 3.9 to run the scripts PDF based on the xpdf-3.0 code base. Tested against Python 3.7, 3.8 and 3.9
  • Python pandas data processing library
  • Python pdfplumber table and visual debugging PDF data extract library
  • Python pdfminer.six PDF information extraction library
  • Python openpyxl library to write Excel 2010 xlsx files

python virtualenv package

For ease of use manage python packages dependencies with a local virtual environment venv using the python virtualenv package:

$ sudo apt install virtualenv
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Creating the Route Clearance reports

The reports for the routes are created as follows:

Download the data

Download the seven route Section Appendix PDF files into the download directory from here

Process the PDF files

To extract the data execute the run.sh script:

$ ./run.sh

This executes a series of scripts to segment, extract and output the data creating a series of TSV and Excel spreadsheets in the seven route directories

How it works

To extract text from the PDF text-object elements, issues with formatting and use of grey-scale background in a number of the key route-clearance tables breaks pdfplumber and pdfminer formatted text extraction.

To overcome this the PDF files are converted to an uncompressed CMYK PDF/A format, and the grey background removed by deleting the call and graphic state for the embedded grey background image. Out-with that it seems to work, this is in no way a recommended approach.

It creates broken PDF files, as the internal PDF checksums no longer match. It assumes the background grey colour is encoded as 0.8081 g or 1 1 0 rg and rendered using the call to f*. Were the PDF rendering software used by Network Rail, Ghostscript, or qpdf to change this would just break. YMMV

License

Network Rail are copyright holder and retain all intellectual property rights related to the data and derived data contained within the National Electronic Sectional Appendix as set out here

The scripts and other material is provided under the the terms set out in the LICENSE

Acknowledgement

The authors would like to thank Network Rail for providing this data and to all the contributors to the tools and libraries used

About

Extracts data from the Network Rail (NR) National Electronic Sectional Appendix data

License:MIT License


Languages

Language:Python 89.1%Language:Shell 10.9%