pageproc

Recognize page elements by page bitmap analysis.

Usage

To do segmentation of PDF and put result into directory run:

pdfproc --segment file.pdf --output dir/to/data

Output directory will contain:

directories numbered as page_1 to page_N for every page and inside it files 1.png to K.png for every segment
page_i/data.json which is metadata info about segments

Metadata file

It contains the following info:

segments - list of dictionaries with following data:
- text - text inside segment if any
- type_guess - can be either text, dummy, table or image
- x - segment x-coordinate on page
- y - segment y coordinate on page
- width - segment width
- height - segment height
- id segment ID
- parent_id - ID of segment containing given segment if any, null otherwise
- img_path - path to image relative to this file

Using ML model to classify segments

To train model:

segment_classify train --data-dir <path/to/output/of/pdfproc> --output <model-output-file.json>

To classify segments with ML model run:

segment_classify classify --data-dir <path/to/output/of/pdfproc> --model <path/to/model/file.json>

NOTE: Classification command will modify data and insert type field for every segment

Changing behaviour of tools

Tools support the following options to change:

segmentation behaviour and thresholds for pdfproc
ML model to be trained and used for segment_classify along with special model parameters given as PREPROCESS_* and MODEL_* envvars

Segmentation

For segmentation, xy-cuts algorithm is used which is documented in this paper: https://www.haralick.org/conferences/71280952.pdf

Preprocessing and ML models

User can chose different preprocessors and ML models to use. Every segment is preprocessed and transformed by specified preprocessor function during training and prediction phases. There are several preprocessors available:

Preprocessor name	Description
`simple`	This preprocessor does nothing to segment. All data will be used from data file as input to model
`img_attrs`	Every segment is transformed to vector [image area, width to area ratio, height / width, mean value of grayscale values of pixels]
`pixels`	Image resized to some predefined size and transformed to grayscale and flattened and values of such output are used as input of model
`histogram`	Image is transformed to grayscaled, resized to predefined value and ixels are summed for each column and each row, so we get vector (sum(c1), sum(c2), ..., sum(ck), sum(r1), ..., sum(rk)) which is input to model

These preprocessed segments are fed into specific ML model which can be:

Model	Description
1-rule	One rule model
Neural network	neural network model
RF	Random forect

Development and testing of models

To setup development env and run model tests first create virtualenv:

virtualenv env
. env/bin/activate
pip install -r requirements.txt

For testing, there is one paper which is used and for which there are pre-labeled data to score accuracy of model.

Please download J. Olsen - Realtime procedural terrain generation (2004) paper and save it on your local machine.

After saving it, export env ariable to its path:

export SAMPLE_PDF=/path/to/paper/pdf.pdf

Run all tests and generate reports:

./bin/testAll.sh sample_configurations/ reports

Results

In the following table, results are shown depending on which preprocessor was used and which model:

	preprocessor	model	accuracy
1	`histogram`	RF	71.792
2	`img_attrs`	RF	72.034
3	`img_attrs`	one_rule	63.153
4	`pixels`	NN	38.451
5	`simple`	NN	42.345
6	`simple`	RF	70.422
7	`simple`	one_rule	63.216

From results we can see that random forest, when used with just image attributes as its input can achieve accuracy of 72%.

fantastic001 / pageproc