The Graph Converter is a tool for creating a graph representation out of the content of PDFs. A graph representation can act as the basis for further document processing steps. Geometric relationships are encapsulated. By those, a document structure can be retrieved.
The tool works independent of different document layouts. The graph construction can be controlled via parameter settings mentioned subsequently. Furthermore, layout-based optimizations without the need parameter tweaks are supported using a regression estimation based on document layout characteristics.
The processing of PDF documents is done using the PDFContentConverter
library.
- Pass the path of the PDF file which is wanted to be converted to
GraphConverter
. - Call the function
convert()
. The document graph representations are returned page-wise as a list ofnetworkx
graphs. - Media boxes of a PDF can be accessed using
get_media_boxes()
, the page count overget_page_count()
Example call:
converter = GraphConverter(pdf)
result = converter.convert()
A file is the only parameter mandatory for a graph construction.
Beside the graph conversion, media boxes of a document can be accessed using get_media_boxes()
and the page count over get_page_count()
.
General document layout characteristics are stored in a converter.meta
object.
A more detailed example usage is also given in Tester.py
.
The following image shows a resulting document graph representation when using the GraphConverter
.
TODO
General parameters:
file
: file namemerge_boxes
: indicating if PDF text boxes should be graph nodes, based on visual rectangles present in documents.regress_parameters
: indicating if graph parameters are regressed or used as a priori optimized default ones.
Edge restrictions:
use_font
: differing font sizeuse_width
: differing widthuse_rect
: nodes contained in differing visual structuresuse_horizontal_overlap
: indicating if horizontal edges should be built on overlap. If not, default deltas are used.use_vertical_overlap
: indicating if vertical edges should be built on overlap. If not, default deltas are used.
Edge thresholds:
page_ratio_x
: maximal relative horizontal distance of two nodes where an edge can be createdpage_ratio_y
: maximal relative vertical distance of two nodes where an edge can be createdx_eps
: alignment epsilon for vertical edges in points ifuse_horizontal_overlap
is not enabledy_eps
: alignment epsilon for horizontal edges in points ifuse_vertical_overlap
is not enabledfont_eps_h
: indicates how much font sizes of nodes are allowed to differ as a constraint for building horizontal edges whenuse_font
is enabledfont_eps_v
: indicates how much font sizes of nodes are allowed to differ as a constraint for building vertical edges whenuse_font
is enabledwidth_pct_eps
: relative width difference of nodes as a condition for vertical edges ifuse_width
is enabledwidth_page_eps
: indicating at which maximal width of a node the width should act as an edge condition ifuse_width
is enabled
GraphConverter.py
: contains theGraphConverter
class for converting documents into graphs.util
:constants
:StorageUtil
: store/load functionalities
Tester.py
: Python script for testing theGraphConverter
pdf
: example pdf input files for tests
As a result, a list of networkx
graphs is returned.
Each graph encapsulates a structured representation of a single page.
Edges are attributed with the following features:
direction
: shows the direction of an edge.v
: Vertical edgeh
: Horizontal edgel
: Rectangular loop. This represents a novel concept encapsulating structural characteristics of document segments by observing if two different paths end up in the same node.
length
: Scaled length of an edgelengthx_phys
: Horizontal edge lengthlengthy_phys
: Vertical edge lengthweight
: Scaled total length
All nodes contain the following content attributes:
id
: unique identifier of the PDF elementpage
: page number, starting with 0text
: text of the PDF elementx_0
: left x coordinatex_1
: right x coordinatey_0
: top y coordinatey_1
: bottom y coordinatepos_x
: center x coordinatepos_y
: center y coordinateabs_pos
: tuple containing a page independent representation of(pos_x,pos_y)
coordinatesoriginal_font
: font as extracted by pdfminerfont_name
: name of the font extracted fromoriginal_font
code
: font code as provided by pdfminerbold
: factor 1 indicating that a text is bold and 0 otherwiseitalic
: factor 1 indicating that a text is italic and 0 otherwisefont_size
: size of the text in pointsmasked
: text with numeric content substituted as #frequency_hist
: histogram of character type frequencies in a text, stored as a tuple containing percentages of textual, numerical, text symbolic and other symbolslen_text
: number of charactersn_tokens
: number of wordstag
: tag for key-value pair extractions, indicating keys or values based on simple heuristicsbox
: box extracted by pdfminer Layout Analysisin_element_ids
: contains IDs of surrounding visual elements such as rectangles or lists. They are stored as a list [left, right, top, bottom]. -1 is indicating that there is no adjacent visual element.in_element
: indicates based on in_element_ids whether an element is stored in a visual rectangle representation (stored as "rectangle") or not (stored as "none").
The media boxes possess the following entries in a dictionary:
x0
: Left x page crop box coordinatex1
: Right x page crop box coordinatey0
: Top y page crop box coordinatey1
: Bottom y page crop box coordinatex0page
: Left x page coordinatex1page
: Right x page coordinatey0page
: Top y page coordinatey1page
: Bottom y page coordinate
- The
GraphConverter
will be extended using OCR processing for images in order to support more unstructured types than solely PDFs.
- Example PDFs are obtained from the ICDAR Table Recognition Challenge 2013 https://roundtrippdf.com/en/data-extraction/pdf-table-recognition-dataset/.
- Michael Benedikt Aigner
- Florian Preis