UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)

Home Page:https://github.com/UglyToad/PdfPig/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Document layout analysis - superscript / subscript

jamesanastasi opened this issue · comments

I'm having a bit of difficulty with this particular use case :

When a line has superscript the line extraction tends to extract the superscript word as a new line. this is bothersome because the word ends up in the wrong place in the raw text.
exemple : from the example PDF

TestPDF5.pdf

Integer egestas tristique aliquet. Sed consequat massa non vehicula finibus

is interpreted
B1 : aliquet. Sed consequat massa non vehicula finibus
B2: Integer egestas tristique

So the raw text is :

aliquet. Sed consequat massa non vehicula finibus Integer egestas tristique

I have adjusted the DocstrumBoundingBoxes parameters : BetweenLineMultiplier to .75 and I get the words in the right order
B1 : Integer egestas tristique
B2 : aliquet. Sed consequat massa non vehicula finibus

but this creates a new probleme :

The two blocs at the end : where each a bloc has two lines ..

Sed a felis fringilla,                           Praesent elementum in enim
maximus libero sit amet.                  id sagittis.

After changing the parameters to make the superscript they are split up into séperate blocs ( and therefore loose their order)

B1 : Sed a felis fringilla,
B2 : Praesent elementum in enim
B3 : maximus libero sit amet.
B4 : id sagittis.

I've tried different variations of recursive XYCut and played with the ordred blocs but can't seem to find the softspot where I get the blocs and the right order.

Any suggestions or ideas would be appreciated