Document layout analysis - superscript / subscript

Question

Document layout analysis - superscript / subscript

jamesanastasi opened this issue 4 months ago · comments

I'm having a bit of difficulty with this particular use case :

When a line has superscript the line extraction tends to extract the superscript word as a new line. this is bothersome because the word ends up in the wrong place in the raw text.
exemple : from the example PDF

TestPDF5.pdf

Integer egestas tristique ^aliquet. Sed consequat massa non vehicula _finibus

is interpreted
B1 : aliquet. Sed consequat massa non vehicula finibus
B2: Integer egestas tristique

So the raw text is :

aliquet. Sed consequat massa non vehicula finibus Integer egestas tristique

I have adjusted the DocstrumBoundingBoxes parameters : BetweenLineMultiplier to .75 and I get the words in the right order
B1 : Integer egestas tristique
B2 : aliquet. Sed consequat massa non vehicula finibus

but this creates a new probleme :

The two blocs at the end : where each a bloc has two lines ..

Sed a felis fringilla, Praesent elementum in enim
maximus libero sit amet. id sagittis.

After changing the parameters to make the superscript they are split up into séperate blocs ( and therefore loose their order)

B1 : Sed a felis fringilla,
B2 : Praesent elementum in enim
B3 : maximus libero sit amet.
B4 : id sagittis.

I've tried different variations of recursive XYCut and played with the ordred blocs but can't seem to find the softspot where I get the blocs and the right order.

Any suggestions or ideas would be appreciated