Document layout analysis - superscript / subscript
jamesanastasi opened this issue · comments
I'm having a bit of difficulty with this particular use case :
When a line has superscript the line extraction tends to extract the superscript word as a new line. this is bothersome because the word ends up in the wrong place in the raw text.
exemple : from the example PDF
Integer egestas tristique aliquet. Sed consequat massa non vehicula finibus
is interpreted
B1 : aliquet. Sed consequat massa non vehicula finibus
B2: Integer egestas tristique
So the raw text is :
aliquet. Sed consequat massa non vehicula finibus Integer egestas tristique
I have adjusted the DocstrumBoundingBoxes parameters : BetweenLineMultiplier to .75 and I get the words in the right order
B1 : Integer egestas tristique
B2 : aliquet. Sed consequat massa non vehicula finibus
but this creates a new probleme :
The two blocs at the end : where each a bloc has two lines ..
Sed a felis fringilla, Praesent elementum in enim
maximus libero sit amet. id sagittis.
After changing the parameters to make the superscript they are split up into séperate blocs ( and therefore loose their order)
B1 : Sed a felis fringilla,
B2 : Praesent elementum in enim
B3 : maximus libero sit amet.
B4 : id sagittis.
I've tried different variations of recursive XYCut and played with the ordred blocs but can't seem to find the softspot where I get the blocs and the right order.
Any suggestions or ideas would be appreciated