pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

Home Page:https://pdfminersix.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to prevent pdfminer.six from executing layout algorithm to create textboxes?

yeus opened this issue · comments

Is there a way to to prevent pdfminer.six from executing the layout algorithm? So that one only gets a list of lines/graphics/image elements etc.. I have several PDFs where the layout algorithm takes a loooooong time. Simply because there are so many tables & textboxes distributed all over it. Also check this issue: euske/pdfminer#61

But as I don't need the layout algorithm. It would be sufficient for me to simply iterate over the page without textboxes..

how would I do that? right now I am using this: https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html#extract-pages

Is it somehow possible with this here?: https://pdfminersix.readthedocs.io/en/latest/tutorial/composable.html

Seems like your problem could be solved with passing None to boxes_flow attribute of LaParams() object as: LAParams(boxes_flow=None), conclusion drawn from #411.

Seems like your problem could be solved with passing None to boxes_flow attribute of LaParams() object as: LAParams(boxes_flow=None), conclusion drawn from #411.

Hi,

Just want to reply that this seems to work quiet well. I haven't removed all the heavy lifting yet, but to give you a better idea, I did some timings:

The following was done with "vanilla" LAParams()

Time taken for page 10: 0.368527889251709 seconds, elements:602
Time taken for page 16: 23.002509593963623 seconds, elements:31153
Time taken for page 17: 14.153702735900879 seconds, elements:31262
Time taken for page 18: 0.3653285503387451 seconds, elements:576
Time taken for page 19: 0.36687445640563965 seconds, elements:596

with: LAParams(detect_vertical=False)

Time taken for page 10: 0.3694620132446289 seconds, elements:602
Time taken for page 16: 22.085140705108643 seconds, elements:31153
Time taken for page 17: 13.33229398727417 seconds, elements:31262
Time taken for page 18: 0.36687588691711426 seconds, elements:576
Time taken for page 19: 0.359846830368042 seconds, elements:596

Then with LAParams(boxes_flow=None):

Time taken for page 10: 0.3904867172241211 seconds, elements:602
Time taken for page 16: 3.1319127082824707 seconds, elements:31153
Time taken for page 17: 3.055600643157959 seconds, elements:31262
Time taken for page 18: 0.248185396194458 seconds, elements:576
Time taken for page 19: 0.24678397178649902 seconds, elements:596

LAParams(detect_vertical=False, boxes_flow=None),

Time taken for page 10: 0.3619980812072754 seconds, elements:602
Time taken for page 16: 3.106546640396118 seconds, elements:31153
Time taken for page 17: 3.0158188343048096 seconds, elements:31262
Time taken for page 18: 0.24485182762145996 seconds, elements:576
Time taken for page 19: 0.2494044303894043 seconds, elements:596

so the improvement is quiet big. detect_vertical=False also seems to have minimal influence on the efficiency.

Any other ideas how we could speed this up :)?