tabulapdf / tabula-extractor

Extract tables from PDF files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spreadsheet extractor: extract cell text all at once

jeremybmerrill opened this issue · comments

I found a slow!

In Spreadsheet#fill_in_cells we call get_cell_text on every single cell. Page#get_cell_text just wraps a call to Page#get_text -- which runs a select over all the text elements on the page.

Obviously this is O(n2) and it doesn't need to be. We (by which I mean, "I", but not right this minute) can write a method to use group_by to get all the cell text at once.

I think this'll bring significant performance gains: in a recent 90 sec script run, 28.59 seconds was spent on 17018 get_cell_text calls.

Great catch. Please do!

@jazzido, do you happen to know off the top of your head what sort of fancy-tree data structure might be good for this? Like quadtrees or R-trees or something?

A naive group_by is still inefficient, but significantly better than O(n2) that it was before. Will push soonish.

This came up a few weeks ago in #45. Experimenting with a spatial index structure is worth a try, I'll start a branch right away and try to implement JSI

3af2a54 contains a first try at this.

  • The entire test suite runs ~10 seconds faster than master
  • The character merging stage breaks, still not sure why.

I may know why; I think I fixed it locally, will double check in a few
minutes, at gym now

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 2:40 PM, "Manuel Aristarán" notifications@github.com
wrote:

3af2a543af2a541f581dc7cf5def0d7568c756c1af8c614contains a first try at this.

  • The entire test suite runs ~10 seconds faster than master
  • The character merging stage breaks, still not sure why.


Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-32717607
.

It might have to do with the sort that needs to be applied after getting the TextElements from the R-Tree.

I think so. It was not the problem that I was having.

I pushed in 25cf3e8 a change (with significant performance gains on shitty PDFs, i.e. ~99 sec -> ~75 sec) that does only a single pass over the text_elements on the page.

Issue ''almost'' solved in bb1dad6 (need to look into a missing space on a test case)

Test run time for the spatial_index branch (cmd line: ruby -X+C test/tests.rb):

Finished in 36.227000s, 0.7729 runs/s, 1.4354 assertions/s.

Time for master (same cmd line, includes your last fixes):

Finished in 35.629000s, 0.7859 runs/s, 1.4876 assertions/s.

Conclusion: master is marginally faster, has shorter startup time (less deps). Let's just use yours :)

I wanna check my script (Derek tweeted the pdf Friday); it's on a pdf with
a giant number of cells. Spatial indexing may help significantly in that
situation. So don't delete the branch! :-)

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 4:46 PM, "Manuel Aristarán" notifications@github.com
wrote:

Issue ''almost'' solved in bb1dad6bb1dad6ab87a9bca2acad96e142d991f5578b5e6(need to look into a missing space on a test case)

Test run time for the spatial_index branch (cmd line: ruby -X+C
test/tests.rb):

Finished in 36.227000s, 0.7729 runs/s, 1.4354 assertions/s.

Time for master (same cmd line, includes your last fixes):

Finished in 35.629000s, 0.7859 runs/s, 1.4876 assertions/s.

Conclusion: master is marginally faster, has less dependencies and shorter
startup time (less deps). Let's just use yours :)


Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-32721217
.

Yes, that one. No script posted, but it's nothing special. Just like this
one, I think: https://gist.github.com/jeremybmerrill/8486499

Pages 1 thru 5 took about 75 seconds on my machine with current master.

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 5:15 PM, "Manuel Aristarán" notifications@github.com
wrote:

This? http://www.sos.state.nm.us/uploads/files/Bernalillo2012Gen.pdf

Have you posted the script somewhere?


Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-32723710
.

Script:

require_relative './lib/tabula'

pdf_file_path = "/Users/manuel/Downloads/Bernalillo2012Gen.pdf"
outfilename = "czechmaybe.csv"

out = open(outfilename, 'w')

extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, [1,2,3,4,5] ) 
extractor.extract.each do |pdf_page|
  pdf_page.spreadsheets.each do |spreadsheet|
    out << spreadsheet.to_csv
    out << "\n\n"
  end
end
out.close
$ time ruby derek.rb
ruby derek.rb  71.56s user 0.81s system 136% cpu 53.003 total

Damn that rotated text :)

Ah yes, that wag the one modification I made: gsubbing out all the letters.
Still not perfect; not sure why the rotated text shows up in the wrong
place. Maybe pdfbox?

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 5:32 PM, "Manuel Aristarán" notifications@github.com
wrote:

Script:

require_relative './lib/tabula'
pdf_file_path = "/Users/manuel/Downloads/Bernalillo2012Gen.pdf"outfilename = "czechmaybe.csv"
out = open(outfilename, 'w')
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, [1,2,3,4,5] ) extractor.extract.each do |pdf_page|
pdf_page.spreadsheets.each do |spreadsheet|
out << spreadsheet.to_csv
out << "\n\n"
endendout.close

$ time ruby derek.rb
ruby derek.rb 71.56s user 0.81s system 136% cpu 53.003 total

Damn that rotated text :)


Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-32724457
.

But looks like some performance gains there, unless your machine is just
faster than mine.

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 5:34 PM, "Jeremy B. Merrill" jeremybmerrill@gmail.com
wrote:

Ah yes, that wag the one modification I made: gsubbing out all the
letters. Still not perfect; not sure why the rotated text shows up in the
wrong place. Maybe pdfbox?

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 5:32 PM, "Manuel Aristarán" notifications@github.com
wrote:

Script:

require_relative './lib/tabula'
pdf_file_path = "/Users/manuel/Downloads/Bernalillo2012Gen.pdf"outfilename = "czechmaybe.csv"
out = open(outfilename, 'w')
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, [1,2,3,4,5] ) extractor.extract.each do |pdf_page|
pdf_page.spreadsheets.each do |spreadsheet|
out << spreadsheet.to_csv
out << "\n\n"
endendout.close

$ time ruby derek.rb
ruby derek.rb 71.56s user 0.81s system 136% cpu 53.003 total

Damn that rotated text :)


Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-32724457
.

Correct results, now that I fixed a faulty merge (I merged your changes from master and the spatial index was being populated but not used :))

✗ time ruby  derek.rb
ruby -X-C derek.rb  53.35s user 0.65s system 135% cpu 39.908 total

As measured by the JRuby profiler, I'm getting decently faster results with the spatial index -- 65 - 75 seconds with JSI and 75 to 80 seconds without it. For that whole 180ish page PDF, that's a difference that'll be measured in minutes. I think it's worth keeping JSI around.

I'm not getting the differences in running the tests. Both with and without JSI I'm getting around 32 seconds for the tests. Even if JSI does make us marginally slower (1 second over 30-some tests), I think it's worth it for the significant gain for PDFs like Dereks.

Maybe we could avoid the cost for simple PDFs with yet another heuristic (we've already got plenty of those) -- but I'm not sure that's really needed.

Sounds reasonable, merging spatial_index to master.