Spreadsheet extractor: extract cell text all at once

Question

Spreadsheet extractor: extract cell text all at once

jeremybmerrill opened this issue 11 years ago · comments

Jeremy B. Merrill commented 11 years ago

I found a slow!

In Spreadsheet#fill_in_cells we call get_cell_text on every single cell. Page#get_cell_text just wraps a call to Page#get_text -- which runs a select over all the text elements on the page.

Obviously this is O(n²) and it doesn't need to be. We (by which I mean, "I", but not right this minute) can write a method to use group_by to get all the cell text at once.

I think this'll bring significant performance gains: in a recent 90 sec script run, 28.59 seconds was spent on 17018 get_cell_text calls.

Manuel Aristarán · Answer 1 · Sun Jan 19 2014 13:25:25 GMT+0800 (China Standard Time)

Great catch. Please do!

Jeremy B. Merrill · Answer 2 · Mon Jan 20 2014 01:05:08 GMT+0800 (China Standard Time)

@jazzido, do you happen to know off the top of your head what sort of fancy-tree data structure might be good for this? Like quadtrees or R-trees or something?

A naive group_by is still inefficient, but significantly better than O(n²) that it was before. Will push soonish.

Manuel Aristarán · Answer 3 · Mon Jan 20 2014 02:14:39 GMT+0800 (China Standard Time)

This came up a few weeks ago in #45. Experimenting with a spatial index structure is worth a try, I'll start a branch right away and try to implement JSI

Manuel Aristarán · Answer 4 · Mon Jan 20 2014 03:40:17 GMT+0800 (China Standard Time)

3af2a54 contains a first try at this.

The entire test suite runs ~10 seconds faster than master
The character merging stage breaks, still not sure why.

Jeremy B. Merrill · Answer 5 · Mon Jan 20 2014 03:44:54 GMT+0800 (China Standard Time)

I may know why; I think I fixed it locally, will double check in a few
minutes, at gym now

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 2:40 PM, "Manuel Aristarán" notifications@github.com
wrote:

3af2a543af2a541f581dc7cf5def0d7568c756c1af8c614contains a first try at this.

The entire test suite runs ~10 seconds faster than master

The character merging stage breaks, still not sure why.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-32717607
.

Manuel Aristarán · Answer 6 · Mon Jan 20 2014 04:45:58 GMT+0800 (China Standard Time)

It might have to do with the sort that needs to be applied after getting the TextElements from the R-Tree.

Jeremy B. Merrill · Answer 7 · Mon Jan 20 2014 05:01:59 GMT+0800 (China Standard Time)

I think so. It was not the problem that I was having.

I pushed in 25cf3e8 a change (with significant performance gains on shitty PDFs, i.e. ~99 sec -> ~75 sec) that does only a single pass over the text_elements on the page.

Manuel Aristarán · Answer 8 · Mon Jan 20 2014 05:46:25 GMT+0800 (China Standard Time)

Issue ''almost'' solved in bb1dad6 (need to look into a missing space on a test case)

Test run time for the spatial_index branch (cmd line: ruby -X+C test/tests.rb):

Finished in 36.227000s, 0.7729 runs/s, 1.4354 assertions/s.

Time for master (same cmd line, includes your last fixes):

Finished in 35.629000s, 0.7859 runs/s, 1.4876 assertions/s.

Conclusion: master is marginally faster, has shorter startup time (less deps). Let's just use yours :)

Jeremy B. Merrill · Answer 9 · Mon Jan 20 2014 06:09:08 GMT+0800 (China Standard Time)

I wanna check my script (Derek tweeted the pdf Friday); it's on a pdf with
a giant number of cells. Spatial indexing may help significantly in that
situation. So don't delete the branch! :-)

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 4:46 PM, "Manuel Aristarán" notifications@github.com
wrote:

Issue ''almost'' solved in bb1dad6bb1dad6ab87a9bca2acad96e142d991f5578b5e6(need to look into a missing space on a test case)

Test run time for the spatial_index branch (cmd line: ruby -X+C
test/tests.rb):

Finished in 36.227000s, 0.7729 runs/s, 1.4354 assertions/s.

Time for master (same cmd line, includes your last fixes):

Finished in 35.629000s, 0.7859 runs/s, 1.4876 assertions/s.

Conclusion: master is marginally faster, has less dependencies and shorter
startup time (less deps). Let's just use yours :)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-32721217
.

Manuel Aristarán · Answer 10 · Mon Jan 20 2014 06:15:43 GMT+0800 (China Standard Time)

This? http://www.sos.state.nm.us/uploads/files/Bernalillo2012Gen.pdf

Have you posted the script somewhere?

Jeremy B. Merrill · Answer 11 · Mon Jan 20 2014 06:22:26 GMT+0800 (China Standard Time)

Yes, that one. No script posted, but it's nothing special. Just like this
one, I think: https://gist.github.com/jeremybmerrill/8486499

Pages 1 thru 5 took about 75 seconds on my machine with current master.

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 5:15 PM, "Manuel Aristarán" notifications@github.com
wrote:

This? http://www.sos.state.nm.us/uploads/files/Bernalillo2012Gen.pdf

Have you posted the script somewhere?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-32723710
.

Manuel Aristarán · Answer 12 · Mon Jan 20 2014 06:32:14 GMT+0800 (China Standard Time)

Script:

require_relative './lib/tabula'

pdf_file_path = "/Users/manuel/Downloads/Bernalillo2012Gen.pdf"
outfilename = "czechmaybe.csv"

out = open(outfilename, 'w')

extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, [1,2,3,4,5] ) 
extractor.extract.each do |pdf_page|
  pdf_page.spreadsheets.each do |spreadsheet|
    out << spreadsheet.to_csv
    out << "\n\n"
  end
end
out.close

$ time ruby derek.rb
ruby derek.rb  71.56s user 0.81s system 136% cpu 53.003 total

Damn that rotated text :)

Jeremy B. Merrill · Answer 13 · Mon Jan 20 2014 06:34:34 GMT+0800 (China Standard Time)

Ah yes, that wag the one modification I made: gsubbing out all the letters.
Still not perfect; not sure why the rotated text shows up in the wrong
place. Maybe pdfbox?

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 5:32 PM, "Manuel Aristarán" notifications@github.com
wrote:

Script:

require_relative './lib/tabula'
pdf_file_path = "/Users/manuel/Downloads/Bernalillo2012Gen.pdf"outfilename = "czechmaybe.csv"
out = open(outfilename, 'w')
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, [1,2,3,4,5] ) extractor.extract.each do |pdf_page|
pdf_page.spreadsheets.each do |spreadsheet|
out << spreadsheet.to_csv
out << "\n\n"
endendout.close

$ time ruby derek.rb
ruby derek.rb 71.56s user 0.81s system 136% cpu 53.003 total

Damn that rotated text :)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-32724457
.

Jeremy B. Merrill · Answer 14 · Mon Jan 20 2014 06:35:14 GMT+0800 (China Standard Time)

But looks like some performance gains there, unless your machine is just
faster than mine.

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 5:34 PM, "Jeremy B. Merrill" jeremybmerrill@gmail.com
wrote:

Ah yes, that wag the one modification I made: gsubbing out all the
letters. Still not perfect; not sure why the rotated text shows up in the
wrong place. Maybe pdfbox?

Jeremy B. Merrill
Sent from my mobile device
On Jan 19, 2014 5:32 PM, "Manuel Aristarán" notifications@github.com
wrote:

Script:

require_relative './lib/tabula'
pdf_file_path = "/Users/manuel/Downloads/Bernalillo2012Gen.pdf"outfilename = "czechmaybe.csv"
out = open(outfilename, 'w')
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, [1,2,3,4,5] ) extractor.extract.each do |pdf_page|
pdf_page.spreadsheets.each do |spreadsheet|
out << spreadsheet.to_csv
out << "\n\n"
endendout.close

$ time ruby derek.rb
ruby derek.rb 71.56s user 0.81s system 136% cpu 53.003 total

Damn that rotated text :)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-32724457
.

Manuel Aristarán · Answer 15 · Mon Jan 20 2014 11:55:30 GMT+0800 (China Standard Time)

Correct results, now that I fixed a faulty merge (I merged your changes from master and the spatial index was being populated but not used :))

✗ time ruby  derek.rb
ruby -X-C derek.rb  53.35s user 0.65s system 135% cpu 39.908 total

Jeremy B. Merrill · Answer 16 · Tue Jan 21 2014 00:46:21 GMT+0800 (China Standard Time)

As measured by the JRuby profiler, I'm getting decently faster results with the spatial index -- 65 - 75 seconds with JSI and 75 to 80 seconds without it. For that whole 180ish page PDF, that's a difference that'll be measured in minutes. I think it's worth keeping JSI around.

I'm not getting the differences in running the tests. Both with and without JSI I'm getting around 32 seconds for the tests. Even if JSI does make us marginally slower (1 second over 30-some tests), I think it's worth it for the significant gain for PDFs like Dereks.

Maybe we could avoid the cost for simple PDFs with yet another heuristic (we've already got plenty of those) -- but I'm not sure that's really needed.

Manuel Aristarán · Answer 17 · Tue Jan 21 2014 00:48:59 GMT+0800 (China Standard Time)

Sounds reasonable, merging spatial_index to master.