Table data extractor

Question

Table data extractor

samikrc opened this issue 8 years ago · comments

Hi,
Do you have code for extracting data from tables like you have for extracting data from forms? Because of the use of rowspan and colspan attributes, it gets difficult to parse a table from the raw html. Is there an easy way to do this from the in-memory rendering of the browsers?
Regards.

Rui Gonçalves · Answer 1 · Sun Sep 18 2016 07:56:24 GMT+0800 (China Standard Time)

Hi! Currently there is no specialized extractor for HTML tables. It would be a nice addition to have one, but does come with its challenges. For example, what data structure did you have in mind? As you mentioned in your example, the ability of cells to cover an arbitrary number of rows and columns can make the organization rather messy...

Samik R · Answer 2 · Sun Sep 18 2016 14:32:28 GMT+0800 (China Standard Time)

I have something in mind - let me see if I can open a pull request in a few weeks. If I do end up coding something, where do you think it would show up in the codebase? In scalascraper/scraper/HtmlExtractor.scala?

Rui Gonçalves · Answer 3 · Mon Sep 19 2016 05:21:56 GMT+0800 (China Standard Time)

Yeah, I would expect it to be a new extractor in ContentExtractors. Looking forward to your pull request then!

Trinadh Gupta · Answer 4 · Mon May 15 2017 21:01:41 GMT+0800 (China Standard Time)

hi, any update with this ?

Samik R · Answer 5 · Mon May 15 2017 21:16:19 GMT+0800 (China Standard Time)

I worked on this a bit, but was unable to directly extend the library to include this feature (mostly because of my level of Scala skills). What I have is a standalone piece of code which does this. I can publish that code, and you can either use it as is, or try integrating this code in the library. Let me know if there is any interest. Thanks.

…

On 15-May-17 6:31 PM, Trinadh Gupta wrote: hi, any update with this ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#30 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACDd2We2gAN7iK57U1N41iFYPHERxfy0ks5r6Ey2gaJpZM4J9hAR>.

Trinadh Gupta · Answer 6 · Mon May 15 2017 21:35:32 GMT+0800 (China Standard Time)

Hello samikrc ,
Thanks a lot for quick response and yes, I would be happy to check it, please do publish.

Rui Gonçalves · Answer 7 · Tue May 16 2017 06:28:54 GMT+0800 (China Standard Time)

It would be awesome if you shared your approach here, even if there was no interest now (which it seems there is), it would be a good resource for anyone dealing with this issue :) If the algorithm and data structure you chose to parse the table is general enough, I can surely help you implementing it in scala-scraper.

Samik R · Answer 8 · Fri May 19 2017 13:52:18 GMT+0800 (China Standard Time)

Guys,

Sorry for the delay. Attaching two files, one containing the source code and the other containing some test code. Note that the test code is not automated - just some prints for manually checking if things look OK.

@ruippeixotog Saw your other email about the exciting features in the next version, including the "Content Extractors". This is probably too late to get included in that, but that is probably where this stuff can be integrated.

Ready to answer questions :-)

Thanks.
-Samik

TableExtractor.scala.txt
TableExtractorTester.scala.txt

Samik R · Answer 9 · Fri May 19 2017 13:56:46 GMT+0800 (China Standard Time)

Also note that some of the methods are just stubs, but are easy to implement. Important methods are already implemented.

Samik R · Answer 10 · Sat Aug 26 2017 14:21:35 GMT+0800 (China Standard Time)

Hi, any update on this? Did the code get used somewhere?

Rui Gonçalves · Answer 11 · Sun Aug 27 2017 02:24:11 GMT+0800 (China Standard Time)

Hi @samikrc, I ended up not using it anywhere for now - mostly due to my lack of time lately. I took a look at your code before and it seemed like a good implementation, it just needs to be converted to a more idiomatic extractor, like the regex extractors. I'll try to work on it in the next two weeks :)

Rui Gonçalves · Answer 12 · Mon Sep 18 2017 07:43:53 GMT+0800 (China Standard Time)

I have just added a new table content extractor to scala-scraper (e7d3fe6). I ended up writing the extractor from scratch, as it seemed easier for me to integrate it with the style of the other extractors this way.

Closing this now. If you find any bug with the implementation feel free to open another issue!