ruippeixotog / scala-scraper

A Scala library for scraping content from HTML pages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Table data extractor

samikrc opened this issue · comments

Hi,
Do you have code for extracting data from tables like you have for extracting data from forms? Because of the use of rowspan and colspan attributes, it gets difficult to parse a table from the raw html. Is there an easy way to do this from the in-memory rendering of the browsers?
Regards.

Hi! Currently there is no specialized extractor for HTML tables. It would be a nice addition to have one, but does come with its challenges. For example, what data structure did you have in mind? As you mentioned in your example, the ability of cells to cover an arbitrary number of rows and columns can make the organization rather messy...

I have something in mind - let me see if I can open a pull request in a few weeks. If I do end up coding something, where do you think it would show up in the codebase? In scalascraper/scraper/HtmlExtractor.scala?

Yeah, I would expect it to be a new extractor in ContentExtractors. Looking forward to your pull request then!

hi, any update with this ?

Hello samikrc ,
Thanks a lot for quick response and yes, I would be happy to check it, please do publish.

It would be awesome if you shared your approach here, even if there was no interest now (which it seems there is), it would be a good resource for anyone dealing with this issue :) If the algorithm and data structure you chose to parse the table is general enough, I can surely help you implementing it in scala-scraper.

Guys,

Sorry for the delay. Attaching two files, one containing the source code and the other containing some test code. Note that the test code is not automated - just some prints for manually checking if things look OK.

@ruippeixotog Saw your other email about the exciting features in the next version, including the "Content Extractors". This is probably too late to get included in that, but that is probably where this stuff can be integrated.

Ready to answer questions :-)

Thanks.
-Samik

TableExtractor.scala.txt
TableExtractorTester.scala.txt

Also note that some of the methods are just stubs, but are easy to implement. Important methods are already implemented.

Hi, any update on this? Did the code get used somewhere?

Hi @samikrc, I ended up not using it anywhere for now - mostly due to my lack of time lately. I took a look at your code before and it seemed like a good implementation, it just needs to be converted to a more idiomatic extractor, like the regex extractors. I'll try to work on it in the next two weeks :)

I have just added a new table content extractor to scala-scraper (e7d3fe6). I ended up writing the extractor from scratch, as it seemed easier for me to integrate it with the style of the other extractors this way.

Closing this now. If you find any bug with the implementation feel free to open another issue!