ruippeixotog / scala-scraper

A Scala library for scraping content from HTML pages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make query results Iterable

maxpsq opened this issue · comments

I've given a try to scala-scraper and I noticed that in case a query matches more than one element in the document I get a String that represents the concatenation of all matching elements.

Is there a way to get an iterable object out of the query results?

There's more related to my point: there are some extraction I cannot put in place using "JQuery like" and "CSS selectors" support offered by jsoup.

For example getting only one occurrence of all matches in jsoup is possible only accessing a specific element in the collection returned via the Java API ( eg. results.first() ) whereas XPath gives the chance to do it in the query definition ( //span[18] -> takes the 18th match only)

This feature will be very nice in case it's required to define queries in a configuration file.

In case you are not by chance planning to add support to XPath selectors, an alternative can be enhancing the extractors definition in order to access a specific match in a result list.

Hi @maxpsq, thanks for using this library.

If you're using the new 2.0-RC2 (2.0 will be released soon), just doing doc >> "span" will provide you an ElementQuery, which implements Iterable. If you're using 1.x, you can achieve the same with doc >> elements("span"). In both versions, if you prefer a List you can do doc >> elementList("span"). Take a look at the Content Extractors section of the README for more examples.

Scala-scraper does not support XPath indeed, and it would be a good feature for a future version - your configuration example is a good motivation for that. For now, you can simply create custom extractors from existing ones or manipulate the results between extractions. #48 has some examples of that, as well as the Other DSL Features section of the README.

Hi @ruippeixotog. Do you have an estimate for the release of 2.0?

Hi @lu4nm3, I'll release 2.0 very soon - between tomorrow and Sunday. I released 2.0-RC2 so that any obvious bugs could be reported before the final release - with RC2 available since May, I'm more confident in releasing this now.

@lu4nm3 I have just released v2.0.0, it should be available in Maven in a few minutes.