Make query results Iterable
maxpsq opened this issue · comments
I've given a try to scala-scraper and I noticed that in case a query matches more than one element in the document I get a String that represents the concatenation of all matching elements.
Is there a way to get an iterable object out of the query results?
There's more related to my point: there are some extraction I cannot put in place using "JQuery like" and "CSS selectors" support offered by jsoup.
For example getting only one occurrence of all matches in jsoup is possible only accessing a specific element in the collection returned via the Java API ( eg. results.first() ) whereas XPath gives the chance to do it in the query definition ( //span[18] -> takes the 18th match only)
This feature will be very nice in case it's required to define queries in a configuration file.
In case you are not by chance planning to add support to XPath selectors, an alternative can be enhancing the extractors definition in order to access a specific match in a result list.
Hi @maxpsq, thanks for using this library.
If you're using the new 2.0-RC2 (2.0 will be released soon), just doing doc >> "span"
will provide you an ElementQuery
, which implements Iterable
. If you're using 1.x, you can achieve the same with doc >> elements("span")
. In both versions, if you prefer a List
you can do doc >> elementList("span")
. Take a look at the Content Extractors section of the README for more examples.
Scala-scraper does not support XPath indeed, and it would be a good feature for a future version - your configuration example is a good motivation for that. For now, you can simply create custom extractors from existing ones or manipulate the results between extractions. #48 has some examples of that, as well as the Other DSL Features section of the README.
Hi @ruippeixotog. Do you have an estimate for the release of 2.0?
Hi @lu4nm3, I'll release 2.0 very soon - between tomorrow and Sunday. I released 2.0-RC2 so that any obvious bugs could be reported before the final release - with RC2 available since May, I'm more confident in releasing this now.
@lu4nm3 I have just released v2.0.0, it should be available in Maven in a few minutes.