elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

Home Page:https://hexdocs.pm/crawly

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

custom parsar callback sample

ziyouchutuwenwu opened this issue · comments

hi, is there any sample which shows how to use custom parsar callback instead of use default parse_item?
i read doc from here, but don't know how to use.

thanks for your help

@Ziinc probably can give more info here.

But could you please describe the use case? Why can't you use parse_item?

here is my usage scenario:

for site demo.com, i need to get some info such as title, category for the main page.
and get the sub url from some links
when i get the sub url, i send requests, then parse data from response, here i need to get some detail info, such as author, price and etc.

the data parsar from sub page should be different from main page, i don't know how to do it through crawly.

great thanks.

for python part, my demo code seems like this

So... Do you have different items on different pages? Or same data just structured differently?

yes, basiclly, i have different data structure on different pages, but according to the sample code, i don't know how to write the code.
It will be appreciate if there are some examples that can help me.

Sorry I still don't understand if that's one of these two:

  1. Same item which can be extracted with other selectors
  2. Two different items

sorry @ziyouchutuwenwu I only just saw this, must have missed the ping.

Parsers are meant for commonly used logic that you want to reuse across spiders. A parser is simply a Pipeline module, with the result of each Parser being passed to the next. The opts 3rd positional arg allows you to provide spider-specific configuration to your parser.

For example, on site 1, you want to extract all links with a h1 tag but filter them out based on some site-specific filter function, and build requests from all extracted links:

# spider 1
parsers: [
  {MyCustomRequestParser, [selector: ".h1", filter: &my_filter_function/1]}

Then, in spider 2 that is crawling site 2, we only want h2 tags, but without using any filtering:

# spider 2
parsers: [
  {MyCustomRequestParser, [selector: ".h2"]}

Then your MyCustomRequestParser.run/3 contains the logic required to select and build the requests