custom parsar callback sample

Question

custom parsar callback sample

ziyouchutuwenwu opened this issue 3 years ago · comments

hi, is there any sample which shows how to use custom parsar callback instead of use default parse_item?
i read doc from here, but don't know how to use.

thanks for your help

oltarasenko · Answer 1 · Fri Jul 30 2021 17:45:22 GMT+0800 (China Standard Time)

@Ziinc probably can give more info here.

But could you please describe the use case? Why can't you use parse_item?

ziyouchutuwenwu · Answer 2 · Fri Jul 30 2021 18:26:46 GMT+0800 (China Standard Time)

here is my usage scenario:

for site demo.com, i need to get some info such as title, category for the main page.
and get the sub url from some links
when i get the sub url, i send requests, then parse data from response, here i need to get some detail info, such as author, price and etc.

the data parsar from sub page should be different from main page, i don't know how to do it through crawly.

great thanks.

ziyouchutuwenwu · Answer 3 · Fri Jul 30 2021 18:31:18 GMT+0800 (China Standard Time)

for python part, my demo code seems like this

oltarasenko · Answer 4 · Sat Jul 31 2021 00:15:45 GMT+0800 (China Standard Time)

So... Do you have different items on different pages? Or same data just structured differently?

ziyouchutuwenwu · Answer 5 · Sat Jul 31 2021 08:23:52 GMT+0800 (China Standard Time)

yes, basiclly, i have different data structure on different pages, but according to the sample code, i don't know how to write the code.
It will be appreciate if there are some examples that can help me.

oltarasenko · Answer 6 · Sun Aug 01 2021 23:32:09 GMT+0800 (China Standard Time)

Sorry I still don't understand if that's one of these two:

Same item which can be extracted with other selectors
Two different items

Ziinc · Answer 7 · Wed Sep 08 2021 01:50:11 GMT+0800 (China Standard Time)

sorry @ziyouchutuwenwu I only just saw this, must have missed the ping.

Parsers are meant for commonly used logic that you want to reuse across spiders. A parser is simply a Pipeline module, with the result of each Parser being passed to the next. The opts 3rd positional arg allows you to provide spider-specific configuration to your parser.

For example, on site 1, you want to extract all links with a h1 tag but filter them out based on some site-specific filter function, and build requests from all extracted links:

# spider 1
parsers: [
  {MyCustomRequestParser, [selector: ".h1", filter: &my_filter_function/1]}
]

Then, in spider 2 that is crawling site 2, we only want h2 tags, but without using any filtering:

# spider 2
parsers: [
  {MyCustomRequestParser, [selector: ".h2"]}
]

Then your MyCustomRequestParser.run/3 contains the logic required to select and build the requests