rmax / scrapy-redis

Redis-based components for Scrapy.

Home Page:http://scrapy-redis.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[RFC]A new journey

whg517 opened this issue · comments

fix #226

Hi, scrapy-redis is one of the most commonly used tools for using scrapy, but IT seems to me that this project has not been maintained for a long time. Some of the states on the project are not updated synchronously.

Given the current updates to the Python and Scrapy versions, I wanted to make some feature contributions to the project. If you can accept, I will arrange the follow-up work.

Tasks:

It would be super useful to also add the feature of feeding more context to the spiders. Not just a list of start_urls, but a list of json like so:

{
    "start_urls": [
        {
            "start_url": "https://example.com/",
            "sku": 1234
        }
    ]
}

This was already proposed a while back #156

Hello @Sm4o , I wrote an example according to your description. Has this achieved your purpose?

import json

from scrapy import Request, Spider
from scrapy.http import Response

from scrapy_redis.spiders import RedisSpider


class SpiderError(Exception):
    """"""


class BaseParser:
    name = None

    def __init__(self, spider: Spider):
        # use log: self.spider.logger
        self.spider = spider

    def parse(
        self,
        *,
        response: Response,
        **kwargs
    ) -> list[str]:
        raise NotImplementedError('`parse()` must be implemented.')


class HtmlParser(BaseParser):
    name = 'html'

    def parse(
        self,
        *,
        response: Response,
        rows_rule: str | None = '//tr',
        row_start: int | None = 0,
        row_end: int | None = -1,
        cells_rule: str | None = 'td',
        field_rule: str | None = 'text()',
    ) -> list[str]:
        """"""
        raise NotImplementedError('`parse()` must be implemented.')


def parser_factory(name: str, spider: Spider) -> BaseParser:
    if name == 'html':
        return HtmlParser(spider)
    else:
        raise SpiderError(f'Can not find parser name of "{name}"')


class MySpider(RedisSpider):
    name = 'my_spider'

    def make_request_from_data(self, data):
        text = data.decode(encoding=self.redis_encoding)
        params = json.loads(text)
        return Request(
            params.get('url'),
            dont_filter=True,
            meta={
                'parser_name': params.get('parser_name'),
                'parser_params': {
                    'rows_rule': params.get('rows_rule'),  # rows_xpath = '//tbody/tr'
                    'row_start': params.get('index'),  # row_start = 1
                    'row_end': params.get('row_end'),  # row_end = -1
                    'cells_rule': params.get('cells_rule'),  # cells_rule = 'td'
                    'field_rule': params.get('text()'),  # field_rule = 'text()'
                }
            }
        )

    def parse(self, response: Response, **kwargs):
        name = response.meta.get('parser_name')
        params = response.meta.get('parser_params')
        parser = parser_factory(name, self)
        items = parser.parse(response=response, **params)
        for item in items:
            yield item

@rmax
Looks good to me. How do you think?
@Sm4o
@rmax is a little busy recently, if you don't mind. Feel free to work on it!

Sounds perfect. Please take the lead!

@LuckyPigeon has been given permissions to the repo.

That's exactly what I needed. Thanks a lot!

I am working in progress...

I'm trying to reach 1500 requests/min but it seems like using a single spider might not be the best. I noticed that scrapy-redis reads urls from redis in batches equal to CONCURRENT_REQUESTS setting. So if I set it to CONCURRENT_REQUESTS=1000 then scrapy-redis waits until all processes are done before requesting another batch of 1000 from redis. I feel like I'm using this tool wrong, so any tips or suggestions would be greatly appreciated

@whg517
Please go head!
@Sm4o
What feature are you working for?

So far, I have done:

  • Support python 3.7-3.9,scrapy 2.0-2.5 . And all test is fine.
  • Add isort, flake8 to check code
  • Add PEP-517 support
  • Add gh action

Now I'm having some problems with my documentation. I am Chinese, but my English is not very good, so my English expression ability is not strong. I want someone to take over the documentation.

I think the current document is too simplistic. Perhaps we need to rearrange the structure and content of the document.

@whg517
Thanks for your contribution! Please file PR for each feature, then I will review it.
Chinese documentations are also welcome, we can rearrange the structure and the content in Chinese version first.
And I can do the translation work.

Hello everyone, I will reorganize features later and try to create a new feature PR. As the New Year begins, I still have many plans to do, I will arrange them as soon as possible.

@whg517 thanks for the initiative. Could you also include the pros and cons of moving the project to scrapy-plugins org?

@whg517 any progress?