[RFC]A new journey
whg517 opened this issue · comments
fix #226
Hi, scrapy-redis is one of the most commonly used tools for using scrapy, but IT seems to me that this project has not been maintained for a long time. Some of the states on the project are not updated synchronously.
Given the current updates to the Python and Scrapy versions, I wanted to make some feature contributions to the project. If you can accept, I will arrange the follow-up work.
Tasks:
- Added GitHub Action workflow
- Add pep-0517 A build-system independent format for source trees support
- Added adaptation from Python 3.7 to Python 3.10, and removed Python 2 support
- #283
- Add pytest to test project
- Use modern sphinx theme -- furo , like pip document
It would be super useful to also add the feature of feeding more context to the spiders. Not just a list of start_urls, but a list of json like so:
{
"start_urls": [
{
"start_url": "https://example.com/",
"sku": 1234
}
]
}
This was already proposed a while back #156
Hello @Sm4o , I wrote an example according to your description. Has this achieved your purpose?
import json
from scrapy import Request, Spider
from scrapy.http import Response
from scrapy_redis.spiders import RedisSpider
class SpiderError(Exception):
""""""
class BaseParser:
name = None
def __init__(self, spider: Spider):
# use log: self.spider.logger
self.spider = spider
def parse(
self,
*,
response: Response,
**kwargs
) -> list[str]:
raise NotImplementedError('`parse()` must be implemented.')
class HtmlParser(BaseParser):
name = 'html'
def parse(
self,
*,
response: Response,
rows_rule: str | None = '//tr',
row_start: int | None = 0,
row_end: int | None = -1,
cells_rule: str | None = 'td',
field_rule: str | None = 'text()',
) -> list[str]:
""""""
raise NotImplementedError('`parse()` must be implemented.')
def parser_factory(name: str, spider: Spider) -> BaseParser:
if name == 'html':
return HtmlParser(spider)
else:
raise SpiderError(f'Can not find parser name of "{name}"')
class MySpider(RedisSpider):
name = 'my_spider'
def make_request_from_data(self, data):
text = data.decode(encoding=self.redis_encoding)
params = json.loads(text)
return Request(
params.get('url'),
dont_filter=True,
meta={
'parser_name': params.get('parser_name'),
'parser_params': {
'rows_rule': params.get('rows_rule'), # rows_xpath = '//tbody/tr'
'row_start': params.get('index'), # row_start = 1
'row_end': params.get('row_end'), # row_end = -1
'cells_rule': params.get('cells_rule'), # cells_rule = 'td'
'field_rule': params.get('text()'), # field_rule = 'text()'
}
}
)
def parse(self, response: Response, **kwargs):
name = response.meta.get('parser_name')
params = response.meta.get('parser_params')
parser = parser_factory(name, self)
items = parser.parse(response=response, **params)
for item in items:
yield item
Sounds perfect. Please take the lead!
@LuckyPigeon has been given permissions to the repo.
That's exactly what I needed. Thanks a lot!
I am working in progress...
I'm trying to reach 1500 requests/min but it seems like using a single spider might not be the best. I noticed that scrapy-redis reads urls from redis in batches equal to CONCURRENT_REQUESTS
setting. So if I set it to CONCURRENT_REQUESTS=1000
then scrapy-redis waits until all processes are done before requesting another batch of 1000 from redis. I feel like I'm using this tool wrong, so any tips or suggestions would be greatly appreciated
So far, I have done:
- Support python 3.7-3.9,scrapy 2.0-2.5 . And all test is fine.
- Add isort, flake8 to check code
- Add PEP-517 support
- Add gh action
Now I'm having some problems with my documentation. I am Chinese, but my English is not very good, so my English expression ability is not strong. I want someone to take over the documentation.
I think the current document is too simplistic. Perhaps we need to rearrange the structure and content of the document.
@whg517
Thanks for your contribution! Please file PR for each feature, then I will review it.
Chinese documentations are also welcome, we can rearrange the structure and the content in Chinese version first.
And I can do the translation work.
Hello everyone, I will reorganize features later and try to create a new feature PR. As the New Year begins, I still have many plans to do, I will arrange them as soon as possible.
@whg517 thanks for the initiative. Could you also include the pros and cons of moving the project to scrapy-plugins org?
@whg517 any progress?