binux / pyspider

A Powerful Spider(Web Crawler) System in Python.

Home Page:http://docs.pyspider.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

「Not Bug」Another problem with "git clone"

TheMasterOfMagic opened this issue · comments

Last time I had some issue to make the fetcher do some git clone, and I got the reply from #931 which solved my problem. Thanks, again! I appreciate it very much.

It's just that I recently encountered another problem, which is that the 「git clone」thing is not doing asynchronously. And sometimes it took a long time to clone a git repo, which caused the queue size of scheduler to fetchers exceeding 100, then the fetcher crashed.

I have tried to fix this problem by myself, so I carefully checked the source code of tornado_fetcher.py, especially the http_fetch method which works perfectly asynchronously. But it turns out that the method uses the tornado library inside. And I have no idea how to make the library works for my git-clone thing. Is this the problem, or I misunderstand something here?

I had considered to add more fetchers, but this does not solve the problem. It just hides the problem, which does not meet my needs.

Is there any other information that I need to put it here?

Could you tell me where the problem is?

Thanks a lot!

ps:

  • According to #931 , I added my git protocol into the async_fetch method. And just like the original http_fetch method, I wrote my git_fetch method. Then because I don't know how to understand this operation
    After that, you can specify fetcher option --fetcher-cls to your implementation when start.
    
    in the reply, I just did all these changes in the original tornado_fetcher.py to make it work. Does this have any influence?