EdmundMartin / Scrapio

Asyncio web crawling framework. Work in progress.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Proxies

0xJatto opened this issue · comments

commented

Greetings,

I'm interested to see how you manage proxies, I want to keep things simple and not have an established DB that manages the status/state of proxies.

For me, I want to place proxies on a cooldown list if they trigger a failed response due to abuse of the targetted site. I was thinking, I would use a Python Dic, and append some kind of notification of a proxies state. either ready to be used, or skip proxy and timestamp it, so it could return to the active pool of proxies after 15-25mins.

However, with more than one worker, I'm confused at how to share this information among them within a Asyncio framework.

Hello,

Sorry for the slow reply.

I would probably look to write a simple class to handle this sort of logic.

import asyncio
import time
from typing import List, Dict

from async_timeout import timeout


class ProxyManager:

    def __init__(self, back_off: int):
        self._proxies: Dict[str, dict] = dict()
        self._back_off = back_off

    def load_proxies(self, proxies: List[str]) -> None:
        for p in proxies:
            self._proxies[p] = {'time': time.time(), 'status': 'working'}

    def broken_proxy(self, proxy: str) -> None:
        self._proxies[proxy] = {'time': time.time(), 'status': 'broken'}

    async def get_working_proxy(self, wait: int) -> str:
        while True:
            async with timeout(wait):
                for k, v in self._proxies.items():
                    if v['status'] == 'working':
                        return k
                    if v['time'] < time.time() - self._back_off:
                        self._proxies[k] = {'time': time.time(), 'status': 'working'}
                        return k
                await asyncio.sleep(0.001)

The above class does roughly what you want and can be pretty easily customized for your needs. Let me know what you think of it, knocked it up rather quickly.

There are different ways to share this information between different workers. For instance you could initialize the class and then pass the class into each async worker. Or you could embed in the class within your class. The best way to do this is really going to depend on how you have currently structured your program.