igizm0 / gain

Web crawling framework based on asyncio for everyone.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Python Version License

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

  • Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

  1. Write spider.py:
from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()
  1. Run python spider.py

  2. Result:

Example

The examples are in the /example/ directory.

Contribution

  • Pull request.
  • Open issue.

About

Web crawling framework based on asyncio for everyone.

License:GNU General Public License v3.0


Languages

Language:Python 100.0%