denisgomes / arachpyd

Arachpyd - Your friendly neighborhood spider and web crawler.

Home Page:https://arachpyd.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

**arackpy**
===========

**arackpy** is a simple but powerful web crawler and scraper. Although it is
good natured and respectful at heart, it can be used to do evil. Remember with
great power comes great responsibility.


Requirements
------------

**arackpy** currently supports Python 2.7 to 3.6+ out of the box. Depending on
how you want to extract data, several other dependencies from the list below is
required to be installed to support the various backends:

* lxml - for html parsing and url extraction
* requests - for downloading html pages
* pysocks - for making tor based connections
* fake_useragent - for browser spoofing
* selenium (coming soon!)


Installation
------------

For a vanilla **arackpy** install with no other dependencies:

    pip install arackpy

For proxy and tor support:

    pip install lxml, requests, fake_useragent, pysocks


Quickstart
----------

Open up your favorite python text editor and type the following:

    # hello_spider.py

    from __future__ import print_function   # python 2 support

    from arackpy.spider import Spider


    class HelloSpider(Spider):
        """A simple spider in just ten lines of working code"""

        start_urls = ["https://www.python.org"]

        def parse(self, url, html):
            """Extract data from the raw html"""
            print("Crawling url, %s" % url)


    if __name__ == "__main__":
        print("Press Ctrl-c to stop crawling")
        spider = HelloSpider()
        spider.crawl()

Run the program using:

    $ python hello_spider.py

Note

Press Ctrl-c to terminate crawling. To use proxies or tor, change the backend
accordingly.


Documentation
-------------

To learn more, go read the docs at https://arackpy.readthedocs.org. The
**arackpy** logo was taken from https://clipart-library.com.

About

Arachpyd - Your friendly neighborhood spider and web crawler.

https://arachpyd.readthedocs.org

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 100.0%