soedr/diffbot-python-client

#Python Diffbot API Client

##Preface Identify and extract the important parts of any web page in Python! This client currently supports calls to Diffbot's Automatic APIs and Crawlbot.

Installation To install activate a new virtual environment and run the following command:

$ pip install -r requirements.txt

##Configuration

To run the example, you must first configure a working API token in config.py:

$ cp config.py.example config.py; vim config.py;

Then replace the string "SOME_TOKEN" with your API token. Finally, to run the example:

$ python example.py

##Usage

###Article API An example call to the Article API:

diffbot = DiffbotClient()
token = "SOME_TOKEN"
version = 2
url = "http://shichuan.github.io/javascript-patterns/"
api = "article"
response = diffbot.request(url, token, api, version=2)

###Product API An example call to the Product API:

diffbot = DiffbotClient()
token = "SOME_TOKEN"
version = 2
url = "http://www.overstock.com/Home-Garden/iRobot-650-Roomba-Vacuuming-Robot/7886009/product.html"
api = "product"
response = diffbot.request(url, token, api, version=version)

###Image API An example call to the Image API:

diffbot = DiffbotClient()
token = "SOME_TOKEN"
version = 2
url = "http://www.google.com/"
api = "image"
response = diffbot.request(url, token, api, version=version)

###Analyze API An example call to the Analyze API:

diffbot = DiffbotClient()
token = "SOME_TOKEN"
version = 2
url = "http://www.twitter.com/"
api = "analyze"
response = diffbot.request(url, token, api, version=version)

###Crawlbot API To start a new crawl, specify a crawl name, seed URLs, and the API via which URLs should be processed. An example call to the Crawlbot API:

token = "SOME_TOKEN"
name = "sampleCrawlName"
seeds = "http://www.twitter.com/"
api = "analyze"
sampleCrawl = DiffbotCrawl(token,name,seeds=seeds,api=api)

Omit "seeds" and "api" to load an existing crawl, or create a crawl as a placeholder.

To check the status of a crawl:

sampleCrawl.status()

To update a crawl:

maxToCrawl = 100
upp = "diffbot"
sampleCrawl.update(maxToCrawl=maxToCrawl,urlProcessPattern=upp)

To delete or restart a crawl:

sampleCrawl.delete()
sampleCrawl.restart()

To download crawl data:

sampleCrawl.download() # returns JSON by default
sampleCrawl.download(data_format="csv")

To pass additional arguments to a crawl:

sampleCrawl = DiffbotCrawl(token,name,seeds,apiUrl,maxToCrawl=100,maxToProcess=50,notifyEmail="support@diffbot.com")

##Testing

First install the test requirements with the following command:

$ pip install -r test_requirements.txt

Currently there are some simple unit tests that mock the API calls and return data from fixtures in the filesystem. From the project directory, simply run:

$ nosetests

soedr / diffbot-python-client

About

Languages