Common Crawl Application

A simple app for mining common crawl data

Author

Requirements

Boto (latest)
Fabric (1.8.1)
Flask (Optional)
Common Crawl library from https://github.com/AKSHAYUBHAT/CommonCrawlLibrary

Quick Demo and Documentation

You can view a quick demo and documentation on ipython notebook viewer . http://nbviewer.ipython.org/github/AKSHAYUBHAT/CommonCrawl/blob/master/Monitor.ipynb

Description:

This repo contains code for accessing Common Crawl crawls (2013 & later) & code for launching spot instances for analyzing the crawl data. The code follows most of the best practices, such as:

An SQS queue is used to track progress of the job.
Output is stored in an S3 Bucket with reduced redundancy to reduce costs
Permissions are passed to EC2 instances via IAM roles and instance profiles. Only required services S3 & SQS are authorized.
Code is stored in an S3 bucket and is downloaded by the spot instance when instance is allocated via user_data script.
Fabric is used to run tasks to get information, execute code and terminate instances.

The current worker.py implements a simple function which stores count of urls and domains with at least 10 urls in the file. The function and configuration can be easily modified to support more complex analysis.

Configuration

Access & Security
Put boto configuration in /etc/boto.cfg on your local machine, note that this information is never sent to EC2 machines
key_filename = Path to your private key file
IAM_ROLE = "ccSpot_role" # Role name, no need to change
IAM_PROFILE = "ccSpot_profile" # Profile name, no need to change
IAM_POLICY_NAME = "ccSpt_policy" # Policy name, no need to change
IAM_POLICY = # Policy, no need to change unless you are accessing other services such as SNS etc.
Instance Configuration
price = price in dollars for a spot instance
instance_type =
image_id = # Amazon Machine Image (AMI) ID
key_name = name of your configured key-pair, should be same key as the pem file above
NUM_WORKERS = Number of worker processes per machine depends on the instance type & memory foot print
VISIBILITY_TIMEOUT = Seconds during which a worker process has time to process the message, this value should be the maximum time a worker process will take to process a single process
MAX_TIME_MINS = 230 # maxiumum amount of time the instance should run 60 * 3 + 50 mins = 230 minutes (This limits the cost in case you forget to terminate the instance)
Job Configuration
EC2_Tag = "cc_wat_13_2"
JOB_QUEUE = SQS queue name
OUTPUT_S3_BUCKET = S3 bucket
CODE_BUCKET = bucket used to store code & configuration make sure this is different from output bucket above
CODE_KEY = key for storing code which will be downloaded by user-data script
FILE_TYPE = "wat" # Type of files you wish to process choose from {"wat","wet","text","warc"}
CRAWL_ID = crawl id choose from { '2013_1','2013_2','2014_1',"ALL"}
USED_DATA = script run by the spot instance the "first time" it is booted up

Instructions / Tasks

AWS credentials should be stored in /etc/boto.cfg, the credentials are not transferred
Install the common crawl library from https://github.com/AKSHAYUBHAT/CommonCrawlLibrary
To set up job run "fab setup_job", this will create IAM roles, S3 output bucket and SQS queue.
To test worker script run "fab test_worker"
To save code to S3 run "fab push_code"
To request spot instances run "fab request_spot_instance" the spot instance once allocated will start running code automatically.
To list current spot instances run "fab ls_instances"
To terminate all instances run "fab terminate_instances" (NOTE its important that you manually terminate all instances.)

Optional

Use "fab ls_bucket" to check status of the output bucket and to download one randomly selected key to temp.json.
Use "fab rm_bucket:bucket_name" to delete a bucket and all keys inside it.

Files

config.py Contains configuration for launching job, identifiers for bucket, queue etc.
worker.py Code executed on each file in the crawl
fabfile.py Contains tasks for setting up, running, monitoring and terminating jobs.
spotinstance.py A small class to keep track of spot instance requests.
filequeue.py A small class to keep track of files in SQS queue.
example.json Example of output stored in the bucket from one file, using current worker.py
worker.go A worker written in Go for better performance
https://github.com/edsu/warc

nyov / CommonCrawl