A simple app for mining common crawl data
Akshay Uday Bhat (www.akshaybhat.com)
- Boto (latest)
- Fabric (1.8.1)
- Flask (Optional)
- Common Crawl library from https://github.com/AKSHAYUBHAT/CommonCrawlLibrary
You can view a quick demo and documentation on ipython notebook viewer . http://nbviewer.ipython.org/github/AKSHAYUBHAT/CommonCrawl/blob/master/Monitor.ipynb
This repo contains code for accessing Common Crawl crawls (2013 & later) & code for launching spot instances for analyzing the crawl data. The code follows most of the best practices, such as:
-
An SQS queue is used to track progress of the job.
-
Output is stored in an S3 Bucket with reduced redundancy to reduce costs
-
Permissions are passed to EC2 instances via IAM roles and instance profiles. Only required services S3 & SQS are authorized.
-
Code is stored in an S3 bucket and is downloaded by the spot instance when instance is allocated via user_data script.
-
Fabric is used to run tasks to get information, execute code and terminate instances.
The current worker.py implements a simple function which stores count of urls and domains with at least 10 urls in the file. The function and configuration can be easily modified to support more complex analysis.
-
Access & Security
-
Put boto configuration in /etc/boto.cfg on your local machine, note that this information is never sent to EC2 machines
-
key_filename = Path to your private key file
-
IAM_ROLE = "ccSpot_role" # Role name, no need to change
-
IAM_PROFILE = "ccSpot_profile" # Profile name, no need to change
-
IAM_POLICY_NAME = "ccSpt_policy" # Policy name, no need to change
-
IAM_POLICY = # Policy, no need to change unless you are accessing other services such as SNS etc.
-
Instance Configuration
-
price = price in dollars for a spot instance
-
instance_type =
-
image_id = # Amazon Machine Image (AMI) ID
-
key_name = name of your configured key-pair, should be same key as the pem file above
-
NUM_WORKERS = Number of worker processes per machine depends on the instance type & memory foot print
-
VISIBILITY_TIMEOUT = Seconds during which a worker process has time to process the message, this value should be the maximum time a worker process will take to process a single process
-
MAX_TIME_MINS = 230 # maxiumum amount of time the instance should run 60 * 3 + 50 mins = 230 minutes (This limits the cost in case you forget to terminate the instance)
-
Job Configuration
-
EC2_Tag = "cc_wat_13_2"
-
JOB_QUEUE = SQS queue name
-
OUTPUT_S3_BUCKET = S3 bucket
-
CODE_BUCKET = bucket used to store code & configuration make sure this is different from output bucket above
-
CODE_KEY = key for storing code which will be downloaded by user-data script
-
FILE_TYPE = "wat" # Type of files you wish to process choose from {"wat","wet","text","warc"}
-
CRAWL_ID = crawl id choose from { '2013_1','2013_2','2014_1',"ALL"}
-
USED_DATA = script run by the spot instance the "first time" it is booted up
- AWS credentials should be stored in /etc/boto.cfg, the credentials are not transferred
- Install the common crawl library from https://github.com/AKSHAYUBHAT/CommonCrawlLibrary
- To set up job run "fab setup_job", this will create IAM roles, S3 output bucket and SQS queue.
- To test worker script run "fab test_worker"
- To save code to S3 run "fab push_code"
- To request spot instances run "fab request_spot_instance" the spot instance once allocated will start running code automatically.
- To list current spot instances run "fab ls_instances"
- To terminate all instances run "fab terminate_instances" (NOTE its important that you manually terminate all instances.)
- Use "fab ls_bucket" to check status of the output bucket and to download one randomly selected key to temp.json.
- Use "fab rm_bucket:bucket_name" to delete a bucket and all keys inside it.
-
config.py Contains configuration for launching job, identifiers for bucket, queue etc.
-
worker.py Code executed on each file in the crawl
-
fabfile.py Contains tasks for setting up, running, monitoring and terminating jobs.
-
spotinstance.py A small class to keep track of spot instance requests.
-
filequeue.py A small class to keep track of files in SQS queue.
-
example.json Example of output stored in the bucket from one file, using current worker.py
-
worker.go A worker written in Go for better performance