vietnamworks_crawler

A VietnamWorks Crawler and RSS.

It uses [scrapy] (http://scrapy.org/) engine.

Deploy on OpenShift

To deploy this script on OpenShift, you will need to use a python cartridge with an added cron cartridge, then you need to clone the source code to ~/app-root/runtime/repo

OpenShift can do most of these steps automatically for you. Step by step guide is as follow:

Go to [OpenShift application console] (https://openshift.redhat.com/app/console/applications), select Add Application to create a new gear
Select Python (2.7 recommended) on the cartridge selection page
Enter a public domain for your app or use default, select an appropriate gear
On source code field, enter

https://github.com/trananhtuan/vietnamworks_crawler

leave branch empty (or enter the branch you want to use). Click Create Application and wait for OpenShift to create your gear

Go back to [OpenShift application console] (https://openshift.redhat.com/app/console/applications), select your newly created gear
On the application management page, select "see the list of cartridges you can add" and select Cron to add cron cartridge to your gear
Go to your application domain to check

To deploy to an existing python gear (python 2.7 recommended), please follow the following steps:

Go to OpenShift application console and add cron to your gear
Bare-clone and mirror-push the source code to your gear

git clone --bare https://github.com/trananhtuan/vietnamworks_crawler
cd vietnamworks_crawler.git
git push --mirror ssh://APP_USERNAME@APPNAME-DOMAIN.rhcloud.com/~/git/tmp.git/
cd ..
rm -rf vietnamworks_crawler.git

Go to your application domain to check

RSS will be ready as the cron runs for the first time. By default, a cron job runs hourly to crawl new pages.

Run manually:

With scrapy installed, go the project home folder and execute:

git clone https://github.com/trananhtuan/vietnamworks_crawler
cd vietnamworks_crawler
scrapy crawl vietnamworks

In order to avoid crawled pages between runs (ie. cron), append "-s JOBDIR=cache" to the last command

scrapy crawl vietnamworks -s JOBDIR=cache

By default, scraped items are saved in jobs.sqlite

Feel free to modify and make your commits

vswb / vietnamworks_crawler

vietnamworks_crawler

Deploy on OpenShift

Run manually:

About

Languages