ashwani-pandey/webspider

A javascript enabled spider to download all the pdf files accessible on the following link :
http://trai.gov.in/Content/PressReleases.aspx

NOTE : I have made a spider generic so that it works for any site.

MOZILLA FIREFOX is a must to run this program!

NOTE : The code has been well tested on Ubuntu 14.04 LTS, and the given procedure should work
perfectly for the same!

Similarly, make sure that python 2.7 and pip is already installed on your environment.

******************************************************************************************************************

AUTOMATIC SETUP to run the program ( If it fails for some reason, manual installation guidelines given below ) :

1. Extract files from webspider-1.0.tar.gz

2. cd into webspider-1.0 and run the following command :

sudo python setup.py install

#################################
If an error comes due to the download of lxml, you can run the following command :

sudo apt-get install libxml2-dev libxslt1-dev python-dev

and then run the command :

sudo python setup.py install

#################################

3. Type the following in your terminal to start downloading the pdfs :

trai_pdfs

And hurray, the download of the pdfs will start automatically.

The logs will be saved in the logs folder, and you can also see the logs being generated on the terminal itself.

The downloaded pdfs will be presnet in downloaded_pdfs folder.

********************************************************************************************************************

MANUAL INSTALLATION GUIDELINES to run the program ( in case the automated setup fails ) :

1. cd into webspider folder where requirements.txt file is present.

2. Make sure that virtualenv is already installed. If not, run the following command

pip install virtualenv

3. Create a virtual environment for the project by running the following command

virtualenv -p /usr/bin/python2.7 venv

4. Activate the virtual environment by running the following command

source venv/bin/activate

5. Run the following command to download all the requirements:

pip install requirements.txt

6. cd into webspider folder again where the 3 python modules are present
6.1. main.py
6.2. generic_scraper.py
6.3. trai_parser.py

7. Run the following command to start downloading all the pdf files from the given link

python main.py

In the webspider folder :

logs folder - contains the generated logs everytime you run main.py module with the timestamp as its name

downloaded_pdfs folder - contains all the downloaded pdfs

ashwani-pandey / webspider