spacetime-crawler

The crawler apps using spacetime library

Contributors

Information such as ID and email can be found in crawler_frame.py, crawler.py, GlobalAnalytics.py, and logs/analytics.txt. Also can find via Contributors

Joshua Pascascio

Miguel Tolosa

Chang Shin Lee

Installation

Use Python 2.7

If you have pip installed already install. Then in this sequence after issuing in terminal git clone https://github.com/Mondego/spacetime-crawler

python -m pip install --user flask

python -m pip install --user flask_restful

python -m pip install --user requests

python -m pip install --user pcc

Then finally pip install spacetime

When running the crawler make sure to be connected to the UCI VPN, you will get ConnectionRefusedError if you do not.

All these installation should allow all the files in spacetime crawler to run in the Python Shell without throwing any error

Additional Libraries

In the crawler_frame.py, BeautifulSoup4, lxml, urllib2, urlparse, htmllib were used.

To get BeautifulSoup, navigate to main Python directory on cmd type: python -m pip install --user beautifulsoup4

To get lxml you can use a simple .exe installer

Other libraries are included with Python2.7 installation.

Other helpful libraries might be CSSSelector and html5lib(native to Python but can be slower in performance)

Installation:

python -m pip install --user cssselect python -m pip install --user html5lib

Execute a sample crawl

To run simulation of web crawling to up to 3000 urls in windows do, in Python folder:

python spacetime-crawler/CopyCrawler.py -a amazon.ics.uci.edu -p 9050

Things to note when executing

It may take a while to get to 3000 because they have to be Successful URL downloads and as you have seen a lot time you get many times so be patient. Before submission, the crawler will have to be run in the background to get to that number in a sample run-through I got 743 in 37 minutes or in the log 2221.07 seconds. If you want to demo the logging or other feature just use main_crawler.py or use a keyboard interrupt Ctrl- Zas I had it print out the stats and do the logging once termination, normal or through interrupt, occurred. Sample Log file in /log/analytics.txt NOTE when http codes like when retrieving from a server will come in as bytes or strings unless using a
special library. So make sure to use int(data.http_code) to check validity.

References

htmllib: HTML Document Parser

urlparse: Breaks down url strings. Most helpful functions are likely .urlparse , .urljoin , .urlsplit

lxml library. Needed for other HTMLParsing and is needed for BeautifulSoup4 to run. Helpful libraries within lxml: html , etree

BeautifulSoup4. To initialize the parser, properly use:

from bs4 import BeautifulSoup from urllib2 import urlopen urlString = "https://website.com/index.html" urlContent = urlopen(urlString) ##Attempts to make an HTTP Request and if successful, returns HTML file soupParser = BeautifulSoup(urlContent,"lxml") ## soupParser can now be used to parse the contents of the 'index.html' file now!

To test, open, and request url's use urllib2. Most useful methods for this kind of program are .urlopen , .Request

Using The Test File mainCrawler.py

Make sure you have the full repository cloned using git onto to your local system. I would recommend you place in the Python directory where all the standard libaries are stored for me it is in:

C:\Python27\

Although for others it might be, on Windows:
C:\Users\Username\AppData\Local\Programs\Python27\

Put in the root of the spacetime-crawler directory on your system, it might complicate the way that import statements are handled if it is stored somewhere else, so try something like:

`C:\Users\Username\AppData\Local\Programs\Python27\spacetime-crawler

Once that is done run the file either in IDLE by

Opening the file -> Right Click -> Edit With IDLE

Click Run -> Run Module or Key: F5

Once there a (cmd) prompt will open asking for commands

To run the crawler just type download followed by a url you will like to check out and crawl (Note only .ics.uci.edu) domains will retrieve valid results.. Others result in timeout or connection errors but will not halt the program)

For remotely just double click or execute the on the command-line with no arguments and it will automatically crawl the subdomain,
ics.uci.edu. This will run infinitely but can be stopped with Ctrl^Z

The crawler will display what url's it has downloaded and what it failed to by displaying a timeout message in red. (Successful downloads are usually in blue)

All successful url downloads or checks are written to a file called successfulurls.txt once finished. All are valid URL's some are relative to certain paths though so in other parts of the program they may need to be joined with urljoin method.

Functions Added

Located in spacetime-crawler/applications/search/crawler_frame.py, near bottom

Added/Modified Functions are: get_url_content, extract_next_links, is_valid. The file should have their documentation

joshpas64 / spacetime-crawler

spacetime-crawler

Contributors

Installation

Additional Libraries

Execute a sample crawl

Things to note when executing

References

Using The Test File mainCrawler.py

Functions Added

About

Languages