joshpas64 / spacetime-crawler

The crawler apps using spacetime

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spacetime-crawler

The crawler apps using spacetime library

Contributors

Information such as ID and email can be found in crawler_frame.py, crawler.py, GlobalAnalytics.py, and logs/analytics.txt. Also can find via Contributors

  • Joshua Pascascio
  • Miguel Tolosa
  • Chang Shin Lee

Installation

  • Use Python 2.7
  • If you have pip installed already install. Then in this sequence after issuing in terminal git clone https://github.com/Mondego/spacetime-crawler
  • python -m pip install --user flask
  • python -m pip install --user flask_restful
  • python -m pip install --user requests
  • python -m pip install --user pcc
  • Then finally pip install spacetime
  • When running the crawler make sure to be connected to the UCI VPN, you will get ConnectionRefusedError if you do not.

All these installation should allow all the files in spacetime crawler to run in the Python Shell without throwing any error

Additional Libraries

  • In the crawler_frame.py, BeautifulSoup4, lxml, urllib2, urlparse, htmllib were used.
  • To get BeautifulSoup, navigate to main Python directory on cmd type: python -m pip install --user beautifulsoup4
  • To get lxml you can use a simple .exe installer
  • Other libraries are included with Python2.7 installation.
  • Other helpful libraries might be CSSSelector and html5lib(native to Python but can be slower in performance)
  • Installation:
  python -m pip install --user cssselect
  python -m pip install --user html5lib

Execute a sample crawl

To run simulation of web crawling to up to 3000 urls in windows do, in Python folder:

python spacetime-crawler/CopyCrawler.py -a amazon.ics.uci.edu -p 9050

Things to note when executing

It may take a while to get to 3000 because they have to be Successful URL downloads and as you have seen a lot time you get many times so be patient. Before submission, the crawler will have to be run in the background to get to that number in a sample run-through I got 743 in 37 minutes or in the log 2221.07 seconds. If you want to demo the logging or other feature just use main_crawler.py or use a keyboard interrupt Ctrl- Zas I had it print out the stats and do the logging once termination, normal or through interrupt, occurred. Sample Log file in /log/analytics.txt NOTE when http codes like when retrieving from a server will come in as bytes or strings unless using a
special library. So make sure to use int(data.http_code) to check validity.

References

  • htmllib: HTML Document Parser
  • urlparse: Breaks down url strings. Most helpful functions are likely .urlparse , .urljoin , .urlsplit
  • lxml library. Needed for other HTMLParsing and is needed for BeautifulSoup4 to run. Helpful libraries within lxml: html , etree
  • BeautifulSoup4. To initialize the parser, properly use:
  from bs4 import BeautifulSoup
  from urllib2 import urlopen
  urlString = "https://website.com/index.html"
  urlContent = urlopen(urlString) ##Attempts to make an HTTP Request and if successful, returns HTML file
  soupParser = BeautifulSoup(urlContent,"lxml")
  ## soupParser can now be used to parse the contents of the 'index.html' file now!
  • To test, open, and request url's use urllib2. Most useful methods for this kind of program are .urlopen , .Request

Using The Test File mainCrawler.py

  1. Make sure you have the full repository cloned using git onto to your local system. I would recommend you place in the Python directory where all the standard libaries are stored for me it is in:
  • C:\Python27\
  • Although for others it might be, on Windows:
    C:\Users\Username\AppData\Local\Programs\Python27\
    
  1. Put in the root of the spacetime-crawler directory on your system, it might complicate the way that import statements are handled if it is stored somewhere else, so try something like:
  • `C:\Users\Username\AppData\Local\Programs\Python27\spacetime-crawler
  1. Once that is done run the file either in IDLE by
  2. Opening the file -> Right Click -> Edit With IDLE
  3. Click Run -> Run Module or Key: F5
  4. Once there a (cmd) prompt will open asking for commands
  5. To run the crawler just type download followed by a url you will like to check out and crawl (Note only .ics.uci.edu) domains will retrieve valid results.. Others result in timeout or connection errors but will not halt the program)
  6. For remotely just double click or execute the on the command-line with no arguments and it will automatically crawl the subdomain,
    ics.uci.edu. This will run infinitely but can be stopped with Ctrl^Z
  7. The crawler will display what url's it has downloaded and what it failed to by displaying a timeout message in red. (Successful downloads are usually in blue)
  8. All successful url downloads or checks are written to a file called successfulurls.txt once finished. All are valid URL's some are relative to certain paths though so in other parts of the program they may need to be joined with urljoin method.

Functions Added

  • Located in spacetime-crawler/applications/search/crawler_frame.py, near bottom
  • Added/Modified Functions are: get_url_content, extract_next_links, is_valid. The file should have their documentation

About

The crawler apps using spacetime


Languages

Language:Python 100.0%