persinammon / parallel-web-crawler

Used Java design patterns, Jackson and Guice, Streams API to implement parallelism for web crawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implementing Multi-Threading to Single-Threaded Web Crawler

I was given a Java implementation of a single-threaded web crawler and unit tests. I implemented a multi-threaded version of the crawler. Credit to the original, very well-planned and dense project goes to here.

Creational Patterns and Libraries Used, Bugs Squashed

How to Run

Clone and run the following to run:

mvn package
java -classpath target/udacity-webcrawler-1.0.jar com.udacity.webcrawler.main.WebCrawlerMain src/main/config/sample_config.json

Configuration File

This is a sample configuration JSON given to the web crawler.

{
  "startPages": ["http://example.com", "http://example.com/foo"],
  "ignoredUrls": ["http://example\\.com/.*"], 
  "ignoredWords": ["^.{1,3}$"], 
  "parallelism": 4, 
  "implementationOverride": "com.udacity.webcrawler.SequentialWebCrawler", 
  "maxDepth": 10, 
  "timeoutSeconds": 7, 
  "popularWordCount": 3, 
  "profileOutputPath": "profileData.txt" 
  "resultPath": "crawlResults.json" 
}


/**
 * Notes:
 * ignoredUrls and ignoredWords use regex, which in Java is an instance of the Pattern class.
 * parallelism is the number of desired threads, and is either that or defaults to number of available CPU cores.
 * implementation override overrides parallelism (which invokes parallel web crawler if > 1). It can be 
 * either SequentialWebCrawler or ParallelWebCrawler.
 * maxDepth is the hardcoded depth of the search trie, the program terminates at a further depth.
 * The two paths are where to write performance data and the results. If unset, these are printed to standard output.
 */

Open-Source Third Party Java Libraries

  • jsoup
  • Jackson Project
  • Guice
  • Maven
  • JUnit 5
  • Truth

Takeaway

Overall, this was a fun use case for practicing more complex Java patterns and doing some debugging.

About

Used Java design patterns, Jackson and Guice, Streams API to implement parallelism for web crawler


Languages

Language:Java 99.3%Language:HTML 0.7%