rccoe / crawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CoeCrawler

This application uses the Java Play framework v. 1.2.7 to crawl a single website and display the connections.

Models

Websites are stored individually, with all of their links mapped to that website. Link interconnections are also stored in a joinTable of links -> other links

Crawling

I decided to use crawler4j as my crawling library, as it has a good amount of documentation and apparent user base. This had several drawbacks, unfortunately. The concurrency built into the application isn't great, and relies on lots of 'sleep' calls to avoid deadlocks. A future task that should be done is to implement asynchronous crawling, currently the application freezes up while crawling a website. Crawling should be kicked off by a Promise from the Application controller and the results saved to the database either while crawling or once it has finished.

Visualization

The d3.js hierarchical edge bundling was an appropriate example to use for this type of visualization - a digraph with the possibility of cyclic links. A regular sitemap or tree structure would make interlinking very difficult to represent.

</p>

About


Languages

Language:Java 86.6%Language:CSS 13.4%