duytd / blackspider

A lightweight Scala web crawler and news classifier

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Blackspider

A lightweight crawler and news classifier

Blackspider components

###Crawler Get links/nodes, build edges between them and download web documents

###Indexer Indexing web document to speed up search query

###Ranker Ranking documents using PageRank algorithm

###News Monitor Monitor and update latest news from the news source – Re-crawl / using RSS

###Tokenizer Extract features/tokens from web documents to classify

###Classifier Be able to classify new crawled web pages using Naïve Bayes algorithm

Blackspider architecture

Overall Architecture

About

A lightweight Scala web crawler and news classifier

License:GNU General Public License v3.0


Languages

Language:Scala 94.1%Language:Java 5.9%