Wikipedia crawler
Basic search engine only for wikipedia. But probably could be suited for searching on other sites.
Introduction
This is a small project which aims to create a little search engine. It began as a homework project.
How it works
There are four main parts of this project:
-
Crawler
It goes over all the pages of the specified site.
-
Indexer
It creates a list of all the words that appear in the crawled pages.
-
Pageranker
It looks over the hierarchy of the downloaded pages, sorts them by popularity (that means that the most referenced pages are the most popular ones), and saves that list.
-
Search engine
It loads the data produced by indexer and pageranker and lets you search for some word (in some distant future it may be even a phrase!), spitting out a list of articles where this word appears, sorted by popularity, which was computed by pagerank.
How to build it
To build this project, you need to have POSIX system (well, there are some bugs that make it possible that on non-Linux machine it could crash), on which wget
, make
, and some c++11 compiler must be installed.
By default compiler is set to clang++
, but you could change it to g++
or some other compiler in the first line of Makefile
.
When these requirements are met, you could simply issue make
command in the root of this project and have everything built for you.
If you want to remove everything that was generated by make
, you'd need to run make clean
.
How to run it
After you have built the project using make
, there will be four new files: crawler
, indexer
, pageranker
, find
.
These are the programs that you need to run.
Currently, all the paths are hardcoded in the four main files: src/main.cpp
, src/IndexerUtility.cpp
, src/PagerRankerUtility.cpp
and src/SearchUtility.cpp
.
Some random facts
- By default crawler will download http://simple.wikipedia.org, that contains 150 000 pages. That takes around 3 hours on my not-fast-at-all internet connection.
- Crawling and indexing are done multithreadedly (I am not sure that there is such a word...) using pthread.
- To downoad some page I am currently using
system("wget ...")
call (that's gross, I know). - Search results currently do not take into account how often the word being searched for appears in the page.
- HTML pages are processed using a simple HTML parser that I've written.
You could see how it works in
src/HTMLContent.h
andsrc/HTMLTag.h
.
Future features
- Automated files updating and reindexing.
- Improved search that will also look at the contents of the file.
system(wget ...)
replaced with some library like curl.