Wikipedia crawler

Basic search engine only for wikipedia. But probably could be suited for searching on other sites.

Introduction

This is a small project which aims to create a little search engine. It began as a homework project.

How it works

There are four main parts of this project:

Crawler

It goes over all the pages of the specified site.
Indexer

It creates a list of all the words that appear in the crawled pages.
Pageranker

It looks over the hierarchy of the downloaded pages, sorts them by popularity (that means that the most referenced pages are the most popular ones), and saves that list.
Search engine

It loads the data produced by indexer and pageranker and lets you search for some word (in some distant future it may be even a phrase!), spitting out a list of articles where this word appears, sorted by popularity, which was computed by pagerank.

How to build it

To build this project, you need to have POSIX system (well, there are some bugs that make it possible that on non-Linux machine it could crash), on which wget, make, and some c++11 compiler must be installed. By default compiler is set to clang++, but you could change it to g++ or some other compiler in the first line of Makefile.

When these requirements are met, you could simply issue make command in the root of this project and have everything built for you. If you want to remove everything that was generated by make, you'd need to run make clean.

How to run it

After you have built the project using make, there will be four new files: crawler, indexer, pageranker, find. These are the programs that you need to run. Currently, all the paths are hardcoded in the four main files: src/main.cpp, src/IndexerUtility.cpp, src/PagerRankerUtility.cpp and src/SearchUtility.cpp.

Some random facts

By default crawler will download http://simple.wikipedia.org, that contains 150 000 pages. That takes around 3 hours on my not-fast-at-all internet connection.
Crawling and indexing are done multithreadedly (I am not sure that there is such a word...) using pthread.
To downoad some page I am currently using system("wget ...") call (that's gross, I know).
Search results currently do not take into account how often the word being searched for appears in the page.
HTML pages are processed using a simple HTML parser that I've written. You could see how it works in src/HTMLContent.h and src/HTMLTag.h.

Future features

Automated files updating and reindexing.
Improved search that will also look at the contents of the file.
system(wget ...) replaced with some library like curl.

posobin / wikipedia_crawler