sanyamss99 / wiki_search_engine

wikipedia search engine. A search engine implemented using inverted indexes created using k-way merge sort and relevance ranking using tf-idf scores.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wiki_search_engine

wikipedia search engine

Sample files : https://drive.google.com/drive/folders/1gsfHWcmdHu4mQ7SxZYK3hpI-AkTfpbHx?usp=sharing

wikipidea dump file link : https://dumps.wikimedia.org/enwiki/latest/

It is a search engine based on k-way merge sort based indexing and further uses relevance ranking using tf-idf scores.

Challenges

-> Difficult to process such huge Data dump of 75+ GB
-> Can not store word & its posting list into a main memory, So Used K-way Merge sort
-> Can not Load full final index into main memory, So Build Secondary Index on top of Primary Index (Posting List)

phase 1

-> Used title, infobox and category to build indexes.
-> returning the title of the wikipedia page.

TODO : phase 2

-> Use body(text) of the wikipedia pages as well to increase relevancy.

phase 1 sample result

Screenshot (73)

About

wikipedia search engine. A search engine implemented using inverted indexes created using k-way merge sort and relevance ranking using tf-idf scores.


Languages

Language:Python 100.0%