wikipedia search engine
Sample files : https://drive.google.com/drive/folders/1gsfHWcmdHu4mQ7SxZYK3hpI-AkTfpbHx?usp=sharing
wikipidea dump file link : https://dumps.wikimedia.org/enwiki/latest/
It is a search engine based on k-way merge sort based indexing and further uses relevance ranking using tf-idf scores.
-> Difficult to process such huge Data dump of 75+ GB-> Can not store word & its posting list into a main memory, So Used K-way Merge sort
-> Can not Load full final index into main memory, So Build Secondary Index on top of Primary Index (Posting List)
-> Used title, infobox and category to build indexes.
-> returning the title of the wikipedia page.
-> Use body(text) of the wikipedia pages as well to increase relevancy.