aminhbl / news-search-engine

Search engine to retrieve news articles with positional indexing and vector space query processing.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


News Search Engine

Search engine to retrieve Persian news articles
project of Information Retrieval course

Quick Links

Overview

Searching for news articles can be a headache when you're looking for news related to a specific phrase or words in a large corpus. Here we implement concepts of Information Retrieval to propose a search engine with both Positional Indexing and Vector Space models on a corpus of 12k Persian news articles to retrieve documents based on phrases and must_not queries.

The Positional Indexing model includes the following components:

  • Preprocessing module for news articles
  • Positional indexing module
  • Query processing module
  • Graph for Zipf's and Heap's law

The Vector Space model includes the following components:

  • TF-IDF weight for each document calculated from the previous positional indexing
  • Similarity module to compute the Cosine Similarity for each document and query vector
  • Implementation of Index Elimination and Champion List to reduce the query process time

Elasticsearch

To see the implementation of this project using Elasticsearch you can check here.

About

Search engine to retrieve news articles with positional indexing and vector space query processing.


Languages

Language:Python 100.0%