anjaleeps / female_singers_sinhala_search_engine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Female Singers Sinhala Search Engine

In this repository, the code for a female singers search engine operating in Sinhala language is provided. The system is built using Python, Elasticsearch, Flask, and Javascript technologies.

Overview

The system accepts a sinhala language search queries through a web interface and query the female singer data indexed on Elasticsearch to retrieve and return singer data related to the query. The indexed data stores 10 different fields of data as indicated in the following list.

  • Name
  • Personal information
  • Career information
  • Discography
  • Awards
  • Summary
  • Birthday
  • Active period
  • Genre
  • Url to related Wikipedia page

Scraping data

The singer data were scraped from Wikipedia using the Wikipedia API and a NPM module named wtf_wikipedia. The latter module handled retrieving data under different sections and cleaning unnecessary components such as links, references from the text. Once singer data is scraped, the texts are translated into Sinhala language using the Google Translate library provided by Google cloud. The data_collector.js file contains the related scraper code and data_original.csv and data_preprocessed.json contain the original and preprocessed data respectively.

Processing search query

The sinhala language search query submitted through the web interface are processed by query_processor.py to retrive the matching results from the indexed elasticsearxh database. The query first goes through an initial preprocessing step to remove stop words and punctuation. Thr next intent classification step that divides all queries into two possible types. They are:

  • Exact phrase search queries. ex: ග්‍රැමී දිනා ඇත
  • Multi match search queries ex: එමිලි බාර්කර්ගේ ප්‍රසිද්ධ ගීත, ඇඩෙල්ගේ දරුවාගේ නම කුමක්ද?

Finally, Elasticsearch DSL queries are executed to retrive documents belonging to the relavant type of query from the database.

Features of the search engine

  • Query preprocessing (stopword and punctuation removal)
  • Intent classification
  • Support for synonyms (similarity matrix calculated with a provided set of keywords for each word allows the system to support synonymous queries such as ඇඩෙල්ගේ මුල් ජීවිතය and ඇඩෙල්ගේ ළමා කාලය)
  • The above characteristic also allows the system to provide good results despite spelling errors in the query.

About


Languages

Language:Python 42.5%Language:JavaScript 40.6%Language:HTML 16.9%