orlovol / key

Search Engine for autosuggesting geography locations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

key

Key Search engine

KEY is a python3.7 implementation of basic search engine, suitable for using as the autosuggest component. Based on nested dict Trie data structure, and aims to keep low memory footprint:

memory

Features:

  • Loads denormalized geodata records from csv.

  • Optimizes space usage by internally normalizing data via parsing and linking parent entities.

  • Creates parent entities if they are missing in dataset. This allows to quickly fill parent data from the most detailed child entities.

  • Supports flexible lookup by parsing record names into current and old name (specified in parentheses).

  • Supports partial lookup by adding records to DictTrie (HashMap Trie) including all word suffixes.

  • Provides import/export to normalized csv with parent ids.

  • Handles lookups in wrong keyboard layout (e.g. key -> лун).

  • Provides CLI tool for import/export, and interactive query mode.

  • Provides simple Flask backend with search endpoint, and React frontend for fullstack experience.

Live version

Things to improve

  • Explore Suffix automaton (MA-FSA, DAWG), which could be used to compress tree branches, but it's not suitable for data that's changed often. However, a balanced combination could be found for items with less frequent updates.

  • Explore CTrie data structure, which relies on atomic CAS operation. Python can be expanded with atomos, which aims to provide support of said operations.

  • Multiprocess data import/export, grouped by geo type.

  • Serialize and export Trie to speed up initialization process.

  • Optimize Address names and keep them in separate index. Most often they have same name for all languages {street_id : address_numbers}

  • Search results ranking should be added. Results matched by prefix should be shown first, and then all results matched in the middle of the word. One idea is to track what is searched more often and rank those items higher. Second idea is to sort results by geo type, increasing/decreasing area.

  • More intelligent lookups can be added, to account for typos and mixed/wrong cyrillic layouts, transliteration.

  • Database could handle some work, but for simplicity sake all indexes are in memory to simulate some kind of cached storage (which is not used again for simplicity).

  • Pandas for csv could be used, but standard csv module with few tricks is more than enough.

Notable mentions:

About

Search Engine for autosuggesting geography locations


Languages

Language:Python 76.3%Language:JavaScript 12.2%Language:CSS 8.3%Language:HTML 3.0%Language:Shell 0.2%