elyase / locasticsearch

Serverless full text search in Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Locasticsearch

Serverless full text search in Python

⚠️ dormant status: 🚧 🚧

Locasticsearch provides serverless full text search powered by sqlite full text search capabilities but trying to be compatible with (a subset of) the elasticsearch API.

That way you can comfortably develop your text search appplication without needing to set up services and smoothly transition to Elasticsearch for scale or more features without changing your code.

That said, if you are only doing basic search operations within the subset supported by this library, and dont have a lot of documents (~million) that would justify going for a cluster deployment, Locasticsearch can be a faster alternative to Elasticsearch.

Test Publish Coverage Package version Python Versions

Getting started

from locasticsearch import Locasticsearch
from datetime import datetime

es = Locasticsearch()

doc = {
    "author": "kimchy",
    "text": "Elasticsearch: cool. bonsai cool.",
    "timestamp": datetime(2010, 10, 10, 10, 10, 10),
}
res = es.index(index="test-index", doc_type="tweet", id=1, body=doc)

res = es.get(index="test-index", doc_type="tweet", id=1)
print(res["_source"])

# this will get ignored in Locasticsearch
es.indices.refresh(index="test-index")

res = es.search(index="test-index", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res["hits"]["total"]["value"])
for hit in res["hits"]["hits"]:
    print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])

We are also adding a simplified API that can be converted to Elasticsearch.

Features

  • πŸ’―% local, no server management
  • ✨ Lightweight pure python, no external dependencies
  • ⚑ Super fast searches thanks to sqlite full text search capabilities
  • πŸ”— No lock in. Thanks to the API compatiblity with the official client, you can smoothly transition to Elasticsearch for scale or more features without changing your code.

Install

pip install locasticsearch

To use or not to use

You should NOT use Locasticsearch if:

  • you are deploying a security sensitive application. Locasticsearch code is very prone to SQL injection attacks. This should improve in future releases.
  • Your searches are more complicated than what you would find in a 5 min Elasticsearch tutorial. Elasticsearch has a huge API and it is very unlikely that we can support even a sizable portion of that.
  • You hate buggy libraries. Locasticsearch is a very young project so bugs are guaranteed. You can check the tests to see if your needs are covered.

You should use Locasticsearch if:

  • you dont want a docker or an elasticsearch service using precious resources in your laptop
  • you only need basic text search and Elasticsearch would be overkill
  • you want very easy deployments that only involve pip installs
  • using Java from a python program makes you feel dirty

Next steps

  • Add a real query DSL parsing
  • Bulk indexing / scan
  • Add simplified non ES compatible interface for easy JSON ingestion, querying
  • Document supported vs unsupported query types

Comparison to similar libraries

Some quick thoughts about existing tools, feel free to add/comment:

The most full featured pure python text search library by far:

  • πŸ‘ Supports highlight, analyzers, query expansion, several ranking functions, ...
  • πŸ‘Ž Unmaintained for a long time though might see a revival at https://github.com/whoosh-community/whoosh
  • πŸ‘ Pure python so doesn't scale as well (still fast enough for small/medium datasets)

The big champion of full text search. This is what you should be using in production:

  • πŸ‘ Lots of features to accomodate any use case
  • πŸ‘ Battle tested, scalable, performant
  • πŸ‘Ž Non python native: more complex to deploy/integrate with python project for easy use cases

This is a good recommendation for local full text search if you dont care about elastic search API compatibility

  • πŸ‘ Simple to set up and use: pip install tantivy
  • πŸ‘ Fast rust based engine
  • πŸ‘Ž DSL/library lock in, no elastic search API

Though not pure python, pyserini is a good compromise if you want something local and scalable:

  • πŸ‘ Acess to Lucene from within Python (via pyjnius Java bridge)
  • πŸ‘ Serverless / local deployment
  • πŸ‘Ž DSL/library lock in
  • πŸ‘Ž Extra JAVA runtime

Django Haystack provides an unified API that allows you to plug in different search backends (such as Solr, Elasticsearch, Whoosh, Xapian, etc.) without having to modify your code:

  • πŸ‘ Many features, boosting, highlight, autocomplete (some backend dependent though)
  • πŸ‘ Possibility to switch backends
  • πŸ‘Ž DSL/library lock in
  • πŸ‘Ž Despite supporting several backends, Whoosh is the only one that is python native.
  • πŸ‘ Very fast and full featured (C++)
  • πŸ‘Ž No pip installable (needs system level compilation)
  • πŸ‘Ž The python bindings and the documentation are not that user friendly

While gensim focuses on topic modeling you can use TfidfModel and SparseMatrixSimilarity for text search. That said this is doesnt use an inverted index (linear search) so it has limited scalability.

  • πŸ‘ Unique features such as approximate search
  • πŸ‘Ž Focus is on topic modeling, so no intuitive APIs for full text ingestion/search
  • πŸ‘Ž Doesn't support inverted indexes search (mostly full scan and approximate)

Peewee is actually a more general ORM but offers abstractions to use full text search on Sqlite:

  • πŸ‘ Support for full text search using several SQL backends (no elasticsearch though)
  • πŸ‘ Custom ranking and analyzer functions
  • πŸ‘Ž No elasticsearch compatible API

About

Serverless full text search in Python

License:MIT License


Languages

Language:Python 79.8%Language:Makefile 12.4%Language:Shell 7.8%