alephdata / aleph

Search and browse documents and data; find the people and companies you look for.

Home Page:http://docs.aleph.occrp.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FEATURE: Allow/improve partial search

andresmrm opened this issue · comments

Is your feature request related to a problem? Please describe.
Sometimes the searched term appears without space separation to another word (like nº123, instead of nº 123, so the query doesn't find anything if I just use 123, I need to search for nº123).

Describe the solution you'd like
I would like to search for 123 and find nº123.

Describe alternatives you've considered
Sometimes using ??123 can help, but not if the number of chars vary.

As discussed in Slack, I've managed to make queries directly to ElasticSearch to use regex queries. But they were too slow (~3s each) and I needed to query a huge list of terms. So I ended up doing regular queries for the most common patterns (~30ms each). For example, in my case the terms generally appear like 0123456789 or 012.345.678-9, so I queried each version of the term for each term (2x30ms=60ms << 3s). But I gave up less common cases, like nº123.

It maybe good to allow regex queries, even if slow, for when you just need to search for a few terms. And, if possible, make regex faster or offer another type of partial match.

Just for context, you can use wildcard and regex queries in Aleph using the ElasticSearch query string syntax.

As you already noticed, both wilcard and regex queries are computationally expensive at search time which makes them slow. While there are options to speed up such queries, these require indexing contents differently (e.g. using ngrams) which usually comes at a significantly higher cost for ingesting and storing the data. This makes it a difficult trade-off.