Casecommons / pg_search

pg_search builds ActiveRecord named scopes that take advantage of PostgreSQL’s full text search

Home Page:http://www.casebook.net

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to search for a substring in a word

jsmartt opened this issue · comments

It is possible to configure pg_search to find a substring of a word? For example, if I have a column named fqdn, and I want to search for a subdomain.

Here's what I have in my model right now:

pg_search_scope :search_for, against: column_names, using: { tsearch: { prefix: true, negation: true } }

Let's say a record exists with a fqdn of host1.site1.example.com...

  • Model.search_for('host1') returns the result properly
  • Model.search_for('site1') returns nothing. I'd like it to return that record.

I haven't been able to come up with a configuration that works yet. Any help here would be much appreciated. Thanks!

This seems to be one of the most basic search requirements and I couldn't find a way to do it either so far.

negation: true seems useless:

If you want to exclude certain words, you can set :negation to true. Then any term that begins with an exclamation point ! will be excluded from the results

prefix: true does not seem to promise you're striving to achieve:

full text search matches on whole words by default. If you want to search for partial words, however, you can set :prefix to true

It will search for partial words, but only those words that do have a prefix matching your search term.

tsearch's capabilities seem to be limited in this regard, and you'll have to use trigram-based search, see https://stackoverflow.com/questions/2513501/postgresql-full-text-search-how-to-search-partial-words

Yes, tsearch is a bit limited for searching in the middle of a string.

You can use the ts_debug SQL function if you want to figure out how PostgreSQL is parsing your text.

# SELECT * FROM ts_debug('simple', 'test.example.com');
 alias | description |      token       | dictionaries | dictionary |      lexemes
-------+-------------+------------------+--------------+------------+--------------------
 host  | Host        | test.example.com | {simple}     | simple     | {test.example.com}
(1 row)

For example, the built-in simple and english parsers seems to recognize an entire hostname as a single lexeme. I believe this means you would need to match the entire string (or a prefix of it with prefix: true) to match.

It may be possible to implement your own parser. You also may want to pre-process your text before indexing it. For example, you could split the hostname by . and store that in a separate column.

See https://www.postgresql.org/docs/current/textsearch-debugging.html for more details. Essentially we are limited by what the database provides.