typesense / typesense

Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences

Home Page:https://typesense.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature Request] Support strict exact highlighting on array elements

DNCHOW1 opened this issue · comments

Description

Hello team! I'm having an issue when performing exact searching on a field of an array of strings, where it matches partial tokens later in the array. A workaround could be to make new documents from each of the array elements, but the overall document would have useful fields that I want to avoid duplicating. Also, I think that supporting exact highlighting on the array elements would be extremely helpful as it reduces the amount of data returned over network.

I'm unsure if there are any settings that would support this as well, as I've tried a good number of combinations with "num_typos", "prefix", and "tokens_thresholds". If there is something I could have missed please let me know.

Steps to reproduce

  1. Create collection with string[] field
  2. Add a document with any data (for ex. ["this is a test", "b d c a g"])
  3. Perform an exact search ("this is a test")

image

"highlight": {
    "arraystring": [
      {
        "matched_tokens": [
          "this",
          "is",
          "a",
          "test"
        ],
        "snippet": "<mark>this</mark> <mark>is</mark> <mark>a</mark> <mark>test</mark>",
        "value": "<mark>this</mark> <mark>is</mark> <mark>a</mark> <mark>test</mark>"
      },
      {
        "matched_tokens": [
          "a"
        ],
        "snippet": "b d c <mark>a</mark> g",
        "value": "b d c <mark>a</mark> g"
      }
    ]
  }

Expected Behavior

Only the first element should be matched within the array and the second shouldn't be matched based on the "a" token.

Actual Behavior

Both elements returned and highlighted despite the exact match query.

Metadata

Typesense v26.0
Typesense Cloud

When I add another array element it returns this in the highlight as well despite no matched tokens. An alternative solution could be to exclude the "highlight" field, as the deprecated "highlights" field actually does have a smaller payload.
image

"highlight": {
    "arraystring": [
      {
        "matched_tokens": [
          "this",
          "is",
          "a",
          "test"
        ],
        "snippet": "<mark>this</mark> <mark>is</mark> <mark>a</mark> <mark>test</mark>",
        "value": "<mark>this</mark> <mark>is</mark> <mark>a</mark> <mark>test</mark>"
      },
      {
        "matched_tokens": [
          "a"
        ],
        "snippet": "b d c <mark>a</mark> g",
        "value": "b d c <mark>a</mark> g"
      },
      {
        "matched_tokens": [],
        "snippet": "testing testing",
        "value": "testing testing"
      }
    ]
  }

If any array element matches the query string, that entire document is returned. Within the matched document, we then highlight any words present in the query. So the search is happening only at the document level.

We've another issue (#962) open for providing an option to not return highlights that don't match any word at all in query. Closing this in favor of that.