meilisearch / meilisearch

A lightning-fast search API that fits effortlessly into your apps, websites, and workflow

Home Page:https://www.meilisearch.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Increase limit: number of positions (~ words) per attribute

curquiza opened this issue · comments

Related to this tiny spec: meilisearch/specifications#80

The current number of positions per attribute is currently 1000.

See the docs page

This limit will be increased to 65 535.

@meilisearch/docs-team. The whole explanation remains unchanged, only the 1000 word should be replaced by 65 535.


TODO:

⚠️ @meilisearch/docs-team
A warning must be added to the changelogs (release and article) to say the size of the DB (data.ms) can be increased between v0.23.0 and v0.24.0 due to this change!
This addition is considered as impactfull for the users since it can impact the disk usage

Milli 0.18.0 is out containing this change 🎉
https://github.com/meilisearch/milli/releases/tag/v0.18.0

I was a bit surprised to find this in my evaluation of MeiliSearch. 1000 words is a three-page document, that is a very low limit and I was wondering what kind of use-case MeiliSearch was targeting (conversations?).

65535 seems much more reasonable so I am looking forward to this release!

Hello @remram44!
Thanks for asking this question!
One of the core team developers (@ManyTheFish) already answered it, but in our Slack, so not publically.
I will copy/paste the slack thread here:


Question

What sort of effort, code wise, would it be to remove the 1,000 word field limit? It’s the primary reason I stay on Typesense. I know there are workarounds such as splitting into multiple fields, but I’d just like to understand a bit more, behind the decision to limit it and what sort of architectural changes would be needed to remove it. Thanks.

Answer

I have several answers to this, depending on the view point.

Relevancy

Meilisearch is a search engine, the goal is to return the most relevant documents corresponding to a given search request, and so, we want to keep the most relevant words in each document. The predicate is: "deepest a word is in an attribute, less this word is relevant.".

The current version considers that any words positioned after the position 1000 are too few relevant to be taken into account in the search. Because more words are more noise, raising this limit could lead to a loss of relevancy.

Performances & Memory

Meilisearch has to be the fastest as possible to respond, we pre-compute a lot of things during the indexing of documents. Raising this limit will lead to a bigger disk usage and a longer indexing time. Moreover, because we have more data, the search time could be impacted.

Technical limit

Does Meilisearch have a technical limit?
Yes, but not 1000, the real technical limit should be 65535 (16bits unsigned integer). So we can technically raise this limit to 65535 positions per attribute. \o/

The arbitrary limit of 1000

Why do we have this limit of 1000 positions per attribute?
To be honest, we don't have any proves that 1000 is the optimal limit.
That's why we rethought it, and we raise this limit.

deepest a word is in an attribute, less this word is relevant

Wow, that is an extremely bad fit for document search. Can I ask where this assumption comes from?

MeiliSearch seems optimized for a use case that I do not understand (if it even exists). What kind of content has progressively decreasing relevance?

Wow, that is an extremely bad fit for document search. Can I ask where this assumption comes from?

If you think this is not the relevancy you expect, you can remove attribute from the ranking rules.

This is something that some users need. For example with the following dataset:

[
  { "id": 1, "title": "Harry Potter and the Half-Blood Prince", "description": "A story about a wizzars" },
  { "id": 2,  "title": "Fantastic Beasts and Where to Find Them", "descrption": "A movie in the universe of Harry Potter" }
]

If you type harry potter, we consider the first document is more relevant than the second document.
But again, it depends on the relevancy you need, and you can customize it redefining your own ranking rules.

This seems unrelated, it's about favoring some attributes over other attributes according to an order. Not about favoring some words over other words in a single attribute.

The depth considered by MeiliSearch is in the same attribute but also between the attributes.

With

[
  { "id": 1, "description": "Harry Potter and his friends live a lof of adventures." },
  { "id": 2, "descrption": "A movie in the universe of Harry Potter" }
]

Doc 1 is considered more relevant than doc 2. My example is really trivial but it can be useful when you have attributes with a lot of words. Again, it depends on your own usecase so if it's something you don't want you can remove attribute from the ranking rules

You mean removing "words" from the ranking rules?

No attribute. I just realized the documentation is not up to date, I've done a PR for a patch: meilisearch/documentation#1222

Sorry for that!

Why do you have a "words" ranking rule if the ranking by words is controlled by the "attributes" rule?

The more I dig the more MeiliSearch is inscrutable. I would have liked a portable, memory-safe solution but I am staying with TypeSense, nothing in here makes sense to me.

Hello @remram44,

the "words" criterion targets the word of the user query and will remove/ignore the last query words 1 by 1 to fill the response. For instance, we have a query "t-shirt covfefe" requesting 20 documents, because in our imaginary store we only have 1 t-shirt with a covfefe print and we would return 20 documents, the criterion will remove the covfefe word in the query and will rerun a search returning others documents matching the word t-shirt.

The "attribute" criterion will rank documents depending on the position of the matching word in the document:

  • if a match is in a higher attribute, the document is boosted
  • if a match is at the beginning of an attribute, the document is boosted

We could rename this criterion wordPosition but it would be confusing because of the words criterion. Moreover, the weight over the attribute position is more important than the weight of the word position in the attribute.

I hope my explanations were clear, and I encourage you to try meilisearch and see if it can fit your needs despite our weaknesses.

Anyway, Thanks a lot for your feedback!
Don't hesitate to ask more question. 👍

Very excited for this enhancement! It allows me to come back to MeiliSearch (I've been using typesense since I encountered the 1,000 word limit)

One suggestion:
I think it's important for MeiliSearch to notify the user whenever it drops data. Silently dropping data is bad :)

I spent half a day debugging my search queries to try and figure out why I couldn't find a document, turned out it was because MeiliSearch silently dropped all the data beyond the 1,000th word.

When getting the update status, it would be great if the response contained something to show that data was ignored.
Perhaps:

{
  "warnings": [
    { "id": "<document_id>", "truncatedAttributes": [ "<attribute_id>" ] }
  ]
}

Hello @Sembiance!
Welcome back then 😄
Have you tried v0.24.0rc2? Does it work for you?

I opened a ticket in the product repo so that the product team could take your suggestion of warning into account: meilisearch/product#305

Have you tried v0.24.0rc2? Does it work for you?
Not yet. Currently working on another project, might be a month or so before I can circle back to MeiliSearch.

I opened a ticket in the product repo so that the product team could take your suggestion of warning into account: meilisearch/product#305

Thanks :)

Hello @remram44,

the "words" criterion targets the word of the user query and will remove/ignore the last query words 1 by 1 to fill the response. For instance, we have a query "t-shirt covfefe" requesting 20 documents, because in our imaginary store we only have 1 t-shirt with a covfefe print and we would return 20 documents, the criterion will remove the covfefe word in the query and will rerun a search returning others documents matching the word t-shirt.

The "attribute" criterion will rank documents depending on the position of the matching word in the document:

  • if a match is in a higher attribute, the document is boosted
  • if a match is at the beginning of an attribute, the document is boosted

We could rename this criterion wordPosition but it would be confusing because of the words criterion. Moreover, the weight over the attribute position is more important than the weight of the word position in the attribute.

I hope my explanations were clear, and I encourage you to try meilisearch and see if it can fit your needs despite our weaknesses.

Anyway, Thanks a lot for your feedback! Don't hesitate to ask more question. +1

@ManyTheFish Thanks a lot for the explanation. I'm not sure if this is useful to anyone else, but this brings up another issue due to "attribute" technically handling two different cases which should be separated into two different rules.

To me it seems like there should be an "attribute" ranking, which is based on the match being in the higher attribute to boost the result, and then another possible "matchPosition" or "wordPosition" within the attribute, which boosts depending on position closer to the beginning of an attribute.

This has become an issue that I can't separate the two rules, since I'm indexing long documents into multiple docs, where the "matchPosition" doesn't matter to me.. where I would technically want to leave it out as a rule (ignoring it), but still sort by important of attribute (such as the title being more important than the description of an article).

Maybe this has already been discussed, but I was unable to find any mention of in the product discussions. It seems like a useful addition/improvement, for those of us indexing long documents into multiple MeiliSearch documents referring to the same piece of content. When indexing these long documents, the position of the match within the attribute is not useful at all, and should be ignored.

Hey @mikerogerz! Thanks for your complete response!
I think your point of view is really interesting, and it may be discussed (poke @gmourier).

I think that nothing will be changed before 2022, but we will do at least a "public response" of "why did we choose or not to split the attribute ranking rule in 2?".

Thanks a lot for your feedback! 👍

Hello @mikerogerz, I just opened a ticket in the product repository so that you ensure we take into consideration your feedback -> meilisearch/product#329

Hey @curquiza I really appreciate it. I'll follow that discussion to see how it progresses.