Using only spaces as search tokenizer fails to process words with '-' character

Question

Using only spaces as search tokenizer fails to process words with '-' character

galthaus opened this issue 2 months ago · comments

Context

Our docs use a lot of words with hyphens. The performance is very slow when it finds then and because it tokenizes each part the system biases the answers to the single words and not the total sets. We then tried to only use a space-based tokenizer and nothing matches anymore.

Bug description

Using only a space tokenizer, the system doesn't find words with hyphens in them.

Reproduction

9.5.15-search-no-hypen-fails.zip

Steps to reproduce

Serve the docs
Search for universal-image-deploy and find nothing.
Change the search option to the "original"
See that universal-image-deploy is found but not as the highest item. On our system, the performance is 20-30 seconds to actually find the pages.

Browser

Chrome, Edge

Before submitting

I have read and followed the bug reporting guidelines.
I have attached links to the documentation, and possibly related issues and discussions.
I assure that I have removed all customizations before submitting this bug report.
I have attached a .zip file with a minimal reproduction using the built-in info plugin.

Martin Donath · Answer 1 · Sun Mar 24 2024 11:52:10 GMT+0800 (China Standard Time)

Thanks for reporting. I've ran your reproduction and can confirm that when no hyphen is used as the search separator, nothing is found. This is very, very likely related to #6885 (reply in thread) (item 2.) and not fixable at the moment for the reasons stated in that comment. However, you might have noticed that we're working on #6307, which will fix this issue as well. I've also ran our latest search preview (#6372) and it fixes the issue, allowing to search with or without -:

If I, as you mentioned, switch to the what you call "original" separator in our current implementation, I can confirm that search works and I do not observe the item being rendered as the second result:

Note that we're working heavily on improving search result ranking as well, which should also be better in #6372. Until then, we're considering this issue as resolvable with a configuration (separator) change. You can follow #6307 for updates on the new search implementation, which should fix many, many shortcomings of the current implementation.

Martin Donath · Answer 2 · Sun Mar 24 2024 11:53:33 GMT+0800 (China Standard Time)

On another note:

On our system, the performance is 20-30 seconds to actually find the pages.

Is the performance the same if you use the search preview (#6372)? How many pages is your documentation composed of? How long does the build take? Searching should not take 20-30s but 20-30ms, and you can help us trying to understand where this comes from by providing us with more information, and ideally, with a test case. Are your docs public?

Martin Donath · Answer 3 · Sun Mar 24 2024 11:57:02 GMT+0800 (China Standard Time)

Alternatively, if you could share the search/search_index.json file that is located in your site directory after building – that would be a tremendous help. It is public anyway if you deploy your site to GitHub Pages. You can just post the link here, as it would help me better understand what the problem is. If you could also provide some searches that lead to suboptimal results on that dataset, that'd be absolutely amazing and of great help ☺️

Greg Althaus · Answer 4 · Sun Mar 24 2024 21:32:43 GMT+0800 (China Standard Time)

@squidfunk - Our docs can 200+ pages. We've split the site into two, but it is still a lot. The build of both sites can take 45 minutes. The problem with the "original" search delimiter is not that things aren't found, but they are biased in the current mechanism to push the set of items down that match the "whole" string. So, universal-image-deploy finds universal and image and deploy and image deploy before universal-image-deploy and that is really annoying. The ordering problem becomes more apparent with lots of pages.

docs.rackn.io is our current site. https://docs.rackn.io/stable/ search universal-hardware - it takes about 3 seconds for the preview window to stabilize. I think the longer times are on slower links and maybe first search.

We are using an older version because I need to figure out how to get the latest to work. I have hacked our docs to make it work for the reproduction case. The current builds fail on our tree because the tag system now seems to not be able to consume tags with hyphens in them. I'll see if I can make a case for that.

Thanks for your feedback. I'll see if I can try the preview.

Martin Donath · Answer 5 · Mon Mar 25 2024 09:30:17 GMT+0800 (China Standard Time)

Our docs can 200+ pages. We've split the site into two, but it is still a lot. The build of both sites can take 45 minutes.

45 minutes are definitely unexpected. Material for MkDocs own documentation has more than 90 pages and takes 4 seconds to build. It may be caused by some third party plugin or extension you're using. It'd be definitely worth debugging what causes this. A good idea is to disable plugins and extensions one-by-one and see what causes this.

The problem with the "original" search delimiter is not that things aren't found, but they are biased in the current mechanism to push the set of items down that match the "whole" string. So, universal-image-deploy finds universal and image and deploy and image deploy before universal-image-deploy and that is really annoying. The ordering problem becomes more apparent with lots of pages.

Yes, ranking is currently not optimal. The existing implementation is based on BM25, which is not ideal for typeahead. The search preview uses a variant of BM25 giving more weight to consecutive matches, so it might already improve the situation. We're working hard on a new ranking method that does not suffer from the problems of BM25.

docs.rackn.io is our current site. https://docs.rackn.io/stable/ search universal-hardware - it takes about 3 seconds for the preview window to stabilize. I think the longer times are on slower links and maybe first search.

The search feels reasonably snappy to me. Yes, it could be even faster (and the search preview actually should be), but I don't observe that opening the search modal or searching takes 3 seconds. I'll download your search index and check if I somehow run into pathological cases.

We are using an older version because I need to figure out how to get the latest to work. I have hacked our docs to make it work for the reproduction case. The current builds fail on our tree because the tag system now seems to not be able to consume tags with hyphens in them. I'll see if I can make a case for that.

Jup, 9.2.3 is a little old, but there have not been many changes to search, so don't expect too much when upgrading. However, as mentioned, following #6307 is a good idea, which will improve the situation. Regardless, it's always a good idea to try and stay updated, since we're iterating fast while trying to keep it as stable as possible.

The current builds fail on our tree because the tag system now seems to not be able to consume tags with hyphens in them.

The tags plugin in Insiders got a complete makeover, as discussed in #6517. If you can narrow the problems down and create a reproduction, we'd be happy if you can create a new bug report so we can fix it ☺️

Greg Althaus · Answer 6 · Tue Mar 26 2024 00:09:27 GMT+0800 (China Standard Time)

Sorry. The dev scope we limit to 600 pages for build times. The 600 pages builds in about 21.68 seconds. The full scope of generated docs is 6000 pages. That takes a while to build, 1165.73 seconds. It appears that mkdocs is faster with the last builds. Still not fun, but getting better. I'll play with plugins and get you a repro on the tags things. Opened an insiders ticket for the tag build issue.

Greg Althaus · Answer 7 · Tue Mar 26 2024 01:05:37 GMT+0800 (China Standard Time)

Here is the slower site. It has 6000 pages. https://refs.rackn.io/stable and search using the preview for universal-hardware or universal-discover. It appears to take 20 seconds to stabilize. The latest tree (but not the search rewrite) is faster, but still takes 10 seconds or so to stabilize. It flashes through sequences. My guess is that it threaded and is processing the keystrokes and bounces. The latest tree does sort better (well a little). It depends upon the search term.

Martin Donath · Answer 8 · Tue Mar 26 2024 08:53:59 GMT+0800 (China Standard Time)

Here is the slower site. It has 6000 pages.

6,000 pages is a whole other level, so it sounds legit that this takes longer. Just as an idea to cut down on build time: you might try to enable navigation pruning, which, depending on how you structured the site, might help in cutting down the size and time of the build, because the navigation plays a large role. Also see #1887 for reference.

Thanks for opening the ticket, we'll look into it.

I'm not surprised. Your search index is 40 MB, so you pretty much reached the end of client-side search, as you're shipping this index to every user. We haven't announced this yet, but we'll likely be offering the ability to provide server-side search and fully integrate it with the search interface in the near future. Additionally, we'll be exploring alternative methods of breaking down the index in order to ship smaller chunks to the user, and not the entire thing. A site of this size is just not suited anymore for full client-side search.

To sum up: we are very aware of the problem that with a growing site, search degrades, and will actively address this in the future after the shipped the first iterations of the new search interface. Our vision is to provide an awesome experience from 1 to 10,000 pages. Please note that this is a pretty big fish to fry, but we're working hard on it.

Based on this search index, could you share some searches + the results you would expect and how they are sorted? That would allow us to better test it.

Martin Donath · Answer 9 · Tue Mar 26 2024 16:49:27 GMT+0800 (China Standard Time)

Thanks again for sharing your site. It helps a lot in gravitating towards a better search implementation ☺️

When I run my current prototype on the 40 MB search index of your site, indexing takes around 2-3 seconds and searching takes less than 100ms on average, which includes searching, ranking (please ignore score = 0 in the video below), ordering, highlighting and pagination. It looks very promising and feels quite snappy, given that there are 6,000 documents, each of which with multiple sections, leading to a total number of 16,000 items in the search index

Ohne.Titel.mp4

When entering a few characters, many, many results are returned, which might bury what you actually search for among many similar results. In this case, a scoped search might be a better idea, in order to prune the number of potential results prior to searching by a categorical system like tags or site subsections (Blog, Reference, etc.).

All of this is currently in movement, and I'll be regularly testing your search index. Please note that a search index with 16,000 items is far, far beyond what we've yet observed in a site, so it might take some time to get this right, but I can assure you that it is on our agenda.

Edit: prior comment said 26,000, but it's 16,000 items and 26,000 distinct terms. Sorry for that. It's, however, still the biggest search index we've seen so far.

Greg Althaus · Answer 10 · Thu Mar 28 2024 22:39:09 GMT+0800 (China Standard Time)

Glad it could help. I'll look at navigation pruning.

squidfunk / mkdocs-material

Using only spaces as search tokenizer fails to process words with '-' character

Context

Bug description

Related links

Reproduction

Steps to reproduce

Browser

Before submitting