Consider pros and cons of having project search ignore all non-alphanumeric characters

Question

Consider pros and cons of having project search ignore all non-alphanumeric characters

srjfoo opened this issue 7 months ago · comments

I suspect that it's often frustrating for everyone using project search (and may actually feed into PMs doing duplicate searches missing obvious duplicates), but the straw that finally motivated this issue was a curly apostrophe in a project title that is automatically converted to a straight apostrophe in the PG posted notice. 😁

Other problems caused in project search by non-alphanumerics:

presence or absence of punctuation between title and subtitle
use of em-dash vs. double-hyphen in project titles
spacey vs non-spacey punctuation between title and subtitle

I'm sure there are others, but these are the ones that spring to mind. (This obviously won't fix all problems that cause non-matches in project search, but should help with the punctuation-related ones.)

Edit:
After a bit of discussion with other squirrels, I realize this is not one issue. There are two basic issues I've identified:

Curly apostrophes within words, and single or double curly quotes that are part of a title (possibly, rarely, in author fields).
Punctuation between words, usually to separate parts of the title. Could be commas, semicolons, colons, em-dashes (either the character or the double-hyphen version).

For the first case I'd recommend treating them all the same, whether straight or curly. For the second, I think ignoring them for search purposes myght be best, but would like to hear discussion on both.

Chris Miceli · Answer 1 · Tue Apr 02 2024 10:23:08 GMT+0800 (China Standard Time)

Would it make sense to do a sort of tokenization of the title and search query that splits input on punctuation (quotes, hyphens, commas, etc.), then rank results based on token matching? This is more in line what solutions oriented towards search do. An exmaple would be

User input of search query of "James Copeland".

Let's say there is a project with the title of "Life and bloody career of the executed criminal, James Copeland, the great Southern land pirate., [2d ed.]" we turn that into a set of tokens

Life
and
bloddy
career
of
the
executed criminal
James
Copleand
the
great
Southern
land
pirate
2d
ed

Then rank it against other projects by comparing to the tokenized input "James", "Copeland".

This project would rank highly since 2 tokens match exactly. We could get more fancy with weighting words token like "the" less than other tokens or partial match ranking, but I wonder how far we could get with this. The solution here would require where to split the project title into tokens to be well defined.

A simpler solution initially could be to just remove those token splitters (commas, apostrophe, etc.) and then do a string search on that string. Just throwing ideas out there.

Sharon Joiner · Answer 2 · Tue Apr 02 2024 10:39:01 GMT+0800 (China Standard Time)

Any thoughts on which would be easier on the database?

Chris Miceli · Answer 3 · Tue Apr 02 2024 11:10:47 GMT+0800 (China Standard Time)

I'd have to defer to someone with more database experience than me. I'm not even sure the full tokenization approach is possible in database. String replacement methods could be done I think in SQL for "normalizing" our project titles during the search.

Casey Peel · Answer 4 · Tue Apr 02 2024 12:52:08 GMT+0800 (China Standard Time)

We might be able to use the full-text search feature of MySQL. That's what the forums use for their search too. I don't know exactly how it handles punctuation.