DistributedProofreaders / dproofreaders

Distributed Proofreaders is a web application intended to ease the process of converting public domain books into e-texts.

Home Page:https://www.pgdp.net

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consider pros and cons of having project search ignore all non-alphanumeric characters

srjfoo opened this issue · comments

I suspect that it's often frustrating for everyone using project search (and may actually feed into PMs doing duplicate searches missing obvious duplicates), but the straw that finally motivated this issue was a curly apostrophe in a project title that is automatically converted to a straight apostrophe in the PG posted notice. 😁

Other problems caused in project search by non-alphanumerics:

  • presence or absence of punctuation between title and subtitle
  • use of em-dash vs. double-hyphen in project titles
  • spacey vs non-spacey punctuation between title and subtitle

I'm sure there are others, but these are the ones that spring to mind. (This obviously won't fix all problems that cause non-matches in project search, but should help with the punctuation-related ones.)

Edit:
After a bit of discussion with other squirrels, I realize this is not one issue. There are two basic issues I've identified:

  1. Curly apostrophes within words, and single or double curly quotes that are part of a title (possibly, rarely, in author fields).
  2. Punctuation between words, usually to separate parts of the title. Could be commas, semicolons, colons, em-dashes (either the character or the double-hyphen version).

For the first case I'd recommend treating them all the same, whether straight or curly. For the second, I think ignoring them for search purposes myght be best, but would like to hear discussion on both.

Would it make sense to do a sort of tokenization of the title and search query that splits input on punctuation (quotes, hyphens, commas, etc.), then rank results based on token matching? This is more in line what solutions oriented towards search do. An exmaple would be

User input of search query of "James Copeland".

Let's say there is a project with the title of "Life and bloody career of the executed criminal, James Copeland, the great Southern land pirate., [2d ed.]" we turn that into a set of tokens

  1. Life
  2. and
  3. bloddy
  4. career
  5. of
  6. the
  7. executed criminal
  8. James
  9. Copleand
  10. the
  11. great
  12. Southern
  13. land
  14. pirate
  15. 2d
  16. ed

Then rank it against other projects by comparing to the tokenized input "James", "Copeland".

This project would rank highly since 2 tokens match exactly. We could get more fancy with weighting words token like "the" less than other tokens or partial match ranking, but I wonder how far we could get with this. The solution here would require where to split the project title into tokens to be well defined.

A simpler solution initially could be to just remove those token splitters (commas, apostrophe, etc.) and then do a string search on that string. Just throwing ideas out there.

Any thoughts on which would be easier on the database?

I'd have to defer to someone with more database experience than me. I'm not even sure the full tokenization approach is possible in database. String replacement methods could be done I think in SQL for "normalizing" our project titles during the search.

We might be able to use the full-text search feature of MySQL. That's what the forums use for their search too. I don't know exactly how it handles punctuation.