internetarchive / openlibrary

One webpage for every book ever published!

Home Page:https://openlibrary.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make author name matching case insensitive

scottbarnes opened this issue · comments

Related: #9003, internetarchive/infogami#221

Problem

A clear and concise description of what you want to happen

On import, author name matching should be case insensitive.

Additional Context

internetarchive/infogami#217 changed ~ to use ILIKE rather than LIKE, and the Open Library code in #9003 relied upon this to perform case insensitive author name matching on import.

However, the Infogami ILIKE change caused performance issues and is slated to be reverted in internetarchive/infogami#221, with ~ doing a LIKE operation and ~i doing an ILIKE operation.

Once internetarchive/infogami#221 is merged, author name resolution will be case sensitive again. However, we can't simply update the Open Library code in openlibrary/catalog/add_book/load_book.py to use ~i, because of the performance issues associated with the ILIKE query, so we'll need to investigate further (perhaps using EXPLAIN can help us see more about the query.

Proposal & Constraints

What is the proposed solution / implementation?

None yet -- this will take more investigation to figure out why ILIKE was such significant performance issues.

Leads

Related files

Stakeholders

Note: Before making a new branch or updating an existing one, please ensure your branch is up to date.

Doesn't SOLR already do this? Is there more context available about why this needs to be done in PostgreSQL in this particular use case?

A few general comments:

  • name matching should be done on normalized names which are not only case folded, but also diacritic folded, and Unicode composition normalized
  • some of these operations are, ideally, locale specific
  • if you have to do it in PostgreSQL, a trigram index may help performance https://stackoverflow.com/questions/20336665/lower-like-vs-ilike
  • but pre-computing a separate column with a normalized version of the name might be better

Solr might be what we have to do considering the performance issues with ILIKE. Note solr has a caveat of being 1 minute behind live edits. In the past when solr has been used to dedupe imports, it caused edge cases where it caused dupes with related books being imported in quick succession, so we'd always need a postgres backup check of some sort. The postgres ILIKE was hence a mandatory and simple change that would result in a large improvement in new authors being created. The plan was to add the solr checking as an improvement at some point in the future. But we might have to re-evaluate that strategy as mentioned above.

Oh sweet thanks for that trigram index find! When we investigate we'll see what it's currently using.