WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.

Home Page:https://openverse.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Move "by" tag contains filter to tag exact match filter

AetherUnbound opened this issue · comments

Description

Looking over our transformations, it seems we have two separate tag filtering steps that happen on the MediaStore class: full string matching and substring matching. The latter case, however, has the value by as a substring match option. Whereas all of the other options feel fairly unique and pointed, the value by could be part of so many valid tag possibilities that it seems like it should be part of the full string matching instead. Examples being: "by the sea", "colby jack", "byte array", etc.

We should move this value from TAG_CONTAINS_DENYLIST to TAG_DENYLIST in this file:

TAG_DENYLIST = {
"no person",
"squareformat",
"undefined",
}
# Filter out tags that contain the following terms. All entrées should be lowercase.
TAG_CONTAINS_DENYLIST = {
":",
"=",
"by",
"by-nc",
"by-nc-nd",
"by-nc-sa",
"by-nd",
"by-sa",
"cc0",
"creative commons",
"flickriosapp",
"pdm",
"public domain",
"uploaded",
}

👍 to by should be a full-tag match, maybe cc by (etc for the other CC license terms) as well.

In line with the discussion of #4417 and into #4456, if we will soon have the ability to filter tags later in our data lifecycle, it'd be great to not do any kind of wholesale exclusion of tags at this point.

My understanding of some of the requirements of #4456, in line with the requirements of #4417 not to include certain tags in search but keep them in our data, would lead me to think that excluding these tags at all is against that trend? I suppose I'm making an assumption about how those discussions will play out.

If we move this later on, though, we'll have greater flexibility to play with what tags we use for search, without precluding the possibility of tags not directly used in search still being useful. Perhaps the existence of some of these tags, even if they aren't part of the searched text, is a useful indicator of some quality of the work that would be relevant for some subset of searches (if not all searches).

If the task for this issue is just to move this check into the other set, fine for now, but perhaps an additional issue would be good to also stop throwing these tags away at ingestion time (if that's how they're being excluded).

If the task for this issue is just to move this check into the other set, fine for now, but perhaps an additional issue would be good to also stop throwing these tags away at ingestion time (if that's how they're being excluded).

I was about to ask exactly this because, from sync discussions, I had understood that we wanted to preserve tags with low accuracy and move forward with cleaning deny-listed tags in the catalog since we're already excluding them in new records. Hence, I created #4453 (blocked while this is clarified). If the desire is to keep everything, then #4453 would not be necessary.

We are moving towards that, I assume Madison opened this issue as something we can resolve as a separate problem, rather than also solving it in that issue (or waiting to solve it after that issue).

As things are, we are incorrectly excluding the types of tags Madison mentioned in the issue. Why block this on overall changes?

We are moving towards that...

Perfect, that is all the confirmation I needed. Thanks!

As things are, we are incorrectly excluding the types of tags Madison mentioned in the issue. Why block this on overall changes?

I wasn't referring to blocking this issue (#4464) but #4453, which is unnecessary, so it's closed now. Sorry for the misunderstanding. I agree with moving the by tag 👍

I'd like to work on this

Thanks @sarayourfriend and @krysal for piping in - yes, I agree that we should stop filtering tags at ingestion in the future once we move the filtering out of the ingestion server (see #4524). I'll make an issue for follow-up!

Edit: issue made - #4542