Move "by" tag contains filter to tag exact match filter
AetherUnbound opened this issue · comments
Description
Looking over our transformations, it seems we have two separate tag filtering steps that happen on the MediaStore
class: full string matching and substring matching. The latter case, however, has the value by
as a substring match option. Whereas all of the other options feel fairly unique and pointed, the value by
could be part of so many valid tag possibilities that it seems like it should be part of the full string matching instead. Examples being: "by the sea", "colby jack", "byte array", etc.
We should move this value from TAG_CONTAINS_DENYLIST
to TAG_DENYLIST
in this file:
openverse/catalog/dags/common/storage/media.py
Lines 15 to 37 in d500b77
👍 to by
should be a full-tag match, maybe cc by
(etc for the other CC license terms) as well.
In line with the discussion of #4417 and into #4456, if we will soon have the ability to filter tags later in our data lifecycle, it'd be great to not do any kind of wholesale exclusion of tags at this point.
My understanding of some of the requirements of #4456, in line with the requirements of #4417 not to include certain tags in search but keep them in our data, would lead me to think that excluding these tags at all is against that trend? I suppose I'm making an assumption about how those discussions will play out.
If we move this later on, though, we'll have greater flexibility to play with what tags we use for search, without precluding the possibility of tags not directly used in search still being useful. Perhaps the existence of some of these tags, even if they aren't part of the searched text, is a useful indicator of some quality of the work that would be relevant for some subset of searches (if not all searches).
If the task for this issue is just to move this check into the other set, fine for now, but perhaps an additional issue would be good to also stop throwing these tags away at ingestion time (if that's how they're being excluded).
If the task for this issue is just to move this check into the other set, fine for now, but perhaps an additional issue would be good to also stop throwing these tags away at ingestion time (if that's how they're being excluded).
I was about to ask exactly this because, from sync discussions, I had understood that we wanted to preserve tags with low accuracy and move forward with cleaning deny-listed tags in the catalog since we're already excluding them in new records. Hence, I created #4453 (blocked while this is clarified). If the desire is to keep everything, then #4453 would not be necessary.
We are moving towards that, I assume Madison opened this issue as something we can resolve as a separate problem, rather than also solving it in that issue (or waiting to solve it after that issue).
As things are, we are incorrectly excluding the types of tags Madison mentioned in the issue. Why block this on overall changes?
We are moving towards that...
Perfect, that is all the confirmation I needed. Thanks!
As things are, we are incorrectly excluding the types of tags Madison mentioned in the issue. Why block this on overall changes?
I wasn't referring to blocking this issue (#4464) but #4453, which is unnecessary, so it's closed now. Sorry for the misunderstanding. I agree with moving the by
tag 👍
I'd like to work on this
Thanks @sarayourfriend and @krysal for piping in - yes, I agree that we should stop filtering tags at ingestion in the future once we move the filtering out of the ingestion server (see #4524). I'll make an issue for follow-up!
Edit: issue made - #4542