WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.

Home Page:https://openverse.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Update ingestion server removal IP to include plan for filtering tags

AetherUnbound opened this issue · comments

Description

Recent discussions in data related projects have uncovered a need to separate the tag filtering step from the other cleanup operations occurring during the data refresh:

@staticmethod
def cleanup_tags(tags):
"""
Delete denylisted and low-accuracy tags.
:return: an SQL fragment if an update is needed, ``None`` otherwise
"""
update_required = False
tag_output = []
if not tags:
return None
for tag in tags:
below_threshold = False
if "accuracy" in tag and float(tag["accuracy"]) < TAG_MIN_CONFIDENCE:
below_threshold = True
if "name" in tag and isinstance(tag["name"], str):
lower_tag = tag["name"].lower()
should_filter = _tag_denylisted(lower_tag) or below_threshold
else:
log.warning(f'Filtering malformed tag "{tag}" in "{tags}"')
should_filter = True
if should_filter:
update_required = True
else:
tag_output.append(tag)
if update_required:
fragment = Json(tag_output)
return fragment
else:
return None

Per discussion in #4417 (comment), we want to keep the tags that do not meet the criteria they're currently being filtered on (denylist and tag accuracy) rather than remove them from the catalog database. Since the intent of the data normalization project is to remove the cleanup steps from the ingestion server, it's imperative we move this filtering option into the new data refresh process. As such, we'll need to modify the implementation plan for that project to include where this filtering will take place.

It may be possible to add this as another step in the DAG, but this can be an intensive (and long-running) filtering operation. We also need to be clear at which step the filtering is happening: are we filtering the data within the temporary table prior to indexing and promoting it, or are we filtering the values as they go into Elasticsearch and leaving them in the API database? The former approach is the current one, but the latter provides more flexibility for rebuilding the index without having to copy the upstream table again. If we filter at the ES level, we could potentially leave the full list of tags in the API's response body for each record. This would mean we'd be returning the full list of tags (including denylisted and non-confident tags), but we would only be performing searches against the filtered tags.

Another consideration would need to be made for where in the new data refresh DAG this would run. It could be run as a set of mapped tasks directly in the DAG, or (since we already have the indexer workers available for this kind of chunked operation) we could add it as a second task that's handled by the indexer workers within the DAG. The modifications to the IP will need to make this distinction explicit.

This step is necessary to define going forward, since we may want to filter out entire tag providers (e.g. Clarifai) or adjust the confidence interval down the line, and we want to preserve this data in the catalog.

Additional context

See: #430, #431, and #3925