WordPress / openverse

Description

Recent discussions in data related projects have uncovered a need to separate the tag filtering step from the other cleanup operations occurring during the data refresh:

openverse/ingestion_server/ingestion_server/cleanup.py

Lines 119 to 150 in 3747f9a

    
               @staticmethod 
        
               def cleanup_tags(tags): 
        
                   """ 
        
                   Delete denylisted and low-accuracy tags. 
        
                   :return: an SQL fragment if an update is needed, ``None`` otherwise 
        
                   """ 
        
                   update_required = False 
        
                   tag_output = [] 
        
                   if not tags: 
        
                       return None 
        
                   for tag in tags: 
        
                       below_threshold = False 
        
                       if "accuracy" in tag and float(tag["accuracy"]) < TAG_MIN_CONFIDENCE: 
        
                           below_threshold = True 
        
                       if "name" in tag and isinstance(tag["name"], str): 
        
                           lower_tag = tag["name"].lower() 
        
                           should_filter = _tag_denylisted(lower_tag) or below_threshold 
        
                       else: 
        
                           log.warning(f'Filtering malformed tag "{tag}" in "{tags}"') 
        
                           should_filter = True 
        
                       if should_filter: 
        
                           update_required = True 
        
                       else: 
        
                           tag_output.append(tag) 
        
                   if update_required: 
        
                       fragment = Json(tag_output) 
        
                       return fragment 
        
                   else: 
        
                       return None

Per discussion in #4417 (comment), we want to keep the tags that do not meet the criteria they're currently being filtered on (denylist and tag accuracy) rather than remove them from the catalog database. Since the intent of the data normalization project is to remove the cleanup steps from the ingestion server, it's imperative we move this filtering option into the new data refresh process. As such, we'll need to modify the implementation plan for that project to include where this filtering will take place.

It may be possible to add this as another step in the DAG, but this can be an intensive (and long-running) filtering operation. We also need to be clear at which step the filtering is happening: are we filtering the data within the temporary table prior to indexing and promoting it, or are we filtering the values as they go into Elasticsearch and leaving them in the API database? The former approach is the current one, but the latter provides more flexibility for rebuilding the index without having to copy the upstream table again. If we filter at the ES level, we could potentially leave the full list of tags in the API's response body for each record. This would mean we'd be returning the full list of tags (including denylisted and non-confident tags), but we would only be performing searches against the filtered tags.

Another consideration would need to be made for where in the new data refresh DAG this would run. It could be run as a set of mapped tasks directly in the DAG, or (since we already have the indexer workers available for this kind of chunked operation) we could add it as a second task that's handled by the indexer workers within the DAG. The modifications to the IP will need to make this distinction explicit.

This step is necessary to define going forward, since we may want to filter out entire tag providers (e.g. Clarifai) or adjust the confidence interval down the line, and we want to preserve this data in the catalog.

Additional context

See: #430, #431, and #3925

	@staticmethod
	def cleanup_tags(tags):
	"""
	Delete denylisted and low-accuracy tags.

	:return: an SQL fragment if an update is needed, ``None`` otherwise
	"""

	update_required = False
	tag_output = []
	if not tags:
	return None
	for tag in tags:
	below_threshold = False
	if "accuracy" in tag and float(tag["accuracy"]) < TAG_MIN_CONFIDENCE:
	below_threshold = True
	if "name" in tag and isinstance(tag["name"], str):
	lower_tag = tag["name"].lower()
	should_filter = _tag_denylisted(lower_tag) or below_threshold
	else:
	log.warning(f'Filtering malformed tag "{tag}" in "{tags}"')
	should_filter = True
	if should_filter:
	update_required = True
	else:
	tag_output.append(tag)

	if update_required:
	fragment = Json(tag_output)
	return fragment
	else:
	return None

Update ingestion server removal IP to include plan for filtering tags

Description

Additional context