Results considered dead if SSL fails during dead link check, even though they might not actually be dead
sarayourfriend opened this issue · comments
Description
Some results are considered "dead" even though they are actually available with cleartext (and in this case, followed by a redirect).
"event": "Failed to validate image! Reason: Cannot connect to host collection.mobiliernational.culture.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1006)')]",
collection.mobiliernational.culture.gouv.fr indeed has an expired certificate, but if you visit the page in cleartext, it redirects to a URL with a valid certificate, https://collection.mobilier-national.fr/recherche
There are other examples of this (and the ones I saw all looked to be French government agency related, but that could be just a coincidence from a cluster of results in particular queries).
@WordPress/openverse-catalog I'm not sure whether this needs a fix in the API or if it's something that we should address during data refresh? It would be nice if the API could follow these redirects, maybe it's safe to retry requests with HTTP when HTTPS fails on an SSL error? What do y'all think @WordPress/openverse-api and catalogue folks?
Reproduction
The logs unfortunately do not show a specific URL. I'm adding the URL to this particular log line in #4333 but for now I don't know the exact works that are failing with this. We can pull it from Elasticsearch by querying on the image urls for this pattern.
Since we're moving away from adding cleaning steps as part of the data refresh, I think we'd like to try and avoid adding new steps there. I think having the API follow those redirects makes sense! We could then report the updated links in Redis, and (when the time comes) follow a procedure like #3585 for reincorporating that data back into the catalog. That feels like a good balance between checking all of the catalog URLs and updating links as they are come across by users.