WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.

Home Page:https://openverse.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Results considered dead if SSL fails during dead link check, even though they might not actually be dead

sarayourfriend opened this issue · comments

Description

Some results are considered "dead" even though they are actually available with cleartext (and in this case, followed by a redirect).

    "event": "Failed to validate image! Reason: Cannot connect to host collection.mobiliernational.culture.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1006)')]",

collection.mobiliernational.culture.gouv.fr indeed has an expired certificate, but if you visit the page in cleartext, it redirects to a URL with a valid certificate, https://collection.mobilier-national.fr/recherche

There are other examples of this (and the ones I saw all looked to be French government agency related, but that could be just a coincidence from a cluster of results in particular queries).

@WordPress/openverse-catalog I'm not sure whether this needs a fix in the API or if it's something that we should address during data refresh? It would be nice if the API could follow these redirects, maybe it's safe to retry requests with HTTP when HTTPS fails on an SSL error? What do y'all think @WordPress/openverse-api and catalogue folks?

Reproduction

The logs unfortunately do not show a specific URL. I'm adding the URL to this particular log line in #4333 but for now I don't know the exact works that are failing with this. We can pull it from Elasticsearch by querying on the image urls for this pattern.

Since we're moving away from adding cleaning steps as part of the data refresh, I think we'd like to try and avoid adding new steps there. I think having the API follow those redirects makes sense! We could then report the updated links in Redis, and (when the time comes) follow a procedure like #3585 for reincorporating that data back into the catalog. That feels like a good balance between checking all of the catalog URLs and updating links as they are come across by users.