sul-dlss / common-accessioning

Suite of robots that handle the tasks of accessioning digital objects

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

jp2-create failing to retrieve TIFFs from preservation

andrewjbtw opened this issue · comments

Describe the bug
We've now set up jp2-create to be able to retrieve TIFFs/JPGs/PNGs from preservation in order to generate/regenerate JP2 derivatives when an item is being updated using the targeted file updates process in Preassembly. This has been working in stage but items in prod are hitting errors trying to retrieve TIFF files

jp2-create : Unable to reach Preservation Catalog - failed with Faraday::ConnectionFailed: Failed to open TCP connection to preservation-catalog-prod.stanford.edu:443 (execution expired)

There are four items with this issue currently: https://argo.stanford.edu/catalog?f%5Bnonhydrus_apo_title_ssim%5D%5B%5D=Kogu+me+lugu&f%5Bwf_wps_ssim%5D%5B%5D=assemblyWF%3Ajp2-create%3Aerror

I've retried a couple of times and either the jp2-create step continues to hit the same error or it retrieves empty, zero-byte files. When it retrieves a zero-byte file, that means a file with the correct filename is retrieved but the file turns out to be empty. This leads the next step in the assemblyWF to fail because that step validates checksums against the Cocina, and the Cocina stores the correct value for the file in preservation.

User Impact
This ends up blocking accessioning on certain items where JP2s should be (re)generated when the item is updated.

To Reproduce
Steps to reproduce the behavior:

  1. Retry jp2-create on one of the items in the four that have an error.
  2. Either the step will fail again or it will succeed but then the very next step checksum-compute will hit an error because the file retrieved is actually invalid.

Expected behavior
The image files should be retrieved from preservation.

Additional context
These files are probably larger than my test files in stage. That could be a factor. But these files are only a few MB, not huge.

Note that there has been some previous work on this issue. Relevant Slack thread.

I think where things stand are:

  • there was a problem with assembly robots trying to get the file through the prescat load balancer
  • that was switched to get the file from a prescat node
  • now a file is copied into /dor/assembly but it's not a TIFF. Instead it's a text file that contains a 302 redirect message

I didn't see this problem in stage when testing before release so it could be that prod is configured in a different way.

Based on my testing, this works in QA but not in stage or prod. The problems in stage is different from prod. It seems that the Apache configuration is inconsistent between environments.

https://github.com/sul-dlss/operations-tasks/issues/3522 requests that Ops fix the issue.

We believe this has been fixed by the Ops change plus some networking changes, so I'm closing.