sul-dlss / was-registrar-app

Rails app to organize downloaded web archiving data and trigger preassembly/accessioning when appropriate

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

1 of the 3 seeds in a collection registered using WAS does not appear in SWAP

peterchanws opened this issue · comments

I think the different components in stage may be out of sync. In stage, I don't see seed items for any of these URLs

http://sensesplaces.org/
http://weatheringin.blogspot.pt/
http://reaisjogosvirtuais.blogspot.pt/

The only items under the Seed APO currently in stage are in this search:
https://argo-stage.stanford.edu/catalog?f%5Bnonhydrus_apo_title_ssim%5D%5B%5D=Web+Archive+Seed+Object+APO&f%5BobjectType_ssim%5D%5B%5D=item

8 of those 9 items show errors. The 9th item is fv957rc7483 and I think that's a crawl object that should be switched back to the crawl APO. I don't see metadata in any of those items that connect them to the URLs that are in swap-stage.

My guess is that swap-stage is showing data that's left over from previous testing.

I don't know if this will get http://senseplaces.org/ fully working but I think these are the remaining steps to go through:

  1. Switch the crawl object fv957rc7483 back to the public crawl APO. I think that object was created directly by WAS Registrar in stage.
  2. Register the seed object for http://senseplaces.org/ following the steps on Consul for seed registration. This looks like it has to be done in Argo, not in WAS Registrar https://consul.stanford.edu/display/WARC/Initiating+Seed+Object+Accessioning
  3. You may need to also change the rights on the Senses Places collection to "world". Currently it's set to "dark": https://argo-stage.stanford.edu/view/druid:nm187fx5259

I have register the 3 seed objects. All got: thumbnail-generator : Thumbnail for druid druid:zx403wd7216 and http://weatheringin.blogspot.pt/ can't be generated. #FAIL# Unable to load the address! with HTTP status: 200, HTTP message: OK
druid:zx403wd7216
druid:sj048rz4005
druid:rw393cb5731

I will leave step 1 and 2 at this moment.

@andrewjbtw
What do you mean by 1. Switch the crawl object fv957rc7483 back to the public crawl APO.
When I try to change the rights on the Senses Places collection to "world". Currently it's set to "dark": https://argo-stage.stanford.edu/view/druid:nm187fx5259, I got "We're sorry, but something went wrong."

I have upload the metadata spreadsheet to Web Archive Seed Object APO. Since there are still errors, Jessica told me I won't be able to see the metadata.

I was able to set the Senses Places collection to "world". There's now an error at "transfer-object" but that's not a problem with the web archive object. It's an issue that's being worked on that affects all objects in the test environment.

I've been looking through web archiving documentation and I think that the thumbnail-generator works by creating thumbnails from crawl data that has already been deposited.

So the high-level overview of the steps seem to be:

  1. Accession crawl data
  2. Register seed
  3. System generates thumbnail for seed by making use of crawl data

If thumbnail generator is failing then that could mean:

  1. The seed URL is not represented in the accessioned crawl data
  2. The logic that connects the seed URL with crawl data isn't matching up
  3. Something else is wrong

I think that recent changes to thumbnail generation may help to resolve this issue. Since it seems to show up correctly in SearchWorks and can link to archived content in SWAP can we close this and reopen when specific issues come up again?

https://searchworks.stanford.edu/view/ts331yh1329

@peterchanws @andrewjbtw is this OK to close based on thumbnail improvements?

Thanks, Ed. Install pywb solves the issue.