Archiving website at Internet Archive
jywarren opened this issue · comments
Due to lack of funds, SpectralWorkbench.org will be going offline at the end of August. We're working to ensure this collaboratively contributed database of spectra remains accessible with the help of the Internet Archive.
Here we're submitting all URLs we can to the following service for archiving:
Google Sheets limitations:
- When you import a file from your computer to Google Sheets, it can have a maximum of 40,000 rows and 100MB.
- Google Sheets does not have any row limit but a sheet may have up to 5 million cells.
- Internet Archive limitations:
- A Patron can process just one Google Sheet at a time.
- A Patron can make 100,000 captures per day.
- The number of concurrent captures a Patron runs in the background when processing a Google Sheet is limited to 4 to avoid system overload.
- Save Page Now can make 100,000 captures from the same host per day.
- Email reporting:
A Patron receives email notifications at the start and in the end of processing a Google Sheet.
We'll need to archive the following URL ranges (from https://spectralworkbench.org/stats):
- generate spreadsheet for https://spectralworkbench.org/sets?page=115 (oldest, so page=1..115)
- initial spreadsheet (tx @Mr0grog!): https://docs.google.com/spreadsheets/d/1GkC0eaiIP2k11jYZC3cDV0L21ZUWT5pqZRjWrgy7A-4/edit#gid=652829455
- submit to wayback machine
- recover failed URLs and sort
- resubmit
- https://spectralworkbench.org/sets/3706 (latest, so /1..3706)
- initial spreadsheet (tx @Mr0grog!): https://docs.google.com/spreadsheets/d/1wsFFeAxvfyycEYsrAiO7rwpz9QlJO3eQW6wahBujr-k/edit#gid=704157721
- submit to wayback machine (done by @Mr0grog)
- recover failed URLs and sort
- resubmit
- https://spectralworkbench.org/spectrums/251374 (latest, so /1..251374)
- first 79999: https://docs.google.com/spreadsheets/d/1-lmKewkurcqQ5sVJbBjIhepcjDm3eNovaJQ3s3TpmaI/edit#gid=0 (need to split in 3)
- submit to wayback machine: https://archive.org/services/wayback-gsheets/check?job_id=0a5906ae-dc53-4546-a614-ab1f4e20dfc6&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1-lmKewkurcqQ5sVJbBjIhepcjDm3eNovaJQ3s3TpmaI%2Fedit%3Fusp%3Dsharing
- recover failed URLs and sort (only 8k succeeded, 72k failed)
- resubmit
- recover failed URLs and sort
- resubmit 2nd time (appx 10k succeeded before capped)
- resubmit 3rd time (from sheet)
- resubmit remaining 38k 4th time (sheet)
- resubmit remaining 24k as 6th attempt (tracking)
- only 600 left, adding to next batch
- 2nd batch (
spectrums_080000-159998
) https://docs.google.com/spreadsheets/d/1hV0R_d9iX0TrFd1hmCFQoQ-McgVnLZDlSd8jo4m49Rc/edit?usp=sharing- submit to wayback machine: @Mr0grog https://archive.org/services/wayback-gsheets/check?job_id=5070d061-49ec-494e-b8bb-64ebeb7a4e8b&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1hV0R_d9iX0TrFd1hmCFQoQ-McgVnLZDlSd8jo4m49Rc%2Fedit%23gid%3D452795522
- recover failed URLs and sort
- resubmit https://archive.org/services/wayback-gsheets/check?job_id=713ddd55-36bd-4c3a-8624-52af82558e0d&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1359MIeGzwH1Chl4R33ZhPjqyBdhDRVjUXhDgPG9L7ac%2Fedit%3Fusp%3Dsharing
- recover failed URLs and sort, merge with leftovers from batch 1; submit, sheet
- recover failed and sort (last batch failed python error, process, sheet)
- resubmit oct 1 sheet process (63k remaining of ~80k, hit host limit last 2 runs, most are already archived on Aug 30)
- Now no longer rerunning "host limit + Already captured + 200" which may mean we miss some "job failed" in that batch, but can return to at this sheet
- Moving remaining 7814 spectra to 3rd batch
- 3rd batch (
spectrums_159999-239997
) https://docs.google.com/spreadsheets/d/1XBPHX92QuSAJz_Nc1d5FIieYKgoKoPDhMT4QfpA3_LE/edit?usp=sharing- submit to wayback machine: @bensheldon https://archive.org/services/wayback-gsheets/check?job_id=072cb2f7-6bfa-49c6-8501-a12063981c2b&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1XBPHX92QuSAJz_Nc1d5FIieYKgoKoPDhMT4QfpA3_LE%2Fedit%23gid%3D325879669
- recover failed URLs and sort sheet, process, 83244 spectra (incl leftover 8k from 2nd batch) - may be past host limit too... not sure bc timezones
- recover and resubmit after sigterm worker failure sheet process (about 10k had run)
- 3rd submission of 3rd batch with 4th batch combined: sheet process
- 4th submission of last 7495 spectra! sheet process
- 5th Oct 8 submission sheet process
- 4th batch (
spectrums_239998-251374
) https://docs.google.com/spreadsheets/d/1PP9CQ3-acLkH2jVMMxsYYk0h6wHkNkkXveh6C9ELrtQ/edit?usp=sharing (merged with above 3rd batch!)
- https://spectralworkbench.org/contributors?page=847 (oldest, so page=1..847)
- initial spreadsheet (tx @Mr0grog!): https://docs.google.com/spreadsheets/d/12E5yW9QWaXWWSv6YCDBeH8tESJ96HYqAiqugtVvSq60/edit#gid=1424008368
- submit to wayback machine
- @Mr0grog did this outside of sheets, did any fail?
- recover failed URLs and sort
- resubmit
- https://spectralworkbench.org/profile/USERNAME (needs db dump of usernames BUT they should be gotten if we get all spectra)
- submit to wayback machine
- recover failed URLs and sort
- resubmit
- https://spectralworkbench.org/tags/TAGNAME (needs db dump of tagnames BUT they should be gotten if we get all spectra)
- submit to wayback machine
- recover failed URLs and sort
- resubmit
How to help
We'll need to split up the URL lists into separate sheets, due to the per-user limits. We could use help creating spreadsheets especially for the ranges above, where one sheet contains, for example, every URL from https://spectralworkbench.org/sets?page=1
to https://spectralworkbench.org/sets?page=115
If you can make such a spreadsheet (even generating it locally with a script and importing it as a CSV to GSheets), please 1) share it with jywarren@gmail.com and 2) link to it from a comment below so we can track above.
If you submit the sheet, please still share it with me and we'll track in the list above, including finding any failed URLS and re-submitting if needed.
Thank you!!!!
Happy to help if this is still needed. As a side note, I noticed https://spectralworkbench.org/sets?page=115 seems to get me a 500 error, so it looks like there’s something broken that prevents the last page of results from rendering.
Here are some Wayback-compatible sheets for the 4 types that don’t need a DB dump. I broke up the spectrums into batches of 10,000, but did it with a quick script, so I can re-do them easily if another batch size seems reasonable.
https://drive.google.com/drive/folders/1q3p6k5Q5fy0KqxFy-Tav20_LZSVAIUx7?usp=sharing
I see - does 114 work?
Oh excellent. I just made a script to generate the spectrums one - adding the spreadsheet above. Some spectra will have been deleted and won't exist, and I can try running a query to find exactly which do exist... but this is a good backup approach if we don't finish in time. And I should try doing that one in 100k batches i think... or maybe 3x 80k batches.
I see the contributors and sets too, will add links above. HUGE APPRECIATION THANK U!
Ah, can you allow anyone to be editor? Apparently the script needs that. Or, possibly if you share GSheets privileges with the Archive service it can read them if you begin the process; if you do, please share the in-progress status link which I believe should be world-readable. Maybe try the sets/# collection first?
If you're able to re-batch the spectrums into 79,999-long batches i think that's the ideal size? The limits are listed differently in different places but I believe 80k is the shortest limit I've seen.
Thanks again, this is super helpful!
I'll begin submitting my own initial 79,999 list of spectra starting at https://spectralworkbench.org/spectrums/1 up to https://spectralworkbench.org/spectrums/79999, since I have that ready to submit.
Lol fwiw I see this on the first 80k records: Velocity: ~7 rows/min. ETA: ~212 hours.
That's about 9 days!
So, if you're able to submit the second set of 80k, we'll need all the time we can get! Although, if it takes that long, it means we can likely submit more than one batch at a time, assuming the SpectralWorkbench.org server can handle all the requests.
does 114 work?
Yep, works fine.
can you allow anyone to be editor?
Done. Didn’t think of this before, sorry!
If you're able to re-batch the spectrums into 79,999-long batches i think that's the ideal size?
Also done! In the same folder:
spectrums_000001-079999
(You’re already doing this one, though)spectrums_080000-159998
(Just started this on my archive.org account)spectrums_159999-239997
spectrums_239998-251374
I also ran the sheet for all the /sets/<n>
URLs last night.
And I did all the /contributors?page=<n>
URLs yesterday, too, just not via sheets — I had an old batch script I wrote when the Save Page Now 2 API was in beta, and pulled it out for this to see if it worked any better. It’s kinda 6 of one half a dozen of the other.
Update: tracking URL for the spectrums_080000-159998
sheet I just submitted is https://archive.org/services/wayback-gsheets/check?job_id=5070d061-49ec-494e-b8bb-64ebeb7a4e8b&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1hV0R_d9iX0TrFd1hmCFQoQ-McgVnLZDlSd8jo4m49Rc%2Fedit%23gid%3D452795522
Update: tracking URL for the spectrums_159999-239997
sheet I just submitted is https://archive.org/services/wayback-gsheets/check?job_id=072cb2f7-6bfa-49c6-8501-a12063981c2b&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1XBPHX92QuSAJz_Nc1d5FIieYKgoKoPDhMT4QfpA3_LE%2Fedit%23gid%3D325879669
Started a second round on the rows that failed from spectrums_080000-159998
. There are 493 that were 5xx errors or that failed because of something in SPN. I did not include a lot of rows that had a 502 error accessing favicon.ico
, since it doesn’t seem critical and was surely captured at a recent time already.
Thanks all! I found that in spectrums 0-80000, about 8k succeeded and 72k failed. But, I also learned that Archive Team has been independently trying, so it's possible they started theirs and we hit the server too hard by doubling up? A lot of mine have 502s which is the server responding too slow. You can see the results here, which I sorted by success/fail in tab 2: https://docs.google.com/spreadsheets/d/1-lmKewkurcqQ5sVJbBjIhepcjDm3eNovaJQ3s3TpmaI/edit#gid=1660752805
On the upside, it took only 17 hours, not 9 days.
Am I understanding that @Mr0grog you are seeing much better success yields for your submissions?
Thank you @bensheldon !!!! Appreciate it!
Re the favicon, is it clear that the rest of the page archived fine? Thanks!!
I'm marking each comment with 👍 if I've added it to the list above. Let me know if I miss anything!
Also marked items in the list at top which will be gotten by outlinks of spectra, which is one good reason to focus on just getting all the spectra since all other pages are within 1 outlink hop from those.
Thanks everyone!!! ❤️
Ah, i'm now getting this:
This host has been already captured 100,153.0 times today. Please try again tomorrow. Please email us at "info@archive.org" if you would like to discuss this more.
I will email them. You may get similar errors in your batches, if this is a system-wide limit...
Am I understanding that @Mr0grog you are seeing much better success yields for your submissions?
I got slightly better results (there were still ~64k that were already captured). I wasn’t worrying about these since, if they were already captured a zillion times in the same day, the job was done.
Re the favicon, is it clear that the rest of the page archived fine?
I’m not totally sure. IIRC from talking with Vangelis (who did the SPN2 rewrite), the capture would still get saved in this situation, but I may be misremembering or things may have changed.
Ah, i'm now getting this: This host has been already captured 100,153.0 times today. Please try again tomorrow. Please email us at "info@archive.org" if you would like to discuss this more.
My second run was entirely this. Given how high that number is (!) and the fact that things sped up, it definitely sounds like other folks at the archive or elsewhere are on this.
I’d consider switching gears to taking the lists of all the pages you want and using the CDX index to make sure they’ve been archived. Then just make a smaller list of what’s missing (if anything).
You can use the wayback
Python package to do this more easily (disclosure: I’m the maintainer):
import wayback
from datetime import date
client = wayback.WaybackClient()
# List the time, status, and URL captured for everything since 8/10
for record in client.search('https://spectralworkbench.org/*', from_date=date(2022, 8, 10), limit=10_000):
print(f"{record.timestamp}: ({record.status_code}) {record.url}")
# Outputs:
# 2022-08-18 00:00:20+00:00: (200) https://spectralworkbench.org/
# 2022-08-18 00:00:39+00:00: (503) https://spectralworkbench.org/
# 2022-08-17 19:14:39+00:00: (301) http://spectralworkbench.org/analyze/spectrum/4474
# 2022-08-16 23:27:13+00:00: (None) https://spectralworkbench.org/assets/adapterjs/publish/adapter.min-0c17431f9d1a50badfff11e14667aeda1023bfebbccfc27893d88cb46cbc9687.js
# 2022-08-17 07:26:48+00:00: (200) https://spectralworkbench.org/assets/analyze-ddc787ced325eab2b23f319d4886faa8dbb53581999f65967b30fe0d93fc3527.js
# etc...
And probably check the ones with status codes >=200 and <300 and >=400 and <500 to make sure they cover every URL you are concerned about.
(Note the limit=10_000
doesn’t limit the number of results, just the number of results per page (beyond this, it’ll automatically iterate through every page, so you don’t need to worry about anything other than setting it). You have to set it to something if you want to get all the results in a large set. It’s definitely a design problem that needs fixing, and comes from some funky behavior I didn’t understand originally in Wayback’s APIs. 😞)
Hi all, just checking in to say:
- i asked folks at the Archive if they could lift the limit, it seems they maybe can...
- i haven't heard back from Archive Team re coordination, yet! I reached out on twitter too.
- haven't gotten clarification yet on if the favicon error means the page was/wasn't archived. i'll try asking that next.
Thanks all!
Hi all, updating -- MapKnitter is essentially done, just about 50 maps left to double-check. Circling back here today and tomorrow. I just resubmitted the 0-79k for a 3rd pass, since the second had hit the 100k per host limit. Updated above. Hopefully we get a better sense of the yield on this run.
Only 10k succeeded in that last run on 0-79k (3rd pass), leaving 53k from the first batch. But we hit the cap with ~35k of the 62.9k total requests. That still means that only about 1/3 of the remainder succeeded :-/
If I can get the 1st and 2nd batch down below 40k I'll start combining them. I can also try to narrow the required ones by checking which have been archived successfully; that won't hit our server as hard.
I sent the second batch latest submission for a check to see what had been already done, perhaps by the Archive Team:
However it's not clear to me that it will create a new tab to show what it completed so we can read/sort it... let's see.
Just an update that I have about 80k left. Some which went past the daily host limit but showed 200 Success I am skipping, even though occasionally there's potential for those to not be a complete backup; I sampled a number of them and they were OK though.
We are under some pressure to wrap up asap so I am moving as fast as I can. Thanks!
Only 28k remaining, running now.
Only 7400 left now!
1284 left, re-running. Very close.
OK, I think we got all but 8; unfortunately these seem not to have worked for some reason:
https://spectralworkbench.org/spectrums/80 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/80 (HTTP status=503).
-- | -- | -- | --
https://spectralworkbench.org/spectrums/123 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/123 (HTTP status=503).
https://spectralworkbench.org/spectrums/143 | New capture | | Internal Server Error for https://spectralworkbench.org/spectrums/143 (HTTP status=500).
https://spectralworkbench.org/spectrums/1208 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/1208 (HTTP status=503).
https://spectralworkbench.org/spectrums/1205 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/1205 (HTTP status=503).
https://spectralworkbench.org/spectrums/113 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/113 (HTTP status=503).
https://spectralworkbench.org/spectrums/114 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/114 (HTTP status=503).
https://spectralworkbench.org/spectrums/22 | Already captured | - | Internal Server Error for https://spectralworkbench.org/spectrums/22 (HTTP status=500).
Going to shut things down as soon as we can now!
Thanks everyone for your help!!