Archiving website at Internet Archive

Question

Archiving website at Internet Archive

jywarren opened this issue 2 years ago · comments

Due to lack of funds, SpectralWorkbench.org will be going offline at the end of August. We're working to ensure this collaboratively contributed database of spectra remains accessible with the help of the Internet Archive.

Here we're submitting all URLs we can to the following service for archiving:

https://archive.org/services/wayback-gsheets/

Google Sheets limitations:

When you import a file from your computer to Google Sheets, it can have a maximum of 40,000 rows and 100MB.
Google Sheets does not have any row limit but a sheet may have up to 5 million cells.
Internet Archive limitations:
A Patron can process just one Google Sheet at a time.
A Patron can make 100,000 captures per day.
The number of concurrent captures a Patron runs in the background when processing a Google Sheet is limited to 4 to avoid system overload.
Save Page Now can make 100,000 captures from the same host per day.
Email reporting:

A Patron receives email notifications at the start and in the end of processing a Google Sheet.

We'll need to archive the following URL ranges (from https://spectralworkbench.org/stats):

How to help

We'll need to split up the URL lists into separate sheets, due to the per-user limits. We could use help creating spreadsheets especially for the ranges above, where one sheet contains, for example, every URL from https://spectralworkbench.org/sets?page=1 to https://spectralworkbench.org/sets?page=115

If you can make such a spreadsheet (even generating it locally with a script and importing it as a CSV to GSheets), please 1) share it with jywarren@gmail.com and 2) link to it from a comment below so we can track above.

If you submit the sheet, please still share it with me and we'll track in the list above, including finding any failed URLS and re-submitting if needed.

Thank you!!!!

Rob Brackett · Answer 1 · Wed Aug 17 2022 07:17:39 GMT+0800 (China Standard Time)

Happy to help if this is still needed. As a side note, I noticed https://spectralworkbench.org/sets?page=115 seems to get me a 500 error, so it looks like there’s something broken that prevents the last page of results from rendering.

Rob Brackett · Answer 2 · Wed Aug 17 2022 08:27:54 GMT+0800 (China Standard Time)

Here are some Wayback-compatible sheets for the 4 types that don’t need a DB dump. I broke up the spectrums into batches of 10,000, but did it with a quick script, so I can re-do them easily if another batch size seems reasonable.

https://drive.google.com/drive/folders/1q3p6k5Q5fy0KqxFy-Tav20_LZSVAIUx7?usp=sharing

Jeffrey Warren · Answer 3 · Thu Aug 18 2022 00:06:43 GMT+0800 (China Standard Time)

I see - does 114 work?

Oh excellent. I just made a script to generate the spectrums one - adding the spreadsheet above. Some spectra will have been deleted and won't exist, and I can try running a query to find exactly which do exist... but this is a good backup approach if we don't finish in time. And I should try doing that one in 100k batches i think... or maybe 3x 80k batches.

I see the contributors and sets too, will add links above. HUGE APPRECIATION THANK U!

Jeffrey Warren · Answer 4 · Thu Aug 18 2022 00:11:12 GMT+0800 (China Standard Time)

Ah, can you allow anyone to be editor? Apparently the script needs that. Or, possibly if you share GSheets privileges with the Archive service it can read them if you begin the process; if you do, please share the in-progress status link which I believe should be world-readable. Maybe try the sets/# collection first?

If you're able to re-batch the spectrums into 79,999-long batches i think that's the ideal size? The limits are listed differently in different places but I believe 80k is the shortest limit I've seen.

Thanks again, this is super helpful!

Jeffrey Warren · Answer 5 · Thu Aug 18 2022 00:12:23 GMT+0800 (China Standard Time)

I'll begin submitting my own initial 79,999 list of spectra starting at https://spectralworkbench.org/spectrums/1 up to https://spectralworkbench.org/spectrums/79999, since I have that ready to submit.

Jeffrey Warren · Answer 6 · Thu Aug 18 2022 00:21:02 GMT+0800 (China Standard Time)

Lol fwiw I see this on the first 80k records: Velocity: ~7 rows/min. ETA: ~212 hours. That's about 9 days!

Jeffrey Warren · Answer 7 · Thu Aug 18 2022 00:22:14 GMT+0800 (China Standard Time)

So, if you're able to submit the second set of 80k, we'll need all the time we can get! Although, if it takes that long, it means we can likely submit more than one batch at a time, assuming the SpectralWorkbench.org server can handle all the requests.

Rob Brackett · Answer 8 · Thu Aug 18 2022 02:15:44 GMT+0800 (China Standard Time)

does 114 work?

Yep, works fine.

can you allow anyone to be editor?

Done. Didn’t think of this before, sorry!

If you're able to re-batch the spectrums into 79,999-long batches i think that's the ideal size?

Also done! In the same folder:

spectrums_000001-079999 (You’re already doing this one, though)
spectrums_080000-159998 (Just started this on my archive.org account)
spectrums_159999-239997
spectrums_239998-251374

I also ran the sheet for all the /sets/<n> URLs last night.

Rob Brackett · Answer 9 · Thu Aug 18 2022 02:17:51 GMT+0800 (China Standard Time)

And I did all the /contributors?page=<n> URLs yesterday, too, just not via sheets — I had an old batch script I wrote when the Save Page Now 2 API was in beta, and pulled it out for this to see if it worked any better. It’s kinda 6 of one half a dozen of the other.

Rob Brackett · Answer 10 · Thu Aug 18 2022 02:19:51 GMT+0800 (China Standard Time)

Update: tracking URL for the spectrums_080000-159998 sheet I just submitted is https://archive.org/services/wayback-gsheets/check?job_id=5070d061-49ec-494e-b8bb-64ebeb7a4e8b&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1hV0R_d9iX0TrFd1hmCFQoQ-McgVnLZDlSd8jo4m49Rc%2Fedit%23gid%3D452795522

Ben Sheldon [he/him] · Answer 11 · Thu Aug 18 2022 05:41:35 GMT+0800 (China Standard Time)

Update: tracking URL for the spectrums_159999-239997 sheet I just submitted is https://archive.org/services/wayback-gsheets/check?job_id=072cb2f7-6bfa-49c6-8501-a12063981c2b&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1XBPHX92QuSAJz_Nc1d5FIieYKgoKoPDhMT4QfpA3_LE%2Fedit%23gid%3D325879669

Rob Brackett · Answer 12 · Fri Aug 19 2022 02:39:48 GMT+0800 (China Standard Time)

Started a second round on the rows that failed from spectrums_080000-159998. There are 493 that were 5xx errors or that failed because of something in SPN. I did not include a lot of rows that had a 502 error accessing favicon.ico, since it doesn’t seem critical and was surely captured at a recent time already.

Tracking URL: https://archive.org/services/wayback-gsheets/check?job_id=713ddd55-36bd-4c3a-8624-52af82558e0d&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1359MIeGzwH1Chl4R33ZhPjqyBdhDRVjUXhDgPG9L7ac%2Fedit%3Fusp%3Dsharing

Jeffrey Warren · Answer 13 · Fri Aug 19 2022 06:31:56 GMT+0800 (China Standard Time)

Thanks all! I found that in spectrums 0-80000, about 8k succeeded and 72k failed. But, I also learned that Archive Team has been independently trying, so it's possible they started theirs and we hit the server too hard by doubling up? A lot of mine have 502s which is the server responding too slow. You can see the results here, which I sorted by success/fail in tab 2: https://docs.google.com/spreadsheets/d/1-lmKewkurcqQ5sVJbBjIhepcjDm3eNovaJQ3s3TpmaI/edit#gid=1660752805

On the upside, it took only 17 hours, not 9 days.

Am I understanding that @Mr0grog you are seeing much better success yields for your submissions?

Thank you @bensheldon !!!! Appreciate it!

Jeffrey Warren · Answer 14 · Fri Aug 19 2022 06:35:00 GMT+0800 (China Standard Time)

Re the favicon, is it clear that the rest of the page archived fine? Thanks!!

Jeffrey Warren · Answer 15 · Fri Aug 19 2022 06:40:58 GMT+0800 (China Standard Time)

I'm marking each comment with 👍 if I've added it to the list above. Let me know if I miss anything!

Also marked items in the list at top which will be gotten by outlinks of spectra, which is one good reason to focus on just getting all the spectra since all other pages are within 1 outlink hop from those.

Thanks everyone!!! ❤️

Jeffrey Warren · Answer 16 · Fri Aug 19 2022 06:50:24 GMT+0800 (China Standard Time)

Ah, i'm now getting this:

This host has been already captured 100,153.0 times today. Please try again tomorrow. Please email us at "info@archive.org" if you would like to discuss this more.

I will email them. You may get similar errors in your batches, if this is a system-wide limit...

Rob Brackett · Answer 17 · Fri Aug 19 2022 14:20:46 GMT+0800 (China Standard Time)

Am I understanding that @Mr0grog you are seeing much better success yields for your submissions?

I got slightly better results (there were still ~64k that were already captured). I wasn’t worrying about these since, if they were already captured a zillion times in the same day, the job was done.

Re the favicon, is it clear that the rest of the page archived fine?

I’m not totally sure. IIRC from talking with Vangelis (who did the SPN2 rewrite), the capture would still get saved in this situation, but I may be misremembering or things may have changed.

Ah, i'm now getting this: This host has been already captured 100,153.0 times today. Please try again tomorrow. Please email us at "info@archive.org" if you would like to discuss this more.

My second run was entirely this. Given how high that number is (!) and the fact that things sped up, it definitely sounds like other folks at the archive or elsewhere are on this.

I’d consider switching gears to taking the lists of all the pages you want and using the CDX index to make sure they’ve been archived. Then just make a smaller list of what’s missing (if anything).

You can use the wayback Python package to do this more easily (disclosure: I’m the maintainer):

import wayback
from datetime import date

client = wayback.WaybackClient()

# List the time, status, and URL captured for everything since 8/10
for record in client.search('https://spectralworkbench.org/*', from_date=date(2022, 8, 10), limit=10_000):
    print(f"{record.timestamp}: ({record.status_code}) {record.url}")

# Outputs:
# 2022-08-18 00:00:20+00:00: (200) https://spectralworkbench.org/
# 2022-08-18 00:00:39+00:00: (503) https://spectralworkbench.org/
# 2022-08-17 19:14:39+00:00: (301) http://spectralworkbench.org/analyze/spectrum/4474
# 2022-08-16 23:27:13+00:00: (None) https://spectralworkbench.org/assets/adapterjs/publish/adapter.min-0c17431f9d1a50badfff11e14667aeda1023bfebbccfc27893d88cb46cbc9687.js
# 2022-08-17 07:26:48+00:00: (200) https://spectralworkbench.org/assets/analyze-ddc787ced325eab2b23f319d4886faa8dbb53581999f65967b30fe0d93fc3527.js
# etc...

And probably check the ones with status codes >=200 and <300 and >=400 and <500 to make sure they cover every URL you are concerned about.

Rob Brackett · Answer 18 · Fri Aug 19 2022 14:23:04 GMT+0800 (China Standard Time)

(Note the limit=10_000 doesn’t limit the number of results, just the number of results per page (beyond this, it’ll automatically iterate through every page, so you don’t need to worry about anything other than setting it). You have to set it to something if you want to get all the results in a large set. It’s definitely a design problem that needs fixing, and comes from some funky behavior I didn’t understand originally in Wayback’s APIs. 😞)

Jeffrey Warren · Answer 19 · Sat Aug 20 2022 08:09:47 GMT+0800 (China Standard Time)

Hi all, just checking in to say:

i asked folks at the Archive if they could lift the limit, it seems they maybe can...
i haven't heard back from Archive Team re coordination, yet! I reached out on twitter too.
haven't gotten clarification yet on if the favicon error means the page was/wasn't archived. i'll try asking that next.

Thanks all!

Jeffrey Warren · Answer 20 · Fri Aug 26 2022 11:30:08 GMT+0800 (China Standard Time)

Hi all, updating -- MapKnitter is essentially done, just about 50 maps left to double-check. Circling back here today and tomorrow. I just resubmitted the 0-79k for a 3rd pass, since the second had hit the 100k per host limit. Updated above. Hopefully we get a better sense of the yield on this run.

Jeffrey Warren · Answer 21 · Tue Sep 06 2022 23:25:31 GMT+0800 (China Standard Time)

Only 10k succeeded in that last run on 0-79k (3rd pass), leaving 53k from the first batch. But we hit the cap with ~35k of the 62.9k total requests. That still means that only about 1/3 of the remainder succeeded :-/

If I can get the 1st and 2nd batch down below 40k I'll start combining them. I can also try to narrow the required ones by checking which have been archived successfully; that won't hit our server as hard.

Jeffrey Warren · Answer 22 · Tue Sep 06 2022 23:29:30 GMT+0800 (China Standard Time)

I sent the second batch latest submission for a check to see what had been already done, perhaps by the Archive Team:

https://archive.org/services/wayback-gsheets/check?job_id=d885c81f-5849-4083-bdd8-70bec5ac2528&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1359MIeGzwH1Chl4R33ZhPjqyBdhDRVjUXhDgPG9L7ac%2Fedit%23gid%3D452795522

https://docs.google.com/spreadsheets/d/1359MIeGzwH1Chl4R33ZhPjqyBdhDRVjUXhDgPG9L7ac/edit#gid=452795522

However it's not clear to me that it will create a new tab to show what it completed so we can read/sort it... let's see.

Jeffrey Warren · Answer 23 · Mon Oct 03 2022 10:40:35 GMT+0800 (China Standard Time)

Just an update that I have about 80k left. Some which went past the daily host limit but showed 200 Success I am skipping, even though occasionally there's potential for those to not be a complete backup; I sampled a number of them and they were OK though.

We are under some pressure to wrap up asap so I am moving as fast as I can. Thanks!

Jeffrey Warren · Answer 24 · Wed Oct 05 2022 06:41:33 GMT+0800 (China Standard Time)

Only 28k remaining, running now.

Jeffrey Warren · Answer 25 · Thu Oct 06 2022 07:53:48 GMT+0800 (China Standard Time)

Only 7400 left now!

Jeffrey Warren · Answer 26 · Sat Oct 08 2022 07:17:18 GMT+0800 (China Standard Time)

1284 left, re-running. Very close.

Jeffrey Warren · Answer 27 · Sat Oct 08 2022 17:20:45 GMT+0800 (China Standard Time)

77 left: https://docs.google.com/spreadsheets/d/1hCpccXa3xmH4D09jufccZtY7WH6mzoEkBJts50__JKE/edit#gid=273693883

https://archive.org/services/wayback-gsheets/check?job_id=e0236c65-bad1-4f6a-bbf6-ab8c0b65943f&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1hCpccXa3xmH4D09jufccZtY7WH6mzoEkBJts50__JKE%2Fedit%3Fusp%3Dsharing

Jeffrey Warren · Answer 28 · Sat Oct 08 2022 17:27:13 GMT+0800 (China Standard Time)

OK, I think we got all but 8; unfortunately these seem not to have worked for some reason:

https://spectralworkbench.org/spectrums/80 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/80 (HTTP status=503).
-- | -- | -- | --
https://spectralworkbench.org/spectrums/123 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/123 (HTTP status=503).
https://spectralworkbench.org/spectrums/143 | New capture |   | Internal Server Error for https://spectralworkbench.org/spectrums/143 (HTTP status=500).
https://spectralworkbench.org/spectrums/1208 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/1208 (HTTP status=503).
https://spectralworkbench.org/spectrums/1205 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/1205 (HTTP status=503).
https://spectralworkbench.org/spectrums/113 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/113 (HTTP status=503).
https://spectralworkbench.org/spectrums/114 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/114 (HTTP status=503).
https://spectralworkbench.org/spectrums/22 | Already captured | - | Internal Server Error for https://spectralworkbench.org/spectrums/22 (HTTP status=500).

Jeffrey Warren · Answer 29 · Sat Oct 08 2022 17:27:26 GMT+0800 (China Standard Time)

Going to shut things down as soon as we can now!

Jeffrey Warren · Answer 30 · Sat Oct 08 2022 17:31:15 GMT+0800 (China Standard Time)

Thanks everyone for your help!!