publiclab / spectral-workbench

Web-based tools for collecting, analyzing, and sharing data from a DIY spectrometer

Home Page:http://spectralworkbench.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Archiving website at Internet Archive

jywarren opened this issue · comments

Due to lack of funds, SpectralWorkbench.org will be going offline at the end of August. We're working to ensure this collaboratively contributed database of spectra remains accessible with the help of the Internet Archive.

Here we're submitting all URLs we can to the following service for archiving:

Google Sheets limitations:

  • When you import a file from your computer to Google Sheets, it can have a maximum of 40,000 rows and 100MB.
  • Google Sheets does not have any row limit but a sheet may have up to 5 million cells.
  • Internet Archive limitations:
  • A Patron can process just one Google Sheet at a time.
  • A Patron can make 100,000 captures per day.
  • The number of concurrent captures a Patron runs in the background when processing a Google Sheet is limited to 4 to avoid system overload.
  • Save Page Now can make 100,000 captures from the same host per day.
  • Email reporting:

A Patron receives email notifications at the start and in the end of processing a Google Sheet.

We'll need to archive the following URL ranges (from https://spectralworkbench.org/stats):

How to help

We'll need to split up the URL lists into separate sheets, due to the per-user limits. We could use help creating spreadsheets especially for the ranges above, where one sheet contains, for example, every URL from https://spectralworkbench.org/sets?page=1 to https://spectralworkbench.org/sets?page=115

If you can make such a spreadsheet (even generating it locally with a script and importing it as a CSV to GSheets), please 1) share it with jywarren@gmail.com and 2) link to it from a comment below so we can track above.

If you submit the sheet, please still share it with me and we'll track in the list above, including finding any failed URLS and re-submitting if needed.

Thank you!!!!

Happy to help if this is still needed. As a side note, I noticed https://spectralworkbench.org/sets?page=115 seems to get me a 500 error, so it looks like there’s something broken that prevents the last page of results from rendering.

Here are some Wayback-compatible sheets for the 4 types that don’t need a DB dump. I broke up the spectrums into batches of 10,000, but did it with a quick script, so I can re-do them easily if another batch size seems reasonable.

https://drive.google.com/drive/folders/1q3p6k5Q5fy0KqxFy-Tav20_LZSVAIUx7?usp=sharing

I see - does 114 work?

Oh excellent. I just made a script to generate the spectrums one - adding the spreadsheet above. Some spectra will have been deleted and won't exist, and I can try running a query to find exactly which do exist... but this is a good backup approach if we don't finish in time. And I should try doing that one in 100k batches i think... or maybe 3x 80k batches.

I see the contributors and sets too, will add links above. HUGE APPRECIATION THANK U!

Ah, can you allow anyone to be editor? Apparently the script needs that. Or, possibly if you share GSheets privileges with the Archive service it can read them if you begin the process; if you do, please share the in-progress status link which I believe should be world-readable. Maybe try the sets/# collection first?

If you're able to re-batch the spectrums into 79,999-long batches i think that's the ideal size? The limits are listed differently in different places but I believe 80k is the shortest limit I've seen.

Thanks again, this is super helpful!

I'll begin submitting my own initial 79,999 list of spectra starting at https://spectralworkbench.org/spectrums/1 up to https://spectralworkbench.org/spectrums/79999, since I have that ready to submit.

Lol fwiw I see this on the first 80k records: Velocity: ~7 rows/min. ETA: ~212 hours. That's about 9 days!

So, if you're able to submit the second set of 80k, we'll need all the time we can get! Although, if it takes that long, it means we can likely submit more than one batch at a time, assuming the SpectralWorkbench.org server can handle all the requests.

does 114 work?

Yep, works fine.

can you allow anyone to be editor?

Done. Didn’t think of this before, sorry!

If you're able to re-batch the spectrums into 79,999-long batches i think that's the ideal size?

Also done! In the same folder:

I also ran the sheet for all the /sets/<n> URLs last night.

And I did all the /contributors?page=<n> URLs yesterday, too, just not via sheets — I had an old batch script I wrote when the Save Page Now 2 API was in beta, and pulled it out for this to see if it worked any better. It’s kinda 6 of one half a dozen of the other.

Started a second round on the rows that failed from spectrums_080000-159998. There are 493 that were 5xx errors or that failed because of something in SPN. I did not include a lot of rows that had a 502 error accessing favicon.ico, since it doesn’t seem critical and was surely captured at a recent time already.

Tracking URL: https://archive.org/services/wayback-gsheets/check?job_id=713ddd55-36bd-4c3a-8624-52af82558e0d&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1359MIeGzwH1Chl4R33ZhPjqyBdhDRVjUXhDgPG9L7ac%2Fedit%3Fusp%3Dsharing

Thanks all! I found that in spectrums 0-80000, about 8k succeeded and 72k failed. But, I also learned that Archive Team has been independently trying, so it's possible they started theirs and we hit the server too hard by doubling up? A lot of mine have 502s which is the server responding too slow. You can see the results here, which I sorted by success/fail in tab 2: https://docs.google.com/spreadsheets/d/1-lmKewkurcqQ5sVJbBjIhepcjDm3eNovaJQ3s3TpmaI/edit#gid=1660752805

On the upside, it took only 17 hours, not 9 days.

Am I understanding that @Mr0grog you are seeing much better success yields for your submissions?

Thank you @bensheldon !!!! Appreciate it!

Re the favicon, is it clear that the rest of the page archived fine? Thanks!!

I'm marking each comment with 👍 if I've added it to the list above. Let me know if I miss anything!

Also marked items in the list at top which will be gotten by outlinks of spectra, which is one good reason to focus on just getting all the spectra since all other pages are within 1 outlink hop from those.

Thanks everyone!!! ❤️

Ah, i'm now getting this:

This host has been already captured 100,153.0 times today. Please try again tomorrow. Please email us at "info@archive.org" if you would like to discuss this more.

I will email them. You may get similar errors in your batches, if this is a system-wide limit...

Am I understanding that @Mr0grog you are seeing much better success yields for your submissions?

I got slightly better results (there were still ~64k that were already captured). I wasn’t worrying about these since, if they were already captured a zillion times in the same day, the job was done.

Re the favicon, is it clear that the rest of the page archived fine?

I’m not totally sure. IIRC from talking with Vangelis (who did the SPN2 rewrite), the capture would still get saved in this situation, but I may be misremembering or things may have changed.

Ah, i'm now getting this: This host has been already captured 100,153.0 times today. Please try again tomorrow. Please email us at "info@archive.org" if you would like to discuss this more.

My second run was entirely this. Given how high that number is (!) and the fact that things sped up, it definitely sounds like other folks at the archive or elsewhere are on this.

I’d consider switching gears to taking the lists of all the pages you want and using the CDX index to make sure they’ve been archived. Then just make a smaller list of what’s missing (if anything).

You can use the wayback Python package to do this more easily (disclosure: I’m the maintainer):

import wayback
from datetime import date

client = wayback.WaybackClient()

# List the time, status, and URL captured for everything since 8/10
for record in client.search('https://spectralworkbench.org/*', from_date=date(2022, 8, 10), limit=10_000):
    print(f"{record.timestamp}: ({record.status_code}) {record.url}")

# Outputs:
# 2022-08-18 00:00:20+00:00: (200) https://spectralworkbench.org/
# 2022-08-18 00:00:39+00:00: (503) https://spectralworkbench.org/
# 2022-08-17 19:14:39+00:00: (301) http://spectralworkbench.org/analyze/spectrum/4474
# 2022-08-16 23:27:13+00:00: (None) https://spectralworkbench.org/assets/adapterjs/publish/adapter.min-0c17431f9d1a50badfff11e14667aeda1023bfebbccfc27893d88cb46cbc9687.js
# 2022-08-17 07:26:48+00:00: (200) https://spectralworkbench.org/assets/analyze-ddc787ced325eab2b23f319d4886faa8dbb53581999f65967b30fe0d93fc3527.js
# etc...

And probably check the ones with status codes >=200 and <300 and >=400 and <500 to make sure they cover every URL you are concerned about.

(Note the limit=10_000 doesn’t limit the number of results, just the number of results per page (beyond this, it’ll automatically iterate through every page, so you don’t need to worry about anything other than setting it). You have to set it to something if you want to get all the results in a large set. It’s definitely a design problem that needs fixing, and comes from some funky behavior I didn’t understand originally in Wayback’s APIs. 😞)

Hi all, just checking in to say:

  1. i asked folks at the Archive if they could lift the limit, it seems they maybe can...
  2. i haven't heard back from Archive Team re coordination, yet! I reached out on twitter too.
  3. haven't gotten clarification yet on if the favicon error means the page was/wasn't archived. i'll try asking that next.

Thanks all!

Hi all, updating -- MapKnitter is essentially done, just about 50 maps left to double-check. Circling back here today and tomorrow. I just resubmitted the 0-79k for a 3rd pass, since the second had hit the 100k per host limit. Updated above. Hopefully we get a better sense of the yield on this run.

Only 10k succeeded in that last run on 0-79k (3rd pass), leaving 53k from the first batch. But we hit the cap with ~35k of the 62.9k total requests. That still means that only about 1/3 of the remainder succeeded :-/

If I can get the 1st and 2nd batch down below 40k I'll start combining them. I can also try to narrow the required ones by checking which have been archived successfully; that won't hit our server as hard.

I sent the second batch latest submission for a check to see what had been already done, perhaps by the Archive Team:

https://archive.org/services/wayback-gsheets/check?job_id=d885c81f-5849-4083-bdd8-70bec5ac2528&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1359MIeGzwH1Chl4R33ZhPjqyBdhDRVjUXhDgPG9L7ac%2Fedit%23gid%3D452795522

https://docs.google.com/spreadsheets/d/1359MIeGzwH1Chl4R33ZhPjqyBdhDRVjUXhDgPG9L7ac/edit#gid=452795522

However it's not clear to me that it will create a new tab to show what it completed so we can read/sort it... let's see.

Just an update that I have about 80k left. Some which went past the daily host limit but showed 200 Success I am skipping, even though occasionally there's potential for those to not be a complete backup; I sampled a number of them and they were OK though.

We are under some pressure to wrap up asap so I am moving as fast as I can. Thanks!

Only 28k remaining, running now.

Only 7400 left now!

1284 left, re-running. Very close.

OK, I think we got all but 8; unfortunately these seem not to have worked for some reason:

https://spectralworkbench.org/spectrums/80 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/80 (HTTP status=503).
-- | -- | -- | --
https://spectralworkbench.org/spectrums/123 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/123 (HTTP status=503).
https://spectralworkbench.org/spectrums/143 | New capture |   | Internal Server Error for https://spectralworkbench.org/spectrums/143 (HTTP status=500).
https://spectralworkbench.org/spectrums/1208 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/1208 (HTTP status=503).
https://spectralworkbench.org/spectrums/1205 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/1205 (HTTP status=503).
https://spectralworkbench.org/spectrums/113 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/113 (HTTP status=503).
https://spectralworkbench.org/spectrums/114 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/114 (HTTP status=503).
https://spectralworkbench.org/spectrums/22 | Already captured | - | Internal Server Error for https://spectralworkbench.org/spectrums/22 (HTTP status=500).

Going to shut things down as soon as we can now!

Thanks everyone for your help!!