heads up...tracking a problem with archive.today and also wget options

Question

heads up...tracking a problem with archive.today and also wget options

nrvale0 opened this issue 4 years ago · comments

Just a heads up...

For some reason archive.today requests are failing (no, not using Cloudflare) and then the backup wget is failing because it does not like the '--execute robots=off' option.

I'm going to try to solve the archive.today problem first but I'll race ya! ;)

Adam Porter · Answer 1 · Mon May 11 2020 12:54:52 GMT+0800 (China Standard Time)

Unfortunately, archive.today (or whatever its alias of the day may be) generally seems like an unreliable service for using as a backend. It's not intended to be used except through a browser. So don't be surprised if it doesn't work sometimes.

If a change has been made to it that requires a change in this code, we can do that.

For Wget, you'll have to be more specific than "it does not like the option." Obviously it works for me and always has.

Nathan Valentine · Answer 2 · Mon May 11 2020 19:28:22 GMT+0800 (China Standard Time)

(use-package org-web-tools)

produces timeout of archive.is function and then the following error in Messages for wget function:

wget output:

/usr/bin/wget: unrecognized option '--execute robots=off'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

The following then fixes wget params:

(use-package org-web-tools
    :config
    (setq org-web-tools-archive-wget-options
        (delete "--execute robots=off" org-web-tools-archive-wget-options))
    (add-to-list 'org-web-tools-archive-wget-options "-e robots=off"))

My wget man page still shows both the -e and --execute options as valid but apparently not.

$ wget --version | head -n1
GNU Wget 1.20.3 built on linux-gnu.

Nathan Valentine · Answer 3 · Mon May 11 2020 19:48:51 GMT+0800 (China Standard Time)

I tried a:

(setq org-web-tools-attach-archive-fn #'org-web-tools-archive--wget-tar)

to just skip the archive.is attempts completely but its still trying archive.is. New to elisp so I'm probably missing something important.

Adam Porter · Answer 4 · Thu May 14 2020 08:52:44 GMT+0800 (China Standard Time)

If those Wget options don't work on your Wget version, I don't know what to suggest other than to not use them. Hopefully you won't need them, but be aware of their purpose. Maybe there is a new, alternative option syntax in your Wget version?

I recommend using the customization system rather than setq for package options. i.e. M-x customize-group RET org-web-tools RET. use-package also has the :custom keyword.

Adam Porter · Answer 5 · Thu May 14 2020 08:57:55 GMT+0800 (China Standard Time)

wget output:

/usr/bin/wget: unrecognized option '--execute robots=off'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

I just recognized something: the option string Wget complains about includes both --execute and robots=off as a single string with a space in between. I think this may be a problem with argument parsing. I encountered a similar problem with Wget when experimenting with something recently, and IIRC I wasn't able to find any workaround.

Try putting --execute and robots=off in separate strings in the org-web-tools-archive-wget-options option. (The customization UI makes it easier.) I don't know if that will work, but if it does, it's an easy fix or workaround.

But I can't explain why my Wget doesn't complain about that option.

Nathan Valentine · Answer 6 · Sun Oct 25 2020 02:13:53 GMT+0800 (China Standard Time)

Indeed adding "--execute" and "robots=off" as their own customize entries seems to have solved the issue with wget archiving.

I'm still not able to get the archive.is based archiving working but the above is a suitable work-around.

Adam Porter · Answer 7 · Sun Oct 25 2020 02:35:52 GMT+0800 (China Standard Time)

Indeed adding "--execute" and "robots=off" as their own customize entries seems to have solved the issue with wget archiving.

I think there may be a bug in Wget, because I recently noticed this problem when calling it from outside of Emacs. I guess we have to work around it in Emacs.

I'm still not able to get the archive.is based archiving working but the above is a suitable work-around.

archive.is doesn't seem to provide zip archives at all anymore. I can't even download them through a browser, and I couldn't find any explanation on its "blog" where people ask questions. In one case I tried to use Wget on the archive.is HTML view (because the page I was trying to save rendered most of its content with JavaScript, so Wget on the actual site was useless), but the downloaded page had about 90% of the content missing, even though it displayed correctly in a browser.

Archiving contemporary web pages is mostly a disaster. I guess if you are serious about it, you'd better look into WARC or WebRecorder tools, something like that, but those are much more complicated, and AFAIK they require specialized "playback" tools. Imagine what people are going to have to do a few decades from now, running ancient browsers in ancient VMs just to render a newspaper article of the day. Or, almost as bad, looking at image-based archives of newspapers, like microfilm from before the digital age. It seems like no one ever knows when to say, "Stop, that's complicated enough. Just because we could doesn't mean that we should."

Nathan Valentine · Answer 8 · Sun Oct 25 2020 07:31:11 GMT+0800 (China Standard Time)

Y, thanks for the confirmation. I feel ya' on future-state stuff.

gety9 · Answer 9 · Sun Nov 06 2022 05:41:35 GMT+0800 (China Standard Time)

Just a heads up...

For some reason archive.today requests are failing (no, not using Cloudflare) and then the backup wget is failing because it does not like the '--execute robots=off' option.

I'm going to try to solve the archive.today problem first but I'll race ya! ;)

@nrvale0 Where you able to solve 1) arhive.today and 2) wget params problems? I have same in #52

gety9 · Answer 10 · Sun Nov 06 2022 06:28:47 GMT+0800 (China Standard Time)

Try putting --execute and robots=off in separate strings in the org-web-tools-archive-wget-options option

@alphapapa

does this look right

(use-package org-web-tools
  :ensure t
  :custom
  (org-web-tools-archive-wget-options
    '(--execute
    robots=off)
  )
)

?

UPDATE: wget error solved with following:

(use-package org-web-tools
  :ensure t
  :custom
  (org-web-tools-archive-wget-options
    '("--execute"
    "robots=off")
  )
)

but archive.today always fails...

Poeer · Answer 11 · Sat Oct 28 2023 22:19:10 GMT+0800 (China Standard Time)

This is still an issue on wget 1.21.4. "--execute" and "robots=off" must be separated.

Adam Porter · Answer 12 · Sun Oct 29 2023 12:54:12 GMT+0800 (China Standard Time)

@deadcombo Thanks for reminding me. I've pushed a fix to master.