alphapapa / org-web-tools

View, capture, and archive Web pages in Org-mode

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

heads up...tracking a problem with archive.today and also wget options

nrvale0 opened this issue · comments

Just a heads up...

For some reason archive.today requests are failing (no, not using Cloudflare) and then the backup wget is failing because it does not like the '--execute robots=off' option.

I'm going to try to solve the archive.today problem first but I'll race ya! ;)

Unfortunately, archive.today (or whatever its alias of the day may be) generally seems like an unreliable service for using as a backend. It's not intended to be used except through a browser. So don't be surprised if it doesn't work sometimes.

If a change has been made to it that requires a change in this code, we can do that.

For Wget, you'll have to be more specific than "it does not like the option." Obviously it works for me and always has.

(use-package org-web-tools)

produces timeout of archive.is function and then the following error in Messages for wget function:

wget output:

/usr/bin/wget: unrecognized option '--execute robots=off'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

The following then fixes wget params:

(use-package org-web-tools
    :config
    (setq org-web-tools-archive-wget-options
        (delete "--execute robots=off" org-web-tools-archive-wget-options))
    (add-to-list 'org-web-tools-archive-wget-options "-e robots=off"))

My wget man page still shows both the -e and --execute options as valid but apparently not.

$ wget --version | head -n1
GNU Wget 1.20.3 built on linux-gnu.

I tried a:

(setq org-web-tools-attach-archive-fn #'org-web-tools-archive--wget-tar)

to just skip the archive.is attempts completely but its still trying archive.is. New to elisp so I'm probably missing something important.

If those Wget options don't work on your Wget version, I don't know what to suggest other than to not use them. Hopefully you won't need them, but be aware of their purpose. Maybe there is a new, alternative option syntax in your Wget version?

I recommend using the customization system rather than setq for package options. i.e. M-x customize-group RET org-web-tools RET. use-package also has the :custom keyword.

wget output:

/usr/bin/wget: unrecognized option '--execute robots=off'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

I just recognized something: the option string Wget complains about includes both --execute and robots=off as a single string with a space in between. I think this may be a problem with argument parsing. I encountered a similar problem with Wget when experimenting with something recently, and IIRC I wasn't able to find any workaround.

Try putting --execute and robots=off in separate strings in the org-web-tools-archive-wget-options option. (The customization UI makes it easier.) I don't know if that will work, but if it does, it's an easy fix or workaround.

But I can't explain why my Wget doesn't complain about that option.

Indeed adding "--execute" and "robots=off" as their own customize entries seems to have solved the issue with wget archiving.

I'm still not able to get the archive.is based archiving working but the above is a suitable work-around.

Indeed adding "--execute" and "robots=off" as their own customize entries seems to have solved the issue with wget archiving.

I think there may be a bug in Wget, because I recently noticed this problem when calling it from outside of Emacs. I guess we have to work around it in Emacs.

I'm still not able to get the archive.is based archiving working but the above is a suitable work-around.

archive.is doesn't seem to provide zip archives at all anymore. I can't even download them through a browser, and I couldn't find any explanation on its "blog" where people ask questions. In one case I tried to use Wget on the archive.is HTML view (because the page I was trying to save rendered most of its content with JavaScript, so Wget on the actual site was useless), but the downloaded page had about 90% of the content missing, even though it displayed correctly in a browser.

Archiving contemporary web pages is mostly a disaster. I guess if you are serious about it, you'd better look into WARC or WebRecorder tools, something like that, but those are much more complicated, and AFAIK they require specialized "playback" tools. Imagine what people are going to have to do a few decades from now, running ancient browsers in ancient VMs just to render a newspaper article of the day. Or, almost as bad, looking at image-based archives of newspapers, like microfilm from before the digital age. It seems like no one ever knows when to say, "Stop, that's complicated enough. Just because we could doesn't mean that we should."

Y, thanks for the confirmation. I feel ya' on future-state stuff.

commented

Just a heads up...

For some reason archive.today requests are failing (no, not using Cloudflare) and then the backup wget is failing because it does not like the '--execute robots=off' option.

I'm going to try to solve the archive.today problem first but I'll race ya! ;)

@nrvale0 Where you able to solve 1) arhive.today and 2) wget params problems? I have same in #52

commented

Try putting --execute and robots=off in separate strings in the org-web-tools-archive-wget-options option

@alphapapa

does this look right

(use-package org-web-tools
  :ensure t
  :custom
  (org-web-tools-archive-wget-options
    '(--execute
    robots=off)
  )
)

?


UPDATE: wget error solved with following:

(use-package org-web-tools
  :ensure t
  :custom
  (org-web-tools-archive-wget-options
    '("--execute"
    "robots=off")
  )
)

but archive.today always fails...

This is still an issue on wget 1.21.4. "--execute" and "robots=off" must be separated.

@deadcombo Thanks for reminding me. I've pushed a fix to master.