CMB / edbrowse

A command-line editor and web browser.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Javascript doesn't work

Dingo64 opened this issue · comments

I am trying to save a webpage to file:
./edbrowse https://example.com/ > out.htm

no ssl certificate file specified; secure connections cannot be verified
15848
Unable to exec edbrowse-js, javascript has been disabled.
1351

Of course edbrowse-js is in the same directory and has exec rights.

Thanks, I did export PATH and now this error is gone. But can I just use it like wget? Download a file and save the final output (after running JS) to file?

Thanks! Can I do this non-interactively? Like edbrowse http://google.com -w output.htm?

Unfortunately the unbrowse command ub might not quite do what you want, if you wanted to see the results of having the Javascript run. Unbrowse is like the View Source command in a graphical browser: it shows you the original page source. It does not show you the modified version of the DOM tree after the scripts have run. Yes the formatted text shows the Javascript result, but the unbrowsed version just shows you the original source. If by "like wget" you meant "like wget plus Javascript DOM changes put back into the source", that's more complex. And no it won't work to inject an extra piece of Javascript into the page like document.body.innerHTML=document.all[0].outerHTML.replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/\n/g,'<br>') (which is supposed to read the DOM back into source form and format this so that the formatted display will be the DOM markup); the reason why this won't work is that edbrowse's DOM support is not complete enough. To clarify, the Javascript engine behind edbrowse is the same one that runs Firefox, but that's only the Javascript engine, not the DOM. The Mozilla SpiderMonkey Javascript engine provides the Javascript interpreter, but the DOM itself still has to be provided by edbrowse, and if we look into the edbrowse source at src/jseng-moz.cpp we can see the JS_DefineProperty call for innerHTML rigs up a "setter" but not a "getter". This means edbrowse has write-only support for the innerHTML property; attempting to read back an element's innerHTML will get an empty string. And the outerHTML property is not supported at all. If I'm doing Web programming and I ever need to do a "quick hack" along the lines of x.innerHTML = x.innerHTML.replace(y,z) I always try to remember to enclose it in an if (x.innerHTML) to verify that we have both read and write support for innerHTML, because if any SpiderMonkey-derived browser has write-only support for this property then a read-modify-write would become a delete. If you really want to inject Javascript into the current version of edbrowse to give you the DOM, you'll have to write a rather roundabout script to walk through the DOM nodes itself, using only the features that edbrowse already implements, building up the markup string as it goes, but if you're going to go to that much effort then you almost might as well do it in C and thereby contribute innerHTML read support to edbrowse. (By the way I'm not sure I'd be the best one to code this because I did the exact same job for a commercial company 10 years ago and I don't want to raise questions about did my contribution somehow taint the free code base. But I can still sit here and point it out.)

In the meantime there is PhantomJS which has more complete DOM support but it is not as lightweight as edbrowse. For example in Python (adapted from Web Adjuster):

from selenium import webdriver
import time
wd = webdriver.PhantomJS(service_args=['--ssl-protocol=any'])
wd.get(url)
time.sleep(2) # wait for onTimeout events
print wd.find_element_by_xpath("//*").get_attribute("outerHTML").encode('utf-8')
wd.quit()

but none of these considerations apply if you merely wanted to view the formatted text of pages that don't need things like read access to innerHTML and you don't need to see a markup representation of the modified DOM but just want to read what the page says: in that case edbrowse should be just fine.

OK I'll see if I can sign up to that list at some point. Not today though as I am a bit overloaded at the moment. One more thing I should mention though is that edbrowse's support of default innerHTML is also limited by length, and if it is too long it will not be set at all. It's not immediately obvious from the code where this length limit is coming from. What it means is that scripts that try to do "search and replace" on the entire document by accessing a wrapper element's innerHTML will fail unless it is a short test document. It also means we cannot get a DOM tree out of current versions of edbrowse simply by adding a DIV element around the entire body, in case anyone was thinking of trying that.

Update: in edbrowse 3.7.4, document.body.innerHTML works (and you can access it via the new jdb command after loading a page, which essentially takes you into a Javascript console), but innerHTML does not reflect the DOM changes made by scripts as it does in graphical browsers (for example, if an inline script has called document.write("2+2="+(2+2)) then this will not cause 2+2=4 to appear in document.body.innerHTML), and outerHTML remains undefined. And it's difficult to implement your own in jdb: you can walk through firstChild and nextSibling looking at nodeType, nodeName and nodeValue but:

  • getAttributeNames() is not yet implemented, and the attributes property of nodes does not yet define length. You can use getAttribute() if you know the attribute name, but you cannot get a list of attribute names, so at best you'll end up having a DOM tree with all the attributes missing.
  • and document.write has been known to break nextSibling links, which can result in parts of your DOM tree falling off.

So as per previous comments on this thread, you can write the original page source to a file, or you can write the final version of the rendered text to a file, but there is not yet a way to write out a with-markup version of the DOM after it has been changed by Javascript.

I have a query. I want to download a webpage with javascripts using edbrowse to make offline copy. how can i achive this. When i browse that site javascript content is not loaded no text or links are loaded

I'm not sure what you are asking. If you want a local copy of a web page, with local javascript files and local css files, it is theoretically possible, we do this a lot when debugging, but it's not easy, and has some caveates, and most users don't do that. Call up the debugging page in the edbrowse wiki, and look for the word snapshot. If you're just saying there's a web page wherein js isn't working properly, well there are a lot of those, let us know which one and we'll add it to the list. Karl Dahlke

i am looking for a way to archive/backup fully javascript dom loaded website for offline backup. edbrowse is only text based browser that supppport javacript. So what should be the command

Thanks. What i want is to execute the external javascript that was in html src tags and update the DOM accordingly and then scrape the final updated html/DOM

If you want to back-convert the final rendered DOM into HTML, so for example if the site says <script>var a=document.createElement("a");a.setAttribute("href","http"+":/"+"/"+"www.example.com");a.innerText="hi";document.body.appendChild(a)</script> and you want the output to be <a href="http://www.example.com">hi</a>, then I don't think Edbrowse can do this yet. In my Web Adjuster's Javascript execution options, I use Selenium with Headless Chrome or Firefox, but this is quite resource-hungry and slightly unreliable. Maybe one day we'll be able to use Edbrowse for this.

That's great. If anyone reading this gets undefined, note that you need at least version 3.7.5 of edbrowse (check edbrowse -v), which means if you've installed edbrowse from your distribution's package manager, you need to be running a new enough distribution:

  • Ubuntu 20.04 should work (it has edbrowse 3.7.6), but not Ubuntu 18.04 which has edbrowse 3.7.2;
  • Debian 11 "Bullseye" (and its Raspbian equivalent on the Raspberry Pi which has just been released) has edbrowse 3.7.7, but Debian 10 "Buster" has 3.7.4 which is too old;
  • FreeBSD should be fine: it has edbrowse 3.7.7;
  • Fedora 34's "RPM Fusion + RPM Sphere" add-on is still stuck on edbrowse 3.7.4 which is too old;
  • and MacPorts is still stuck on edbrowse 3.4.10.

To compile a more recent edbrowse on the Mac:

  1. Get MacPorts to install PCRE, libcurl and Tidy, as well as the GNU versions of sed and make (since Edbrowse's Makefile is not fully compatible with the BSD versions of these tools): sudo port install pcre curl tidy gsed gmake
  2. compile quickjs as described in Edbrowse's README
  3. in Edbrowse's src/Makefile change else sed -f to else gsed -f to ensure the GNU version of sed is called, and remove -latomic
  4. in Edbrowse's src, run gmake CFLAGS="-I /opt/local/include"

And yes, some sites do depend on outerHTML, but more depend on innerHTML, and some of them expect innerHTML to be dynamic (like outerHTML currently is). They also expect both innerHTML and outerHTML to have setters which re-parse an HTML fragment and repopulate part of the DOM.

Yes I realised my comment was wrong (I'd forgotten I'd installed RPM Fusion on the box), so I edited my comment shortly after writing it. But GitHub still sent the wrong version to anyone subscribed to this thread by email. Sorry about that.

Just filed a ticket at MacPorts asking them to update, with a scripted version of the above instructions (they might say "oh that's not how we write our scripts at MacPorts" but hopefully they can adapt it)

  • and MacPorts is still stuck on edbrowse 3.4.10.

I've updated edbrowse in MacPorts to 3.8.2.1 and listed myself as the maintainer so I should notice any future versions becoming available and update the port in short order. If I fail to do so please file a MacPorts ticket or send a MacPorts pull request.