danny0838 / webscrapbook

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performace issue (eating RAM)

mbnoimi opened this issue · comments

Hi,

I'm downloading a website with 3 depth in the same domain. My laptop RAM is 16 GB
Withing less than 3 hours, the extension ate my RAM to 90% Which forced me to force restart my laptop.
This issue occurs with big websites only (my website size about 1.5 GB mostly pure html)

Is there any workaround for enhancing the performance?

  • Linux Mint 21.3 Xfce
  • Firefox 126.0 (64-bit)
  • Save captured data to: Scrapbook folder
  • Save captured data as: Folder

There's probably not too much you can do besides upgrading the hardware. It may be more performant by saving to the backend server in some cases, though.

There's probably not too much you can do besides upgrading the hardware. It may be more performant by saving to the backend server in some cases, though.

I use WebHTTrack it works pretty fine but for some reason my cookies doesn't work fine. For that I use webscrapbook because it deals with cookies behind the scenes.

There's probably not too much you can do besides upgrading the hardware

BTW, Why webscrapbook stores all the scrapped data in the memory then save them in the last step? Why it doesn't save them one by one just like wget and httrack?

BTW, Why webscrapbook stores all the scrapped data in the memory then save them in the last step? Why it doesn't save them one by one just like wget and httrack?

This is not true. Intermediate data is mostly saved to the browser storage, which is ultimately in the disk in some form.

The browser extension API is so limited that it cannot load files that are downloaded to the local filesystem. When capturing multiple web pages, the saved pages need to be loaded and have all links to other downloaded pages rewritten, which is not possible before all pages have been downloaded. As a result, we have to save all downloaded pages in the browser storage, rewrite them, and then save to the local filesystem.