ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Home Page:https://archivebox.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: One-Click Deploy to hosting providers

mAAdhaTTah opened this issue · comments

DigitalOcean is launching a one-click deploy for it's AppPlatform. This won't work for us yet because we would need to attach a Volume, which AppPlatform doesn't support, but the documentation linked suggests it will soon/eventually. Alternatively, we could look into configuring it for Heroku.

I'm happy to take the lead on this as well, but wanted to open an issue for visibility/discussion.

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

I think it would be helpful for new users to be able to spin up an ArchiveBox instance in the cloud w/ minimal work. Running it on Docker in the first place is really helpful, but would be nice to simplify it even further.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

It should be feasible for a new user

What hacks or alternative solutions have you tried to solve the problem?

I'm still considering how I'm going to host my archive. I initially spun it up on a home server, which works but doesn't help if I want to expose the in-progress REST API to my website. I then put it on a DO droplet, which I'm still fiddling with. I've also considered writing ansible roles for this as well, although that's a bit more involved for the less technical.

The main issue with something like AppPlatform & Heroku is that you don't get CLI access, so everything needs to function via the UI. Downloading sites can take several minutes, which may time out if deployed on AppPlatform (I haven't tested it in that context but it's definitely been happening on my droplet). Maybe worth looking at/considering how we can configure this as background tasks or something? Or maybe deploy to AppPlatform as a worker?

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up

Some managed hosting options have popped up in the last few months, might be worth checking out if you're willing to pay $ for hosting:

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#managed-archivebox-hosting

Heroku button support would be awesome indeed.
https://www.heroku.com/elements/buttons

@olimart The biggest issue with doing this is the filesystem. Heroku & DO's App Platform both provide ephemeral filesystems per deploy, so they're wiped on restart/redeploy. We'd need to either configure those platforms for block storage (something DO's AP doesn't support yet; not sure about Heroku) or provide a swappable implementation for the filesystem to save things to S3 or some other object storage (DO's Spaces, which is S3 compatible). I haven't dug into this much but it's definitely not a trivial effort.

Thanks @mAAdhaTTah
Yep, would need to provide the ability to configure external storage (S3...)
I saw quickly a reference to SQLite which is not supported by Heroku either.
Web app on Heroku, storage on Dropbox 😄

Here's a WIP DigitalOcean "one-click" deploy template, but as @mAAdhaTTah mentioned it's broken because disk storage is not supported by DO apps yet: https://github.com/ArchiveBox/ArchiveBox/blob/digitalocean/.do/deploy.template.yaml

image

@pirate Yeah, and swapping out for S3 would be tough/impossible with the SQLite db (plus if the tools we use write their own files, that makes it even more difficult).

I think it's still feasible though, we can write to local disk / RAM disk and then sync it to s3 or other storage backends every few seconds. It'll have a second or two of lag but I think that's an acceptable trade off.

@pirate How would you handle the db in that instance? Sync it down on boot?

Nah just rsync it every few seconds like all the other files. I think S3 supports byte-range requests so you can just sync the diffs instead of the whole thing each time.

I would also want this feature

@pirate How would you handle the db in that instance? Sync it down on boot?

Alternately, use the Digital Ocean postgres server. (Or is archivebox sqlite3 only.)

Additionally, it might be possible to use s3fuse to treat the DO spaces as a local filesystem

This might be kinda gross since you have to overwrite the file each time, you can't modify / append it. That could cause issues

@turian The big issue, as I understand it, is the external binaries write files directly to disk.

@turian The big issue, as I understand it, is the external binaries write files directly to disk.

Yeah but @pirate 's suggestion is just to rsync very frequently to s3.

On startup, you rsync back from s3. (I guess this can get expensive if you are not in AWS, since s3 downloads are costly.)

(BTW, digital ocean spaces are s3 compatible.)

The only real issue I can think of is durability, like if the process breaks for some reason and you have a corrupted thing. Then you have to rollback the s3 which could be a pain.

rsync'ing back & forth seems rough for an archive of any serious size. I believe my archive is several GBs at this point and if I had to resync it down on startup and rsync up after archiving, that would be pretty slow.

@mAAdhaTTah So I don't know the internals of archivebox but:

  • rsync'ing it up should be relatively fast, since it only uploads the diff. i.e. whatever is new in the past 10 seconds or whatever.
  • I'm not sure you have to rsync down the entire archive. Probably just the sqlite3 and a few other small files that indicate what's left in the queue to be archived. I could be wrong though, I'm just guessing.

I believe rsyncing bidirectionally on startup can be made reasonably fast/efficient even for large archives as there are advanced rsync options that let you store a sync cache file for faster diffing.

@mAAdhaTTah Also, if you want a one-click deploy of ArchiveBox, you can get one on PikaPods. It costs a few bucks a month.

I think they are running 0.6.2. Unfortunately this means you still will get crashes on the UTF-8 bug and youtube-dl bugs and the archiving will stop, for which there are PRs but are not merged yet.

PikaPods builds all their one-click app stuff in house (not open source) I think, so there's no way to customize.

Another option is YunoHost. Their apps are all open-source, so in principle there could be a bleeding edge archivebox app in there too.

I'm going to close this for now because realistically the only two options I foresee for the future are:

  • I continue maintaining ArchiveBox as a non-profit side-project (in which case I have no personal capacity to support bespoke one-click solutions that deploy to paid hosting platforms beyond linking them in the README)
  • I turn ArchiveBox into a for-profit enterprise and offer paid ArchiveBox hosting (in which case I have no interest in supporting competing paid deployment solutions for free)

For what its worth I did a railway deploy, this is a link to it. I think for new users they give you $5 in credit, and once that is used you get $5 credit for a $5 subscription. ArchiveBox uses like $1 of credit or so per month.

Edit: here it is deployed: https://box.boehs.org/archive/1714976395.796772/index.html