benbjohnson / litestream

Streaming replication for SQLite.

Home Page:https://litestream.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Live read replicas

benbjohnson opened this issue · comments

Currently, litestream replicas a database to cold storage (file or s3). It would be nice to provide a way to replicate streams of WAL frames to other live nodes to serve as read replicas.

This a network replication endpoint (probably http) as well as figuring out how to apply WAL writes to a live database.

Would it be possible to sign frames with a private key? Think SQLite as a lightweight blockchain database. Read replicas could subscribe to a database by its public key identifier and then grab the latest frames over p2p transport of choice. There would be a ton of opportunities to use this for dapps.

You could sign WAL frames with a private key but I'm not sure I understand the use case. What's the benefit rather than streaming the WAL frames from a known source over HTTPS?

Signing the WAL before uploading could give a nice protection against the S3 bucket being compromised.

This should probably be a new issue though.

@philips That's a good point. I was thinking about adding encryption too so it would mitigate an S3 compromise.

You could sign WAL frames with a private key but I'm not sure I understand the use case. What's the benefit rather than streaming the WAL frames from a known source over HTTPS?

In the decentralized app world people are working on how do you write apps that are decentralized but still have good user experience without the severe limitations of blockchain apps. Now imagine if hackernews/reddit was open source golang with a sqlite database-- and every database change to hackernews is signed and broadcast. Any person could quickly stand-up a near-realtime clone at myhackernews.com. What's powerful is that now the entire community is forkable. If hackernews does something bad all it takes is myhackernews.com to alter the code/data and essentially fork the entire site. Or take it down another level. What if when you created a subreddit you were actually creating a new sqlite database with a keypair. Now it doesn't matter if reddit censors your subreddit. If users are represented by keypairs and (with this new litestream feature) all of their collective action is moderated by a keypair then it allows the person running the app to build out any features they want (without the constraints of traditional blockchains) but it keeps them honest because baked into the software anybody can "fork" the entire site and other users can decide to follow the new keypair if they want. Mastadon and other federated models fix some problems but you are still beholden to the operator of that particular domain. Git/github changed the game by making everything forkable. If entire website communities were forkable nobody would want to use anything else because it would keep operators honest and truly return ownership of shared data to the community.

I added a new issue for encryption and digital signatures for Litestream: #88

Thanks, very much awaited feature indeed. I'm a long time user of SQLite on ZFS, and proponent and contributor to ZFS filesystem project , offering streaming replications and snapshots, and would really love to see it being executed on database level, rather than on filesystem.

This a network replication endpoint (probably http) as well as figuring out how to apply WAL writes to a live database.

Would it be possible to support read replicas just using the existing S3 bucket mechanism?

I'd love to be able to have Litestream push changes up to an S3 bucket (as it does now) and then have a separate read-only Litestream that polls that bucket for changes and uses them to continuously update a copy of that SQLite database file elsewhere.

I guess the concern with this would be cost - but would polling S3 for changes once a minute really be a significant cost over time? I've not done the napkin maths.

@simonw Yeah, that's definitely something I want to add with read replicas. There's about 108,000 minutes per month and each LIST request costs $0.000005 so that comes out to $0.54/month/replica.

There's a separate issue though in that you'd need to check for new generations as well in case replication was interrupted for some reason. That's much less likely so maybe that could check every hour instead to keep costs lower.

Another fun opportunity for this is to take advantage of it on Fly.io volumes.

I did a bunch of work recently to get Datasette working with those, but it currently only makes sense if you have a single instance - so you don't get to take advantage of Fly's ability to load balance your app between multiple regions: https://simonwillison.net/2022/Feb/15/fly-volumes/

What I'd really love to do is build a SQLite emulation of the way Fly handles PostgreSQL replication as described in https://fly.io/blog/globally-distributed-postgres/ - so one of my Fly instances would have the SQLite database that accepts writes, but every other instance would maintain a replicated read-only copy via Litestream.

Any time someone tried to write to the DB, I would ensure that their HTTP request went to the location with the write database.

I think this may have been closed prematurely.

This has been merged into main but not released yet. I suppose I could close issues after release instead.

I just mean there's no documentation or anything like that, so it's not usable without reading the code.

Yeah, that’s fair. I’ll re-open it until that’s all fixed up. 👍

commented

You could sign WAL frames with a private key but I'm not sure I understand the use case. What's the benefit rather than streaming the WAL frames from a known source over HTTPS?

In the decentralized app world people are working on how do you write apps that are decentralized but still have good user experience without the severe limitations of blockchain apps. Now imagine if hackernews/reddit was open source golang with a sqlite database-- and every database change to hackernews is signed and broadcast. Any person could quickly stand-up a near-realtime clone at myhackernews.com. What's powerful is that now the entire community is forkable. If hackernews does something bad all it takes is myhackernews.com to alter the code/data and essentially fork the entire site. Or take it down another level. What if when you created a subreddit you were actually creating a new sqlite database with a keypair. Now it doesn't matter if reddit censors your subreddit. If users are represented by keypairs and (with this new litestream feature) all of their collective action is moderated by a keypair then it allows the person running the app to build out any features they want (without the constraints of traditional blockchains) but it keeps them honest because baked into the software anybody can "fork" the entire site and other users can decide to follow the new keypair if they want. Mastadon and other federated models fix some problems but you are still beholden to the operator of that particular domain. Git/github changed the game by making everything forkable. If entire website communities were forkable nobody would want to use anything else because it would keep operators honest and truly return ownership of shared data to the community.

This has sone merit. There are are some db systems where the db is versioned and forkable. They under the hood use sql.

Here is one in golang: https://github.com/sqlitebrowser/dbhub.io

I cut a v0.4.0-beta.2 release, added a guide on the tip docs site, and there's also a litestream-read-replica-example repository if you want to try out the new live replication feature.

I did not expect the initial implementation would add direct http syncing from another litestream instance, but rather syncing from a backup location like an S3 replica. Is that intended or is it not possible (and hence the direct http sync instead)?

@anacrolix HTTP syncing was the goal of live read replication as it provides low latency read replicas. Syncing from S3 is also a possibility but it's more expensive. I wrote up an issue for it explaining some of the costs: #357

commented

The fly.io setup is fantastic - thanks @benbjohnson

i have a small worry / question regarding reconciliation.

If the node on fly.io ( of any cloud ) is partitioned is there a risk that the slaves will not catch up to the master once the network partition heals ? I assume it’s all fine, because the WAL is stored with the master db right ?

Other question is to do with elections of a new master. if Master node is network partitioned , can a slave be elected to master auto - magically ? I have seen golang gossip setups with no other side system needed that can do this but wondering if it’s possible or planned / wanted for litestream.

For example if I need to kill a master node ( for whatever reason ), I want to tell all clients ( like app tier , or perhaps a GUI clients if I use client side load balancing ) to drain all existing connections / sessions , and use the new Master for all new connections. This ensures no interruptions for users. Maybe there is a smarter pattern then this , but for gossip based failover ( both automatic and forced ) it seems like a logical way to do it.

I see on line 160 that fly.op is responsible for tracking where the primary region is. See : https://github.com/benbjohnson/litestream-read-replica-example/blob/main/main.go. I guess that we are currently dependent of fly.io database load balancing then also ? It would be interesting to be cloud independent , which would assume we run our own gossip ( to know who is master ) and our own built in load balancing system such that a read goes to the nearest slave and a write goes to master. This is funnily enough very similar to how NATS failover works. In NATS, the clients are kept up to data with all nodes endpoints, who is nearest ( for them ) and who is master. That gives the clients enough knowledge to do their thing. If a server node dies half way through a client interaction, it times out and the client has the other endpoints to ask for the new topology ( nearest read only / new master ). It’s sort of like how all p2p networks need a list of initial nodes and a DHT but much simpler. I also understand that fly.io LB makes a client stick to the master node for about 1 to 5 seconds to allow the data to replicate to all slaves, thus ensuring consistency; otherwise a client that wrote a mutation to master and then read from a slave immediately after would NOT see the mutation on the slave. I like how simple and effective a solution it is to a tricky race condition.

thanks again Ben for sticking at this - it’s really a liberating system, as well as easy to use.

commented

@anacrolix HTTP syncing was the goal of live read replication as it provides low latency read replicas. Syncing from S3 is also a possibility but it's more expensive. I wrote up an issue for it explaining some of the costs: #357

I am wondering if both options are needed . You could have a nice 3 node setup between London, Sydney and San Fran with London being Primary, but you also want a backup log going to s3 because sometimes stuff goes wrong .

With the s3 option you don’t have to turn it on. But it might’ve needed by some users because :

they want an audit log separate from the main db / app servers. Something that is always there.

they want to be able to do a restore in case they mess up a sql migration. For example where users have created data mutations that you can’t recover from.

You want to maintain an off cloud log of data to ensure you can bring up a new environment anywhere.

you want your devs to easily bring up a local copy of the db in order to easily test code changes against that is independent of production. This is also really useful for CI testing / regresssion testing.

Of course all the above ( to a degree ) can be achieved by tapping into a slave and pulling the data from it. But Not everything like an audit trail .

I would am curious to know what how others feel about this .

My requirements might be overly paranoid but the s3 functionality is already there and so it could be toggle on for developers if they want it when running on fly.io .

If the node on fly.io ( of any cloud ) is partitioned is there a risk that the slaves will not catch up to the master once the network partition heals ? I assume it’s all fine, because the WAL is stored with the master db right ?

@gedw99 There's no master election in Litestream so replicas are always behind the primary. It makes it easy to catch up in the event of a network partition. If your primary fails catastrophically and you restore from backup, the new primary will have a different "generation" and that causes replicas to re-snapshot.

Other question is to do with elections of a new master. if Master node is network partitioned , can a slave be elected to master auto - magically ? I have seen golang gossip setups with no other side system needed that can do this but wondering if it’s possible or planned / wanted for litestream.

There's no promotion planned for Litestream itself, however, I'm working on a different (but related) system for doing high availability with SQLite. That'll have proper failover. Litestream's goal is simplicity and adding leader election makes it significantly more complicated. You can probably put together a system for failover using Consul but you'll need to ensure you only have one primary at a time.

HTTP syncing was the goal of live read replication as it provides low latency read replicas. Syncing from S3 is also a possibility but it's more expensive. I wrote up an issue for it explaining some of the costs: #357
I am wondering if both options are needed . You could have a nice 3 node setup between London, Sydney and San Fran with London being Primary, but you also want a backup log going to s3 because sometimes stuff goes wrong .

You can currently enable both S3 replication and live read replication. I agree that it's useful to have a backup to a higher durability system like S3 and that's how I typically run it. I didn't include that in the litestream-read-replica-example because it makes it more complicated.

I think what @anacrolix is looking for is to be able to sync the primary & replica through S3 instead of over HTTP. Setting up a direct connection between two servers isn't always feasible.

I’m completely lost I’m new to this

@mrEarls210 I know the feeling! It's a lot to take in & database replication is a difficult topic. There's a Getting Started page on the docs site that runs you through some of the basics. This issue is about a new feature around live replication to read-only replicas so it's quite a bit more advanced. Feel free to open another issue or discussion if you have questions.

I'm working on a different (but related) system for doing high availability with SQLite.

Something like the rqlite or dqlite?

I'm working on a different (but related) system for doing high availability with SQLite.

Something like the rqlite or dqlite?

@warmchang Not quite. Those both use Raft to replicate. The system I'm working on is using a two-node synchronous replication setup similar to what you'd find in Postgres or MySQL synchronous replication. I'll have more details in the near future once I get a little further along with it.

In case anyone else makes the same mistake I did: live read replicas are not supported in litestream 0.3.8 (the latest non-beta release as of 2022-06-26), but they are supported at 0.4.0-beta2 and at main.

hey folks, I wanted to give an update on live replication. The original implementation had some flaws so I reworked it, however, the new implementation still has some limitations & foot-guns that I'm not comfortable with. For example, you can only have one fixed primary. Also, you can corrupt your replicas by writing to them. Unfortunately, this is a consequence of Litestream running as a separate process so there's only so much it can do to manage those limitations.

As some of you may know, I've been working on a next generation tool for running SQLite on a distributed platform and, after some discussion, I've decided to keep Litestream as a pure disaster recovery tool and not include live replication. It's going to be difficult to maintain it long term—especially when when the new tooling will work better for this use case.

The bones of the new tooling are in place and we're moving it to internal testing this week. We're focusing on releasing a small but stable feature set at first and we'll grow it over time. I'll announce more here soon once we finish initial testing.

Thank you to everyone who has given feedback and tested out live replication in Litestream. I'm excited about the future of SQLite tooling and I hope y'all are too!

Update: We’ve open sourced the new replication tool which is called LiteFS. You can find it here: https://github.com/superfly/litefs

Thanks for the work Ben. Looking forward to the next-gen tooling. Keep us updated!

I'm working on a different (but related) system for doing high availability with SQLite.

Something like the rqlite or dqlite?

@warmchang Not quite. Those both use Raft to replicate. The system I'm working on is using a two-node synchronous replication setup similar to what you'd find in Postgres or MySQL synchronous replication. I'll have more details in the near future once I get a little further along with it.

Would that mean write replicas or still read only replicas?

commented

Hey @benbjohnson

will the new version only work on fly.io ? I ask because I figured the live replication will utilise the fly.io Infrastructure just like it does for PostgreSQL.
Assume than that the backup and restore will be i dependent of fly.io?

sorry about the direct questions but things are a bit of a blue and just trying to plan roadmaps myself .

Congrats on the fly.io deal. Freaking awesome system they have. KISS all the way ..

@CanRau It will still be a single primary node that can accept writes at a time but it is able to hand off that responsibility to another node without downtime during deploys.

@gedw99 Good question. I can’t believe I forgot to mention that part in my original comment. The new tooling is not fly.io specific. It is Apache 2 licensed. It will be able to run on any “stateful” server platforms like fly.io, Kubernetes, DigitalOcean, etc. We are also targeting pure serverless platforms like Vercel and Deno in the future too.

I've decided to keep Litestream as a pure disaster recovery tool and not include live replication. It's going to be difficult to maintain it long term—especially when when the new tooling will work better for this use case.

Good call! 👍 to KISS & the UNIX philosophy. This theme continues in shipit.show/61 & shipit.show/62 (not out yet).

New link added to show notes:

image

What is the reasoning behind not supporting leader election? CRDB is doing it really well. Is it due to latency or just overall complexity?

@jamra Litestream runs as a separate process so it has a lot less control over the database file. It makes certain things tricky like preventing writes on replicas. Live replication was moved to LiteFS which can tightly control what can get written and when.