[2021 Theme Proposal] Scalability

Question

[2021 Theme Proposal] Scalability

atopal opened this issue 4 years ago · comments

Note, this is part of the 2021 IPFS project planning process - feel free to add other potential 2021 themes for the IPFS project by opening a new issue or discuss this proposed theme in the comments, especially other example workstreams that could fit under this theme for 2021. Please also review others’ proposed themes and leave feedback here!

Theme description

Make retrieving data from IPFS fast and reliable at the next level of data scale that IPFS will achieve. There is reason to expect the volume of data (files and data size) on IPFS to increase massively soon due to growing IPFS and Filecoin adoption, and we want to solve scaling problems before they hit.

Hypothesis

With the IPFS network growing quickly, especially due to the launch of Filecoin mainnet, the volume of data on IPFS will soon scale to the point where our current implementations will be difficult to use. Failure to meet this scale will greatly hamper the experience of both new and existing IPFS users, limiting the project’s ability to grow.

Vision statement

IPFS users do not have to worry about adding large amounts of data or a ton of individual files to the network, unlocking tons of use cases that currently web2 is better equipped to handle.

Why focus this year

The IPFS network is poised to grow a ton in 2021, so making progress on the scalability front is critical. Longer-term scalability solutions (e.g., next version of content routing) will take a long time, so it’s important to start today.

Example workstreams

Go-bitswap, DHT scaling, resource consumption, reimagining what content routing looks like

lizelive · Answer 1 · Tue Nov 24 2020 16:13:59 GMT+0800 (China Standard Time)

Can confirm. hosting 8tb dataset on ipfs is not fun. anything over a tb is a massive pain in the ass.

DISC30 · Answer 2 · Fri Nov 27 2020 14:41:31 GMT+0800 (China Standard Time)

YES scalability of mass adoption, by mass adoption:

Transition from centralized servers to user-run nodes (can we handle it? is there adequate documentation/education?)

Migrate web2 to IPFS

Rüdiger Klaehn · Answer 3 · Tue Dec 01 2020 16:57:22 GMT+0800 (China Standard Time)

I have written a slightly different, but roughly similar proposal. I got more focus on the DHT, since I think that is the core point. But if this proposal would be adopted instead I would also be happy...

#76

David Choi · Answer 4 · Wed Dec 02 2020 02:32:38 GMT+0800 (China Standard Time)

Thanks @rklaehn - definitely a core part of it, helpful that you wrote it up

MollyM · Answer 5 · Thu Dec 03 2020 18:25:31 GMT+0800 (China Standard Time)

Specific initiative proposal from @lizelive i don’t want to lose track of:

One thing that limits IPFS usage for large datasets is its very small block size. 256kb is too small, and the largest size (at least for files) is 1mb. For better disk performance a minimum of 4mb would be needed. This will also drastically improve performance when using cloud backings for data storage.

cjqf · Answer 6 · Tue Dec 08 2020 07:26:07 GMT+0800 (China Standard Time)

To pile a bit more onto this topic. The @textileio team is also concerned about scalability but from a slightly different perspective. Here I'm talking about blockstore scalability. We had written up a "standalone" theme, but perhaps the theme of scalability in general is a good umbrella for our ideas, so I'll include them here instead. But @momack2 let me know if you think this deserves its own theme, or belongs somewhere else?

IPFS is increasingly being used in production environments, where an embedded datastore is no longer scalable, stable, or in some cases even “safe”. We are just now arriving at a point in the IPFS community where storage integrity is becoming a serious concern. I'd propose this theme also focuses on improving existing data/blockstore implementations, while also providing guidance and best practices for developers looking to run IPFS in various (production) scenarios.

Hypothesis

Large-scale adoption of IPFS for commercial/enterprise solutions is not going to be viable until a clear solution for scalable/stable backing storage configurations can be prescribed A specific example of this is simply making go-ipfs be more robust under low disk-space scenarios. Currently, go-ipfs can lead to blockstore corruption if the blockstore gets out of disk-space. Nowadays, if the underlying disk backing up FlatFS blockstore gets out of space, it's possible that the datastore can become corrupted (see Other content section). But the alternatives (embedded badger datastore) aren’t always the right fit either.

More users and applications are relying on go-ipfs as the underlying storage of data. As with any data storage, it should gracefully deal with one of the most common problems: be safe under border-case conditions.

Vision statement

Users and applications should be able to rely heavily on go-ipfs nodes to save data, knowing that expected events under normal operations won't affect data safety/integrity. Additionally, it would be useful to consider “out of the box” support for remote blockstore backends: external databases, cloud-services, etc. For example, does it make sense to support additional go-datastore backends “out of the box” such as go-ds-mongo. What about alternatives to badger for embedded databases. FlatFS has known scalability and speed issues, and more recently, has lead to some serious data corruption issues. Can we provide useful ways to avoid these issues?

Why focus this year

go-ipfs is becoming more important in 2021 with the launch of Filecoin mainnet. Most probably, more people will be using IPFS for serious data storage applications, thus making this issue more apparent. We are only just now reaching the realm of big data on IPFS, and systems that have scaled reasonably well up to this point are starting to break down. See Other content section for an example of a new user of go-ipfs already coming to this situation some months after starting usage.

Example workstreams

A useful first pass could be adding a new feature to halt new block storage if the underlying blockstore reaches a defined limit, being in GiB, % of disk space, would be helpful to simply avoid the above problem and also leave the go-ipfs hosting machine have enough space to deal with the situation. The next steps would be testing alternative block store implementations under varying storage requirements and scenarios. What works best for high “write” vs “read” scenarios, what about remote storage configurations? Is a shared storage backend for multiple peers a viable option? How might we more fully integration IPFS-Cluster into the mix?