ipfs / roadmap

IPFS Project && Working Group Roadmaps Repo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[2021 Theme Proposal] probabilistic tiering

RubenKelevra opened this issue · comments

Note, this is part of the 2021 IPFS project planning process - feel free to add other potential 2021 themes for the IPFS project by opening a new issue or discuss this proposed theme in the comments, especially other example workstreams that could fit under this theme for 2021. Please also review others’ proposed themes and leave feedback here!

Theme description

Please describe the objective of your proposed theme, what problem it solves, and what executing on it would mean for the IPFS Project.

Hypothesis

Currently, we have a binary state of data availability - a node either got the data or not. If the node has the data, the node can provide it and each node searching for the data assumes that every node has exactly the same (free) network bandwidth, disk speed, processing power, etc.

While this makes sense to a degree, since we just issue more requests to nodes that deliver the data faster, this assumption has its limits.

Mobile devices might run on batteries, which makes requesting data from them, not the best option, as an example where requesting in a round-robin fashion doesn't make sense.

Putting a tiering number on the data a node provides (stored in the DHT), makes it possible to select a server instead of a tablet or RasPi as a source. To avoid that only the servers get the requests, it makes sense to have a probabilistic approach to the data request:

The higher the tiering number, the more probability that the request will start there. This avoids that small nodes get overloaded by requests because they get an equal amount of requests than a large server.

While the DHT saved tiering number makes sense for an average, like on each provide the node saves the 95% percentile mean load over the providing period, short overloads might make it necessary to create an informational message which can update the tiering number to each node which currently actively requests data from the node.

An example how local data requests could still be preferred is to use the tiering number times the latency measured (in ms) to prefer local nodes over nodes far away.

Vision statement

In the long run we probably want to created tiered data stores, which would make it possible to mark different data with different tiering, based on the storage media they are saved on.

This would enable us to use cheap cold storage, like blue-ray disks or magnet bands with an automated loading mechanism to store extremely large quantities of data over long periods where not many accesses are necessary since most stuff won't get accessed at all - when stored somewhere else in the network.

Why focus this year

I've heard there's work to get the internet archive on the network and mobile clients are also a topic to tackle soon. Both would benefit from such a way to differentiate between different tiers of nodes.

Example workstreams

  • Implement a measurement of the free memory, processing speed and bandwidth to estimate how fast a node can provide data after start.
  • Gather how fast we deliver data over a long period to estimate the free bandwidth
  • Update the value in the DHT on each reprovide
  • Implement informational updates in the protocol to inform nodes for a more recent estimate, like a rolling 5 minute mean of the 95 percentile.

This would be a great and very useful tool to have indeed! A lot of applications could benefit from a more refined query method based on several statistics that decides who to ask. I'm researching decentralized dispatchers and scheduling right now ( admittedly rather academic in nature ) but I'm very much interested in seeing what you all come up with.
I'm not sure how well a one-dimensional metric would work in practice but it's a good starting point!

You're proposal has a good point though and would best be implemented as part of a bitswap protocol extension.

If you're interested, I'm currently working on a more generic cooperative scheduling protocol in the context of the Ambients protocol. But bit swap requests could be expressed in ambient calculus terms and therefore be scheduled that way. (PS: the ambients protocol is an awesome idea by @haadcode and friends from orbitDB). Non interfering boxed ambients is a great read if you're into dry research papers ;)

@JonasKruckenberg wrote:

I'm not sure how well a one-dimensional metric would work in practice but it's a good starting point!

I think one dimension is best in this case since there are a lot of caches, data store types, network bandwidth fluctuations etc. I doubt we can provide more than one dimension with sensible data.

Also using explicit flags like "this is mobile node running on batteries" might lead to MORE accesses on them, if someone thinks it's funny to drain batteries.

So having one metric hide these details to a degree.

I'm a big fan of Pressure Stall Information (PSI) which got added to the Linux kernel. It's basically a percentage of how much time is wasted waiting for a specific ressource. 100% means all processes active are waiting for say memory.

While I think percentages don't make sense here, we could measure the ms a requested block takes to be "out the door", so in terms of TCP/QUIC when this block is confirmed received on the other end. Sure this method would add some ms depending on the distance to the endpoint, but it can catch all types of network latencies, not only local ones.

Not sure we can look that deep into the TCP/QUIC stack, though.

It obviously doesn't make sense to track every request, but we could track one request every 100ms or so.

So since the original proposal didn't make it clear. The idea is to have the daemon look at the hardware and the memory available and choose a sensible multiplier.

If the node is running on a mobile phone, it makes sense to set a high multiplier, like 5. If it runs on a server dedicated to just run ipfs, you may want to use 0.5.

Then there's a live measurement, which looks every say 100ms on a request for a block and how long it runs through the node, until it's confirmed to be delivered. This time is measured in ms.

That ms are now multiplied with the static multiplier from the settings.

So a mobile phone might give you a 200ms answer for a small block, but multiplied it's now 1,000ms. While a server gives you the same block in 100ms and which gets multiplied with 0.5 and ends up with a tiering value of 50ms.

The reasoning between a lower than 1 multiplier is, that a server might have larger caching and readahead, so can deliver large amounts of data faster than a regular desktop node, where the cache is being used for other stuff.

Now the node requesting data is doing the latency estimations for each of both connections, which gives 15ms for the server and 25ms for the mobile phone.

Now you end up with a final tiering value of 1,000 x 25 = 25,000 for the mobile phone and 50 x 15 = 750 for the server.

By default the round-robin should probably cut off slow nodes, like <10, and if none, do <100, and if none, do <1000.

So the mobile phone would only be queried when there's a block the server isn't providing or any other faster node. :)

Since everything, except the latency to the node, can be calculated without connecting to the node it makes sense to filter out very slow nodes first, before doing the latency measuring step. After which it's filtered again.

The latency step could include a query message to get a short-term load value, to have a better understanding of the current node's load.

Forgive me if I'm being obtuse, but surely this could lead to exploitation by nodes that artificially inflate their multiplier, say by running a modified version of the daemon. The BitSwap protocol gives preferential treatment to nodes that aren't leeching and are providing blocks, but if the node requesting data sees that inflated multiplier, it wouldn't even try to request data, leading either to actually less capable nodes falling behind as they never get to serve blocks, or those nefarious nodes getting away with not serving data, depending on implementation. Perhaps a reasonable cap to the value would be appropriate, so that a node with no latency could never make themselves appear so incapable of serving data they sucessfully game the system?

Forgive me if I'm being obtuse, but surely this could lead to exploitation by nodes that artificially inflate their multiplier, say by running a modified version of the daemon. The BitSwap protocol gives preferential treatment to nodes that aren't leeching and are providing blocks, but if the node requesting data sees that inflated multiplier, it wouldn't even try to request data, leading either to actually less capable nodes falling behind as they never get to serve blocks, or those nefarious nodes getting away with not serving data, depending on implementation. Perhaps a reasonable cap to the value would be appropriate, so that a node with no latency could never make themselves appear so incapable of serving data they sucessfully game the system?

Well I'm not sure that we're on the same page here:

The idea is to prefer nodes which are not running on batteries. So yes, this can be exploited if the daemon is lead to belief that the device is running on batteries.

Sure there may be some folks which may exploit this metric, but in the great scheme of things the impact by forged multipliers is neglectable:

If a node is running on batteries it's better to preserve the battery life than to be pedantic about how "fair" the network load is:

Otherwise the users my not use ipfs, because it drains too much battery.