Switching to slice16 by default increases already large resource footprint by 16x

Question

Switching to slice16 by default increases already large resource footprint by 16x

cbiffle opened this issue a year ago · comments

Hi! We use your crate in a variety of embedded applications. I noticed that there's a recently landed, yet-unreleased commit changing the default time-vs-space tradeoff of the crate to use significantly larger tables.

Instead of making slice16 the default, what about making NoTable or the current behavior the default? The tables generated by this crate are already too large for some of our applications, and with this change we'd need to move away from the crate entirely.

(I'm aware that we could go through and update our code and all consumers to use NoTable decorators, etc., but that seems hard to achieve in practice in the presence of third-party code.)

Thanks!

Akhil Velagapudi · Answer 1 · Sat Feb 18 2023 09:48:58 GMT+0800 (China Standard Time)

Slice16 is probably a better default for most applications, but minimizing impact to current users takes priority. So, I'm in favor of keeping the current behavior (Bytewise) as default. @KillingSpark wdyt?

Moritz Borcherding · Answer 2 · Sat Feb 18 2023 10:05:44 GMT+0800 (China Standard Time)

I see the benefit of all users that do not care about the tradeoff that will just get the 6x speedup automatically "for free" without any interaction with a new build. I came to this via the sctp implementation of the webrtc-rs project, just as an example. This would benefit most of the ecosystem without any effort on behalf of that part of the ecosystem. This is especially important for crates that are abandoned or not well maintained. I don't know how common this is for crates that depend on this crate but let's just say, crcs are pretty common.

I think the people that do care about the tradeoff now have the tools to explicitly make a choice.

But on the other hand I see the benefit of not annoying the people that do actually care about this tradeoff.

In the end I don't really care, I'd just have to make a small PR to webrtc-rs to use the slice16 algorithm to reach my original goal. But I'd rather see everyone benefit from this effort.

I don't think using the NoTable as default is bad. People that care about performance will notice the drop in performance and update accordingly? But on the other hand people that care about size should notice the size difference and update accordingly so.... Sorry no I am of no great help in this matter.

Moritz Borcherding · Answer 3 · Sat Feb 18 2023 10:20:24 GMT+0800 (China Standard Time)

@cbiffle how would this affect third party code? Do you expose symbols of type Crc<uxx> in your API? Or is the worry that third party code could compile your code with the new defaults which would result in way larger tables?

Also: how is moving away from this crate easier than updating to using this crate with the explicit NoTable or Bytewise implementations?

Akhil Velagapudi · Answer 4 · Sat Feb 18 2023 12:07:45 GMT+0800 (China Standard Time)

@cbiffle, @KillingSpark I appreciate your feedback.

After thinking about this a bit more, I think we should stick with Bytewise as default for two reasons:

crc-rs has likely self-selected into more embedded-style applications due to its support for compile-time execution and custom algorithms.
If code changes are required for an application to continue functioning properly after a minor version update, this should be treated as a semver breaking change and require a major version bump.

I'll certainly track down as many dependent projects as I can, and suggest switching to Slice16 or NoTable where appropriate. But I think changing the defaults would be too risky.

Cliff L. Biffle · Answer 5 · Thu Feb 23 2023 09:41:56 GMT+0800 (China Standard Time)

@cbiffle how would this affect third party code? Do you expose symbols of type Crc<uxx> in your API? Or is the worry that third party code could compile your code with the new defaults which would result in way larger tables?

My concern is that any third party crate that uses crc would get a 16x resource usage multiplier, and would have to either agree with my preference for a default, or alter their API (transitively) to expose the choice to users. crc is not an uncommon dependency in the embedded space, so this seems like a real risk -- particularly since the cost is paid per polynomial used.

Also: how is moving away from this crate easier than updating to using this crate with the explicit NoTable or Bytewise implementations?

Well, it depends on this crate's policy on this sort of thing -- if crc is optimizing for performance on infinite-resource machines over generality, that's a valid choice, but one that makes using the crate in bare-metal embedded contexts difficult, and in that case moving away will be initially somewhat painful but may be the right decision in the long term. It sounds like this may not be necessary though.

For reference, one of the pieces of firmware I'm shipping needs to fit in 4 kiB. For the entire image. In that case we already can't use the released crc even though we'd prefer to (though when a new release arrives with NoTable we will evaluate switching). I am using crc in larger deployed applications with between 32 and 128 kiB of Flash available, and in those cases it's worked well. Changing the default to slice16 (so a typical table size of 16 kiB for a CRC32) would rule it out for, as far as I can tell, all the places I use it.

the 6x speedup

Out of curiosity, what architecture and configuration are you measuring this on?

@akhilles -

If code changes are required for an application to continue functioning properly after a minor version update, this should be treated as a semver breaking change and require a major version bump.

I had assumed this was intended as a major version bump. If this change was headed for a minor version it seems significantly worse! So thanks for hearing me out.

Moritz Borcherding · Answer 6 · Thu Feb 23 2023 17:28:01 GMT+0800 (China Standard Time)

Edit: Note that my opinion does not really have any weight, I do not speak for this project! I am just very curious :)

and would have to either agree with my preference for a default, or alter their API (transitively) to expose the choice to users.

I still don't quite see how this works. I get that the choice for the default transitively affects all reverse dependencies by making the memory usage bigger. Which would be (transitively) fixed by you updating your code to use the new Crc<Bytewise<uxx>>.

But I don't see why this would necessitate any API changes. Unless you reexport the type of the used Crc<uxx> struct in a way that it is breaking when changed to Crc<Bytewise<uxx>>.

Could you give an example of this?

Well, it depends on this crate's policy on this sort of thing -- if crc is optimizing for performance on infinite-resource machines over generality, that's a valid choice

This is not what's happening here, with the changes there is even better support for low-resource machines. The proposed changes only affect the defaults. And I think the defaults should reflect the majority of usecases. Having only a few kiB of storage is probably a niche. I get that this must be annoying for you and other users in this niche, and you are right, this should probably be a major version update even though technically this is a semver compatible change.

I am just arguing that most people are compiling on and for achitectures where even hello-world rust binaries are in the order of hundreds of kiB. Adding 16kiB more for a crc lookup table seems fine to me in this context.

Out of curiosity, what architecture and configuration are you measuring this on?

As the readme says: AMD Ryzen 7 3800X I also made some of these PRs on a Mac Studio with a new M2 chip which had similar speedups. If you could test this on some embedded systems this would be valuable information though!

Another datapoint, they achived a 8x speedup in C++ (I think they used an older Intel CPU?) https://create.stephan-brumme.com/crc32/

For reference, one of the pieces of firmware I'm shipping needs to fit in 4 kiB. For the entire image. In that case we already can't use the released crc even though we'd prefer to (though when a new release arrives with NoTable we will evaluate switching). I am using crc in larger deployed applications with between 32 and 128 kiB of Flash available, and in those cases it's worked well. Changing the default to slice16 (so a typical table size of 16 kiB for a CRC32) would rule it out for, as far as I can tell, all the places I use it.

I really don't get the logic here :/
You say that it may be possible to switch to the new version of the crc crate for the 4 kiB images but for the ones where you already use the crate changing to the new version is somehow impossible and would rule out using this crate?

Cliff L. Biffle · Answer 7 · Fri Feb 24 2023 02:00:10 GMT+0800 (China Standard Time)

But I don't see why this would necessitate any API changes.

We're talking about different situations, I think. The scenario I'm concerned about is not code I control using crc, it's crates I depend on using crc internally. In that case, with the proposed change,

They will immediately take 16x the flash per polynomial used,
I will need to reach out to them to ask that they change their internal code to override the default, which they may not want to do because of the performance impact on large machines, and so
I'd need to ask them to change their API to expose the choice of tabling strategy to me as a client.

Having only a few kiB of storage is probably a niche.

It's definitely a niche, but a niche a lot of us work in -- the chips with less total RAM than your Ryzen has L1 cache on a single core. Rust is particularly well suited for this niche, but it does mean we need to stay vigilent when our dependencies don't test on machines with limited resources. Which is why I've appeared with this bug report. :-)

Cliff L. Biffle · Answer 8 · Fri Feb 24 2023 02:10:23 GMT+0800 (China Standard Time)

Another datapoint, they achived a 8x speedup in C++ (I think they used an older Intel CPU?) https://create.stephan-brumme.com/crc32/

While the specific microcontroller used is pretty dated, the results there that are closest to the situation endured by most embedded developers are the Arduino test case; to quote your source,

All the branch-free tricks' arithmetic overhead severely decrease performance and in the end the original algorithm remains the best among all table-less.

The slightly higher arithmetic complexity of Slicing-by-8 causes a lower throughput compared to the overall winner Slicing-by-4, too. I didn't even try Slicing-by-16.

Considering the available Flash of the Arduino (only 32 kByte), I would prefer the Standard implementation.

One of the micros in our current system has cache, and many are larger by an order of magnitude, but the key points of this analysis hold: for machines without branch prediction, without micro-op fusion, without caches, without virtual memory that causes storage to act like it's zero-wait-state, the highest performing option is not the same as on your Ryzen or an "older Intel CPU" (unless you mean an 8051).

Moritz Borcherding · Answer 9 · Fri Feb 24 2023 04:07:08 GMT+0800 (China Standard Time)

We're talking about different situations, I think. The scenario I'm concerned about is not code I control using crc, it's crates I depend on using crc internally. In that case, with the proposed change

I see! Yes that is a situation I didn't explicitly consider. It would be dealt with by pinning the dependency on crc but it's definitely a workaround not a real solution.

I will need to reach out to them to ask that they change their internal code to override the default, which they may not want to do because of the performance impact on large machines, and so

I'd need to ask them to change their API to expose the choice of tabling strategy to me as a client.

So you are instead asking everyone using large machines to take a pretty big performance hit. That seems... less than ideal especially since you agree that your usecase is the niche. I'd favor considering the benefit of the total ecosystem. In the end we are not talking big changes in these crates, it should just be a check for a feature flag.

Which is why I've appeared with this bug report. :-)

And I am glad I get to learn something about this niche in the process, it's definitely something I have very little practical knowledge in!

to quote your source,

Goes to show I should have read this document more thoroughly, I could have learned even more. I was aware that these micro controllers have widely different performance characteristics then modern super-scalar processors, but not of the magnitude of the difference.

Talking about feature flags though... @akhilles do you think it is viable to change the defaults to be behind feature flags? This should allow @cbiffle to change them even for indirect dependencies, AFAIK.

So make a feature flag "slice16defaults" or somesuch that is on by default. If it's not there, the default impl gets changed to the NoTable implementation? We could even provide an additional one for "bytewisedefaults".

I think this is ok with respect to the "feature flags should only add features" thing that cargo imposes. Since these are all semver compatible and "just" increase/decrease the default lookup table sizes.

Edit: Apparently it is possible to control features for sub-dependencies: https://stackoverflow.com/questions/59994525/how-do-i-specify-features-for-a-sub-dependency-in-cargo. Disabling defaults should also be possible the same way.

Edit2: If this is agreeable with @cbiffle I'd be willing to make a PR that introduces these feature flags

Akhil Velagapudi · Answer 10 · Sat Feb 25 2023 08:59:00 GMT+0800 (China Standard Time)

The feature flags proposal is interesting. In the spirit of "feature flags should only add features", I think feature flags should also only add hardware compatibility. So, if both the "no-table" and "slice-16" feature flags are set, then "no-table" should have priority as it maximizes the functionality of the resulting binary.

Another option is to detect the opt-level compiler setting. If it's set to "s" or "z", which optimize for binary size, then NoTable is used as the default implementation.

Moritz Borcherding · Answer 11 · Sat Feb 25 2023 20:17:50 GMT+0800 (China Standard Time)

I don't know about the opt-level. As a user I'd not expect that to change stuff this drastically.

I agree that the NoTable flag should take precedence.

Moritz Borcherding · Answer 12 · Thu Mar 02 2023 18:33:49 GMT+0800 (China Standard Time)

@akhilles Hi just wanted to make sure you saw the draft PR. I don't know if drafts cause a notification, thats why I am pinging this here. If you just have other things to do, dont worry about it :)

Akhil Velagapudi · Answer 13 · Mon Mar 06 2023 02:25:21 GMT+0800 (China Standard Time)

Yup, it's on my TODO. Don't have much free time over the next 1-2 weeks unfortunately but will review soon after that.

Akhil Velagapudi · Answer 14 · Fri Apr 14 2023 12:35:10 GMT+0800 (China Standard Time)

@cbiffle I believe #101 addresses your use-cases, but please LMK if that's not the case.

The default implementation will remain as Bytewise, so you should see no increase in resource usage from upgrading.
You can use a feature flag to further reduce resource usage by making NoTable the default implementation (even across dependencies).

Akhil Velagapudi · Answer 15 · Mon Apr 24 2023 02:05:27 GMT+0800 (China Standard Time)

Closing this since #101 was merged. Please re-open if the resource footprint is still a concern.