Question on size of enum `Material`

Question

Question on size of enum `Material`

LollipopFt opened this issue 10 months ago · comments

Hi, I was working through raytracing in one weekend using this repo as a way to learn rust. Having recently read this blog post, I decided to test out how much memory your enum in an enum took. This is what I found: 280 bytes total for the Material enum, and 16 bytes total for a hypothetical Smallest enum which:

has more than 1 more-than-0-sized field.
includes the smallest field in the Material enum, f64.

The most amount of space is taken up by the Perlin struct.

My questions are:
Do you think this would affect the speed of image generation due to the multitudes larger enum?
How would you go about reducing the space alignment 264 bytes?
Other solutions require Box<dyn ...>. How would this compare to the enum solution?
Thank you!

Chris Biscardi · Answer 1 · Mon Sep 25 2023 05:02:20 GMT+0800 (China Standard Time)

Heya!

Some research first:

The actual size of Texture is 272.

note: I use an example called examples/dump.rs for intermediary experiments, so when it says examples/dump.rs in the dbg, that's where I'm putting this code.

Basic size exploration

use glam::DVec3;
use image::DynamicImage;
use noise::Perlin;
use raytracer::{
    material::Material, shapes::Shapes, textures::Texture,
};

fn main() {
    dbg!(std::mem::size_of::<Texture>());
    dbg!(std::mem::size_of::<Material>());
    dbg!(std::mem::size_of::<&Material>());
    dbg!(std::mem::size_of::<Shapes>());
    dbg!(std::mem::size_of::<HitRecord>());

    dbg!(std::mem::size_of::<Perlin>());
    dbg!(std::mem::size_of::<&Perlin>());
    dbg!(std::mem::size_of::<DynamicImage>());
    dbg!(std::mem::size_of::<&DynamicImage>());
    dbg!(std::mem::size_of::<DVec3>());
    dbg!(std::mem::size_of::<&DVec3>());
    dbg!(std::mem::size_of::<f64>());
    dbg!(std::mem::size_of::<&f64>());
    dbg!(std::mem::size_of::<bool>());
    dbg!(std::mem::size_of::<&bool>());
}

off the cuff, I would definitely expect the Perlin/Turbulence Textures to be the worst offenders when it comes to size. This rings true when we look at the output of the above dump.rs.

std::mem::size_of reports 272 for the Texture size. This is slightly bigger than your Rust playground example.

[examples/dump.rs:9] std::mem::size_of::<Texture>() = 272
[examples/dump.rs:10] std::mem::size_of::<Material>() = 280
[examples/dump.rs:12] std::mem::size_of::<&Material>() = 8
[examples/dump.rs:11] std::mem::size_of::<Shapes>() = 408
[examples/dump.rs:13] std::mem::size_of::<HitRecord>() = 360

[examples/dump.rs:16] std::mem::size_of::<Perlin>() = 260
[examples/dump.rs:17] std::mem::size_of::<&Perlin>() = 8
[examples/dump.rs:18] std::mem::size_of::<DynamicImage>() = 40
[examples/dump.rs:19] std::mem::size_of::<&DynamicImage>() = 8
[examples/dump.rs:20] std::mem::size_of::<DVec3>() = 24
[examples/dump.rs:21] std::mem::size_of::<&DVec3>() = 8
[examples/dump.rs:22] std::mem::size_of::<f64>() = 8
[examples/dump.rs:23] std::mem::size_of::<&f64>() = 8
[examples/dump.rs:24] std::mem::size_of::<bool>() = 1
[examples/dump.rs:25] std::mem::size_of::<&bool>() = 8

Notably HitRecord, which we create a lot of while looping, is 360.

Doing some simple math:

2 DVec3s == 2 * 24 == 48
3 f64s == 3 * 8 == 24
1 bool == 1
1 Material == 280

Clones

Here's every .clone in the entire project.

❯ rg clone
src/shapes.rs
89:                let mut origin = ray.origin.clone();
90:                let mut direction = ray.direction.clone();

src/shapes/constant_medium.rs
89:            material: self.phase_function.clone(),

src/shapes/quad_box.rs
48:            material.clone(),
54:            material.clone(),
60:            material.clone(),
66:            material.clone(),
72:            material.clone(),
78:            material.clone(),

src/shapes/rounded_box.rs
75:                    self.material.clone(),
159:                self.material.clone(),
227://             self.material.clone(),

src/shapes/sphere.rs
118:            self.material.clone(),

src/shapes/cylinder.rs
16://     self.material.clone(),
58:                self.material.clone(),
74:                self.material.clone(),

src/shapes/quad.rs
97:            self.material.clone(),

src/shapes/a_box.rs
43:                    self.material.clone(),
55:                    self.material.clone(),
68:        //     self.material.clone(),

src/shapes.rs

Both clones are DVec3s, not textures or materials. These happen to be rotate related and are being cloned on every hit. I would expect that this affects this only affects shapes that are rotated and looking at the code we're supposed to be creating a new Ray, but a good improvement could be to store that rotation instead of re-calculating it on every hit().

src/shapes/constant_medium.rs

clones a Material::Isotropic on every hit(). Only affects "fog volumes".

src/shapes/quad_box.rs

Clones are all in the new() construction. One-time cost of cloning material. Unlikely to affect runtime due to not being allocated more than once at the beginning.

src/shapes/sphere.rs

clones self.material on every hit().

src/shapes/quad.rs

clones self.material on every hit().

Other clones

These three shapes contain clones, but I don't consider them "working" anyway and are just things I was experimenting with.

src/shapes/rounded_box.rs
src/shapes/cylinder.rs
src/shapes/a_box.rs

clones conclusion

There are a couple places we're cloning materials on every hit. I'm pretty sure we don't need to be doing that, but I didn't know that we didn't need to be doing that when I was going through the series originally.

To me, this says that we create a lot of HitRecords, which is why we're cloning the Materials in the first place. Every ray that hits a sphere, a quad, or a fog volume clones a material.

More measuring stuff

Before we go further, let's actually measure something at least, even if its not perfect.

all-materials-spheres original runtime

I ran the all-materials-spheres example three times in --release mode to evaluate how long it took. This was on an Apple M1 Max, so at this time that means no SIMD which greatly impacts the runtime. The runtimes were 22-24s.

22s
23s
24s

DHAT

DHAT is a heap profiler. It could give us some insight.
Here's a basic run and the commit I used.

cargo run --release --features dhat-heap --example all-materials-spheres
     Running `target/release/examples/all-materials-spheres`
dhat: Total:     60,708,575 bytes in 728,138 blocks
dhat: At t-gmax: 27,129,072 bytes in 364,589 blocks
dhat: At t-end:  81,424 bytes in 191 blocks
dhat: The data has been saved to dhat-heap.json, and is viewable with dhat/dh_view.html

You can view the file at this site.

It doesn't give me much insight with respect to your questions, but there are some numbers to look at at least.

HitRecord

Do you think this would affect the speed of image generation due to the multitudes larger enum?

We know that Material is comparatively big (280), and &Material is comparatively small (8). Your question is what happens if the enums are smaller. Well, using references would make the data smaller so let's find out.

I replaced the clones with references in this commit in response to your question and re-ran our all-materials-spheres example.

17s
18s
17s

So with references instead of full clones, we dropped 4-7 seconds from our very unscientific benchmark.

heap usage didn't really change.

dhat: Total:     60,534,012 bytes in 726,414 blocks
dhat: At t-gmax: 27,129,536 bytes in 364,816 blocks
dhat: At t-end:  81,424 bytes in 191 blocks
dhat: The data has been saved to dhat-heap.json, and is viewable with dhat/dh_view.html

Conclusion

I don't really think the enum sizes are impacting much since we're definitely not getting below the size of the shared reference.

and really, this matches the blog post because we aren't doing what the blog post is talking about as a pain point.

This presents real pain when collecting a large number of them into a Vec or HashMap

It might matter if we had extremely large amounts of Shapes to set up our scene, but realistically we should be using acceleration structures in that case anyway like the chapter on BVH.

I suppose we could hit the issue mentioned in the blog post if, when using acceleration structures, we construct a really large scene and store all of our Shapes in an Vec, then use indexes into that Vec to power the acceleration structure (which would be a binary tree as described in the Raytracing series).

4k resolution is 8.3 million pixels.

So if every pixel of a 4k screen showed a pixel of a different shape, we'd need 408 * 8300,000 == 3,386,400,000 or 3.3864 gigabytes to store it the data with the current enum, assuming every object has a unique material/texture.

of course, at that scale a single pixel isn't enough to recognize any of the shapes those objects represent, so I don't really consider the enum size to be a problem right now in the project. If it was, I think the first step I would take is to switch to references for all of the materials and textures, then require them to be constructed up front and only referenced in the program.

How would you go about reducing the space alignment 264 bytes?

As seen above, I'd start by using references since we really don't need to be cloning the materials.

Swapping the Textures in Material variants for &Texture, or the Perlin in Texture variants for &Perlin. You could also remove Perlin entirely from the Texture and force people to use a specific pre-constructed version of Perlin noise but that gets into changing the behavior of the program.

That said, references aren't always a size win. While Perlin vs &Perlin is 260 vs 8, f64 vs &f64 is 8 vs 8, and bool vs &bool is even worse at 1 vs 8.

This is because shared references have the same size and alignment as usize.

Your other question is

Other solutions require Box<dyn ...>. How would this compare to the enum solution?

using traits for various things allows the extension of the program by a user. If materials were a texture for example (like the Material trait in Bevy), you could implement them for any additional struct making extending the behavior of the program possible for end-users. If this was a raytracing "engine" that people were suppose to install and extend, a trait could be a good choice for a number of the enums we currently have.

I chose enums specifically because they are concrete and easier to deal with, not because they have any inherent performance benefits.

Did that help?

Chris Biscardi · Answer 2 · Mon Sep 25 2023 05:16:30 GMT+0800 (China Standard Time)

Some more unscientific "benchmarks" for the change described above.

On my simd-enabled windows pc, the runtime of the original program was approx 7s, and the new one runs in approx 4s. These were hand-run and wall-clock timed but consistently showed a visible improvement. I think for any future perf work on the simd-enabled computer I'd have to write some criterion benchmarks or use something longer-running.

LollipopFt · Answer 3 · Mon Sep 25 2023 20:09:26 GMT+0800 (China Standard Time)

Woah, thank you for the informative answer! That's a really in-depth dive into the runtime differences between cloning and references. Thanks for also answering why some people may prefer to use Box<dyn> trait (for extensibility). I also did not know about the dhat_heap feature, I will be using it more often in future projects. Thanks!