rand performs unaligned memory access (and invalid raw ptr usage)
RalfJung opened this issue · comments
The following program fails to run in Miri with an unaligned memory access:
use rand::Rng;
fn main()
{
let mut rng = rand::thread_rng();
let _val = rng.gen::<i32>();
let _val = rng.gen::<usize>(); // causes unaligned memory access
}
This is caused by this line of code:
Line 186 in 852988b
That casts a 4-aligned pointer to a *const i64
and then uses that for a read. i64
reads must be 8-aligned. There is a comment in the souce indicating that this "requires little-endian CPU supporting unaligned reads", but that is not sufficient: the alignment requirement is made by LLVM, not by the CPU. In Rust, all accesses must be aligned on all platforms, no matter what the CPU says; any unaligned access causes UB (on all platforms). Unaligned accesses are only allowed with read_unaligned
and write_unaligned
.
There is another problem with this code as well: It violates pointer provenance rules. &results[index] as *const u32
creates a raw pointer that may be used to access this particular element, but it is not legal to use this pointer to access neighboring elements that were never cast to a raw pointer. You could try to fix this with something like &results[index..index+1] as *const [u32] as *const u32 as *const u64
.
Is the performance of the "fallback code path" really so much worse that all these hacks are worth it?
fill_bytes
and try_fill_bytes
in the same file also seem to be based on the incorrect assumption that unaligned accesses are okay in Rust when compiling for x86:
// As an optimization we try to write directly into the output buffer.
// This is only enabled for little-endian platforms where unaligned writes
// are known to be safe and fast.
Thanks @RalfJung; some good questions here.
This particular example is around 13% faster with the HC128 generator on my system (gen_u64_hc128
benchmark), but with much less difference for other generators (1-4%). In my opinion, that's enough of a difference to be worth taking if it's free, but not if it causes UB.
Following your suggestions, I get around the same performance with this version of the code:
if cfg!(any(target_arch = "x86", target_arch = "x86_64")) {
// requires little-endian CPU supporting unaligned reads:
let ptr: *const u64 = &results[index..index+1] as *const [u32] as *const u32 as *const u64;
unsafe { read_unaligned(ptr) }
}
Is this acceptable in your opinion? Knowing where to draw the line between unsafe
platform-specific optimisations and safe, portable code isn't always easy.
The same performance observations appear to apply to the fill_bytes
code: HC128 is around 10% faster, while other generators aren't affected much.
Of note is that HC128
is currently the algorithm behind thread_rng
, but will likely be replaced in the future.
Is this acceptable in your opinion?
This looks good, I'll try to run it in Miri as well. Do you have this in a branch somewhere?
Of note is that HC128 is currently the algorithm behind thread_rng, but will likely be replaced in the future.
That's in fact how I noticed, as you can see from my example above. I should probably try to run the entire rand test suite in Miri... will it exercise all these code paths irrespective of what thread_rng
chooses?
I can confirm that RalfJung@8303309 makes my small tests pass in Miri.
Running the entire test suite should probably cover most of the unsafe
code, but it would be worth testing all the sub-crates in the repo (e.g. BlockRng64
is only used by Isaac64Rng
which is not a dependency of the main crate).
Thanks again for stepping in to fix the unsafe
pointer access; as you can tell we can do with a bit of help here.
I'm afraid the fill_bytes
problem is much harder to fix. This creates an unaligned reference at
Lines 234 to 237 in 9828cdf
and then passes that to
generate
. References must always be aligned even if they are not used.
So to fix this, generate
would have to take a raw pointer instead. But that seems to be a trait method so that's not that easy...
Here, filled
should always be a multiple of four, but of course that's not enough since the dest
argument may not be 32-bit aligned.
I'm not sure how important the direct copy is to performance (it's only going to be encountered with large fill requests), hence we could refactor this to generate to the self.results
buffer or a temporary and then copy from there.
Refactoring the trait isn't completely out of the question (it doesn't have a massive number of users), but I don't wish to force block RNG implementations to use ugly unsafe code, therefore would prefer generate
does not take a raw pointer.