rfjakob / gocryptfs

Encrypted overlay filesystem written in Go

Home Page:https://nuetzlich.net/gocryptfs/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance issues - multi core?

jkaberg opened this issue · comments

Just tried the 1.3 release and I'm seeing some lower transfer numbers (roughly around 50-60MB/s) on HDD/ZFS pool - usually speeds are around 110MB/s

CPU supports AES-NI (24 cores)

grep 'model name' /proc/cpuinfo
model name      : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
<...>
grep aes /proc/cpuinfo | wc -l
24

gocryptfs speedtest

./gocryptfs -speed
AES-GCM-256-OpenSSL      226.92 MB/s
AES-GCM-256-Go           376.53 MB/s    (selected in auto mode)
AES-SIV-512-Go            87.52 MB/s

While mounting the filesystem and doing an larger transfer (10GB) I notice 1 core gets full load, but no additional cores gets used.

Is gocryptfs (or alternatively the encryption process) limited to one core? If so - consider this a feature request for muli core encryption 😄

If not - any ideas what might be the bottle neck?

@rfjakob I only tested 1.3 yet, the transfer was done with rsync -avP --progress source target. So it seems my single core performance is too slow (atleast not as fast as the HDD's).

Is it possible to do the encryption in parallel to utilize more cores?

@rfjakob

  1. yeah, 100%
  2. the underlying storage is a ZFS Raidz2 pool with 11 x 4TB SATA3 HGST drives.

Normal transfers (eg from zfs pool -> same zfs pool) hits a steady 110-120MB/s with the same rsync command as above, just not to a gocryptfs mount point on the very same ZFS pool

This sounds like poor random write/read performance due to raidz2 parity overhead on top of encryption overhead. Quite possibly that the non-fixed stripe sizes of raidz in combination with fuse are also a factor..

Anything abnormal showing up on iotop?

@jkaberg A difference between plain rsync and rsync+gocryptfs is that gocryptfs writes the data in 128KB blocks, while rsync probably uses bigger blocks. This is a FUSE limitation - the kernel always splits the data into 128KB blocks.

What throughput do you get when you write to the ZFS with 128KB blocks? Like this:

dd if=/dev/zero of=YOURZFSMOUNT/zero bs=128k

Then, to find out why we are running at 100% CPU: Can you post a cpu profile of gocryptfs? Mount with this option:

gocryptfs -cpuprofile /tmp/cpu.prof

then run the rsync and unmount. Thanks, Jakob

This is what I meant by stripe sizes in combination with FUSE. The 128kb block size is probably what is bottlenecking. In this case compiling a custom kernel with FUSE_MAX_PAGES_PER_REQ higher than 32 may help alleviate the issue.

Yes, increasing FUSE_MAX_PAGES_PER_REQ should increase the throughput. However, this is not something I can ask from users.

So I think behaving like dd bs=128k is the best we can do. But that we are pegged at 100% cpu is probably keeping us from getting there. But let's see what the cpu profile says.

@rfjakob Here's the output (/media/xfiles is the ZFS mountpoint)

root@gunder:/media/xfiles# dd if=/dev/zero of=/media/xfiles/zero bs=128k count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 0.859343 s, 1.5 GB/s
root@gunder:/media/xfiles# ./gocryptfs encrypted/ unencrypted/
Password:
root@gunder:/media/xfiles# dd if=/dev/zero of=/media/xfiles/unencrypted/zero bs=128k count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 9.93545 s, 132 MB/s
root@gunder:/media/xfiles# fusermount -u unencrypted/
root@gunder:/media/xfiles# ./gocryptfs -cpuprofile /tmp/cpu.prof encrypted/ unencrypted/
Writing CPU profile to /tmp/cpu.prof
Note: You must unmount gracefully, otherwise the profile file(s) will stay empty!
Password:
root@gunder:/media/xfiles# dd if=/dev/zero of=/media/xfiles/unencrypted/zero bs=128k count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 10.0026 s, 131 MB/s
root@gunder:/media/xfiles# fusermount -u unencrypted/

The cpu profile can be found here

Strange thing is somehow rsync (using flags avP) to the same unencrypted mount is topping out at around 60 MB/s

The CPU profile (rendered as pdf: pprof001.svg.pdf) show that we spend our time on:

36.8%    gcmAesEnc
14.2%    syscall.Pwrite
 6.9%    nonceGenerator.Get

I have already sped up nonceGenerator.Get quite a little in 80516ed . We cannot do anything about the pwrite syscall. That leaves gcmAesEnc. My benchmarks suggest that we can get a big improvement by parallelizing the encryption: results.txt

On a 4-core, 8-thread machine (Xeon E31245) we get a superlinear (!!) improvement by switching form one to two threads:

Benchmark1_gogcm-8            	    5000	    282694 ns/op	 463.65 MB/s
Benchmark2_gogcm-8            	   20000	     99704 ns/op	1314.60 MB/s

Impressive numbers and work @rfjakob. You mind publishing a build for me to test (Linux amd64)?

Also good news from libfuse, https://github.com/libfuse/libfuse/releases/tag/fuse-3.0.2

"Internal: calculate request buffer size from page size and kernel page limit instead of using hardcoded 128 kB limit." (libfuse/libfuse@4f8f034)

This should help speed things up abit 😄

The numbers I posted are from a synthetic benchmark ( https://github.com/rfjakob/gocryptfs-microbenchmarks ), I'm working on getting it into gocryptfs. I'll probably not get the same improvement in gocryptfs due the FUSE overhead. Will keep you updated here!

The page size thing, unfortunately, only applies to architectures other than x86. I believe arm64 and powerpc have a bigger page size, so they would get much bigger blocks.

I have added two-way encryption parallelism. If you can test, here is the latest build:
gocryptfs_v1.3-70-gafc3a82_linux-static_amd64.tar.gz

@rfjakob Indeed, I'm seeing on avarage an 20MB/s increase (with rsync). Very nice! 😄

I did an cpu profile for you aswell, https://cloud.eth0.im/s/jEwCnsLJElFz8E0

While doing the rsync job I noticed my CPU is not going above 130%. From the commit messages I recon you limited threading to 2 threads, do you think bumping up that value would make a difference?

Great, thanks! Rendered cpu profile: pprof002.svg.pdf

I saw about a 20% increase in my testing, and to be honest, I was a bit underwhelmed. It turns out that the encryption threads often get scheduled to the same core. This gets worse with more threads, which is why I have limited it to two-way parallelism for now.

@jkaberg If you want to try again, 3c6fe98 should give it another boost. This gets rid of most of the garbage collection overhead by re-using temporary buffers.

I think this can be closed - check out the performance history at performance.txt, the last commits gave us quite a boost.

@rfjakob Thanks. I'll have a go when I'm back from vacation 😄