Performance issues - multi core?

Question

Performance issues - multi core?

jkaberg opened this issue 7 years ago · comments

Just tried the 1.3 release and I'm seeing some lower transfer numbers (roughly around 50-60MB/s) on HDD/ZFS pool - usually speeds are around 110MB/s

CPU supports AES-NI (24 cores)

grep 'model name' /proc/cpuinfo
model name      : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
<...>
grep aes /proc/cpuinfo | wc -l
24

gocryptfs speedtest

./gocryptfs -speed
AES-GCM-256-OpenSSL      226.92 MB/s
AES-GCM-256-Go           376.53 MB/s    (selected in auto mode)
AES-SIV-512-Go            87.52 MB/s

While mounting the filesystem and doing an larger transfer (10GB) I notice 1 core gets full load, but no additional cores gets used.

Is gocryptfs (or alternatively the encryption process) limited to one core? If so - consider this a feature request for muli core encryption 😄

If not - any ideas what might be the bottle neck?

rfjakob · Answer 1 · Thu Jun 01 2017 20:28:37 GMT+0800 (China Standard Time)

Hi, which release was faster? Yes, encryption of a single file transfer is single-thread, but multiple transfers run in parallel (each transfer get its own thread)

Joel Kåberg · Answer 2 · Thu Jun 01 2017 21:32:06 GMT+0800 (China Standard Time)

@rfjakob I only tested 1.3 yet, the transfer was done with rsync -avP --progress source target. So it seems my single core performance is too slow (atleast not as fast as the HDD's).

Is it possible to do the encryption in parallel to utilize more cores?

rfjakob · Answer 3 · Thu Jun 01 2017 23:51:25 GMT+0800 (China Standard Time)

Ah ok, i misunderstood. I thought it was 50mb/s slower than some other version. So given the "-speed" numbers I don't think encryption is the bottleneck. Two questions 1) what is the cpu load on the used core? 100%? 2) what is the underlying storage? A local ext4 on a hard disk? A ssd?

Joel Kåberg · Answer 4 · Fri Jun 02 2017 00:07:00 GMT+0800 (China Standard Time)

@rfjakob

yeah, 100%
the underlying storage is a ZFS Raidz2 pool with 11 x 4TB SATA3 HGST drives.

Normal transfers (eg from zfs pool -> same zfs pool) hits a steady 110-120MB/s with the same rsync command as above, just not to a gocryptfs mount point on the very same ZFS pool

Nodens- · Answer 5 · Fri Jun 02 2017 00:19:22 GMT+0800 (China Standard Time)

This sounds like poor random write/read performance due to raidz2 parity overhead on top of encryption overhead. Quite possibly that the non-fixed stripe sizes of raidz in combination with fuse are also a factor..

Anything abnormal showing up on iotop?

rfjakob · Answer 6 · Tue Jun 06 2017 15:52:30 GMT+0800 (China Standard Time)

@jkaberg A difference between plain rsync and rsync+gocryptfs is that gocryptfs writes the data in 128KB blocks, while rsync probably uses bigger blocks. This is a FUSE limitation - the kernel always splits the data into 128KB blocks.

What throughput do you get when you write to the ZFS with 128KB blocks? Like this:

dd if=/dev/zero of=YOURZFSMOUNT/zero bs=128k

Then, to find out why we are running at 100% CPU: Can you post a cpu profile of gocryptfs? Mount with this option:

gocryptfs -cpuprofile /tmp/cpu.prof

then run the rsync and unmount. Thanks, Jakob

Nodens- · Answer 7 · Tue Jun 06 2017 16:31:09 GMT+0800 (China Standard Time)

This is what I meant by stripe sizes in combination with FUSE. The 128kb block size is probably what is bottlenecking. In this case compiling a custom kernel with FUSE_MAX_PAGES_PER_REQ higher than 32 may help alleviate the issue.

rfjakob · Answer 8 · Wed Jun 07 2017 15:23:35 GMT+0800 (China Standard Time)

Yes, increasing FUSE_MAX_PAGES_PER_REQ should increase the throughput. However, this is not something I can ask from users.

So I think behaving like dd bs=128k is the best we can do. But that we are pegged at 100% cpu is probably keeping us from getting there. But let's see what the cpu profile says.

Joel Kåberg · Answer 9 · Wed Jun 07 2017 16:53:01 GMT+0800 (China Standard Time)

@rfjakob Here's the output (/media/xfiles is the ZFS mountpoint)

root@gunder:/media/xfiles# dd if=/dev/zero of=/media/xfiles/zero bs=128k count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 0.859343 s, 1.5 GB/s
root@gunder:/media/xfiles# ./gocryptfs encrypted/ unencrypted/
Password:
root@gunder:/media/xfiles# dd if=/dev/zero of=/media/xfiles/unencrypted/zero bs=128k count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 9.93545 s, 132 MB/s
root@gunder:/media/xfiles# fusermount -u unencrypted/
root@gunder:/media/xfiles# ./gocryptfs -cpuprofile /tmp/cpu.prof encrypted/ unencrypted/
Writing CPU profile to /tmp/cpu.prof
Note: You must unmount gracefully, otherwise the profile file(s) will stay empty!
Password:
root@gunder:/media/xfiles# dd if=/dev/zero of=/media/xfiles/unencrypted/zero bs=128k count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 10.0026 s, 131 MB/s
root@gunder:/media/xfiles# fusermount -u unencrypted/

The cpu profile can be found here

Strange thing is somehow rsync (using flags avP) to the same unencrypted mount is topping out at around 60 MB/s

rfjakob · Answer 10 · Sat Jun 10 2017 21:14:55 GMT+0800 (China Standard Time)

The CPU profile (rendered as pdf: pprof001.svg.pdf) show that we spend our time on:

36.8%    gcmAesEnc
14.2%    syscall.Pwrite
 6.9%    nonceGenerator.Get

I have already sped up nonceGenerator.Get quite a little in 80516ed . We cannot do anything about the pwrite syscall. That leaves gcmAesEnc. My benchmarks suggest that we can get a big improvement by parallelizing the encryption: results.txt

rfjakob · Answer 11 · Sat Jun 10 2017 21:18:49 GMT+0800 (China Standard Time)

On a 4-core, 8-thread machine (Xeon E31245) we get a superlinear (!!) improvement by switching form one to two threads:

Benchmark1_gogcm-8            	    5000	    282694 ns/op	 463.65 MB/s
Benchmark2_gogcm-8            	   20000	     99704 ns/op	1314.60 MB/s

Joel Kåberg · Answer 12 · Sat Jun 10 2017 22:26:19 GMT+0800 (China Standard Time)

Impressive numbers and work @rfjakob. You mind publishing a build for me to test (Linux amd64)?

Joel Kåberg · Answer 13 · Sat Jun 10 2017 22:45:12 GMT+0800 (China Standard Time)

Also good news from libfuse, https://github.com/libfuse/libfuse/releases/tag/fuse-3.0.2

"Internal: calculate request buffer size from page size and kernel page limit instead of using hardcoded 128 kB limit." (libfuse/libfuse@4f8f034)

This should help speed things up abit 😄

rfjakob · Answer 14 · Sun Jun 11 2017 01:01:45 GMT+0800 (China Standard Time)

The numbers I posted are from a synthetic benchmark ( https://github.com/rfjakob/gocryptfs-microbenchmarks ), I'm working on getting it into gocryptfs. I'll probably not get the same improvement in gocryptfs due the FUSE overhead. Will keep you updated here!

The page size thing, unfortunately, only applies to architectures other than x86. I believe arm64 and powerpc have a bigger page size, so they would get much bigger blocks.

rfjakob · Answer 15 · Mon Jun 12 2017 04:19:10 GMT+0800 (China Standard Time)

I have added two-way encryption parallelism. If you can test, here is the latest build:
gocryptfs_v1.3-70-gafc3a82_linux-static_amd64.tar.gz

Joel Kåberg · Answer 16 · Mon Jun 12 2017 04:47:45 GMT+0800 (China Standard Time)

@rfjakob Indeed, I'm seeing on avarage an 20MB/s increase (with rsync). Very nice! 😄

I did an cpu profile for you aswell, https://cloud.eth0.im/s/jEwCnsLJElFz8E0

While doing the rsync job I noticed my CPU is not going above 130%. From the commit messages I recon you limited threading to 2 threads, do you think bumping up that value would make a difference?

rfjakob · Answer 17 · Mon Jun 12 2017 05:11:13 GMT+0800 (China Standard Time)

Great, thanks! Rendered cpu profile: pprof002.svg.pdf

I saw about a 20% increase in my testing, and to be honest, I was a bit underwhelmed. It turns out that the encryption threads often get scheduled to the same core. This gets worse with more threads, which is why I have limited it to two-way parallelism for now.

rfjakob · Answer 18 · Wed Jun 21 2017 04:46:25 GMT+0800 (China Standard Time)

@jkaberg If you want to try again, 3c6fe98 should give it another boost. This gets rid of most of the garbage collection overhead by re-using temporary buffers.

rfjakob · Answer 19 · Sat Jul 01 2017 16:21:27 GMT+0800 (China Standard Time)

I think this can be closed - check out the performance history at performance.txt, the last commits gave us quite a boost.

Joel Kåberg · Answer 20 · Mon Jul 03 2017 18:27:13 GMT+0800 (China Standard Time)

@rfjakob Thanks. I'll have a go when I'm back from vacation 😄