rfjakob / gocryptfs

Encrypted overlay filesystem written in Go

Home Page:https://nuetzlich.net/gocryptfs/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Go crypto faster than OpenSSL on AES-NI systems

lxp opened this issue · comments

On my system Go crypto seems to be a lot faster than OpenSSL crypto.
I started to investigate this with gocryptfs 0.9 and perf on Linux 4.4. Under heavy load (multiple rsync's ongoing) perf attributed 60% overhead to the Go runtime's native call checks (runtime.cgoCheckArg), which were caused by OpenSSL calls.
I will provide proper benchmarks with gocryptfs 0.10-rc1, once my system is idle again.

$ cat /proc/cpuinfo 
[...]
model name  : Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
[...]
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts
[...]

That would be defininitely interesting. You can run the built-in benchmark using

cd gocryptfs/internal/stupidgcm
go test -bench .

On my machine, I get this (StupidGCM = simple OpenSSL wrapper, GoGCM = built-in Go crypto):

Benchmark4kEncStupidGCM-2      50000         24774 ns/op     165.33 MB/s
Benchmark4kEncGoGCM-2          10000        120745 ns/op      33.92 MB/s

My cpu does not have AES-NI,

cat /proc/cpuinfo 
[...]
model name  : Intel(R) Pentium(R) CPU G630 @ 2.70GHz
[...]
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave lahf_lm arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid xsaveopt

My machine (i5-4690K) is still not fully idle, but I think the results are clear enough:

$ go test -bench .
PASS
Benchmark4kEncStupidGCM-4     200000          7123 ns/op     575.03 MB/s
Benchmark4kEncGoGCM-4         500000          2512 ns/op    1629.95 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 2.867s
$ go test -bench .
PASS
Benchmark4kEncStupidGCM-4     200000          6949 ns/op     589.37 MB/s
Benchmark4kEncGoGCM-4         500000          2480 ns/op    1651.41 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 2.803s
$ go test -bench .
PASS
Benchmark4kEncStupidGCM-4     200000          6985 ns/op     586.37 MB/s
Benchmark4kEncGoGCM-4         500000          2480 ns/op    1651.13 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 2.813s

Results from the old openssl_benchmark.bash from v0.9:

$ ./openssl_benchmark.bash 
+ go test -bench=.
Benchmarking AES-GCM-256 with 4kB block size
testing: warning: no tests to run
PASS
BenchmarkGoEnc4K-4       1000000          1493 ns/op    2743.30 MB/s
BenchmarkGoDec4K-4       1000000          1481 ns/op    2764.83 MB/s
BenchmarkOpensslEnc4K-4   200000          7624 ns/op     537.24 MB/s
BenchmarkOpensslDec4K-4   100000         20524 ns/op     199.56 MB/s
ok      github.com/rfjakob/gocryptfs/openssl_benchmark  6.878s
$ ./openssl_benchmark.bash 
+ go test -bench=.
Benchmarking AES-GCM-256 with 4kB block size
testing: warning: no tests to run
PASS
BenchmarkGoEnc4K-4       1000000          1497 ns/op    2734.83 MB/s
BenchmarkGoDec4K-4       1000000          1487 ns/op    2754.54 MB/s
BenchmarkOpensslEnc4K-4   200000          7648 ns/op     535.54 MB/s
BenchmarkOpensslDec4K-4   100000         20577 ns/op     199.05 MB/s
ok      github.com/rfjakob/gocryptfs/openssl_benchmark  6.901s
$ ./openssl_benchmark.bash 
+ go test -bench=.
Benchmarking AES-GCM-256 with 4kB block size
testing: warning: no tests to run
PASS
BenchmarkGoEnc4K-4       1000000          1500 ns/op    2729.13 MB/s
BenchmarkGoDec4K-4       1000000          1490 ns/op    2747.32 MB/s
BenchmarkOpensslEnc4K-4   200000          7690 ns/op     532.61 MB/s
BenchmarkOpensslDec4K-4   100000         20579 ns/op     199.03 MB/s
ok      github.com/rfjakob/gocryptfs/openssl_benchmark  6.941s

I am not sure what causes the difference in Go crypto performance (but I also didn't look into the code).
What I also find interesting in the old benchmark is that OpenSSL decryption is significantly slower than encryption.

The old benchmarks use a 12-byte IV, which is Go's default. Since v0.7, gocryptfs actually uses 16 bytes and the new benchmarks reflect that.

In any case, the performance difference between Go and OpenSSL is huge. I will add autodection that switches to Go crypto if AES-NI is available.

Ah okay, that explains it.
For me, the current situation is no problem, as I just use -openssl=false during mounting.
Yeah, autodetection was exactly what I wanted to recommend :)
I think the Go crypto code already does it. I am just not sure if it is easily accessible from outside.

I am rather new to Go. Do you know if there is an easy way to compile the benchmark as binary?
Then, I could also test it on one of the first Intel processors supporting AES-NI (Xeon E5620).
I know it has worse AES-NI performance than newer processors, but would be interesting to know if Go crypto is still faster.

Similar results here on an i5 core that has AES-NI instructions.

$ go test -bench .
PASS
Benchmark4kEncStupidGCM-4     200000          8815 ns/op     464.65 MB/s
Benchmark4kEncGoGCM-4         300000          3796 ns/op    1078.98 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 3.147s

$ cat /proc/cpuinfo 
[...]
model name  : Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz
[...]
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts

@lxp Run go test -c to get the stupidgcm.test binary. Benchmark is run using

./stupidgcm.test -test.bench .

Ugh. Looks like it is going to be more complicated than checking for the "aes" flag.

$ go test -bench .
PASS
Benchmark4kEncStupidGCM-2     200000         10611 ns/op     385.99 MB/s
Benchmark4kEncGoGCM-2          30000         44999 ns/op      91.02 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 4.429s

$ cat /proc/cpuinfo | grep -e "model name\|flags" | head -2
model name  : Intel Xeon E312xx (Sandy Bridge)
flags       : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm xsaveopt

$ go version
go version go1.5.1 linux/amd64

Ok here we go, Go seems to use the AES instructions from v1.6. This is on the same box as above.

 ~/go/bin/go test -bench .
PASS
Benchmark4kEncStupidGCM-2     100000         16528 ns/op     247.81 MB/s
Benchmark4kEncGoGCM-2         300000          5014 ns/op     816.86 MB/s
ok      github.com/rfjakob/gocryptfs/internal/stupidgcm 3.603s

$ ~/go/bin/go version
go version go1.6.2 linux/amd64

Hi guys, if you are interested I ran some benchmarks on my desktop machine and a fresh SSD comparing plain, gocryptfs (openssl on/off), encfs, securefs, truecrypt & dm-crypt.
Keep in mind that Truecrypt & dm-crypt do play in a different league since they are not file based encryption tools.
https://gist.github.com/alphazo/09a2e523e22e7aa00d491ab67678dd80

@rfjakob Thank you, I didn't expect a that simple solution :)
I compiled a version with Go 1.6 and used the same binary on all machines.
I think the benchmarks draw a pretty clear picture.
AES-NI + Go 1.6+ -> Go Crypto
Otherwise -> OpenSSL

$ go version
go version go1.6 linux/amd64

AES-NI

Skylake (Launch: Q3'15)

$ cat /proc/cpuinfo
model name  : Intel(R) Core(TM) i3-6100U CPU @ 2.30GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm hwp hwp_notify hwp_act_window hwp_epp intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-4     200000         10688 ns/op     383.22 MB/s
Benchmark4kEncGoGCM-4         300000          4073 ns/op    1005.57 MB/s

Haswell (Launch: Q2'14)

$ cat /proc/cpuinfo
model name  : Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-4     200000          6710 ns/op     610.43 MB/s
Benchmark4kEncGoGCM-4         500000          2422 ns/op    1690.86 MB/s

Ivy Bridge (Launch: Q2'12)

$ cat /proc/cpuinfo 
model name  : Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-4     200000         14684 ns/op     278.94 MB/s
Benchmark4kEncGoGCM-4         300000          7792 ns/op     525.62 MB/s

Sandy Bridge (Launch: Q1'11)

$ cat /proc/cpuinfo 
model name  : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-4     100000         19070 ns/op     214.78 MB/s
Benchmark4kEncGoGCM-4         200000         10981 ns/op     373.01 MB/s

Westmere (Launch: Q1'10)

$ cat /proc/cpuinfo 
model name  : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-16        100000             18297 ns/op         223.85 MB/s
Benchmark4kEncGoGCM-16            200000              9579 ns/op         427.58 MB/s

no AES-NI

Ivy Bridge (Launch: Q1'13)

$ cat /proc/cpuinfo 
model name  : Intel(R) Pentium(R) CPU G2130 @ 3.20GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-2     100000         22691 ns/op     180.51 MB/s
Benchmark4kEncGoGCM-2          20000         92810 ns/op      44.13 MB/s

Nehalem (Launch: Q3'09)

$ cat /proc/cpuinfo 
model name  : Intel(R) Xeon(R) CPU           X3460  @ 2.80GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-8      50000         35247 ns/op     116.21 MB/s
Benchmark4kEncGoGCM-8          20000         92230 ns/op      44.41 MB/s

Core (Launch: Q1'08)

$ cat /proc/cpuinfo 
model name  : Intel(R) Core(TM)2 Duo CPU     E7400  @ 2.80GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm
$ ./stupidgcm.test -test.bench .
PASS
Benchmark4kEncStupidGCM-2      30000         46697 ns/op      87.71 MB/s
Benchmark4kEncGoGCM-2          10000        194095 ns/op      21.10 MB/s

Maybe, I will add two older AMD processors (without AES-NI), when I have time.

@rfjakob While most of gocryptfs operations outperformed encfs (even in standard mode) in the quick benchmark I posted earlier, why is the rm operation a bit behind ?

Hi @alphazo, I read your comparison with great interest, thank you! Yes, we are 15% behind EncFS for rm, hmm. To be honest, I'm not sure why. I'll have to profile this!

Autodetection has been added to master in 49b597f , the -openssl option now defaults to "auto". It can be overridden by passing true or false.

You can run "gocryptfs -debug -version" to see the result of the autodetection, I get

$ ./gocryptfs -debug -version
openssl=true
gocryptfs v0.10-rc2-7-g49b597f-dirty; on-disk format 2; go-fuse a01ba14

because my CPU does not support AES-NI.

Great! Thank you, for integrating it so fast 👍
I added a Skylake CPU to my above benchmark post.
It looks good, on 4 AES-NI CPUs I get (not sure when I will be able to test it on Skylake):

$ ./gocryptfs -debug -version
openssl=false
gocryptfs v0.10-rc1-16-g4ad9d4e; on-disk format 2; go-fuse ed84134

While on the 3 non AES-NI CPUs I get:

$ ./gocryptfs -debug -version
openssl=true
gocryptfs v0.10-rc1-16-g4ad9d4e; on-disk format 2; go-fuse ed84134

I compiled again with Go 1.6 and all systems are running on amd64.

Great! Do you want to put the benchmarks into the wiki? Something like https://github.com/rfjakob/gocryptfs/wiki/CPU-Benchmarks ? I think it's valuable information and deserves some visibility.

Same thing for you, @alphazo ! Maybe https://github.com/rfjakob/gocryptfs/wiki/Performance-Comparison ?

Released as v0.10-rc3.