RC3: H2O performance anomaly on fourth iteration
steveblackburn opened this issue · comments
Performance results show that h2o systematically runs very slow on its fourth iteration. This observation holds across heap sizes, and across JVMs. It is most pronounced (about 15 X slowdown) at small heap sizes, but still noticeable at large heap sizes.
The problem is not evident when running on a single core.
Perhaps relatedly, the benchmark only appears to run to completion in 4 / 10 trials when using the Parallel GC.
Data with my aarch64 system shows the third iteration is anomalous with "small h2o". It isn't 15x but it is reproducible. I do not see the anomaly with "default h2o".
$ ../JAVA/jdk-17.0.8.1+1/bin/java -jar dacapo-23.9-RC3-chopin.jar -v -s small -n 7 h2o
Class name: org.dacapo.harness.H2O
Configurations:
short Open Source Fast Scalable Machine Learning Platform.
long H2O is an in-memory platform for distributed, scalable machine learning. The benchmark uses the 201908-citibike-tripdata dataset.
author null
license Apache License, Version 2.0
copyright Copyright (c) H2O.ai. All rights reserved
url https://github.com/h2oai/h2o-3
version h2o 3.42.0.2
sizes default large small
Using scaled threading model. 160 processors detected, 160 threads used to drive the workload, in a possible range of [1,1024]
Version: h2o 3.42.0.2 (use -p to print nominal benchmark stats)
===== DaCapo 23.9-RC3-chopin h2o starting warmup 1 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 1 in 6659 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting warmup 2 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 2 in 5958 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting warmup 3 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 3 in 8660 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting warmup 4 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 4 in 5063 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting warmup 5 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 5 in 5025 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting warmup 6 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 6 in 5074 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o PASSED in 4961 msec =====
https://gist.github.com/wenyuzhao/291db3f7a8764576f135c432d57147b9
Looks like an OpenJDK GC issue.
I'm going to close this, and declare it to be a JVM issue rather than a DaCapo issue.
Thanks. That's very interesting.
@steveblackburn Update to this issue:
This gist contains four GC logs:
Both G1's 4th iteration and Immix's 3rd iteration have longer execution times. The behaviors are also similar: The workload live size suddenly increased a lot and forced both G1 and Immix to trigger much more GCs.
Interestingly, if comparing the 1G and 350M heap gc logs, the max live size after GC is ~270M for 350M heap, and ~900M for 1G heap. This probably means that h2o is inspecting the GC and heap size, and adjusting live size accordingly.
Another weird thing: This only happens to one iteration, although the work for every iteration should be similar.
Looks like this is not only a JVM issue
Very interesting. Thanks @wenyuzhao. Reopening the issue.
I have spent some time investigating this, with unsatisfying results.
There are four places in their codebase where they directly interact with the GC:
- They use the MemoryMXBean to monitor heap usage, and triggers a cleaner which might write fields to secondary storage or free them.
- Their allocator expects to trigger OOMs.
- It does the same here.
- They also expect to trigger OOMs here.
The first three of these relate to their own implementation of a memory manager which allocates backing data for the KV tables used by h2o.
In my investigation, I established empirically, by instrumenting their cleaner here that it never recovers freed or cleaned items for the default DaCapo workload (i.e. that the cleaner, which is invoked 100's or 1000's of times always yields cleaned == 0
and freed == 0
).
I also established that fixing the DESIRED
cache size (which by default is set dynamically based on heap size) led to no observable change in h2o's performance across a range of heap sizes.
In f8d2b32 I've done the following:
- Allow the
DESIRED
cache size to be set via a java property,dacapo.h2o.target
, and it is set to 2MB by default. - If
dacapo.h2o.target
is set to0
, h2o will resort to its prior behavior (reacting to GC heap resizing). - Otherwise, the callback to triggered by GCs is disabled. This means that the workload it self is no longer dependent on the choice of GC (and whether or not it implements the MemoryMXBean correctly).
- I also changed the time constant in the Cleaner to be 1 second, somewhat shorter than the hardcoded 5 seconds.
Having made these changes, what I observe is that after a certain number of iterations (often 2), the workload will start triggering OOMs (often tens or hundreds). After a flurry of these, the workload stabilises and its performance improves. It seems that the core mechanism being used to manage memory is allocation throttling, which is achieved by blocking threads when OOMs occur. What is not clear is why there are no further OOMs after the initial major OOM storm (a large number perhaps 100s, which occur during the iteration which shows the massively slower allocation time). In all of the experiments I've run with 10 iterations of the workload, this happens exactly once, usually in the second or third iteration and no subsequent iterations experiences OOMs.
At this stage I do not plan to investigate this issue further and having isolated the elements that are explicitly making assumptions about the garbage collector, I will ascribe the remaining pathologies to the h2o workload as an artefact of a large real-world application.
c4bbe26 adds steps mentioned in the h2o troubleshooting guide, namely explicitly calling h2o's GarbageCollect
and RemoveAll
APIs between iterations. This does not appear to improve the situation.
I think I found the issue.
First, run with -Xlog:gc=info
.
Normally, the log looks like this.
[14.753s][info][gc] GC(79) Pause Young (Normal) (G1 Evacuation Pause) 416M->68M(600M) 16.502ms
[14.833s][info][gc] GC(80) Pause Young (Normal) (G1 Evacuation Pause) 416M->72M(600M) 16.799ms
[14.922s][info][gc] GC(81) Pause Young (Normal) (G1 Evacuation Pause) 417M->73M(600M) 16.435ms
[15.006s][info][gc] GC(82) Pause Young (Normal) (G1 Evacuation Pause) 418M->76M(600M) 16.490ms
[15.088s][info][gc] GC(83) Pause Young (Normal) (G1 Evacuation Pause) 419M->86M(600M) 17.295ms
During the slow iteration, there are a lot of large object allocation.
[44.235s][info][gc] GC(637) Pause Full (G1 Humongous Allocation) 479M->479M(600M) 82.926ms
[44.324s][info][gc] GC(638) Pause Full (G1 Humongous Allocation) 479M->479M(600M) 84.817ms
[44.326s][info][gc] GC(639) Pause Young (Normal) (G1 Humongous Allocation) 479M->479M(600M) 0.532ms
[44.410s][info][gc] GC(640) Pause Full (G1 Humongous Allocation) 479M->479M(600M) 82.307ms
[44.504s][info][gc] GC(641) Pause Full (G1 Humongous Allocation) 479M->479M(600M) 89.594ms
And the live size doesn't decrease after GC.
Running jcmd $(pidof java) GC.heap_dump
during these points, and summarizing the heap dump with jol
.
Good
INSTANCES SUM SIZE AVG SIZE CLASS
------------------------------------------------------------------------------------------------
116353 35115168 301 byte[]
130 4261920 32784 jsr166y.ForkJoinTask[]
77192 1852608 24 java.lang.String
3368 1510768 448 int[]
15862 1233296 77 java.lang.Object[]
22065 706080 32 java.util.HashMap$Node
34405 550480 16 javassist.bytecode.Utf8Info
Bad
INSTANCES SUM SIZE AVG SIZE CLASS
------------------------------------------------------------------------------------------------
4178 401557816 96112 int[]
105729 14448312 136 byte[]
99 3245616 32784 jsr166y.ForkJoinTask[]
2066 2237248 1082 double[]
73422 1762128 24 java.lang.String
15848 1095600 69 java.lang.Object[]
21767 696544 32 java.util.HashMap$Node
3931 597512 152 hex.tree.DHistogram
A lot more large int arrays.
Write a bpftrace script.
uprobe:/path/to/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm {
printf("LOS %s %d %d\n", comm, pid, arg1);
}
And it shows different application threads allocating objects of word size 5000002
.
Dumping the stack for such allocation.
uprobe:/path/to/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm {
printf("LOS %s %d %d\n", comm, pid, arg1);
if (arg1 == 5000002) {
system("jstack %d", pid);
}
}
And the offending thread looks like this.
"Thread-55" #163 daemon prio=10 os_prio=0 cpu=66.16ms elapsed=0.94s tid=0x00007f42600b1000 nid=0x1d2dc1 runnable [0x00007f41360e1000]
java.lang.Thread.State: RUNNABLE
JavaThread state: _thread_blocked
Thread: 0x00007f42600b1000 [0x1d2dc1] State: _at_safepoint _has_called_back 0 _at_poll_safepoint 0
JavaThread state: _thread_blocked
at water.init.MemoryBandwidth.run_benchmark(MemoryBandwidth.java:56)
at water.init.MemoryBandwidth$1.run(MemoryBandwidth.java:27)
This seems to be caused by builtin benchmarks of h2o.
It seems like we should set heartbeat.benchmark.enabled
to false.
Since the start of the MemoryBandwith
depends on some hashcode, so where MemoryBandwith
is in the invocation can be random, or might not appear at all, if not running with a lot of iterations.
For heap larger than 400M, MemoryBandwith
will allocate a total of 40M * 32 = 1280M on a 32 thread machine. For heap < 400M, MemoryBandwith
will allocate a total of 3.2x heap size.
This explains our previous observations that the live set seem increase with heap size (until you make the heap size >>1280M).
The first few allocation will succeed, but once the heap is full, subsequent allocation will fail, but OOM is caught by h2o's memory manager, so the allocation will be retried, and G1 will be running back to back GCs but not yield any free space, until some thread finishes its MemoryBandwith
benchmark.
Confirmed and fixed. The one line change in 6a525d9 turns off the heartbeat benchmark, resulting is the problem disappearing entirely.
Thank you @caizixian!
Shouldn't the property have a sys.ai.h2o prefix?
H2O.OptArgs.SYSTEM_PROP_PREFIX
was in the code
Ah yes. Unfortunately I seem to have tested with the property set at the command line :-(
Fixed in 53d17f9. See the following quick experiment:
400M 326f25cd
===== DaCapo evaluation-git-326f25cd h2o completed warmup 1 in 7308 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 2 in 22264 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 3 in 4897 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 4 in 4768 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 5 in 4765 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 6 in 4570 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 7 in 4787 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 8 in 4615 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 9 in 4717 msec =====
===== DaCapo evaluation-git-326f25cd h2o PASSED in 4723 msec =====
400M 21c4c2eb
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 1 in 6981 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 2 in 27500 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 3 in 4677 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 4 in 4751 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 5 in 5527 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 6 in 4658 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 7 in 4598 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 8 in 4902 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 9 in 4702 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o PASSED in 4720 msec =====
400M 53d17f9b
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 1 in 7319 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 2 in 5516 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 3 in 5428 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 4 in 4804 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 5 in 4665 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 6 in 4764 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 7 in 4807 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 8 in 4805 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 9 in 4729 msec =====
===== DaCapo evaluation-git-53d17f9b h2o PASSED in 4621 msec =====
800M 326f25cd
===== DaCapo evaluation-git-326f25cd h2o completed warmup 1 in 6844 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 2 in 12082 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 3 in 4463 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 4 in 4451 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 5 in 4552 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 6 in 4448 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 7 in 4456 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 8 in 4378 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 9 in 4490 msec =====
===== DaCapo evaluation-git-326f25cd h2o PASSED in 4405 msec =====
800M 21c4c2eb
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 1 in 6774 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 2 in 14428 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 3 in 4549 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 4 in 4554 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 5 in 4662 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 6 in 4560 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 7 in 4570 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 8 in 4478 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 9 in 4387 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o PASSED in 4508 msec =====
800M 53d17f9b
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 1 in 7088 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 2 in 5060 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 3 in 4661 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 4 in 4464 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 5 in 4546 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 6 in 4567 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 7 in 4664 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 8 in 4588 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 9 in 4613 msec =====
===== DaCapo evaluation-git-53d17f9b h2o PASSED in 4516 msec =====