dacapobench / dacapobench

The DaCapo benchmark suite

Home Page:https://www.dacapobench.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RC3: H2O performance anomaly on fourth iteration

steveblackburn opened this issue · comments

Performance results show that h2o systematically runs very slow on its fourth iteration. This observation holds across heap sizes, and across JVMs. It is most pronounced (about 15 X slowdown) at small heap sizes, but still noticeable at large heap sizes.

The problem is not evident when running on a single core.

Perhaps relatedly, the benchmark only appears to run to completion in 4 / 10 trials when using the Parallel GC.

Data with my aarch64 system shows the third iteration is anomalous with "small h2o". It isn't 15x but it is reproducible. I do not see the anomaly with "default h2o".

$ ../JAVA/jdk-17.0.8.1+1/bin/java -jar dacapo-23.9-RC3-chopin.jar -v -s small -n 7 h2o
Class name: org.dacapo.harness.H2O
Configurations:
short     Open Source Fast Scalable Machine Learning Platform.
long      H2O is an in-memory platform for distributed, scalable machine learning.  The benchmark uses the 201908-citibike-tripdata dataset.
author    null
license   Apache License, Version 2.0
copyright Copyright (c) H2O.ai. All rights reserved
url       https://github.com/h2oai/h2o-3
version   h2o 3.42.0.2
sizes     default large small
Using scaled threading model. 160 processors detected, 160 threads used to drive the workload, in a possible range of [1,1024]
Version: h2o 3.42.0.2 (use -p to print nominal benchmark stats)
===== DaCapo 23.9-RC3-chopin h2o starting warmup 1 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 1 in 6659 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting warmup 2 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 2 in 5958 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting warmup 3 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 3 in 8660 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting warmup 4 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 4 in 5063 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting warmup 5 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 5 in 5025 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting warmup 6 =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o completed warmup 6 in 5074 msec =====
===== DaCapo 23.9-RC3-chopin h2o starting =====
Importing file: citibiketripdata201908s.csv......
Parsing file: citibiketripdata201908s.csv......
Building model......
Model built successfully
Frames deleted
H2O finished
===== DaCapo 23.9-RC3-chopin h2o PASSED in 4961 msec =====

I'm going to close this, and declare it to be a JVM issue rather than a DaCapo issue.

Thanks. That's very interesting.

@steveblackburn Update to this issue:

This gist contains four GC logs:

Both G1's 4th iteration and Immix's 3rd iteration have longer execution times. The behaviors are also similar: The workload live size suddenly increased a lot and forced both G1 and Immix to trigger much more GCs.

Interestingly, if comparing the 1G and 350M heap gc logs, the max live size after GC is ~270M for 350M heap, and ~900M for 1G heap. This probably means that h2o is inspecting the GC and heap size, and adjusting live size accordingly.

Another weird thing: This only happens to one iteration, although the work for every iteration should be similar.

Looks like this is not only a JVM issue

Very interesting. Thanks @wenyuzhao. Reopening the issue.

I have spent some time investigating this, with unsatisfying results.

There are four places in their codebase where they directly interact with the GC:

  1. They use the MemoryMXBean to monitor heap usage, and triggers a cleaner which might write fields to secondary storage or free them.
  2. Their allocator expects to trigger OOMs.
  3. It does the same here.
  4. They also expect to trigger OOMs here.

The first three of these relate to their own implementation of a memory manager which allocates backing data for the KV tables used by h2o.

In my investigation, I established empirically, by instrumenting their cleaner here that it never recovers freed or cleaned items for the default DaCapo workload (i.e. that the cleaner, which is invoked 100's or 1000's of times always yields cleaned == 0 and freed == 0).

I also established that fixing the DESIRED cache size (which by default is set dynamically based on heap size) led to no observable change in h2o's performance across a range of heap sizes.

In f8d2b32 I've done the following:

  • Allow the DESIRED cache size to be set via a java property, dacapo.h2o.target, and it is set to 2MB by default.
  • If dacapo.h2o.target is set to 0, h2o will resort to its prior behavior (reacting to GC heap resizing).
  • Otherwise, the callback to triggered by GCs is disabled. This means that the workload it self is no longer dependent on the choice of GC (and whether or not it implements the MemoryMXBean correctly).
  • I also changed the time constant in the Cleaner to be 1 second, somewhat shorter than the hardcoded 5 seconds.

Having made these changes, what I observe is that after a certain number of iterations (often 2), the workload will start triggering OOMs (often tens or hundreds). After a flurry of these, the workload stabilises and its performance improves. It seems that the core mechanism being used to manage memory is allocation throttling, which is achieved by blocking threads when OOMs occur. What is not clear is why there are no further OOMs after the initial major OOM storm (a large number perhaps 100s, which occur during the iteration which shows the massively slower allocation time). In all of the experiments I've run with 10 iterations of the workload, this happens exactly once, usually in the second or third iteration and no subsequent iterations experiences OOMs.

At this stage I do not plan to investigate this issue further and having isolated the elements that are explicitly making assumptions about the garbage collector, I will ascribe the remaining pathologies to the h2o workload as an artefact of a large real-world application.

c4bbe26 adds steps mentioned in the h2o troubleshooting guide, namely explicitly calling h2o's GarbageCollect and RemoveAll APIs between iterations. This does not appear to improve the situation.

I think I found the issue.

First, run with -Xlog:gc=info.
Normally, the log looks like this.

[14.753s][info][gc] GC(79) Pause Young (Normal) (G1 Evacuation Pause) 416M->68M(600M) 16.502ms
[14.833s][info][gc] GC(80) Pause Young (Normal) (G1 Evacuation Pause) 416M->72M(600M) 16.799ms
[14.922s][info][gc] GC(81) Pause Young (Normal) (G1 Evacuation Pause) 417M->73M(600M) 16.435ms
[15.006s][info][gc] GC(82) Pause Young (Normal) (G1 Evacuation Pause) 418M->76M(600M) 16.490ms
[15.088s][info][gc] GC(83) Pause Young (Normal) (G1 Evacuation Pause) 419M->86M(600M) 17.295ms

During the slow iteration, there are a lot of large object allocation.

[44.235s][info][gc] GC(637) Pause Full (G1 Humongous Allocation) 479M->479M(600M) 82.926ms
[44.324s][info][gc] GC(638) Pause Full (G1 Humongous Allocation) 479M->479M(600M) 84.817ms
[44.326s][info][gc] GC(639) Pause Young (Normal) (G1 Humongous Allocation) 479M->479M(600M) 0.532ms
[44.410s][info][gc] GC(640) Pause Full (G1 Humongous Allocation) 479M->479M(600M) 82.307ms
[44.504s][info][gc] GC(641) Pause Full (G1 Humongous Allocation) 479M->479M(600M) 89.594ms

And the live size doesn't decrease after GC.

Running jcmd $(pidof java) GC.heap_dump during these points, and summarizing the heap dump with jol.

Good

     INSTANCES      SUM SIZE      AVG SIZE   CLASS
------------------------------------------------------------------------------------------------
        116353      35115168           301   byte[]
           130       4261920         32784   jsr166y.ForkJoinTask[]
         77192       1852608            24   java.lang.String
          3368       1510768           448   int[]
         15862       1233296            77   java.lang.Object[]
         22065        706080            32   java.util.HashMap$Node
         34405        550480            16   javassist.bytecode.Utf8Info

Bad

     INSTANCES      SUM SIZE      AVG SIZE   CLASS
------------------------------------------------------------------------------------------------
          4178     401557816         96112   int[]
        105729      14448312           136   byte[]
            99       3245616         32784   jsr166y.ForkJoinTask[]
          2066       2237248          1082   double[]
         73422       1762128            24   java.lang.String
         15848       1095600            69   java.lang.Object[]
         21767        696544            32   java.util.HashMap$Node
          3931        597512           152   hex.tree.DHistogram

A lot more large int arrays.

Write a bpftrace script.

uprobe:/path/to/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm {
    printf("LOS %s %d %d\n", comm, pid, arg1);
}

And it shows different application threads allocating objects of word size 5000002.

Dumping the stack for such allocation.

uprobe:/path/to/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm {
    printf("LOS %s %d %d\n", comm, pid, arg1);
    if (arg1 == 5000002) {
        system("jstack %d", pid);
    }
}

And the offending thread looks like this.

"Thread-55" #163 daemon prio=10 os_prio=0 cpu=66.16ms elapsed=0.94s tid=0x00007f42600b1000 nid=0x1d2dc1 runnable  [0x00007f41360e1000]
   java.lang.Thread.State: RUNNABLE
   JavaThread state: _thread_blocked
Thread: 0x00007f42600b1000  [0x1d2dc1] State: _at_safepoint _has_called_back 0 _at_poll_safepoint 0
   JavaThread state: _thread_blocked
	at water.init.MemoryBandwidth.run_benchmark(MemoryBandwidth.java:56)
	at water.init.MemoryBandwidth$1.run(MemoryBandwidth.java:27)

This seems to be caused by builtin benchmarks of h2o.

https://github.com/h2oai/h2o-3/blob/f0a67ff31b95534beb3c40fe7cf6a291cc9b2aca/h2o-core/src/main/java/water/HeartBeatThread.java#L156-L164

It seems like we should set heartbeat.benchmark.enabled to false.

Since the start of the MemoryBandwith depends on some hashcode, so where MemoryBandwith is in the invocation can be random, or might not appear at all, if not running with a lot of iterations.

https://github.com/h2oai/h2o-3/blob/f0a67ff31b95534beb3c40fe7cf6a291cc9b2aca/h2o-core/src/main/java/water/HeartBeatThread.java#L161

For heap larger than 400M, MemoryBandwith will allocate a total of 40M * 32 = 1280M on a 32 thread machine. For heap < 400M, MemoryBandwith will allocate a total of 3.2x heap size.
This explains our previous observations that the live set seem increase with heap size (until you make the heap size >>1280M).

The first few allocation will succeed, but once the heap is full, subsequent allocation will fail, but OOM is caught by h2o's memory manager, so the allocation will be retried, and G1 will be running back to back GCs but not yield any free space, until some thread finishes its MemoryBandwith benchmark.

Confirmed and fixed. The one line change in 6a525d9 turns off the heartbeat benchmark, resulting is the problem disappearing entirely.

Thank you @caizixian!

Shouldn't the property have a sys.ai.h2o prefix?

H2O.OptArgs.SYSTEM_PROP_PREFIX was in the code

Ah yes. Unfortunately I seem to have tested with the property set at the command line :-(

Fixed in 53d17f9. See the following quick experiment:

400M 326f25cd
===== DaCapo evaluation-git-326f25cd h2o completed warmup 1 in 7308 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 2 in 22264 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 3 in 4897 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 4 in 4768 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 5 in 4765 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 6 in 4570 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 7 in 4787 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 8 in 4615 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 9 in 4717 msec =====
===== DaCapo evaluation-git-326f25cd h2o PASSED in 4723 msec =====

400M 21c4c2eb
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 1 in 6981 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 2 in 27500 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 3 in 4677 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 4 in 4751 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 5 in 5527 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 6 in 4658 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 7 in 4598 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 8 in 4902 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 9 in 4702 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o PASSED in 4720 msec =====

400M 53d17f9b
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 1 in 7319 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 2 in 5516 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 3 in 5428 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 4 in 4804 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 5 in 4665 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 6 in 4764 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 7 in 4807 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 8 in 4805 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 9 in 4729 msec =====
===== DaCapo evaluation-git-53d17f9b h2o PASSED in 4621 msec =====

800M 326f25cd
===== DaCapo evaluation-git-326f25cd h2o completed warmup 1 in 6844 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 2 in 12082 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 3 in 4463 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 4 in 4451 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 5 in 4552 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 6 in 4448 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 7 in 4456 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 8 in 4378 msec =====
===== DaCapo evaluation-git-326f25cd h2o completed warmup 9 in 4490 msec =====
===== DaCapo evaluation-git-326f25cd h2o PASSED in 4405 msec =====

800M 21c4c2eb
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 1 in 6774 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 2 in 14428 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 3 in 4549 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 4 in 4554 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 5 in 4662 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 6 in 4560 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 7 in 4570 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 8 in 4478 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o completed warmup 9 in 4387 msec =====
===== DaCapo evaluation-git-21c4c2eb h2o PASSED in 4508 msec =====

800M 53d17f9b
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 1 in 7088 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 2 in 5060 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 3 in 4661 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 4 in 4464 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 5 in 4546 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 6 in 4567 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 7 in 4664 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 8 in 4588 msec =====
===== DaCapo evaluation-git-53d17f9b h2o completed warmup 9 in 4613 msec =====
===== DaCapo evaluation-git-53d17f9b h2o PASSED in 4516 msec =====