Rally on aarch64 appears to leak memory
mikeh-elastic opened this issue · comments
Rally version (get with esrally --version
):
esrally 2.7.1
Invoked command:
~/.local/bin/esrally race --challenge logging-indexing-querying --track elastic/logs --target-hosts=${URL}:9200 --pipeline=benchmark-only --client-options="enable_cleanup_closed:true,use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:$PASSWORD" --track-params="bulk_indexing_clients:48,number_of_shards:3,number_of_replicas:1,start_date:2022-12-22,end_date:2022-12-24,raw_data_volume_per_day:1024GB,data_generation_clients:16,throttle_indexing:true,query_min_date:2022-12-22,query_max_date:2022-12-24" --kill-running-processes
Configuration file (located in ~/.rally/rally.ini
)):
JVM version:
N/A - running against remote cluster
OS version:
ubuntu@ip-192-168-6-238:$ uname -a22.04.1-Ubuntu SMP Mon Apr 24 01:58:03 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
Linux ip-192-168-6-238 5.19.0-1025-aws #26
Description of the problem including expected versus actual behavior:
Steps to reproduce:
- Run rally on aarch64
- Wait
- Reboot server due to sshd being killed by oom killer
Provide logs (if relevant):
On aarch64 there appears to be a leak that I have not seen on x86_64
Start before rally run:
top - 18:09:52 up 1:46, 2 users, load average: 0.00, 0.00, 0.00
Tasks: 189 total, 1 running, 188 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31373.4 total, 30647.5 free, 273.9 used, 452.0 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30776.9 avail Mem
Run command
~/.local/bin/esrally race --challenge logging-indexing-querying --track elastic/logs --target-hosts=${URL}:9200 --pipeline=benchmark-only --client-options="enable_cleanup_closed:true,use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:$PASSWORD" --track-params="bulk_indexing_clients:48,number_of_shards:3,number_of_replicas:1,start_date:2022-12-22,end_date:2022-12-24,raw_data_volume_per_day:1024GB,data_generation_clients:16,throttle_indexing:true,query_min_date:2022-12-22,query_max_date:2022-12-24" --kill-running-processes
Beginning of rally run:
top - 18:12:39 up 1:49, 1 user, load average: 0.06, 0.02, 0.00
Tasks: 212 total, 1 running, 210 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31373.4 total, 30293.7 free, 510.2 used, 569.4 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30537.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1593 ubuntu 20 0 74516 58440 5888 S 0.0 0.2 0:01.03 esrally
1623 ubuntu 20 0 148308 57920 5256 S 0.0 0.2 0:00.04 esrally
1621 ubuntu 20 0 148308 57892 5256 S 0.0 0.2 0:00.04 esrally
1626 ubuntu 20 0 148308 57872 5192 S 0.0 0.2 0:00.04 esrally
1622 ubuntu 20 0 148308 57860 5192 S 0.0 0.2 0:00.04 esrally
1624 ubuntu 20 0 148308 57852 5192 S 0.0 0.2 0:00.04 esrally
1627 ubuntu 20 0 148308 57848 5192 S 0.0 0.2 0:00.03 esrally
1625 ubuntu 20 0 148308 57844 5168 S 0.0 0.2 0:00.04 esrally
1628 ubuntu 20 0 148308 57844 5192 S 0.0 0.2 0:00.04 esrally
1592 ubuntu 20 0 65088 51332 7680 S 0.0 0.2 0:00.09 esrally
1571 ubuntu 20 0 63040 49304 7760 S 0.0 0.2 0:00.55 esrally
1591 ubuntu 20 0 63040 46140 4596 S 0.0 0.1 0:00.00 esrally
1559 ubuntu 20 0 57300 46072 10668 S 0.0 0.1 0:00.36 esrally
1570 ubuntu 20 0 57300 40048 4560 S 0.0 0.1 0:00.12 esrally
1569 ubuntu 20 0 57300 39964 4496 S 0.0 0.1 0:00.00 esrally
About 8 minutes in:
top - 18:20:18 up 1:57, 1 user, load average: 0.92, 0.69, 0.33
Tasks: 208 total, 1 running, 206 sleeping, 0 stopped, 1 zombie
%Cpu(s): 14.2 us, 0.8 sy, 0.0 ni, 84.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem : 31373.4 total, 28570.9 free, 1547.8 used, 1254.7 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 29498.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1643 ubuntu 20 0 19.6g 5.4g 5.3g S 19.7 17.8 1:11.64 esrally
1644 ubuntu 20 0 19.7g 5.4g 5.3g S 21.0 17.7 1:11.48 esrally
1645 ubuntu 20 0 19.9g 5.4g 5.3g S 20.3 17.7 1:11.51 esrally
1648 ubuntu 20 0 17.1g 4.8g 4.6g S 18.3 15.6 1:02.97 esrally
1646 ubuntu 20 0 17.3g 4.8g 4.6g S 17.7 15.6 1:03.10 esrally
1647 ubuntu 20 0 17.2g 4.8g 4.6g S 19.0 15.6 1:02.86 esrally
1642 ubuntu 20 0 7889356 2.1g 2.0g S 6.7 6.8 0:28.65 esrally
1592 ubuntu 20 0 306452 219172 8628 S 0.3 0.7 0:01.93 esrally
1740 ubuntu 20 0 1874976 186640 54536 S 0.0 0.6 0:01.54 filebeat
1641 ubuntu 20 0 224168 73356 8916 S 0.0 0.2 0:02.63 esrally
1571 ubuntu 20 0 63040 49404 7856 S 0.0 0.2 0:00.58 esrally
1591 ubuntu 20 0 63040 46144 4596 S 0.0 0.1 0:00.00 esrally
1559 ubuntu 20 0 57300 46072 10668 S 0.0 0.1 0:00.36 esrally
1570 ubuntu 20 0 57300 40160 4560 S 0.0 0.1 0:00.57 esrally
1569 ubuntu 20 0 57300 39964 4496 S 0.0 0.1 0:00.00 esrally
About 17 minutes in:
top - 18:29:16 up 2:06, 1 user, load average: 1.10, 1.11, 0.74
Tasks: 209 total, 1 running, 207 sleeping, 0 stopped, 1 zombie
%Cpu(s): 15.5 us, 0.8 sy, 0.0 ni, 83.2 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st
MiB Mem : 31373.4 total, 27199.6 free, 1914.3 used, 2259.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 29131.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1643 ubuntu 20 0 19.6g 13.3g 13.1g S 23.7 43.3 2:54.94 esrally
1645 ubuntu 20 0 19.9g 13.3g 13.1g S 20.7 43.3 2:54.28 esrally
1644 ubuntu 20 0 19.7g 13.3g 13.1g S 21.3 43.3 2:54.21 esrally
1647 ubuntu 20 0 17.3g 11.6g 11.5g S 19.3 38.0 2:32.70 esrally
1646 ubuntu 20 0 17.3g 11.6g 11.5g S 18.3 37.9 2:33.39 esrally
1648 ubuntu 20 0 17.1g 11.6g 11.5g S 17.3 37.9 2:33.06 esrally
1642 ubuntu 20 0 7889356 5.0g 4.9g S 8.7 16.4 1:09.37 esrally
1592 ubuntu 20 0 540932 453596 8628 S 2.0 1.4 0:04.65 esrally
1740 ubuntu 20 0 1943316 185320 54660 S 1.0 0.6 0:09.85 filebeat
1641 ubuntu 20 0 224168 73856 8916 S 0.3 0.2 0:03.56 esrally
1571 ubuntu 20 0 63040 49404 7856 S 0.0 0.2 0:00.58 esrally
1591 ubuntu 20 0 63040 46144 4596 S 0.0 0.1 0:00.00 esrally
1559 ubuntu 20 0 57300 46072 10668 S 0.0 0.1 0:00.36 esrally
1570 ubuntu 20 0 57300 40160 4560 S 0.3 0.1 0:00.78 esrally
1569 ubuntu 20 0 57300 39964 4496 S 0.0 0.1 0:00.00 esrally
About 40 minutes in:
top - 18:52:09 up 2:28, 1 user, load average: 1.14, 1.07, 0.99
Tasks: 209 total, 1 running, 207 sleeping, 0 stopped, 1 zombie
%Cpu(s): 11.5 us, 0.6 sy, 0.0 ni, 87.6 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem : 31373.4 total, 25819.9 free, 2531.2 used, 3022.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 28513.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1644 ubuntu 20 0 19.7g 19.2g 19.1g S 16.9 62.8 7:09.65 esrally
1645 ubuntu 20 0 19.9g 19.2g 19.1g S 17.9 62.8 7:09.58 esrally
1643 ubuntu 20 0 19.6g 19.2g 19.1g S 16.6 62.8 7:10.97 esrally
1647 ubuntu 20 0 17.2g 16.8g 16.7g S 14.6 55.0 6:15.89 esrally
1648 ubuntu 20 0 17.1g 16.8g 16.7g S 14.3 55.0 6:18.13 esrally
1646 ubuntu 20 0 17.3g 16.8g 16.7g S 10.6 55.0 6:17.58 esrally
1642 ubuntu 20 0 7889356 7.2g 7.2g S 5.3 23.7 2:50.33 esrally
1592 ubuntu 20 0 1138272 1.0g 8628 S 0.3 3.3 0:11.87 esrally
1740 ubuntu 20 0 1943316 192416 55160 S 0.0 0.6 0:30.64 filebeat
1641 ubuntu 20 0 224168 74412 9196 S 0.0 0.2 0:05.20 esrally
1571 ubuntu 20 0 63040 49404 7856 S 0.0 0.2 0:00.58 esrally
1591 ubuntu 20 0 63040 46144 4596 S 0.0 0.1 0:00.00 esrally
1559 ubuntu 20 0 57300 46072 10668 S 0.0 0.1 0:00.36 esrally
1570 ubuntu 20 0 57300 40160 4560 S 0.0 0.1 0:01.06 esrally
1569 ubuntu 20 0 57300 39964 4496 S 0.0 0.1 0:00.00 esrally
It continues to consume system memory until the server ooms.
Can you please share the OOM killer output from dmesg -T
(or similar)? The memory usage of RES
makes sense at a glance as Rally mmap
s the various corpora files (hence you see a similar value for both VIRT
and SHR
) and you can see that the amount of free
memory reported by the system is still ~25GB.
Generally the system should be able to reclaim pages as required to avoid invoking the OOMKiller, but it is possible for physical memory to become fragmented in a way that despite having enough total free memory, there's not enough of it available in a physically contiguous chunk(s) which can trigger a panic.
The actual OOMKiller event log should include enough information to see if that is what might be happening.
I started a reproduction attempt using the same Rally parameters on as close-of-a hardware profile that I could:
Host Name | Cloud Provider | Availability Zone | CPU Core Count | Total RAM | Total Storage | Machine Type | Host Architecture | Operating System | Operating System Version | Kernel |
---|---|---|---|---|---|---|---|---|---|---|
rally-0 | aws | ap-southeast-2a | 16 | 30.8GB | 869.8GB | c6gd.4xlarge | aarch64 | Ubuntu | 18.04.6 LTS (Bionic Beaver) | 5.4.0-1083-aws |
I've noticed some strange behaviour related to how Rally handles sampling and the subsequent flushing to a remote metric store that could perhaps explain excess memory usage in some scenarios. Are you using a remote metrics store in this scenario?
Regardless, the OOMKiller output will still be invaluable.
I did not set up a metrics store for this run incidentally and it's a long run in my track params, perhaps it's as simple as that.
OOM info
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047305] filebeat invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047315] CPU: 4 PID: 142599 Comm: filebeat Not tainted 5.19.0-1022-aws #23~22.04.1-Ubuntu
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047319] Hardware name: Amazon EC2 m6gd.2xlarge/, BIOS 1.0 11/1/2018
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047320] Call trace:
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047321] dump_backtrace+0xd8/0x150
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047326] show_stack+0x20/0x70
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047328] dump_stack_lvl+0x68/0x98
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047332] dump_stack+0x18/0x40
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047334] dump_header+0x54/0x230
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047338] oom_kill_process+0x278/0x280
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047341] out_of_memory+0xe4/0x36c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047344] __alloc_pages_may_oom+0x130/0x200
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047347] __alloc_pages_slowpath.constprop.0+0x57c/0x914
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047350] __alloc_pages+0x298/0x34c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047353] alloc_pages+0xb4/0x1a4
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047356] folio_alloc+0x24/0x7c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047359] filemap_alloc_folio+0x104/0x130
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047362] __filemap_get_folio+0x134/0x46c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047364] filemap_fault+0x498/0x93c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047367] __do_fault+0x44/0x1ac
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047369] do_read_fault+0xec/0x1f0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047371] do_fault+0xbc/0x1dc
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047372] handle_pte_fault+0xdc/0x25c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047374] __handle_mm_fault+0x204/0x3a0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047376] handle_mm_fault+0xcc/0x280
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047378] do_page_fault+0x180/0x554
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047380] do_translation_fault+0xac/0xfc
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047382] do_mem_abort+0x4c/0xc0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047384] el0_ia+0xa0/0x234
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047386] el0t_64_sync_handler+0x154/0x160
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047388] el0t_64_sync+0x1a0/0x1a4
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047390] Mem-Info:
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391] active_anon:248 inactive_anon:7829164 isolated_anon:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391] active_file:39 inactive_file:632 isolated_file:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391] unevictable:6548 dirty:0 writeback:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391] slab_reclaimable:30950 slab_unreclaimable:21060
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391] mapped:1878 shmem:247 pagetables:75781 bounce:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391] kernel_misc_reclaimable:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391] free:43116 free_pcp:0 free_cma:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047397] Node 0 active_anon:992kB inactive_anon:31316656kB active_file:156kB inactive_file:2528kB unevictable:26192kB isolated(anon):0kB isolated(file):0kB mapped:7512kB dirty:0kB writeba
ck:0kB shmem:988kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 4096kB writeback_tmp:0kB kernel_stack:5056kB pagetables:303124kB all_unreclaimable? no
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047401] Node 0 DMA free:123004kB boost:0kB min:1372kB low:2348kB high:3324kB reserved_highatomic:0KB active_anon:0kB inactive_anon:842944kB active_file:0kB inactive_file:0kB unevictable:
0kB writepending:0kB present:1048576kB managed:978032kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047407] lowmem_reserve[]: 0 0 30408 30408 30408
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047411] Node 0 Normal free:49460kB boost:0kB min:43680kB low:74816kB high:105952kB reserved_highatomic:6144KB active_anon:992kB inactive_anon:30473712kB active_file:156kB inactive_file:2
528kB unevictable:26192kB writepending:0kB present:31817728kB managed:31148160kB mlocked:26192kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047416] lowmem_reserve[]: 0 0 0 0 0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047420] Node 0 DMA: 255*4kB (U) 214*8kB (UM) 163*16kB (UME) 137*32kB (UME) 106*64kB (UME) 52*128kB (UME) 20*256kB (UME) 7*512kB (UE) 3*1024kB (U) 3*2048kB (UM) 2*4096kB (UE) 1*8192kB (E)
4*16384kB (UM) = 123004kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047438] Node 0 Normal: 321*4kB (UME) 117*8kB (UME) 70*16kB (ME) 54*32kB (ME) 20*64kB (ME) 1*128kB (M) 1*256kB (M) 0*512kB 1*1024kB (M) 21*2048kB (M) 0*4096kB 0*8192kB 0*16384kB = 50764kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047454] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047457] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047458] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047459] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047461] 2785 total pagecache pages
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047462] 0 pages in swap cache
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047463] Swap cache stats: add 0, delete 0, find 0/0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047464] Free swap = 0kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047465] Total swap = 0kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047466] 8216576 pages RAM
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047467] 0 pages HighMem/MovableOnly
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047468] 185028 pages reserved
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047469] 0 pages hwpoisoned
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047470] Tasks state (memory values in pages):
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047470] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047475] [ 245] 0 245 54086 745 409600 0 -250 systemd-journal
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047478] [ 287] 0 287 72416 6417 114688 0 -1000 multipathd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047481] [ 299] 0 299 2673 671 61440 0 -1000 systemd-udevd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047483] [ 496] 100 496 4108 769 77824 0 0 systemd-network
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047486] [ 498] 101 498 6241 1506 94208 0 0 systemd-resolve
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047488] [ 534] 0 534 1728 385 53248 0 0 cron
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047490] [ 535] 102 535 2239 536 57344 0 -900 dbus-daemon
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047493] [ 543] 0 543 20524 324 57344 0 0 irqbalance
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047495] [ 544] 0 544 8245 2694 106496 0 0 networkd-dispat
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047497] [ 548] 0 548 455527 1472 241664 0 0 amazon-ssm-agen
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047499] [ 557] 0 557 3992 781 69632 0 0 systemd-logind
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047501] [ 559] 114 559 4652 484 61440 0 0 chronyd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047504] [ 563] 114 563 2555 131 61440 0 0 chronyd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047506] [ 615] 0 615 1409 121 45056 0 0 agetty
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047508] [ 623] 0 623 1398 136 49152 0 0 agetty
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047510] [ 644] 0 644 27488 2639 118784 0 0 unattended-upgr
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047512] [ 677] 0 677 58838 347 90112 0 0 polkitd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047514] [ 725] 0 725 3789 1000 65536 0 -1000 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047516] [ 912] 0 912 457799 1921 266240 0 0 ssm-agent-worke
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047519] [ 931] 1000 931 4359 971 73728 0 0 systemd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047521] [ 932] 1000 932 42693 892 98304 0 0 (sd-pam)
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047523] [ 1067] 1000 1067 2137 660 57344 0 0 dbus-daemon
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047525] [ 2236] 0 2236 74389 1334 176128 0 0 packagekitd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047528] [ 3426] 1000 3426 1907 434 45056 0 0 screen
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047530] [ 3427] 1000 3427 2183 883 57344 0 0 bash
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047532] [ 62793] 104 62793 55505 385 81920 0 0 rsyslogd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047535] [ 131498] 0 131498 4531 1065 81920 0 0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047537] [ 131578] 1000 131578 4605 882 81920 0 0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047540] [ 131581] 1000 131581 2185 852 57344 0 0 bash
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047542] [ 142589] 1000 142589 468824 30280 692224 0 0 filebeat
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047544] [ 142605] 1000 142605 1868 468 45056 0 0 screen
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047547] [ 142606] 1000 142606 2150 843 53248 0 0 bash
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047549] [ 142744] 1000 142744 14326 9309 159744 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047551] [ 142753] 1000 142753 14326 8993 147456 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047553] [ 142754] 1000 142754 14326 9013 147456 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047555] [ 142755] 1000 142755 15797 10814 163840 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047557] [ 142779] 1000 142779 15797 10782 159744 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047559] [ 142780] 1000 142780 7565100 7511440 60559360 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047561] [ 143267] 1000 143267 56582 16844 233472 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047563] [ 143268] 1000 143268 1977178 27083 15511552 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047565] [ 143269] 1000 143269 5166637 42564 40820736 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047567] [ 143270] 1000 143270 5147172 41534 40792064 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047569] [ 143271] 1000 143271 5132143 44441 40808448 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047571] [ 143272] 1000 143272 4500035 38515 35725312 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047573] [ 143273] 1000 143273 4500627 38651 35737600 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047575] [ 143274] 1000 143274 4500444 39094 35713024 0 0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047577] [ 143867] 0 143867 3756 801 73728 0 0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047579] [ 143868] 0 143868 3756 908 69632 0 0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047581] [ 143872] 0 143872 3722 798 69632 0 0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047583] [ 143875] 0 143875 3722 811 69632 0 0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047585] [ 143877] 0 143877 3588 191 61440 0 0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047587] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-574.scope,task=esrally,pid=142780,uid=1000
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047605] Out of memory: Killed process 142780 (esrally) total-vm:30260400kB, anon-rss:30044172kB, file-rss:1588kB, shmem-rss:0kB, UID:1000 pgtables:59140kB oom_score_adj:0
May 11 01:13:17 ip-192-168-6-238 kernel: [2025385.208870] oom_reaper: reaped process 142780 (esrally), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
I did not set up a metrics store for this run incidentally and it's a long run in my track params, perhaps it's as simple as that.
I think that actually could be it, the default in-memory metrics store stores all collected during the execution of a particular task in a per-core (Worker
) Sampler
object before serialising and compressing them at the end of a task, and then we deserialise and uncompress all samples (percentiles etc.) at the end of the entire run.
The OOMKiller output pretty much confirms this for me as the RSS for esrally
is indeed almost 100% of the available system memory (~30GB), and the file mapped memory is pretty much empty (1.5MB).
In my reproduction I found that this particular benchmark and challenge (logging-indexing-querying
) generates a lot of per-request samples that were taking quite some time to flush to a remote metrics store, which in itself caused some excess memory pressure, but ultimately doesn't lead to an OOM scenario because we flush the remote metrics store and remove the flushed samples from memory unlike the in-memory metrics store which retains them for the duration of the task execution, which in this specific benchmark is actually concurrent indexing and querying tasks with many clients.
The default Sampler
does actually have a maximum queue size and will drop metrics once reached, but I think we set it too large to be effective at 2^20
, or 2097152
samples per-Worker/core. Exact per-Sample
sizes will change per-task metadata etc, but we can safely assume that these are at least 4KB based on some rudimentary testing, meaning we store at least 2097152 * 4KB = 8GB
of samples per-core before dropping them.
You can see the full details here #1723 and here #1724
For now, there's two things you can do to work around this:
- Use a remote metrics store to keep all metrics and results stored
- adjust the
sample.queue.size
setting in yourrally.ini
to something lower to allow the benchmark to complete, at the expense of losing samples once the queue is full.
[reporting]
datastore.type = in-memory
sample.queue.size = 1572864