Rally on aarch64 appears to leak memory

Question

Rally on aarch64 appears to leak memory

mikeh-elastic opened this issue a year ago · comments

Rally version (get with esrally --version):
esrally 2.7.1

Invoked command:

~/.local/bin/esrally race --challenge logging-indexing-querying --track elastic/logs --target-hosts=${URL}:9200 --pipeline=benchmark-only --client-options="enable_cleanup_closed:true,use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:$PASSWORD" --track-params="bulk_indexing_clients:48,number_of_shards:3,number_of_replicas:1,start_date:2022-12-22,end_date:2022-12-24,raw_data_volume_per_day:1024GB,data_generation_clients:16,throttle_indexing:true,query_min_date:2022-12-22,query_max_date:2022-12-24" --kill-running-processes

Configuration file (located in ~/.rally/rally.ini)):

JVM version:
N/A - running against remote cluster

OS version:
ubuntu@ip-192-168-6-238:$ uname -a
Linux ip-192-168-6-238 5.19.0-1025-aws #2622.04.1-Ubuntu SMP Mon Apr 24 01:58:03 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
Description of the problem including expected versus actual behavior:

Steps to reproduce:

Run rally on aarch64
Wait
Reboot server due to sshd being killed by oom killer

Provide logs (if relevant):

On aarch64 there appears to be a leak that I have not seen on x86_64

Start before rally run:

top - 18:09:52 up  1:46,  2 users,  load average: 0.00, 0.00, 0.00
Tasks: 189 total,   1 running, 188 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31373.4 total,  30647.5 free,    273.9 used,    452.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  30776.9 avail Mem

Run command

~/.local/bin/esrally race --challenge logging-indexing-querying --track elastic/logs --target-hosts=${URL}:9200 --pipeline=benchmark-only --client-options="enable_cleanup_closed:true,use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:$PASSWORD" --track-params="bulk_indexing_clients:48,number_of_shards:3,number_of_replicas:1,start_date:2022-12-22,end_date:2022-12-24,raw_data_volume_per_day:1024GB,data_generation_clients:16,throttle_indexing:true,query_min_date:2022-12-22,query_max_date:2022-12-24" --kill-running-processes

Beginning of rally run:

top - 18:12:39 up  1:49,  1 user,  load average: 0.06, 0.02, 0.00
Tasks: 212 total,   1 running, 210 sleeping,   0 stopped,   1 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31373.4 total,  30293.7 free,    510.2 used,    569.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  30537.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                             
   1593 ubuntu    20   0   74516  58440   5888 S   0.0   0.2   0:01.03 esrally                                                                                                                                                             
   1623 ubuntu    20   0  148308  57920   5256 S   0.0   0.2   0:00.04 esrally                                                                                                                                                             
   1621 ubuntu    20   0  148308  57892   5256 S   0.0   0.2   0:00.04 esrally                                                                                                                                                             
   1626 ubuntu    20   0  148308  57872   5192 S   0.0   0.2   0:00.04 esrally                                                                                                                                                             
   1622 ubuntu    20   0  148308  57860   5192 S   0.0   0.2   0:00.04 esrally                                                                                                                                                             
   1624 ubuntu    20   0  148308  57852   5192 S   0.0   0.2   0:00.04 esrally                                                                                                                                                             
   1627 ubuntu    20   0  148308  57848   5192 S   0.0   0.2   0:00.03 esrally                                                                                                                                                             
   1625 ubuntu    20   0  148308  57844   5168 S   0.0   0.2   0:00.04 esrally                                                                                                                                                             
   1628 ubuntu    20   0  148308  57844   5192 S   0.0   0.2   0:00.04 esrally                                                                                                                                                             
   1592 ubuntu    20   0   65088  51332   7680 S   0.0   0.2   0:00.09 esrally                                                                                                                                                             
   1571 ubuntu    20   0   63040  49304   7760 S   0.0   0.2   0:00.55 esrally                                                                                                                                                             
   1591 ubuntu    20   0   63040  46140   4596 S   0.0   0.1   0:00.00 esrally                                                                                                                                                             
   1559 ubuntu    20   0   57300  46072  10668 S   0.0   0.1   0:00.36 esrally                                                                                                                                                             
   1570 ubuntu    20   0   57300  40048   4560 S   0.0   0.1   0:00.12 esrally                                                                                                                                                             
   1569 ubuntu    20   0   57300  39964   4496 S   0.0   0.1   0:00.00 esrally

About 8 minutes in:

top - 18:20:18 up  1:57,  1 user,  load average: 0.92, 0.69, 0.33
Tasks: 208 total,   1 running, 206 sleeping,   0 stopped,   1 zombie
%Cpu(s): 14.2 us,  0.8 sy,  0.0 ni, 84.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
MiB Mem :  31373.4 total,  28570.9 free,   1547.8 used,   1254.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29498.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                             
   1643 ubuntu    20   0   19.6g   5.4g   5.3g S  19.7  17.8   1:11.64 esrally                                                                                                                                                             
   1644 ubuntu    20   0   19.7g   5.4g   5.3g S  21.0  17.7   1:11.48 esrally                                                                                                                                                             
   1645 ubuntu    20   0   19.9g   5.4g   5.3g S  20.3  17.7   1:11.51 esrally                                                                                                                                                             
   1648 ubuntu    20   0   17.1g   4.8g   4.6g S  18.3  15.6   1:02.97 esrally                                                                                                                                                             
   1646 ubuntu    20   0   17.3g   4.8g   4.6g S  17.7  15.6   1:03.10 esrally                                                                                                                                                             
   1647 ubuntu    20   0   17.2g   4.8g   4.6g S  19.0  15.6   1:02.86 esrally                                                                                                                                                             
   1642 ubuntu    20   0 7889356   2.1g   2.0g S   6.7   6.8   0:28.65 esrally                                                                                                                                                             
   1592 ubuntu    20   0  306452 219172   8628 S   0.3   0.7   0:01.93 esrally                                                                                                                                                             
   1740 ubuntu    20   0 1874976 186640  54536 S   0.0   0.6   0:01.54 filebeat                                                                                                                                                            
   1641 ubuntu    20   0  224168  73356   8916 S   0.0   0.2   0:02.63 esrally                                                                                                                                                             
   1571 ubuntu    20   0   63040  49404   7856 S   0.0   0.2   0:00.58 esrally                                                                                                                                                             
   1591 ubuntu    20   0   63040  46144   4596 S   0.0   0.1   0:00.00 esrally                                                                                                                                                             
   1559 ubuntu    20   0   57300  46072  10668 S   0.0   0.1   0:00.36 esrally                                                                                                                                                             
   1570 ubuntu    20   0   57300  40160   4560 S   0.0   0.1   0:00.57 esrally                                                                                                                                                             
   1569 ubuntu    20   0   57300  39964   4496 S   0.0   0.1   0:00.00 esrally

About 17 minutes in:

top - 18:29:16 up  2:06,  1 user,  load average: 1.10, 1.11, 0.74
Tasks: 209 total,   1 running, 207 sleeping,   0 stopped,   1 zombie
%Cpu(s): 15.5 us,  0.8 sy,  0.0 ni, 83.2 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
MiB Mem :  31373.4 total,  27199.6 free,   1914.3 used,   2259.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29131.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                             
   1643 ubuntu    20   0   19.6g  13.3g  13.1g S  23.7  43.3   2:54.94 esrally                                                                                                                                                             
   1645 ubuntu    20   0   19.9g  13.3g  13.1g S  20.7  43.3   2:54.28 esrally                                                                                                                                                             
   1644 ubuntu    20   0   19.7g  13.3g  13.1g S  21.3  43.3   2:54.21 esrally                                                                                                                                                             
   1647 ubuntu    20   0   17.3g  11.6g  11.5g S  19.3  38.0   2:32.70 esrally                                                                                                                                                             
   1646 ubuntu    20   0   17.3g  11.6g  11.5g S  18.3  37.9   2:33.39 esrally                                                                                                                                                             
   1648 ubuntu    20   0   17.1g  11.6g  11.5g S  17.3  37.9   2:33.06 esrally                                                                                                                                                             
   1642 ubuntu    20   0 7889356   5.0g   4.9g S   8.7  16.4   1:09.37 esrally                                                                                                                                                             
   1592 ubuntu    20   0  540932 453596   8628 S   2.0   1.4   0:04.65 esrally                                                                                                                                                             
   1740 ubuntu    20   0 1943316 185320  54660 S   1.0   0.6   0:09.85 filebeat                                                                                                                                                            
   1641 ubuntu    20   0  224168  73856   8916 S   0.3   0.2   0:03.56 esrally                                                                                                                                                             
   1571 ubuntu    20   0   63040  49404   7856 S   0.0   0.2   0:00.58 esrally                                                                                                                                                             
   1591 ubuntu    20   0   63040  46144   4596 S   0.0   0.1   0:00.00 esrally                                                                                                                                                             
   1559 ubuntu    20   0   57300  46072  10668 S   0.0   0.1   0:00.36 esrally                                                                                                                                                             
   1570 ubuntu    20   0   57300  40160   4560 S   0.3   0.1   0:00.78 esrally                                                                                                                                                             
   1569 ubuntu    20   0   57300  39964   4496 S   0.0   0.1   0:00.00 esrally

About 40 minutes in:

top - 18:52:09 up  2:28,  1 user,  load average: 1.14, 1.07, 0.99
Tasks: 209 total,   1 running, 207 sleeping,   0 stopped,   1 zombie
%Cpu(s): 11.5 us,  0.6 sy,  0.0 ni, 87.6 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
MiB Mem :  31373.4 total,  25819.9 free,   2531.2 used,   3022.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  28513.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                             
   1644 ubuntu    20   0   19.7g  19.2g  19.1g S  16.9  62.8   7:09.65 esrally                                                                                                                                                             
   1645 ubuntu    20   0   19.9g  19.2g  19.1g S  17.9  62.8   7:09.58 esrally                                                                                                                                                             
   1643 ubuntu    20   0   19.6g  19.2g  19.1g S  16.6  62.8   7:10.97 esrally                                                                                                                                                             
   1647 ubuntu    20   0   17.2g  16.8g  16.7g S  14.6  55.0   6:15.89 esrally                                                                                                                                                             
   1648 ubuntu    20   0   17.1g  16.8g  16.7g S  14.3  55.0   6:18.13 esrally                                                                                                                                                             
   1646 ubuntu    20   0   17.3g  16.8g  16.7g S  10.6  55.0   6:17.58 esrally                                                                                                                                                             
   1642 ubuntu    20   0 7889356   7.2g   7.2g S   5.3  23.7   2:50.33 esrally                                                                                                                                                             
   1592 ubuntu    20   0 1138272   1.0g   8628 S   0.3   3.3   0:11.87 esrally                                                                                                                                                             
   1740 ubuntu    20   0 1943316 192416  55160 S   0.0   0.6   0:30.64 filebeat                                                                                                                                                            
   1641 ubuntu    20   0  224168  74412   9196 S   0.0   0.2   0:05.20 esrally                                                                                                                                                             
   1571 ubuntu    20   0   63040  49404   7856 S   0.0   0.2   0:00.58 esrally                                                                                                                                                             
   1591 ubuntu    20   0   63040  46144   4596 S   0.0   0.1   0:00.00 esrally                                                                                                                                                             
   1559 ubuntu    20   0   57300  46072  10668 S   0.0   0.1   0:00.36 esrally                                                                                                                                                             
   1570 ubuntu    20   0   57300  40160   4560 S   0.0   0.1   0:01.06 esrally                                                                                                                                                             
   1569 ubuntu    20   0   57300  39964   4496 S   0.0   0.1   0:00.00 esrally

It continues to consume system memory until the server ooms.

Brad Deam · Answer 1 · Thu Jun 01 2023 11:35:32 GMT+0800 (China Standard Time)

Can you please share the OOM killer output from dmesg -T (or similar)? The memory usage of RES makes sense at a glance as Rally mmaps the various corpora files (hence you see a similar value for both VIRT and SHR) and you can see that the amount of free memory reported by the system is still ~25GB.

Generally the system should be able to reclaim pages as required to avoid invoking the OOMKiller, but it is possible for physical memory to become fragmented in a way that despite having enough total free memory, there's not enough of it available in a physically contiguous chunk(s) which can trigger a panic.

The actual OOMKiller event log should include enough information to see if that is what might be happening.

Brad Deam · Answer 2 · Thu Jun 01 2023 17:36:31 GMT+0800 (China Standard Time)

I started a reproduction attempt using the same Rally parameters on as close-of-a hardware profile that I could:

Host Name	Cloud Provider	Availability Zone	CPU Core Count	Total RAM	Total Storage	Machine Type	Host Architecture	Operating System	Operating System Version	Kernel
rally-0	aws	ap-southeast-2a	16	30.8GB	869.8GB	c6gd.4xlarge	aarch64	Ubuntu	18.04.6 LTS (Bionic Beaver)	5.4.0-1083-aws

I've noticed some strange behaviour related to how Rally handles sampling and the subsequent flushing to a remote metric store that could perhaps explain excess memory usage in some scenarios. Are you using a remote metrics store in this scenario?

Regardless, the OOMKiller output will still be invaluable.

mikeh-elastic · Answer 3 · Thu Jun 01 2023 23:39:51 GMT+0800 (China Standard Time)

I did not set up a metrics store for this run incidentally and it's a long run in my track params, perhaps it's as simple as that.

OOM info

May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047305] filebeat invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047315] CPU: 4 PID: 142599 Comm: filebeat Not tainted 5.19.0-1022-aws #23~22.04.1-Ubuntu
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047319] Hardware name: Amazon EC2 m6gd.2xlarge/, BIOS 1.0 11/1/2018
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047320] Call trace:
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047321]  dump_backtrace+0xd8/0x150
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047326]  show_stack+0x20/0x70
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047328]  dump_stack_lvl+0x68/0x98
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047332]  dump_stack+0x18/0x40
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047334]  dump_header+0x54/0x230
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047338]  oom_kill_process+0x278/0x280
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047341]  out_of_memory+0xe4/0x36c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047344]  __alloc_pages_may_oom+0x130/0x200
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047347]  __alloc_pages_slowpath.constprop.0+0x57c/0x914
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047350]  __alloc_pages+0x298/0x34c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047353]  alloc_pages+0xb4/0x1a4
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047356]  folio_alloc+0x24/0x7c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047359]  filemap_alloc_folio+0x104/0x130
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047362]  __filemap_get_folio+0x134/0x46c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047364]  filemap_fault+0x498/0x93c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047367]  __do_fault+0x44/0x1ac
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047369]  do_read_fault+0xec/0x1f0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047371]  do_fault+0xbc/0x1dc
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047372]  handle_pte_fault+0xdc/0x25c
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047374]  __handle_mm_fault+0x204/0x3a0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047376]  handle_mm_fault+0xcc/0x280
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047378]  do_page_fault+0x180/0x554
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047380]  do_translation_fault+0xac/0xfc
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047382]  do_mem_abort+0x4c/0xc0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047384]  el0_ia+0xa0/0x234
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047386]  el0t_64_sync_handler+0x154/0x160
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047388]  el0t_64_sync+0x1a0/0x1a4
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047390] Mem-Info:
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391] active_anon:248 inactive_anon:7829164 isolated_anon:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391]  active_file:39 inactive_file:632 isolated_file:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391]  unevictable:6548 dirty:0 writeback:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391]  slab_reclaimable:30950 slab_unreclaimable:21060
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391]  mapped:1878 shmem:247 pagetables:75781 bounce:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391]  kernel_misc_reclaimable:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047391]  free:43116 free_pcp:0 free_cma:0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047397] Node 0 active_anon:992kB inactive_anon:31316656kB active_file:156kB inactive_file:2528kB unevictable:26192kB isolated(anon):0kB isolated(file):0kB mapped:7512kB dirty:0kB writeba
ck:0kB shmem:988kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 4096kB writeback_tmp:0kB kernel_stack:5056kB pagetables:303124kB all_unreclaimable? no
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047401] Node 0 DMA free:123004kB boost:0kB min:1372kB low:2348kB high:3324kB reserved_highatomic:0KB active_anon:0kB inactive_anon:842944kB active_file:0kB inactive_file:0kB unevictable:
0kB writepending:0kB present:1048576kB managed:978032kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047407] lowmem_reserve[]: 0 0 30408 30408 30408
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047411] Node 0 Normal free:49460kB boost:0kB min:43680kB low:74816kB high:105952kB reserved_highatomic:6144KB active_anon:992kB inactive_anon:30473712kB active_file:156kB inactive_file:2
528kB unevictable:26192kB writepending:0kB present:31817728kB managed:31148160kB mlocked:26192kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047416] lowmem_reserve[]: 0 0 0 0 0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047420] Node 0 DMA: 255*4kB (U) 214*8kB (UM) 163*16kB (UME) 137*32kB (UME) 106*64kB (UME) 52*128kB (UME) 20*256kB (UME) 7*512kB (UE) 3*1024kB (U) 3*2048kB (UM) 2*4096kB (UE) 1*8192kB (E)
 4*16384kB (UM) = 123004kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047438] Node 0 Normal: 321*4kB (UME) 117*8kB (UME) 70*16kB (ME) 54*32kB (ME) 20*64kB (ME) 1*128kB (M) 1*256kB (M) 0*512kB 1*1024kB (M) 21*2048kB (M) 0*4096kB 0*8192kB 0*16384kB = 50764kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047454] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047457] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047458] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047459] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047461] 2785 total pagecache pages
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047462] 0 pages in swap cache
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047463] Swap cache stats: add 0, delete 0, find 0/0
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047464] Free swap  = 0kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047465] Total swap = 0kB
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047466] 8216576 pages RAM
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047467] 0 pages HighMem/MovableOnly
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047468] 185028 pages reserved
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047469] 0 pages hwpoisoned
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047470] Tasks state (memory values in pages):
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047470] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047475] [    245]     0   245    54086      745   409600        0          -250 systemd-journal
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047478] [    287]     0   287    72416     6417   114688        0         -1000 multipathd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047481] [    299]     0   299     2673      671    61440        0         -1000 systemd-udevd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047483] [    496]   100   496     4108      769    77824        0             0 systemd-network
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047486] [    498]   101   498     6241     1506    94208        0             0 systemd-resolve
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047488] [    534]     0   534     1728      385    53248        0             0 cron
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047490] [    535]   102   535     2239      536    57344        0          -900 dbus-daemon
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047493] [    543]     0   543    20524      324    57344        0             0 irqbalance
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047495] [    544]     0   544     8245     2694   106496        0             0 networkd-dispat
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047497] [    548]     0   548   455527     1472   241664        0             0 amazon-ssm-agen
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047499] [    557]     0   557     3992      781    69632        0             0 systemd-logind
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047501] [    559]   114   559     4652      484    61440        0             0 chronyd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047504] [    563]   114   563     2555      131    61440        0             0 chronyd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047506] [    615]     0   615     1409      121    45056        0             0 agetty
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047508] [    623]     0   623     1398      136    49152        0             0 agetty
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047510] [    644]     0   644    27488     2639   118784        0             0 unattended-upgr
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047512] [    677]     0   677    58838      347    90112        0             0 polkitd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047514] [    725]     0   725     3789     1000    65536        0         -1000 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047516] [    912]     0   912   457799     1921   266240        0             0 ssm-agent-worke
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047519] [    931]  1000   931     4359      971    73728        0             0 systemd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047521] [    932]  1000   932    42693      892    98304        0             0 (sd-pam)
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047523] [   1067]  1000  1067     2137      660    57344        0             0 dbus-daemon
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047525] [   2236]     0  2236    74389     1334   176128        0             0 packagekitd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047528] [   3426]  1000  3426     1907      434    45056        0             0 screen
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047530] [   3427]  1000  3427     2183      883    57344        0             0 bash
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047532] [  62793]   104 62793    55505      385    81920        0             0 rsyslogd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047535] [ 131498]     0 131498     4531     1065    81920        0             0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047537] [ 131578]  1000 131578     4605      882    81920        0             0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047540] [ 131581]  1000 131581     2185      852    57344        0             0 bash
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047542] [ 142589]  1000 142589   468824    30280   692224        0             0 filebeat
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047544] [ 142605]  1000 142605     1868      468    45056        0             0 screen
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047547] [ 142606]  1000 142606     2150      843    53248        0             0 bash
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047549] [ 142744]  1000 142744    14326     9309   159744        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047551] [ 142753]  1000 142753    14326     8993   147456        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047553] [ 142754]  1000 142754    14326     9013   147456        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047555] [ 142755]  1000 142755    15797    10814   163840        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047557] [ 142779]  1000 142779    15797    10782   159744        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047559] [ 142780]  1000 142780  7565100  7511440 60559360        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047561] [ 143267]  1000 143267    56582    16844   233472        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047563] [ 143268]  1000 143268  1977178    27083 15511552        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047565] [ 143269]  1000 143269  5166637    42564 40820736        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047567] [ 143270]  1000 143270  5147172    41534 40792064        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047569] [ 143271]  1000 143271  5132143    44441 40808448        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047571] [ 143272]  1000 143272  4500035    38515 35725312        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047573] [ 143273]  1000 143273  4500627    38651 35737600        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047575] [ 143274]  1000 143274  4500444    39094 35713024        0             0 esrally
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047577] [ 143867]     0 143867     3756      801    73728        0             0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047579] [ 143868]     0 143868     3756      908    69632        0             0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047581] [ 143872]     0 143872     3722      798    69632        0             0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047583] [ 143875]     0 143875     3722      811    69632        0             0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047585] [ 143877]     0 143877     3588      191    61440        0             0 sshd
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047587] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-574.scope,task=esrally,pid=142780,uid=1000
May 11 01:13:16 ip-192-168-6-238 kernel: [2025383.047605] Out of memory: Killed process 142780 (esrally) total-vm:30260400kB, anon-rss:30044172kB, file-rss:1588kB, shmem-rss:0kB, UID:1000 pgtables:59140kB oom_score_adj:0
May 11 01:13:17 ip-192-168-6-238 kernel: [2025385.208870] oom_reaper: reaped process 142780 (esrally), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Brad Deam · Answer 4 · Fri Jun 02 2023 11:46:50 GMT+0800 (China Standard Time)

I did not set up a metrics store for this run incidentally and it's a long run in my track params, perhaps it's as simple as that.

I think that actually could be it, the default in-memory metrics store stores all collected during the execution of a particular task in a per-core (Worker) Sampler object before serialising and compressing them at the end of a task, and then we deserialise and uncompress all samples (percentiles etc.) at the end of the entire run.

The OOMKiller output pretty much confirms this for me as the RSS for esrally is indeed almost 100% of the available system memory (~30GB), and the file mapped memory is pretty much empty (1.5MB).

In my reproduction I found that this particular benchmark and challenge (logging-indexing-querying) generates a lot of per-request samples that were taking quite some time to flush to a remote metrics store, which in itself caused some excess memory pressure, but ultimately doesn't lead to an OOM scenario because we flush the remote metrics store and remove the flushed samples from memory unlike the in-memory metrics store which retains them for the duration of the task execution, which in this specific benchmark is actually concurrent indexing and querying tasks with many clients.

The default Sampler does actually have a maximum queue size and will drop metrics once reached, but I think we set it too large to be effective at 2^20, or 2097152 samples per-Worker/core. Exact per-Sample sizes will change per-task metadata etc, but we can safely assume that these are at least 4KB based on some rudimentary testing, meaning we store at least 2097152 * 4KB = 8GB of samples per-core before dropping them.

You can see the full details here #1723 and here #1724

For now, there's two things you can do to work around this:

Use a remote metrics store to keep all metrics and results stored
adjust the sample.queue.size setting in your rally.ini to something lower to allow the benchmark to complete, at the expense of losing samples once the queue is full.

[reporting]
datastore.type = in-memory
sample.queue.size = 1572864

Brad Deam · Answer 5 · Fri Nov 10 2023 14:09:56 GMT+0800 (China Standard Time)

Closing as the improvements for this are being tracked in #1723 and #1724.