dbuf_cache_shift is too small for systems with more than a terabyte of RAM
rottegift opened this issue · comments
Nobody is likely to ever need or want a dbuf cache of dozens of gigabytes, and it causes bad (and possibly unrecoverable) stalls when a system with 1.5 TiB of RAM is doing a sequential write of compressible data to a slow spinny-disk pool with spotlight enabled, with the stuck system having threads looking like the one (edited for size!) below.
Although there is something here that needs thinking about (why aren't we stopping at kmem? have we even reached 1/64 of RAM yet?) clearly 1/32 (of the half of RAM that ARC can grab, so 1/64 of system RAM) of 1.5 TiB is way too much dbuf cache.
I think we should cap the dbuf cache at something reasonable and allow a user to tune it upwards if the user really wants.
I'll turn this into a PR in due course.
See discussion around this comment #721 (comment)
*1000 __zio_execute + 243 (zfs + 865241) [0xffffff7f8519c3d9]
*1000 zio_vdev_io_done + 130 (zfs + 879953) [0xffffff7f8519fd51]
*1000 vdev_queue_io_done + 188 (zfs + 540215) [0xffffff7f8514ce37]
*1000 zio_nowait + 307 (zfs + 862989) [0xffffff7f8519bb0d]
*1000 zio_vdev_io_start + 495 (zfs + 879749) [0xffffff7f8519fc85]
*1000 vdev_disk_io_start + 491 (zfs + 509072) [0xffffff7f85145490]
*1000 abd_borrow_buf_copy + 21 (zfs + 4425) [0xffffff7f850ca149]
*1000 abd_borrow_buf + 42 (zfs + 4327) [0xffffff7f850ca0e7]
*1000 spl_zio_kmem_cache_alloc + 214 (spl + 32925) [0xffffff7f83edb09d]
*1000 kmem_cache_alloc + 749 (spl + 12519) [0xffffff7f83ed60e7]
*1000 vmem_alloc + 174 (spl + 65534) [0xffffff7f83ee2ffe]
*1000 vmem_xalloc + 1354 (spl + 61567) [0xffffff7f83ee207f]
*1000 vmem_alloc + 174 (spl + 65534) [0xffffff7f83ee2ffe]
*1000 vmem_xalloc + 1354 (spl + 61567) [0xffffff7f83ee207f]
*1000 vmem_alloc + 174 (spl + 65534) [0xffffff7f83ee2ffe]
*1000 vmem_xalloc + 1354 (spl + 61567) [0xffffff7f83ee207f]
*1000 vmem_bucket_alloc + 1565 (spl + 77255) [0xffffff7f83ee5dc7]
*1000 vmem_alloc + 174 (spl + 65534) [0xffffff7f83ee2ffe]
*1000 vmem_xalloc + 1354 (spl + 61567) [0xffffff7f83ee207f]
*563 xnu_alloc_throttled + 507 (spl + 74382) [0xffffff7f83ee528e]
*451 spl_vmem_malloc_if_no_pressure + 95 (spl + 79936) [0xffffff7f83ee6840]
*451 osif_malloc + 53 (spl + 49719) [0xffffff7f83edf237]
*240 kernel_memory_allocate + 902 (kernel + 1890310) [0xffffff80003cd806]
*195 vm_map_find_space + 451 (kernel + 1935491) [0xffffff80003d8883] (running)
*14 vm_map_find_space + 427 (kernel + 1935467) [0xffffff80003d886b] (running)
*13 vm_map_find_space + 371 (kernel + 1935411) [0xffffff80003d8833] (running)
*6 vm_map_find_space + 404 (kernel + 1935444) [0xffffff80003d8854] (running)
*4 vm_map_find_space + 409 (kernel + 1935449) [0xffffff80003d8859] (running)
*3 vm_map_find_space + 441 (kernel + 1935481) [0xffffff80003d8879] (running)
*2 vm_map_find_space + 401 (kernel + 1935441) [0xffffff80003d8851] (running)
*1 vm_map_find_space + 423 (kernel + 1935463) [0xffffff80003d8867] (running)
*195 kernel_memory_allocate + 375 (kernel + 1889783) [0xffffff80003cd
5f7]
*98 vm_page_grab_options + 1005 (kernel + 2138973) [0xffffff800040
a35d] (running)
*12 vm_page_grab_options + 929 (kernel + 2138897) [0xffffff800040a
311] (running)
*9 vm_page_grab_options + 96 (kernel + 2138064) [0xffffff8000409fd0] (running)
*6 vm_page_grab_options + 1405 (kernel + 2139373) [0xffffff800040a4ed] (running)
*5 vm_page_grab_options + 1337 (kernel + 2139305) [0xffffff800040a4a9]
*3 vm_pressure_response + 73 (kernel + 2057401) [0xffffff80003f64b9] (running)
*2 vm_pressure_response + 225 (kernel + 2057553) [0xffffff80003f6551] (running)
*4 vm_page_grab_options + 99 (kernel + 2138067) [0xffffff8000409fd3] (running)
*3 vm_page_grab_options + 1078 (kernel + 2139046) [0xffffff800040a3a6] (running)
*3 vm_page_grab_options + 1016 (kernel + 2138984) [0xffffff800040a368] (running)
*3 vm_page_grab_options + 1009 (kernel + 2138977) [0xffffff800040a361] (running)
*3 vm_page_grab_options + 51 (kernel + 2138019) [0xffffff8000409fa3] (running)
*3 vm_page_grab_options + 6 (kernel + 2137974) [0xffffff8000409f76] (running)
*2 vm_page_grab_options + 1429 (kernel + 2139397) [0xffffff800040a505] (running)
*2 vm_page_grab_options + 1063 (kernel + 2139031) [0xffffff800040a397] (running)
*2 vm_page_grab_options + 1040 (kernel + 2139008) [0xffffff800040a380] (running)
*2 vm_page_grab_options + 988 (kernel + 2138956) [0xffffff800040a34c] (running)
*112 spl_vmem_malloc_if_no_pressure + 22 (spl + 79863) [0xffffff7f83ee67f
7]
*112 spl_mutex_enter + 38 (spl + 46425) [0xffffff7f83ede559]
*111 ??? (kernel + 2477019) [0xffffff800045cbdb]
*110 lck_mtx_lock_wait_x86 + 313 (kernel + 2478777) [0xffffff800045
d2b9]
*110 thread_block_reason + 175 (kernel + 1425263) [0xffffff800035
bf6f]
*437 xnu_alloc_throttled + 397 (spl + 74272) [0xffffff7f83ee5220]
*437 cv_timedwait_hires + 226 (spl + 6952) [0xffffff7f83ed4b28]
*437 msleep + 98 (kernel + 6937026) [0xffffff800089d9c2]
*437 ??? (kernel + 6935580) [0xffffff800089d41c]
*437 lck_mtx_sleep_deadline + 115 (kernel + 1363411) [0xffffff80003
4cdd3]
*437 thread_block_reason + 175 (kernel + 1425263) [0xffffff800035
bf6f]
*437 ??? (kernel + 1431409) [0xffffff800035d771]
*437 machine_switch_context + 200 (kernel + 2490744) [0xfffff
f8000460178]
...
ZOL has changed how it sizes dbuf_cache, but I don't like their change for these particular symptoms (because arc_c can still be very large), so I won't link to it. :-) If you're curious, it's "Scale the dbuf cache with arc_c".