krakjoe / apcu

Facts first:

APCu 5.1.23
PHP fpm 8.2.18 (inside a docker container based on php:8.2-fpm)
Memory-Type mmap

I observe that the cache expunges way too often without any hint why it's doing so.

Does anyone have an idea what's going on and how I can mitigate?

I implemented logging by calling the following script every 30s and writing its results to a log file:

<?php
$ci = apcu_cache_info(true);
$si = apcu_sma_info(true);

$info = [
    'now' => date('Y-m-d H:i:s'),
    'num_slots' => $ci['num_slots'],
    'num_hits' => $ci['num_hits'],
    'num_misses' => $ci['num_misses'],
    'num_inserts' => $ci['num_inserts'],
    'num_entries' => $ci['num_entries'],
    'expunges' => $ci['expunges'],
    'start_time' => date('c', $ci['start_time']),
    'mem_size_MB' => $ci['mem_size'] / 1024 / 1024,
    'avail_mem_MB' => $si['avail_mem'] / 1024 / 1024,
    'max_mem_MB' => ($si['num_seg'] * $si['seg_size']) / 1024 / 1024
];

echo implode("\t", $info);
echo PHP_EOL;

I then analyze the output with:

awk '$8 != acht { print lali; print; print "" } { acht=$8; lali=$0 }' apcu*log

My output is:

Day Time	Slots	Hit	Miss	Insert	Entries	KILLS	Start Time	Memory Size	Memory Avail	Max Memory
2024-05-12 00:46:17	163841	6353095	1	3997021	115772	1	2024-05-11T16:58:57+02:00	995.14669799805	1000.0436172485	1999.9998855591
2024-05-12 00:46:47	163841	4936	2	3738	429	2	2024-05-12T00:46:20+02:00	15.228881835938	1983.5077209473	1999.9998855591

2024-05-12 07:40:01	163841	5641487	2	3533330	110189	2	2024-05-12T00:46:20+02:00	995.19388580322	1000.1700973511	1999.9998855591
2024-05-12 07:40:31	163841	7425	1	6298	919	3	2024-05-12T07:40:06+02:00	21.448593139648	1969.1580581665	1999.9998855591

2024-05-12 13:18:07	163841	5769070	1	3800976	109887	3	2024-05-12T07:40:06+02:00	995.13430023193	1000.2423477173	1999.9998855591
2024-05-12 13:18:37	163841	9731	0	8950	1105	4	2024-05-12T13:18:12+02:00	25.843650817871	1963.471206665	1999.9998855591

2024-05-12 19:01:42	163841	5365891	0	3428149	112283	4	2024-05-12T13:18:12+02:00	994.80582427979	1000.4887771606	1999.9998855591
2024-05-12 19:02:12	163841	1818	0	1948	405	5	2024-05-12T19:02:03+02:00	16.132225036621	1982.6050491333	1999.9998855591

2024-05-13 00:32:43	163841	4912471	0	3072161	109755	5	2024-05-12T19:02:03+02:00	995.37220001221	1000.0014648438	1999.9998855591
2024-05-13 00:33:13	163841	12136	1	10872	1528	6	2024-05-13T00:32:43+02:00	33.372436523438	1955.4794082642	1999.9998855591

So basically it's always crashing if available space reaches 1000M.

I don't get why it's doing so. You see that my "Slots Hint" is 160k (shown as 163.841 in col 2 "Slots"), while "only" handling around 100-130k entries (col 6 "Entries"). I thought the problem was that I only assigned 1000M to apcu at first. So I upped it to 2000M, but the expunges still happen often.

Any hints are highly appreciated!

Full Config

[APCu]
apc.enabled = 1
apc.enable_cli = 1
apc.shm_size = 2000M
apc.shm_segments = 1
apc.shm_strings_buffer = 64M
apc.gc_ttl = 30
apc.entries_hint = 160000

Additional findings:

apcu/apc_cache.c

Lines 755 to 768 in 1ba5a2d

    
           suitable = (cache->smart > 0L) ? (size_t) (cache->smart * size) : (size_t) (cache->sma->size/2); 
        
           /* gc */ 
        
           apc_cache_wlocked_gc(cache); 
        
           /* get available */ 
        
           available = apc_sma_get_avail_mem(cache->sma); 
        
           /* perform expunge processing */ 
        
           if (!cache->ttl) { 
        
           	/* check it is necessary to expunge */ 
        
           	if (available < suitable) { 
        
           		apc_cache_wlocked_real_expunge(cache); 
        
           	}

If the smart-Flag is not set (since it's not mentioned on the config page, I never set it to begin with - see #504), the code checks if available is HALF the full cache size, and purges if not. - Since I'm not that deep into the code, I assume I miss something here, because as it's a switched thing, there for sure is some sense I'm just not getting ;)

Update I: ~2 days later:

After setting the smart flag to 1, the purge does not happen anymore. but after reaching approx. 75% of usage, the FPM processes do not respond anymore. I assume it's because APC cannot find a big enough slot to store needed information. I do not see any hint on that in the log files, just a "max_children reached", because it piles up.

Update II: ~3 days later:

The problem seems to boil down to the massive fragmentation my implementation caused. After resetting smart to 0 again and upping the segment size to 3000M, I'm not running into purges or crashes anymore, because we still have a bunch of storage free at the end of the segment. I'm currently at > 54% usage without purge with 98% fragmentation. After another update that will run this night, the fragmentation should lower further, because I'm not inserting data multiple times per sec anymore.

Nevertheless I think that at least the hard crashes are something to look after. Also the docs could state, how and why the purge takes place.

	suitable = (cache->smart > 0L) ? (size_t) (cache->smart * size) : (size_t) (cache->sma->size/2);

	/* gc */
	apc_cache_wlocked_gc(cache);

	/* get available */
	available = apc_sma_get_avail_mem(cache->sma);

	/* perform expunge processing */
	if (!cache->ttl) {
	/* check it is necessary to expunge */
	if (available < suitable) {
	apc_cache_wlocked_real_expunge(cache);
	}

Expunges happen without hitting limits