krakjoe / apcu

APCu - APC User Cache

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Expunges happen without hitting limits

boppy opened this issue · comments

Facts first:

  • APCu 5.1.23
  • PHP fpm 8.2.18 (inside a docker container based on php:8.2-fpm)
  • Memory-Type mmap

I observe that the cache expunges way too often without any hint why it's doing so.

Does anyone have an idea what's going on and how I can mitigate?


I implemented logging by calling the following script every 30s and writing its results to a log file:

<?php
$ci = apcu_cache_info(true);
$si = apcu_sma_info(true);

$info = [
    'now' => date('Y-m-d H:i:s'),
    'num_slots' => $ci['num_slots'],
    'num_hits' => $ci['num_hits'],
    'num_misses' => $ci['num_misses'],
    'num_inserts' => $ci['num_inserts'],
    'num_entries' => $ci['num_entries'],
    'expunges' => $ci['expunges'],
    'start_time' => date('c', $ci['start_time']),
    'mem_size_MB' => $ci['mem_size'] / 1024 / 1024,
    'avail_mem_MB' => $si['avail_mem'] / 1024 / 1024,
    'max_mem_MB' => ($si['num_seg'] * $si['seg_size']) / 1024 / 1024
];

echo implode("\t", $info);
echo PHP_EOL;

I then analyze the output with:

awk '$8 != acht { print lali; print; print "" } { acht=$8; lali=$0 }' apcu*log

My output is:

Day Time Slots Hit Miss Insert Entries KILLS Start Time Memory Size Memory Avail Max Memory
2024-05-12 00:46:17 163841 6353095 1 3997021 115772 1 2024-05-11T16:58:57+02:00 995.14669799805 1000.0436172485 1999.9998855591
2024-05-12 00:46:47 163841 4936 2 3738 429 2 2024-05-12T00:46:20+02:00 15.228881835938 1983.5077209473 1999.9998855591
2024-05-12 07:40:01 163841 5641487 2 3533330 110189 2 2024-05-12T00:46:20+02:00 995.19388580322 1000.1700973511 1999.9998855591
2024-05-12 07:40:31 163841 7425 1 6298 919 3 2024-05-12T07:40:06+02:00 21.448593139648 1969.1580581665 1999.9998855591
2024-05-12 13:18:07 163841 5769070 1 3800976 109887 3 2024-05-12T07:40:06+02:00 995.13430023193 1000.2423477173 1999.9998855591
2024-05-12 13:18:37 163841 9731 0 8950 1105 4 2024-05-12T13:18:12+02:00 25.843650817871 1963.471206665 1999.9998855591
2024-05-12 19:01:42 163841 5365891 0 3428149 112283 4 2024-05-12T13:18:12+02:00 994.80582427979 1000.4887771606 1999.9998855591
2024-05-12 19:02:12 163841 1818 0 1948 405 5 2024-05-12T19:02:03+02:00 16.132225036621 1982.6050491333 1999.9998855591
2024-05-13 00:32:43 163841 4912471 0 3072161 109755 5 2024-05-12T19:02:03+02:00 995.37220001221 1000.0014648438 1999.9998855591
2024-05-13 00:33:13 163841 12136 1 10872 1528 6 2024-05-13T00:32:43+02:00 33.372436523438 1955.4794082642 1999.9998855591

So basically it's always crashing if available space reaches 1000M.

I don't get why it's doing so. You see that my "Slots Hint" is 160k (shown as 163.841 in col 2 "Slots"), while "only" handling around 100-130k entries (col 6 "Entries"). I thought the problem was that I only assigned 1000M to apcu at first. So I upped it to 2000M, but the expunges still happen often.

Any hints are highly appreciated!

Full Config
[APCu]
apc.enabled = 1
apc.enable_cli = 1
apc.shm_size = 2000M
apc.shm_segments = 1
apc.shm_strings_buffer = 64M
apc.gc_ttl = 30
apc.entries_hint = 160000

Additional findings:

apcu/apc_cache.c

Lines 755 to 768 in 1ba5a2d

suitable = (cache->smart > 0L) ? (size_t) (cache->smart * size) : (size_t) (cache->sma->size/2);
/* gc */
apc_cache_wlocked_gc(cache);
/* get available */
available = apc_sma_get_avail_mem(cache->sma);
/* perform expunge processing */
if (!cache->ttl) {
/* check it is necessary to expunge */
if (available < suitable) {
apc_cache_wlocked_real_expunge(cache);
}

If the smart-Flag is not set (since it's not mentioned on the config page, I never set it to begin with - see #504), the code checks if available is HALF the full cache size, and purges if not. - Since I'm not that deep into the code, I assume I miss something here, because as it's a switched thing, there for sure is some sense I'm just not getting ;)


Update I: ~2 days later:

After setting the smart flag to 1, the purge does not happen anymore. but after reaching approx. 75% of usage, the FPM processes do not respond anymore. I assume it's because APC cannot find a big enough slot to store needed information. I do not see any hint on that in the log files, just a "max_children reached", because it piles up.


Update II: ~3 days later:

The problem seems to boil down to the massive fragmentation my implementation caused. After resetting smart to 0 again and upping the segment size to 3000M, I'm not running into purges or crashes anymore, because we still have a bunch of storage free at the end of the segment. I'm currently at > 54% usage without purge with 98% fragmentation. After another update that will run this night, the fragmentation should lower further, because I'm not inserting data multiple times per sec anymore.

Nevertheless I think that at least the hard crashes are something to look after. Also the docs could state, how and why the purge takes place.