How can I understand better why some time allocation fail

Question

How can I understand better why some time allocation fail

Chokoabigail opened this issue a year ago · comments

This is my cache implementation:

int32_t PutKey(CacheKey key, const std::string& value)
{
	size_t valueSize = value.size();
	size_t numChunks = GetChunkNUmber(valueSize);

	// parenthesize will hold all the pointers to the chained items
	size_t parentItemSize = numChunks * sizeof(void*);
	if (parentItemSize > gMaxChainedSize)
	{
		// For now - this is too big, later we will implement a chain of chains (currently we support > 2GB)
		if (gDbg)
		{
			cout << "[-][PutKey] Error parentItemSize:" << parentItemSize << " is bigger then gMaxChainedSize:" << gMaxChainedSize << endl;
		}

		return 2;
	}

	// Get parrent
	auto parentItemHandle = gCache_->allocate(defaultPool_, key, parentItemSize);
	if (!parentItemHandle)
	{
		if (gDbg)
		{
			cout << "[-][PutKey] Error we didn't succeed to get the parentItemHandle" << endl;
		}

		return 3;
	}

	//Fill in the chunk number in the parent
	// CustomParentItem* parentItem = reinterpret_cast<CustomParentItem*>(parentItemHandle->getMemory());
	// parent item->numChunks = numChunks;

	// Get char * representation of the data
	char* ourValue = const_cast<char*>(value.c_str());

	// Create the chunk - Now split user data into chunks and cache them
	for (size_t i = 0; i < numChunks; ++i) {
		size_t chunkSize = std::min((size_t)(gMaxChainedSize), valueSize);
		auto chainedItemHandle = gCache_->allocateChainedItem(parentItemHandle, chunkSize);
		if (!chainedItemHandle)
		{
			//We failed to allocate the chunk
			if (gDbg)
			{
				cout << "[-][PutKey] Error we didn't succeed to allocateChainedItem for size:" << chunkSize << endl;
			}
			return 4;
		}

		// Compute user data offset and copy data over
		char * dataOffset = ourValue + (gMaxChainedSize * i);
		std::memcpy(chainedItemHandle->getMemory(), dataOffset, chunkSize);

		// Add this chained item to the parent item
		gCache_->addChainedItem(parentItemHandle, std::move(chainedItemHandle));
		valueSize -= chunkSize;
	}

	// Now, make the parent item visible to others
	gCache_->insertOrReplace(parentItemHandle);

	return 1; // 1 == Success
}

When I run this, I get from time to time 4 or 3 in response (i.e., we failed in allocate/allocateChainedItem).
A. How should I handle that? should I retry allocate/allocateChainedItem before returning the error code?
B. How can I debug the root cause of this?

p.s.
My cache is an hybrid cache (with NVME), in the tests I assign to the RAM part 100mb.
This is my full settings:


// Set Nvme Cache 	
	nvmConfig.navyConfig.setBlockSize(4096); // Default is 4096 - Device block size in bytes (minimum IO granularity)
	nvmConfig.navyConfig.setSimpleFile(gCacheFilePath,
								858993459200, 		/*fileSize*/
								false 			/*truncateFile*/);

	// Set the block cache - Large Object Cache - caches objects that are larger than KBs in size 
	nvmConfig.navyConfig.blockCache().setRegionSize(33554432);

	// Set admission policy
	nvmConfig.navyConfig.enableRandomAdmPolicy().setAdmProbability(0.9); // a float number that decides Acceptance probability. The value has to be in the range of [0, 1].
	nvmConfig.enableFastNegativeLookups = true;

	// Set the Small Object Cache - caches objects that are 100s of bytes
	nvmConfig.navyConfig.bigHash()
      .setSizePctAndMaxItemSize(5, 4052) // bigHashSizePct (0-100), the % of the file that bigHash will use , bigHashSmallItemMaxSize must be smaller than bucket size which is 4096 by default
      .setBucketSize(4096)
      .setBucketBfSize(8); // bloom filter size per bucket, default is 8
	  

	// Ram config

	config
		.setCacheSize(1073741824) // 1GB = 1 * 1024 * 1024 * 1024
		.setCacheName("My Case")
		.setAccessConfig(40000000) // assuming caching 20 = {25 /* bucket power */, 10 /* lock power */})
		.enableNvmCache(nvmConfig)
			.validate(); // will throw if bad config

jaesoo-fb · Answer 1 · Tue Aug 15 2023 09:56:06 GMT+0800 (China Standard Time)

Hi @Chokoabigail,

allocate can fail if CacheAllocator cannot find items to be evicted at the end of eviction queue. The maximum number of retries (say N) is determined by evictionSearchTries and the eviction will be failed if none of the N items at the end of eviction queue for given allocation class are eligible to be evicted; i.e.,

To debug this, you can check how many handles are outstanding by checking CacheAllocator::getNumActiveHandles.

If the number is much higher than expected, you need to take a look at the client code for any leaking handles.

Otherwise if the number is expected, I think you can retry after some delay.

Thanks.

Chokoabigail · Answer 2 · Tue Aug 15 2023 20:14:08 GMT+0800 (China Standard Time)

The client code is actually very simple, the most complex function is the PutKey and in it, there is a concern for memory leaks.

What should I do with this code:

auto chainedItemHandle = gCache_->allocateChainedItem(parentItemHandle, chunkSize);
if (!chainedItemHandle)
{
...
}

It potentially can be after several successful chainedItemHandle allocations (or none), should I call gCache_->remove(key); on the same key I used in auto parentItemHandle = gCache_->allocate(defaultPool_, key, parentItemSize);?
Should I call free on parentItemHandle ? Should I call free on every successful chainedItemHandle allocation before the loop?

This is the rest of the client code:


std::string GetCacheStatus()
{
	std::string res;
	std::string seperator = "[::]";

	res += "isNvmCacheEnabled: " + std::to_string(gCache_->isNvmCacheEnabled()) + seperator;

	// Get the statistics map
	auto statsMap = gCache_->getNvmCacheStatsMap();
	for (const auto& kv : statsMap.getCounts()) {
		std::string k = "statsMap.getCounts()" + kv.first + ":";
		res += k + std::to_string(gCache_->isNvmCacheEnabled()) + seperator;
	}

	// Get the rate map
	for (const auto& kv : statsMap.getRates()) {
		std::string k = "statsMap.getRates()" + kv.first + ":";
		res += k + std::to_string(gCache_->isNvmCacheEnabled()) + seperator;
	}

	return res;
}



bool RemoveKey(CacheKey key)
{
	auto rr = gCache_->remove(key);
	if (rr == Cache::RemoveRes::kSuccess) {
		return true;
	}
	return false;
}

std::string GetKey(CacheKey key) 
{
	std::string result;
	uint32_t fregmentNumber = 0;

	auto parent = gCache_->find(key);
	if (!parent)
	{
		return result;
	}

	auto iobuf_chainedItems = gCache_->convertToIOBuf(std::move(parent));
			
	for (const auto& item : iobuf_chainedItems) 
	{
		// First item is the parent item so we can continue over it
		if (fregmentNumber == 0)
		{
			fregmentNumber++;
			continue;
		}

		// Reconstruct the value
		folly::StringPiece sp2{item};
		result += sp2;
	}

	return result; 
}

jaesoo-fb · Answer 3 · Wed Aug 16 2023 01:25:02 GMT+0800 (China Standard Time)

@Chokoabigail As long as the handle is disposed, you are good. Those have zero references, so should be eligible to be evicted.

I can see you allocated 100MB for RAM cache, meaning you would have only 24 slabs or so. Some allocation class might not have enough slabs allocated or even 0. I would suggest check the distribution of slabs to alloc classes and what was the failing allocation class.

Chokoabigail · Answer 4 · Wed Aug 16 2023 14:45:10 GMT+0800 (China Standard Time)

@Chokoabigail As long as the handle is disposed, you are good. Those have zero references, so should be eligible to be evicted.

I can see you allocated 100MB for RAM cache, meaning you would have only 24 slabs or so. Some allocation class might not have enough slabs allocated or even 0. I would suggest check the distribution of slabs to alloc classes and what was the failing allocation class.

How can I do that? (p.s. In production, we have the same client with a 16GB of RAM, and it also fails there from time to time (once a month or so))

jaesoo-fb · Answer 5 · Thu Aug 17 2023 06:00:14 GMT+0800 (China Standard Time)

@Chokoabigail You can refer to the ACStats which is part of the PoolStats returned by CacheAllocator::getPoolStats(...)

I noticed that there is no public support to print those stats. I think you can refer to the cachebench implementation here (-report_ac_memory_usage_stat option)

jaesoo-fb · Answer 6 · Thu Aug 24 2023 02:42:26 GMT+0800 (China Standard Time)

Let me close this for now. Feel free to reopen or open a new one if needed