mversiotech / fastcdc

An implementation of the FastCDC algorithm in Go

Home Page:https://codeberg.org/mhofmann/fastcdc

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FastCDC

This is a Go implementation of the FastCDC algorithm for content-defined chunking. CDC is a technique used in data deduplication and data storage systems to break data into variable-sized chunks based on its content rather than fixed block sizes. This approach aims to improve the efficiency of deduplication.

Usage

go get -u codeberg.org/mhofmann/fastcdc

Evaluation

The implementation can be used with the same parameters as in the FastCDC paper as well as user-provided values for minimum, average and maximum sizes of chunks. For comparison, the following table shows statistics about the number and size of chunks generated by chunking sets of test files with different parameters. The numbers in the chunker names refer to the parameters used. For example "2k-8k-64k" is a chunker with 2KB minSize 8KB avgSize and 64k maxSize. The test corpus had a total uncompressed size of 8182081670 bytes (~7.6GB) and consisted of technical manuals and drawings in PDF format and tarballs containing the source code of 5 different versions of the Linux kernel.

Results

Chunker Num. of Chunks Avg. chunk size Deduplicated size Deduplication ratio
reference 480942 9831 4727992736 1.73
2k-16k-64k 271140 19136 5188457451 1.58
2k-32k-64k 150041 37254 5589600419 1.46
2k-64k-128k 80946 73123 5919028334 1.38
4k-8k-64k 471195 10107 4762223233 1.72
4k-16k-64k 266596 19503 5199438463 1.57
4k-32k-64k 148487 37669 5593355619 1.46
4k-64k-128k 80332 73701 5920577574 1.38

In terms of pure deduplication performance, the reference parameters (2k-8k-64k) yielded the best result on the test dataset. For storage systems where chunks are stored compressed, El-Shimi et al. suggest that using larger chunk sizes for CDC can improve the performance of the compression algorithm and thereby reduce the effective storage size. If and how far this applies to the FastCDC algorithm remains to be tested in the future.

License

BSD-2-Clause. See LICENSE for details.

About

An implementation of the FastCDC algorithm in Go

https://codeberg.org/mhofmann/fastcdc

License:BSD 2-Clause "Simplified" License


Languages

Language:Go 100.0%