fd0/split

Split large files into smaller ones using the same Content Defined Chunking algorithm the restic backup program uses.

Build (using Go >= 1.11):

$ go build

Sample usage:

$ ./split --verbose --output /tmp --input /tmp/data
next chunk offset 0, 814244 bytes written to /tmp/split-000
next chunk offset 814244, 1649886 bytes written to /tmp/split-001
next chunk offset 2464130, 3332485 bytes written to /tmp/split-002
next chunk offset 5796615, 1996103 bytes written to /tmp/split-003
[...]
next chunk offset 101940538, 700441 bytes written to /tmp/split-069
next chunk offset 102640979, 533829 bytes written to /tmp/split-070
next chunk offset 103174808, 537761 bytes written to /tmp/split-071
next chunk offset 103712569, 1145031 bytes written to /tmp/split-072
wrote 104857600 bytes to 73 files

Using cat, we can put the file back together (and use sh256sum to verify it's the same data):

$ cat /tmp/split-* | sha256sum
66b9d2c5de34170b93f387988a80fb600717da5e437dcc4da1025343fb9019a1  -

$ sha256sum /tmp/data
66b9d2c5de34170b93f387988a80fb600717da5e437dcc4da1025343fb9019a1  /tmp/data

Check out the help for other options:

$ ./split -h
Usage of split:
  -i, --input file     Read from file instead of stdin
  -u, --max-size n     Set maximal chunk size to n bytes (default 8388608)
  -l, --min-size n     Set minimal chunk size to n bytes (default 524288)
  -o, --output dir     Write files to directory dir instead of the current directory (default ".")
  -p, --polynomial p   Use polynomial p for splitting (hex notation, no prefix) (default "3DA3358B4DC173")
  -t, --template s     Use s as the (printf-style) template for output files (default "split-%03d")
  -v, --verbose        Be verbose

The library used for this program is https://github.com/restic/chunker

If you're interested in the mathematical foundation for Content Defined Chunking with Rabin Fingerprints, head over to the restic blog which has an introductory article.

fd0 / split

About

Languages