OOZZY / tlo-file-similarity

Command-line programs that use fuzzy hashing to compare files for similarity and find files that are similar to each other.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tlo-file-similarity

Command-line programs that use fuzzy hashing to compare files for similarity and find files that are similar to each other.

Build Requirements

  • CMake
  • C++17 development environment for which CMake can generate build files
  • SQLite 3

Clone, Build, and Run

Clone into tlo-file-similarity directory.

$ git clone --branch develop --recursive git@github.com:OOZZY/tlo-file-similarity.git

Build (out of source).

$ mkdir build
$ cd build
$ cmake -G 'Unix Makefiles' -DCMAKE_BUILD_TYPE=Debug ../tlo-file-similarity
$ make

Fuzzy hash all files in the samples directory and all its subdirectories. In the samples directory, Original.txt is a text file containing six paragraphs generated by https://loremipsum.io/. All the other text files were derived from Original.txt through some modifications. The file names describe what modifications were performed to derive the text file.

$ ./tlo-fuzzy-hash ../tlo-file-similarity/samples > hashes.txt

Compare hashes to find similar files.

$ ./tlo-find-similar-hashes hashes.txt
"../tlo-file-similarity/samples/Removed-1st-Half.txt" and "../tlo-file-similarity/samples/Moved-Some-Lines.txt" are about 52.4064% similar.
"../tlo-file-similarity/samples/Removed-1st-Half.txt" and "../tlo-file-similarity/samples/Moved-Some-Words.txt" are about 59.2593% similar.
"../tlo-file-similarity/samples/Removed-1st-Half.txt" and "../tlo-file-similarity/samples/Original.txt" are about 65.5914% similar.
"../tlo-file-similarity/samples/Removed-1st-Half.txt" and "../tlo-file-similarity/samples/Removed-Some-Lines.txt" are about 55.6962% similar.
"../tlo-file-similarity/samples/Removed-1st-Half.txt" and "../tlo-file-similarity/samples/Swapped-3rd-And-4th-Paragraphs.txt" are about 65.5914% similar.
"../tlo-file-similarity/samples/Removed-2nd-Half.txt" and "../tlo-file-similarity/samples/Moved-Some-Lines.txt" are about 54.2553% similar.
"../tlo-file-similarity/samples/Removed-2nd-Half.txt" and "../tlo-file-similarity/samples/Moved-Some-Words.txt" are about 60% similar.
"../tlo-file-similarity/samples/Removed-2nd-Half.txt" and "../tlo-file-similarity/samples/Original.txt" are about 67.3797% similar.
"../tlo-file-similarity/samples/Removed-2nd-Half.txt" and "../tlo-file-similarity/samples/Swapped-3rd-And-4th-Paragraphs.txt" are about 66.3102% similar.
"../tlo-file-similarity/samples/Moved-Some-Lines.txt" and "../tlo-file-similarity/samples/Moved-Some-Words.txt" are about 73.0159% similar.
"../tlo-file-similarity/samples/Moved-Some-Lines.txt" and "../tlo-file-similarity/samples/Original.txt" are about 80.3213% similar.
"../tlo-file-similarity/samples/Moved-Some-Lines.txt" and "../tlo-file-similarity/samples/Removed-Some-Lines.txt" are about 61.5385% similar.
"../tlo-file-similarity/samples/Moved-Some-Lines.txt" and "../tlo-file-similarity/samples/Swapped-3rd-And-4th-Paragraphs.txt" are about 69.0763% similar.
"../tlo-file-similarity/samples/Moved-Some-Words.txt" and "../tlo-file-similarity/samples/Original.txt" are about 89.243% similar.
"../tlo-file-similarity/samples/Moved-Some-Words.txt" and "../tlo-file-similarity/samples/Removed-Some-Lines.txt" are about 65.4709% similar.
"../tlo-file-similarity/samples/Moved-Some-Words.txt" and "../tlo-file-similarity/samples/Swapped-3rd-And-4th-Paragraphs.txt" are about 75.6972% similar.
"../tlo-file-similarity/samples/Original.txt" and "../tlo-file-similarity/samples/Removed-Some-Lines.txt" are about 74.5455% similar.
"../tlo-file-similarity/samples/Original.txt" and "../tlo-file-similarity/samples/Removed-Some-Words.txt" are about 50.4202% similar.
"../tlo-file-similarity/samples/Original.txt" and "../tlo-file-similarity/samples/Swapped-3rd-And-4th-Paragraphs.txt" are about 85.4839% similar.
"../tlo-file-similarity/samples/Removed-Some-Lines.txt" and "../tlo-file-similarity/samples/Swapped-3rd-And-4th-Paragraphs.txt" are about 63.6364% similar.

CMake Options

  • TLO_FS_COLORED_DIAGNOSTICS
    • Tell the compiler to use colors in diagnostics (GNU/Clang only)
    • On by default
  • TLO_FS_USE_LIBCPP
    • Use libc++ (Clang only)
    • Off by default
  • TLO_FS_LINK_FS
    • Link to filesystem library of older GNU and Clang (GNU/Clang only)
    • Prior to LLVM 9, using std::filesystem required linker option -lc++fs
    • Prior to GCC 9, using std::filesystem required linker option -lstdc++fs
    • Off by default
  • TLO_FS_SQLITE3_INCLUDE_DIRS and TLO_FS_SQLITE3_LIBRARIES
    • If both are specified (non-empty strings), will search for SQLite 3 headers in the directories specified by TLO_FS_SQLITE3_INCLUDE_DIRS and will link to the libraries specified by TLO_FS_SQLITE3_LIBRARIES
    • Otherwise, find_package(SQLite3 REQUIRED) will be used instead
    • Empty strings by default
  • TLO_FS_ENABLE_TESTS
    • Enable tests
    • On by default

Program Options

tlo-fuzzy-hash

$ ./tlo-fuzzy-hash
Usage: tlo-fuzzy-hash [options] <file or directory>...

Options:
  --database=value
    Store hashes in and get hashes from the database at the specified path (default: no database used).

  --num-threads=value
    Number of threads the program will use (default: 1).

  --verbose
    Allow program to print status updates to stderr (default: off).

tlo-find-similar-hashes

$ ./tlo-find-similar-hashes
Usage: tlo-find-similar-hashes [options] <text file with hashes>...

Options:
  --num-threads=value
    Number of threads the program will use (default: 1).

  --output-format=value
    Output format can be regular, csv (comma-separated values), or tsv (tab-separated values) (default: regular).

  --record-sources
    Record which input text file each hash came from (default: off).

  --similarity-threshold=value
    Display only the file pairs with a similarity score greater than or equal to this threshold (default: 50).

  --verbose
    Allow program to print status updates to stderr (default: off).

Relevant Papers and Projects

About

Command-line programs that use fuzzy hashing to compare files for similarity and find files that are similar to each other.


Languages

Language:C++ 92.7%Language:CMake 7.3%