reshinthadithyan / cpp-near-dedupe

cpp-near-dedupe

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cpp-near-dedupe

dedupes arrow datasets ( IPC .arrow db's at the moment) using minhash / jaccard similarity scores. Uses multiple threads to speed things up

warning

this project was thrown together in a few days and fueled by large amounts of coffee, so things are getting cleaned / fixed / improved, and there are likely bugs and other oddities in the code

todo

  • contiguous blocks of hashed values to increase speed (memory access, and better parallelization of loops) of comparison checks using jaccard ( this is the long leg at the moment )
  • work stealing to speed up jaccard compare checks when other threads are less busy
  • better readme
  • more commandline arguments for more options
  • better error checking
  • handle various arrow formats
  • handle different CPU intrinsics for more hardware support
  • unit tests
  • check for file write permissions on output folder before its ready to write out at end of crunching
  • clearing output folder on run
  • allowing to operate inplace on a dataset
  • allow continuing from a partially crunched set of data
  • cuda support for even faster fastness

building

  • windows
visual studio 2022 community edition:
see apache arrow installation docs for installing arrow dependencies

open the sln (for sln based ) 
or 
open the folder with the cmakelists.txt file (for cmake based)
  • Linux ( tested on ubuntu WSL2 )
due to accessing windows mounts, i had to sudo every command to avoid errors.
This may not be neccessary for your particular setup, but better safe than sorry.

Cmake:
see apache arrow isntallation docs for installing arrow dependencies

sudo cmake .
sudo make release

running

  • windows/linux
executing the program with no cmd args will display the expected cmd args.

usage: 
CPPDeduper "\path\to\dirWith\arrowIPCdatasets\inSubfolders" "fileExtensionOfArrowIPCDatasets" "dataColumnName" dupeThreshold "outdir\where\nondupes\are\saved"


NOTE: there is no parameter for hash size or n-gram size. they are compiled in for optimization, and default to n-gram size of 5, and 256 finger print hashes
they can be modified in code on these lines:
static constexpr int HASH_LENGTH_SHINGLES = 5; //words used per hash
static constexpr int NUM_HASHES = 256; //number of hashes for comparison



sample windows commandline:

CPPDeduper "D:\\datasets\\folderWithManySubfolders" ".arrow" "text" 0.7 "d:\\dedupOut"

sample linux:
./CPPDeduper "/mnt/d/datasets/folderWithManySubfolders" ".arrow" "text" 0.7 "/mnt/d/carp/dedupOut"

bugs and issues

  • theres very little error handling at the moment, so it can be touchy
  • need to manually clear the output folder / make sure its empty / ensure you have permissions, otherwise std::filesystem:: throws an exception and it fails at the end of crunching.

About

cpp-near-dedupe


Languages

Language:C++ 64.7%Language:C 35.0%Language:CMake 0.3%