guillaumekln / simdoc

Similar document search using TF-IDF.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

simdoc

simdoc associates most similar documents in a dataset using the TF-IDF method.

Requirements

  • C++11
  • boost
  • Intel TBB
  • CMake

Usage

./simdoc [options] input

where input is either a directory or a text file with one file per line.

See --help option for a complete usage.

Example

The command

./simdoc -t 8 -c 5 -r data/ > output.json

associates the 5 most similar documents to each document in the data directory and its sub-directories using 8 threads.

See examples/output.json for an output example.

About

Similar document search using TF-IDF.


Languages

Language:C++ 53.3%Language:CMake 46.7%