bgfeldm/Csv2Lucene

Csv2Lucene

Note: code is largely a proof of concept.

Description:
   Csv2Lucene's goal is to bulk index a large amount of huge CSV files quickly.
   The focus is on the record level and not the file level when creating threads. 
   Multi-Threading on record lines instead of files has advantages when it comes
   to speed as well as recovery.

Requirements:
   - Quickly index a large amount of huge CSV files.
   - Thread on record lines not on files.

Advantages of threading on the record level instead of by file:
   - Working on a single huge database dump file.
   - Faster to index until the very end, keeping all threads busy until the last few lines
   - Simpler recovery from abrupt application halts, since we are reading a smaller set of 
   files at a time. 

Note: More than one file can be read at a time, when reading the tail end of one file
the beginning of the next file is read, keeping a continuous flow of record lines 
until the last record of the last file is read.
bgfeldm / Csv2Lucene

About

Languages