bgfeldm / Csv2Lucene

Fast CSV file indexer for Lucene Index

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Csv2Lucene

Note: code is largely a proof of concept.

Description:
   Csv2Lucene's goal is to bulk index a large amount of huge CSV files quickly.
   The focus is on the record level and not the file level when creating threads. 
   Multi-Threading on record lines instead of files has advantages when it comes
   to speed as well as recovery.

Requirements:
   - Quickly index a large amount of huge CSV files.
   - Thread on record lines not on files.

Advantages of threading on the record level instead of by file:
   - Working on a single huge database dump file.
   - Faster to index until the very end, keeping all threads busy until the last few lines
   - Simpler recovery from abrupt application halts, since we are reading a smaller set of 
   files at a time. 

Note: More than one file can be read at a time, when reading the tail end of one file
the beginning of the next file is read, keeping a continuous flow of record lines 
until the last record of the last file is read.

About

Fast CSV file indexer for Lucene Index


Languages

Language:Java 99.9%Language:Shell 0.1%