CarlWright / csv

csv parser, optimized for performance

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CSV-file parser
***************

Copyright (C) 2012, Dmitry Kolesnikov

   This file is free documentation; unlimited permissions are give to copy,
distribute and modify the documentation. 


   This library is free software; you can redistribute it and/or modify
it under the terms of the the 3-clause BSD License (the "License");
as published by http://www.opensource.org/licenses/BSD-3-Clause.


Introduction
============

    The simple CSV-file parser based on event model. The parser generates an
    event/callback when the CSV line is parsed. The parser supports both
    sequential and parallel parsing. The major goal is an performance of 
    intake procedure with an parsing target of 3 - 4 micro seconds pr line on
    the reference hardware.

                           Acc
                       +--------+
                       |        |
                       V        |
                  +---------+   |
    ----Input---->| Parser  |--------> AccN
          +       +---------+
         Acc0          |
                       V
                   Event Line 

    The parser takes as input binary stream, event handler function and
    initial state/accumulator. Event function is evaluated against current
    accumulator and parsed line of csv-file. Note: The accumulator allows to
    carry-on application specific state through event functions.

Compile and build
=================

    For compilation use

    make

    If you would like to compile native compiled version use

    make BUILD=native

    If you previously compiled normal version don't forget clean first or use

    make clean compile BUILD=native

Interface
=========   
   Briefly, the sequence of operations for data parse/intake is following; 
   see the src/csv.erl file for detailed interface specification and/or 
   example parser at priv/csv_example.erl
   
   %% define an event function that takes two arguments line value and
   %% accumulator. The function shall return a new accumulator state.
   %% The structure of accumulator is an application specific, that might
   %% vary from integer to comprex record.
   Fun = fun({line, L}, #my_record{count = C} = Acc0) ->
      do_my_intake_to_somewhere(lists:reverse(L)),
      Acc0#my_record{count = C + 1};
            (_, Acc) -> Acc
   end
   
   %%
   %% A sequential parse, parses whole data stream in client process
   csv:parse(CSV, Fun, #myrecord{})
   
   %%
   %% a parallel parse splits the CSV into multiple chunks;
   %% spawns multiple processes (process per chunk)
   %% results aggregated in the client process.
   csv:parse(CSV, 20, Fun, #myrecord{})
   
Performance
===========

   Reference platform: 
     * MacMini, Lion Server, 
     * 1x Intel Core i7 (2 GHz), 4x cores
     * L2 Cache 256KB per core
     * L3 Cache 6MB
     * Memory 4GB 1333 MHZ DDR3
     * Disk 750GB 7200rpm WDC WD7500BTKT-40MD3T0
     * Erlang R15B + native build of the library
   
   The data set is has following patterns: key, date, time, float numbers and 
   zz suffix
     * key{1..300 000},2012-03-25,23:26:15.543,166.280,...,zz
   
   The numbers of keys is 300.000, and number of float fields varies from 8, 
   24 and 40 in reference data. Reference data set is generated by command
   
   make example or perl priv/gen_set.pl 300 40 > priv/set-300K-40.txt
   
   
   version 0.0.1
   
   E/Parse         Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
   -------------------------------------------------------------------
   300K,  8 flds     23.41       91.722     350.000         1.16
   300K, 24 flds     50.42      489.303     697.739         2.33
   300K, 40 flds     77.43      780.296     946.003         3.15
   
   
   ET/hash         Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
   -------------------------------------------------------------------
   300K,  8 flds     23.41       91.722     384.598         1.28
   300K, 24 flds     50.42      489.303     761.414         2.54
   300K, 40 flds     77.43      780.296    1047.329         3.49
   
   
   ET/tuple         Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
   -------------------------------------------------------------------
   300K,  8 flds     23.41       91.722     228.306         0.76
   300K, 24 flds     50.42      489.303     601.025         2.00
   300K, 40 flds     77.43      780.296     984.676         3.28
   
   ETL/ets          Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
   -------------------------------------------------------------------
   300K,  8 flds     23.41       91.722    1489.543         4.50
   300K, 24 flds     50.42      489.303    2249.689         7.50
   300K, 40 flds     77.43      780.296    2519.401         8.39
   
   ETL/pts          Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
   -------------------------------------------------------------------
   300K,  8 flds     23.41       91.722     592.886         1.98
   300K, 24 flds     50.42      489.303    1190.745         3.97
   300K, 40 flds     77.43      780.296    1734.898         5.78

About

csv parser, optimized for performance

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Erlang 96.5%Language:Perl 2.1%Language:Makefile 1.4%