lighttransport / nanocsv

Multithreaded header only C++11 CSV parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NanoCSV, Faster C++11 multithreaded header-only CSV parser

C/C++ CI

NanoCSV is a faster C++11 multithreaded header-only CSV parser with only STL dependency. NanoCSV is designed for CSV data with numeric values.

tty

Status

In development. Not recommended to use NanoCSV in production at the moment.

Requirements

  • C++11 compiler(with thread support)

Usage

// defined this only in **one** c++ file.
#define NANOCSV_IMPLEMENTATION
#include "nanocsv.h"

int main(int argc, char **argv)
{
  if (argc < 2) {
    std::cout << "csv_parser_example input.csv (num_threads) (delimiter)\n";
  }

  std::string filename("./data/array-4-5.csv");
  int num_threads = -1; // -1 = use all system threads
  char delimiter = ' '; // delimiter character.

  if (argc > 1) {
    filename = argv[1];
  }

  if (argc > 2) {
    num_threads = std::atoi(argv[2]);
  }

  if (argc > 3) {
    delimiter = argv[3][0];
  }

  nanocsv::ParseOption<float> option;
  option.delimiter = delimiter;
  option.req_num_threads = num_threads;
  option.verbose = true; // verbse message will be stored in `warn`.
  option.ignore_header = true; // Parse header(the first line. default = true).

  std::string warn;
  std::string err;

  nanocsv::CSV<float> csv;

  bool ret = nanocsv::ParseCSVFromFile(filename, option, &csv, &warn, &err);

  if (!warn.empty()) {
    std::cout << "WARN: " << warn << "\n";
  }


  if (!ret) {

    if (!err.empty()) {
      std::cout << "ERROR: " << err << "\n";
    }

    return EXIT_FAILURE;
  }

  std::cout << "num records(rows) = " << csv.num_records << "\n";
  std::cout << "num fields(columns) = " << csv.num_fields << "\n";

  // values are 1D array of length [num_records * num_fields]
  // std::cout << csv.values[4 * num_fields + 3] << "\n";

  // header string is stored in `csv.header`
  if (!option.ignore_header) {
    for (size_t i = 0; i < csv.header.size(); i++) {
      std::cout << csv.header[i] << "\n";
    }
  }


  return EXIT_SUCCESS;
}

NaN, Inf

nanocsv supports parsing

  • nan, -nan as NaN, -NaN
  • inf, -inf as Inf, -Inf

Support for N/A and null value

In default, missing value(e.g. N/A(including invalid numeric string), NaN) are replaced by nan, and null(empty) value(e.g. "") are replaced by nan.

You can control the behavior with the following parametes in ParseOption.

  • replace_na : Replace N/A, NaN value?
    • na_value : The value to be replaced for N/A, NaN value
  • replace_null : Replace null(empty) value?
    • null_value : The value to be replaced for null value

Parse Text CSV

Parsing Text CSV(each field is just a string) is also supported. (Use differnt API. See the source code for details.)

Compiler options

  • NANOCSV_NO_IO : Disable I/O(file access, stdio, mmap).
  • NANOCSV_WITH_RYU : Use ryu library to parse floating-point string. https://github.com/ulfjack/ryu . This will give precise handling of floating point values.
    • NANOCSV_WITH_RYU_NOINCLUDE: Do not include Ryu header files in nanocsv.h. This is useful when you want to include Ryu header files outside of nanocsv.h.

TODO

Performance

Dataset is 8192 x 4096, 800 MB in file size(generated by tools/gencsv/gen.py)

  • Thradripper 1950X
  • DDR4 2666 64 GB memory

perf

1 thread.

total parsing time: 3833.33 ms
  line detection : 1264.99 ms
  alloc buf      : 0.016351 ms
  parse          : 2508.83 ms
  construct      : 55.726 ms

16 thread.

total parsing time: 545.646 ms
  line detection : 159.078 ms
  alloc buf      : 0.077979 ms
  parse          : 337.207 ms
  construct      : 46.7815 ms

23 threads

Since 23 threads are faster than 32 thread for 1950x.

total parsing time: 494.849 ms
  line detection : 127.176 ms
  alloc buf      : 0.050988 ms
  parse          : 314.287 ms
  construct      : 50.7568 ms

Roughly 7.7 times faster than signle therad parsing.

Note on memory consumption

Not sure, but it should not exceed 3 * filesize, so guess 2.4 GB.

In python

Using numpy.loadtxt to load data takes 23.4 secs.

23 threaded naocsv parsing is Roughly 40 times faster than numpy.loadtxt.

References

License

MIT License

Third-party license

  • stack_container : Copyright (c) 2006-2008 The Chromium Authors. BSD-style license.
  • acutest : MIT license. Used for unit tester.
  • ryu : Apache 2.0 or Boost 1.0 dual license.

About

Multithreaded header only C++11 CSV parser

License:Other


Languages

Language:C 80.4%Language:C++ 19.4%Language:Makefile 0.2%Language:Python 0.0%