ir33k / billion

My attempt of solving The One Billion Row Challenge in C

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

One Billion Row Challenge
=========================

My attempt in solving The One Billion Row Challenge [1] without Hash
Map in C.  Also I'm doing this on very old ThinkPad X220 laptop [2].

[1] https://github.com/gunnarmorling/1brc/
[2] CPU: Intel i5-2520M (4) @ 3.200GHz 

Rules and limits that I follow, from original repo 1brc/README.md, that
are relevant to solutions not written in Java:

- No external library dependencies may be used.
- Implementations must be provided as a single source file.
- The computation must happen at application _runtime_, i.e. you cannot
  process the measurements file at _build time_ (for instance, when using
  GraalVM) and just bake the result into the binary.
- Input value ranges are as follows:
    - Station name: non null UTF-8 string of min length 1 character and
      max length 100 bytes, containing neither `;` nor `\n` characters.
    - Temperature value: non null double between -99.9 (inclusive) and
      99.9 (inclusive), always with one fractional digit.
- There is a maximum of 10,000 unique station names.
- Line endings in the file are `\n` characters on all platforms.
- Implementations must not rely on specifics of a given data set, e.g. any
  valid station name as per the constraints above and any data distribution
  (number of measurements per station) must be supported.
- The rounding of output values must be done using the semantics of IEEE
  754 rounding-direction "roundTowardPositive".

Build:

	$ ./build

Generate input files with varing number of lines:

	$ ./gen 1000000    > data1m.tmp
	$ ./gen 1000000000 > data1b.tmp

Run solution program with one of input files:

	$ ./solve < data1m.tmp
	$ ./solve < data1b.tmp


Devlog
======

2024-07-07 Sun 10:40
--------------------

I started by writing my own input data generator gen.c.  Just because I
don't want to insetall Java to run the data generators from original repo.
They also have an Python script [1] but the code is very slopy.  I don't want
to use it.  My version is mostly a clone of CreateMeasurements.java [2]
from original repo.

Generation of 1B lines took around 8 minutes and file weights 13 GB.

[1] src/main/python/create_measurements.py
[2] src/main/java/dev/morling/onebrc/CreateMeasurements.java

About

My attempt of solving The One Billion Row Challenge in C


Languages

Language:C 96.9%Language:Shell 3.1%