Trying my hands on the 1 billion row challenge to see how best I can improve it
There are multiple binaries 1brc* being named in ascending order as the code is being optimized. The repo currently contains 1million rows csv which is used to run the benchmarks. To generate the 1billion rows of data run the command
python3 assets/create_measurements.py 1_000_000_000
You can run any of the binaries as
./1brc4 -file=<filename>
NB: <filename>
should be replaced by the path to the 1billion rows file generated above
For example;
./1brc4 -file=measurements.txt
You can run go benchmarks by running
go test ./... -bench=.
These benchmarks uses a copy of the 1 million rows of data. You can also use the time command on linux to test how fast it runs.
time ./1brc4 -file=<filename>
These benchmarks were taken from a 2021 M1 Pro 16GB
First naive implementation took 2:15s
- Changed from
strings.Split
tostrings.Cut
.strings.Split
will walk through the whole line looking for the separator butstrings.Cut
returns immediately it finds one which is more appropriate here. This reduced the time by circa45s
. New time is1:35s
- Since the temperatures are more deterministic, moved away from
strconv.ParseFloat
and wrote my own parser. - Also, moved from
scanner.Text
toscanner.Bytes
so I can work with bytes. Reduced our time by circa30s
putting as around1:06s
now
- Moved from
scanner
toReadSlice
since scanner was doing some extra checks which I don't need here. - Moved from
bytes.Cut
to a custom functioncut
that starts reading the lines from the end so we reach our delimiter faster
- Moved from ReadSlice and called Read directly on our file using a buffer size of 1mb.
- Reduce re-allocations by making slices and maps with size