gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

Home Page:https://www.morling.dev/blog/one-billion-row-challenge/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrect results with short city names

viliam-durina opened this issue · comments

final long partialWord = word & ((mask >> 7) - 1);

@royvanrijn

This code in the (currently) best solution will give incorrect results if there happen to be two semicolons in a single word. This is possible since a minimal record is 6 bytes, e.g.:

a;0.0
a;0.0
...

I've added short test cases #277 which royvanrijn passes. Maybe you could provide a test case to show the problem?

This is also covered in an existing test case:

My fault, I didn't actually run the code at all. I only assumed that the hash can be calculated differently for the same city name, and that it must lead to incorrect results.

Today I debugged and indeed two entries can be created for the same city in MeasurementRepository. But it's resolved in the final combine step when the String city field of MeasurementRepository.Entry is used as the key for the TreeMap, and this field is calculated correctly, so the end results are correct.