Allow indexing data in order with multiple indexing clients

Question

Allow indexing data in order with multiple indexing clients

jpountz opened this issue a year ago · comments

When indexing data with multiple clients, Rally may not ingest data in the same order as the input file. This is probably fine when ingesting non-time based datasets as the order in which documents get ingested should not be too relevant. However this is problematic for time-based data as we would expect different performance characteristics depending on whether time-based data gets ingested in (mostly) increasing time order vs. not:

Index sorts that involve @timestamp involve more reordering at flush and merge time when data gets ingested in random order, making ingestion possibly slower than what users would observe in practice. This currently affects TSDB data streams, which sort data on _tsid then descending @timestamp, and would be relevant for some other datasets including Logging datasets that we're contemplating updating to be sorted by descending @timestamp by default.
Range filters on @timestamp are more costly when data gets ingested in random order. When data gets ingested in order, there are higher chances that segments either fully match or don't match at all a given range filter, which is something that Lucene can handle more efficiently than a partially matching segment. On the other hand if indexing happens in a somewhat random order, then most segments are likely to partially match a range filter on @timestamp, which would make search performance worse than what users would observe in practice.

Having a way to make Rally honor the order in which documents occur in the input file would help create tracks that have more fidelity to the performance characteristics that users would observe in production.

Stewart Miles · Answer 1 · Fri Jan 06 2023 07:13:24 GMT+0800 (China Standard Time)

Will review with @pquentin. Can you comment on the urgency/priority for this ?

Adrien Grand · Answer 2 · Sun Jan 08 2023 16:31:01 GMT+0800 (China Standard Time)

Can you comment on the urgency/priority for this ?

Yes: this enhancement has a high priority for the TSDB team as the ingestion slowdown that we are getting with TSDB is in the way of getting TSDB adopted. This slowdown has two main causes as far as we know: the way how TSDB handles IDs, which the TSDB team is working on, and the index sort that TSDB configures (_tsid asc, then @timestamp desc). We think that the fact that Rally does not ingest data in order exacerbates the cost of index sorting, so addressing this would help get more trustable data regarding how slower we could expect ingestion to get as we move APM and integrations to TSDB.

Quentin Pradet · Answer 3 · Mon Jan 09 2023 21:30:48 GMT+0800 (China Standard Time)

What would you want Rally to do here? The use cases you've laid out made sense, but I don't know how to support them with multiple indexing clients, which are also used to get better indexing speed.

The TSDB track has a single corpora with a single big 130G file with 123M lines with timestamps ranging from 2021-04-28T19:45:28.222Z to 2021-04-29T17:29:03.348Z. With the 8 indexing clients (the default configured in the TSDB track that we also use in benchmarks), Rally does indeed split that big file in 8 parts, so that each indexing client only has its own 15M lines chunk to index:

esrally.track.params INFO Will read [14579212] lines from [.../.rally/benchmarks/data/tsdb/documents.json] starting from line [0] with bulk size [5000].
esrally.track.params INFO Will read [14579212] lines from [.../.rally/benchmarks/data/tsdb/documents.json] starting from line [14579212] with bulk size [5000].
esrally.track.params INFO Will read [14579212] lines from [.../.rally/benchmarks/data/tsdb/documents.json] starting from line [43737637] with bulk size [5000].
esrally.track.params INFO Will read [14579212] lines from [.../.rally/benchmarks/data/tsdb/documents.json] starting from line [58316849] with bulk size [5000].
esrally.track.params INFO Will read [14579212] lines from [.../.rally/benchmarks/data/tsdb/documents.json] starting from line [87475274] with bulk size [5000].
esrally.track.params INFO Will read [14579213] lines from [.../.rally/benchmarks/data/tsdb/documents.json] starting from line [29158424] with bulk size [5000].
esrally.track.params INFO Will read [14579213] lines from [.../.rally/benchmarks/data/tsdb/documents.json] starting from line [72896061] with bulk size [5000].
esrally.track.params INFO Will read [14579212] lines from [.../.rally/benchmarks/data/tsdb/documents.json] starting from line [102054486] with bulk size [5000].

While this is a kind of worst case scenario, by definition we can't index in order with 8 indexing clients that are indexing in parallel. I think the best possible outcome in theory would be that each indexing client retrieves data sequentially from the file, so that we have 8 bulk operations of 5K in parallel at all times, but reading the file sequentially.

We can't do that in Rally today however. And it's also not clear it's doable without introducing a client-side bottleneck. A quick and dirty approach that seems worth trying however is to have an alternate corpora with 8 files, each of them going from 2021-04-28T19:45:28.222Z to 2021-04-29T17:29:03.348Z but only containing 1 out of 8 docs, and ask Rally to index those files in parallel. Assuming that all documents take the same time to index, this could approach the best possible outcome well enough.

Adrien Grand · Answer 4 · Mon Jan 09 2023 22:12:07 GMT+0800 (China Standard Time)

I don't know much about the Rally internals, so apologies if my suggestion is being naive, but could Rally leverage the offset table to precompute an ordered queue of bulk requests that need to run. Then each of the driver's workers would pull from this queue to know what is the offset range for the next bulk request that needs to run? I wouldn't expect it to introduce more client-side bottlenecks compared to today, while providing stronger guarantees that the file gets ingested in order compared to independently ingesting multiple ordered files?

Jason Bryan · Answer 5 · Mon Jan 09 2023 23:37:10 GMT+0800 (China Standard Time)

I wouldn't expect it to introduce more client-side bottlenecks compared to today, while providing stronger guarantees that the file gets ingested in order compared to independently ingesting multiple ordered files?

Is there a benefit to using multiple clients if a single file is being ingested sequentially? Could a single client be used?

Adrien Grand · Answer 6 · Mon Jan 09 2023 23:44:11 GMT+0800 (China Standard Time)

Is there a benefit to using multiple clients if a single file is being ingested sequentially?

Yes: Elasticsearch's threading model for ingestion consists of using one thread per target shard of the _bulk request. So e.g. in the case of a single node cluster with a single index that has a single shard, you would need a number of clients greater than or equal to the number of CPUs of the node to have a chance to use all CPUs.

Quentin Pradet · Answer 7 · Tue Jan 10 2023 13:45:53 GMT+0800 (China Standard Time)

Yes: Elasticsearch's threading model for ingestion consists of using one thread per target shard of the _bulk request. So e.g. in the case of a single node cluster with a single index that has a single shard, you would need a number of clients greater than or equal to the number of CPUs of the node to have a chance to use all CPUs.

While I understand why we need more clients than the number of cores, I'm not following this specific argument: if you have a single shard to index to, how can you use more than one CPU? Per your first sentence, only one thread could be active since we have a single shard in that cluster. On the other hand, with a single node cluster and a single index but N shards, I see how you could use N CPUs.

Adrien Grand · Answer 8 · Tue Jan 10 2023 15:51:39 GMT+0800 (China Standard Time)

if you have a single shard to index to, how can you use more than one CPU? Per your first sentence, only one thread could be active

Sorry if my phrasing was unclear. Elasticsearch uses one thread per shard per bulk request. So you need concurrent bulk requests (e.g. via multiple clients) to use more than one CPU per shard.

Quentin Pradet · Answer 9 · Wed Jan 11 2023 15:36:00 GMT+0800 (China Standard Time)

I don't know much about the Rally internals, so apologies if my suggestion is being naive, but could Rally leverage the offset table to precompute an ordered queue of bulk requests that need to run. Then each of the driver's workers would pull from this queue to know what is the offset range for the next bulk request that needs to run?

So one core assumption of Rally is that all operations (including bulk indexing) behave the same at the top level:

Before starting any task, Rally will spawn as many processes as CPU cores, start an asyncio event loop and instantiate async clients there, multiple per loop if needed (so if you have 8 CPUs but 800 indexing clients, Rally will start 100 clients per process).
Then, Rally will ask each worker to work on its own part. For bulk indexing, the worker will compute for each client id the part of the file it needs to read from (from client_id * split_length to (client__id + 1) * split_length).
Finally, all clients will mmap the relevant version of the file starting at their offset and read sequentially from it.

Reading more closely at your suggestion, I realize that instead of splitting by the number of indexing clients (8 in my example) we could split by the number of bulk requests (116_633_698 lines / 10_000 bulk size, that is 11_664 offsets) and reuse the same mmap approach. Just as above, each client can work out independently which parts of the file it needs to read. Not only this looks possible without a big rearchitecture, but it's indeed going to be fast enough. Thanks! I'll try to figure out how much time that would take to implement.

A quick and dirty approach that seems worth trying however is to have an alternate corpus with 8 files, each of them going from 2021-04-28T19:45:28.222Z to 2021-04-29T17:29:03.348Z but only containing 1 out of 8 docs, and ask Rally to index those files in parallel. Assuming that all documents take the same time to index, this could approach the best possible outcome well enough.

Given the code change approach is likely to take time to implement, we agreed offline to try the alternate corpus approach first, except that we're going to use 16 indexing clients since our benchmarks run on a m5d.4xlarge instance which has 16 vCPUs. Here's the list of steps to add such a new TSDB corpus:

download the existing rally-tracks.elastic.co/tsdb/documents.json.bz2 file and uncompress it
write a script to turn that into 16 new files (documents-X.json) and 16 new test-mode files (documents-X-1k.json) with X going from 1 to 16
compress all 32 files and upload them to rally-tracks.elastic.co/tsdb/
document a new parameter named "corpus" that can take the values "ordered" or "unordered" (the default). Based on that parameter, use a Jinja condition to either use the existing single file or the 16 files in https://github.com/elastic/rally-tracks/blob/5917b92299bfab8eee4ad40f4e8121030cc4d5e0/tsdb/track.json#L28-L40 that references all the files uploaded in step 3. You can look at http_logs for inspiration. The main difference is that we want the corpus to be called "tsdb" in both cases.
You can now test the indexing in order using by running a race by setting the corpus parameter to ordered

Quentin Pradet · Answer 10 · Fri Jan 27 2023 18:51:58 GMT+0800 (China Standard Time)

After a false start, I was able to implement the hacky approach to rearrange the file and index roughly in order. On my laptop which is much less stable than a true benchmarking environment, here's the difference in distribution at 13% indexing vs 18% with 16 clients:

   "2021-04-28T17:00:00.000Z": 2557494,
   "2021-04-28T18:00:00.000Z": 3742052,
   "2021-04-28T19:00:00.000Z": 3704665,
-  "2021-04-28T20:00:00.000Z": 1979124,
+  "2021-04-28T20:00:00.000Z": 3693782,
-  "2021-04-28T21:00:00.000Z": 1241240,
+  "2021-04-28T21:00:00.000Z": 2985247,
   "2021-04-28T22:00:00.000Z": 1233969,
-  "2021-04-28T23:00:00.000Z": 389828,
+  "2021-04-28T23:00:00.000Z": 1228967,
-  "2021-04-29T00:00:00.000Z": 0,
+  "2021-04-29T00:00:00.000Z": 1243619,
-  "2021-04-29T01:00:00.000Z": 0,
+  "2021-04-29T02:00:00.000Z": 169134,
-  "2021-04-29T02:00:00.000Z": 0,
+  "2021-04-29T01:00:00.000Z": 1238765,
   "2021-04-29T03:00:00.000Z": 0,
   "2021-04-29T04:00:00.000Z": 0,
   "2021-04-29T05:00:00.000Z": 0,
   "2021-04-29T06:00:00.000Z": 0,
   "2021-04-29T07:00:00.000Z": 0,
   "2021-04-29T08:00:00.000Z": 0,
   "2021-04-29T09:00:00.000Z": 0,
   "2021-04-29T10:00:00.000Z": 0,
   "2021-04-29T11:00:00.000Z": 0,
   "2021-04-29T12:00:00.000Z": 0,
   "2021-04-29T13:00:00.000Z": 0,
   "2021-04-29T14:00:00.000Z": 0,
   "2021-04-29T15:00:00.000Z": 0,
   "2021-04-29T16:00:00.000Z": 0,
   "2021-04-29T17:00:00.000Z": 9

(Yes, I should investigate the lone 9 at the end.)

This did improve indexing speed, but surprisingly it improved all configurations, not only the time-series benchmarks where index sorting is applied:

@martijnvg will investigate more I believe.

In parallel, I've also worked towards implementing the proper change in Rally, but it requires touching the most complex part of Rally. So far I was able to open two pull requests towards that goal:

#1657
#1659

It's also OK if I don't finish this given the hacky approach solves the immediate problem.

Adrien Grand · Answer 11 · Fri Jan 27 2023 20:55:54 GMT+0800 (China Standard Time)

Thanks a lot @pquentin. This work is increasing our confidence in the TSDB overhead that is reported in nightly benchmarks and already helped highlight that the way that Elasticsearch performs merging is not good at maintaining the original ordering of the data. This will help us move forward with changes that rely on ingestion order like elastic/elasticsearch#92456, elastic/elasticsearch#92684, and I'm sure many more in the future.

It's also OK if I don't finish this given the hacky approach solves the immediate problem.

This sounds right. If ingesting data in order proves too hard to do in Rally, I think we'll want to do a similar hack for the Logging track where the ingestion order could also affect ingestion rates and query performance.

Quentin Pradet · Answer 12 · Fri Jan 27 2023 21:14:26 GMT+0800 (China Standard Time)

This sounds right. If ingesting data in order proves too hard to do in Rally, I think we'll want to do a similar hack for the Logging track where the ingestion order could also affect ingestion rates and query performance.

This won't be needed as the elastic/logs (and elastic/security) tracks completely sidestep this logic by generating their own data with dynamic timestamps and thus already index in order.

Adrien Grand · Answer 13 · Fri Jan 27 2023 22:27:48 GMT+0800 (China Standard Time)

Is it like the hack we did for TSDB where each client sends data in order but maybe data doesn't get ingested in global order in case a client goes a bit faster than another one for some reason, or do the Logging and Security tracks have stronger guarantees around the global order?

Quentin Pradet · Answer 14 · Sat Jan 28 2023 01:20:38 GMT+0800 (China Standard Time)

No, the guarantee is much stronger, as the timestamps are generated on demand just before indexing directly in the hot code path.

Despite the stronger guarantee, the design of those tracks is more or less considered a mistake now as:

they replace large parts of Rally with custom code which is partly copy/pasted from Rally itself, which is difficult to maintain
the code in the hot code path is highly optimized and quite unreadable as it avoids things like parsing JSON to avoid introducing client-side bottlenecks

If you need that stronger guarantee, we can look into doing this for the TSDB track too.

Adrien Grand · Answer 15 · Sat Jan 28 2023 01:31:23 GMT+0800 (China Standard Time)

Thanks for the pointers!

If you need that stronger guarantee, we can look into doing this for the TSDB track too.

Your above comment states that this design is considered a mistake, so I'm unsure how you feel about this approach vs. the curent hack. I don't know how good the current hack is at giving good approximate global ordering, but in general having better guarantees around it to better mimic what happens in the real world feels like a desirable feature.

Quentin Pradet · Answer 16 · Wed Feb 08 2023 01:43:03 GMT+0800 (China Standard Time)

Your above comment states that this design is considered a mistake, so I'm unsure how you feel about this approach vs. the curent hack. I don't know how good the current hack is at giving good approximate global ordering, but in general having better guarantees around it to better mimic what happens in the real world feels like a desirable feature.

I've discussed this offline quickly with @dliappis and then with @martijnvg and we agreed that the split16 corpus will do as for now as the elastic/logs way would have too many drawbacks. In elastic/rally-tracks#378 I fixed the remaining blocker and make sure we don't index a few documents from the end first. (Which was the issue mentioned in #1650 (comment).)

Initially I wanted to implement this approach in Rally directly, but was only able to merge #1659 before having to move on. This is still satisfying as it simplifies the existing code significantly and I understand this core and complicated part of Rally much better. (I also opened #1666 to document my findings.)

If we need this behavior for other tracks, we can always reuse the TSDB approach: https://github.com/elastic/rally-tracks/tree/master/tsdb. The README explains how to run _tools/split.py to generate the wanted file. If we find ourselves needing this in many places, then it will be time to implement this in Rally itself, but for now there's nothing actionable left in this issue, so I'll close it.