A bad mapping can cause Rally can get into a bad state during bulk indexing

Question

A bad mapping can cause Rally can get into a bad state during bulk indexing

inqueue opened this issue a year ago · comments

elastic/rally-tracks@cffc864 led to a strange stream of daily failures in the nightly benchmarks for http_logs where jobs were taking a full 24 hours to fail¹. The simplified Rally race invocation:

rally race --track="http_logs" --challenge="append-index-only-with-ingest-pipeline" --on-error="abort" --runtime-jdk="bundled" --track-params="{\"source_enabled\": false, \"ingest_pipeline\": \"grok\"}" --pipeline="from-sources"

Use elastic/rally-tracks@cffc864 without --test-mode to reproduce. The issue does not reproduce in test mode. When left to run, Rally displays Running index-append-with-ingest-grok-pipeline [100% done], then remains in the state forever.

2023-03-30 20:38:16,376 ActorAddr-(T|:33597)/PID:8312 esrally.actor ERROR Error in driver
Traceback (most recent call last):

  File "/home/esbench/rally/esrally/actor.py", line 92, in guard
    return f(self, msg, sender)
           ^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 303, in receiveMsg_WakeupMessage
    self.coordinator.post_process_samples()

  File "/home/esbench/rally/esrally/driver/driver.py", line 955, in post_process_samples
    self.sample_post_processor(raw_samples)

  File "/home/esbench/rally/esrally/driver/driver.py", line 1068, in __call__
    self.metrics_store.flush(refresh=False)

  File "/home/esbench/rally/esrally/metrics.py", line 906, in flush
    self._client.bulk_index(index=self._index, items=self._docs)

  File "/home/esbench/rally/esrally/metrics.py", line 92, in bulk_index
    self.guarded(elasticsearch.helpers.bulk, self._client, items, index=index, chunk_size=5000)

  File "/home/esbench/rally/esrally/metrics.py", line 116, in guarded
    return target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 524, in bulk
    for ok, item in streaming_bulk(

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 438, in streaming_bulk
    for data, (ok, info) in zip(

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 355, in _process_bulk_chunk
    yield from gen

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 274, in _process_bulk_chunk_success
    raise BulkIndexError(f"{len(errors)} document(s) failed to index.", errors)

elasticsearch.helpers.BulkIndexError: 118 document(s) failed to index.

2023-03-30 20:38:16,384 ActorAddr-(T|:33597)/PID:8312 esrally.actor ERROR Main driver received a fatal exception from a load generator. Shutting down.

2023-03-30 20:38:22,339 ActorAddr-(T|:33597)/PID:8312 esrally.driver.driver.DriverActor WARNING Actor esrally.driver.driver.DriverActor @ ActorAddr-(T|:33597) retryable exception on message <esrally.actor.BenchmarkFailure object at 0x7f083132c
6d0>
Traceback (most recent call last):

  File "/home/esbench/.local/lib/python3.11/site-packages/thespian/system/actorManager.py", line 163, in _handleOneMessage
    actor_result = self.actorInst.receiveMessage(msg, envelope.sender)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/.local/lib/python3.11/site-packages/thespian/actors.py", line 838, in receiveMessage
    r = getattr(klass, methodName)(self, message, sender)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 242, in receiveMsg_BenchmarkFailure
    self.coordinator.close()

  File "/home/esbench/rally/esrally/driver/driver.py", line 921, in close
    self.metrics_store.close()

  File "/home/esbench/rally/esrally/metrics.py", line 454, in close
    self.flush()

  File "/home/esbench/rally/esrally/metrics.py", line 906, in flush
    self._client.bulk_index(index=self._index, items=self._docs)

  File "/home/esbench/rally/esrally/metrics.py", line 92, in bulk_index
    self.guarded(elasticsearch.helpers.bulk, self._client, items, index=index, chunk_size=5000)

  File "/home/esbench/rally/esrally/metrics.py", line 116, in guarded
    return target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 524, in bulk
    for ok, item in streaming_bulk(

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 438, in streaming_bulk
    for data, (ok, info) in zip(

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 355, in _process_bulk_chunk
    yield from gen

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 274, in _process_bulk_chunk_success
    raise BulkIndexError(f"{len(errors)} document(s) failed to index.", errors)

OS: "Ubuntu 18.04.6 LTS

Bad Mapping

The mapping issue was about the date format of @timestamp. The index mapping was set to epoch_second when it should have been strict_date_optional_time. Bulk requests would have encountered this response for every single document in the corpus:

{
  "took": 0,
  "ingest_took": 0,
  "errors": true,
  "items": [
    {
      "index": {
        "_index": "logs-181998",
        "_type": "_doc",
        "_id": "br7lM4cBM1F21Bvv-SHk",
        "status": 400,
        "error": {
          "type": "mapper_parsing_exception",
          "reason": "failed to parse field [@timestamp] of type [date] in document with id 'br7lM4cBM1F21Bvv-SHk'. Preview of field's value: '1998-04-30T14:30:17-05:00'",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "failed to parse date field [1998-04-30T14:30:17-05:00] with format [epoch_second]",
            "caused_by": {
              "type": "date_time_parse_exception",
              "reason": "Failed to parse with all enclosed parsers"
            }
          }
        }
      }
    }
...
    }
}

These result in a 200 status code response, even though the error status of the individual document was a 400. We should consider exiting if the bulk request responds with errors: true.

https://elasticsearch-ci.elastic.co/blue/organizations/jenkins/elastic%2Belasticsearch%2Bmain%2Bmacrobenchmark-periodic-group-2/detail/elastic%2Belasticsearch%2Bmain%2Bmacrobenchmark-periodic-group-2/382/pipeline ↩

A bad mapping can cause Rally can get into a bad state during bulk indexing

Bad Mapping

Footnotes