strapdata / elassandra

Elassandra = Elasticsearch + Apache Cassandra

Home Page:http://www.elassandra.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Got tons of error after performing a small insert.

didip opened this issue · comments

Elassandra version: Latest

Plugins installed: []

JVM version (java -version): JDK 8

OS version (uname -a if on a Unix-like system): Ubuntu 18

Description of the problem including expected versus actual behavior:

After inserting a blob of data, the cluster is suddenly in a bad state and starts erroring out

at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)

I don't have precise info yet which blob breaks the cluster. I am just curious if you have seen this error before.

Rolling restart fixes the cluster.

So please attach information in the issue (log, version, etc..) !

A few more information. The beginning of that stack trace:

2020-08-07 23:47:20,907 WARN  [elasticsearch[10.0.15.0][http_server_worker][T#8]] Netty4HttpServerTransport.java:575 exceptionCaught caught exception while handling client http traffic, closing connection [id: 0x49c6de7e, L:/10.0.15.0:9200 - R:/10.0.12.124:31658]
java.lang.StackOverflowError: null
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)

Another one:

2020-08-07 23:45:12,956 ERROR [Thread-66661] CassandraDaemon.java:241 uncaughtException Exception in thread Thread[Thread-66661,5,main]
java.lang.StackOverflowError: null
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)

The 2 errors always happen back-to-back, elasticsearch first, then cassandra.

Sorry for the late reply, the thread stack size is already set to Xss1m

Ok, could you provide more information to reproduce this issue and the full stacktrace from your logs ?

    at java.lang.Thread.run(Thread.java:748)

2020-09-01 11:27:27,990 WARN [elasticsearch[172.16.2.30][http_server_worker][T#14]] org.elasticsearch.common.logging.DeprecationLogger.deprecated(DeprecationLogger.java:242) Unknown field [refresh] used in UpdateRequest which has no value and will not be accepted in future
2020-09-01 11:27:28,391 WARN [elasticsearch[172.16.2.30][masterService#updateTask][T#1]] org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:712) org.elasticsearch.cluster.service.MasterService$$Lambda$2393/2102567294@577db957
org.elasticsearch.ResourceAlreadyExistsException: index [log_icos_sit/IG7uGWGxRvudBgZoXwBslA] already exists
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:153)
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:795)
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:307)
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:54)
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:704)
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:302)
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:227)
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:158)
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2020-09-01 11:27:38,572 WARN [elasticsearch[172.16.2.30][http_server_worker][T#14]] org.elasticsearch.common.logging.DeprecationLogger.deprecated(DeprecationLogger.java:242) Unknown field [refresh] used in UpdateRequest which has no value and will not be accepted in future
2020-09-01 12:54:38,839 INFO [elasticsearch[172.16.2.30][generic][T#5]] org.elassandra.shard.CassandraShardStateListener.afterIndexShardStarted(CassandraShardStateListener.java:69) shard [feature_1062466119_icos_sit][0] started
2020-09-01 12:55:22,880 INFO [elasticsearch[172.16.2.30][generic][T#11]] org.elassandra.shard.CassandraShardStateListener.afterIndexShardStarted(CassandraShardStateListener.java:69) shard [twin_444743153_icos_sit][0] started
2020-09-01 12:55:24,885 INFO [elasticsearch[172.16.2.30][generic][T#8]] org.elassandra.shard.CassandraShardStateListener.afterIndexShardStarted(CassandraShardStateListener.java:69) shard [feature_444743153_icos_sit][0] started
2020-09-01 12:55:26,754 ERROR [GossipStage:1] org.apache.cassandra.service.CassandraDaemon$2.uncaughtException(CassandraDaemon.java:234) Exception in thread Thread[GossipStage:1,5,main]
java.lang.StackOverflowError: null
at java.util.HashMap.hash(HashMap.java:339)
at java.util.HashMap.containsKey(HashMap.java:596)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)

elassandra 6.8.4.2
helm chart deploy memory 26Gi JvmMaxHeap 13Gi JvmHeapNewSize 3Gi Xss 1m

I want to chime in to report a similar occurrence to what didip and tengzhuofei had reported.

We're on Elassandra 6.8.4.7 running as a service on RHEL with Java 8 (openjdk version "1.8.0_265"). 32GB RAM. Our cluster has nine Elasticsearch-enabled nodes in its search datacenter. Interesting JVM settings:

-Xms16G 
-Xmx16G
-Xss1m

Here's how that DC looks in nodetool status (note the 16 tokens with vnodes):

Datacenter: us-west-2-search
============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  ...      14.14 GiB  16           ?       f307e7ce-542f-4bee-822d-fd942f8577e7  us-west-2a
UN  ...      16.76 GiB  16           ?       36508031-a73e-4818-b66b-6b2c2b64da05  us-west-2b
UN  ...       11.76 GiB  16           ?       a17061a5-532b-4613-966d-bf9171b70446  us-west-2b
UN  ...       13.2 GiB   16           ?       d89b9142-21e1-4dca-b0bf-6fdad21f361a  us-west-2c
UN  ...       15.98 GiB  16           ?       48798cb9-3515-4867-a5d0-70e4c27674f9  us-west-2c
UN  ...       18.95 GiB  16           ?       6d2a6fde-c6c3-4be4-90ea-4a182359d0fd  us-west-2c
UN  ...      14.61 GiB  16           ?       be147c81-1a23-4b58-b9dd-03993ef13e76  us-west-2a
UN  ...       19.93 GiB  16           ?       ca4c1231-62b2-4aa6-be0d-249fb3ef61cd  us-west-2b
UN  ...      19.66 GiB  16           ?       5d823437-cd50-4e94-951c-b3755bcca02a  us-west-2a

We noticed that all nine of our Elasticsearch endpoints became unresponsive. All nine exhibited a StackOverflowError with thousands upon thousands of lines of:

	at java.util.Collections$UnmodifiableMap.get(Collections.java:1456)

Just megabytes with that line in the system.log file.

Excerpt:

2020-11-04 11:55:11,568 ERROR [elasticsearch[11.11.11.11][http_server_worker][T#3]] ExceptionsHelper.java:281 lambda$maybeDieOnAnotherThread$2 fatal error
	at org.elasticsearch.ExceptionsHelper.lambda$maybeDieOnAnotherThread$2(ExceptionsHelper.java:280)
	at java.util.Optional.ifPresent(Optional.java:159)
	at org.elasticsearch.ExceptionsHelper.maybeDieOnAnotherThread(ExceptionsHelper.java:270)
	at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.exceptionCaught(Netty4HttpRequestHandler.java:176)
	at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
	at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:856)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:68)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
	at java.lang.Thread.run(Thread.java:748)
2020-11-04 11:55:11,569 ERROR [Thread-61961] CassandraDaemon.java:241 uncaughtException Exception in thread Thread[Thread-61961,5,main]
java.lang.StackOverflowError: null
	at java.util.Collections$UnmodifiableMap.get(Collections.java:1456)
	at java.util.Collections$UnmodifiableMap.get(Collections.java:1456)
	at java.util.Collections$UnmodifiableMap.get(Collections.java:1456)
	at java.util.Collections$UnmodifiableMap.get(Collections.java:1456)

Here's what else I noticed when the Elasticsearch endpoints were failing to respond to queries. With the HTTP API at port 9200, each index definition would show up, but not with the extra _search element in the URI. The Cassandra nodes would still show up as up (UN) in nodetool status.

Restarting the nodes made the immediate problem go away. Of course, the question is, how to prevent this.

In recent months we've upgraded to Elassandra 6.8.4.7 from 5.5.0.24. I don't recall seeing this error on 5.5.0.24. This particular keyspace was just upgraded a couple weeks ago to a 6.8.4.7 cluster, and we hadn't seen the problem on its old 5.5.0.24 cluster.

Also, I don't know whether this is an infinite loop or a sign of an incredibly nested call stack where it's legitimately one UnmodifiableMap.get after another.

My one colleague who has seen this before believes it has to do with the definition of a nested mapping for an index. Example:

                "foo_bar": {
                    "type": "nested",
                    "cql_collection": "singleton",
                    "cql_udt_name": "foo_bar",
                    "properties": {
                        "transaction_id": {
                            "type": "keyword",
                            "cql_collection": "singleton",
                            "index": "false"
                        },
                        "status": {
                            "type": "keyword",
                            "cql_collection": "singleton",
                            "index": "false"
                        },
                        "transaction_date": {
                            "type": "date",
                            "cql_collection": "singleton",
                            "index": "false"
                        }
                    }
                }

And if we changed this definition from nested to object, the problem would go away. I wish I could give a more definitive basis for this assertion of his.

Vincent, thanks for getting back.

I'll take your word that there's a possibility this is not an infinite loop. We'll double the stack size and see what happens. (Even then, my anecdotal observation is that, even if this is not an infinite loop and just an incredibly deep stack, there's something exceedingly inefficient going on, whether on a write or a query.)

What's your take on the speculation that this has something to do with the "nested" mapping and that we should replace it with "object"? Beyond that, if you do have any insight as to when a field mapping should be defined as "nested" and when it should be defined as "object", I'd love to hear.

I adjusted Xss to 5m and still have this problem

We found a possible cause for this issue, fixed in this commit.
Could you please upgrade to 6.8.4.13 and let us know if it fix your issue.
Thanks in advance.