max_shards_per_node not behaving as documented

Question

max_shards_per_node not behaving as documented

hlcianfagna opened this issue 4 months ago · comments

hlcianfagna commented 4 months ago

CrateDB version

5.6.3

CrateDB setup information

3 nodes cluster deployed with https://github.com/crate/crate-terraform

Problem description

According to https://cratedb.com/docs/crate/reference/en/5.6/config/cluster.html#shard-limits

Any operations that would result in the creation of additional shard copies that would exceed this limit are rejected.

However in a multi-node cluster (could not repro this on single-node environments) it seems under certain conditions the limit is not strictly enforced.

Steps to Reproduce

cr> select node['name'] ,closed,count(*) from sys.shards group by 1,2 limit 100;
+------------------+--------+----------+
| node['name']     | closed | count(*) |
+------------------+--------+----------+
| Hochschwung      | FALSE  |       26 |
| Schafberg        | FALSE  |       27 |
| Monte La Palazza | FALSE  |       27 |
+------------------+--------+----------+
SELECT 3 rows in set (0.053 sec)
cr> set GLOBAL PERSISTENT "cluster.max_shards_per_node"=30;
SET OK, 1 row affected (0.056 sec)
cr> create table hernan.shardstest (a int) clustered into 8 shards;
CREATE OK, 1 row affected (0.149 sec)
cr> create table hernan.shardstest2 (a int) clustered into 5 shards;
SQLParseException[Validation Failed: 1: this action would add [5] total shards, but this cluster currently has [96]/[90] maximum shards open;]
cr> select node['name'] ,closed,count(*) from sys.shards group by 1,2 limit 100;
+------------------+--------+----------+
| node['name']     | closed | count(*) |
+------------------+--------+----------+
| Hochschwung      | FALSE  |       32 |
| Schafberg        | FALSE  |       32 |
| Monte La Palazza | FALSE  |       32 |
+------------------+--------+----------+

Actual Result

The nodes end up with 32 shards

Expected Result

The limit of 30 shards is enforced, or the documentation is updated explaining how the limit works/when it may not be enforced.

jeeminso · Answer 1 · Fri Apr 05 2024 09:10:11 GMT+0800 (China Standard Time)

I think it is caused by number_of_replicas which is by default set to 0-1. A workaround could be to not use a range value.

Baur · Answer 2 · Tue Apr 16 2024 18:11:21 GMT+0800 (China Standard Time)

Regarding original issue - I think it's more documentation issue.
See elastic/elasticsearch#51839

cluster.max_shards_per_node controls how many shards are allowed to exist in the cluster as a whole, and is checked at shard creation time, but does not pay attention to how many shards any individual node has

and 7.10 backport elastic/elasticsearch@e4054e4

NOTE: This setting does not limit shards for individual nodes.

I will port those docs + mention in docs that max_shards_per_node doesn't take to account closed shards.

On auto-expanding replicas (which we have enabled by default, it's 0-1).
I found elastic/elasticsearch#2869 and elastic/elasticsearch@eb3d184.
Links above are talking about total_shards_per_node - I will check whether it holds true for max_shards_per_node and update docs if needed.

UPD: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/index-modules.html

Note that the auto-expanded number of replicas only takes allocation filtering rules into account, but ignores other allocation rules such as total shards per node,

Probably by design, but I will check why we cannot do it like @jeeminso proposed in #15805

Regarding partitioned tables - it's a legitimate bug but visible only for INSERT INTO ... VALUES (many values).
insert-from subquery is not that badly exposed - see details in a fix.

Baur · Answer 3 · Wed Apr 17 2024 01:55:37 GMT+0800 (China Standard Time)

Hi @hlcianfagna could you please post your initital create table statment(s) - how you got those 26/27 shards?

Also, couldn't exactly reproduce locally: do you get 32 shards per node after running
create table hernan.shardstest2 (a int) clustered into 5 shards;

What I'm saying is that slightly overshooting limit might be expected behavior but reporting error and still incrementing number of shards would be a bug.

I suspect that actually 32 shards were already there after
create table hernan.shardstest (a int) clustered into 8 shards;

hlcianfagna · Answer 4 · Wed Apr 17 2024 19:58:57 GMT+0800 (China Standard Time)

how you got those 26/27 shards?

They were already there on a cluster that I do not have at hand anymore, however I just reproduced this again successfully using this on an empty cluster:

create table hernan.legacytables (a int) clustered into 40 shards;

I confirmed the 32 shards per node are there right after the command with clustered into 8 shards

Baur · Answer 5 · Wed Apr 17 2024 20:01:20 GMT+0800 (China Standard Time)

I confirmed the 32 shards per node are there right after the command with clustered into 8 shards

ok, then I will follow my original plan - improve docs (basically expected behaviour for auto-expanding replicas).

As said, throwing an error this action would add ... total shards and actually adding shards would be a bug -> but it's not the case.