crate / crate

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

Home Page:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

max_shards_per_node not behaving as documented

hlcianfagna opened this issue · comments

CrateDB version


CrateDB setup information

3 nodes cluster deployed with

Problem description

According to

Any operations that would result in the creation of additional shard copies that would exceed this limit are rejected.

However in a multi-node cluster (could not repro this on single-node environments) it seems under certain conditions the limit is not strictly enforced.

Steps to Reproduce

cr> select node['name'] ,closed,count(*) from sys.shards group by 1,2 limit 100;
| node['name']     | closed | count(*) |
| Hochschwung      | FALSE  |       26 |
| Schafberg        | FALSE  |       27 |
| Monte La Palazza | FALSE  |       27 |
SELECT 3 rows in set (0.053 sec)
cr> set GLOBAL PERSISTENT "cluster.max_shards_per_node"=30;
SET OK, 1 row affected (0.056 sec)
cr> create table hernan.shardstest (a int) clustered into 8 shards;
CREATE OK, 1 row affected (0.149 sec)
cr> create table hernan.shardstest2 (a int) clustered into 5 shards;
SQLParseException[Validation Failed: 1: this action would add [5] total shards, but this cluster currently has [96]/[90] maximum shards open;]
cr> select node['name'] ,closed,count(*) from sys.shards group by 1,2 limit 100;
| node['name']     | closed | count(*) |
| Hochschwung      | FALSE  |       32 |
| Schafberg        | FALSE  |       32 |
| Monte La Palazza | FALSE  |       32 |

Actual Result

The nodes end up with 32 shards

Expected Result

The limit of 30 shards is enforced, or the documentation is updated explaining how the limit works/when it may not be enforced.

I think it is caused by number_of_replicas which is by default set to 0-1. A workaround could be to not use a range value.

  1. Regarding original issue - I think it's more documentation issue.
    See elastic/elasticsearch#51839

cluster.max_shards_per_node controls how many shards are allowed to exist in the cluster as a whole, and is checked at shard creation time, but does not pay attention to how many shards any individual node has

and 7.10 backport elastic/elasticsearch@e4054e4

NOTE: This setting does not limit shards for individual nodes.

I will port those docs + mention in docs that max_shards_per_node doesn't take to account closed shards.

On auto-expanding replicas (which we have enabled by default, it's 0-1).
I found elastic/elasticsearch#2869 and elastic/elasticsearch@eb3d184.
Links above are talking about total_shards_per_node - I will check whether it holds true for max_shards_per_node and update docs if needed.


Note that the auto-expanded number of replicas only takes allocation filtering rules into account, but ignores other allocation rules such as total shards per node,

Probably by design, but I will check why we cannot do it like @jeeminso proposed in #15805

  1. Regarding partitioned tables - it's a legitimate bug but visible only for INSERT INTO ... VALUES (many values).
    insert-from subquery is not that badly exposed - see details in a fix.

Hi @hlcianfagna could you please post your initital create table statment(s) - how you got those 26/27 shards?

Also, couldn't exactly reproduce locally: do you get 32 shards per node after running
create table hernan.shardstest2 (a int) clustered into 5 shards;

What I'm saying is that slightly overshooting limit might be expected behavior but reporting error and still incrementing number of shards would be a bug.

I suspect that actually 32 shards were already there after
create table hernan.shardstest (a int) clustered into 8 shards;

how you got those 26/27 shards?

They were already there on a cluster that I do not have at hand anymore, however I just reproduced this again successfully using this on an empty cluster:

create table hernan.legacytables (a int) clustered into 40 shards;

I confirmed the 32 shards per node are there right after the command with clustered into 8 shards


I confirmed the 32 shards per node are there right after the command with clustered into 8 shards

ok, then I will follow my original plan - improve docs (basically expected behaviour for auto-expanding replicas).

As said, throwing an error this action would add ... total shards and actually adding shards would be a bug -> but it's not the case.