Know Elasticsearch from the view of a decentralized storage system

Question

Know Elasticsearch from the view of a decentralized storage system

abbshr opened this issue 7 years ago · comments

内容并不涉及 Elasticsearch 中的高层概念, 如 analysis, search 等搜索引擎相关的技术. 单从其分布式系统的角度来介绍 Elasticsearch 的基础内容.

concept

分布式协调 (一致性算法)

之前没了解过 Elasticsearch 的集群协调机制. 后来一看也是采用了与 Cassandra / Serf 一样的 gossip 协议. 这就给集群的自治提供了极大程度的便利, 也暗示 Elasticsearch 集群属于 p2p 体系 (去中心化) 存储.

节点发现

Elasticsearch 的集群构建上使用了 gossip 协议. 你可以理解成流言传播的模型, 互相交换彼此知道的额外消息, 再与另外的节点交流, 可以在平均 O(logN) 时间里达到数据一致.

好处是节点的加入和离开无需人为干预, 即集群自治. 但是初始化时为了能跟一个节点交换信息, 需要至少一个种子节点, 他知道集群中所有节点的信息.

选主 (为何需要主节点?)

对于像 Kafka 一样具有主节点的分布式系统来说, 主节点的地位是不可忽视的: 因为数据必须要经由主节点 (partition) 写入然后再由它传播给副本节点 (partition), 一切成功后再告知写入完成以确保数据强一致性, Cassandra 其实在数据读写时也是类似, 但是提供了更灵活的可配置能力, 允许配置一个读写操作至少需要多少个节点才认为是成功.

Elasticsearch 并不是 Kafka 一类的, 但是它也具有主节点, 然而这个主节点与其他节点在数据处理上的地位是平等的, 只用于维护集群状态, 比如: 哪些 shards 在哪些节点上, 有哪些具备 master 选举权的节点, 集群的设置等. 注意: 前面提到的种子节点并不一定是 maser 节点.

The master node is the only node in a cluster that can make changes to the cluster state.

也就是说, 这个主节点只控制 metadata 的写.

至于为什么写操作必须要有一个主节点控制?

这里讲个题外话, 因为任一节点写无法保证整个系统数据的写入顺序, 影响集群的最终一致性. 但由于将数据分区 (shard), 因此几乎不会因为单节点的瓶颈影响整个系统的写入性能. 这一点无论是去中心化还是中心化的分布式系统里, 都是毋庸置疑的.

一旦主节点准备更新集群状态时, 会广播询问其他节点, 收到这条更新的节点会响应一个 ack, 如果 master 没有在指定时间内得到来自指定数量 (一般为 quorum, 即多数: N/2 + 1) 的具备选举权节点的 ack 响应, 那么这次更新不会应用. 一旦足够的具备选举权的节点响应, 则 master 提交这次更新并把状态数据发布给其他节点. 这也是 Paxos 和 Raft 共识算法中的核心概念之一.

处理脑裂

这种无中心节点的系统很容易发生的一种情况就是网络隔离: 如果网络原因导致划分成了至少两个隔离的网络, 经过一段时间, 很有可能选举出两个 master. 对于这种情况, Elasticsearch 的策略是这样的:

主节点的选举依赖足够数量的 master eligible, 通过 discovery.zen.minimum_master_nodes 设置, 如果 master eligible 数量低于这个值, 那么选主不会发生, 如果已经存在了主节点, 那么它广播的状态更新也不会被应用.

因此在集群配置时, 最好保证这个值满足 quorum 集合的数量, 即 N/2 + 1. 当网络隔离划分成了两个或以上的集群时, 只能有至多一个集群能够选主, 其他无法完成选主, 也就不会有信息的不适当写入导致数据不一致乃至丢失了.

读写过程

上面说过, 主节点与其他节点在数据处理上的地位是平等的, 任意一个节点都可以处理 index/doc 读写请求.
对于读请求, 由于每个节点会冗余副本, 如果能在这个节点找到的话, 那么直接返回查询结果.
对于写请求, 节点首先计算这次更新的 hash 值, 然后用对应索引的 primary shard 数量对这个 hash 取余, 得到目标 shard, 然后根据集群状态信息得知这个 shard 在哪个节点上, 最后将更新写入目标节点, 目标节点再把更新发送到其他节点的 replica shard 上, 等操作完成后, 一级一级返回, 这和 Kafak 写 partition 是一样的.

失败检测

Elasticsearch 中的集群健康检查是 master 通过 ping 其他节点以及其他节点 ping master 共同完成的.

这相当于一个非黑即白的判断: ping 一个节点失败/超时后重试一定次数后即认为它 crash 了. 只有 master 知道其他节点的状态, 而其他节点有一个认为 master 不可达后, 就会发起一轮 master 选举.

(而 Cassandra 是基于 gossip 设计的累积失败检测算法实现的 (任意两个节点一次 gossiping 会取得其他所有节点的健康状态), 根据历史检测信息以及来自其他节点的信息用统计学的角度来评估目标的健康状态, 不仅降低因为网速原因造成的检测失败, 并且具备健康状态评级能力, 我觉得这种方法的误判率更低一些.)

NOTE: 如果对去中心化的最终一致性协调策略 & 存储有兴趣, 建议前往 Cassandra / Dynamo / Serf 设计文档以及相关论文中一探究竟, 因为 Elasticsearch 的核心是搜索, 虽然使用了类似的技术, 然而它对这些方面的文档描述甚少. 对于存在中心节点的强一致性系统, 建议了解其代表 Kafka.

分布式存储

elasticsearch 中有 index, document 两个上层名词, 但我们在观察日志以及修改配置时往往会更多发现 shard 和 replica.

其实 shard (分片) 和 replica (副本) 在分布式系统里属于两个基本的概念, 这里给一个 Elasticsearch 中的具象化解释:

    index_1 -> primary shard_1 (node_1) => x replicas (node_2, node_3, node_4, ...)
           |-> primary shard_2 (node_3) => x replicas (node_1, node_n, node_5, ...)
           |-> primary shard_3 (node_1) => x replicas (node_3, node_4, node_2, ...)
           |-> ...
           |-> primary shard_n (node_2) => x replicas (node_n, ...)
                         ↓
              [doc_x, doc_y, doc_z, ...]

即: 每个 index 被分为多个 shards 存储在不同的 nodes 上, docs 存储在各个 shards 里面. 而每个 shard (称为 primary shard) 又有多份 replica shards, 他们同样分布于其他 nodes 上.

你可以通过 /_cat/shards?v 查看集群中的 shards:

    GET /_cat/shards?v

分片

(分片相当于除法)

把 index 划分成 shards 来存储, 目的是可以通过不同节点直接访问同一个 index 中的数据, 做到读写负载均衡, 提高并发.

副本

(副本相当于乘法)

每个 shard 做的多份 replicas 其实就是备份, 为了保证数据的可用性/系统健壮性, 以及查询吞吐率 (原因如下). 因此把 replica shards 和 primary shard 分布在同一个节点上没有意义.

However, read requests—searches or document retrieval—can be handled by a primary or a replica shard, so the more copies of data that you have, the more search throughput you can handle.

节点

All nodes know about all the other nodes in the cluster and can forward client requests to the appropriate node.

Every node is implicitly a coordinating node, … … , which cannot be disabled. As a result, such a node needs to have enough memory and CPU in order to deal with the gather phase.

Indexing and searching your data is CPU-, memory-, and I/O-intensive work which can put pressure on a node’s resources. To ensure that your master node is stable and not under pressure, it is a good idea in a bigger cluster to split the roles between dedicated master-eligible nodes and dedicated data nodes.

While master nodes can also behave as coordinating nodes and route search and indexing requests from clients to data nodes, it is better not to use dedicated master nodes for this purpose. It is important for the stability of the cluster that master-eligible nodes do as little work as possible.

分片的分配

何时会发生分片的分配 (将 shards 对应到 nodes 上)? 官方文档给出了说明:

This can happen during initial recovery, replica allocation, rebalancing, or when nodes are added or removed.

index 主分片数量固定:

This setting cannot be changed after index creation.

考虑到 replica 需要跟随 primary 一同变化, 更改 primary 数量不太现实.

operation

添加节点 (scale up)

集群里添加成员, 以达到扩容, 提高性能, 容灾等目的.

配置节点发现

Concept 一章中说过, Elasticsearch 采用 gossip 作为集群协调管理的基石, 那么往集群中添加节点十分容易. 现在提供了 unicast 的节点发现方式, 就是相当于 Cassandra 的 seeds 节点, 配置新节点时, 只需提供一个种子节点的列表即可:

    discovery.zen.ping.unicast.hosts: ["192.168.5.68", "192.168.5.69"]

启动/剔除一个节点, 都会导致在平均 O(logN) 的时间里集群中的节点重新回归平衡, 所有节点互相知晓彼此状态.

确定节点种类

根据官方文档, elasticsearch node 有 5 种类型:

有 master 选举权的节点
存储数据的节点
ingest 节点
tribe 节点
coordinate 节点

但是前文说过, coordinate 节点是每个节点的必备属性. 可以通过如下字段配置节点的种类:

    node.master: false|true
    node.data: false|true
    node.ingest: true|false

当这三个选项都置为 false 时, 这个节点就是一个 coordinate 节点了.

扩充索引的分片副本数量

这一步需要操作人员确认是否必须, 因为过多的分片可能导致分片分配失败 (见后文描述)

    PUT /<index_name>/_settings -d '{"number_of_replicas": <n>}'

摘除节点 (scale down)

下文描述了如何安全的摘除数据节点.

数据迁移

Elasticsearch REST API 支持不宕机迁移数据, Rebalance shards & replicas, 这里有两种级别的 shards relocation:

index level
cluster level

一般数据迁移我们更多的使用后者, 因为多数时候我们并不在乎具体哪个索引需要在哪些节点上, 只需要从集群的视角观察 nodes 和 shards. 根据官方文档:

cluster-level shard allocation filtering allows you to allow or disallow the allocation of shards from any index to particular nodes.

The typical use case for cluster-wide shard allocation filtering is when you want to decommission a node, and you would like to move the shards from that node to other nodes in the cluster before shutting it down.

那么可以用这种方法简单实现数据迁移了:

    PUT /_cluster/settings
    
    {
      "transient": {
        # 这里可以根据 ip, name, host 来设置要排除掉的节点
        "cluster.routing.allocation.exclude.{_ip, _name, _host}": "<ip/name/host> splited by comma"
      }
    }

如何检查迁移状态? 有多种方法, 这里给出两种:

    GET /_nodes/<node_name>/stats/indices?pretty‌
    
    # 如果被排除掉的节点 document 数量为 0, 表示迁移完成.


    GET /_cat/health?pretty
    
    # 如果 relocating_shards 数量为 0, 表示迁移完成.

这时就可以安全的 shutdown 目标节点了.

集群状态

Red / Yellow / Green

Red: 表示有 primary shards 还没有 allocation
Yellow: 表示全部 primary shards 已经 allocation, 但是有 replica shards 还没有 allocation. 如果一直停留在 Yellow, 那么可能是有 primary shards 没副本 (通常是只有一个节点或者 shards allocation 受限导致的).
Green: 全部 shards 就绪.

获取集群信息

    /_cat/*
    /_cluster/*

处理 Unassigned shards (摘自 datadog blog)

    curl -XGET /_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

首先列出 unassigned 的 shards 及其原因.

如果这些 shards 属于你认为已经删除的或不再需要的 index, 你可以通过删除索引清理掉它们:

    curl -XDELETE /<index_name>/

过多 shards, 过少 nodes

有时, 当 shards 数量过多, 而 nodes 数量过少时, 也会导致部分 shards 停留在 unassigned 状态, 因为:

the master node will not assign a primary shard to the same node as its replica, nor will it assign two replicas of the same shard to the same node.

这就要求集群中每个索引的 replicas 数量应该少于 nodes 数量:

    Nodes ≥ Replicas + 1

当我们遇到这个问题时, 或者增加节点数量, 或者降低 replicas factor:

    curl -XPUT /<index_name>/_settings -d '{"number_of_replicas": <n>}'

没有启用 shard allocation

在滚动升级时需要关闭 shard allocation, 记得之后打开:

    curl -XPUT /_cluster/settings -d
    '{ "transient":
      { "cluster.routing.allocation.enable" : "all" 
      }
    }'

master 中有 shards 的记录, 但 shards 无法找到

这种情况一般是持有 primay 的 shards offline 或者上面的 shards 数据损坏导致的.

磁盘空间达到 low watermark

ES will not allocate new shards to nodes once they have more than 85% disk used

磁盘空间占用可以通过如下 API 查看:

    GET /_cat/allocation?v

可以增加 watermark:

    curl -XPUT /_cluster/settings -d
    '{
        "transient": {  
          "cluster.routing.allocation.disk.watermark.low": "> 85%"    
        }
    }'

存在多个不同版本的 Elasticsearch 实例

the master node will not assign a primary shard’s replicas to any node running an older version

这也是为什么在滚动升级前先关闭 shard allocation.

abbshr / abbshr.github.io

Know Elasticsearch from the view of a decentralized storage system

concept

分布式协调 (一致性算法)

节点发现

选主 (为何需要主节点?)

处理脑裂

读写过程

失败检测

分布式存储

分片

副本

节点

分片的分配

operation

添加节点 (scale up)

配置节点发现

确定节点种类

扩充索引的分片副本数量

摘除节点 (scale down)

数据迁移

集群状态

Red / Yellow / Green

获取集群信息

处理 Unassigned shards (摘自 datadog blog)

过多 shards, 过少 nodes

没有启用 shard allocation

master 中有 shards 的记录, 但 shards 无法找到

存在多个不同版本的 Elasticsearch 实例

References