INCONSISTENT HASH | manager node and other nodes.

Question

INCONSISTENT HASH | manager node and other nodes.

varuntanwar opened this issue 5 years ago · comments

Hi,

We have 3 Nodes running.
LEOFS version: 1.3.8-1

root@prd-leofs001:/home/varun.tanwar# leofs-adm status
 [System Confiuration]
-----------------------------------+----------
 Item                              | Value
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.3.8
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 2
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash |
                previous ring-hash |
-----------------------------------+----------

 [State of Node(s)]
-------+--------------------------------------------+--------------+----------------+----------------+----------------------------
 type  |                    node                    |    state     |  current ring  |   prev ring    |          updated at
-------+--------------------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_0@prd-leofs001         | running      | e92344d1       | e92344d1       | 2019-09-16 16:05:13 +0550
  S    | storage_1@prd-leofs002         | running      | e92344d1       | e92344d1       | 2019-09-16 15:33:49 +0550
  S    | storage_2@prd-leofs003         | running      | e92344d1       | e92344d1       | 2019-09-16 15:34:27 +0550
  G    | gateway_http@prd-leofs003      | running      | e92344d1       | e92344d1       | 2019-09-16 15:35:27 +0550
  G    | gateway_s3@prd-leofs001        | running      | e92344d1       | e92344d1       | 2019-09-16 15:34:54 +0550
-------+--------------------------------------------+--------------+----------------+----------------+----------------------------

I have tried restarting all services (stop gateway -> stop storage -> stop manager then start in the reverse order : manager -> storage -> gateway).

When I run where-is command, I getting following error:

leofs-adm whereis s3:///
[ERROR] Could not get ring

When I configure the s3cmd, I get following error:

Test access with supplied credentials? [Y/n] y
Please wait, attempting to list all buckets...
ERROR: Test failed: 403 (AccessDenied): Access Denied
ERROR: Are you sure your keys have s3:ListAllMyBuckets permissions?

but when I upload file via s3cmd, it succeeds.

Yosuke Hara · Answer 1 · Wed Sep 18 2019 11:39:07 GMT+0800 (China Standard Time)

Ring may be broken as below:

 Manager RING hash
-----------------------------------+----------
                 current ring-hash |
                previous ring-hash |
-----------------------------------+----------

You need to rebuild Ring. Refer to my comments below:

#1185 (comment)

varuntanwar · Answer 2 · Wed Sep 18 2019 15:31:55 GMT+0800 (China Standard Time)

@yosukehara

On the same thread, one user mentioned that after following the steps all users and buckets disappeared.

Should we go ahead? We can't afford any data loss.

varuntanwar · Answer 3 · Thu Sep 19 2019 15:04:25 GMT+0800 (China Standard Time)

I have another question:

When I configure the s3cmd, I get following error:

Test access with supplied credentials? [Y/n] y
Please wait, attempting to list all buckets...
ERROR: Test failed: 403 (AccessDenied): Access Denied
ERROR: Are you sure your keys have s3:ListAllMyBuckets permissions?

but when I upload/download file via s3cmd, it succeeds.

Is this because of broken ring?

Yosuke Hara · Answer 4 · Thu Sep 19 2019 16:33:47 GMT+0800 (China Standard Time)

On the same thread, one user mentioned that after following the steps all users and buckets disappeared.
Should we go ahead? We can't afford any data loss.

I can only provide this solution. I recommend that you build a test environment and test this solution.

wulfgar93 · Answer 5 · Fri Feb 14 2020 05:41:48 GMT+0800 (China Standard Time)

@yosukehara got the same problem.
I tried to fix the managers RING by this solution #1185 (comment) but old RING did not restored, new RING was generated. So I restored the mnesia on leomanagers nodes.
I found another solution #1078 (comment) and it helps, but if I restart leomanager service the RING becomes broken again.

wulfgar93 · Answer 6 · Sun Mar 01 2020 02:24:14 GMT+0800 (China Standard Time)

@yosukehara I can add the repeating error massage in leomanager error log:
[W] manager_0@10.6.0.40 2020-02-29 20:55:44.439025 +0300 1582998944 leo_manager_api:brutal_synchronize_ring_1/2 2143 [{cause,invalid_db}]

Yosuke Hara · Answer 7 · Mon Mar 02 2020 09:57:04 GMT+0800 (China Standard Time)

@wulfgar93
I've recognized that it seems to be an error in your LeoFS setting because as below:

leo_manager_api:brutal_synchronize_ring_1/2 2143 [{cause,invalid_db}]

This error is occurring at leo_redundant_manager/src/leo_cluster_tbl_member.erl#L190-L191. Cloud you review the settings of your LeoFS with the reference to this page ?

wulfgar93 · Answer 8 · Mon Jun 22 2020 23:37:20 GMT+0800 (China Standard Time)

@yosukehara I did not find any strange settings in manager's conf file. Could you please take a look on it's settings?

sasl.sasl_error_log = /var/log/leofs/leo_manager_0/sasl/sasl-error.log
sasl.errlog_type = error
sasl.error_logger_mf_dir = /var/log/leofs/leo_manager_0/sasl
sasl.error_logger_mf_maxbytes = 10485760
sasl.error_logger_mf_maxfiles = 5

manager.partner = manager_1@10.6.0.41
console.bind_address = 10.10.0.40
console.port.cui  = 10010
console.port.json = 10020
console.acceptors.cui = 3
console.acceptors.json = 16
console.histories.num_of_display = 200

system.dc_id = dc_1
system.cluster_id = leofs_1
consistency.num_of_replicas = 3
consistency.write = 2
consistency.read = 1
consistency.delete = 2
consistency.rack_aware_replicas = 0

mnesia.dir = /var/lib/leofs/work/mnesia/10.6.0.40
mnesia.dump_log_write_threshold = 50000
mnesia.dc_dump_limit = 40

log.log_level = 2
log.erlang = /var/log/leofs/leo_manager_0/erlang
log.app = /var/log/leofs/leo_manager_0/app
log.member_dir = /var/log/leofs/leo_manager_0/ring
log.ring_dir = /var/log/leofs/leo_manager_0/ring

queue_dir = /var/lib/leofs/work/queue
snmp_agent = /etc/leofs/leo_manager_0/snmpa_manager_0/LEO-MANAGER

rpc.server.acceptors = 16
rpc.server.listen_port = 13075
rpc.server.listen_timeout = 5000
rpc.client.connection_pool_size = 16
rpc.client.connection_buffer_size = 16

nodename = manager_0@10.6.0.40
distributed_cookie = <censored>
erlang.kernel_poll = true
erlang.asyc_threads = 32
erlang.max_ports = 64000
erlang.crash_dump = /var/log/leofs/leo_manager_0/erl_crash.dump
erlang.max_ets_tables = 256000
erlang.smp = enable
process_limit = 1048576
snmp_conf = /etc/leofs/leo_manager_0/leo_manager_snmp.config

wulfgar93 · Answer 9 · Fri Oct 09 2020 20:10:45 GMT+0800 (China Standard Time)

@yosukehara

Yosuke Hara · Answer 10 · Fri Oct 09 2020 22:57:22 GMT+0800 (China Standard Time)

@wulfgar93 I'm sorry for the delay in replying. Let me know LeoStorage's configuration of your LeoFS (Not LeoManager configuration).

wulfgar93 · Answer 11 · Fri Oct 09 2020 23:52:54 GMT+0800 (China Standard Time)

@yosukehara

# cat /etc/leofs/leo_storage/leo_storage.d/20-leo_storage.conf
sasl.sasl_error_log = /var/log/leofs/leo_storage/sasl/sasl-error.log
sasl.errlog_type = error
sasl.error_logger_mf_dir = /var/log/leofs/leo_storage/sasl
sasl.error_logger_mf_maxbytes = 10485760
sasl.error_logger_mf_maxfiles = 5

managers = [manager_0@10.6.0.40, manager_1@10.6.0.41]

obj_containers.path = [/mnt/storage/leofs/leo_storage]
obj_containers.num_of_containers = [128]

obj_containers.metadata_storage = leveldb

object_storage.is_strict_check = true

watchdog.cpu.threshold_cpu_load_avg = 8.0

mq.backend_db = leveldb
mq.num_of_mq_procs = 8

backend_db.eleveldb.write_buf_size = 100663296
backend_db.eleveldb.max_open_files = 1000
backend_db.eleveldb.sst_block_size = 4096

log.erlang = /var/log/leofs/leo_storage/erlang
log.app = /var/log/leofs/leo_storage/app
log.member_dir = /var/log/leofs/leo_storage/ring
log.ring_dir = /var/log/leofs/leo_storage/ring
queue_dir  = /var/lib/leofs/leo_storage/work/queue
leo_ordning_reda.temp_stacked_dir = "/mnt/storage/leofs/leo_storage/work/ord_reda/"

nodename = stor0snode0@10.6.0.60
distributed_cookie = <censored>

erlang.crash_dump = /var/log/leofs/leo_storage/erl_crash.dump

snmp_conf = /etc/leofs/leo_storage/leo_storage_snmp.config

autonomic_op.compaction.is_enabled = true
autonomic_op.compaction.warn_active_size_ratio = 92
autonomic_op.compaction.threshold_active_size_ratio = 90

Yosuke Hara · Answer 12 · Sat Oct 10 2020 11:12:26 GMT+0800 (China Standard Time)

I’ve confirmed LeoManager and LeoStorage configuration files. But I could not find an issue.

So I recommend that you rebuild your LeoFS’ ring. I will share its procedure soon.

Yosuke Hara · Answer 13 · Sat Oct 10 2020 22:40:57 GMT+0800 (China Standard Time)

I've remembered that I recommended the same solution last year - #1195 (comment)