./stack.sh script is not working

Question

./stack.sh script is not working

jogamod opened this issue 4 years ago · comments

Starting point: Run a BigchainDB network

Describe the bug
After downloading ./stack.sh script and running it, I'm getting error for pip not found. After a while, I figured out that this script is using Dockerfile-alpine and inside dockerfile there is image alpine:latest. But, since this dockerfile is updated 2 years ago this is not working anymore. Then I looked upon the other dockerfiles inside the repository and saw that they are using alpine:3.9 so I tried to use it also. Now, the previous error is gone and everything is built successfully with failed=0 tasks.
But then, the tendermint container is constantly exiting with the error:

9ff5c391179cab3ad98ae8a4a2d437f8f4ac126a@tendermint1:26656,3441ad15eafbdbb2de3f69390772997c0518b94e@tendermint2:26656
I[30076-07-30|07:19:28.904] Starting multiAppConn                        module=proxy impl=multiAppConn
I[30076-07-30|07:19:28.904] Starting socketClient                        module=abci-client connection=query impl=socketClient
E[30076-07-30|07:19:28.954] abci.socketClient failed to connect to tcp://bigchaindb1:26658.  Retrying... module=abci-client connection=query err="dial tcp: lookup bigchaindb1 on 127.0.0.11:53: no such host"
E[30076-07-30|07:19:32.014] abci.socketClient failed to connect to tcp://bigchaindb1:26658.  Retrying... module=abci-client connection=query err="dial tcp: lookup bigchaindb1 on 127.0.0.11:53: no such host"
E[30076-07-30|07:19:36.038] abci.socketClient failed to connect to tcp://bigchaindb1:26658.  Retrying... module=abci-client connection=query err="dial tcp 172.22.0.6:26658: connect: connection refused"
I[30076-07-30|07:19:39.039] Starting socketClient                        module=abci-client connection=mempool impl=socketClient
I[30076-07-30|07:19:39.040] Starting socketClient                        module=abci-client connection=consensus impl=socketClient
E[30076-07-30|07:19:39.049] Stopping abci.socketClient for error: EOF    module=abci-client connection=query
I[30076-07-30|07:19:39.049] Stopping socketClient                        module=abci-client connection=query impl=socketClient
ERROR: Failed to create node: Error during handshake: Error calling Info: EOF

To Reproduce
Just try running ./stack.sh

Expected behavior
The ./stack.sh is script is deploying docker version of BigchainDB network with STACK_SIZE number of nodes.

Desktop (please complete the following information):

Distribution: Ubuntu 16.04

davie0 commented 4 years ago

oops :)

davie0 · Answer 1 · Fri Aug 14 2020 02:45:21 GMT+0800 (China Standard Time)

Hello @jogamod! Thanks for report.
Indeed Dockerfile-alpine was outdated... You now have to install pip3 as separate package in alpine.
I've updated bigchaindb docker containers and tried to run stack.sh. Ansible tasks went without error but Tendermint is crashing with segfault. Not yet sure what is causing it. Tendermint container is built from cloned source during Ansible run so It might need some tweaking.
At the moment you can try change

stack_branch=${STACK_BRANCH:="master"}

to

stack_branch=${STACK_BRANCH:="update-deps"}

in stack.sh. This branch contains recent changes including those to Dockerfile-alpine. But I suspect you will just reproduce the segfault.

Domagoj · Answer 2 · Thu Aug 27 2020 20:00:09 GMT+0800 (China Standard Time)

Hi, @zzappie! Thank you for the answer and the changes, but do you know how to fix tendermint crashing bug or do you know when it'll be fixed?

davie0 · Answer 3 · Fri Aug 28 2020 00:58:54 GMT+0800 (China Standard Time)

I'll examine It soon. I'm planing to make 2.2.2 release next week fixing most of issues with pkg and k8s so it is on my radar

Andrijan Ostrun · Answer 4 · Fri Aug 28 2020 19:52:42 GMT+0800 (China Standard Time)

There are some issues with the ./stack.sh script and the Tendermint v0.31.5 as the files for the tendermint containers are not parsed correctly, to be exact the pub_key of each validator in the validators array of the genesis.json is parsed as null. This is caused by the bug in the pkg/configuration/rules/tendermint/tasks/start.yaml on the line 35, the public key of the validator is parsed as follows:

cat tendermint/config/priv_validator$i.json | jq ".pub_key" | jq ". as \$k | {pub_key: \$k, power: \"10\",
          name: \"{{ tendermint_docker_name }}$i\"}" > pub_validator$i.json;

but it should be parsed like this (notice the change in the jq command after the first pipe | jq ".Key.pub_key") :

cat tendermint/config/priv_validator$i.json | jq ".Key.pub_key" | jq ". as \$k | {pub_key: \$k, power: \"10\",
          name: \"{{ tendermint_docker_name }}$i\"}" > pub_validator$i.json;

I looked back at the Tendermint version 0.22.8 and the private keys were formatted differently, this is the example of the keys on the Tendermint 0.22.8:

{"address":"2108C5673494A8D001E14B5147C2F31BB25DEE7B","pub_key":{"type":"tendermint/PubKeyEd25519","value":"oxZBNMis7tBZ6cRBAPdAE9zE+Df9/zNR8UQAT3lrNKw="},"last_height":"0","last_round":"0","last_step":0,"priv_key":{"type":"tendermint/PrivKeyEd25519","value":"qzyUi/uEnSqbeO597YBffM0b40PvWuOAGpJ5iC/SqS6jFkE0yKzu0FnpxEEA90AT3MT4N/3/M1HxRABPeWs0rA=="}}

and this is the key from Tendermint v0.31.5:

{"Key":{"address":"9A46764C95C13D2EC4047E1859DD64A3767388A2","pub_key":{"type":"tendermint/PubKeyEd25519","value":"vaFUCOtk3btoAmkmFkxYXQ7GQ7fWnOdeaQM7nKEyyaM="},"priv_key":{"type":"tendermint/PrivKeyEd25519","value":"2nc1GFKPtNw9ehcPMEj019AitcU06WA8FzEyoalMAzS9oVQI62Tdu2gCaSYWTFhdDsZDt9ac515pAzucoTLJow=="}},"LastSignState":{"height":"0","round":"0","step":0}}

Unfortunately, this fix doesn't solve the segfault so I suppose there are some other issues related to differences in the versions of the Tendermint.

Tendermint 0.22.8 error

By using Tendermint 0.22.8 (tm_version=${TM_VERSION:="0.22.8"} in the stack.sh) the parsing works but I get to the same error as @jogamod described in the first post.

But the error that @jogamod described is probably not caused by the tendermint as the bigchaindb gets the connection from the Tendermint and, I suppose sends the error back:

[2020-08-28 11:08:31 +0000] [39] [INFO] Booting worker with pid: 39
[2020-08-28 11:08:31 +0000] [40] [INFO] Booting worker with pid: 40
[2020-08-28 11:08:31 +0000] [41] [INFO] Booting worker with pid: 41
[2020-08-28 11:08:31 +0000] [42] [INFO] Booting worker with pid: 42
[2020-08-28 11:08:32 +0000] [43] [INFO] Booting worker with pid: 43
[2020-08-28 11:08:32 +0000] [12] [DEBUG] 25 workers
[2020-08-28 11:08:32] [INFO] (abci.app)  ... connection from Tendermint: 172.19.0.9:45794 ... (MainProcess - pid: 1)
[2020-08-28 11:08:32] [INFO] (abci.app)  ... connection from Tendermint: 172.19.0.9:45796 ... (MainProcess - pid: 1)
[2020-08-28 11:08:32] [INFO] (abci.app)  ... connection from Tendermint: 172.19.0.9:45798 ... (MainProcess - pid: 1)
[2020-08-28 11:08:32] [INFO] (bigchaindb.core) Tendermint version: 0.22.8-40d6dc2e (MainProcess - pid: 1)
Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 854, in gevent._gevent_cgreenlet.Greenlet.run
  File "/usr/lib/python3.8/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
    return handle(*args_tuple)
  File "/usr/lib/python3.8/site-packages/abci/server.py", line 162, in __handle_connection
    for message in messages:
  File "/usr/lib/python3.8/site-packages/abci/encoding.py", line 59, in read_messages
    m.ParseFromString(data)
  File "/usr/lib/python3.8/site-packages/google/protobuf/message.py", line 185, in ParseFromString
    self.MergeFromString(serialized)
  File "/usr/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1083, in MergeFromString
    if self._InternalParse(serialized, 0, length) != length:
  File "/usr/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1120, in InternalParse
    pos = field_decoder(buffer, new_pos, end, self, field_dict)
  File "/usr/lib/python3.8/site-packages/google/protobuf/internal/decoder.py", line 633, in DecodeField
    if value._InternalParse(buffer, pos, new_pos) != new_pos:
  File "/usr/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1120, in InternalParse
    pos = field_decoder(buffer, new_pos, end, self, field_dict)
  File "/usr/lib/python3.8/site-packages/google/protobuf/internal/decoder.py", line 612, in DecodeRepeatedField
    if value.add()._InternalParse(buffer, pos, new_pos) != new_pos:
  File "/usr/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1120, in InternalParse
    pos = field_decoder(buffer, new_pos, end, self, field_dict)
  File "/usr/lib/python3.8/site-packages/google/protobuf/internal/decoder.py", line 636, in DecodeField
    raise _DecodeError('Unexpected end-group tag.')
google.protobuf.message.DecodeError: Unexpected end-group tag.
2020-08-28T11:08:32Z <Greenlet at 0x7f74b4c549d0: _handle_and_close_when_done(<bound method ABCIServer.__handle_connection of <a, <bound method StreamServer.do_close of <StreamServ, (<gevent._socket3.socket [closed] at 0x7f74b4a9068)> failed with DecodeError

davie0 · Answer 5 · Tue Sep 01 2020 19:09:20 GMT+0800 (China Standard Time)

Hello @artus! It's very easy to spot that you're teammate of @jogamod looking at your avatar. :)
Thank you for detailed bug report. I've created stack-unstack-fix-wip branch where I'll be investigating this issue. I haven't looked at what causing tendermint to segfault yet. I'm not ansible person so prs a are welcome.
BTW if you want to create test network now you may also try bigchaindb-node-ansible.

Artus Vranken · Answer 6 · Tue Sep 01 2020 19:33:14 GMT+0800 (China Standard Time)

Hey @zzappie , I think you got me mixed up with @aostrun there 😄

Andrijan Ostrun · Answer 7 · Wed Sep 02 2020 17:39:31 GMT+0800 (China Standard Time)

Hi @zzappie , I've managed to find some other parsing issues within the tendermint's start.yml script. I fixed them and now the stack.sh script works fine. I've created the PR with the fixes so you can take a look at them.

Eduardo Thums · Answer 8 · Sat Sep 05 2020 12:30:29 GMT+0800 (China Standard Time)

Hi everyone, I'm was going through the same problems that @jogamod described in this issue, after looking at the last comment from @aostrun I decided to use the update-deps branch to run the stack.sh script, but something strange happened, looks like there's some memory reference error on Tendermints containers, they go up without problems but then after some minutes they shut down, an example of the logs inside one of the containers:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xaf817d]

Maybe this is a new issue?
I'm using an Ubuntu 18.04 LTS, Tendermint 0.22.8, docker 19.03.12 and docker-compose 1.26.2.

PS: Sorry for the bad English or the really long text.

davie0 · Answer 9 · Tue Sep 08 2020 05:04:14 GMT+0800 (China Standard Time)

Hello everyone. I think I've hunted down the cause of the problem. You could try running stack.sh docker-based network supposed to work on update-deps branch. Please report back If it didn't work for you.

Eduardo Thums · Answer 10 · Tue Sep 08 2020 06:28:58 GMT+0800 (China Standard Time)

Running make run directly from the branch update-deps worked fine, but when running stack.sh script Tendermint's containers still insist on throwing error, this time is a different one. I'm starting to think the issue is in the stack.sh itself, but I can't say it precisely. Oh, another thing, does make any difference running as sudo(forget it if doesn't make any sense lol)?

cp: can't stat '/tendermint_config/priv_validator_key4.json': No such file or directory
cp: can't stat '/tendermint_config/node_key4.json': No such file or directory
starting node with persistent peers set to:
11844418098d31343251e6e90c29e12a572a9598@tendermint1:26656,@tendermint2:26656,@tendermint3:26656,@tendermint4:26656
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xaf817d]

Edit: it seems that /tendermint_config path should be /tendermint/config, at least that's what looks like from auto-generated Tendermint's folders.

davie0 · Answer 11 · Tue Sep 08 2020 16:44:10 GMT+0800 (China Standard Time)

Hey @EduardoThums!
Try to remove already created containers (e.g. to remove ALL containers you could run docker rm $(docker stop $(docker ps -aq))).
Run stack.sh again and if won't work paste here the output of commands docker logs tm_config_gen and docker logs tendermint1.

Running with sudo should probably only affect the bootstrap.sh script. It does runs commands like apt and add files in /etc folder which is kind of bad. But it doesn't do anything dangerous.

Concerning your edit: /tendermint_config is directory created on your host and /tendermint/config is where the configuration folder is mounted in docker container.

Eduardo Thums · Answer 12 · Wed Sep 09 2020 06:25:35 GMT+0800 (China Standard Time)

Thank you @zzappie!
Everything is working fine right now, after deleting the old containers(and their volumes as well) the problem seems to be resolved. I presume that something with my old volumes in my docker messed up something. Well, thank you very much again for your help, you are a life saver!

davie0 · Answer 13 · Wed Sep 30 2020 03:28:01 GMT+0800 (China Standard Time)

Thank you for kind words @EduardoThums. Closing this issue since fixes are now in master.