./stack.sh script is not working
jogamod opened this issue · comments
Starting point: Run a BigchainDB network
Describe the bug
After downloading ./stack.sh
script and running it, I'm getting error for pip not found
. After a while, I figured out that this script is using Dockerfile-alpine
and inside dockerfile there is image alpine:latest
. But, since this dockerfile is updated 2 years ago this is not working anymore. Then I looked upon the other dockerfiles inside the repository and saw that they are using alpine:3.9
so I tried to use it also. Now, the previous error is gone and everything is built successfully with failed=0
tasks.
But then, the tendermint container is constantly exiting with the error:
9ff5c391179cab3ad98ae8a4a2d437f8f4ac126a@tendermint1:26656,3441ad15eafbdbb2de3f69390772997c0518b94e@tendermint2:26656
I[30076-07-30|07:19:28.904] Starting multiAppConn module=proxy impl=multiAppConn
I[30076-07-30|07:19:28.904] Starting socketClient module=abci-client connection=query impl=socketClient
E[30076-07-30|07:19:28.954] abci.socketClient failed to connect to tcp://bigchaindb1:26658. Retrying... module=abci-client connection=query err="dial tcp: lookup bigchaindb1 on 127.0.0.11:53: no such host"
E[30076-07-30|07:19:32.014] abci.socketClient failed to connect to tcp://bigchaindb1:26658. Retrying... module=abci-client connection=query err="dial tcp: lookup bigchaindb1 on 127.0.0.11:53: no such host"
E[30076-07-30|07:19:36.038] abci.socketClient failed to connect to tcp://bigchaindb1:26658. Retrying... module=abci-client connection=query err="dial tcp 172.22.0.6:26658: connect: connection refused"
I[30076-07-30|07:19:39.039] Starting socketClient module=abci-client connection=mempool impl=socketClient
I[30076-07-30|07:19:39.040] Starting socketClient module=abci-client connection=consensus impl=socketClient
E[30076-07-30|07:19:39.049] Stopping abci.socketClient for error: EOF module=abci-client connection=query
I[30076-07-30|07:19:39.049] Stopping socketClient module=abci-client connection=query impl=socketClient
ERROR: Failed to create node: Error during handshake: Error calling Info: EOF
To Reproduce
Just try running ./stack.sh
Expected behavior
The ./stack.sh
is script is deploying docker version of BigchainDB network with STACK_SIZE
number of nodes.
Desktop (please complete the following information):
- Distribution: Ubuntu 16.04
Hello @jogamod! Thanks for report.
Indeed Dockerfile-alpine was outdated... You now have to install pip3 as separate package in alpine.
I've updated bigchaindb docker containers and tried to run stack.sh. Ansible tasks went without error but Tendermint is crashing with segfault. Not yet sure what is causing it. Tendermint container is built from cloned source during Ansible run so It might need some tweaking.
At the moment you can try change
stack_branch=${STACK_BRANCH:="master"}
to
stack_branch=${STACK_BRANCH:="update-deps"}
in stack.sh. This branch contains recent changes including those to Dockerfile-alpine. But I suspect you will just reproduce the segfault.
Hi, @zzappie! Thank you for the answer and the changes, but do you know how to fix tendermint crashing bug or do you know when it'll be fixed?
I'll examine It soon. I'm planing to make 2.2.2 release next week fixing most of issues with pkg
and k8s
so it is on my radar
There are some issues with the ./stack.sh
script and the Tendermint v0.31.5
as the files for the tendermint containers are not parsed correctly, to be exact the pub_key
of each validator in the validators
array of the genesis.json
is parsed as null
. This is caused by the bug in the pkg/configuration/rules/tendermint/tasks/start.yaml
on the line 35
, the public key of the validator is parsed as follows:
cat tendermint/config/priv_validator$i.json | jq ".pub_key" | jq ". as \$k | {pub_key: \$k, power: \"10\",
name: \"{{ tendermint_docker_name }}$i\"}" > pub_validator$i.json;
but it should be parsed like this (notice the change in the jq command after the first pipe | jq ".Key.pub_key") :
cat tendermint/config/priv_validator$i.json | jq ".Key.pub_key" | jq ". as \$k | {pub_key: \$k, power: \"10\",
name: \"{{ tendermint_docker_name }}$i\"}" > pub_validator$i.json;
I looked back at the Tendermint version 0.22.8
and the private keys were formatted differently, this is the example of the keys on the Tendermint 0.22.8
:
{"address":"2108C5673494A8D001E14B5147C2F31BB25DEE7B","pub_key":{"type":"tendermint/PubKeyEd25519","value":"oxZBNMis7tBZ6cRBAPdAE9zE+Df9/zNR8UQAT3lrNKw="},"last_height":"0","last_round":"0","last_step":0,"priv_key":{"type":"tendermint/PrivKeyEd25519","value":"qzyUi/uEnSqbeO597YBffM0b40PvWuOAGpJ5iC/SqS6jFkE0yKzu0FnpxEEA90AT3MT4N/3/M1HxRABPeWs0rA=="}}
and this is the key from Tendermint v0.31.5
:
{"Key":{"address":"9A46764C95C13D2EC4047E1859DD64A3767388A2","pub_key":{"type":"tendermint/PubKeyEd25519","value":"vaFUCOtk3btoAmkmFkxYXQ7GQ7fWnOdeaQM7nKEyyaM="},"priv_key":{"type":"tendermint/PrivKeyEd25519","value":"2nc1GFKPtNw9ehcPMEj019AitcU06WA8FzEyoalMAzS9oVQI62Tdu2gCaSYWTFhdDsZDt9ac515pAzucoTLJow=="}},"LastSignState":{"height":"0","round":"0","step":0}}
Unfortunately, this fix doesn't solve the segfault so I suppose there are some other issues related to differences in the versions of the Tendermint.
Tendermint 0.22.8 error
By using Tendermint 0.22.8
(tm_version=${TM_VERSION:="0.22.8"} in the stack.sh
) the parsing works but I get to the same error as @jogamod described in the first post.
But the error that @jogamod described is probably not caused by the tendermint as the bigchaindb
gets the connection from the Tendermint and, I suppose sends the error back:
[2020-08-28 11:08:31 +0000] [39] [INFO] Booting worker with pid: 39
[2020-08-28 11:08:31 +0000] [40] [INFO] Booting worker with pid: 40
[2020-08-28 11:08:31 +0000] [41] [INFO] Booting worker with pid: 41
[2020-08-28 11:08:31 +0000] [42] [INFO] Booting worker with pid: 42
[2020-08-28 11:08:32 +0000] [43] [INFO] Booting worker with pid: 43
[2020-08-28 11:08:32 +0000] [12] [DEBUG] 25 workers
[2020-08-28 11:08:32] [INFO] (abci.app) ... connection from Tendermint: 172.19.0.9:45794 ... (MainProcess - pid: 1)
[2020-08-28 11:08:32] [INFO] (abci.app) ... connection from Tendermint: 172.19.0.9:45796 ... (MainProcess - pid: 1)
[2020-08-28 11:08:32] [INFO] (abci.app) ... connection from Tendermint: 172.19.0.9:45798 ... (MainProcess - pid: 1)
[2020-08-28 11:08:32] [INFO] (bigchaindb.core) Tendermint version: 0.22.8-40d6dc2e (MainProcess - pid: 1)
Traceback (most recent call last):
File "src/gevent/greenlet.py", line 854, in gevent._gevent_cgreenlet.Greenlet.run
File "/usr/lib/python3.8/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
return handle(*args_tuple)
File "/usr/lib/python3.8/site-packages/abci/server.py", line 162, in __handle_connection
for message in messages:
File "/usr/lib/python3.8/site-packages/abci/encoding.py", line 59, in read_messages
m.ParseFromString(data)
File "/usr/lib/python3.8/site-packages/google/protobuf/message.py", line 185, in ParseFromString
self.MergeFromString(serialized)
File "/usr/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1083, in MergeFromString
if self._InternalParse(serialized, 0, length) != length:
File "/usr/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1120, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/lib/python3.8/site-packages/google/protobuf/internal/decoder.py", line 633, in DecodeField
if value._InternalParse(buffer, pos, new_pos) != new_pos:
File "/usr/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1120, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/lib/python3.8/site-packages/google/protobuf/internal/decoder.py", line 612, in DecodeRepeatedField
if value.add()._InternalParse(buffer, pos, new_pos) != new_pos:
File "/usr/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1120, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/lib/python3.8/site-packages/google/protobuf/internal/decoder.py", line 636, in DecodeField
raise _DecodeError('Unexpected end-group tag.')
google.protobuf.message.DecodeError: Unexpected end-group tag.
2020-08-28T11:08:32Z <Greenlet at 0x7f74b4c549d0: _handle_and_close_when_done(<bound method ABCIServer.__handle_connection of <a, <bound method StreamServer.do_close of <StreamServ, (<gevent._socket3.socket [closed] at 0x7f74b4a9068)> failed with DecodeError
Hello @artus! It's very easy to spot that you're teammate of @jogamod looking at your avatar. :)
Thank you for detailed bug report. I've created stack-unstack-fix-wip
branch where I'll be investigating this issue. I haven't looked at what causing tendermint to segfault yet. I'm not ansible person so prs a are welcome.
BTW if you want to create test network now you may also try bigchaindb-node-ansible.
Hey @zzappie , I think you got me mixed up with @aostrun there 😄
oops :)
Hi @zzappie , I've managed to find some other parsing issues within the tendermint's start.yml
script. I fixed them and now the stack.sh
script works fine. I've created the PR with the fixes so you can take a look at them.
Hi everyone, I'm was going through the same problems that @jogamod described in this issue, after looking at the last comment from @aostrun I decided to use the update-deps
branch to run the stack.sh
script, but something strange happened, looks like there's some memory reference error on Tendermints containers, they go up without problems but then after some minutes they shut down, an example of the logs inside one of the containers:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xaf817d]
Maybe this is a new issue?
I'm using an Ubuntu 18.04 LTS, Tendermint 0.22.8, docker 19.03.12 and docker-compose 1.26.2.
PS: Sorry for the bad English or the really long text.
Hello everyone. I think I've hunted down the cause of the problem. You could try running stack.sh
docker-based network supposed to work on update-deps
branch. Please report back If it didn't work for you.
Running make run
directly from the branch update-deps
worked fine, but when running stack.sh
script Tendermint's containers still insist on throwing error, this time is a different one. I'm starting to think the issue is in the stack.sh
itself, but I can't say it precisely. Oh, another thing, does make any difference running as sudo(forget it if doesn't make any sense lol)?
cp: can't stat '/tendermint_config/priv_validator_key4.json': No such file or directory
cp: can't stat '/tendermint_config/node_key4.json': No such file or directory
starting node with persistent peers set to:
11844418098d31343251e6e90c29e12a572a9598@tendermint1:26656,@tendermint2:26656,@tendermint3:26656,@tendermint4:26656
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xaf817d]
Edit: it seems that /tendermint_config
path should be /tendermint/config
, at least that's what looks like from auto-generated Tendermint's folders.
Hey @EduardoThums!
Try to remove already created containers (e.g. to remove ALL containers you could run docker rm $(docker stop $(docker ps -aq))
).
Run stack.sh again and if won't work paste here the output of commands docker logs tm_config_gen
and docker logs tendermint1
.
Running with sudo should probably only affect the bootstrap.sh script. It does runs commands like apt and add files in /etc folder which is kind of bad. But it doesn't do anything dangerous.
Concerning your edit: /tendermint_config
is directory created on your host and /tendermint/config
is where the configuration folder is mounted in docker container.
Thank you @zzappie!
Everything is working fine right now, after deleting the old containers(and their volumes as well) the problem seems to be resolved. I presume that something with my old volumes in my docker messed up something. Well, thank you very much again for your help, you are a life saver!
Thank you for kind words @EduardoThums. Closing this issue since fixes are now in master.