xuwang / kube-aws-terraform

KAT - Kubernetes cluster on AWS with Terraform

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Challenge with Make Addons

agentbond007 opened this issue · comments

This section of Makefile for Addons

      @scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
	core@${KUBE_API_DNSNAME}:/var/lib/kubernetes/admin.pem \
	core@${KUBE_API_DNSNAME}:/var/lib/kubernetes/admin-key.pem \
	core@${KUBE_API_DNSNAME}:/var/lib/kubernetes/kube-apiserver-ca.pem \
	 ${KUBERNETES_WORKDIR}/

Produces error:
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
make[2]: *** [do-config] Error 1
make[1]: *** [kube-config] Error 2
make: *** [add-ons] Error 2

The directory does exist on either local machine or kube-cluster-master.
/var/lib/kubernetes/admin.pem \

The identity is added with no problem from vault
Identity added: /Users/admin/.ssh/kube-cluster-master.pem

Any ideas?

Thank you.

Thank you Xueshan. The issue appears to be with getting "kubectl" configured correctly.
I can download the cluster.pem but the scp command isn't working properly for
admin.pem
admin-key.pem
kube-apiserver-ca.pem

files in their proper place.

I have also added "kube-api.buildmodels.net" to hosts file.

I tried scp -o core@kube-api.buildmodels.net:/var/lib/kubernetes/admin.pem

I don't see where you get the admin.pem , admin-key.pem in your script.

Appreciate any assistance.

@agentbond007 Thank you for trying this repo!

I just built a cluster to make sure recent code changes didn't have regression... and I haven't run into problem. However, in past I did occasionally run into similar ssh problem mostly because of timing, and if I make add-on again, or cd resources/add-on; make kube-config it would copy the certs over and generate ~/.kube/config.

All the pem files are provided by vault PKI, and generated as part of master's bootstrap. The code is under resources->master->artifacts->upload->setup.sh. setup.sh calls get-certs.sh in the same upload directory to generate etcd, kube-apiserver, and admin certs.

Master has dependencies on etcd and vault. If re-try make kube-config doesn't work for you, here are some trouble-shooting tips - you have every access to the cluster you build :)

  • First login to master:
$ cd resources/master
$ make ssh
  • Check etcd status - need to be root (sudo su)
# source source /etc/profile.d/etcdctl.sh
etcdctl cluster-health
member 64c6ca31c84f1803 is healthy: got healthy result from https://10.240.2.12:2379
cluster is healthy

If it is not healthy, restart etcd:

# systemctl restart etcd2
  • Check vault status
core@ip-10-240-3-5-my-kube-master ~ $ vault status
Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0
Unseal Nonce: Version: 0.6.5
Cluster Name: vault-cluster-88c43ad7
Cluster ID: bec05d9b-6cac-46cf-e55c-321b952eadfb

High-Availability Enabled: false

  • Check certs
core@ip-10-240-3-5-my-kube-master /var/lib/kubernetes $ ls -lrt
total 56
-rw-------. 1 core root 2074 May 31 05:48 admin.pem
-rw-------. 1 core root 1675 May 31 05:48 admin-key.pem
-rw-r--r--. 1 root root 2098 May 31 05:48 kube-apiserver.pem
-rw-------. 1 root root 1679 May 31 05:48 kube-apiserver-key.pem
-rw-r--r--. 1 root root 1826 May 31 05:48 kube-apiserver-ca.pem
-rw-r--r--. 1 root root   83 May 31 05:48 token.csv
-rw-r--r--. 1 root root 3247 May 31 05:48 service-account-key.pem
  • Check master log:
# journalctl -f -u kube-apiserver.service

Oh, don't forget to update your /etc/hosts file with kube-apiserver 's IP if you did not delegate domain name - ELB IPs can change over time.

Let me know if any of the trouble-shooting shows obvious errors.

Hi Xu. Thank you for your comments. I'm continuing to investigate.
Still not working.

Please see below. My master doesn't seem to have the etcdctl.sh

also, no /var/lib/kubernetes

The part of kube-config isn't working so I followed some of your steps.

I went to: cd resources/master

master git:(master) ✗ make ssh
Permitted 22 from 45.49.236.46/32 to master...
Last login: Tue Jun 6 00:22:01 UTC 2017 from 45.49.236.46 on pts/0
Container Linux by CoreOS stable (1353.8.0)
core@ip-10-240-3-9 ~ $ sudo su
ip-10-240-3-9 core # source source /etc/profile.d/etcdctl.sh
bash: source: No such file or directory
ip-10-240-3-9 core # systemctl restart etcd2
ip-10-240-3-9 core # source source /etc/profile.d/etcdctl.sh
bash: source: No such file or directory
ip-10-240-3-9 core # vault status
bash: vault: command not found
ip-10-240-3-9 core # /var/lib/kubernetes $ ls -lrt
bash: /var/lib/kubernetes: No such file or directory
ip-10-240-3-9 core # /var/lib/kubernetes
bash: /var/lib/kubernetes: No such file or directory
ip-10-240-3-9 core # cd var
bash: cd: var: No such file or directory

Any ideas. Thank you very very much for any clues.

Also vault doesn't seem to be installed on the master.

@agentbond007 how about etcd cluster?

ssh-add ~/.ssh/<cluster-name>-etcd.pem
cd resources/etcd
make ssh

Run vault status and etcdctl cluster-health? as core user?

All bootstrap file and logs are located under /root/bootstrap on each machine. If etcd cluster status is okay, I'd just reboot master if you haven't tried it - /root/bootstrap/config/cloud-config.yaml has a size zero? Either reboot or just halt the machine to let ASG replace master...

Some of the vault rouble-shootiing docs is here: https://github.com/xuwang/kube-aws-terraform/blob/master/docs/02-vault-pki.md.

Thank you Xu. Very strange issues. I rebooted both the master and etcd. the etcd cluster has the same issue.
core@ip-10-240-2-7 ~ $ sudo su
ip-10-240-2-7 core # source source /etc/profile.d/etcdctl.sh
bash: source: No such file or directory

Also, Vault not installed on either master or etcd
ip-10-240-2-7 / # vault status
bash: vault: command not found

Where in the original build do all the machines install Vault binary and set PATH?

Also,

cd /root/bootstrap/config/
cloud-config.yaml
is EMPTY.

Halted the MASTER. ASG recreated. Same problem. ):

Could it be export TF_VAR_vault_release=0.7.0 in envs.sh ?

Thank you for any help.

@agentbond007 var.vault_release default value is 0.7.0, though it doesn't hurt to explicitly define it with TF_VAR_vault_release. The problem doesn't appear to be vault version related. A few more things to check, whenever you get a chance:

  1. Is vault server status show there are mount points, and running normally?
  2. Take a look at s3 bucket. The are named by cluster name as prefix. In the --cloudinit folder, does each cluster (worker, etcd, etc.) has cloud-config.yaml file and is the content looks like normal yaml file?
  3. If it does, then there some download issues. You can try to curl the user-data and run it on a machine, for example:
curl 169.254.169.254/latest/user-data > /tmp/user-data
sh -x /tmp/user-data
  1. Which region are you in? We use signature v4 to download metadata. Wonder if your region supports it..

BTW, I also has a typo, in one of the trouble shooting steps - source source /etc/profile.d/etcdctl.sh should be just source /etc/profile.d/etcdctl.sh.

Thank you Xu. I re-cloned the project and setup a new envs.sh file. I ran make cluster | tee /tmp/build.log . Same problem. Looked in S3 buckets and cluster-config has folders but Vault-S3-backend bucket doesn't contain any objects.
I'm in region us-east-1 . I have attached the build.log too.

build.txt

Downloaded and masked file name.

@agentbond007 can you also send journald log from vault?
Login to vault and run journalctl > /tmp/vault.log.

It would also be helpful to share envs.sh (sanitize specifics) so I can try to reproduce. I did tested us-east-1 before. It supports s3 v4 signature..

@agentbond007 I was able to re-produce in us-east-1. your build logs are all looks right, but s3 download during system bootstrap fails. You can see it in /tmp/curlLog.log... I am looking into it. Sorry!

Yes Xu! :) Thank you. I'm excited about KAT project and maybe someday I can help contribute. I'm good at "proof reading" . xièxie

@agentbond007 The S3 bucket download problem is fixed in xuwang/bash-s3#1.

If you don't want to rebuild the cluster, you can reboot ec2 one by one in this order(you can do it from console): vault, etcd, master, nodes. Before you reboot other machines, make sure to logon to vault, sudo su -; vault status, which should show un-stealed status, and vault mounts should show pki backend mounts.

Hey you are so very nice to help with debugging and documentations! Thank you for sticking with KAT!

Success on us-east-1 . I had to reboot MASTER once at beginning but cluster was healthy after that. Successfully run "make ui" . Dashboard created.
Thank XU and Team. Great work.

@agentbond007 Glad you have your cluster up! Thanks!