在node 上创建etcd, 将其加入master 中,发现pod 状态是:CrashLoopBackOff
Debug steps:
step 1)在master 上check 新加的etcd pod 状态
[root@k8s-master kubernetes]# kubectl get pods --namespace=kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE
coredns-77d4bbc998-4rmpg 1/1 Running 0 23h 10.244.0.5 k8s-master
etcd-huleib.eng.platformlab.ibm.com 0/1 CrashLoopBackOff 9 23m 9.111.252.241 huleib.eng.platformlab.ibm.com
etcd-k8s-master 1/1 Running 0 23h 9.111.252.196 k8s-master
kube-apiserver-k8s-master 1/1 Running 0 23h 9.111.252.196 k8s-master
kube-controller-manager-k8s-master 1/1 Running 0 23h 9.111.252.196 k8s-master
kube-flannel-ds-wrktv 1/1 Running 0 22h 9.111.252.241 huleib.eng.platformlab.ibm.com
kube-flannel-ds-xd52v 1/1 Running 0 22h 9.111.252.196 k8s-master
kube-proxy-2nwpw 1/1 Running 0 23h 9.111.252.241 huleib.eng.platformlab.ibm.com
kube-proxy-9kvb4 1/1 Running 0 23h 9.111.252.196 k8s-master
kube-scheduler-k8s-master 1/1 Running 0 23h 9.111.252.196 k8s-master
step 2)进一步验证阻塞情况
[root@k8s-master kubernetes]# kubectl get events --namespace=kube-system -o wide
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
15m 15m 1 etcd-huleib.eng.platformlab.ibm.com.14f4bb15ddf09e42 Pod Normal SuccessfulMountVolume kubelet, huleib.eng.platformlab.ibm.com MountVolume.SetUp succeeded for volume "etcd"
15m 15m 1 etcd-huleib.eng.platformlab.ibm.com.14f4bb15e6a4bf79 Pod spec.containers{etcd} Normal Pulling kubelet, huleib.eng.platformlab.ibm.com pulling image "gcr.io/google_containers/etcd-amd64:3.0.17"
14m 14m 1 etcd-huleib.eng.platformlab.ibm.com.14f4bb2c84316077 Pod spec.containers{etcd} Normal Pulled kubelet, huleib.eng.platformlab.ibm.com Successfully pulled image "gcr.io/google_containers/etcd-amd64:3.0.17"
13m 14m 4 etcd-huleib.eng.platformlab.ibm.com.14f4bb2c84cd0412 Pod spec.containers{etcd} Normal Created kubelet, huleib.eng.platformlab.ibm.com Created container
13m 14m 3 etcd-huleib.eng.platformlab.ibm.com.14f4bb2c8a9c15ab Pod spec.containers{etcd} Normal Started kubelet, huleib.eng.platformlab.ibm.com Started container
13m 14m 3 etcd-huleib.eng.platformlab.ibm.com.14f4bb2cacf5e68d Pod spec.containers{etcd} Normal Pulled kubelet, huleib.eng.platformlab.ibm.com Container image "gcr.io/google_containers/etcd-amd64:3.0.17" already present on machine
13m 14m 6 etcd-huleib.eng.platformlab.ibm.com.14f4bb2ce912e87f Pod spec.containers{etcd} Warning BackOff kubelet, huleib.eng.platformlab.ibm.com Back-off restarting failed container
35s 14m 67 etcd-huleib.eng.platformlab.ibm.com.14f4bb2ce914473e Pod Warning FailedSync kubelet, huleib.eng.platformlab.ibm.com Error syncing pod
step 3)尝试验证是不是网络问题 : master 和node 通信
[root@k8s-master kubernetes]# ping 9.111.252.241
PING 9.111.252.241 (9.111.252.241) 56(84) bytes of data.
64 bytes from 9.111.252.241: icmp_seq=1 ttl=64 time=0.268 ms
64 bytes from 9.111.252.241: icmp_seq=2 ttl=64 time=0.266 ms
64 bytes from 9.111.252.241: icmp_seq=3 ttl=64 time=0.239 ms
64 bytes from 9.111.252.241: icmp_seq=4 ttl=64 time=0.264 ms
64 bytes from 9.111.252.241: icmp_seq=5 ttl=64 time=0.291 ms
64 bytes from 9.111.252.241: icmp_seq=6 ttl=64 time=0.277 ms
^C
--- 9.111.252.241 ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 4999ms
rtt min/avg/max/mdev = 0.239/0.267/0.291/0.022 ms
step 4) 检查log,定位error
[root@k8s-master kubernetes]# kubectl logs etcd-huleib.eng.platformlab.ibm.com --namespace=kube-system
2017-11-07 07:18:56.817661 I | etcdmain: etcd Version: 3.0.17
2017-11-07 07:18:56.817706 I | etcdmain: Git SHA: cc198e2
2017-11-07 07:18:56.817710 I | etcdmain: Go Version: go1.6.4
2017-11-07 07:18:56.817713 I | etcdmain: Go OS/Arch: linux/amd64
2017-11-07 07:18:56.817717 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2017-11-07 07:18:56.817743 W | etcdmain: found invalid file/dir default.etcd under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
2017-11-07 07:18:56.817750 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2017-11-07 07:18:56.817818 I | etcdmain: listening for peers on http://9.111.252.241:2380
2017-11-07 07:18:56.817840 I | etcdmain: listening for client requests on 127.0.0.1:2379
2017-11-07 07:18:56.817858 I | etcdmain: listening for client requests on 9.111.252.241:2379
2017-11-07 07:18:56.823349 I | etcdmain: stopping listening for client requests on 9.111.252.241:2379
2017-11-07 07:18:56.823365 I | etcdmain: stopping listening for client requests on 9.111.252.241:2379
2017-11-07 07:18:56.823373 I | etcdmain: stopping listening for peers on http://9.111.252.241:2380
2017-11-07 07:18:56.823382 I | etcdmain: --initial-cluster must include etcd-huleib=http://9.111.252.241:2380 given --initial-advertise-peer-urls=http://9.111.252.241:2380
2017-11-07 07:18:56.823387 I | etcdmain: forgot to set --initial-cluster flag?
2017-11-07 07:18:56.823393 I | etcdmain: if you want to use discovery service, please set --discovery flag.
Solution: 在node上将该参量加入configure file
[root@huleib ~]# grep initial-cluster /etc/kubernetes/manifests/etcd.yaml
- --initial-cluster="etcd-huleib=http://9.111.252.241:2380"
最终测试:
在node上检查状态:
[[root@huleib pki]# etcdctl cluster-health
member 3411c12055f5c575 is healthy: got healthy result from http://9.111.252.241:2379
在master 上检查状态:
[root@k8s-master net.d]# etcdctl --endpoints "http://9.111.252.241:2379" cluster-health
member 3411c12055f5c575 is healthy: got healthy result from http://9.111.252.241:2379
cluster is healthy