Solution - Troubleshooting 'unable to connect to etcd' Error Message
Issue
- When reviewing the status of the Ondat daemonset pods named
storageos-node-xxxx
, the pods are fail to startup and are stuck in a CrashLoopBackOff
state loop.
# Get the pods in the "storageos" namespace
kubectl get pods --namespace storageos
storageos storageos-node-6v9x2 3/3 Running 2 (16s ago) 55s
storageos storageos-node-g7vhn 2/3 CrashLoopBackOff 1 (14s ago) 55s
storageos storageos-node-ph567 3/3 Running 2 (17s ago) 55s
storageos storageos-node-tw8fk 2/3 CrashLoopBackOff 1 (12s ago) 55s
storageos storageos-node-xtvgd 3/3 Running 2 (12s ago) 55s
- Upon reviewing the logs for one of the failing daemonset pods, there is an
"unable to connect to etcd"
error message and "failed to initialise store client"
error message that shows up before the pod shutsdown and restarts.
# Check the logs of a Ondat daemonset that is in a "CrashLoopBackOff" loop.
kubectl logs storageos-node-tw8fk --namespace storageos
{"endpoints":["http://10.73.16.8:2379"],"error":"context deadline exceeded","level":"error","msg":"unable to connect to etcd","store":"etcd","time":"2022-02-13T19:17:35.556128015Z"}
{"error":"failed to instantiate ETCD: context deadline exceeded","level":"error","msg":"failed to initialise store client","time":"2022-02-13T19:17:35.55625352Z"}
{"level":"info","msg":"shutting down","time":"2022-02-13T19:17:35.556274893Z"}
Root Cause
- This issue is cause by the Ondat daemonset pods not being able to connect to the
etcd
cluster. This is generally caused by etcd
being unreachable due to network partitioning or misconfiguration.
Resolution
- Ensure that your
etcd
cluster is healthy.
- Ensure that etcd is running.
- Ensure that the
etcd
URL that is set in the StorageOSCluster
custom resource is correct and it includes the port number.
- Ensure that the
etcd
peers are routable from the etcd
advertise addresses of each peer, and not only from a load balancer.
- Ensure that
etcd
is routable from the network where the worker nodes reside.