Solution - Troubleshooting etcd 'failed to join existing cluster' Error Message During Peer Discovery

Issue

You are experiencing an issue where nodes cannot successfully join the cluster. Upon investigation, you also notice the following error messages in the logs:

# Truncated output...
time="2018-09-24T13:40:20Z" level=info msg="not first cluster node, joining first node" action=create address=172.28.128.5 category=etcd host=node3 module=cp target=172.28.128.6
time="2018-09-24T13:40:20Z" level=error msg="could not retrieve cluster config from api" status_code=503
time="2018-09-24T13:40:20Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="172.28.128.3,172.28.128.4,172.28.128.5,172.28.128.6" error="503 Service Unavailable" module=cp
time="2018-09-24T13:40:20Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp
# Truncated output...

Root Cause

Ondat uses a gossip protocol to discover nodes in the cluster. When Ondat starts, one or more nodes can be referenced so new nodes can query existing nodes for the list of members.

  • The error demonstrated in the code snippet above indicates that the node can’t connect to any of the nodes in the known list. The known list is defined in the JOIN variable.
  • If there are no active nodes, the bootstrap process will elect the first node in the JOIN variable as master, and the rest will try to discover from it. In case of that node not starting, the whole cluster will remain unable to bootstrap.
  • Deployments of Ondat use a Kubernetes DaemonSet, and by default do not schedule Ondat pods to master nodes, due to the presence of the » node-role.kubernetes.io/master:NoSchedule taint that is typically present. In such cases the JOIN variable must not contain master nodes or the Ondat cluster will not start.

Resolution

  • Check and ensure that the first node of the JOIN variable started properly.
# Describe the daemonset and grep for "JOIN".
kubectl --namespace storageos describe daemonset.apps/storageos-node | grep "JOIN"

    JOIN:          172.28.128.3,172.28.128.4,172.28.128.5

# Check for the pod with the "172.28.128.3" IP address.
kubectl --namespace storageos get pods --output wide | grep 172.28.128.3

storageos-node-8zqxl   1/1       Running   0          2m        172.28.128.3   node1
  • Make sure that the JOIN variable doesn’t specify the master nodes. In case you are using the discovery service, it is necessary to ensure that the Ondat daemonset won’t allocate pods on the master nodes. This can be achieved with taints, node selectors or labels.
  • For deployments with the Ondat operator you can specify which nodes to deploy Ondat on using nodeSelectors.
  • For more advanced deployments that are using compute-only and storage nodes, check the » storageos.com/computeonly=true label that can be added to the nodes through Kubernetes node labels has been configured correctly.

References