Solution - Troubleshooting etcd '503 Service Unavailable' Error Message During Peer Discovery


You have noticed that nodes cannot successfully join the cluster. Upon investigation, you also notice the following error messages in after 1 minute of uptime:

# Truncated output...
time="2018-09-24T13:40:20Z" level=info msg="not first cluster node, joining first node" action=create address= category=etcd host=node3 module=cp target=
time="2018-09-24T13:40:20Z" level=error msg="could not retrieve cluster config from api" status_code=503
time="2018-09-24T13:40:20Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint=",,," error="503 Service Unavailable" module=cp
time="2018-09-24T13:40:20Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp
# Truncated output...

Root Cause

Ondat uses a gossip protocol to discover nodes in the cluster. When Ondat starts, one or more nodes can be referenced so new nodes can query existing nodes for the list of members.

  • The error demonstrated in the code snippet above indicates that the node can’t connect to any of the nodes in the known list. The known list is defined in the JOIN variable.


  1. Check and ensure that the required ports for Ondat to run, are not being blocked by a firewall.

    # Connect to, for example "node04" on port 5705 from "node06" using "nc".
    nc -zv node04 5705
    Ncat: Version 7.50 (  )
    Ncat: Connected to
    Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
  2. Ondat exposes network diagnostics through its API - which is viewable from the Ondat CLI. The diagnostic results show information from all known cluster members - If all the ports are blocked during the first bootstrap of the cluster, the diagnostics results won’t show any data as nodes could not register:

    💡 The command below is only available in Ondat v1 deployments. For Ondat v2 deployments, end users can generate diagnostic and support bundles to get more information about the connectivity in the cluster.

    # Check the network diagnostic results from the Ondat CLI (Ondat v1 deployments only).
    storageos cluster connectivity
    SOURCE  NAME            ADDRESS            LATENCY      STATUS  MESSAGE
    node4   node2.nats  1.949275ms   OK
    node4   node3.api  3.070574ms   OK
    node4   node3.nats  2.989238ms   OK
    node4   node2.directfs  2.925707ms   OK
    node4   node3.etcd  2.854726ms   OK
    node4   node3.directfs  2.833371ms   OK
    node4   node1.api  2.714467ms   OK
    node4   node1.nats  2.613752ms   OK
    node4   node1.etcd  2.594159ms   OK
    node4   node1.directfs  2.601834ms   OK
    node4   node2.api  2.598236ms   OK
    node4   node2.etcd  16.650625ms  OK
    node3   node4.nats  1.304126ms   OK
    node3   node4.api  1.515218ms   OK
    node3   node2.directfs  1.359827ms   OK
    node3   node1.api  1.185535ms   OK
    node3   node4.directfs  1.379765ms   OK
    node3   node1.etcd  1.221176ms   OK
    node3   node1.nats  1.330122ms   OK
    node3   node2.api  1.238541ms   OK
    node3   node1.directfs  1.413574ms   OK
    node3   node2.etcd  1.214273ms   OK
    node3   node2.nats  1.321145ms   OK
    node1   node4.directfs  1.140797ms   OK
    node1   node3.api  1.089252ms   OK
    node1   node4.api  1.178439ms   OK
    node1   node4.nats  1.176648ms   OK
    node1   node2.directfs  1.529612ms   OK
    node1   node2.etcd  1.165681ms   OK
    node1   node2.api  1.29602ms    OK
    node1   node2.nats  1.267454ms   OK
    node1   node3.nats  1.485657ms   OK
    node1   node3.etcd  1.469429ms   OK
    node1   node3.directfs  1.503015ms   OK
    node2   node4.directfs  1.484ms      OK
    node2   node1.directfs  1.275304ms   OK
    node2   node4.nats  1.261422ms   OK
    node2   node4.api  1.465532ms   OK
    node2   node3.api  1.252768ms   OK
    node2   node3.nats  1.212332ms   OK
    node2   node3.directfs  1.192792ms   OK
    node2   node3.etcd  1.270076ms   OK
    node2   node1.etcd  1.218522ms   OK
    node2   node1.api  1.363071ms   OK
    node2   node1.nats  1.349383ms   OK