Solution - Troubleshooting etcd '503 Service Unavailable' Error Message During Peer Discovery
Issue
You have noticed that nodes cannot successfully join the cluster. Upon investigation, you also notice the following error messages in after 1 minute of uptime:
# Truncated output...
time="2018-09-24T13:40:20Z" level=info msg="not first cluster node, joining first node" action=create address=172.28.128.5 category=etcd host=node3 module=cp target=172.28.128.6
time="2018-09-24T13:40:20Z" level=error msg="could not retrieve cluster config from api" status_code=503
time="2018-09-24T13:40:20Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="172.28.128.3,172.28.128.4,172.28.128.5,172.28.128.6" error="503 Service Unavailable" module=cp
time="2018-09-24T13:40:20Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp
# Truncated output...
Root Cause
Ondat uses a gossip protocol to discover nodes in the cluster. When Ondat starts, one or more nodes can be referenced so new nodes can query existing nodes for the list of members.
- The error demonstrated in the code snippet above indicates that the node can’t connect to any of the nodes in the known list. The known list is defined in the
JOIN
variable.
Resolution
-
Check and ensure that the required ports for Ondat to run, are not being blocked by a firewall.
# Connect to, for example "node04" on port 5705 from "node06" using "nc". nc -zv node04 5705 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 10.0.1.166:5705. Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
-
Ondat exposes network diagnostics through its API - which is viewable from the Ondat CLI. The diagnostic results show information from all known cluster members - If all the ports are blocked during the first bootstrap of the cluster, the diagnostics results won’t show any data as nodes could not register:
💡 The command below is only available in Ondat
v1
deployments. For Ondatv2
deployments, end users can generate diagnostic and support bundles to get more information about the connectivity in the cluster.# Check the network diagnostic results from the Ondat CLI (Ondat v1 deployments only). storageos cluster connectivity SOURCE NAME ADDRESS LATENCY STATUS MESSAGE node4 node2.nats 172.28.128.4:5708 1.949275ms OK node4 node3.api 172.28.128.5:5705 3.070574ms OK node4 node3.nats 172.28.128.5:5708 2.989238ms OK node4 node2.directfs 172.28.128.4:5703 2.925707ms OK node4 node3.etcd 172.28.128.5:5707 2.854726ms OK node4 node3.directfs 172.28.128.5:5703 2.833371ms OK node4 node1.api 172.28.128.3:5705 2.714467ms OK node4 node1.nats 172.28.128.3:5708 2.613752ms OK node4 node1.etcd 172.28.128.3:5707 2.594159ms OK node4 node1.directfs 172.28.128.3:5703 2.601834ms OK node4 node2.api 172.28.128.4:5705 2.598236ms OK node4 node2.etcd 172.28.128.4:5707 16.650625ms OK node3 node4.nats 172.28.128.6:5708 1.304126ms OK node3 node4.api 172.28.128.6:5705 1.515218ms OK node3 node2.directfs 172.28.128.4:5703 1.359827ms OK node3 node1.api 172.28.128.3:5705 1.185535ms OK node3 node4.directfs 172.28.128.6:5703 1.379765ms OK node3 node1.etcd 172.28.128.3:5707 1.221176ms OK node3 node1.nats 172.28.128.3:5708 1.330122ms OK node3 node2.api 172.28.128.4:5705 1.238541ms OK node3 node1.directfs 172.28.128.3:5703 1.413574ms OK node3 node2.etcd 172.28.128.4:5707 1.214273ms OK node3 node2.nats 172.28.128.4:5708 1.321145ms OK node1 node4.directfs 172.28.128.6:5703 1.140797ms OK node1 node3.api 172.28.128.5:5705 1.089252ms OK node1 node4.api 172.28.128.6:5705 1.178439ms OK node1 node4.nats 172.28.128.6:5708 1.176648ms OK node1 node2.directfs 172.28.128.4:5703 1.529612ms OK node1 node2.etcd 172.28.128.4:5707 1.165681ms OK node1 node2.api 172.28.128.4:5705 1.29602ms OK node1 node2.nats 172.28.128.4:5708 1.267454ms OK node1 node3.nats 172.28.128.5:5708 1.485657ms OK node1 node3.etcd 172.28.128.5:5707 1.469429ms OK node1 node3.directfs 172.28.128.5:5703 1.503015ms OK node2 node4.directfs 172.28.128.6:5703 1.484ms OK node2 node1.directfs 172.28.128.3:5703 1.275304ms OK node2 node4.nats 172.28.128.6:5708 1.261422ms OK node2 node4.api 172.28.128.6:5705 1.465532ms OK node2 node3.api 172.28.128.5:5705 1.252768ms OK node2 node3.nats 172.28.128.5:5708 1.212332ms OK node2 node3.directfs 172.28.128.5:5703 1.192792ms OK node2 node3.etcd 172.28.128.5:5707 1.270076ms OK node2 node1.etcd 172.28.128.3:5707 1.218522ms OK node2 node1.api 172.28.128.3:5705 1.363071ms OK node2 node1.nats 172.28.128.3:5708 1.349383ms OK