NSX Controllers all show as disconnected

I was recently upgrading my home lab to the newest version of NSX. Since it’s my home lab I didn’t backup or snapshot before I did the upgrade. Don’t try this at work. The upgrade of the NSX manager went fine but the controllers were all disconnected. I logged into all three of the NSX controllers (running 6.0) and found them all to be in this state:

As you can see they are all showing waiting to join majority with no cluster id. I attempted to force the first machine to join it’s self using

join control-cluster 192.168.10.29 force

This command rips out previous cluster configuration and reconfigures. That node came back as normal and became the master. I then tried to force the other nodes. Once they finished everyone was disconnected again. I then removed two controllers and tried to force the single into being the master. This seemed to work but when I tried to add a controller it failed again. This left me with a few choices:

Wipe out NSX and start from scratch
Try something else

I went for something else with a wipe out fall back. I figured since the logical switch know their own config without the controllers they would be ok as long as nothing changed. They were set to communicate updates via unicast mode. I switched them to multicast (yes it works in my environment) and then ripped out my last controller (you can switch it on the transport zone instead of each switch). I then deployed a new set of controllers one at a time. I configured the transport zone back into unicast and everything seemed ok. I also redeployed the edge gateways to complete the upgrade (I don’t think this was essential to the process). I hope it helps you if you failed to back up before an upgrade gone bad.

8 Replies to “NSX Controllers all show as disconnected”

laurenmalhoit says:

November 16, 2015 at 5:22 pm

You can also force them to rejoin one at a time using the CLI

Reply
1. Joseph Griffiths says:
  
  November 17, 2015 at 6:26 am
  
  Lauren,
  
  Thanks for reading. Which command should I have used to rejoin. I tried:
  
  join control-cluster IP force
  
  multiple times and when I tried to join the second machine to the master (first machine) it would fail. The logs on both machines would show neither was willing to be the slave due to no quorum.
  
  Thanks for reading let me know if there is a command I missed.
  
  Joseph
  
  Reply
laurenmalhoit says:

November 17, 2015 at 1:13 pm

Here’s what I experienced: First controller installed and said it was connected. When I deployed the second controller the first one became disconnected and the second one wouldn’t deploy (time out error).
1. Tried rebooting NSX Manager – didn’t work
2. Tried rebooting NSX Controller – didn’t work

When I try to add another controller it says Controller controller-n creation failed – there is no active controller node for join.
3. Deleted the first controller and re-deployed – I was able to deploy the first controller again, status = normal

Tried to deploy the second controller and the first controller disconnected again.

http://roie9876.wordpress.com/2014/11/11/troubleshooting-nsx-v-controller/

4. Tried join control-cluster 172.31.217.187 (IP address of first controller) on the first controller – this seemed to fix the connectivity issue.

So, I guess my issue was just different.

Reply
1. Joseph Griffiths says:
  
  November 17, 2015 at 9:27 pm
  
  Lauren,
  
  Thanks for sharing the experience. I really wish the logs would output more information on the specific problem. I love Roie’s blog and tried his suggestions as well before going with the burn down suggestion. Thanks for sharing the method and reading.
  
  Thanks,
  Joseph
  
  Reply
russ says:

October 30, 2016 at 4:27 am

Powered off two controllers and forcing the remaining one to join the cluster . itself
once it was finished ran show controller-cluster startup-node and it was alone… hmmmm…

Removed the other controllers and tried to redeploy
In my case there was a timeout error due to overlapping ip assignments, once that was fixed redploy went fine

The point is to check the startup-nodes after force rejoin

Reply
1. Joseph Griffiths says:
  
  November 3, 2016 at 4:17 pm
  
  Thanks for the update and comment.
  
  Reply
justin says:

January 19, 2017 at 5:33 pm

We have gone through the same process however after doing all these task our logical switches even universal switches were showing alert status. Any idea how we can bring down to normal status.

Reply
1. Joseph Griffiths says:
  
  January 21, 2017 at 10:48 am
  
  The alert status is it on your control cluster or inside vCenter on the logical switches? The issue provided here is focused on the control cluster sync issues only. Let me know the error associated with your switches alert and I’ll see if I can help.
  
  Reply