I was recently upgrading my home lab to the newest version of NSX. Since it’s my home lab I didn’t backup or snapshot before I did the upgrade. Don’t try this at work. The upgrade of the NSX manager went fine but the controllers were all disconnected. I logged into all three of the NSX controllers (running 6.0) and found them all to be in this state:
As you can see they are all showing waiting to join majority with no cluster id. I attempted to force the first machine to join it’s self using
join control-cluster 192.168.10.29 force
This command rips out previous cluster configuration and reconfigures. That node came back as normal and became the master. I then tried to force the other nodes. Once they finished everyone was disconnected again. I then removed two controllers and tried to force the single into being the master. This seemed to work but when I tried to add a controller it failed again. This left me with a few choices:
- Wipe out NSX and start from scratch
- Try something else
I went for something else with a wipe out fall back. I figured since the logical switch know their own config without the controllers they would be ok as long as nothing changed. They were set to communicate updates via unicast mode. I switched them to multicast (yes it works in my environment) and then ripped out my last controller (you can switch it on the transport zone instead of each switch). I then deployed a new set of controllers one at a time. I configured the transport zone back into unicast and everything seemed ok. I also redeployed the edge gateways to complete the upgrade (I don’t think this was essential to the process). I hope it helps you if you failed to back up before an upgrade gone bad.