vSphere 5.1 introduced one of my favorite new features. Network health check. This feature is designed to identify problems with MTU and VLAN settings. It is easy enough to set up MTU and VLAN’s in ESXi especially with a dVS. In most environment the vSphere admins don’t control the physical switches making confirmation of upstream configuration hard. The health check resolves these issues. It is only available on dVS switches and only via the web client. (I know time to start using that web client.. your magical fat client is going away) If you have an upstream issue with MTU then you will get an alert in vCenter. You can find the health check by selecting the dVS and clicking on the manage tab. On the middle pane you will see Health check which you can edit and enable. You came here because you want to know how it works.
MTU check is easy. Each system sends out a ping message to the other nodes. This ping message has a special header that tells the network not to fragment (split) the packet. In addition it has a payload (empty data) to make the ping the size of the max MTU. If the host get’s a return message from the ping it knows the MTU is correct. If it fails then we know MTU is bad. Each node checks it’s MTU at an interval. You can manually check your MTU with vmkping but the syntax has changed between 5.0,5.1 and 5.5 so look up the latest syntax.
Checking the VLAN is a little more complex. Each VLAN has to be checked. So one host on the same vDS (not sure which one but I am willing to bet it’s the master node) sends out a broadcast layer 2 packet on the VLAN. Then it waits for each node to reply to the broadcast via unicast layer 2 packet. You can determine which hosts have VLAN issues based upon who reports back. I assume that host marked as bad then try’s to broadcast as a method to identify failed configuration or partitions. This test is repeated on each VLAN and at regular intervals. It only works when two peers can connect.
In ESXi 5.5 they added a check for teaming policy to physical switch. This check identifies mismatches between IP Hash teaming and switches that are not configured in etherchannel/LACP.
Negative Effect of Health check
So why should I not use health check? Well it does produce some traffic. It does require you to use the web client to enable and determine which vlan’s are bad… otherwise I cannot figure out a reason to not use it. A simple and easy way to determine issues.
Design Advice on health check
Health check is a proactive way to determine upstream vlan or MTU issues before you deploy production to that VLAN. It saves a ton of time when troubleshooting and fighting between networking and server teams. I really cannot see a reason to not use it. I have not tested the required bandwidth but it cannot be huge. My two cents turn it on if you have a vDS… if you don’t have vDS I hope you only have ten or less VLAN’s.
2 Replies to “Deep Dive: Network Health check”
have you seen the IP hash mismatch error that comes for teaming and failover health status. you can refer to https://communities.vmware.com/message/2565347#2565347 for details.
Thanks for reading. I added comments to your post on the VMware forum and I hope they help.