Introduction

Last week I encountered something after upgrading one of our environments. Since the updates we’ve been randomly getting errors on ESXi hosts that said “Status of other host hardware objects”. After troubleshooting a little bit it seemed this isn’t impacting production or is providing us any trouble at all. But since it’s a red alert and not nice to have in the GUI, I made a case with VMware to check up on this, just to be sure I am not missing anything.

Environment specifics:

  • HPE ProLiant Gen10 rack servers
  • vSphere ESXi 6.7 Build 16316930 (ESXi 6.7 EP 15)
  • HPE System ROM UB30 v2.22
  • HPE ILO 5 v2.10

Troubleshooting

While diving into the Event log for this host I found the following log entry:

Status of other host hardware objects error

Alright, there seems to be an issue with a hardware sensor. So naturally you go to check the Hardware Health status under the Host -> Monitor -> Hardware Health section inside the vSphere Client. Strangely, there are no warnings and this “I/O Module 2 ALOM_Link_P2” isn’t even (always) present in our hardware sensor list.

Hardware Health, sensor isn’t present?

Another approach is using the vSphere Flex Client. The Flex Client is still around on this patch, so why not try and use it. After logging into the Flex Client and opening up the same Hardware Health tab, I was able to find the Hardware Sensor and verify it was all green and healthy.

vSphere Flex Client Hardware Health Status

It seems a bit weird that there sensor is present in one of the two clients. So I figured I’d check the sensors from the ESXi host directly through the CLI:

Alright, it’s also not available from the CLI. Even stranger, a couple minutes later I tried this again and I couldn’t even find the “P1” sensor anymore in the sensor list. And a couple minutes after this it was available again. This leads to believe that this issue is more ILO/ESXi software related than actually hardware related.

Next up I checked our HPE ILO’s on the ESXi hosts. During this I noticed that most of the ILO’s were experiencing connection issues, hung ups and timeouts. Although at this point we are not facing a publicly available bugs with the ILO or ESXi versions that are installed. I did find two KB’s for VMware and HPE that mentioned this bug, but the fixes and versions that are applicable are not installed on our systems.

The fix

So according to my VMware case this issue is known within VMware, but just not for these ILO/ESXi versions. There is one resolutions and one workaround that we can work with here:

  1. The internal KB mentioned that the issue should be resolved in ESXi 6.5 U3 and 6.7 U2. Since we are running ESXi 6.7 U3 this is clearly not resolved. If this is also the case for you, please read step 2, the workaround.
  2. We need to “silence” the Alarms. These hardware sensor related errors can be safely ignore (for now). Read below on how to implement this in your environment.

Connect to your ESXi host that is reporting the hardware health alarm through SSH. Next up we need to determine the Node-Sensor ID that is giving us this error. You can do this by running the following command (I’ve filtered out the unnecessary parts):

In our case the Node-Sensor Id is “0.2”. This Node-Sensor ID is what we need to use in our next step, configuring ESXi to ignore these messages:

If you need you can also specify multiple Node-Sensor ID’s at once, just use a comma between sensor ID’s as such "esxcfg-advcfg -s 2,1 /UserVars/HardwareHealthIgnoredSensors".

Next up we need to double check our work and see what we’ve configured ESXi to ignore. We can do this by executing the following command:

As you can see in the above command output, we’ve succesfully configured ESXi to ignore Node-Sensor ID 2.

If you want to reset the configuration you have made above, you can do this through the vSphere Webclient. Just follow the below list:

  1. Go to your ESXi host through the Host Client or through the vCenter vSphere Webclient.
  2. Go to Configure -> System -> Advanced System Settings
  3. Look for the “UserVars.HardwareHealthIgnoredSensors” property and empty the Value field.
  4. Press OK.
  5. Now if you execute the esxcfg-advcfg -g /UserVars/HardwareHealthIgnoredSensors again, it will be empty.
UserVars.HardwareHealthIgnoredSensors value

The last extreme measure would be to completely disable the Hardware Health checks by disabling the ESXi “WBEM” service. This will effectively kill the hardware sensor monitoring for all hardware sensors. I do not recommend this step, but I included it in the blogpost for those that do want this.

Conclusion

So bringing everything together and looking at this issue. After upgrading to ESXi 6.7 EP 15 and HPE ILO 5 v2.10 we have seen “Status of other host hardware objects” errors on hosts in the cluster. This issue is known at VMware, but just not with the particular ESXi/ILO versions that we have running at this moment. Since there is a fix, and a workaround, the first thing to do is to check if you are running ESXi 6.5 U3 or 6.7 U2, since these are the versions that should have this issue fixed.

If this issue isn’t fixed in your environment by installing the patches, you can safely ignore (for now) these sensors and either Acknowledge/Reset to green them yourself or configure ESXi to ignore the sensors for you, like explained in The Fix section of this blogpost.

While writing this blogpost I noticed that there is a newer ILO version available, v2.18. The release notes do not state a fix for this behaviour. But I will make sure to update the environment the coming weeks. Once I’ve done this I will update the blogpost with my findings.

**I do want to say that silencing the Hardware Sensor alarms in your environment is not a recommendable thing to do, since the behaviour can change in newer software versions and your or your colleagues might forget you ever ignored these. So my tip is; if you are ignoring your Hardware Sensors, please note it down so that you can remove them later on.**

Leave a Reply

Your email address will not be published. Required fields are marked *