Introduction
Last week I encountered something after upgrading one of our environments. Since the updates we’ve been randomly getting errors on ESXi hosts that said “Status of other host hardware objects”. After troubleshooting a little bit it seemed this isn’t impacting production or is providing us any trouble at all. But since it’s a red alert and not nice to have in the GUI, I made a case with VMware to check up on this, just to be sure I am not missing anything.
Environment specifics:
- HPE ProLiant Gen10 rack servers
- vSphere ESXi 6.7 Build 16316930 (ESXi 6.7 EP 15)
- HPE System ROM UB30 v2.22
- HPE ILO 5 v2.10
Troubleshooting
While diving into the Event log for this host I found the following log entry:
Alright, there seems to be an issue with a hardware sensor. So naturally you go to check the Hardware Health status under the Host -> Monitor -> Hardware Health section inside the vSphere Client. Strangely, there are no warnings and this “I/O Module 2 ALOM_Link_P2” isn’t even (always) present in our hardware sensor list.
Another approach is using the vSphere Flex Client. The Flex Client is still around on this patch, so why not try and use it. After logging into the Flex Client and opening up the same Hardware Health tab, I was able to find the Hardware Sensor and verify it was all green and healthy.
It seems a bit weird that there sensor is present in one of the two clients. So I figured I’d check the sensors from the ESXi host directly through the CLI:
[xxxxx\b.vaneeden@esx06:~] esxcli hardware ipmi sdr list | grep -i ALOM_Link Node-Sensor Description Entity-Instance Computed Reading Base Unit Raw Reading Sensor Type Timestamp/Comment Raw Formatted-Raw ----------- ----------------------------------------- --------------- ----------------- ------------------- ----------- ---------------- ------------------- --- ------------- 0.1 [Device] I/O Module 1 ALOM_Link_P1 44.97 Heartbeat sensor-discrete 2 LAN 2020-07-03T07:01:23
Alright, it’s also not available from the CLI. Even stranger, a couple minutes later I tried this again and I couldn’t even find the “P1” sensor anymore in the sensor list. And a couple minutes after this it was available again. This leads to believe that this issue is more ILO/ESXi software related than actually hardware related.
Next up I checked our HPE ILO’s on the ESXi hosts. During this I noticed that most of the ILO’s were experiencing connection issues, hung ups and timeouts. Although at this point we are not facing a publicly available bugs with the ILO or ESXi versions that are installed. I did find two KB’s for VMware and HPE that mentioned this bug, but the fixes and versions that are applicable are not installed on our systems.
The fix
So according to my VMware case this issue is known within VMware, but just not for these ILO/ESXi versions. There is one resolutions and one workaround that we can work with here:
- The internal KB mentioned that the issue should be resolved in ESXi 6.5 U3 and 6.7 U2. Since we are running ESXi 6.7 U3 this is clearly not resolved. If this is also the case for you, please read step 2, the workaround.
- We need to “silence” the Alarms. These hardware sensor related errors can be safely ignore (for now). Read below on how to implement this in your environment.
Connect to your ESXi host that is reporting the hardware health alarm through SSH. Next up we need to determine the Node-Sensor ID that is giving us this error. You can do this by running the following command (I’ve filtered out the unnecessary parts):
[xxxxx\b.vaneeden\esx06:~] esxcli hardware ipmi sdr list Node-Sensor Description Entity-Instance Computed Reading Base Unit Raw Reading Sensor Type Timestamp/Comment Raw Formatted-Raw ----------- ----------------------------------------- --------------- ----------------- ------------------- ----------- ---------------- ------------------- --- ------------- 0.1 [Device] I/O Module 1 ALOM_Link_P1 44.97 Heartbeat sensor-discrete 2 LAN 2020-07-22T19:04:16 0.2 [Device] I/O Module 2 ALOM_Link_P2 44.98 Heartbeat sensor-discrete 2 LAN 2020-07-22T19:04:16
In our case the Node-Sensor Id is “0.2”. This Node-Sensor ID is what we need to use in our next step, configuring ESXi to ignore these messages:
esxcfg-advcfg -s 2 /UserVars/HardwareHealthIgnoredSensors
If you need you can also specify multiple Node-Sensor ID’s at once, just use a comma between sensor ID’s as such "esxcfg-advcfg -s 2,1 /UserVars/HardwareHealthIgnoredSensors"
.
Next up we need to double check our work and see what we’ve configured ESXi to ignore. We can do this by executing the following command:
[xxxxx\b.vaneeden@esx07:~] esxcfg-advcfg -g /UserVars/HardwareHealthIgnoredSensors Value of HardwareHealthIgnoredSensors is 2
As you can see in the above command output, we’ve succesfully configured ESXi to ignore Node-Sensor ID 2.
If you want to reset the configuration you have made above, you can do this through the vSphere Webclient. Just follow the below list:
- Go to your ESXi host through the Host Client or through the vCenter vSphere Webclient.
- Go to Configure -> System -> Advanced System Settings
- Look for the “UserVars.HardwareHealthIgnoredSensors” property and empty the Value field.
- Press OK.
- Now if you execute the
esxcfg-advcfg -g /UserVars/HardwareHealthIgnoredSensors
again, it will be empty.
The last extreme measure would be to completely disable the Hardware Health checks by disabling the ESXi “WBEM” service. This will effectively kill the hardware sensor monitoring for all hardware sensors. I do not recommend this step, but I included it in the blogpost for those that do want this.
Disable: esxcli system wbem set --enable false Enable: esxcli system wbem set --enable true
Conclusion
So bringing everything together and looking at this issue. After upgrading to ESXi 6.7 EP 15 and HPE ILO 5 v2.10 we have seen “Status of other host hardware objects” errors on hosts in the cluster. This issue is known at VMware, but just not with the particular ESXi/ILO versions that we have running at this moment. Since there is a fix, and a workaround, the first thing to do is to check if you are running ESXi 6.5 U3 or 6.7 U2, since these are the versions that should have this issue fixed.
If this issue isn’t fixed in your environment by installing the patches, you can safely ignore (for now) these sensors and either Acknowledge/Reset to green them yourself or configure ESXi to ignore the sensors for you, like explained in The Fix section of this blogpost.
While writing this blogpost I noticed that there is a newer ILO version available, v2.18. The release notes do not state a fix for this behaviour. But I will make sure to update the environment the coming weeks. Once I’ve done this I will update the blogpost with my findings.
**I do want to say that silencing the Hardware Sensor alarms in your environment is not a recommendable thing to do, since the behaviour can change in newer software versions and your or your colleagues might forget you ever ignored these. So my tip is; if you are ignoring your Hardware Sensors, please note it down so that you can remove them later on.**
6 Comments
Jan Engfeldt · November 18, 2020 at 3:59 pm
Hi,
How is the HPE synergy Alarms managed by the VCenter?
I know the HPE Oneview is integrated as plugin but can we see HW alarms related to HPE in Vcenter?
Ludovic · July 1, 2021 at 11:55 am
Hello Jan,
It seems the problem is corrected in ESXi 6.7 March 2021 Patch ESXi670-202103001 build 17700523 (maybe with a previous update too
We don’t see the problem with bellow ESXi version and iLO 2.15 or 2.44.
Regards,
Claude Hayden · July 16, 2021 at 1:30 am
0.44.99.3
[Device] I/O Module 3 NIC_Link_01P3
Warning
1
15
Other
07/15/2021, 8:07:37 PM
Still investigating.
Jan, was there anything that triggered this? I know you didnt say that but out of the blue this appears. In my case there are a few new switches installed. As no solution has been offered I was grasping at straws ie. trying to think of what could have caused these errors to spring up.
Claude Hayden · July 16, 2021 at 1:31 am
I forgot to mention interestingly if I migrate all my vm’s to host 3 4 or 5 the error follows to that host and the error goes away on the vm empty host.
John W. · October 23, 2024 at 11:18 pm
Did the issue get fixed with a subsequent iLO update?
Bryan van Eeden · December 30, 2024 at 2:21 pm
I am unsure to be honest, it’s been a long time. There have been many many iLO updates so I would suppose it did.