Recently we have been having (extremely) slow boot times for a couple of hosts that we’ve been using for a new customer. We actually didn’t really notice this in the beginning, until we added hosts to the existing cluster and the reboot for that host took over 3 hours!
The host seemed fine to me after checking the hardware and configuration. Everything checked out. But the host still seemed to be broken. Checking the console it seemed to be stuck on the “vmw_vaaip_cx loaded successfully” message in the hypervisor boot screen. Shortly after investigating this I remembered that these hosts have LUN’s being used as physical Raw Device Mappings (RDM) on Windows MSCS clusters. And that was the problem! During the boot process, ESXi scans the so-called storage mid-layer and attempts to discover all devices presented to an ESXi host during the device claiming phase. This however is a problem for RDM’s used in MSCS clusters because they have a permanent SCSI reservation allocated to them. Because of this the ESXi host cannot interrogate the LUN during the boot process and has to wait for a device scan timeout, which takes a long time. In this KB VMware says that if you have 10 RDM’s the boot process is already 30 minutes (excluding physical hardware boot times). Our hosts have over 40 RDM’s so that took way to long.
Fortunately, this can be fixed by setting an ESXi local device parameter on the LUN. This setting is called “perennially reserved”. This setting just tells ESXi that there is a permanent SCSI reservation on the LUN, so it can be skipped during the boot process. ESXi basically won’t scan the device during the boot process anymore.
You can easily set this for a LUN on an ESXi host with the following commands:
esxcli storage core device setconfig -d naa.id --perennially-reserved=true
Check if the command succeeded by using the following command and checking the “IsPereniallyReserved” setting:
esxcli storage core device list -d naa.id
This is easy if you have just a single host, but we have an environment with loads of hosts for this customer and since I am not going to execute this for each RDM on each host by hand, I wrote a script for this. This script is pretty simple, it checks a given cluster for VM’s with (p/v)RDM’s and makes those (p/v)RDM’s perennially reserved on each host in the cluster. It has an additional check to see if a LUN has already been set to perennially reserved, if so it skips that LUN.
#Start of script.
#########################################
#Author = Bryan van Eeden #
#Version = 1.0 #
#Date = 24/07/2019 #
#########################################
#Set variables
$cluster = "CLUSTERNAME"
###########################################################################################
#Don't change anything beyond this point.
###########################################################################################
#Get (v/p)RDM's
$RDMlist = Get-Cluster -Name $cluster | Get-VM | Get-HardDisk -DiskType RawPhysical,RawVirtual | Select ScsiCanonicalName -Unique
#Get VMhosts in cluster
$VMhosts = Get-Cluster -Name $cluster | Get-VMHost
foreach ($VMhost in $VMhosts){
$esxcli = Get-EsxCli -V2 -VMHost $VMhost
foreach ($RDM in $RDMlist){
#Enter Perennially Reserved flag into variable
$RDMPerenniallyReservedInfo = $esxcli.storage.core.device.list.Invoke(@{device = $RDM.ScsiCanonicalName}) | Select -ExpandProperty IsPerenniallyReserved
if($RDMPerenniallyReservedInfo -eq "false"){
Write-Host "RDM with naa ID ="$RDM.ScsiCanonicalName" is not Perennially reserved"
Write-Host "Setting Perennially Reserved Flag for RDM with naa ID ="$RDM.ScsiCanonicalName""
#Set Perennially Reserved flag for given device to true
$esxcli.storage.core.device.setconfig.Invoke(@{device = $RDM.ScsiCanonicalName; perenniallyreserved = $True}) | Out-Null
$RDMPerenniallyReservedInfo = $esxcli.storage.core.device.list.Invoke(@{device = $RDM.ScsiCanonicalName}) | Select -ExpandProperty IsPerenniallyReserved
Write-Host "RDM with naa ID ="$RDM.ScsiCanonicalName" is now set to Perennially Reserved."
}
else{
Write-Host "RDM with naa ID ="$RDM.ScsiCanonicalName"is already Perennially reserved. Skipping lun."
}
}
}
#End of script.
Output from the script should be something like this:
RDM with naa ID = naa.60060160087045004f16385dae5b20c9 is not Perennially reserved
Setting Perennially Reserved Flag for RDM with naa ID = naa.60060160087045004f16385dae5b20c9
Or:
RDM with naa ID = naa.60060160087045004589355df80c5270 is already Perennially reserved. Skipping lun.
I did find some examples for the scripts on the web, but I couldn’t actually find any with ESXCLI V2 commands that suited my situation. So I created my own. I am going to write this into a function sometime this year so that it will be easier to use, but this will do for now.
The “Perennially Reserved” setting can also be edited in a Host Profile since ESXi 5.1. You can do this by browsing to the host profile -> Edit Host Profile -> Storage Configuration -> Pluggable Storage Architecture (PSA) configuration -> PSA Device Setting and set the flag “Device perennially reserved status” to “Enabled”. See the screenshot below for more information:
So there you have it! Are you experiencing extreme long boot times and using (p/v)RDM’s? You will want to archive this script to make your life easier!
21 Comments
Jim · February 1, 2020 at 7:20 am
Works great except I had to change else(){ to just else{
Paul · February 5, 2020 at 9:08 pm
Thanks for the script, this is best one I found for this task.
There is one typo on line 35: else(){
I had to change it to read: else{
Bryan van Eeden · February 6, 2020 at 2:43 pm
Hi Paul, thanks you are correct. Missed that one during testing.
James Krolak · June 10, 2020 at 8:07 pm
Great script! This really saved me some time. It’s working for me, though I’m getting the following error when it runs:
Get-VM : 6/10/2020 7:05:24 PM Get-VM Exception has been thrown by the target of an invocation.
At D:\Scripts\Fix Slow ESXi Boot Times Due to RDMs\SetPerenniallyReserved.ps1:21 char:41
+ $RDMlist = Get-Cluster -Name $cluster | Get-VM | Get-HardDisk -DiskTy …
+ ~~~~~~
+ CategoryInfo : NotSpecified: (:) [Get-VM], VimException
+ FullyQualifiedErrorId : Core_BaseCmdlet_UnknownError,VMware.VimAutomation.ViCore.Cmdlets.Commands.GetVM
Any clue what it’s complaining about?
Bryan van Eeden · June 12, 2020 at 7:54 pm
Hi James,
Thank you! It’s weird that it runs for you but you still get the error? Did it actually apply the flag after running?
Because if I look at your Exception code it looks like it is already failing on the first line of code:
$RDMlist = Get-Cluster -Name $cluster | Get-VM | Get-HardDisk -DiskType RawPhysical,RawVirtual | Select ScsiCanonicalName -Unique
In this part all it does is get the VM’s and list their p/vRDM’s. I suggest you run this line without anything else to see if this works at all for you. If it does and the flag is set I am not sure why it’s giving you an error. Let me know how it goes!
James Krolak · June 19, 2020 at 5:18 pm
It turns out, the script was apparently not actually setting perennially reserved properly ‘cuz my hosts are taking 45 minutes to boot. The rest of the script acted like it was running properly, though. I had added in a line so it would report when it would switch to the next host and it listed out all the correct RDM LUN names.
On the down side, the same day I ran this, we started experiencing problems with vMotion on our cluster. I could not put some hosts into maintenance mode because not all VMs would vMotion. They would sit at 18% for 30-40 minutes and then fail. In some cases, the host would stop responding to vCenter. And sometimes the VMs that would vMotion would get to 98 or 99% done and then sit there for 20 minutes or more before finishing. In the vmkernel logs we can see reports of file locks from other hosts in the cluster when we try to vMotion the ones that won’t move. I’m not sure whether this script had something to do with causing that ‘cuz every VM in the cluster was migrated to new hosts just 4 days earlier with no issue. VMware support is still trying to figure out what the problem is and how we fix it.
James Krolak · June 19, 2020 at 5:40 pm
I should mention that I’m now addressing the RDM issue with host profiles, instead. Though, that’s a bit weird ‘cuz as soon as you set the PSA Device Settings for even 1 drive, the host reports EVERY drive it can see as not being included in the profile–including the local boot disk, which has a different naa id, so you can’t really include them all in the profile. So, the host will always say it’s not compliant with the profile.
Bryan van Eeden · June 19, 2020 at 5:43 pm
Hi James,
We’ve actually used this script countless and countless of times on multiple clusters with a significant amount of physical compute nodes that had pRDM’s connected. Setting the perennially reserved flag on the device is also a supported move (https://kb.vmware.com/s/article/1016106).
If you can find “VMFS partition is marked perennially reserved” messages in the vmkernel.log, it could potentially be that a VMFS volume got set to perennially reserved. Which it shouldn’t. If this is the case you should change that back. I hope your issue is fixed soon!
Samuel · July 3, 2020 at 1:03 am
This was a life-saver! Thank you Bryan…
Chandra · July 17, 2020 at 9:07 am
where do I see the output file of this script, sorry for the dump question. I’m not aware much on scripts.
Bryan van Eeden · July 22, 2020 at 7:35 pm
Hi Chandra,
There isn’t actually any output file. The output is in the Powershell window in the following example format:
RDM with naa ID = naa.60060160087045004f16385dae5b20c9 is not Perennially reserved
Setting Perennially Reserved Flag for RDM with naa ID = naa.60060160087045004f16385dae5b20c9
or
RDM with naa ID = naa.60060160087045004589355df80c5270 is already Perennially reserved. Skipping lun.
I hope this helps.
Bjoern · September 14, 2020 at 3:08 pm
Works perfect for me! You made my day!
A small improvement: Also mention the hostname in the output.
Razvan · October 7, 2020 at 10:04 am
Thank you! It worked perfectly and it will save a lot of hours of waiting 🙂
Michael Farrell · November 25, 2020 at 10:08 pm
Bryan, Well done sir! This saved me quite a bit of time/headaches so just wanted to say thank you!
Tested successfully with the following:
VCSA 6.7.0.46000
ESXi hosts 6.7.0, 17167734
PSVersion 5.1.14409.1018
VMware.PowerCLI 10.1.0.8403314
Andrew · April 26, 2021 at 8:28 pm
I’m running the script, but I’m not seeing any Output so I have no idea what it’s doing. When I check the hosts after running the script, I can still see RDM’s set to false so I don’t think it’s working. I’m on ESXi 7.0 U2 if that matters.
Bryan van Eeden · May 1, 2021 at 3:17 pm
I haven’t tested this on vSphere 7 to be honest. But if there is no output like I mentioned in the Powershell window, likely the script is not working at all. I will try to test ths soon in a vSphere 7 environment. Currently don’t have any pRDM’s anymore so bit hard to test.
Rob · June 10, 2021 at 3:10 am
I take my comment back….my issue was if you have a multi cluster environment, and your p\vRDM is in Cluster A. But Your hosts in CLuster B can see the the RDMs it wont set the flag in cluster B.
A few tweaks to the script to account for that and its working for 7.0u2
THANK!
Lucas · December 6, 2021 at 1:13 pm
Awesome article. Saved me a lot of time!
Ycl · January 26, 2023 at 7:04 am
Thank you Bryan.
Soheil · April 17, 2023 at 11:11 am
Hi
we have 8 esxi Host which boot from SAN and the servers are HPE Gen10 DL360 with ESXI7.0.3 running.
the all processing on all esxi host is so slow even when restart them in maintenance mode we spend more than 30 mins to wait for esxi boot. here, we dont have any RDM or Local USB Flash or HDD everything boot from SAN Storage.
but the most spending time on booting is on “vmw_vaaip_cx load successfully “.
exactly in SAN monitoring we observe all esxi bot LUN have high respond time and latency is so high too.
please share with me any solution or same experience.
Bryan van Eeden · April 19, 2023 at 8:37 pm
Are you sure you don’t have any RDM’s (pRDM) configured in this environment? You can have a look at the vmkernel.log to maybe find out why this is taking so long, or what is going on in the first place.