Recently we have been having (extremely) slow boot times for a couple of hosts that we’ve been using for a new customer. We actually didn’t really notice this in the beginning, until we added hosts to the existing cluster and the reboot for that host took over 3 hours!

The host seemed fine to me after checking the hardware and configuration. Everything checked out. But the host still seemed to be broken. Checking the console it seemed to be stuck on the “vmw_vaaip_cx loaded successfully” message in the hypervisor boot screen. Shortly after investigating this I remembered that these hosts have LUN’s being used as physical Raw Device Mappings (RDM) on Windows MSCS clusters. And that was the problem! During the boot process, ESXi scans the so-called storage mid-layer and attempts to discover all devices presented to an ESXi host during the device claiming phase. This however is a problem for RDM’s used in MSCS clusters because they have a permanent SCSI reservation allocated to them. Because of this the ESXi host cannot interrogate the LUN during the boot process and has to wait for a device scan timeout, which takes a long time. In this KB VMware says that if you have 10 RDM’s the boot process is already 30 minutes (excluding physical hardware boot times). Our hosts have over 40 RDM’s so that took way to long.

Fortunately, this can be fixed by setting an ESXi local device parameter on the LUN. This setting is called “perennially reserved”.  This setting just tells ESXi that there is a permanent SCSI reservation on the LUN, so it can be skipped during the boot process. ESXi basically won’t scan the device during the boot process anymore.

You can easily set this for a LUN on an ESXi host with the following commands:

Check if the command succeeded by using the following command and checking the “IsPereniallyReserved” setting:

This is easy if you have just a single host, but we have an environment with loads of hosts for this customer and since I am not going to execute this for each RDM on each host by hand, I wrote a script for this. This script is pretty simple, it checks a given cluster for VM’s with (p/v)RDM’s and makes those (p/v)RDM’s perennially reserved on each host in the cluster. It has an additional check to see if a LUN has already been set to perennially reserved, if so it skips that LUN.

Output from the script should be something like this:

Or:

I did find some examples for the scripts on the web, but I couldn’t actually find any with ESXCLI V2 commands that suited my situation. So I created my own. I am going to write this into a function sometime this year so that it will be easier to use, but this will do for now.

The “Perennially Reserved” setting can also be edited in a Host Profile since ESXi 5.1. You can do this by browsing to the host profile -> Edit Host Profile -> Storage Configuration -> Pluggable Storage Architecture (PSA) configuration -> PSA Device Setting and set the flag “Device perennially reserved status” to “Enabled”. See the screenshot below for more information:

Host profile perennially reserved flag slow rdm boot time
Host Profile Perennially Reserved flag setting

So there you have it! Are you experiencing extreme long boot times and using (p/v)RDM’s? You will want to archive this script to make your life easier!

14 Comments

  1. Great script! This really saved me some time. It’s working for me, though I’m getting the following error when it runs:

    Get-VM : 6/10/2020 7:05:24 PM Get-VM Exception has been thrown by the target of an invocation.
    At D:\Scripts\Fix Slow ESXi Boot Times Due to RDMs\SetPerenniallyReserved.ps1:21 char:41
    + $RDMlist = Get-Cluster -Name $cluster | Get-VM | Get-HardDisk -DiskTy …
    + ~~~~~~
    + CategoryInfo : NotSpecified: (:) [Get-VM], VimException
    + FullyQualifiedErrorId : Core_BaseCmdlet_UnknownError,VMware.VimAutomation.ViCore.Cmdlets.Commands.GetVM

    Any clue what it’s complaining about?

    Avatar James Krolak
    1. Hi James,

      Thank you! It’s weird that it runs for you but you still get the error? Did it actually apply the flag after running?
      Because if I look at your Exception code it looks like it is already failing on the first line of code:

      $RDMlist = Get-Cluster -Name $cluster | Get-VM | Get-HardDisk -DiskType RawPhysical,RawVirtual | Select ScsiCanonicalName -Unique

      In this part all it does is get the VM’s and list their p/vRDM’s. I suggest you run this line without anything else to see if this works at all for you. If it does and the flag is set I am not sure why it’s giving you an error. Let me know how it goes!

      1. It turns out, the script was apparently not actually setting perennially reserved properly ‘cuz my hosts are taking 45 minutes to boot. The rest of the script acted like it was running properly, though. I had added in a line so it would report when it would switch to the next host and it listed out all the correct RDM LUN names.

        On the down side, the same day I ran this, we started experiencing problems with vMotion on our cluster. I could not put some hosts into maintenance mode because not all VMs would vMotion. They would sit at 18% for 30-40 minutes and then fail. In some cases, the host would stop responding to vCenter. And sometimes the VMs that would vMotion would get to 98 or 99% done and then sit there for 20 minutes or more before finishing. In the vmkernel logs we can see reports of file locks from other hosts in the cluster when we try to vMotion the ones that won’t move. I’m not sure whether this script had something to do with causing that ‘cuz every VM in the cluster was migrated to new hosts just 4 days earlier with no issue. VMware support is still trying to figure out what the problem is and how we fix it.

        Avatar James Krolak
        1. I should mention that I’m now addressing the RDM issue with host profiles, instead. Though, that’s a bit weird ‘cuz as soon as you set the PSA Device Settings for even 1 drive, the host reports EVERY drive it can see as not being included in the profile–including the local boot disk, which has a different naa id, so you can’t really include them all in the profile. So, the host will always say it’s not compliant with the profile.

          Avatar James Krolak
        2. Hi James,

          We’ve actually used this script countless and countless of times on multiple clusters with a significant amount of physical compute nodes that had pRDM’s connected. Setting the perennially reserved flag on the device is also a supported move (https://kb.vmware.com/s/article/1016106).

          If you can find “VMFS partition is marked perennially reserved” messages in the vmkernel.log, it could potentially be that a VMFS volume got set to perennially reserved. Which it shouldn’t. If this is the case you should change that back. I hope your issue is fixed soon!

    1. Hi Chandra,

      There isn’t actually any output file. The output is in the Powershell window in the following example format:
      RDM with naa ID = naa.60060160087045004f16385dae5b20c9 is not Perennially reserved
      Setting Perennially Reserved Flag for RDM with naa ID = naa.60060160087045004f16385dae5b20c9

      or

      RDM with naa ID = naa.60060160087045004589355df80c5270 is already Perennially reserved. Skipping lun.

      I hope this helps.

  2. Bryan, Well done sir! This saved me quite a bit of time/headaches so just wanted to say thank you!

    Tested successfully with the following:

    VCSA 6.7.0.46000
    ESXi hosts 6.7.0, 17167734
    PSVersion 5.1.14409.1018
    VMware.PowerCLI 10.1.0.8403314

    Avatar Michael Farrell

Leave a Reply

Your email address will not be published. Required fields are marked *