Introduction

Recently we were in a situation in which we upgraded one of our Workspace ONE Access (formerly known as VMware Identity Manager) environments to the at that moment newest version 21.08 (Release notes here). All in all the upgrade went fine, however it seems that after the upgrade we had a couple of weird issues in the environment, that ultimately didn’t allow us to provision/assign new users/groups from the connected Active Directory to our Web Applications.

The issue

As mentioned earlier we were not able assign new users or groups to our Web Applications. This behaviour outed itself as such:

Workspace One Access unable to find the user
Workspace ONE Access unable to find the user

and

Workspace One Access unable to assign/find user to Web Application
Workspace ONE Access unable to assign/find user to Web Application

After the upgrade, and during the troubleshooting for this issue we did check all of the most basic items which in my opinion are:

  • WS ONE Access cluster health
    • The WS ONE Access cluster health was also 100%. You can check this by going to the Administration Console -> Dashboard -> System Diagnostics Dashboard -> Checking the DB connection and simply checking all of the nodes. An important thing to notice is the ElasticSearch Health.
  • WS ONE Access Node health
    • The Node health is 100%. You can check this by going to the Administration Console -> Dashboard -> System Diagnostics Dashboard -> click on a node and examine all the fields. If you need to refresh the checks with the refresh button on the right side.
  • WS ONE Access Elasticsearch health

The ElasticSearch cluster health can be checked by running the following commands and examining the output:

curl 'http://localhost:9200/_cluster/health?pretty'
{
  "cluster_name" : "horizon",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 590,
  "active_shards" : 1180,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

In the above output you can see that the cluster status is green, there are three nodes connected and no ‘unassigned’ shards available. The active shards percentage is also 100%. So far so good! You can also fire the following command to look at what WS ONE Access appliance is the master for the ElasticSearch application:

url 'http://localhost:9200/_cat/nodes?v'
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
XXX.XX.XX.23           49          98   9    0.26    0.42     0.31 mdi       *      node-vidm01
XXX.XX.XX.25           40          98   7    0.12    0.27     0.23 mdi       -      node-vidm03
XXX.XX.XX.24           45          98   8    0.29    0.34     0.22 mdi       -      node-vidm02
root@vidm01 [ ~ ]#
  • Back-end SQL cluster health.
    • The health for the SQL cluster was 100%. The user was found within the database in the table called: saas.Users.
  • WS ONE Access Connector (Directory Sync status) and health.
    • The health was 100% and we found the user “bvetest” being synced to the environment in the Directory Sync service log files. These can be found at: C:\Program Files\Workspace ONE Access\Directory Sync Service\logs

So as you can see, everything seems to be fine and healthy. We were troubleshooting for a while before we said let’s just examine a specific shard/index to see if we can see anything on that end. An index is made up of one or more shards. Each shard is as the official documentation states, a Lucine index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of data. Once data is written to a shard, this gets pushed to the disk, after which it’s available for the search engine to use.

You can examine a shard by executing the following command (I’ve removed unrelevant data from the shard):

curl GET 'http://localhost:9200/index_here?pretty'
...................
"settings" : {
      "index" : {
        "number_of_shards" : "5",
        "blocks" : {
          "read_only_allow_delete" : "true"
        },
        "provided_name" : "v4_2022-07-23",
        "creation_date" : "1658534587283",
        "number_of_replicas" : "1",
        "uuid" : "22cSMeF8Rl-sKE38_GBERw",
        "version" : {
          "created" : "5061599"
        }
      }
    }
...................

Now that is pretty interesting. We see some settings that are held in the shard, but we can also notice that the ‘read_only_allow_delete‘ value is ‘true‘. At this point we figured that at some point the environment went below the disk space watermark that ElasticSearch (especially the older versions) use to prevent damage to the cluster. If the threshold for free disk space if crossed, ElasticSearch can do two things, either lock up (read-only) the cluster, or put the indices on read-only. The latter part happend here. If we have a look at the log file /opt/vmware/horizon/workspace/logs/analytics-service.log we can also see that there are some errors in regard to pushing new data into elasticsearch:

2022-10-05 18:17:45,629 GMT INFO  vidm01.:analytics (ForkJoinPool-14-worker-0) [-;e294fcee-d216-4b57-bbf2-f1b26ae36311;127.0.0.1;f527ffed-05df-4af2-95ec-dcf86a8e05b3;-] com.vmware.idm.analytics.resource.AuditResource - Received 2 search documents to delete
2022-10-05 18:17:45,637 GMT WARN  vidm01.:analytics (I/O dispatcher 4) [;;;;] com.vmware.idm.analytics.elasticsearch.JestBulkDeleteResultHandler - Unable to bulk delete 4 out of 4 documents: One or more of the items in the Bulk request failed, check BulkResult.getItems() for more information.
User-5507 403 {"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"}
User-5507 403 {"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"}
User-5506 403 {"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"}
User-5506 403 {"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"}
2022-10-05 18:17:45,637 GMT WARN  vidm01.:analytics (I/O dispatcher 4) [;;;;] com.vmware.vidm.common.vertx.exception.handler.DefaultExceptionHandler - Unexpected Exception:  com.vmware.vidm.common.vertx.exceptions.WebApplicationException: INTERNAL_SERVER_ERROR
        at com.vmware.vidm.common.vertx.Response$Failure.lambda$withMapping$0(Response.java:288)
        at com.vmware.vidm.common.vertx.Response$Failure.lambda$convert$5(Response.java:390)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
        at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
        at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:531)
        at com.vmware.vidm.common.vertx.Response$Failure.convert(Response.java:391)
        at com.vmware.vidm.common.vertx.Response$Failure.fail(Response.java:367)
        at com.vmware.vidm.common.vertx.Response$Failure.fail(Response.java:343)
        at com.vmware.vidm.common.vertx.Response$Failure.fail(Response.java:305)
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
        at com.vmware.idm.analytics.elasticsearch.JestBulkDeleteResultHandler.completed(JestBulkDeleteResultHandler.java:41)
        at com.vmware.idm.analytics.elasticsearch.JestBulkDeleteResultHandler.completed(JestBulkDeleteResultHandler.java:22)
        at io.searchbox.client.http.JestHttpClient$DefaultCallback.completed(JestHttpClient.java:267)
        at io.searchbox.client.http.JestHttpClient$DefaultCallback.completed(JestHttpClient.java:246)
        at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122)
        at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:181)
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:437)
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:327)
        at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
        at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
        at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.vmware.idm.analytics.exception.DeleteFailedException: error.deleteFailed
        ... 19 more

If you really wanted. You can even relate the mentioned users in the above output with the ID’s for the users in the database to find which specific users are mentioned/having issues.

If we execute the following command it will display the settings for all shards:

curl 'http://localhost:9200/_all/_settings' | grep -i read_only_allow_delete

We noticed that most, if not all shards were in read-only. Thankfully the disk space issue was already resolved but in ElasticSearch 6.x (which is running on WS ONE Access 21.08 appliances) it does not automatically revert the read-only indices once the disk space is back. So this is something we have to do manually by entering the following command (this is for all indices):

curl -k -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": false}'

After this we can again read the shard and see if the setting changed:

curl GET 'http://localhost:9200/index_here?pretty'
...................
"settings" : {
      "index" : {
        "number_of_shards" : "5",
        "blocks" : {
          "read_only_allow_delete" : "false"
        },
        "provided_name" : "v4_2022-07-23",
        "creation_date" : "1658534587283",
        "number_of_replicas" : "1",
        "uuid" : "22cSMeF8Rl-sKE38_GBERw",
        "version" : {
          "created" : "5061599"
        }
      }
    }
  }
...................

Now that we’ve done this, we checked the environment again to see if we were now able to find the users and new groups within the Workspace ONE Access UI. It takes a couple of minutes for ElasticSearch to process everything, but once this is done the result is that it worked!

Workspace One Access able to find the user
Workspace ONE Access able to find the user

and

Workspace One Access unable to assign/find user to Web Application
Workspace ONE Access unable to assign/find user to Web Application

If we take a quick look at the log file at /opt/vmware/horizon/workspace/logs/analytics-service.log:

root@vidm01 [ ~ ]# tail -f /opt/vmware/horizon/workspace/logs/analytics-service.log
2022-10-07 14:15:13,268 GMT INFO  vidm01.loc:analytics (ForkJoinPool-14-worker-0) [-;e294fcee-d216-4b57-bbf2-f1b26ae36311;127.0.0.1;3309d8a9-f2a7-4c4b-xxxx-29a86f25b51f;-] com.vmware.idm.analytics.resource.AuditResource - Received 6 search documents to store
2022-10-07 14:15:13,293 GMT INFO  vidm01:analytics (I/O dispatcher 3) [;;;;] com.vmware.idm.analytics.elasticsearch.JestBulkAddResultHandler - Stored 12 documents
2022-10-07 14:15:13,314 GMT INFO  vidm01:analytics (ForkJoinPool-14-worker-0) [-;e294fcee-d216-4b57-bbf2-f1b26ae36311;127.0.0.1;05a1b6ad-6e10-46c0-xxxx-440388a37bf9;-] com.vmware.idm.analytics.resource.AuditResource - Received 12 audit documents to store

We can see that ElasticSearch is once again syncing the documents and maken them usable in the searchengine.

There you have it, some insights in WS ONE Access and the ElasticSearch engine that is running on these appliances. WS ONE Access can sometimes be quit complex to work with since there are a lot of moving parts required to have it up and running. I hope this blogpost helps you with troubleshooting your issues.


Bryan van Eeden

Bryan is an ambitious and seasoned IT professional with almost a decade of experience in designing, building and operating complex (virtual) IT environments. In his current role he tackles customers, complex issues and design questions on a daily basis. Bryan holds several certifications such as VCIX-DCV, VCAP-DCA, VCAP-DCD, V(T)SP and vSAN and vCloud Specialist badges.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *