Introduction
Recently we were in a situation in which we upgraded one of our Workspace ONE Access (formerly known as VMware Identity Manager) environments to the at that moment newest version 21.08 (Release notes here). All in all the upgrade went fine, however it seems that after the upgrade we had a couple of weird issues in the environment, that ultimately didn’t allow us to provision/assign new users/groups from the connected Active Directory to our Web Applications.
The issue
As mentioned earlier we were not able assign new users or groups to our Web Applications. This behaviour outed itself as such:
and
After the upgrade, and during the troubleshooting for this issue we did check all of the most basic items which in my opinion are:
- WS ONE Access cluster health
- The WS ONE Access cluster health was also 100%. You can check this by going to the Administration Console -> Dashboard -> System Diagnostics Dashboard -> Checking the DB connection and simply checking all of the nodes. An important thing to notice is the ElasticSearch Health.
- WS ONE Access Node health
- The Node health is 100%. You can check this by going to the Administration Console -> Dashboard -> System Diagnostics Dashboard -> click on a node and examine all the fields. If you need to refresh the checks with the refresh button on the right side.
- WS ONE Access Elasticsearch health
The ElasticSearch cluster health can be checked by running the following commands and examining the output:
curl 'http://localhost:9200/_cluster/health?pretty' { "cluster_name" : "horizon", "status" : "green", "timed_out" : false, "number_of_nodes" : 3, "number_of_data_nodes" : 3, "active_primary_shards" : 590, "active_shards" : 1180, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 100.0 }
In the above output you can see that the cluster status is green, there are three nodes connected and no ‘unassigned’ shards available. The active shards percentage is also 100%. So far so good! You can also fire the following command to look at what WS ONE Access appliance is the master for the ElasticSearch application:
url 'http://localhost:9200/_cat/nodes?v' ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name XXX.XX.XX.23 49 98 9 0.26 0.42 0.31 mdi * node-vidm01 XXX.XX.XX.25 40 98 7 0.12 0.27 0.23 mdi - node-vidm03 XXX.XX.XX.24 45 98 8 0.29 0.34 0.22 mdi - node-vidm02 root@vidm01 [ ~ ]#
- Back-end SQL cluster health.
- The health for the SQL cluster was 100%. The user was found within the database in the table called: saas.Users.
- WS ONE Access Connector (Directory Sync status) and health.
- The health was 100% and we found the user “bvetest” being synced to the environment in the Directory Sync service log files. These can be found at: C:\Program Files\Workspace ONE Access\Directory Sync Service\logs
So as you can see, everything seems to be fine and healthy. We were troubleshooting for a while before we said let’s just examine a specific shard/index to see if we can see anything on that end. An index is made up of one or more shards. Each shard is as the official documentation states, a Lucine index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of data. Once data is written to a shard, this gets pushed to the disk, after which it’s available for the search engine to use.
You can examine a shard by executing the following command (I’ve removed unrelevant data from the shard):
curl GET 'http://localhost:9200/index_here?pretty' ................... "settings" : { "index" : { "number_of_shards" : "5", "blocks" : { "read_only_allow_delete" : "true" }, "provided_name" : "v4_2022-07-23", "creation_date" : "1658534587283", "number_of_replicas" : "1", "uuid" : "22cSMeF8Rl-sKE38_GBERw", "version" : { "created" : "5061599" } } } ...................
Now that is pretty interesting. We see some settings that are held in the shard, but we can also notice that the ‘read_only_allow_delete‘ value is ‘true‘. At this point we figured that at some point the environment went below the disk space watermark that ElasticSearch (especially the older versions) use to prevent damage to the cluster. If the threshold for free disk space if crossed, ElasticSearch can do two things, either lock up (read-only) the cluster, or put the indices on read-only. The latter part happend here. If we have a look at the log file /opt/vmware/horizon/workspace/logs/analytics-service.log
we can also see that there are some errors in regard to pushing new data into elasticsearch:
2022-10-05 18:17:45,629 GMT INFO vidm01.:analytics (ForkJoinPool-14-worker-0) [-;e294fcee-d216-4b57-bbf2-f1b26ae36311;127.0.0.1;f527ffed-05df-4af2-95ec-dcf86a8e05b3;-] com.vmware.idm.analytics.resource.AuditResource - Received 2 search documents to delete 2022-10-05 18:17:45,637 GMT WARN vidm01.:analytics (I/O dispatcher 4) [;;;;] com.vmware.idm.analytics.elasticsearch.JestBulkDeleteResultHandler - Unable to bulk delete 4 out of 4 documents: One or more of the items in the Bulk request failed, check BulkResult.getItems() for more information. User-5507 403 {"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"} User-5507 403 {"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"} User-5506 403 {"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"} User-5506 403 {"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"} 2022-10-05 18:17:45,637 GMT WARN vidm01.:analytics (I/O dispatcher 4) [;;;;] com.vmware.vidm.common.vertx.exception.handler.DefaultExceptionHandler - Unexpected Exception: com.vmware.vidm.common.vertx.exceptions.WebApplicationException: INTERNAL_SERVER_ERROR at com.vmware.vidm.common.vertx.Response$Failure.lambda$withMapping$0(Response.java:288) at com.vmware.vidm.common.vertx.Response$Failure.lambda$convert$5(Response.java:390) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:531) at com.vmware.vidm.common.vertx.Response$Failure.convert(Response.java:391) at com.vmware.vidm.common.vertx.Response$Failure.fail(Response.java:367) at com.vmware.vidm.common.vertx.Response$Failure.fail(Response.java:343) at com.vmware.vidm.common.vertx.Response$Failure.fail(Response.java:305) at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at com.vmware.idm.analytics.elasticsearch.JestBulkDeleteResultHandler.completed(JestBulkDeleteResultHandler.java:41) at com.vmware.idm.analytics.elasticsearch.JestBulkDeleteResultHandler.completed(JestBulkDeleteResultHandler.java:22) at io.searchbox.client.http.JestHttpClient$DefaultCallback.completed(JestHttpClient.java:267) at io.searchbox.client.http.JestHttpClient$DefaultCallback.completed(JestHttpClient.java:246) at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122) at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:181) at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:437) at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:327) at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) at java.lang.Thread.run(Thread.java:748) Caused by: com.vmware.idm.analytics.exception.DeleteFailedException: error.deleteFailed ... 19 more
If you really wanted. You can even relate the mentioned users in the above output with the ID’s for the users in the database to find which specific users are mentioned/having issues.
If we execute the following command it will display the settings for all shards:
curl 'http://localhost:9200/_all/_settings' | grep -i read_only_allow_delete
We noticed that most, if not all shards were in read-only. Thankfully the disk space issue was already resolved but in ElasticSearch 6.x (which is running on WS ONE Access 21.08 appliances) it does not automatically revert the read-only indices once the disk space is back. So this is something we have to do manually by entering the following command (this is for all indices):
curl -k -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": false}'
After this we can again read the shard and see if the setting changed:
curl GET 'http://localhost:9200/index_here?pretty' ................... "settings" : { "index" : { "number_of_shards" : "5", "blocks" : { "read_only_allow_delete" : "false" }, "provided_name" : "v4_2022-07-23", "creation_date" : "1658534587283", "number_of_replicas" : "1", "uuid" : "22cSMeF8Rl-sKE38_GBERw", "version" : { "created" : "5061599" } } } } ...................
Now that we’ve done this, we checked the environment again to see if we were now able to find the users and new groups within the Workspace ONE Access UI. It takes a couple of minutes for ElasticSearch to process everything, but once this is done the result is that it worked!
and
If we take a quick look at the log file at /opt/vmware/horizon/workspace/logs/analytics-service.log
:
root@vidm01 [ ~ ]# tail -f /opt/vmware/horizon/workspace/logs/analytics-service.log 2022-10-07 14:15:13,268 GMT INFO vidm01.loc:analytics (ForkJoinPool-14-worker-0) [-;e294fcee-d216-4b57-bbf2-f1b26ae36311;127.0.0.1;3309d8a9-f2a7-4c4b-xxxx-29a86f25b51f;-] com.vmware.idm.analytics.resource.AuditResource - Received 6 search documents to store 2022-10-07 14:15:13,293 GMT INFO vidm01:analytics (I/O dispatcher 3) [;;;;] com.vmware.idm.analytics.elasticsearch.JestBulkAddResultHandler - Stored 12 documents 2022-10-07 14:15:13,314 GMT INFO vidm01:analytics (ForkJoinPool-14-worker-0) [-;e294fcee-d216-4b57-bbf2-f1b26ae36311;127.0.0.1;05a1b6ad-6e10-46c0-xxxx-440388a37bf9;-] com.vmware.idm.analytics.resource.AuditResource - Received 12 audit documents to store
We can see that ElasticSearch is once again syncing the documents and maken them usable in the searchengine.
There you have it, some insights in WS ONE Access and the ElasticSearch engine that is running on these appliances. WS ONE Access can sometimes be quit complex to work with since there are a lot of moving parts required to have it up and running. I hope this blogpost helps you with troubleshooting your issues.
0 Comments