Thingworx Cluster Down After MSSQL OS Updates

Question

We have Thingworx 9.3.0 deployed in HA with three ThingWorx nodes, three Zookeeper nodes, and three Ignite nodes.  Our database is MSSL 2019 with two nodes set up for failover.Recently, we found our Thingworx cluster down after scheduled OS updates were made to the MSSQL server nodes.  We're looking for some help in tuning our platform-settings and whatever else is needed to ensure the cluster is able to recover from temporary drops due to server maintenance.  Question: Do I just need to adjust "ClusteredModeSettings.ModelSyncTimeout" to the amount of time we expect it to take for our DB to return to availability?Any other insights? Investigation Findings:This one repeats many times in Application log. 2023-02-27 00:33:13.594+0000 [L: WARN] [O: c.m.v.a.ThreadPoolAsynchronousRunner] [I: ] [U: ] [S: ] [P: thingworx2] [T: C3P0PooledConnectionPoolManager[identityToken->14wbh7taufjvcx41v83jrm|7e8297 7b]-AdminTaskTimer] com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@cb7a47d -- APPARENT DEADLOCK!!! Creating emergency threads for unassigned pending tasks! These errors seemed to define the beginning of cluster shutdown:The timeout message about synchronization pointed me to the ClusteredModeSettings section of platform-settings where I found the ModelSyncTimeout set to 120000.  The ClusteredModeSettings section for reference: "ClusteredModeSettings": {
 "PlatformId": "thingworx1",
 "CoordinatorHosts": "zookeeper1:2181,zookeeper2:2181,zookeeper3:2181",
 "ModelSyncPollInterval": 100,
 "ModelSyncWaitPeriod": 3000,
 "ModelSyncTimeout": 120000,
 "ModelSyncMaxDBUnavailableErrors": 10,
 "ModelSyncMaxCacheUnavailableErrors": 10,
 "CoordinatorMaxRetries": 3,
 "CoordinatorSessionTimeout": 90000,
 "CoordinatorConnectionTimeout": 10000,
 "MetricsCacheFrequency": 60000,
 "IgnoreInactiveInterfaces": true,
 "IgnoreVirtualInterfaces": true
 },

slangley · Accepted Answer

Hi @Ascherer17 and @nquirindongo.

Per the details of the case, the only solution found was to schedule a restart of the nodes in the ThingWorx cluster following any database maintenance. If you have further information to share, please feel free to do so.

Regards.

--Sharon

Ascherer17 · Answer

After posting this, I managed to find this article, CS347815 - Is it possible to increase ChangeWatcher databse connection attempt in ThingWorx , which seems similar to our problem. It suggests increasing ModelSyncMaxDBUnavailableErrors (default value is 10). The HA platform settings page defines ModelSyncMaxDBUnavailableErrors as "The number of consecutive sync failures from lost database connectivity allowed before the server shuts down. The timeframe in milliseconds is approximately ModelSyncPollInterval * this value."

If the default value for ModelSyncPollInterval is 100, this would mean if our time gap is ~10 minutes, we'd need to set ModelSyncMaxDBUnavailableErrors to 10min * 60sec * 1000ms / 100ms = 6000 tries.

Am I understanding this correctly? What would be the repercussions for setting the ModelSyncMaxDBUnavailableErrors this high?

Sign up

Please use your PTC eSupport account.

Welcome to the PTC Community

Please use your PTC eSupport account.

Scanning file for viruses.

This file cannot be downloaded