Solved: Re: Thingworx Cluster Down After MSSQL OS Updates

Ascherer17 · ‎Mar 01, 2023

We have Thingworx 9.3.0 deployed in HA with three ThingWorx nodes, three Zookeeper nodes, and three Ignite nodes. Our database is MSSL 2019 with two nodes set up for failover.

Recently, we found our Thingworx cluster down after scheduled OS updates were made to the MSSQL server nodes. We're looking for some help in tuning our platform-settings and whatever else is needed to ensure the cluster is able to recover from temporary drops due to server maintenance.

Question: Do I just need to adjust "ClusteredModeSettings.ModelSyncTimeout" to the amount of time we expect it to take for our DB to return to availability?

Any other insights?

Investigation Findings:

This one repeats many times in Application log.

2023-02-27 00:33:13.594+0000 [L: WARN] [O: c.m.v.a.ThreadPoolAsynchronousRunner] [I: ] [U: ] [S: ] [P: thingworx2] [T: C3P0PooledConnectionPoolManager[identityToken->14wbh7taufjvcx41v83jrm|7e8297     7b]-AdminTaskTimer] com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@cb7a47d -- APPARENT DEADLOCK!!! Creating emergency threads for unassigned pending tasks!

These errors seemed to define the beginning of cluster shutdown:

The timeout message about synchronization pointed me to the ClusteredModeSettings section of platform-settings where I found the ModelSyncTimeout set to 120000. The ClusteredModeSettings section for reference:

"ClusteredModeSettings": {
                       "PlatformId": "thingworx1",
                       "CoordinatorHosts": "zookeeper1:2181,zookeeper2:2181,zookeeper3:2181",
                       "ModelSyncPollInterval": 100,
                       "ModelSyncWaitPeriod": 3000,
                       "ModelSyncTimeout": 120000,
                       "ModelSyncMaxDBUnavailableErrors": 10,
                       "ModelSyncMaxCacheUnavailableErrors": 10,
                       "CoordinatorMaxRetries": 3,
                       "CoordinatorSessionTimeout": 90000,
                       "CoordinatorConnectionTimeout": 10000,
                       "MetricsCacheFrequency": 60000,
                       "IgnoreInactiveInterfaces": true,
                       "IgnoreVirtualInterfaces": true
       },

slangley · ‎May 08, 2023

Hi @Ascherer17 and @nquirindongo.

Per the details of the case, the only solution found was to schedule a restart of the nodes in the ThingWorx cluster following any database maintenance. If you have further information to share, please feel free to do so.

Regards.

--Sharon

View solution in original post

Ascherer17 · ‎Mar 01, 2023

After posting this, I managed to find this article, CS347815 - Is it possible to increase ChangeWatcher databse connection attempt in ThingWorx , which seems similar to our problem. It suggests increasing ModelSyncMaxDBUnavailableErrors (default value is 10). The HA platform settings page defines ModelSyncMaxDBUnavailableErrors as "The number of consecutive sync failures from lost database connectivity allowed before the server shuts down. The timeframe in milliseconds is approximately ModelSyncPollInterval * this value."

If the default value for ModelSyncPollInterval is 100, this would mean if our time gap is ~10 minutes, we'd need to set ModelSyncMaxDBUnavailableErrors to 10min * 60sec * 1000ms / 100ms = 6000 tries.

Am I understanding this correctly? What would be the repercussions for setting the ModelSyncMaxDBUnavailableErrors this high?

slangley · ‎Mar 14, 2023

Hi @Ascherer17.

I think it would be a good idea to open a case for this so we can run it by R&D. If you agree, I'll be happy to open the case on your behalf.

Regards.

--Sharon

nquirindongo · ‎Mar 24, 2023

Hi slangley,

Aaron is in my team. Can you open the case on my behalf. What additional information would you need for us to open the case?

Constantine · ‎Mar 07, 2023

Hello,

Just an alternative thought -- in many cases you might be able to upgrade your database with minimal downtime. It depends on how you host it, for example AWS RDS in Multi-AZ can do minor version upgrades with <60s downtime, which fits comfortably into ThingWorx' default timeout.

/ Constantine

slangley · ‎May 08, 2023

Hi @Ascherer17 and @nquirindongo.

Per the details of the case, the only solution found was to schedule a restart of the nodes in the ThingWorx cluster following any database maintenance. If you have further information to share, please feel free to do so.

Regards.

--Sharon