Community Tip - Your Friends List is a way to easily have access to the community members that you interact with the most! X
We have Thingworx 9.3.0 deployed in HA with three ThingWorx nodes, three Zookeeper nodes, and three Ignite nodes. Our database is MSSL 2019 with two nodes set up for failover.
Recently, we found our Thingworx cluster down after scheduled OS updates were made to the MSSQL server nodes. We're looking for some help in tuning our platform-settings and whatever else is needed to ensure the cluster is able to recover from temporary drops due to server maintenance.
Question: Do I just need to adjust "ClusteredModeSettings.ModelSyncTimeout" to the amount of time we expect it to take for our DB to return to availability?
Any other insights?
Investigation Findings:
This one repeats many times in Application log.
2023-02-27 00:33:13.594+0000 [L: WARN] [O: c.m.v.a.ThreadPoolAsynchronousRunner] [I: ] [U: ] [S: ] [P: thingworx2] [T: C3P0PooledConnectionPoolManager[identityToken->14wbh7taufjvcx41v83jrm|7e8297 7b]-AdminTaskTimer] com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@cb7a47d -- APPARENT DEADLOCK!!! Creating emergency threads for unassigned pending tasks!
These errors seemed to define the beginning of cluster shutdown:
The timeout message about synchronization pointed me to the ClusteredModeSettings section of platform-settings where I found the ModelSyncTimeout set to 120000. The ClusteredModeSettings section for reference:
"ClusteredModeSettings": {
"PlatformId": "thingworx1",
"CoordinatorHosts": "zookeeper1:2181,zookeeper2:2181,zookeeper3:2181",
"ModelSyncPollInterval": 100,
"ModelSyncWaitPeriod": 3000,
"ModelSyncTimeout": 120000,
"ModelSyncMaxDBUnavailableErrors": 10,
"ModelSyncMaxCacheUnavailableErrors": 10,
"CoordinatorMaxRetries": 3,
"CoordinatorSessionTimeout": 90000,
"CoordinatorConnectionTimeout": 10000,
"MetricsCacheFrequency": 60000,
"IgnoreInactiveInterfaces": true,
"IgnoreVirtualInterfaces": true
},
Solved! Go to Solution.
Hi @Ascherer17 and @nquirindongo.
Per the details of the case, the only solution found was to schedule a restart of the nodes in the ThingWorx cluster following any database maintenance. If you have further information to share, please feel free to do so.
Regards.
--Sharon
After posting this, I managed to find this article, CS347815 - Is it possible to increase ChangeWatcher databse connection attempt in ThingWorx , which seems similar to our problem. It suggests increasing ModelSyncMaxDBUnavailableErrors (default value is 10). The HA platform settings page defines ModelSyncMaxDBUnavailableErrors as "The number of consecutive sync failures from lost database connectivity allowed before the server shuts down. The timeframe in milliseconds is approximately ModelSyncPollInterval * this value."
If the default value for ModelSyncPollInterval is 100, this would mean if our time gap is ~10 minutes, we'd need to set ModelSyncMaxDBUnavailableErrors to 10min * 60sec * 1000ms / 100ms = 6000 tries.
Am I understanding this correctly? What would be the repercussions for setting the ModelSyncMaxDBUnavailableErrors this high?
Hi @Ascherer17.
I think it would be a good idea to open a case for this so we can run it by R&D. If you agree, I'll be happy to open the case on your behalf.
Regards.
--Sharon
Hi slangley,
Aaron is in my team. Can you open the case on my behalf. What additional information would you need for us to open the case?
Hello,
Just an alternative thought -- in many cases you might be able to upgrade your database with minimal downtime. It depends on how you host it, for example AWS RDS in Multi-AZ can do minor version upgrades with <60s downtime, which fits comfortably into ThingWorx' default timeout.
/ Constantine
Hi @Ascherer17 and @nquirindongo.
Per the details of the case, the only solution found was to schedule a restart of the nodes in the ThingWorx cluster following any database maintenance. If you have further information to share, please feel free to do so.
Regards.
--Sharon