Skip to main content
avillanueva
23-Emerald I
June 20, 2023
Solved

BGMS Stalls - Quits processing queue tasks, publishing, anything

  • June 20, 2023
  • 3 replies
  • 3153 views

I've been tracking a vexing issue for a few weeks now. I would get a note from the users that they are not seeing emails, publishing has stopped or workflows are not working. Quick check on System monitor shows red lights on BGMS:

Threshold of average active method contexts exceeded; average active contexts: 135.0001

 It appears as if no queue jobs are transacting. This will persist until I whack the BGMS process and it auto-restarts. Queue jobs start processing normally and there is an email flood. What is strange is this often occurs at night or weekends. I am checking with IT to make sure this is not an external job that is running. I would think if it was DB related, it would settle back down once issue is resolved. I am still investigating issues on that front. Here is the status messages before the event:

2023-05-27 10:05:33,376 INFO [wt.jmx.core.SharedScheduledExecutor.worker] wt.method.MethodContextMonitor.stats.summary - ElapsedSeconds=600.001357667, CompletedContexts=2844, ErrorCount=0, RedirectCount=0, ContextSecondsAverage=0.0010882164483122362, ContextSecondsMax=0.170925781, ContextCpuSecondsAverage=3.7317970253164556E-4, ContextUserSecondsAverage=3.023909985935302E-4, ContextJDBCConnWaitSecondsAverage=9.443747187060478E-6, ActiveContextsAverage=33.00515865339878, ActiveContextsMax=36, ActiveContextsStart=33, ActiveContextsEnd=33, AverageJDBCCallsPerContext=2.2988748241912798, AverageJNDICallsPerContext=0.0, AverageRemoteCacheCallsPerContext=0.015119549929676512, PercentageOfContextTimeInJDBCCalls=69.4537942697918, PercentageOfContextTimeInJDBCConnWait=0.8678188242520327, PercentageOfContextTimeInJNDICalls=0.0, PercentageOfContextTimeInRemoteCacheCalls=0.46653119479917626

 Here is the messages when it occurs:

2023-05-27 10:31:23,396 WARN [JMX Monitor ThreadGroup<main> Executor Pool [Thread-32910]] wt.jmx.notif.methodContextGauge - Time=2023-05-27 10:31:23.393 -0400, Name=MethodContextsGaugeNotifier, SourceObjectName=com.ptc:wt.subsystem=Monitors,wt.monitorType=UserSpecified,monitorName=NumberOfMethodContexts, class=class javax.management.monitor.MonitorNotification, type=jmx.monitor.gauge.high, message=, JVM Name=734509@<hostname>, Database Connections=15, RecentStatistics=[activeContextsAverage=43.69575698079789,activeContextsEnd=61,activeContextsMax=61,activeContextsStart=33,averageBlockedCountPerContext=0.0,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=2.088888888888889,averageJNDICallsPerContext=0.0,averageRemoteCacheCallsPerContext=0.022222222222222223,averageWaitedCountPerContext=0.0,completedContexts=45,contextCpuSecondsAverage=7.938385555555555E-4,contextJDBCConnWaitSecondsAverage=1.38434E-5,contextSecondsAverage=0.0035897418444444445,contextSecondsMax=432060.292405267,contextUserSecondsAverage=6.666666666666666E-4,contextsPerSecond=0.5998349031447896,elapsedSeconds=75.020642787,endTime=2023-05-27 10:31:23.395 -0400,errorCount=0,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=1.3454552408663482E-5,percentageOfContextTimeInJDBCConnWait=2.0362712675106688E-7,percentageOfContextTimeInJNDICalls=0.0,percentageOfContextTimeInRemoteCacheCalls=2.472260154022281E-8,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-27 10:30:08.374 -0400], BaselineStatistics=[activeContextsAverage=17.039334284186697,activeContextsEnd=61,activeContextsMax=79,activeContextsStart=3,averageBlockedCountPerContext=0.02292217978357173,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=4.403481477633508,averageJNDICallsPerContext=0.001559566688444229,averageRemoteCacheCallsPerContext=0.08460450215393109,averageWaitedCountPerContext=0.00505408810850307,completedContexts=2637271,contextCpuSecondsAverage=0.0014892850589446438,contextJDBCConnWaitSecondsAverage=1.1264928033144869E-4,contextSecondsAverage=0.046587539902467744,contextSecondsMax=1800.673565344,contextUserSecondsAverage=0.001307286964441652,contextsPerSecond=5.336972950282546,elapsedSeconds=494151.089872093,endTime=2023-05-27 10:31:23.395 -0400,errorCount=12,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=21.863126897034316,percentageOfContextTimeInJDBCConnWait=0.2418013068886723,percentageOfContextTimeInJNDICalls=0.002312341661670789,percentageOfContextTimeInRemoteCacheCalls=0.04387963380984237,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-21 17:15:32.305 -0400]
2023-05-27 10:34:09,071 INFO [wt.jmx.core.SharedScheduledExecutor.worker] wt.summary.general - JVMName=734509@<hostname>, HeapMemoryUsage=637741752, NonHeapMemoryUsage=325782624, Uptime=494404081, ProcessCpuTime=7944010000000, JITCTime=270947, GarbageCollectors=[G1 Old Generation,G1 Young Generation], GCTime=[577956,10256], GCCount=[928,283], Classes=32564, Threads=182, TotalStartedThreadCount=138596, CodeCachePercentUsage=NaN
2023-05-27 10:35:33,381 WARN [wt.jmx.core.SharedScheduledExecutor.worker] wt.method.MethodContextMonitor.stats.summary - ElapsedSeconds=600.001357878, CompletedContexts=1793, ErrorCount=0, RedirectCount=0, ContextSecondsAverage=9.379451115448968E-4, ContextSecondsMax=0.119802725, ContextCpuSecondsAverage=3.3825659453430004E-4, ContextUserSecondsAverage=2.732849972113776E-4, ContextJDBCConnWaitSecondsAverage=9.276237032905745E-6, ActiveContextsAverage=54.95284711645644, ActiveContextsMax=128, ActiveContextsStart=33, ActiveContextsEnd=128, AverageJDBCCallsPerContext=2.036252091466815, AverageJNDICallsPerContext=0.0, AverageRemoteCacheCallsPerContext=0.015616285554935862, PercentageOfContextTimeInJDBCCalls=68.06158638785061, PercentageOfContextTimeInJDBCConnWait=0.9889957225350618, PercentageOfContextTimeInJNDICalls=0.0, PercentageOfContextTimeInRemoteCacheCalls=0.4693038590843637
2023-05-27 10:35:33,382 WARN [wt.jmx.core.SharedScheduledExecutor.worker] wt.jmx.notif.methodContext - Time=2023-05-27 10:35:33.381 -0400, Name=MethodContextsNotifier, SourceObjectName=com.ptc:wt.subsystem=Monitors,wt.monitorType=MethodContexts, class=class javax.management.Notification, type=wt.method.MethodContextMonitorMBean.maxAverageActiveContextsThreshold.exceeded, userData=54.95284711645644, message=Threshold of average active method contexts exceeded; average active contexts: 54.9528, JVM Name=734509@<hostname>, Database Connections=15, MaxAverageActiveContextsThreshold=37, RecentStatistics=[activeContextsAverage=119.98866890627697,activeContextsEnd=128,activeContextsMax=128,activeContextsStart=66,averageBlockedCountPerContext=0.0,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=0.0,averageJNDICallsPerContext=0.0,averageRemoteCacheCallsPerContext=0.0,averageWaitedCountPerContext=0.0,completedContexts=17,contextCpuSecondsAverage=3.0073935294117646E-4,contextJDBCConnWaitSecondsAverage=0.0,contextSecondsAverage=5.642147647058823E-4,contextSecondsMax=432310.279012334,contextUserSecondsAverage=0.0,contextsPerSecond=0.20163700157121983,elapsedSeconds=84.309922621,endTime=2023-05-27 10:35:33.382 -0400,errorCount=0,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=1.1907240740019307E-5,percentageOfContextTimeInJDBCConnWait=1.956330099324393E-7,percentageOfContextTimeInJNDICalls=0.0,percentageOfContextTimeInRemoteCacheCalls=0.0,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-27 10:34:09.072 -0400], BaselineStatistics=[activeContextsAverage=17.072420586985455,activeContextsEnd=128,activeContextsMax=128,activeContextsStart=3,averageBlockedCountPerContext=0.022922005952320013,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=4.4034480836585725,averageJNDICallsPerContext=0.001559554861408923,averageRemoteCacheCallsPerContext=0.08460386055236226,averageWaitedCountPerContext=0.005054049780627166,completedContexts=2637291,contextCpuSecondsAverage=0.001489276805779112,contextJDBCConnWaitSecondsAverage=1.1264842605120177E-4,contextSecondsAverage=0.04658719249507581,contextSecondsMax=1800.673565344,contextUserSecondsAverage=0.001307277050579553,contextsPerSecond=5.334314841123434,elapsedSeconds=494401.076529741,endTime=2023-05-27 10:35:33.382 -0400,errorCount=12,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=21.86312413239578,percentageOfContextTimeInJDBCConnWait=0.24180127631238674,percentageOfContextTimeInJNDICalls=0.002312341369270306,percentageOfContextTimeInRemoteCacheCalls=0.04387962826117007,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-21 17:15:32.305 -0400]
2023-05-27 10:35:33,382 INFO [wt.jmx.core.SharedScheduledExecutor.worker] com.infoengine.util.IeContextMonitor.stats.summary - ElapsedSeconds=600.002, CompletedContexts=0, ErrorCount=0, ContextSecondsAverage=NaN, ContextSecondsMax=0.0, ContextCpuSecondsAverage=NaN, ContextUserSecondsAverage=NaN, ActiveContextsAverage=0.0, ActiveContextsMax=0, ActiveContextsStart=0, ActiveContextsEnd=0
2023-05-27 10:35:33,383 INFO [wt.jmx.core.SharedScheduledExecutor.worker] com.infoengine.jndi.DirContextMonitor.stats.summary - ServiceName=com.utcaus.EnterpriseLdap, IntervalSeconds=600.002, TotalContexts=0, ErrorCount=0, ContextsPerSecond=0.0, AverageContextSeconds=0.0, LongestContextSeconds=0.0, AverageActiveContexts=0.0, MaxActiveContexts=0
2023-05-27 10:35:33,383 INFO [wt.jmx.core.SharedScheduledExecutor.worker] com.infoengine.jndi.DirContextMonitor.stats.summary - ServiceName=com.utcaus.Ldap, IntervalSeconds=600.003, TotalContexts=0, ErrorCount=0, ContextsPerSecond=0.0, AverageContextSeconds=0.0, LongestContextSeconds=0.0, AverageActiveContexts=0.0, MaxActiveContexts=0

2023-05-27 10:44:09,781 INFO [wt.jmx.core.SharedScheduledExecutor.worker] wt.summary.general - JVMName=734509@<hostname>, HeapMemoryUsage=643160336, NonHeapMemoryUsage=325782496, Uptime=495004791, ProcessCpuTime=7949140000000, JITCTime=270956, GarbageCollectors=[G1 Old Generation,G1 Young Generation], GCTime=[578664,10256], GCCount=[929,283], Classes=32564, Threads=245, TotalStartedThreadCount=138771, CodeCachePercentUsage=NaN
2023-05-27 10:45:33,382 WARN [wt.jmx.core.SharedScheduledExecutor.worker] wt.method.MethodContextMonitor.stats.summary - ElapsedSeconds=600.001896003, CompletedContexts=8, ErrorCount=0, RedirectCount=0, ContextSecondsAverage=0.00578046275, ContextSecondsMax=0.008206355, ContextCpuSecondsAverage=0.0023634275, ContextUserSecondsAverage=0.0025, ContextJDBCConnWaitSecondsAverage=0.0, ActiveContextsAverage=130.0123936803109, ActiveContextsMax=132, ActiveContextsStart=128, ActiveContextsEnd=132, AverageJDBCCallsPerContext=0.0, AverageJNDICallsPerContext=0.0, AverageRemoteCacheCallsPerContext=0.0, PercentageOfContextTimeInJDBCCalls=0.0, PercentageOfContextTimeInJDBCConnWait=0.0, PercentageOfContextTimeInJNDICalls=0.0, PercentageOfContextTimeInRemoteCacheCalls=0.0
2023-05-27 10:45:33,384 WARN [wt.jmx.core.SharedScheduledExecutor.worker] wt.jmx.notif.methodContext - Time=2023-05-27 10:45:33.383 -0400, Name=MethodContextsNotifier, SourceObjectName=com.ptc:wt.subsystem=Monitors,wt.monitorType=MethodContexts, class=class javax.management.Notification, type=wt.method.MethodContextMonitorMBean.maxAverageActiveContextsThreshold.exceeded, userData=130.0123936803109, message=Threshold of average active method contexts exceeded; average active contexts: 130.0124, JVM Name=734509@<hostname>, Database Connections=15, MaxAverageActiveContextsThreshold=37, RecentStatistics=[activeContextsAverage=131.38045394969262,activeContextsEnd=132,activeContextsMax=132,activeContextsStart=131,averageBlockedCountPerContext=0.0,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=0.0,averageJNDICallsPerContext=0.0,averageRemoteCacheCallsPerContext=0.0,averageWaitedCountPerContext=0.0,completedContexts=1,contextCpuSecondsAverage=0.002965126,contextJDBCConnWaitSecondsAverage=0.0,contextSecondsAverage=0.007209526,contextSecondsMax=432910.280733025,contextUserSecondsAverage=0.01,contextsPerSecond=0.011961477921314188,elapsedSeconds=83.601709302,endTime=2023-05-27 10:45:33.383 -0400,errorCount=0,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=1.1796607632116384E-5,percentageOfContextTimeInJDBCConnWait=1.9381533542918624E-7,percentageOfContextTimeInJNDICalls=0.0,percentageOfContextTimeInRemoteCacheCalls=0.0,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-27 10:44:09.782 -0400], BaselineStatistics=[activeContextsAverage=17.20931762561805,activeContextsEnd=132,activeContextsMax=132,activeContextsStart=3,averageBlockedCountPerContext=0.022921936420557548,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=4.403434726210414,averageJNDICallsPerContext=0.0015595501306450274,averageRemoteCacheCallsPerContext=0.08460360391445945,averageWaitedCountPerContext=0.0050540344496395745,completedContexts=2637299,contextCpuSecondsAverage=0.0014892794574335331,contextJDBCConnWaitSecondsAverage=1.1264808434273096E-4,contextSecondsAverage=0.04658706871167547,contextSecondsMax=1800.673565344,contextUserSecondsAverage=0.0013072806685931325,contextsPerSecond=5.327865162195883,elapsedSeconds=495001.078239194,endTime=2023-05-27 10:45:33.384 -0400,errorCount=12,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=21.863115903528474,percentageOfContextTimeInJDBCConnWait=0.2418011853029498,percentageOfContextTimeInJNDICalls=0.002312340498948655,percentageOfContextTimeInRemoteCacheCalls=0.04387961174570612,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-21 17:15:32.305 -0400]

Any idea what might cause this behavior as described?

Best answer by avillanueva

Success? I did some analysis on which queue jobs were running and my attention was drawn to this job which runs daily:

NAME IDA2A2 SCHEDULETIME ENTRY_TARGETCLASS ENTRY_TARGETMETHOD SCHEDULEITEM TARGETCLASS TARGETMETHOD ITEMDESCRIPTION ITEMNAME
commonScheduleQueue 1630151366 08/16/2023 10:30 wt.scheduler.StandardSchedulingService executeScheduledItem 1630151365 com.ptc.windchill.enterprise.datamonitor.WBRDataMonitorHelper processFailedAsyncExecutions WBR_DATAMONITOR_PURGE_SCHEDULE WBR_DATAMONITOR_PURGE_SCHEDULE

Previously, I put in a call about some Jasper report error messages. This is documented in the CS articles below.

https://www.ptc.com/en/support/article/CS378753?source=search

https://www.ptc.com/en/support/article/CS380347?source=search

This matches us exactly as we are employing an alias hostname. I was going to fix this with a CPS update. 

I then came across this article which I think was the real cause:

https://www.ptc.com/en/support/article/CS358954?source=search

My theory is that there were some daily Jasper reports related to Change Object Summary reports that were failing to execute, likely due to the issues noted above. This job hangs and does not finish resulting in the step up we are seeing in the logs above for contexts. After a few days, things get maxed out and BGMS halts. Following the steps in CS358954, I found the entries in the WBRDataMonitor table and marked them as FAILED. When 10:30 rolled around, no step up was observed. I think I fixed the issue but I will wait for a few days to close this thread.

 

What I do not understand is that we do not use these reports but  I can see how a user might trigger them to run. How do I see what has been requested and clear them? If I do not, I can see this event occurring again until I can get the CPS updated.

avillanueva_0-1692203006744.png

 

 

 

 

3 replies

13-Aquamarine
June 21, 2023

No current idea, but many questions.

May I presume that you already have ran through Windchill tuning that is recommended in https://www.ptc.com/en/support/article/CS244172?

I think yes since you've been working on this for some time, but I'd rather ask then assume.

Which DB are you running on? If it'd be due to lack of resources you'd presume that this would happen during higher loads.

Do you have any scheduled jobs running at nighttime? Does this happen every night, or just on some occasions? 

If it is a regular event, have you considered increasing logging levels for short intervals around when you expect it to happen?

If this is caused by external issues, have you found any correlating events through the Windows Event Viewer? 

 

13-Aquamarine
June 21, 2023

Have you by the way considered to use the On-demand System Scan from PTC? Gather sufficient data and run it through a solid scan for suggestions on what to alter in your system and see what output you get.

HelesicPetr
22-Sapphire II
June 21, 2023

Hi @avillanueva 

I would check what is the average active method contexts and increase the value and check if it can help. 

 

Why this threshold is exceeded in the night? it is interesting, do you have any custom drill action to export data to another system?

 

This can be a case. 

 

PetrH

avillanueva
23-Emerald I
June 21, 2023

If there is a limiter, I can get how two many contexts getting spawned and they all get stuck in the exit door. I am not sure how widening the door would help if we do not know what is spawning them. This is all happening in the morning on a Saturday when the system is idle. No one is online doing things then. I am checking for external causes like something on the database but nothing obvious yet. There are some queue jobs like the Purge Report task that runs daily at 10:30am. We do have a few data monitors but the frequencies for clean up do not align. I have discounted this. This is the key data I think:

ActiveContextsAverage=130.0123936803109, ActiveContextsMax=132, ActiveContextsStart=128, ActiveContextsEnd=132

It spikes at this time and stays this way until restart. I think I will setup a monitoring period that morning and try to trap any DB calls and other activity to see what is the trigger.

avillanueva
23-Emerald I
June 27, 2023

Sharing more info. It obviously has something to do with an errant queue job since timing is spot on but I am seeing that I cannot correlate why its some days and not others. I was able to trap this from the DB.

Object (Owner/Name): <user>.QUEUELOCK
Object Type : TABLE
.
Blocking Session : 616
Object (Owner/Name): <user>.QUEUELOCK
Object Type : TABLE
.
Blocking Session : 617
Object (Owner/Name): <user>.QUEUELOCK
Object Type : TABLE

So this tells me there is some sort of table lock which causes all other jobs to stack up and BGMS to stall.  Still looking for culprit.