cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

Community Tip - Want the oppurtunity to discuss enhancements to PTC products? Join a working group! X

Translate the entire conversation x

BGMS Stalls - Quits processing queue tasks, publishing, anything

avillanueva
23-Emerald I

BGMS Stalls - Quits processing queue tasks, publishing, anything

I've been tracking a vexing issue for a few weeks now. I would get a note from the users that they are not seeing emails, publishing has stopped or workflows are not working. Quick check on System monitor shows red lights on BGMS:

Threshold of average active method contexts exceeded; average active contexts: 135.0001

 It appears as if no queue jobs are transacting. This will persist until I whack the BGMS process and it auto-restarts. Queue jobs start processing normally and there is an email flood. What is strange is this often occurs at night or weekends. I am checking with IT to make sure this is not an external job that is running. I would think if it was DB related, it would settle back down once issue is resolved. I am still investigating issues on that front. Here is the status messages before the event:

2023-05-27 10:05:33,376 INFO  [wt.jmx.core.SharedScheduledExecutor.worker] wt.method.MethodContextMonitor.stats.summary  - ElapsedSeconds=600.001357667, CompletedContexts=2844, ErrorCount=0, RedirectCount=0, ContextSecondsAverage=0.0010882164483122362, ContextSecondsMax=0.170925781, ContextCpuSecondsAverage=3.7317970253164556E-4, ContextUserSecondsAverage=3.023909985935302E-4, ContextJDBCConnWaitSecondsAverage=9.443747187060478E-6, ActiveContextsAverage=33.00515865339878, ActiveContextsMax=36, ActiveContextsStart=33, ActiveContextsEnd=33, AverageJDBCCallsPerContext=2.2988748241912798, AverageJNDICallsPerContext=0.0, AverageRemoteCacheCallsPerContext=0.015119549929676512, PercentageOfContextTimeInJDBCCalls=69.4537942697918, PercentageOfContextTimeInJDBCConnWait=0.8678188242520327, PercentageOfContextTimeInJNDICalls=0.0, PercentageOfContextTimeInRemoteCacheCalls=0.46653119479917626

 Here is the messages when it occurs:

2023-05-27 10:31:23,396 WARN  [JMX Monitor ThreadGroup<main> Executor Pool [Thread-32910]] wt.jmx.notif.methodContextGauge  - Time=2023-05-27 10:31:23.393 -0400, Name=MethodContextsGaugeNotifier, SourceObjectName=com.ptc:wt.subsystem=Monitors,wt.monitorType=UserSpecified,monitorName=NumberOfMethodContexts, class=class javax.management.monitor.MonitorNotification, type=jmx.monitor.gauge.high, message=, JVM Name=734509@<hostname>, Database Connections=15, RecentStatistics=[activeContextsAverage=43.69575698079789,activeContextsEnd=61,activeContextsMax=61,activeContextsStart=33,averageBlockedCountPerContext=0.0,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=2.088888888888889,averageJNDICallsPerContext=0.0,averageRemoteCacheCallsPerContext=0.022222222222222223,averageWaitedCountPerContext=0.0,completedContexts=45,contextCpuSecondsAverage=7.938385555555555E-4,contextJDBCConnWaitSecondsAverage=1.38434E-5,contextSecondsAverage=0.0035897418444444445,contextSecondsMax=432060.292405267,contextUserSecondsAverage=6.666666666666666E-4,contextsPerSecond=0.5998349031447896,elapsedSeconds=75.020642787,endTime=2023-05-27 10:31:23.395 -0400,errorCount=0,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=1.3454552408663482E-5,percentageOfContextTimeInJDBCConnWait=2.0362712675106688E-7,percentageOfContextTimeInJNDICalls=0.0,percentageOfContextTimeInRemoteCacheCalls=2.472260154022281E-8,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-27 10:30:08.374 -0400], BaselineStatistics=[activeContextsAverage=17.039334284186697,activeContextsEnd=61,activeContextsMax=79,activeContextsStart=3,averageBlockedCountPerContext=0.02292217978357173,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=4.403481477633508,averageJNDICallsPerContext=0.001559566688444229,averageRemoteCacheCallsPerContext=0.08460450215393109,averageWaitedCountPerContext=0.00505408810850307,completedContexts=2637271,contextCpuSecondsAverage=0.0014892850589446438,contextJDBCConnWaitSecondsAverage=1.1264928033144869E-4,contextSecondsAverage=0.046587539902467744,contextSecondsMax=1800.673565344,contextUserSecondsAverage=0.001307286964441652,contextsPerSecond=5.336972950282546,elapsedSeconds=494151.089872093,endTime=2023-05-27 10:31:23.395 -0400,errorCount=12,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=21.863126897034316,percentageOfContextTimeInJDBCConnWait=0.2418013068886723,percentageOfContextTimeInJNDICalls=0.002312341661670789,percentageOfContextTimeInRemoteCacheCalls=0.04387963380984237,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-21 17:15:32.305 -0400]
2023-05-27 10:34:09,071 INFO  [wt.jmx.core.SharedScheduledExecutor.worker] wt.summary.general  - JVMName=734509@<hostname>, HeapMemoryUsage=637741752, NonHeapMemoryUsage=325782624, Uptime=494404081, ProcessCpuTime=7944010000000, JITCTime=270947, GarbageCollectors=[G1 Old Generation,G1 Young Generation], GCTime=[577956,10256], GCCount=[928,283], Classes=32564, Threads=182, TotalStartedThreadCount=138596, CodeCachePercentUsage=NaN
2023-05-27 10:35:33,381 WARN  [wt.jmx.core.SharedScheduledExecutor.worker] wt.method.MethodContextMonitor.stats.summary  - ElapsedSeconds=600.001357878, CompletedContexts=1793, ErrorCount=0, RedirectCount=0, ContextSecondsAverage=9.379451115448968E-4, ContextSecondsMax=0.119802725, ContextCpuSecondsAverage=3.3825659453430004E-4, ContextUserSecondsAverage=2.732849972113776E-4, ContextJDBCConnWaitSecondsAverage=9.276237032905745E-6, ActiveContextsAverage=54.95284711645644, ActiveContextsMax=128, ActiveContextsStart=33, ActiveContextsEnd=128, AverageJDBCCallsPerContext=2.036252091466815, AverageJNDICallsPerContext=0.0, AverageRemoteCacheCallsPerContext=0.015616285554935862, PercentageOfContextTimeInJDBCCalls=68.06158638785061, PercentageOfContextTimeInJDBCConnWait=0.9889957225350618, PercentageOfContextTimeInJNDICalls=0.0, PercentageOfContextTimeInRemoteCacheCalls=0.4693038590843637
2023-05-27 10:35:33,382 WARN  [wt.jmx.core.SharedScheduledExecutor.worker] wt.jmx.notif.methodContext  - Time=2023-05-27 10:35:33.381 -0400, Name=MethodContextsNotifier, SourceObjectName=com.ptc:wt.subsystem=Monitors,wt.monitorType=MethodContexts, class=class javax.management.Notification, type=wt.method.MethodContextMonitorMBean.maxAverageActiveContextsThreshold.exceeded, userData=54.95284711645644, message=Threshold of average active method contexts exceeded; average active contexts: 54.9528, JVM Name=734509@<hostname>, Database Connections=15, MaxAverageActiveContextsThreshold=37, RecentStatistics=[activeContextsAverage=119.98866890627697,activeContextsEnd=128,activeContextsMax=128,activeContextsStart=66,averageBlockedCountPerContext=0.0,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=0.0,averageJNDICallsPerContext=0.0,averageRemoteCacheCallsPerContext=0.0,averageWaitedCountPerContext=0.0,completedContexts=17,contextCpuSecondsAverage=3.0073935294117646E-4,contextJDBCConnWaitSecondsAverage=0.0,contextSecondsAverage=5.642147647058823E-4,contextSecondsMax=432310.279012334,contextUserSecondsAverage=0.0,contextsPerSecond=0.20163700157121983,elapsedSeconds=84.309922621,endTime=2023-05-27 10:35:33.382 -0400,errorCount=0,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=1.1907240740019307E-5,percentageOfContextTimeInJDBCConnWait=1.956330099324393E-7,percentageOfContextTimeInJNDICalls=0.0,percentageOfContextTimeInRemoteCacheCalls=0.0,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-27 10:34:09.072 -0400], BaselineStatistics=[activeContextsAverage=17.072420586985455,activeContextsEnd=128,activeContextsMax=128,activeContextsStart=3,averageBlockedCountPerContext=0.022922005952320013,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=4.4034480836585725,averageJNDICallsPerContext=0.001559554861408923,averageRemoteCacheCallsPerContext=0.08460386055236226,averageWaitedCountPerContext=0.005054049780627166,completedContexts=2637291,contextCpuSecondsAverage=0.001489276805779112,contextJDBCConnWaitSecondsAverage=1.1264842605120177E-4,contextSecondsAverage=0.04658719249507581,contextSecondsMax=1800.673565344,contextUserSecondsAverage=0.001307277050579553,contextsPerSecond=5.334314841123434,elapsedSeconds=494401.076529741,endTime=2023-05-27 10:35:33.382 -0400,errorCount=12,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=21.86312413239578,percentageOfContextTimeInJDBCConnWait=0.24180127631238674,percentageOfContextTimeInJNDICalls=0.002312341369270306,percentageOfContextTimeInRemoteCacheCalls=0.04387962826117007,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-21 17:15:32.305 -0400]
2023-05-27 10:35:33,382 INFO  [wt.jmx.core.SharedScheduledExecutor.worker] com.infoengine.util.IeContextMonitor.stats.summary  - ElapsedSeconds=600.002, CompletedContexts=0, ErrorCount=0, ContextSecondsAverage=NaN, ContextSecondsMax=0.0, ContextCpuSecondsAverage=NaN, ContextUserSecondsAverage=NaN, ActiveContextsAverage=0.0, ActiveContextsMax=0, ActiveContextsStart=0, ActiveContextsEnd=0
2023-05-27 10:35:33,383 INFO  [wt.jmx.core.SharedScheduledExecutor.worker] com.infoengine.jndi.DirContextMonitor.stats.summary  - ServiceName=com.utcaus.EnterpriseLdap, IntervalSeconds=600.002, TotalContexts=0, ErrorCount=0, ContextsPerSecond=0.0, AverageContextSeconds=0.0, LongestContextSeconds=0.0, AverageActiveContexts=0.0, MaxActiveContexts=0
2023-05-27 10:35:33,383 INFO  [wt.jmx.core.SharedScheduledExecutor.worker] com.infoengine.jndi.DirContextMonitor.stats.summary  - ServiceName=com.utcaus.Ldap, IntervalSeconds=600.003, TotalContexts=0, ErrorCount=0, ContextsPerSecond=0.0, AverageContextSeconds=0.0, LongestContextSeconds=0.0, AverageActiveContexts=0.0, MaxActiveContexts=0

2023-05-27 10:44:09,781 INFO  [wt.jmx.core.SharedScheduledExecutor.worker] wt.summary.general  - JVMName=734509@<hostname>, HeapMemoryUsage=643160336, NonHeapMemoryUsage=325782496, Uptime=495004791, ProcessCpuTime=7949140000000, JITCTime=270956, GarbageCollectors=[G1 Old Generation,G1 Young Generation], GCTime=[578664,10256], GCCount=[929,283], Classes=32564, Threads=245, TotalStartedThreadCount=138771, CodeCachePercentUsage=NaN
2023-05-27 10:45:33,382 WARN  [wt.jmx.core.SharedScheduledExecutor.worker] wt.method.MethodContextMonitor.stats.summary  - ElapsedSeconds=600.001896003, CompletedContexts=8, ErrorCount=0, RedirectCount=0, ContextSecondsAverage=0.00578046275, ContextSecondsMax=0.008206355, ContextCpuSecondsAverage=0.0023634275, ContextUserSecondsAverage=0.0025, ContextJDBCConnWaitSecondsAverage=0.0, ActiveContextsAverage=130.0123936803109, ActiveContextsMax=132, ActiveContextsStart=128, ActiveContextsEnd=132, AverageJDBCCallsPerContext=0.0, AverageJNDICallsPerContext=0.0, AverageRemoteCacheCallsPerContext=0.0, PercentageOfContextTimeInJDBCCalls=0.0, PercentageOfContextTimeInJDBCConnWait=0.0, PercentageOfContextTimeInJNDICalls=0.0, PercentageOfContextTimeInRemoteCacheCalls=0.0
2023-05-27 10:45:33,384 WARN  [wt.jmx.core.SharedScheduledExecutor.worker] wt.jmx.notif.methodContext  - Time=2023-05-27 10:45:33.383 -0400, Name=MethodContextsNotifier, SourceObjectName=com.ptc:wt.subsystem=Monitors,wt.monitorType=MethodContexts, class=class javax.management.Notification, type=wt.method.MethodContextMonitorMBean.maxAverageActiveContextsThreshold.exceeded, userData=130.0123936803109, message=Threshold of average active method contexts exceeded; average active contexts: 130.0124, JVM Name=734509@<hostname>, Database Connections=15, MaxAverageActiveContextsThreshold=37, RecentStatistics=[activeContextsAverage=131.38045394969262,activeContextsEnd=132,activeContextsMax=132,activeContextsStart=131,averageBlockedCountPerContext=0.0,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=0.0,averageJNDICallsPerContext=0.0,averageRemoteCacheCallsPerContext=0.0,averageWaitedCountPerContext=0.0,completedContexts=1,contextCpuSecondsAverage=0.002965126,contextJDBCConnWaitSecondsAverage=0.0,contextSecondsAverage=0.007209526,contextSecondsMax=432910.280733025,contextUserSecondsAverage=0.01,contextsPerSecond=0.011961477921314188,elapsedSeconds=83.601709302,endTime=2023-05-27 10:45:33.383 -0400,errorCount=0,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=1.1796607632116384E-5,percentageOfContextTimeInJDBCConnWait=1.9381533542918624E-7,percentageOfContextTimeInJNDICalls=0.0,percentageOfContextTimeInRemoteCacheCalls=0.0,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-27 10:44:09.782 -0400], BaselineStatistics=[activeContextsAverage=17.20931762561805,activeContextsEnd=132,activeContextsMax=132,activeContextsStart=3,averageBlockedCountPerContext=0.022921936420557548,averageCORBACallsPerContext=0.0,averageJDBCCallsPerContext=4.403434726210414,averageJNDICallsPerContext=0.0015595501306450274,averageRemoteCacheCallsPerContext=0.08460360391445945,averageWaitedCountPerContext=0.0050540344496395745,completedContexts=2637299,contextCpuSecondsAverage=0.0014892794574335331,contextJDBCConnWaitSecondsAverage=1.1264808434273096E-4,contextSecondsAverage=0.04658706871167547,contextSecondsMax=1800.673565344,contextUserSecondsAverage=0.0013072806685931325,contextsPerSecond=5.327865162195883,elapsedSeconds=495001.078239194,endTime=2023-05-27 10:45:33.384 -0400,errorCount=12,percentageOfContextTimeBlocked=0.0,percentageOfContextTimeInCORBACalls=0.0,percentageOfContextTimeInJDBCCalls=21.863115903528474,percentageOfContextTimeInJDBCConnWait=0.2418011853029498,percentageOfContextTimeInJNDICalls=0.002312340498948655,percentageOfContextTimeInRemoteCacheCalls=0.04387961174570612,percentageOfContextTimeWaited=0.0,redirectCount=0,startTime=2023-05-21 17:15:32.305 -0400]

Any idea what might cause this behavior as described?

ACCEPTED SOLUTION

Accepted Solutions

Success? I did some analysis on which queue jobs were running and my attention was drawn to this job which runs daily:

NAME IDA2A2 SCHEDULETIME ENTRY_TARGETCLASS ENTRY_TARGETMETHOD SCHEDULEITEM TARGETCLASS TARGETMETHOD ITEMDESCRIPTION ITEMNAME
commonScheduleQueue 1630151366 08/16/2023 10:30 wt.scheduler.StandardSchedulingService executeScheduledItem 1630151365 com.ptc.windchill.enterprise.datamonitor.WBRDataMonitorHelper processFailedAsyncExecutions WBR_DATAMONITOR_PURGE_SCHEDULE WBR_DATAMONITOR_PURGE_SCHEDULE

Previously, I put in a call about some Jasper report error messages. This is documented in the CS articles below.

https://www.ptc.com/en/support/article/CS378753?source=search

https://www.ptc.com/en/support/article/CS380347?source=search

This matches us exactly as we are employing an alias hostname. I was going to fix this with a CPS update. 

I then came across this article which I think was the real cause:

https://www.ptc.com/en/support/article/CS358954?source=search

My theory is that there were some daily Jasper reports related to Change Object Summary reports that were failing to execute, likely due to the issues noted above. This job hangs and does not finish resulting in the step up we are seeing in the logs above for contexts. After a few days, things get maxed out and BGMS halts. Following the steps in CS358954, I found the entries in the WBRDataMonitor table and marked them as FAILED. When 10:30 rolled around, no step up was observed. I think I fixed the issue but I will wait for a few days to close this thread.

 

What I do not understand is that we do not use these reports but  I can see how a user might trigger them to run. How do I see what has been requested and clear them? If I do not, I can see this event occurring again until I can get the CPS updated.

avillanueva_0-1692203006744.png

 

 

 

 

View solution in original post

11 REPLIES 11

No current idea, but many questions.

May I presume that you already have ran through Windchill tuning that is recommended in https://www.ptc.com/en/support/article/CS244172?

I think yes since you've been working on this for some time, but I'd rather ask then assume.

Which DB are you running on? If it'd be due to lack of resources you'd presume that this would happen during higher loads.

Do you have any scheduled jobs running at nighttime? Does this happen every night, or just on some occasions? 

If it is a regular event, have you considered increasing logging levels for short intervals around when you expect it to happen?

If this is caused by external issues, have you found any correlating events through the Windows Event Viewer? 

 

Have you by the way considered to use the On-demand System Scan from PTC? Gather sufficient data and run it through a solid scan for suggestions on what to alter in your system and see what output you get.

HelesicPetr
22-Sapphire II
(To:avillanueva)

Hi @avillanueva 

I would check what is the average active method contexts and increase the value and check if it can help. 

 

Why this threshold is exceeded in the night? it is interesting, do you have any custom drill action to export data to another system?

 

This can be a case. 

 

PetrH

If there is a limiter, I can get how two many contexts getting spawned and they all get stuck in the exit door. I am not sure how widening the door would help if we do not know what is spawning them. This is all happening in the morning on a Saturday when the system is idle. No one is online doing things then. I am checking for external causes like something on the database but nothing obvious yet. There are some queue jobs like the Purge Report task that runs daily at 10:30am. We do have a few data monitors but the frequencies for clean up do not align. I have discounted this. This is the key data I think:

ActiveContextsAverage=130.0123936803109, ActiveContextsMax=132, ActiveContextsStart=128, ActiveContextsEnd=132

It spikes at this time and stays this way until restart. I think I will setup a monitoring period that morning and try to trap any DB calls and other activity to see what is the trigger.

Sharing more info. It obviously has something to do with an errant queue job since timing is spot on but I am seeing that I cannot correlate why its some days and not others. I was able to trap this from the DB.

Object (Owner/Name): <user>.QUEUELOCK
Object Type : TABLE
.
Blocking Session : 616
Object (Owner/Name): <user>.QUEUELOCK
Object Type : TABLE
.
Blocking Session : 617
Object (Owner/Name): <user>.QUEUELOCK
Object Type : TABLE

So this tells me there is some sort of table lock which causes all other jobs to stack up and BGMS to stall.  Still looking for culprit.

DmitryC
14-Alexandrite
(To:avillanueva)

Hi @avillanueva,

 

 

We've been battling a different issue that causes blocks in SQL transactions. One of the symptoms is that it stalls some of the queues, including workflows and publishing.

 

That's what we found so far during the investigation:

We´ve found that CPS 10 introduced changes to EDRLoader class, which signals that we might be closer to solving this issue, 

It's referencing CPS 12.0.2.10. 

 

It's a rare issue for us, happening 1-2 times a month.

avillanueva
23-Emerald I
(To:DmitryC)

I am on CPS 8 so no luck for me. Since the failure occurs at about same time of the day, just not any particular day, here are my current working theories:

  • A certain scheduled job runs and locks the tables but fails to release.
  • A different job locks the table and fails to release but issue is only problematic when another scheduled job which runs at that special time comes along and runs into the lock.

I believe the cause to be a traffic jam of active contexts spiraling up to the maximum. Essentially, they all get stuck in the doorway.

avillanueva_0-1688124186025.jpeg

 

More info. This appears on a regular cadence (just restarted) but I am still baffled. I am linking in CS244172. I will note that max DB connections is 15 and I've always had that value set. There is also this curious pattern of Method Server context that does not make any sense.

avillanueva_0-1692026087524.png

avillanueva_1-1692026095233.png

and then when it went haywire:

avillanueva_2-1692026115633.png

I do see a message at 10:32 am that there was a successful QueueLock for the commonProcessorQueue but that its. It never releases it but nothing ever reported that it also asked for it. 

 

Success? I did some analysis on which queue jobs were running and my attention was drawn to this job which runs daily:

NAME IDA2A2 SCHEDULETIME ENTRY_TARGETCLASS ENTRY_TARGETMETHOD SCHEDULEITEM TARGETCLASS TARGETMETHOD ITEMDESCRIPTION ITEMNAME
commonScheduleQueue 1630151366 08/16/2023 10:30 wt.scheduler.StandardSchedulingService executeScheduledItem 1630151365 com.ptc.windchill.enterprise.datamonitor.WBRDataMonitorHelper processFailedAsyncExecutions WBR_DATAMONITOR_PURGE_SCHEDULE WBR_DATAMONITOR_PURGE_SCHEDULE

Previously, I put in a call about some Jasper report error messages. This is documented in the CS articles below.

https://www.ptc.com/en/support/article/CS378753?source=search

https://www.ptc.com/en/support/article/CS380347?source=search

This matches us exactly as we are employing an alias hostname. I was going to fix this with a CPS update. 

I then came across this article which I think was the real cause:

https://www.ptc.com/en/support/article/CS358954?source=search

My theory is that there were some daily Jasper reports related to Change Object Summary reports that were failing to execute, likely due to the issues noted above. This job hangs and does not finish resulting in the step up we are seeing in the logs above for contexts. After a few days, things get maxed out and BGMS halts. Following the steps in CS358954, I found the entries in the WBRDataMonitor table and marked them as FAILED. When 10:30 rolled around, no step up was observed. I think I fixed the issue but I will wait for a few days to close this thread.

 

What I do not understand is that we do not use these reports but  I can see how a user might trigger them to run. How do I see what has been requested and clear them? If I do not, I can see this event occurring again until I can get the CPS updated.

avillanueva_0-1692203006744.png

 

 

 

 


@avillanueva wrote:

 

What I do not understand is that we do not use these reports but  I can see how a user might trigger them to run. How do I see what has been requested and clear them? If I do not, I can see this event occurring again until I can get the CPS updated.

avillanueva_0-1692203006744.png

 

 

 

 


Could it be some data monitor created by a user on one of the reports you highlighted?

 

When a Data Monitor is created or deleted in Windchill which queues are used?

 

 

Marco

I deleted any data monitors in use to test that. I do not think that was the cause.

Announcements

Top Tags