Community Tip - Your Friends List is a way to easily have access to the community members that you interact with the most! X
I am still investigating but twice this week we've had issues where our DB (Oracle 21C Standard) would freeze up. All Windchill threads would pause and stack up. After 10-20 mins, the system would come alive but freeze again shortly large. We saw this Monday and this lasted for a few hours. Things were fine until this afternoon when issue returned. We were not able to determine the root issue but we know its database related. We capture ORA-00742 error in alert.log (log read detects lost write in thread string sequence string block).
My DBA is chasing down any Oracle related bug or setting issue. I started looking into what was going on at that time. Very close to that time, a user checked in a massive 1200 component assembly new. It looks like it succeeded since I can see the jobs filled up the publishing queue. I can also see that they were working in a massive workspace on Monday as well, 2400+ objects, basically some garbage, crappy STEP import.
My questions is these: is it possible that the large transaction swamped the system, overloaded the database to a point where it is struggling to recover? Could that impact linger for hours after the event supposedly finished? Is this too large of a job for it to handle? I am sure that people out there have bigger assemblies. But could be a combination of our scale and possible tuning that choked it? How large is just too large? I am not talking about just the size but the fact I saw it all go in new all at once. Let me know your thoughts I and I will share a post-mortem when resolved.
Solved! Go to Solution.
At this point, the field has told me that the numbers I was describing should have been no issue. There is no limit which should have froze the database. I can see what @lhoogeveen is describing where that transaction would take much longer and possible slow down the system. That's not what we saw. It appears that after the check in was complete (since I saw publishing jobs) the anomaly occurred. That freeze was a complete lock up of the database, all method server transactions stacked up and paused. Preliminary information from the DBA was correct REDO header on one of the redo logs. We contacted Oracle and were able to clean the archive logs. Whether this was a symptom, cause, trigger or result is still under investigation but we did not see repeats of the 742 error after the fixes. System is back to operation (said same thing Monday too).
Aside from the single assembly of 2800 components, I see that this same user has a workspace of 11293 objects. Any issue with that bringing down the house?
Performance and stability vary based on how well the system is tuned, so it is possible. If you can't find a smoking gun in the logs (Windchill and DB), have you tried the OnDemand system scan?
Instructions:
https://www.ptc.com/en/support/article/CS299230
Upload site:
https://www.ptc.com/en/support/on-demand-system-scan/
I start here for general performance and stability issues. Then I log a call with the Windchill performance team if I am still having issues.
We have multiple users with 15k+ object workspaces with no issue (other than workspace actions are slower relative to a 1k workspace). I have also seen over the years checkins of over 1k objects that did not stall the system.
We are currently on Oracle 19c SE2.
Thanks, just being methodical. I had a conversation with the user and they were actually not online Monday during the event so its less likely it was a culprit. Now that I know we are no where near a ceiling, I can look elsewhere. I still have issues with huge assemblies like this for other reasons but system should not have behaved this way. Looking more like a DB related issue. Stay tuned.
@avillanueva wrote:
... I can see the jobs filled up the publishing queue.
I've never had our system completely lock up, but publish-on-change (on check-in) definitely slow things down. I need to test, but I think all the publish entries have to be generated before whatever action triggered them (check-in in this case) is considered complete. I've seen this when running the reassign lifecycle states command-line utility. When publish-on-change is turned on, the utility will process a couple thousand objects per hour. When it's turned off the utility will process hundreds of thousands of objects per hour. Might be worth seeing what impact having publish-on-change enabled during a large check-in is having.
You're on Windchill 12.x. True?
Is Oracle 21c supported?
When I view the matrix from 12.1.2.0.0 it lists for Oracle only 19c.
Plus, there's this article confirming it is not supported. Doesn't necessarily mean it won't work. Just means it's not supported.
My bad. Sry, It is 19C. Getting ahead of myself.
@avillanueva It's absolutely possible for a massive CAD check-in to lockup your Windchill system. We have the same problem with the scenario you described - checking in 1000's of parts from an imported STEP assembly. Check-in can be 3x slower if creating WTParts upon check-in.
Maybe this issue is specific to creating WTParts upon check-in of CAD?
Here's some related product ideas to vote:
Have you tried doing just an upload first followed by a checkin as a separate action?
Just thinking about isolating the problem.
At this point, the field has told me that the numbers I was describing should have been no issue. There is no limit which should have froze the database. I can see what @lhoogeveen is describing where that transaction would take much longer and possible slow down the system. That's not what we saw. It appears that after the check in was complete (since I saw publishing jobs) the anomaly occurred. That freeze was a complete lock up of the database, all method server transactions stacked up and paused. Preliminary information from the DBA was correct REDO header on one of the redo logs. We contacted Oracle and were able to clean the archive logs. Whether this was a symptom, cause, trigger or result is still under investigation but we did not see repeats of the 742 error after the fixes. System is back to operation (said same thing Monday too).