cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

Community Tip - Your Friends List is a way to easily have access to the community members that you interact with the most! X

Limits on large assemblies and workspaces, possible DB issue?

avillanueva
22-Sapphire II

Limits on large assemblies and workspaces, possible DB issue?

I am still investigating but twice this week we've had issues where our DB (Oracle 21C Standard) would freeze up. All Windchill threads would pause and stack up. After 10-20 mins, the system would come alive but freeze again shortly large. We saw this Monday and this lasted for a few hours. Things were fine until this afternoon when issue returned. We were not able to determine the root issue but we know its database related. We capture ORA-00742 error in alert.log (log read detects lost write in thread string sequence string block). 

My DBA is chasing down any Oracle related bug or setting issue. I started looking into what was going on at that time. Very close to that time, a user checked in a massive 1200 component assembly new. It looks like it succeeded since I can see the jobs filled up the publishing queue. I can also see that they were working in a massive workspace on Monday as well, 2400+ objects, basically some garbage, crappy STEP import. 

My questions is these: is it possible that the large transaction swamped the system, overloaded the database to a point where it is struggling to recover? Could that impact linger for hours after the event supposedly finished? Is this too large of a job for it to handle? I am sure that people out there have bigger assemblies. But could be a combination of our scale and possible tuning that choked it? How large is just too large? I am not talking about just the size but the fact I saw it all go in new all at once. Let me know your thoughts I and I will share a post-mortem when resolved. 

ACCEPTED SOLUTION

Accepted Solutions
avillanueva
22-Sapphire II
(To:d_graham)

At this point, the field has told me that the numbers I was describing should have been no issue. There is no limit which should have froze the database. I can see what @lhoogeveen is describing where that transaction would take much longer and possible slow down the system. That's not what we saw. It appears that after the check in was complete (since I saw publishing jobs) the anomaly occurred. That freeze was a complete lock up of the database, all method server transactions stacked up and paused. Preliminary information from the DBA was correct REDO header on one of the redo logs. We contacted Oracle and were able to clean the archive logs. Whether this was a symptom, cause, trigger or result is still under investigation but we did not see repeats of the 742 error after the fixes. System is back to operation (said same thing Monday too).

View solution in original post

10 REPLIES 10
avillanueva
22-Sapphire II
(To:avillanueva)

Aside from the single assembly of 2800 components, I see that this same user has a workspace of 11293 objects. Any issue with that bringing down the house?

Performance and stability vary based on how well the system is tuned, so it is possible.  If you can't find a smoking gun in the logs (Windchill and DB), have you tried the OnDemand system scan?

Instructions:
https://www.ptc.com/en/support/article/CS299230
Upload site:

https://www.ptc.com/en/support/on-demand-system-scan/

 

I start here for general performance and stability issues.  Then I log a call with the Windchill performance team if I am still having issues.

We have multiple users with 15k+ object workspaces with no issue (other than workspace actions are slower relative to a 1k workspace). I have also seen over the years checkins of over 1k objects that did not stall the system.

 

We are currently on Oracle 19c SE2.

avillanueva
22-Sapphire II
(To:RandyJones)

Thanks, just being methodical. I had a conversation with the user and they were actually not online Monday during the event so its less likely it was a culprit. Now that I know we are no where near a ceiling, I can look elsewhere. I still have issues with huge assemblies like this for other reasons but system should not have behaved this way. Looking more like a DB related issue. Stay tuned.

TomU
23-Emerald IV
(To:avillanueva)


@avillanueva wrote:

... I can see the jobs filled up the publishing queue.


I've never had our system completely lock up, but publish-on-change (on check-in) definitely slow things down.  I need to test, but I think all the publish entries have to be generated before whatever action triggered them (check-in in this case) is considered complete.  I've seen this when running the reassign lifecycle states command-line utility.  When publish-on-change is turned on, the utility will process a couple thousand objects per hour.  When it's turned off the utility will process hundreds of thousands of objects per hour.  Might be worth seeing what impact having publish-on-change enabled during a large check-in is having.

@avillanueva ,

You're on Windchill 12.x. True?

Is Oracle 21c supported?

When I view the matrix from 12.1.2.0.0 it lists for Oracle only 19c.

 

Plus, there's this article confirming it is not supported.  Doesn't necessarily mean it won't work.  Just means it's not supported.

Article - CS364968 - Unable to install Windchill 12.1 using Oracle 21c with Error Unable to validate target Oracle database version 19c (ptc.com)

avillanueva
22-Sapphire II
(To:d_graham)

My bad. Sry, It is 19C. Getting ahead of myself.

@avillanueva It's absolutely possible for a massive CAD check-in to lockup your Windchill system. We have the same problem with the scenario you described - checking in 1000's of parts from an imported STEP assembly. Check-in can be 3x slower if creating WTParts upon check-in.

 

Maybe this issue is specific to creating WTParts upon check-in of CAD?

 

Here's some related product ideas to vote:

@avillanueva 

Have you tried doing just an upload first followed by a checkin as a separate action?

 

Just thinking about isolating the problem.

avillanueva
22-Sapphire II
(To:d_graham)

At this point, the field has told me that the numbers I was describing should have been no issue. There is no limit which should have froze the database. I can see what @lhoogeveen is describing where that transaction would take much longer and possible slow down the system. That's not what we saw. It appears that after the check in was complete (since I saw publishing jobs) the anomaly occurred. That freeze was a complete lock up of the database, all method server transactions stacked up and paused. Preliminary information from the DBA was correct REDO header on one of the redo logs. We contacted Oracle and were able to clean the archive logs. Whether this was a symptom, cause, trigger or result is still under investigation but we did not see repeats of the 742 error after the fixes. System is back to operation (said same thing Monday too).

Announcements


Top Tags