We had a similar experience but our problem was large assemblies were taking longer than the timeout set in the recipe file. We didn't want to set the timeout too high because it caused the jobs behind it wait for too long. The solution was to split out the long running jobs (large assemblies) to their own queue set and have a dedicated worker process all of those jobs. Then, the timeout in the dedicated worker's recipe file could be as high as it needed to be without blocking smaller jobs.
We also put the publishing queues on their own background method server. That way, when something happened (all got stuck in the executing state) we could just kill that bgms and not hurt anything happening on the ootb bgms.
Let me know if you want more information on either solution.
After all these years, this issue confounds me. Tech Support has been useless (you have) on this issue. No one has yet gotten to root cause. What frustrating is there are so many links in the chain that its tough to diagnose. The only thing I do know is that the suggestion of restarting the Background MS was the only thing that remedied the issue but does not solve it. Users are typically unaware of the restart.
I now run multiple PDMLink 10.2 and 10.1 installations and all have the issue to some degree. I cannot queue up a scheduled republish job since it will never complete without baby sitting the queue. I am ok with it skipping and moving on to next job but it stalling all together needs to be solved.
Here is what I know:
At this point, since I am seeing it all over, I am open to all options. I have thought about splitting off publishing to its own BGMS, creating a thumbnail worker exclusively. I have played with timeout settings but issue seems to remain. Anyone else still seeing this or has solved it?
On your worker, have you updated your local hosts file with your 3 aliases in windows\system32\drivers\etc\hosts? Sounds like you have it on the server side in /etc/hosts, but the worker needs to know about these too.
Did you manually add the correct alias to proeworker.bat as -DA <alias>? Keep in mind that the preo2pv gui tool overwrites this.
For troubleshooting hung creo processes, Resource Monitor is a great tool because you can filter on your different worker paths. Next time try identifying the hung job, killing the xtop.exe process, and see if it's able to continue.
Yes, hosts and setup is all correct. As I stated, it publishes correctly 99% of the time. This is a toughie since its random and extremely time consuming to debug. I can't also play in production like when tech support suggest to make prop file changes and see what happens. Can't exactly just reboot in middle of day to try that.
Replying to myself - Had issue occur on two systems in the same day. Fixed both by restarted BGMS. Both failed in exactly the same place:
Sep 23, 2018 11:34:58 PM: Getting Mass Properties
Sep 23, 2018 11:34:58 PM: Converting Author States
Sep 23, 2018 11:34:58 PM: Generating Output
Sep 23, 2018 11:34:58 PM: Generating thumbnail
Sep 24, 2018 12:34:58 AM:Timeout exceeded waiting for a reply from the CadAgent - Time out 3,600 seconds
Sep 24, 2018 12:34:58 AM:Asm Processing Returned: $ERROR$ Timeout exceeded waiting for a reply from the CadAgent - Time out 3,600 seconds
Sep 24, 2018 12:34:58 AM:Attempting to delete temporary workspace publish7290478894723542735tmp
Open to anything here. Should I direct the workers not to create thumbnails and just create a separate thumbnail worker? Or would that just increase processing time?
This has been a long term issue with me as well. Over two companies and two different CAD platforms and three different versions of Windchill. One would think such a fundamental issue could be solved, restarting is ok, and gets you going, - but what is the problem to begin with? I'm commenting primarily to follow this thread in the hopes of gleaning some insight. 🙂
Swisslog Healthcare (North America)
Windchill 11.0 M030 CPS09
WGM 11.0 M030
Inventor Pro 2017
I just had an admin republish a context with 3000 jobs. It may go a few hours before stalling. Given the daily load, it really is frustrating that current requests are getting hung up. I "think" this might be another work around. If I see a job that is spinning away at the state of publishing thumbnail (whether it has done this already or not), usually, it should not take a long time. If it does, you can go to Queue manager and delete the executing job. System might not stall but again, it takes periodic monitoring.
Speaking about resurrecting...
I came across your discussion today when one of our customers had this exact same issue with their CAD worker. The solution in their case was this article: https://www.ptc.com/en/support/article/CS332276The issue at their end was caused by VMWare Tool. I don't know if this is of any help to your case, nor if you still have an issue with this, but if you do maybe it can be of some help.