CAD Worker jobs failing

jstone-3 · ‎Feb 09, 2017

Due to previous sins, we have a number of files without viewables and so started a job to recreate virtually all of them. The problem is that it's prone to failing.

At times, it runs fine and then everything will start failing and the job logs indicate "Error, did not find available worker of type PROE". When this happens, nothing will get through. Then, after pulling my hair out for a day or so, it'll start running jobs again (from time to time I resubmit a couple to see if anything's changed).

If I go to the Worker Agent Administration window, it indicates "Fails to Start" for the PROE worker. I had luck one time just clicking on the green flag to start the job, but that only worked once. These jobs all run in PublisherQueueL. "Standard" jobs from check-ins, promotion requests, etc come through fine in either the L or the M queue while these resubmitted jobs are failing.

What would cause the PROE worker to work only when it wants to (and how do I convince it to)?

Windchill 10.2, M030.

MikeLockwood · ‎Feb 09, 2017

I've spent the last few months on a similar effort - and finally understand how all the elements and interfaces related to publishing work. Had to read and re-read everything written by PTC on this a few times, then get a beer, and come back and read again. Currently working on documenting with a few diagrams. Good puzzle if there ever was one - and documented like a treasure map with only a few clues

Key things to monitor live:

From Windchill browser

- Work Agent Admin window (red/green flags)

- WVS Job Monitor

Desktop of Worker Machine

- task manager: worker daemon, worker monitor, worker helper

- creoview_adapter setup folder: monitor logs

- creoview shared worker folder: temp folder used for the job

Database:

- derivedimage table

Multiple times I deleted all the published viewables from our production system and republished all - using various recipe and other settings. Finally have it really dialed in.

jstone-3 · ‎Feb 09, 2017

Like so many things in Windchill, there seems to be an abundance of windows and buttons to click to accomplish anything!

I do watch the Work Agent window, though as I said, clicking the green flag doesn't start anything. And the WVS Job Monitor is where I see failed jobs, resubmit them and check on Executing and Successful jobs. I also monitor by refreshing the Queue Management window.

Is that the standard Windows Task Manager where you're monitoring the worker daemon, worker monitor and worker helper? I don't see those processes in there.

areddy · ‎Jun 07, 2017

Hi Mike

have you documented all those things. what errors are generally faced during ppublishing???and what exactly happens during publishing? or can you just share your knowledge briefly which would be of great importance to us .

Kind regards

Amarnath

BineshKumar1 · ‎Feb 09, 2017

This is a good read - PTC Windchill Visualization Services (WVS) and Windchill Queues

To answer your question, there are couple of properties you can set in wvs.properties to restart the cad agent after it is marked as fails to start

cadagent.startattemptretrytime=90
cadagent.maxstartattempts=5

If you want to set it specific to worker type you can do something like this

cadagent.startattemptretrytime.PROE=

jstone-3 · ‎Feb 09, 2017

Thanks, that is a good overview article. I've been reading what I can but I can't find anything about the PROE worker erratically not starting.

Before messing with something like wvs.properties, is it logical that these settings would make it work some times but not others? I hesitate to mess with such settings when things have worked fine in the past. Initially, we thought it was the shear volume of resubmitting the entire database for viewables that was clogging it and by resubmitting them we'd eventually get through it all. But once it stops, it won't even do a single resubmit. While standard jobs go though fine...

BineshKumar1 · ‎Feb 09, 2017

A fails to start condition occurs when the worker daemon fails to connect to start the worker. This could be anything from a unresponsive GS daemon service to unavailable CAD license or a timeout in the network. Once the CAD agent issues the startup command, it waits for the worker to respond within the start uptime mentioned in your worker configuration, It retries, if I remember like 3 times, if it does not get any response, the worker agent will mark the worker as "Fails to start". That's it, it remains in that state until an administrator manually starts the worker, or till the BGMS running CAD agent is restarted.

I think in 10.2 release, two properties were introduced to recover from the "fails to start" state. cadagent.startattemptretrytime is the amount of time CAD agent will wait after the failed before attempting to start again under the same conditions. Let's say you set cadagent.startattemptretrytime=60, the condition will get cleared every 60 seconds, after which it will again try and use that worker. cadagent.maxstartattempts provides the ability to set the number of start attempts to start the worker. If set to cadagent.maxstartattempts=5, this sets the number of restart tries at 5

jstone-3 · ‎Feb 09, 2017

OK, so I restarted the CAD Worker machine and things are working again. In past attempts this seemed to work one time and not the other, and even the "successful" attempt seemed to be delayed so I wasn't sure that was even the fix.

I'll keep resubmitting in batches and see how it goes...

MikeLockwood · ‎Feb 09, 2017

Can generally just restart the GS Worker Daemon service (on the worker machine) instead of rebooting it.

Observe on the worker machine:

- folders where publishing occurs

- task manager processes

I'm going to make a document on this sometime soon - maybe will sell copies for $1 each.

jstone-3 · ‎Feb 09, 2017

If you can make this clear and logical, you could get a lot more than that! Maybe $2!

I expect this will fail again eventually and when it does, I'll try restarting the service rather than the whole machine.

jstone-3 · ‎Feb 10, 2017

Things konked out again an hour or so ago and the "did not find available worker of type PROE" came back. The worker agent status was back to "Fails to start" and all jobs were failing.

Restarting the GS Worker Daemon Service, then manually stopping and starting the worker didn't work. So I restarted the machine it's on, manually stopped and started the worker (Stop all; Start all) and we're back in business.

I've been trying to weed out the jobs that either clog things up or jus don't need to be there - template files, .STL, .STP, .CFG, .SEC, .MFG files etc. I still get some that fail with "Failure to retrieve filename.prt name filename (err= -4)". Haven't figured that one out yet. It's usually an old interim iteration of a harness, or something else that I would consider non-standard, and many of them I've just deleted the job rather than submitting it. I've started looking up the -4 error but haven't gotten too far yet.

SergeyEfimov · ‎Feb 10, 2017

Can you show the Creo adapter log files?

jstone-3 · ‎Feb 10, 2017

Forgive me, Sergey, but where do I find those particular log files?

SergeyEfimov · ‎Feb 13, 2017

Usually a <creo_view_adapters>\proe_setup folder

jstone-3 · ‎Feb 14, 2017

Sergey,

There are various log files there, some of which I'm a bit familiar with - the worker, helper and monitor logs from the Worker Agent. I'd have to find logs from the specific jobs that failed. I'm mostly down to just the err=-4 failures, and even those have been resubmitted so I don't have specific log files to look at or share. I do have a handful that just time out and these are mostly older iterations that I'm not as concerned with.

But - I did fine other ".crc" files that indicate specific issues with failed jobs, such as circular references in the models. I assume that these will continue to fail if resubmitted. So for those, I'll have to either get the model fixed or possibly make the decision that for now a missing viewable isn't a serious problem and just delete the job. I think this may also explain some of the timed out job issues? Fixing the model now will be slightly problematic because I don't yet have a rework state off of the released state where I can do repairs and clerical corrections. That's another solution for another day, and one that's needed!

I'll keep monitoring the jobs and post again if I get another err=-4 issue. Or let me know if I'm straying from reality and/or what you wanted to see for log files.

jstone-3 · ‎Feb 14, 2017

OK, here's one that failed with the "Failure to retrieve..." and "err = -4" errors. The worker log doesn't say much except that it couldn't retrieve the file.

2017-02-14 15:16:15] Recipe file: C:/ptc/creo_view_adapters/proe_setup/proe2pv.rcp

[2017-02-14 15:16:15] Source file: 21832_mfg.asm

[2017-02-14 15:16:15] Output file: c:/ptc/cad_temp/w1i1j1705/21832_mfg_asm.pvs

[2017-02-14 15:16:15] Registering Server : COMPLETE

[2017-02-14 15:16:17] Downloading 21832_mfg.asm : COMPLETE

[2017-02-14 15:16:17] Loading 21832_mfg.asm : [2017-02-14 15:16:18] proe2pv Error:56127: Failure to retrieve 21832_mfg.asm name 21832_mfg (err = -4)

Could this be because it's an mfg file? I've deleted a number of jobs for old .mfg files and some of these mfg.asm files seem to be acting similarly.

Here's the full job log -

SergeyEfimov · ‎Feb 15, 2017

Try to open the assembly on the server in the Creo Parametric.

I recommend to use a start directory for the Creo Parametric - C:/ptc/creo_view_adapters/proe_setup.

May be need a licence or the Creo Parametric stopped for a message.

byork · ‎Jun 07, 2017

So the -4 is a missing reference file of some type. Usually by opening the assembly in Creo and looking at the message log you can see what is missing.

On the fails to start issue, we were having the same problem so we implemented the recommendations of this article.

https://support.ptc.com/appserver/cs/view/solution.jsp?n=CS162756