Worker process 'Fails to Start'

cjohnstone · ‎May 10, 2012

All -

I saw some old posts on this question, but no response: Our publish monitor shows all our ProductView jobs are failing. They started failing yesterday, so I reloaded/restarted the process and rebooted the worker. This morning, everything was running successfully for a while, and now they are failing again. I went to restart the process, and its status shows 'Fails to Start'. What does this mean? (beside the obvious...) Any ideas what happened?

Thanks in advance -

Chrystal Johnstone

Red Dot Corporation

BrianGeary · ‎May 10, 2012

We see this from time to time -- it can be caused by many things, but
basically it means that the xtop.exe (Wildfire) isn't starting on the
worker. Many times when this happens, you'll end up with multiple
workermonitor.exe and workerhelper.exe processes on the worker machine.

To clear the "Fails to start" so that you can start your workers, hit the
"reload" icon on the Worker Agent Administration utility. You might also
want to restart the GS Worker Daemon service on the worker and make sure
all xtop.exe, workermonitor.exe and workerhelper.exe processes are close
before starting the workers.

If all else fails, maybe it's time to reboot the worker machine (assuming
its Windows).

Brian Geary
Java Architect
Information Technology
hermanmiller.com
616 654 8993 OFFICE
616 796 5257 MOBILE
HermanMiller

BobLehman2 · ‎May 10, 2012

Chystal,

These good comments and steps from Brian - might fix the short term
problem, but I think you need to dig a little deeper into the worker
logs and figure out why it is dying. I would guess that that there is
a bad Proe model hanging up or killing xtop. The most likely thing is
that someone has a model with a ghost part or something along that
lines. From the normal proe user interface they get a warning and click
on the warning box to ignore it but the worker has no way of doing this.

--Bob

RonThellen · ‎May 11, 2012

Chrystal,

A few thoughts.

To help ensure that the worker doesn't have issues with models failing while they load in Pro/ENGINEER, add the following lines to a config.pro file (in either the <proe_loadpoint>\text folder or in the worker's start dir):

! The following config options are needed for all publishing queues.
! They ensure that drawings sent to the publishing queues display correctly.

plot_names no
display_thick_cables yes
display_axes no
display_coord_sys no
display_planes no
display_points no

! The following config options will help prevent retrieval failures.

dm_auto_conflict_resolution yes
dm_checkout_on_the_fly continue
freeze_failed_assy_comp yes
multiple_skeletons_allowed yes
web_browser_homepage about:blank

Another thing I ran into just recently. I was cleaning up the worker's start dir on the graphics servers-purging files (trail files, log files, etc.) and cleaning up unneeded files (.crc files, etc.). In my haste I inadvertently deleted the protk.dat file. Whenever I tried to start the workers, they reported "Fails to Start" and the xtop process would never start. Fortunately, there is a backup copy in a subfolder (pview.sav) under the worker's start dir. I copied the file to the start dir and things went back to normal.

One final thought, although I don't believe it would prevent the worker from starting. It is possible that the cache is corrupted in the worker's start dir. If you suspect that, you can simply delete the ".wf" folder and one will be recreated the next time the worker is restarted.

Ron Thellen
CAD Administrator
Engineering & Manufacturing Applications
Government Communications Systems Division
Harris Corporation
321-729-7502

DanielReid · ‎May 11, 2012

Something else that could be happening is cross-talk or corruption in
the worker temp spaces or workspaces on the publisher server.

I now have each Worker (we have 6) working within its own temp space.
They were originally set up with their own .wf and share folders with
the Windchill server. They also all now each have their own startup
batch file in the ProE loadpoint bin folder. This helps identify
(Process Explorer helps here) which xtop belongs to which worker.

Each "proeworker.bat" script also now contains a set of commands to
empty the temp space and workspace (.wf) at each start.

So, all this has reduced the frequency of "Fails to Start". However,
like I think Bob said, due to models just being too weird to handle
sometimes, we still get hangs. The nice thing about the above setup is
I can turn of just the problem worker, go kill its xtop process tree,
and then restart it without disturbing the other workers (that are often
busy with their own jobs).

Here's an excerpt of the changes to the proeworker.bat:

set PTC_WF_ROOT=C:\ptc\pubprod1\.wf

if "%HOMEDRIVE%"==" goto set_user_home

if "%HOMEPATH%"==" goto set_user_home

goto user_home_set

:set_user_home

set HOMEDRIVE=C:

set HOMEPATH=\ptc\pubprod1

set TEMP=%HOMEDRIVE%%HOMEPATH%\temp

set TMP=%HOMEDRIVE%%HOMEPATH%\temp

rd /s /q %HOMEDRIVE%%HOMEPATH%\temp

md %HOMEDRIVE%%HOMEPATH%\temp

rd /s /q %HOMEDRIVE%%HOMEPATH%\.wf

:user_home_set

set PVIEW_HOME=C:\ptc\productview_adapters

Here's what is in the individual worker ProE startup batch (in this case
proeprod1.bat):

@echo off

rd /s /q c:\ptc\worker\prod1

md c:\ptc\worker\prod1

C:\ptc\proewf\bin\proe.exe C:\ptc\proewf\bin\proeprod1.psf %*

So the folder for this worker in the shared space between Windchill and
Publisher is cleared at each start of ProE.

I'm afraid I haven't documented this yet except for my notes in my SVN
repository. Feel free to ask questions.

-Daniel

Sriram_Rammohan · ‎May 18, 2012

I would also suggest few more ,

First try telnet'ing from the master node to the CAD Box.

Secondly try to communicate via telnet and check if it is reflecting in statusLogs .

Third , check for the ALIAS names from the host file.

I assume that Worker configuration was not properly setup.

Sriram Rammohan

v_mhala · ‎May 22, 2012

There is a point in Bob's sugestions. The logs will give you the root cause of the issue. try uploading / studying the Worker, helper & monitor .log (found in CAD Worker setup directory) for permenant fix.

I am guessing the configuration is correct, as; if it would have been incorrect, there would be no publishing at all. But in your case few jobs are publishing appropriately.

Does the Pro/E on CAD worker have a dedicated licence ? Make sure you are not falling short of licence as well.

Regards,

~ Vaibhav

In Reply to Sriram Rammohan:

I would also suggest few more ,

First try telnet'ing from the master node to the CAD Box.

Secondly try to communicate via telnet and check if it is reflecting in statusLogs .

Third , check for the ALIAS names from the host file.

I assume that Worker configuration was not properly setup.

Sriram Rammohan