Reducing the risk of Enterprise Down situations in Windchill
Have you ever had an issue in Windchill for which you had to raise an Enterprise Down (EDOWN) case with PTC Technical Support?
An EDOWN situation is the equivalent of an emergency room need: the Windchill server is down or unresponsive, productivity is crippled and the user community is impatiently waiting for updates on what’s going on and when the server is back up. It’s a pretty stressful environment.
Having a support contract and the possibility of raising EDOWN cases is good to fall back on, but it’s similar to having doctors and dentists around: it’s comforting to know that they are there, but we’d rather prevent the need of their services in the first place.
This is what this post is about.
General preventative measures
We started doing analyses of Enterprise Down cases on a quarterly basis to get a better understanding of the underlying causes and see if we could work out some common practices that specifically targets these causes. Product improvements is one aspect that we are continuously working on (the Internet of Things opens up interesting possibilities), but there are actions that you can take right now to safeguard your server from some of the more common causes of EDOWN situations:
Take regular backups – Daily incrementals is what you would normally strive for. We do occasionally get cases where production server backups are old or non-existent and a catastrophic hardware failure has led to data loss. Needless to say, this is beyond repair with possibly months or years of lost work as a consequence. Information on Windchill backup strategies can be found in the PTC Windchill Backup and Recovery Planning Technical Brief.
Configure mail notifications – Windchill itself and some of its third-party components have built-in monitoring that can send out e-mail alerts when server performance indicators start drifting outside of their comfort zones. However, some configuration is needed for the alerts to be sent out. The configuration steps are described here. If you get a monitoring alert that you are unsure about, search in our Knowledge Base for information on the alert and what actions might be required in response to it. If no information is found, open a case with Technical Support.
Set up a test server and use it – Any change to a Windchill server, no matter how small, can have unexpected side effects. There is a good chance that any adverse side effects will reveal themselves on the test server so that you know about them before applying them to the production system. It might not seem worth the extra cost and hassle to do this, but it makes troubleshooting so much easier for everyone involved so please reconsider if you don’t already have one. Other advantages include:
Troubleshooting which requires verbose logging and/or frequent restarts does not disrupt the operation of the production server
Reconfiguration for data capture does not disrupt production system. For example, profiling with the Windchill Profiler is greatly simplified with a single Method Server, which can be easily configured on a test server.
Testing of potential fixes can be done without interfering with the production server.
Monitor your server – use PSM if possible, or the out-of-the-box Site > Utilities > Server Status page. As a server administrator, keep the page (or an overview dashboard if using PSM) on a screen and check it regularly. This will make you familiar with the day-to-day load cycles on the server, including how user activity rises and falls on a daily basis and when background activities usually kick off. This makes it easier to spot unusual patterns that may indicate budding problems early.
Server Status page:
System Health dashboard in PSM:
This was a brief overview of common measures that can be taken to avoid some EDOWNs. Some of this may seem basic and plain common sense to you; if so, excellent, hold on to that mindset. Nevertheless, we see a significant portion of EDOWN cases that may have been prevented with these measures, which is why they were covered here.
On a final note, there is a new set of articles that outline the most common technical areas where EDOWNs occur and contain information including:
Informative articles and resources
Links to articles for the most common EDOWNs issues
The main article is CS202168- How to avoid common Enterprise Down issues in Windchill, other related articles are linked from that one. These will be reviewed on a quarterly basis to ensure that they reflect the most recent EDOWN analysis results.
Thank you for your time and as always, comments and feedback are greatly appreciated.