Workflow and Queue Performance -Part 1

vaillan · ‎Feb 05, 2014

This post is the first in a planned series of posts on how deal with performance problems relating to queue processing with an emphasis on workflow performance.

Even though the name 'Workflow', suggests doing something workflow processes are more like a container that direct queue entries and hold the results when the entries are processed. The queue entries are executed are executed on one of two types of queues. Queues are either 'processing' queues and aim to execute an entry as quickly as possible, or they are 'schedule' queues which will execute an entry at a determined point in the future.

Processing queues are first in first out (FIFO) queues. They execute one entry at a time based on the order it was received into the queue. The oldest "Ready" entry will next to process. Processing queues also will not allow another entry to start until the entry currently executing completes. This model of operating isn't a problem when a queue entry takes a fraction of second to complete, but when an entry takes minutes or longer to complete it blocks the waiting from executing which could result in users not getting a task as expected. Long running expressions or set states in a system with a large number of class based sync robots are common reasons why a processing queue may not process as quickly as it should.

A Schedule queue differs from a FIFO in that it doesn't act on a first in first out basis but more like a calendar function where at a specific time in the future the schedule entry needs to "fire" and do something. An over due notifications or timer robots are examples of workflow actions which would be processed in a schedule queue. When editing a workflow template, anytime a time to wait is entered you are creating an entry to be executed in the schedule queue .

A 'Queue' is a made up of several components:

1) It's definition in either the processingQueue or ScheduleQueue database tables which defines how the queue operates. A sql statement to see some of the attributes for all of the prodessing queue's in the system is:

set linesize 200

col EXECUTIONHOST for a20

col QUEUESTATE for a20

col REMOVEFAILEDENTIRES for 999,999

select name,ENABLED,EXCEPTIONRETRIES,QUEUESTATE,RUNNING,SUSPENDDURATION from processingQueue;

b) For a list of the Schedule queues change "from processingQueue" in the above to sql to "from ScheduleQueue".

The same list of queue attributes can also be viewed in the queue manager.

The most important attribute from a performance perspective is "suspendDuration". This parameter is the default number of seconds a queue will stop processing entries when a queue entry generates an error. In some releases the default time can be set as high as 120 seconds. According to the developer who set this default, the two minute time was to give the administrator time to fix the problem. I pointed out to the developer that detecting, diagnosing and correcting a failed queue entry in two minutes wasn't going to happen. And that having a queue suspend itself especally if a series of entries failed in a short period of time; was would give the appearance of a hung queue.

Related to the suspend duration is the exception retries setting, there are some types of queue errors which are resource contention problems, and waiting a a couple of seconds the queue entry will be able to process and re-trying the entry will allow it to process. But unfortunately for other failed entries, no number of retries will matter. It's hard for an administrator to know which type of exception they experienced, but in general, deadlock and Oracle exceptions are most likely transient and exceptions like classCast or object no longer exists probably aren't going to be fixed by waiting. Setting exception retries and suspend duration to something low is generally a good idea. Values of between 2 and 5 maybe reasonable, exception retries of 3 and suspend durations of 5 are something I've recommended in the past.

2) The second part of a queue is a Background Method Server (BGMS) thread to process entries. On startup of the BGMS the queue definition is read into memory and defines the parameters of the java thread thread created to process the entries for that queue. Queues can run on multiple background method servers, but each queue name must be unique among all BGMS. For example it's not possible to have a queue named WfProcessingQueue2 running on two different background method servers.

3) Queue Entries are the third part and final component of a queue. A workflow queue is notified when an entry is placed into a queue and there is also a queue polling mechanism in place as a backup to poll the queueEntry table every 60 seconds to make sure an entry hasn't missed something. In my experience though, the queue missing entries to process isn't something which happens. A queueEntry can be generated either by a user action like completing a task or being automatically generated by another step in the workflow. One of the most common problems with workflows is a failure of a queue entry to process. The SQL below will show the ordered summary of failures which have built up over time.

select codec5, name, count(*) count from queueEntry,processingQueue where queueEntry.ida3a5=processingQueue.ida2a2 group by codec5, name order by 3,1;

Sometimes administrators think that failed or severe queue entries should be removed or deleted from the queue. NO! Failed entries are potential work which may or may not have been done. If the related task or workflow has been closed, then it's ok to remove the entry. Similarly, resetting a severe or failed entry to a READY state to process, isn't a good idea either. Queueentries fail for all sorts of reason. But at it's core a queue entry is business task which needs to be completed wrapped inside a queue processing jacket. In some cases the workflow related part can complete (like completing a task) but for some reason the queue entry task part of processing the entry fails leading a severe queue entry. Setting the queue entry to ready will re-execute the business task which can result in earlier steps in the workflow "waking up". Unless is understood what the queue entry was supposed to do and the current state of things, neither delete or set to ready the failed queue entry.

There are two ways to deal with failed queue entries. One is to use the QueueAnalyzer utility; this utility will generate a csv file of all processing queue entries which can be imported into excel for analysis. There are a lot instructions in the above link, but interpreting the output really comes down to, if the workflow or task is complete the entry is probably safe to remove it. Other conditions need to be evaluated on a case by case basis. The second option is to use #Site #Utilities #Workflow Process Administration capabilities to drill into problems workflows/queue entries.

The next post will be on using the Workflow report tool spot and diagnose problems along with details on how sync robots impact the performance and stability of a BGMS are the number one trouble spot with workflow and queues.

Lastly, one last problem which I see every particularly in large systems or ones which have a large number of custom queues is a lack of DB connections. In most method servers having a maximum of 15 or 20 connections is typical, but in a BGMS this few connections can lead problems depending on the concurrency of work being done in the queues. In some systems setting the max db connections to 50 or more is necessary. I'd use the wt.method.MethodContextMonitor.stats.summary procedure to see if the max number of active contexts was exceeding the available connections. If the BGMS is configured with 4G+ of heap it might be worthwhile setting the maxDBconnections to 50 to prevent problems too.