Serious Method Server stability Problem with Red H...

vaillan · ‎Jan 29, 2013

There have recently been several reports of serious intermittent stability problems resulting from long garbage collection times when method servers are running on Red Hat Linux. This problem is the result of a bug in the Linux operating system.

An example garbage collection line when this problem occurs is:

2013-01-24T16:10:26.934-0500: 6090.538: [GC6216.218: [SoftReference, 0 refs, 0.0000090 secs]6216.218: [WeakReference, 1918 refs, 0.0003090 secs]6216.219: [FinalReference, 1864 refs, 0.0035750 secs]6216.222: [PhantomReference, 3 refs, 0.0000080 secs]6216.222: [JNI Weak Reference, 0.0061630 secs] [PSYoungGen: 1039229K->134051K(1120960K)] 1693558K->838339K(3851648K), 125.6905170 secs] [Times: user=82.96 sys=1301.77, real=125.67 secs]

Observations about this line:

Garbage collection (GC) times were excessive; 125.67 seconds elapsed time in this case. Similar GC activity earlier in the file was taking fractions of a second.
- The Server Manager was unable to contact the MS during this time and it marked the MS as dead and tried to start a new one, but this method server was unable to respond to the stop command because GC was still occurring. On some machines where RAM is limited, having one extra method server running can cause swapping, further degrading performance.
There are no memory problems which would explain the long GC time.
- The heap was configured for a maximum of 8G, but the heap was only using about 4G (3851648K) which was the Xms setting.
- GC problems typically occur when more memory that the maximum configured is required. What usually happens is the system uses up to the maximum (8g in this case) and it still needs more memory to satisfy the request. The JVM then tries to perform GC collection and free more memory but with limited success. The system continues to try and free memory but with little memory actually being freed. Eventually almost all time is spent in GC and no other work is being done; the method server appears unresponsive when this happens. A common cause of this condition is not having the db.property wt.pom.queryLimit set, but there are other situations which this condition can result from also.
Memory reclaimed by the garbage collection activity was moderate; about 800M (1693558K->838339K). Based on earlier GC activity in the log file recovering this memory should have happened quickly. An earlier line from the same file which recovered similar but slightly less memory was:
- 2013-01-24T14:31:47.485-0500: 171.089: [GC171.137: [SoftReference, 0 refs, 0.0000110 secs]171.137: [WeakReference, 1154 refs, 0.0002600 secs]171.137: [FinalReference, 2078 refs, 0.0014780 secs]171.139: [PhantomReference, 0 refs, 0.0000030 secs]171.139: [JNI Weak Reference, 0.0022820 secs] [PSYoungGen: 601033K->44968K(972480K)] 1089070K->533005K(3703168K), 0.0520670 secs] [Times: user=0.60 sys=0.00, real=0.05 secs]
There was a disproportionate and excessive amount of time spent in "sys" time (kernel time) compared to 'user' time. Normal GC activity is largely recorded against user time and usually there is very little time is spent in "sys" time.
- In the 125.67 seconds of wall clock time, the CPU's spent a total of 1301.77 seconds in the kernel mode. It was a 16 cpu machine during, uring a 125 seconds there was 2000 seconds of CPU time available (125*16). This one GC was using about 70% of the available resources during this period((1300+125)/2000) very likely negatively impacting the performance of other method servers or other processes running on the same node.

See the article CS113478 for additional information and a workaround to prevent this problem.

Serious Method Server stability Problem with Red Had Linux & Observations about a Garbage Collecton Line

Serious Method Server stability Problem with Red Had Linux & Observations about a Garbage Collecton Line