Serious Method Server stability Problem with Red Had Linux & Observations about a Garbage Collecton Line
There have recently been several reports of serious intermittent stability problems resulting from long garbage collection times when method servers are running on Red Hat Linux. This problem is the result of a bug in the Linux operating system.
An example garbage collection line when this problem occurs is:
Garbage collection (GC) times were excessive; 125.67 seconds elapsed time in this case. Similar GC activity earlier in the file was taking fractions of a second.
The Server Manager was unable to contact the MS during this time and it marked the MS as dead and tried to start a new one, but this method server was unable to respond to the stop command because GC was still occurring. On some machines where RAM is limited, having one extra method server running can cause swapping, further degrading performance.
There are no memory problems which would explain the long GC time.
The heap was configured for a maximum of 8G, but the heap was only using about 4G (3851648K) which was the Xms setting.
GC problems typically occur when more memory that the maximum configured is required. What usually happens is the system uses up to the maximum (8g in this case) and it still needs more memory to satisfy the request. The JVM then tries to perform GC collection and free more memory but with limited success. The system continues to try and free memory but with little memory actually being freed. Eventually almost all time is spent in GC and no other work is being done; the method server appears unresponsive when this happens. A common cause of this condition is not having the db.property wt.pom.queryLimit set, but there are other situations which this condition can result from also.
Memory reclaimed by the garbage collection activity was moderate; about 800M (1693558K->838339K). Based on earlier GC activity in the log file recovering this memory should have happened quickly. An earlier line from the same file which recovered similar but slightly less memory was:
There was a disproportionate and excessive amount of time spent in "sys" time (kernel time) compared to 'user' time. Normal GC activity is largely recorded against user time and usually there is very little time is spent in "sys" time.
In the 125.67 seconds of wall clock time, the CPU's spent a total of 1301.77 seconds in the kernel mode. It was a 16 cpu machine during, uring a 125 seconds there was 2000 seconds of CPU time available (125*16). This one GC was using about 70% of the available resources during this period((1300+125)/2000) very likely negatively impacting the performance of other method servers or other processes running on the same node.
See the article CS113478 for additional information and a workaround to prevent this problem.