Short form: The CMS garbage collector appears to be failing to collect an ever-increasing amount of garbage; eventually, our JVM fills up, and the application becomes unresponsive. Forcing a GC via an external tool (JConsole or jmap -histo:live
) cleans it up once.
UPDATE: The problem appears to be related to the JTop plugin for JConsole; if we don't run JConsole, or run it without the JTop plugin, the behavior goes away.
(Technical notes: we're running Sun JDK 1.6.0_07, 32-bit, on a Linux 2.6.9 box. Upgrading the JDK version is not really an option, unless there's an unavoidable, major reason. Also, our system is not hooked up to an Internet-accessible machine, so screenshots of JConsole, etc aren't an option.)
We're currently running our JVM with the following flags:
-server -Xms3072m -Xmx3072m -XX:NewSize=512m -XX:MaxNewSize=512m
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=70
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+DisableExplicitGC
Observing the memory graph in JConsole, there's a full GC that runs every ~15 minutes or so during the first several hours of our application's lifespan; after each full GC, there's more and more memory still in use. After a few hours, the system hits a steady state where there's approximately 2GB of used memory in the CMS old gen.
Which sounds like a classic memory leak, except that if we use any tool that forces a full GC (hitting the "collect garbage" button in JConsole, or running jmap -histo:live
, etc), the old gen suddenly drops to ~500MB used, and our application becomes responsive again for the next several hours (during which time the same pattern continues - after each full GC, more and more of the old gen is full.)
One thing of note: in JConsole, the reported ConcurrentMarkSweep GC count will stay at 0 until we force a GC with jconsole/jmap/etc.
Using jmap -histo
and jmap -histo:live
in sequence, I am able to determine that the apparently uncollected objects consist of:
- several million
HashMap
s and arrays ofHashMap$Entry
(in a 1:1 ratio) - several million
Vector
s and Object arrays (1:1 ratio, and about the same as the number of HashMaps) - several million
HashSet
,Hashtable
, andcom.sun.jmx.remote.util.OrderClassLoader
s, as well as arrays ofHashtable$Entry
(about the same number of each; about half as many as the HashMaps and Vectors)
There are some excerpts from the GC output below; my interpretation of them appears to be that the CMS GC is getting aborted without failing over to the stop-the-world GC. Am I misinterpreting this output somehow? Is there something that would cause that?
During the normal runtime, the CMS GC output blocks look about like this:
36301.827: [GC [1 CMS-initial-mark: 1856321K(2621330K)] 1879456K(3093312K), 1.7634200 secs] [Times: user=0.17 sys=0.00, real=0.18 secs]
36303.638: [CMS-concurrent-mark-start]
36314.903: [CMS-concurrent-mark: 7.804/11.264 secs] [Times: user=2.13 sys=0.06, real=1.13 secs]
36314.903: [CMS-concurrent-preclean-start]
36314.963: [CMS-concurrent-preclean: 0.037/0.060 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
36314.963: [CMS-concurrent-abortable-preclean-start]
36315.195: [GC 36315.195: [ParNew: 428092K->40832K(471872K), 1.1705760 secs] 2284414K->1897153K(3093312K), 1.1710560 secs] [Times: user=0.13 sys=0.02, real=0.12 secs]
CMS: abort preclean due to time 36320.059: [CMS-concurrent-abortable-preclean: 0.844/5.095 secs] [Times: user=0.74 sys=0.05, real=0.51 secs]
36320.062: [GC[YG occupancy: 146166 K (471872 K)]36320.062: [Rescan (parallel), 1.54078550 secs]3632开发者_如何学Python1.603: [weak refs processing, 0.0042640 secs] [1 CMS-remark: 1856321K(2621440K)] 2002488K(3093312K), 1.5456150 secs] [Times: user=0.18 sys=0.03, real=0.15 secs]
36321.608: [CMS-concurrent-sweep-start]
36324.650: [CMS-concurrent-sweep: 2.686/3.042 secs] [Times: uesr=0.66 sys=0.02, real=0.30 secs]
36324.651: [CMS-concurrent-reset-start]
36324.700: [CMS-concurrent-reset: 0.050/0.050 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
and that's it; the next line will be the next ParNew GC.
When we force a GC using jmap -histo:live, we instead get:
48004.088: [CMS-concurrent-mark: 8.012/8.647 secs] [Times: user=1.15 sys=0.02, real=0.87 secs]
(concurrent mode interrupted)
followed by ~125 lines of the form below: (some GeneratedMethodAccessor, some GeneratedSerializationConstructorAccessor, some GeneratedConstructorAccessor, etc)
[Unloading class sun.reflect.GeneratedMethodAccessor3]
followed by:
: 1911295K->562232K(2621440K), 15.6886180 secs] 2366440K->562232K(3093312K), [CMS Perm: 52729K->51864K(65536K)], 15.6892270 secs] [Times: user=1.55 sys=0.01, real=1.57 secs]
Thanks in advance!
com.sun.jmx.remote.util.OrderClassLoader is used in the remote'ing layer for JMX and a quick review of the code suggests they're created as part of the unmarshalling process for remote requests inside of the JVM. The lifetime of those classloaders will be directly related to the lifetime of the thing that was unmarshalled such that once there are no longer any references to that thing the classloader could be released.
I wouldn't be surprised if in this case the presence of these instances was a direct result of you using JConsole to examine the goings on in the JVM. And it looks like they'd just be cleaned up by GC as part of normal operation.
I guess it's possible there's a bug in the JMX implementation (seems unlikely in a relatively up-to-date JVM) or perhaps you have some custom MBeans or are using some custom JMX tools that are causing the problem. But ultimately, I'm suspecting the OrderClassLoader is probably a red-herring and the issue lies elsewhere (broken GC or some other leak).
Technical notes: we're running Sun JDK 1.6.0_07, 32-bit, on a Linux 2.6.9 box. Upgrading the JDK version is not really an option, unless there's an unavoidable, major reason.
Several newer Java versions have had updates to the CMS garbage collector. Notably 6u12, 6u14, and 6u18.
I'm not an expert with GC stuff, but I'm guessing the preclean fixes in 6u14 may fix the issue you're seeing. Of course, I could say the same thing about 6u18's class unloading bugs. Like I said, I'm not an expert at GC stuff.
There are fixes for:
- 6u10: (affects 6u4+) CMS never clears referents when -XX:+ParallelRefProcEnabled
- 6u12: CMS: Incorrect encoding of overflown object arrays during concurrent precleaning
- 6u12: CMS: Incorrect overflow handling when using parallel concurrent marking
- 6u14: CMS: assertion failure "is_cms_thread == Thread::current()->is_ConcurrentGC_thread()"
- 6u14: CMS: Need CMSInitiatingPermOccupancyFraction for perm, divorcing from CMSInitiatingOccupancyFraction
- 6u14: CMS assert: _concurrent_iteration_safe_limit update missed
- 6u14: CMS: Incorrect overflow handling during precleaning of Reference lists
- 6u14: SIGSEGV or (!is_null(v),"oop value can never be zero") assertion when running with CMS and COOPs
- 6u14: CMS: Livelock in CompactibleFreeListSpace::block_size().
- 6u14: Make CMS work with compressed oops
- 6u18: CMS: core dump with -XX:+UseCompressedOops
- 6u18: CMS: bugs related to class unloading
- 6u18: CMS: ReduceInitialCardMarks unsafe in the presence of cms precleaning
- 6u18: [Regression] -XX:NewRatio with -XX:+UseConcMarkSweepGC causes fatal error
- 6u20: card marks can be deferred too long
In addition to all of the above, 6u14 also introduced the G1 garbage collector, although it is still in testing. G1 is intended to replace CMS in Java 7.
G1 can be used in Java 6u14 and newer with the following command-line switches:
-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC
I would start with something far simpler, like:
-server -Xms3072m -Xmx3072m -XX:+UseParallelOldGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
And see if this meets your needs.
It looks like you are building objects which point back to their owners ( A points to B points to A ). This results in the reference counts remaining greater than zero, so the garbage collector can't clean them up. You need to break the cycle when you release them. Nullifying the reference in either A or B will solve the problem. This works even in larger reference looks like ( A -> B -> C -> D -> A). Vectors and object arrays may be used by your HashMaps.
The presense of the remote loaders may indicate a failure to cleanup and close references to objects loaded via JNDI or other remote access method.
EDIT: I took a second look at your last line. You may want to increase the perm allocation.
精彩评论