Strange GC behaviour of Scala actors application_问答_开发者

I have an application which uses rather a lot of actors: 25,000 to be precise. It uses Scala 2.7.7 and is running on jdk6_u18. It basically listens to and processes market data and has very little state.

It starts at 8.02am every day and within the hour it has crashed with an OutOfMemoryError. "Aha" you say, "you have a memory leak!" Except that when I restart it, it never, ever crashes again for the rest of the day! This despite ramping up both GC and CPU overhead when the US markets open at 2.30pm.

A few anecdotal findings:

it runs on Solaris. When I used to run it on Linux, it never crashed at all.
I have tried messing with generation heap sizing, allocating more memory etc. I think It makes no difference
The behaviour of the collector seems to differ when I have verbose:gc switched on

A few questions present themselves:

Why would the behaviour of this program be different between Linux and Solaris?
Why would the behaviour be different between starting it at 8.02 and starting it at 8.42?
I have heard that the actors library had some memory-leak issues. What were they, when were they fixed and how would I discover if anything "similar" might be going on here? (Things to look for in jhat etc)
Does anyone have an idea as to what might be going on?

I am now trying G1 to see if that makes any difference. I will update this question tomorrow!

Some output from G1 with verbose:gc on

I think I just caught it in the act:

600.290: [Full GC 255M->144M(256M), 1.5772616 secs]
602.084: [GC pause (young) 227M->145M(256M), 0.0556769 secs]
602.418: [Full GC 254M->144M(256M), 1.6415216 secs]
604.279: [GC pause (young) 227M->145M(256M), 0.0415157 secs]
604.602: [Full GC 255M->145M(256M), 1.6041762 secs]
606.422: [GC pause (young) 227M->145M(256M), 0.0237441 secs]
606.710: [Full GC 254M->145M(256M), 1.6022185 secs]

And then a bit later (you can see the full GC has taken longer and is reclaiming less)

849.084: [Full GC 254M->176M(256M), 1.9658754 secs]
851.191: [GC pause (young) 228M->176M(256M), 0.0218611 secs]
851.414: [Full GC 254M->176M(256M), 1.9352357 secs]
853.492: [GC pause (young) 228M->176M(256M), 0.0224688 secs]
853.716: [Full GC 254M->176M(256M), 1.9339705 secs]
855.793: [GC pause (young) 228M->176M(256M), 0.0215707 secs]
856.009: [Full GC 254M->176M(256M), 1.9805797 secs]
858.137: [GC pause (young) 228M->176M(256M), 0.0223224 secs]

Some output from G1 with verbose:gc off

It's fine again! *Sigh*

303.656: [GC pause (young) 225M->93M(256M), 0.1680767 secs]
308.060: [GC pause (young) 226M->94M(256M), 0.1793724 secs]
312.746: [GC pause (young) 227M->93M(256M), 0.1674851 secs]
316.162: [GC pause (young) 227M->95M(256M), 0.1826145 secs]
320.147: [GC pause (young) 226M->94M(256M), 0.1656664 secs]
325.978: [GC pause (young) 226M->93M(256M), 0.1475760 secs]
330.176: [GC pause (young) 226M->94M(256M), 0.1727795 secs]

and much, much later it is still OK!

25882.894: [GC pause (young) 224M->125M(256M), 0.2126515 secs]
25884.880: [GC pause (young) 224M->126M(256M), 0.2059802 secs]
25887.027: [GC pause (young) 224M->125M(256M), 0.1851359 secs]
25889.940: [GC pause (young) 223M->126M(256M), 0.2046496 secs]
25891.567: [GC pause (young) 224M->126M(256M), 0.1600574 secs]

and later still, a full GC

37180.191: [GC pause (young) 225M->154M(256M), 0.1716404 secs]
37182.163: [GC pause (young) (initial-mark) 225M->153M(256M)37182.326: [GC concurrent-mark-start], 0.1622246 secs]
37183.089: [GC concurrent-mark-end, 0.7635219 sec]
37183.090: [GC remark, 0.0032547 secs]
37183.093: [GC 开发者_高级运维concurrent-count-start]
37183.297: [GC concurrent-count-end, 0.2043307]
37183.393: [GC cleanup 198M->198M(256M), 0.0068127 secs]
37183.400: [GC concurrent-cleanup-start]
37183.400: [GC concurrent-cleanup-end, 0.0000393]
37183.648: [GC pause (young) 222M->153M(256M), 0.1483041 secs]
37184.235: [GC pause (partial) 171M->91M(256M), 0.2520714 secs]
37187.223: [GC pause (young) 221M->92M(256M), 0.1721220 secs]

UPDATE

Well, since switching to the G1 garbage collector on jdk1.6.0_18 the application has behaved itself on three consecutive days. I suspect that Erik is correct in his analysis of the situation about the VM getting itself into a death spiral in high-throughput cases where it has promoted objects into the tenured generation.

Do you have any reason to expect your heap to slowly grow in size? It looks like in both traces it is growing. One thing I've done on many occasions is reduce my heap to try to make the problem worse. However, 256M is about the lower limit for Scala.

One thing I've noticed before is that if you have short-lived objects that make it out of the eden generation due to too much pressure it can gradually kill you. This can happen when there's a burst of activity (maybe you have a burst in the morning?) and your eden space isn't big enough. Scala in general and actors in particular make heavy use of small short-lived objects, and it seems there's this magic threshold that once you cross it's all down hill. So one run will be fine and the next crashes and burns.

Another thing I've noticed before is that the GC settings that work well on OSX/x86 often do not work well on Sparc/Solaris and vice versa. If you're on CoolThreads, I suggest you configure the GC to have one GC thread per thread in your scheduler/pool.

Which brings up another thing - make sure the scheduler isn't creating new threads willy-nilly. It will do that sometimes. I'd say you should almost always manually set a cap on the threads. I don't know how relevant it is, but an interesting factoid about the fork-join scheduler that actors use by default is that it is intended for short, CPU bound tasks. Doing IO in the threads it manages screws up its assumptions. Of course it should still work, but...

Good luck! I've lost many, many days of my life to such problems.

Take a look at some of the options here: http://java.sun.com/performance/reference/whitepapers/tuning.html

It looks like you are using the concurrent-mark-sweep collector. Try setting:

-XX:+UseParallelGC

At this point, I wonder if it would be worth trying to replace your actors with Lift's or Akka? Though I think it is unlikely for it to be a bug with them, they might not tickle whatever it is that causes the crash.

From my point heap layout is wrong. Old space varies from 93M in the beginning to the 176 at the high load. as i see from logs you have generation is about ~50mb/sec on average. So if you have pause about 2 seconds on full gc at point of 176mb, you won't have young space to create new objects. My advices:

Check survivor spaces settings - your old space grows to 176M. reducing new space - obviously, it's not by intention
Set NewSize explictly to desired value, say 128M
increase overall heap size to promote objects instead of doing full gc
you have very long pauses: gc 80mb>50 ms, ~120mb>150ms. Try CMS instead, i believe it will be faster.(but do benchs)