rscarter rscarter - 4 years ago 170
Java Question

Ambiguities in IBM JRE gencon policy and verbose:gc output

I'm assisting a project in tuning their application server environments, and am seeing some rather confusing output in the verbose:gc logs from the IBM JRE, which I'm much less familiar with than Hotspot. In many cases, what is there appears to contradict what I've seen in documentation from the vendor. This is for an app running on Websphere, running on IBM's 1.7_64 JRE. This particular run was with the gencon GC policy, with a 4GB heap size. Here is a sample of the verbose:gc output for a global GC that is a source of numerous ambiguities which I'll detail below:

<exclusive-start id="8574" timestamp="2015-09-08T22:48:45.819" intervalms="2919.893">
<response-info timems="0.229" idlems="0.069" threads="43" lastid="0000000005FF6800" lastname="WebContainer : 37" />
<sys-start id="8575" timestamp="2015-09-08T22:48:45.820" intervalms="3028473.245" />
<cycle-start id="8576" type="global" contextid="0" timestamp="2015-09-08T22:48:45.820" intervalms="3028473.281" />
<gc-start id="8577" type="global" contextid="8576" timestamp="2015-09-08T22:48:45.820">
<mem-info id="8578" free="462493104" total="4155310080" percent="11">
<mem type="nursery" free="80236904" total="934084608" percent="8" />
<mem type="tenure" free="382256200" total="3221225472" percent="11">
<mem type="soa" free="221195336" total="3060164608" percent="7" />
<mem type="loa" free="161060864" total="161060864" percent="100" />
<remembered-set count="16780" />
<allocation-stats totalBytes="745771152" >
<allocated-bytes non-tlh="193377184" tlh="552393968" />
<largest-consumer threadName="WebContainer : 49" threadId="000000000659E600" bytes="156075608" />
<gc-op id="8579" type="mark" timems="609.025" contextid="8576" timestamp="2015-09-08T22:48:46.429">
<trace-info objectcount="9486509" scancount="7257956" scanbytes="246533016" />
<finalization candidates="1528" enqueued="411" />
<ownableSynchronizers candidates="32" cleared="0" />
<references type="soft" candidates="63260" cleared="0" enqueued="0" dynamicThreshold="30" maxThreshold="32" />
<references type="weak" candidates="27120" cleared="11" enqueued="0" />
<references type="phantom" candidates="17361" cleared="7885" enqueued="7885" />
<stringconstants candidates="153217" cleared="105" />
<gc-op id="8580" type="classunload" timems="1.114" contextid="8576" timestamp="2015-09-08T22:48:46.430">
<classunload-info classloadercandidates="4306" classloadersunloaded="26" classesunloaded="26" quiescems="0.000" setupms="0.955" scanms="0.108" postms="0.051" />
<gc-op id="8581" type="sweep" timems="18.520" contextid="8576" timestamp="2015-09-08T22:48:46.449" />
<gc-op id="8582" type="compact" timems="2209.475" contextid="8576" timestamp="2015-09-08T22:48:48.658">
<compact-info movecount="9482641" movebytes="539414400" reason="compact on aggressive collection" />
<gc-end id="8583" type="global" contextid="8576" durationms="2839.085" timestamp="2015-09-08T22:48:48.659">
<mem-info id="8584" free="3615704600" total="4155310080" percent="87">
<mem type="nursery" free="834618336" total="934084608" percent="89" />
<mem type="tenure" free="2781086264" total="3221225472" percent="86">
<mem type="soa" free="2620025400" total="3060164608" percent="85" />
<mem type="loa" free="161060864" total="161060864" percent="100" />
<pending-finalizers system="367" default="44" reference="7885" classloader="0" />
<remembered-set count="15785" />
<cycle-end id="8585" type="global" contextid="8576" timestamp="2015-09-08T22:48:48.659" />
<sys-end id="8586" timestamp="2015-09-08T22:48:48.659" />
<exclusive-end id="8587" timestamp="2015-09-08T22:48:48.659" durationms="2839.704" />

The issues include:

1) The timestamps for GC operations (marked gc-op) seem to be for when the operation finishes, not from when it starts. IBM doc implies that the timestamp is when that operation occurred ( However, when looking at the timestamps and durations for the operation (identified by timems attributes), the numbers don't add up. They do add up and make sense, however, if the timestamp is interpreted as when the operation finished, not when it started. To illustrate, in the example output above the following operations were performed, and these are their provided durations and timestamps (I've rounded the numbers):

operation - duration - timestamp
mark - 609ms - 48:46.429
classunload - 1ms - 48:46.430
sweep - 19ms - 48:46.449
compact - 2209ms - 48:48.658

As one can see, the timestamps don't make sense if they're interpreted as the start time. They do, however, if interpreted as the end time. For instance, the classunload timestamp value equals the makr timestamp plus the classunload duration (not plus the mark duration as it should be if the timestamps indicate when the operations started). Similarly, the sweep timestamp equals the classunload timestamp plus the sweep duration.

Can someone confirm if this is expected behavior, or am I simply just reading this incorrectly?

2) Another issue appears to be that the entirely of every operation in the log appears to be a full-stop-the world operation from start to finish. I expected that the scavenge GCs would all be full stops, however, all the documentation I've seen from IBM has indicated that at least part of the mark phase of the global GC should be done concurrently. However, that is also not what is occurring in the GC log entry above. In this example, it appears that everything is done as a full-stop, and that no part of the mark or either the sweep is done concurrently. Per (older) documentation at least part of the mark should be done concurrently as not all as part of the stop-the-world pause (
Is there a setting to make the global GCs in the tenured generation run part of the cycle concurrently? I know this is being done in the IBM JREs optavgpause collector as well as the Hotspot concurrent collector, neither of which are particularly new technology, so it's not clear why gencon is always forcing a full stop-the-world every time it has to do a global collection. This is important, as the 2.8 second pause indicated above is going to make us run afoul of our SLAs with the client.

3) In the global GC above it's apparent from the occupancy numbers of the nursery that it is also being collected during this operation (the nursery shows as only 8% free before the GC and 89% after). However, I'm not seeing the relevant information regarding a GC occuring in the nursery in the output. I would expect to see a scavange operation logged here. Does anyone know why this is missing? Is this because during a global GC the nursury is also collected via the same heap walking (mark-sweep-compact) approach as the tenured space is when a global GC occurs?

Answer Source

You're right in that you're seeing stop of the world collections indicated by the exclusive-start and exclusive-end covering the whole cycle. What I find interesting is that you're seeing a global gc at a time when the heap is not even fully allocated.

The reason I think is that you have sys-start and sys-end entries which seem to indicate that something in your application (or a remote GC) is triggering a System.gc() and you should get this resolved first and get a new set of gc logs.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download