JVM Tuning

These are some notes about JVM tuning, in particular memory and garbage collection settings for various services and clusters.

DRAFT, WIP

Overview

There are three garbage collectors to worry about. If you are running with defaults on java 7 or 8, you are using the Parallel collector, optimized for throughput and not latency. CMS is optimized for latency with smallish heap sizes, G1 for latency with large heap sizes (6GB or larger).

These notes are all about latency, presuming that's the big issue we all have. Latency is caused by the garbage collector pausing all application threads while it goes about various phases of its collection.

Start by tuning the gc you have. Set -XX:MaxGCPauseMillis (see Ergonomics), see what happens. If that fails, turn on logging (see Garbage collection logging).

Why tune the garbage collector

Tuning garbage collector can help reduce GC related pauses or help increase application throughput (ratio between time doing actual work and time doing garbage collection).

Tuning the garbage collector will usually not help to improve badly behaving applications. In a lot of cases, when an application is freezing on GC, or suddenly spending too much time in GC, the garbage collection is a symptom of the issue, but not the cause. The best tuning of GC will not help if the application allocates heap at a higher rate than can reasonably be collected.

How Garbage collection works

Garbage collectors expect there to be lots of new objects allocated which can be collected soon, and relatively few long lived objects which stick around. Objects get put in a new object are first, then moved eventually to the long lived objects area. Minor GCs clear out the new object space; full GCs also clean up the long lived objects space. G1 also has a mixed collection, where it collects both young and old spaces at the same time. A few objects are considered permanent (classes, methods) and go in their own space.

Click "Expand" to read about any of the following more in depth:

Blindingly fast overview of Java garbage collection

Eden/YoungGen/OldGen/PermGen/MetaSpace, the soundbyte version:

The way Java garbage collectors work is based on two assumptions:

most objects are short-lived
objects that have stuck around for awhile are likely to stick around even longer
(three weapons! our three chief weapons are...) long=lived objects generally don't reference short-lived ones

So objects at creation time are plopped into an area for new objects (typically called "Eden"), then moved to a "survivor" area after the first round of GC. Those two areas comprise the "YoungGen" area. Objects are moved from the survivor area to 'OldGen' after a later round, and for openJDK 7, moved to PermGen after that. Candidates for PermGen include classes and methods. Note that OpenJDK 8 does not have PermGen; it has Metaspace (not part of the heap, it's native memory!) for class metadata.

In general, the various garbage collectors are set up so that tossing objects from the Eden or YoungGen area for newish objects, is fast. Once they wind up in OldGen removing them is slow, but we don't expect to have unused objects in there very often. It is possible that the old objects space simply fills up with no more room for objects to be moved in; if that happens too often, some tuning is needed.

The process of moving objects from a 'younger objects' area to an 'older objects' one is called 'promotion'.

Parallel GC for minor gcs

ParallelGC (also known as ParallelScavenge) PAUSES all app threads while it runs. It is a 'mark-copy' algorithm; that is, it marks all live objects in the young objects ("Eden") area and "survivor" area in use; it then copies all these objects to the empty survivor area, moving objects to the OldGen space if there is no room left or if they have survived a certain number of garbage collection cycles and can be considered long-lived.

At the end of the collection, the formerly in use survivor area will be empty and the Eden area should be close to empty.

If there is not enough room in the OldGen space for promoted objects, a full GC will be initiated.

The 'parallel' in the name refers to the use of multiple threads for the collection, as opposed to the old 'SerialGC' which used one thread.

CMS for minor gcs

The CMS collector uses the same approach and algorithms as described above for the ParallelGC.

G1 for minor gcs

G1 (Garbage First) PAUSES all app threads during a YoungGen collection. It uses a variant of mark-copy. Almost all of its work is done by parallel threads. It uses many small regions each of which are labelled as "Eden", "survivor", or "OldGen". G1 keeps a list of references into each region and will collect those with less live data first, hence its name. First, references to all live objects in Eden or the used survivor spaces are marked, and information about their regions recorded; then live objects are moved from Eden regions to empty survivor regions. If not enough room is available or an object has survived enough garbage collection cycles, it will be moved into an OldGen region. If not enough room is available for this, a full garbage collection will be invoked.

Parallel GC for full gcs

ParallelOldGC uses multiple threads for garbage collection, in contrast with the original SerialGC whch ran with one thread. It is intended for high throughput apps.

Phases in a full gc (app threads are PAUSED for all phases):

Phase 1: divide OldGen into fixed sized regions
Phase 2: "mark": divide all live objects directly reachable from code, among gc threads
Phase 3: "mark" (continued): all live objects marked in parallel
Phase 4: "summary": determine a location in each region beyond which there's enough unused objects to make compaction worth it
Phase 5: "compaction": for each region, do compaction, leaving a contiguous free space (done in parallel)

CMS for full gcs

CMS improves on ParallelOldGC by running some stages concurrently with the application threads. It is intended for low latency apps with a smallish heap size.

Phases in a full gc:

Phase 1: "initial mark": PAUSE all app threads, mark live objects directly reachable from code (relatively fast)
Phase 2: mark all live objects transitively reachable from the above set (takes some time)
Phase 3: "remark": PAUSE all app threads, check all objects modified during phase 3 and remark as needed (faster than Phase 2 but slower than Phase 1)
Phase 4: reclaim all unmarked objects

The phases done with paused application threads are handled by one gc thread. The other phases may be done with more than one thread running concurrently with the application; the server needs to have capacity to handle this.

Note that no defragmentation of the freed areas is done; allocation is handled by walking through free lists. Adjacent freed objects are combined into one freed area however.

Full gcs are triggered by the collector before OldGen is full, early enough that it will not fill while the gc pass runs. If it does fill, this would require a pause until the gc completes, which we want to avoid.

G1 for full gcs

G1, intended to be the default for java 9 and later, is intended for low latency apps that need a large heap.

Phases:

Phase 1: "initial mark": PAUSE all app threads, mark all live objects directly reachable from app code (done at same time as a YoungGen gc)
Phase 2: "root region scan": mark all references to OldGen objects from the objects above in the "survivor" regions
Phase 3: "concurrent mark": mark all references to live objects in entire heap
Phase 4: "remark": PAUSE all app threads, check any objects that changed during the above, check unvisited live objects, etc.
Phase 5: "cleanup": PAUSE all app threads, identify free regions, add to free list

G1 does away with the compaction (defrag) phase of the Parallel GC, since all of its data lives in small regions that are relabelled as data of different ages is moved around; no defragmentation of a large area is needed. If your heap is small this won't make much difference; if your heap is large then avoiding defrag is noticeable in performance. Down side: an object which is over half the size of one of the many small regions in the heap, will be hard to handle and will cause slowdowns. These are called "humungous objects" and you want to avoid them. If you have to, resize your regions, or restructure your code to avoid such objects.

Some quick and dirty benchmarks illustrating the latency/throughput tradeoffs are in this blogpost, from late 2013.

First tuning options

Make sure you're using the right version of the JVM (see the --server option). If you see app latency issues, start by asking the gc to do most of the work in reducing latency; see ergonomics and the MaxGCPauseMillis option.

Click "Expand" to read about any of the following more in depth:

The --server option

If you're running anything but a short-lived desktop app, you want the -server option. Startup time is longer because more compile-time optimizations are done; executions are correspondingly faster, see [1]. HOWEVER you likely don't even need to bother to pass this flag; it's enabled by default on boxes with 2GB or more of RAM and 2 or more physical processors. If you want to check the defaults your java has on your version of openjdk, you can run java -version and look for something like OpenJDK 64-Bit Server VM in the output.

Ergonomics/MaxGCPauseMillis

These garbage collectors are all set up so that you can configure the heck out of them, or set one or two options only and let the JVM try to figure out the rest. This 'figure out the rest' process is called "Ergonomics" and here's how it works.

You can set either -XX:MaxGCPauseMillis or -XX:GCTimeRatio or both. The first option is the number of milliseconds you would like the GC to pause, at a maximum; this is called the "maximum pause time goal". The second option is the ratio (1:n, where n is the value you set) for garbage collection against application run time; this is called the "throughput goal". The garbage collector will then adjust heap and youngen and oldgen sizes and kick off full gcs as it sees fit, to try to match these goals, with priority the max pause time goal first, if specified.

The above options are supported for the Parallel GCs, CMS and G1. (I looked at the openjdk 7 code.) Having said that, I don't know how useful the GCTimeRatio option is in the CMS or G1 collectors; these are intended to optimize latency at the expense of throughput, so setting those might not buy you very much.

Things that go wrong

See if there are too many or too few minor or full gcs, whether space for new objects is too large or too small compared to the space for old objects, whether new objects are being promoted to old objects too slowly or too fast, whether the space for permament objects is filling up because of dynamic classes. Tune accordingly; if that fails, try CMS for small heaps or G1 for large ones, with just the -XX:MaxGCPauseMillis option again, only resorting to fine tuning if you have to.

If you have to change heap size settings, bear in mind the following.

Setting both minimum and maximum heap size to prevent the JVM from doing allocation / deallocation with the OS is still seen in documentations, but actually only made sense with JVM < 1.5. You want compressed ordinary object pointers (saves space) and this is the default for heap sizes under 32G so don't specify more than that. For determining the exact safe spot, see the docs.

Click "Expand" to read about any of the following more in depth:

Too few/too many minor gc cycles

Too few:

In general, fewer is better. But if minor gc cycles start taking too long while the app threads are frozen, that's too few collections. Since minor collections happen when the YoungGen space is filled and new objects cannot be allocated, shrinking the YoungGen size will force minor collections more often, at the expense of throughput. (FIXME add some figures for 'too long'? add the option(s)?)

Too many:

If your minor garbage collection cycles are using up too much of the cpu, you can expand the YoungGen space to reduce the frequency of collections. This is a matter of trial and error until you find the right balance between frequency and app thread pause time for any one collection cycle. (FIXME add the option(s)?)

Note: my current understanding is that garbage collection has a mostly linear complexity (O(n)). This means that the amortized complexity is mostly independent of the region size.

Too few/too many full gc cycles

Too few:

As with minor gc cycles, fewer is generally better. But if your full gcs start taking too long, that will impact the responsiveness of your apps. You can shrink the OldGen area in order to force more frequent collections on fewer objects, thus taking less time. (FIXME add example and option here?)

Too many:

As with minor gc cycles, too many gcs means a loss of throughput, the cpu spending more time on collections than you would like. You can increase the OldGen space size, which will permit more long-lived objects to be stored for longer until a full garbage collection is needed. But see also below (YoungGen space too small, YoungGen objects promoted too fast).

YoungGen space too small/too large

Too small:

If there isn't enough space for new/young objects in the YoungGen given how the app allocates and frees them, they will be moved into OldGen space and collected from there during a full gc. This will cause OldGen churn which impacts both latency and throughput. You can tell if this is happening by looking at minor gc cycles to see if most of the space freed up from the YoungGen area is still in use by the heap; you'll need GC logging for that. You should expect to see frequent major gc cycles if this is the case.

Too large:

If the space for new/young objects is too large, the garbage collector will take too long to mark, copy or free up objects stored there on each cycle. This is not the only reason that minor gc cycles may take too long, so use caution (trial and error again) when starting to shrink the YoungGen space. See below (OldGen space too large) for other possibilities.

OldGen space too small/too large

Too small:

If the OldGen space is too small, there will not be enough room to hold long-lived objects, and so full gcs will run more often than is necessary, impacting the throughput of your app.

Too large:

If the OldGen space is too large, minor and/or full gcs may take too long to run. Minor gcs may take too long because all of those objects in the OldGen space will be scanned to see if they contain references to new/young objects; full gcs may take too long because the process of finding objects to be freed in the OldGen space will again require too time-consuming a scan.

PermGen space too small

If the PermGen space is too small, your app will OOM when there's no room there left for method or class metadata. You can see if that's the cause of OOMs by looking at the PermGen numbers in the GC logs. This shouldn't be an issue unless your app has a lot of dynamic classes going on, but hey, you never know.

Young objects promoted too fast/too slow

Adjusting promotion-related settings is very subtle and an ultra-fine-tuning measure. Adjusting YoungGen or OldGen (via heap) sizes is much more likely to be useful.

Too slow:

If objects are being promoted too slowly (objects that aren't freed in a very short period of time are likely to be long-lived), minor garbage collections will happen more frequently than is needed and not so much space will be reclaimed from the YoungGen area. Note that when there is no room for new objects in the YoungGen survivor space during a minor garbage collection, those objects will be promoted to the OldGen space regardless of the promotion settings.

If you use the CMS collector, you may want objects to be promoted less frequently in order to reduce fragmentation in the OldGen space. If you do this, you need to increase the YoungGen space size, so that there is enough room to hold objects longer before promotion.

Too fast:

If objects are being promoted too fast (objects that are short-lived wind up in the OldGen space), full garbage collections will happen more often than they should, as the space fills up with short-lived objects. You can check whether this is the case by comparing the space reclaimed from the YoungGen area during minor gcs, with the space reclaimed from the heap as a whole. If they are not commensurate, but the YoungGen space is never getting close to full, you can tweak the promotion parameters for the collector.

Garbage collector options table

-XX:MaxGCPauseMillis --maximum number of milliseconds the gc should pause, best effort only, for all gcs
-XX:GCTimeRatio --ratio (1:n, where n is the value you set) of gc to app run time, best effort only, for all gcs
-Xms - Set initial heap size, for all gcs
-Xmx - Set maximum heap size, for all gcs
-XX:NewSize - Set initial size of YoungGen space, for all gcs
-XX:MaxNewSize - Set maximum size of YoungGen space, for all gcs
-XX:PermSize - Determine initial size of heap for objects kept permanently, for CMS/G1 for openJDK 7. (Does CMS have PermGen in openJDK 8?)
-XX:MaxPermSize - Set max size of heap for objects kept permanently, for CMS/G1 for openJDK 7. (Does CMS have PermGen in openJDK 8?)
-XX:InitialTenuringThreshold - Set initial number of times an object survives minor gc before being promoted to OldGen space (ParallelGC, CMS. G1?)
-XX:MaxTenuringThreshold - Set max number of times an object survives minor gc before being promoted to OldGen space (ParallelGC, CMS. G1?)
-XX:InitiatingHeapOccupancyPercent - how full the (entire) heap must be before a concurrent-type (not full!) garbage collection is kicked off (G1)
-XX:CMSInitiatingOccupancyFraction - what percentage of the old generation heap space must be used before the first concurrent GC is started (CMS)
-XX+UseCMSInitiatingOccupancyOnly - use the CMSInitiatingOccupancyFraction value to decide when to start all CMS concurrent GC runs, not just the first one
-XX:+CMSClassUnloadingEnabled - Do GC on the PermGen (classes/methods) space. This is expensive (app threads freeze). (CMS)
-XX:G1HeapRegionSize - in G1, the heap is composed of many regions of small size, any of which may be marked for use by younger or older objects. This settings sets the size of those regions.

There are many other settings but those should cover the issues covered earlier.

Our services relying on JVMs

Archiva
Cassandra (RESTBase, Maps, Analytics query service)
Druid
Elasticsearch
Gerrit
Hadoop (analytics)
Kafka (analytics)
Kafka (production: EventBus)
Logstash
Wikidata query service served from a labs project
Zookeeper

As of this writing:

OpenJDK 8 is in use on restbase*, maps*, aqs*, cerium, praseodymium and xenon. The last three are Cassandra test hosts.
OpenJDK 7 is in use on analytics*, cobalt (gerrit), conf* (zookeeper), contint1001, druid*, elastic*, kafka*, logstash*, meitnerium (archiva), notebook*, relforge*, stat*, and wdqs* hosts.
OpenJDK 6 is no longer in use anywhere, thank goodness.

Garbage collectors in use for the various JVMs include CMS (Concurrent Mark Sweep), G1 (Garbage First), ParallelGC/ParallelOldGC (default for OpenJDK 7 and 8).

Which GC does my JVM use?

Each version of OpenJDK has different defaults; to find out which garbage collector your JVM is using by default for YoungGen and OldGen collections, you can run the command java -XX:+PrintFlagsFinal -version | grep Use | grep GC | grep true

On OpenJDK 7u111, the defaults are UseParallelGC for young object collection and UseParallelOldGC for old object collection.
On OpenJDK 8u102, the defaults are the same.
Expect different defaults for OpenJDK 9 (G1).

Garbage collection logging

These options should produce logs useful for determining the sorts of gc-related problems you may have.

-XX:+PrintGCDetails -- print out a detailed message after each garbage collection
-XX:+PrintGCDateStamps -- print the date and time that each garbage collection starts
-XX:+PrintHeapAtGC -- (which java versions is this good for?) show start and top of various parts of heap (CMS/ParallelGC only?), see next section for info
-XX:+PrintGCApplicationStoppedTime -- print time application spent inside safepoints (includes stops due to GC)
-XX:+PrintTenuringDistribution -- (is this good for all GCs?) print byte countes of objects in age groups of the survivor spaces for YoungGen (see next section for info)
-XX:+PrintPromotionFailure -- show information about failures to move objects from pools for younger objects to pools for older ones
-XX:PrintFLSStatistics=1 -- print free list stats after garbage collection (good for CMS GC, any others?)

Log rotation options (should be self-explanatory):

-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles
-XX:GCLogFileSize
-Xloggc:filename

Note that log rotation works by appending ".0" to the end of the current log file while all other log files are renumbered accordingly. Note also bug: T148655

Note that %p and %t can be used in the log filename as placeholder for process ID and timestamp respectively.

Click "Expand" below for more informtion on the related topic(s).

Logging output

This needs a lot more work!

Garbage collection log entries for the Parallel and CMS collectors have the same general format:

[name of gc or phase: size of region used before collection -> size of region used after collection(max size region used) heap used before collection->heap used after collection(committed heap size), pause time for gc]

So in the below line, taken from Parallel GC logging:

[PSYoungGen: 8443392K->273948K(9106432K)] 10404765K->2235321K(12207104K), 0.0832030 secs]

the collector was in the phase "Parallel Scavenge YoungGen", 8.4GB of the younggen region was used before collection, 273MB afterwards, max used was 9.1GB, 10GB of heap was used before collection, 2.2GB afterwards, and committed size was 12.2GB. The collection phase took .083 seconds.

In this instance, pretty much all objects moved out of the younggen space were freed, rather than moved to the oldgen space. You can tell this because the space gained back from the youngen space is about the amount gained back from the heap as a whole. This is a good use of the younggen space: most objects from it don't live long and therefore are not moved into a space for long-lived objects.

Here's an example of a full gc cycle from the Parallel GC:

[Full GC (Ergonomics) [PSYoungGen: 263197K->0K(8707072K)] [ParOldGen: 2779152K->1958978K(3100672K)] 3042349K->1958978K(11807744K) [PSPermGen: 68900K->68899K(100864K)], 0.5237840 secs]

The Parallel Scavenge YoungGen phase took the younggen area use from 263MB to 0, max used was 8.7GB; the Parallel OldGen gc phase took the oldgen region use from 2.7GB to 1.8GB, max used 3.1GB; the Parallel Scavenge PermGen phase took the permgen region use from 689MB to 688MB, max used was 1GB; and the whole thing took .523 seconds.

You can see here that very little free space was gained in the PermGen area, as we expect, since objects there are presumed to live for the lifetime of the app. Rather a lot was cleaned up from OldGen; maybe that's a sign that the YoungGen area size should be increased and objects moved to OldGen somewhat later in their lifetimes.

Tuning heap size

Having a reasonable heap size is the first step to good JVM performance, usually bringing more benefits than tuning the details of the GC itself.

How to understand how much heap your application needs

Bad indicators of heap usage:

memory used by the JVM as reported by the OS: the JVM allocates memory from the OS, the application requests memory from the JVM. The JVM might have memory allocated from the OS that has not yet been given to the application.
maximum heap used: the heap usage reported by the JVM includes objects that are live and objects that are ready to be collected. The main cause for a full GC is allocation failure, which means that the GC will mostly wait for the heap to be full before starting, which means that whatever the size of the heap available, it will be filled.

Better indicator of heap usage:

heap used after full GC: after a full GC, only live objects are left in the heap, so that represent fairly well what the memory needs of the application is. If the application has spiky memory requirements, it might well be that the actual minimum heap requirement is higher than that. A few other constraint have impact on the heap sizing to need more than just the minimal required by the application:
- heap fragmentation
- young / old spaces: to do its job correctly, the GC needs space to move objects from young to old space, so some overhead is required

Note: oversized heap is usually not in itself a GC performance issue. It might still make sense to not oversize heap to let that memory be used for disk cache instead.

Examples

Too many full GC in a short time

This example is taken from one of our Gerrit issues. More details are available in task T148478.

Example graph of garbage collection issue: too many full garbage collection happening in a short time

This graph shows first a period where the GC seems to work correctly, with many hours without full GC kicking in and a reasonable use of heap around 2Gb after full GC.

Between 10:00 and 11:00, the behaviour changes radically. We can see:

max heap usage after non full GC climbs to 14Gb
heap usage after full GC is still relatively low (between 2 and 5 Gb)
the rate of full GC increases radically, with around 30 full GC over 1 hour

Conjectures we can draw from these observations:

heap seems oversized during normal operations (from this small sample, on a larger sample, we might find times where the full heap is in use)
heap allocation rate increases dramatically between 10:00 and 11:00 (allocation rate is not the same as the total amount of heap used)
the GC mostly keeps up with the increased allocation rate (heap after GC is still going down significantly)
the application is most probably experiencing slow downs, not from the time it takes for a full GC but from the number of full GC
this is probably a case where tuning GC will not help much

Potential next steps:

more precise measurement of allocation rate over time would give a better idea of whether we are in a situation where we can reasonnably expect the GC to cope with. An allocation rate of more than 1 Gb/s is suspicious, more than 1Gb/s/core is definitely not something I would expect the GC to be able to handle
the investigation should continue to see where this increased allocation rate comes from
periodic thread dumps during the issue might help understand what code is exercised

Setting max and min heap (-Xmx and -Xms)

This example is taken from some heap size tuning of our zookeeper cluster. More details are available in task T157968.

Example of the impact of setting -Xms and -Xmx on a JVM

This graph shows the change in heap usage following the setting of -Xmx and -Xms to 1g. Previously, -Xmx and -Xms were not set and relied on JVM defaults. Note that the graph is taken from our codfw cluster and that cluster does not see much traffic. We can still make observations. In particular, we can see:

The default max heap size (3.5G) was much larger than what is actually needed.
The default max heap on this configuration is 3.5G, my understanding of the documentation led me to expect 1G as a default. This underlines the importance of setting -Xmx and -Xms explicitly to avoid surprises
We see that heap used increases after the change, even if max heap was decreased by the change. Before the change, the JVM was only allocating enough heap to allow for the current needs of the application (see the HeapMemoryUsage_committed line). By setting -Xms we commit the full heap from the start of the application.
GC now occurs much less often, but collects more memory each time, longer pauses, but less of them (whether this is a good thing or not depends very much on the use case). We should probably reduce the heap to reduce pauses (note that in this case, all the GC are actually young GC and the GC time went from ~2.5ms to ~7.5ms, which is still very reasonable).

Additional reading

GC options for OpenJDK 7 [2] and 8 [3]
Tuning docs for OpenJDK 6 (has background not in other docs) [4], 8 [5]

Tools

Reading GC logs is tedious. Understanding how GC is behaving is extremely difficult without tooling and graphs. A few tools can help:

GCeasy: online tool, provides pointers to potential issues, nice graphs, can analyse multiple logs if uploaded in a tar.gz archive
jClarity Censum: desktop application, non-free license, free demo, provides pointers to potential issues, nice graphs (including graph of allocation rate)
IBM Pattern Modeling and Analysis Tool for Java Garbage Collector: desktop application, non-free license, free as in beer, provides fairly basic graphs and stats

TODO

Get folks with JVM experience to read, complain, fix.
Add references, or at least some doc that includes all the trouble cases I've listed