Donnerstag, 27. Oktober 2016

Generational disparity in garbage collection

For the last year, I have been helping the startup Instana to create a Java agent that traces executions within a Java application. This execution data is collected and jointed to generate traces of user requests as well as the resulting communication between services within the system owner’s hemisphere. This way, unstructured communication can be visualized what significantly simplifies the operation of a distributed system that is composed of multiple interacting services.

In order to generate these traces, the Java agent rewrites all code that reads an external request or initiates one. Obviously, these entries and exits into or out of a system need to be recorded and additionally, meta data is exchanged to identify a request uniquely across systems. For example, when tracing HTTP requests, the agent adds a header containing a unique id which is then recorded by the receiving server as a proof of a request’s origin. Broadly speaking, it is similar to what Zipkin is modeling, but without requiring users to change their code.

In the most simple scenario, such tracing is straightforward to implement. Thanks to my library Byte Buddy which does the heavy lifting, all injected code is written in plain old Java and then copied to the relevant methods at runtime using the Java instrumentation API. For example, when instrumenting a servlet, we know that an entry to a JVM is made whenever the service method is invoked. We also know that the entry is completed when this very same method exits. Therefore, it suffices to add some code to the beginning and the end of the method to record any such entry into a VM process. And it has been the majority of my job to plow through the many Java libraries and frameworks to add support for their ways of communication. From Akka to Zookeeper, over the last year I have hello-worlded my way through the entire Java ecosystem; I even got to write EJBs for all the servers! And I had to make sense of Sun’s CORBA implementation. (Spoiler: There is no sense.)

Things do however quickly become more difficult when tracing asynchronous executions. If a request is received by one thread but is answered from within another thread, it does no longer suffice to only trace entries and exits. Therefore, our agent needs to also track all context switches in concurrent systems made via thread pools, fork join tasks or custom concurrency frameworks. And the same way that debugging asynchronous execution is difficult, this is quite a bit of work for us too. I think that I spend as much time dealing with concurrency as I do recording entries and exits.

The impact on garbage collection


But how does all this impact garbage collection? When implementing a performance monitor, one is facing a trade-off between interpreting the work of a Virtual Machine and causing work for this machine by doing so. While the majority of processing is done in the monitor back-end to which the agent reports its data, we have to do a minimum within the Java process that we share with the monitored application. And you can already guess it: by allocating objects, we inevitably have an impact on the VM’s garbage collection. Fortunately, modern garbage collection algorithms are doing excellent work and by mostly avoiding object allocation and by adaptively sampling our tracing efforts, the effect of our code changes is negligible for the vast majority of users. Ideally, we only burn a few unused processor cycles to do our work. As a matter of fact, very few applications do use their full processing potential and we are happy with grabbing a small portion of this excess.

Writing a garbage collection-friendly application is typically not too difficult. It is obvious that the easiest way of avoiding garbage is to avoid object allocation altogether. However, object allocation in itself isn’t too bad either. Allocating memory is a rather cheap operation and as any processor owns its own allocation buffer - a so-called TLAB - we do not impose an unnecessary synchronization when allocating only a bit of memory from our threads. If an object only lives in the scope of a method, the JVM can even erase the object allocation altogether as if the fields of the objects were put onto the stack directly. But even without this escape-analysis, short-lived objects are captured by a special garbage collection circle called the young generation collection that is processed quite efficiently. To be honest, this is where most of my objects end up as I often value code readability over the small improvements that escape analysis offers. Currently, escape analysis quickly hits its boundary. Yet, I hope for future HotSpots to improve to get the best of both worlds even without changing my code. Fingers crossed!

When writing Java programs, I do not typically think about the impact on garbage collection but the above guidelines tend to manifest in my code. For the majority of our agent, this has been working very well. We are running a whole bunch of example applications and integration tests to assure a good behavior of our agent and I also keep an eye on the GC when running examples. In our modern times, using tools like flight recorder and JIT watch, performance analysis has become quite approachable.

The relativity of short-lived


With an early version of our agent, I one day noticed an application to trigger tenured collection cycles that it did not trigger without it. As a consequence, collection pauses increased by a multitude. The objects that ended up in the tenured collection were however only objects of the monitored application itself. But since our agent runs mostly isolated from the application threads and at first, this did at first not make sense to me.

When digging deeper, I found that our analysis of user objects triggered some additional escapes of objects but the impact was minimal. The application did already produce a fair amount of objects, mostly by using NIO and by using fork join pools. One thing that the latter frameworks have in common is that they rely on the allocation of many short lived objects. For example, a fork-join task often splits itself into multiple subtasks which repeat this procedure until each task’s payload is small enough to be computed directly. Every such task is represented by a single, stateful object. An active fork join pool can spawn millions of such objects every minute. But since the tasks compute fast, the representing object is eligible for collection quickly and therefore captured by the young collector.

So how did these objects end up in the tenured collection all of a sudden? At this time, I was prototyping a new stitching instrumentation to track context switches between such fork join tasks. Following the path of a fork join tasks is not trivial. Each worker thread of a fork join pool applies work stealing and might grab tasks out of the queue of any other task. Also, tasks might provide a feedback to their parent task on completion. As a consequence, tracing the expansion and interaction of tasks is a rather complex process, also because of the existence of so-called continuation threads where a single task might bounce jobs to hundreds of threads within only a few milliseconds. I came up with a rather elegant solution which relied on the allocation of many short-lived objects which were allocated in bursts whenever backtracking a task to its origin. It turned out that these bursts triggered quite a few young collections themselves.

And this is what I did not consider: each young generation collection increases the age of any object that is not eligible for garbage collection at this point. An object does not age by time but by the amount of young collections triggered. This is not true for all collection algorithms but for many of them such as for all default collectors of HotSpot. And by triggering so many collections, the agent threads “prematurely matured” objects of the monitored application despite those objects being unrelated to the agent’s objects. In a way, running the agent “prematurely matured” the target application’s object.

Getting around the problem


I did at first not know how to solve this. In the end, there is no way of telling a garbage collector to treat “your objects” separately. As long as the agent threads were allocating shorter-lived objects at a faster rate than the host process, it would spoil the original objects into the tenured collection causing an increase of garbage collection pauses. In order to avoid this, I therefore started to pool the objects I was using. By pooling, I quickly matured my own objects into the tenured collection and the garbage collection behavior returned to its normal state. Traditionally, pooling was used to avoid the costs of allocation which became cheap in our days. I rediscovered it to erase the impact of our “foreign process” onto garbage collection for the cost of a few kilobytes of memory.

Our tracer is already pooling objects in other places. For example, we represent entries and exits as thread local values that contain a bunch of primitive values that we mutate without allocating a single object. And while such mutable, often procedural and object pooling programming is no longer fashionable, it turns out to be very performance friendly. In the end, mutating bits is closer to what a processor is actually doing. And by using preallocated arrays of a fixed size instead of immutable collections, we save us quite a few round-trips to memory while also preserving our state to be contained in only a few cache lines.

 

Is this a “real world” problem?


You might think that this is a rather specific problem that most people do not need to worry about. But as a matter of fact, the problem that I describe applies to a large number of Java applications. For example, within application containers, we typically deploy multiple applications in a single Java process. Just as in the above case, the garbage collection algorithm does not group objects by application as it has no notion of this deployment model. Therefore, object allocations by two isolated applications that share a container do interfere with the anticipated collection patterns of one another. If each application relies on its objects to die young, the sharing of a heap causes a strong relativity on the duration of short-lived.

I am not an advocate for microservices. As a matter of fact, I think they are a bad idea for most applications. In my opinion, routines that can only exist in interaction should ideally be deployed together unless there are good technical reasons not to. And even if isolated applications ease development, you quickly pay the price in operations. I am just mentioning this to avoid a misinterpretation of the moral of the above experience.

What this experience taught me was that deploying several applications in a single Java process can be a bad idea if those applications are heterogeneous. For example, when running a batch process parallel to a web server, you should consider running each in its own process rather than deploying both of them in the same container. Typically, a batch process is allocating objects at a very different rate than a web server. Yet, many enterprise frameworks still advertise all-in-one solutions for tackling such problems which should not share a process to begin with. In 2016, the overhead of an additional process is not typically a problem and since memory is cheap, rather upgrade your server instead of sharing a heap. Otherwise, you might end up with collection patterns that you did not anticipate when developing, running and testing your applications in isolation.