Tuesday, 30 May 2017

Performance visualisation tools

Update


Amir has very kindly contributed a Vagrant box configuration to enable non-Linux users to work with the tools contained in the grav repository.

Thanks Amir!




In my last post, I looked at annotating Flamegraphs with contextual information in order to filter on an interesting subset of the data. One of the things that stuck with me was the idea of using SVGs to render data generated on a server running in headless mode.
Traditionally, I have recorded profiles and traces on remote servers, then pulled the data back to my workstation to filter, aggregate and plot. The scripts used to do this data-wrangling tend to be one-shot affairs, and I've probably lost many useful utilities over the years. I am increasingly coming around to the idea of building the rendering into the server-side script, as it forces me to think about how I want to interpret the data, and also gives the ability to deploy and serve such monitoring from a whole fleet of servers.
Partly to address this, and partly because experimentation is fun, I've been working on some new visualisations of the same ilk as Flamegraphs.
These tools are available in the grav repository.

 

Scheduler Profile

The scheduler profile tool can be used to indicate whether your application's threads are getting enough time on CPU, or whether there is resource contention at play.
In an ideal scenario, your application threads will only ever yield the CPU to one of the kernel's helper threads, such as ksoftirqd, and then only sparingly. Running the scheduling-profile script will record scheduler events in the kernel and will determine the state of your application threads at the point they are pre-empted and replaced with another process.
Threads will tend to be in one of three states when pre-empted by the scheduler:
  • Runnable - happily doing work, moved off the CPU due to scheduler quantum expiry (we want to minimise this)
  • Blocked on I/O - otherwise known as 'Uninterruptible sleep', still an interesting signal but expected for threads handling IO
  • Sleeping - voluntarily yielding the CPU, perhaps due to waiting on a lock
There are a number of other states, which will not be covered here.
Once the profile is collected, these states are rendered as a bar chart for each thread in your application. The examples here are from a JVM-based service, but the approach will work just as well for other runtimes, albeit without the mapping of pid to thread name.
The bar chart will show the proportion of states encountered per-thread as the OS scheduler swapped out the application thread. The sleeping state is marked green (the CPU was intentionally yielded), the runnable state is marked red (the program thread was pre-empted while it still had useful work to do).
Let's take a look at an initial example running on my 4-core laptop:


Scheduling profile

original
This profile is taken from a simple drop-wizard application, the threads actually processing inbound requests are prefixed with 'dw-'. We can see that these request processing threads were ready to yield the CPU (i.e. entering sleep state) about 30% of the time, but they were mostly attempting to do useful work when they were moved off the CPU. This is a hint that the application is resource constrained in terms of CPU.
This effect is magnified due to the fact that I'm running a desktop OS, the application, and a load-generator all on the same laptop, but these effects will still be present on a larger system.
This can be a useful signal that these threads would benefit from their own dedicated pool of CPUs. Further work is needed to annotate the chart with those processes that were switched in by the scheduler - i.e. the processes that are contending with the application for CPU resource.
Using a combination of kernel tuning and thread-pinning, it should be possible to ensure that the application threads are only very rarely pre-empted by essential kernel threads. More details on how to go about achieving this can be found in previous posts.

 

CPU Tenancy

One of the operating system's responsibilities is to allocate resources to processes that require CPU. In modern multi-core systems, the scheduler must move runnable threads to otherwise idle CPUs to try to maximise system resource usage.
A good example if this is network packet handling. When a network packet is received, it is (by default) processed by the CPU that handles the network card's interrupt. The kernel may then decide to migrate any task that is waiting for data to arrive (e.g. a thread blocked on a socket read) to the receiving CPU, since the packet data is more likely to be available in the CPU's cache.
While we can generally rely on the OS to do a good job of this for us, we may wish to force this cache-locality by having a network-handling thread on an adjacent CPU to the interrupt handler. Such a set-up would mean that the network-handling thread would always be close to the data, without the overhead and jitter introduced by actually running on a CPU responsible for handling interrupts.
This is a common configuration in low-latency applications in the finance industry.
The perf-cpu-tenancy script can be used to build a picture showing how the scheduler allocates CPU to your application threads. In the example below, the threads named dw- are the message-processing threads, and it is clear that they are mostly executed on CPU 2. This correlates with the network card setup on the machine running the application - the IRQ of the network card is associated with CPU 2.


CPU tenancy by thread

original

 

Further work

To make the scheduling-profile tool more useful, I intend to annotate the runnable state portion of the bar chart with a further breakdown detailing the incoming processes that kicked application threads off-CPU.
This will provide enough information to direct system-tuning efforts to ensure an application has the best chance possible to get CPU-time when required.
If you've read this far, perhaps you're interested in contributing?

Monday, 27 March 2017

Named thread flamegraphs

Update


The utility described in this post has been moved to the grav repository, see "Performance visualisation tools" for more details.


After watching a great talk by Nitsan Wakart at this year's QCon London, I started playing around with flamegraphs a little more.
To get the gist of Nitsan's talk, you can read his blog post Java Flame Graphs Introduction: Fire For Everyone!.
The important thing to take away is that the collapsed stack files that are processed by Brendan Gregg's FlameGraph scripts are just text files, and so can be filtered, hacked, and modified using your favourite methods for such activities (your author is a fan of grep and awk).
The examples in this post are generated using a fork of the excellent perf-map-agent. I've added a couple of scripts to make these examples easier.
To follow the examples, clone this repository.

Aggregate view

One feature that was demonstrated in Nitsan's talk was the ability to collapse the stacks by Java thread. Usually, flamegraphs show all samples aggregated into one view. With thread-level detail though, we can start to see where different threads are spending their time.
This can be useful information when exploring the performance of systems that are unfamiliar.
Let's see what difference this can make to an initial analysis. For this example, we're going to look at a very simple microservice built on dropwizard. The service does virtually nothing except echo a response, so we wouldn't expect the business logic to show up in the trace. Here we are primarily interested in looking at what the framework is doing during request processing.
We'll take an initial recording without a thread breakdown:
$ source ./etc/options-without-threads.sh

$ ./bin/perf-java-flames 7731
You can view the rendered svg file here: flamegraph-aggregate-stacks.svg.


Aggregated stacks
Aggregated stacks

From the aggregated view we can see that most of the time is spent in framework code (jetty, jersey, etc), a smaller proportion is in log-handling (logback), and the rest is spent on garbage collection.

Thread breakdown

Making the same recording, but this time with the stacks assigned to their respective threads, we see much more detail.
$ source ./etc/options-with-threads.sh

$ ./bin/perf-java-flames 7602
In flamegraph-thread-stacks.svg we can immediately see that we have five threads doing most of the work of handling the HTTP requests; they all have very similar profiles, so we can reason that these are probably threads belonging to a generic request-handler pool.


Thread stacks
Thread stacks

We can also see another thread that spends most of its time writing log messages to disk. From this, we can reason that the logging framework has a single thread for draining log messages from a queue - something that was more difficult to see in the aggregated view.
Now, we have made some assumptions in the statements above, so is there anything we can do to validate them?

Annotating thread names

With the addition of a simple script to replace thread IDs with thread names, we can gain a better understanding of thread responsibilities within the application.
Let's capture another trace:
$ source ./etc/options-with-threads.sh

$ ./bin/perf-thread-flames 8513
Although flamegraph-named-thread-stacks.svg looks identical when zoomed-out, it contains one more very useful piece of context.


Named stacks
Named stacks

Rolling the mouse over the base of the image shows that the five similar stacks are all from threads named "dw-XX", giving a little more evidence that these are dropwizard handler threads.
There are a couple of narrower stacks named "dw-XX-acceptor"; these are the threads that manage incoming connections before they are handed off to the request processing thread pool.
Further along is a "metrics-csv-reporter" thread, whose responsibility is to write out performance metrics at regular intervals.
The logging framework thread is now more obvious when we can see its name is "AsyncAppender-Worker-async-console-appender".
With nothing more than an external script, we can now infer the following about our application:
  • this application has a request-handling thread-pool
  • fronted by an acceptor thread
  • logging is performed asynchronously
  • metrics reporting is enabled
This kind of overview of system architecture would be much harder to piece together by just reading the framework code.

Filtering

Now that we have this extra context in place, it is a simple matter to filter the flamegraphs down to a finer focus.
The perf-thread-grep script operates on the result of a previous call to perf-java-flames (as seen above).
Suppose we just wanted to look at what the JIT compiler threads were doing?
$ source ./etc/options-with-threads.sh

$ ./bin/perf-thread-grep 8513 "Compiler"
flamegraph-compiler.svg


Compiler threads
Compiler threads

or to focus in on threads that called into any logging functions?
$ source ./etc/options-with-threads.sh

$ ./bin/perf-thread-grep 8513 "logback"
flamegraph-logback.svg


Logging functions
Logging functions

Summary

Annotating flamegraphs with java thread names can offer insight into how an application's processing resources are configured. We can use this extra context to easily zoom in on certain functionality.
This technique is particularly powerful when exploring unfamiliar applications with large numbers of threads, whose function may not be immediately obvious.
My learned correspondent Nitsan has suggested that I'm being lazy by using jstack to generate the thread-name mapping. His main complaint is that it causes a safe-point pause in the running application. To make these scripts a little more lightweight, I will explore retrieving the thread names via JVMTI or another low-level interface. But that's a blog post for another day.