Supporting Flame Graphs on production kernels


Perf is an amazing tool for observing system performance in Linux. Using perf on production kernels can be filled with pitfalls, due to the rapid pace at which new features are being added. In my case, I support a production kernel team that expects every feature they read about on the web to work on their older production kernel. A good example of a downstream use case of perf is Brendan Gregg’s very nice Flame Graphs tool for visualizing frequently used code paths in a system.

Example mysql Flame Graph

Example mysql Flame Graph

Recording call frame information with perf

Generation of Flame Graphs depends on perf capturing call frames. As documented in the Flame Graph tools, one records perf data on a x86-64 system by enabling DWARF call graph support with a command line like:

$ perf record -F 99 -a --call-graph dwarf -- sleep 60

That, of course, produces the raw file. The call frames we need are there. However, we need to process this data with a reporting tool.

Problems generating Flame Graphs

Now we start running into the problem with our production kernel. In our case, we are on a 4.1 kernel. Users are happily running perf report, seeing the complete set of call frame information throughout the system components under observation. The interesting thing is that if we generate a Flame Graph using this same data, then the users no longer have visibility into the complete calling tree information. That is, the Flame Graph will simply show time spent in a given library. So what’s wrong? Let’s take a look at how Flame Graphs are generated:

$ perf script > out.perf
$ out.perf > out.folded
$ out.folded > out.svg

The key here is that we are no longer parsing the perf data using perf report, but rather using perf script to do the heavy lifting and feeding the result into the Flame Graph generation tools. Doing a bit of git detective work, we can see that perf report added callchain sampling all the way back in 3.18:

$ git describe --contains 0cdccac6fe4b1316f04f0dbfcc4efab51932014a
$ git log -1 -p 0cdccac6fe4b1316f04f0dbfcc4efab51932014a
commit 0cdccac6fe4b1316f04f0dbfcc4efab51932014a
Author: Namhyung Kim <[email protected]>
Date:   Mon Oct 6 09:45:59 2014 +0900

    perf report: Set callchain_param.record_mode for future use

    Normally the callchain_param.record_mode is used only for record path.
    But as it might need to prepare something for dwarf unwinding, setup
    this info for perf report too.

    Signed-off-by: Namhyung Kim <[email protected]>
    Acked-by: Jiri Olsa <[email protected]>
    Cc: David Ahern <[email protected]>
    Cc: Frederic Weisbecker <[email protected]>
    Cc: Ingo Molnar <[email protected]>
    Cc: Jean Pihet <[email protected]>
    Cc: Jiri Olsa <[email protected]>
    Cc: Namhyung Kim <[email protected]>
    Cc: Paul Mackerras <[email protected]>
    Cc: Peter Zijlstra <[email protected]>
    Link:[email protected]
    Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 2cfc4b93..140a6cd 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -257,6 +257,13 @@ static int report__setup_sample_type(struct report *rep)

+       if (symbol_conf.use_callchain || symbol_conf.cumulate_callchain) {
+               if ((sample_type & PERF_SAMPLE_REGS_USER) &&
+                   (sample_type & PERF_SAMPLE_STACK_USER))
+                       callchain_param.record_mode = CALLCHAIN_DWARF;
+               else
+                       callchain_param.record_mode = CALLCHAIN_FP;
+       }
        return 0;

diff --git a/tools/perf/tests/dwarf-unwind.c b/tools/perf/tests/dwarf-unwind.c
index 96adb73..fc25e57 100644
--- a/tools/perf/tests/dwarf-unwind.c
+++ b/tools/perf/tests/dwarf-unwind.c
@@ -9,6 +9,7 @@
 #include "perf_regs.h"
 #include "map.h"
 #include "thread.h"
+#include "callchain.h"

 static int mmap_handler(struct perf_tool *tool __maybe_unused,
                        union perf_event *event,
@@ -120,6 +121,8 @@ int test__dwarf_unwind(void)
                return -1;

+       callchain_param.record_mode = CALLCHAIN_DWARF;
        if (init_live_machine(machine)) {
                pr_err("Could not init machinen");
                goto out;

Making Flame Graphs work with our kernel

Knowing that this worked on newer versions of perf in at least the 4.6 kernel, we were then able to spot that it wasn’t until 4.3 that perf script gained callchain support. Notice the addition of the analogous code to what was already in perf report:

$ git describe --contains 7322d6c98dd214252bd697f8dde64a3576977fab
$ git log -1 -p 7322d6c98dd214252bd697f8dde64a3576977fab
commit 7322d6c98dd214252bd697f8dde64a3576977fab
Author: Jiri Olsa <[email protected]>
Date:   Thu Aug 13 09:17:24 2015 +0200

    perf script: Initialize callchain_param.record_mode

    Milian Wolff reported non functional DWARF unwind under perf script. The
    reason is that perf script does not properly configure
    callchain_param.record_mode, which is needed by unwind code.

    Stealing the code from report and leaving the place for more
    initialization code in a hope we could merge it with
    report__setup_sample_type one day.

    Reported-by: Milian Wolff <[email protected]>
    Signed-off-by: Jiri Olsa <[email protected]>
    Tested-by: Milian Wolff <[email protected]>
    Cc: David Ahern <[email protected]>
    Cc: Namhyung Kim <[email protected]>
    Cc: Peter Zijlstra <[email protected]>
    Link:[email protected]
    Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 7b376d2..105332e 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -1561,6 +1561,22 @@ static int have_cmd(int argc, const char **argv)
        return 0;

+static void script__setup_sample_type(struct perf_script *script)
+       struct perf_session *session = script->session;
+       u64 sample_type = perf_evlist__combined_sample_type(session->evlist);
+       if (symbol_conf.use_callchain || symbol_conf.cumulate_callchain) {
+               if ((sample_type & PERF_SAMPLE_REGS_USER) &&
+                   (sample_type & PERF_SAMPLE_STACK_USER))
+                       callchain_param.record_mode = CALLCHAIN_DWARF;
+               else if (sample_type & PERF_SAMPLE_BRANCH_STACK)
+                       callchain_param.record_mode = CALLCHAIN_LBR;
+               else
+                       callchain_param.record_mode = CALLCHAIN_FP;
+       }
 int cmd_script(int argc, const char **argv, const char *prefix __maybe_unused)
        bool show_full_info = false;
@@ -1849,6 +1865,7 @@ int cmd_script(int argc, const char **argv, const char *prefix __maybe_unused)
                goto out_delete;

        script.session = session;
+       script__setup_sample_type(&script);

        session->itrace_synth_opts = &itrace_synth_opts;

By backporting this support from the 4.3 version of perf, we were able to support generation of Flame Graphs with our 4.1 production kernel tooling.


The moral of the story is: don’t count on well publicized perf features working on your older kernel. It is just as important to backport updates to the userspace perf tools as it is to backport updates for the production kernel itself.