| ============================== |
| Using the tracer for debugging |
| ============================== |
| |
| Copyright 2024 Google LLC. |
| |
| :Author: Steven Rostedt <rostedt@goodmis.org> |
| :License: The GNU Free Documentation License, Version 1.2 |
| (dual licensed under the GPL v2) |
| |
| - Written for: 6.12 |
| |
| Introduction |
| ------------ |
| The tracing infrastructure can be very useful for debugging the Linux |
| kernel. This document is a place to add various methods of using the tracer |
| for debugging. |
| |
| First, make sure that the tracefs file system is mounted:: |
| |
| $ sudo mount -t tracefs tracefs /sys/kernel/tracing |
| |
| |
| Using trace_printk() |
| -------------------- |
| |
| trace_printk() is a very lightweight utility that can be used in any context |
| inside the kernel, with the exception of "noinstr" sections. It can be used |
| in normal, softirq, interrupt and even NMI context. The trace data is |
| written to the tracing ring buffer in a lockless way. To make it even |
| lighter weight, when possible, it will only record the pointer to the format |
| string, and save the raw arguments into the buffer. The format and the |
| arguments will be post processed when the ring buffer is read. This way the |
| trace_printk() format conversions are not done during the hot path, where |
| the trace is being recorded. |
| |
| trace_printk() is meant only for debugging, and should never be added into |
| a subsystem of the kernel. If you need debugging traces, add trace events |
| instead. If a trace_printk() is found in the kernel, the following will |
| appear in the dmesg:: |
| |
| ********************************************************** |
| ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** |
| ** ** |
| ** trace_printk() being used. Allocating extra memory. ** |
| ** ** |
| ** This means that this is a DEBUG kernel and it is ** |
| ** unsafe for production use. ** |
| ** ** |
| ** If you see this message and you are not debugging ** |
| ** the kernel, report this immediately to your vendor! ** |
| ** ** |
| ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** |
| ********************************************************** |
| |
| Debugging kernel crashes |
| ------------------------ |
| There is various methods of acquiring the state of the system when a kernel |
| crash occurs. This could be from the oops message in printk, or one could |
| use kexec/kdump. But these just show what happened at the time of the crash. |
| It can be very useful in knowing what happened up to the point of the crash. |
| The tracing ring buffer, by default, is a circular buffer than will |
| overwrite older events with newer ones. When a crash happens, the content of |
| the ring buffer will be all the events that lead up to the crash. |
| |
| There are several kernel command line parameters that can be used to help in |
| this. The first is "ftrace_dump_on_oops". This will dump the tracing ring |
| buffer when a oops occurs to the console. This can be useful if the console |
| is being logged somewhere. If a serial console is used, it may be prudent to |
| make sure the ring buffer is relatively small, otherwise the dumping of the |
| ring buffer may take several minutes to hours to finish. Here's an example |
| of the kernel command line:: |
| |
| ftrace_dump_on_oops trace_buf_size=50K |
| |
| Note, the tracing buffer is made up of per CPU buffers where each of these |
| buffers is broken up into sub-buffers that are by default PAGE_SIZE. The |
| above trace_buf_size option above sets each of the per CPU buffers to 50K, |
| so, on a machine with 8 CPUs, that's actually 400K total. |
| |
| Persistent buffers across boots |
| ------------------------------- |
| If the system memory allows it, the tracing ring buffer can be specified at |
| a specific location in memory. If the location is the same across boots and |
| the memory is not modified, the tracing buffer can be retrieved from the |
| following boot. There's two ways to reserve memory for the use of the ring |
| buffer. |
| |
| The more reliable way (on x86) is to reserve memory with the "memmap" kernel |
| command line option and then use that memory for the trace_instance. This |
| requires a bit of knowledge of the physical memory layout of the system. The |
| advantage of using this method, is that the memory for the ring buffer will |
| always be the same:: |
| |
| memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M |
| |
| The memmap above reserves 12 megabytes of memory at the physical memory |
| location 0x284500000. Then the trace_instance option will create a trace |
| instance "boot_map" at that same location with the same amount of memory |
| reserved. As the ring buffer is broke up into per CPU buffers, the 12 |
| megabytes will be broken up evenly between those CPUs. If you have 8 CPUs, |
| each per CPU ring buffer will be 1.5 megabytes in size. Note, that also |
| includes meta data, so the amount of memory actually used by the ring buffer |
| will be slightly smaller. |
| |
| Another more generic but less robust way to allocate a ring buffer mapping |
| at boot is with the "reserve_mem" option:: |
| |
| reserve_mem=12M:4096:trace trace_instance=boot_map@trace |
| |
| The reserve_mem option above will find 12 megabytes that are available at |
| boot up, and align it by 4096 bytes. It will label this memory as "trace" |
| that can be used by later command line options. |
| |
| The trace_instance option creates a "boot_map" instance and will use the |
| memory reserved by reserve_mem that was labeled as "trace". This method is |
| more generic but may not be as reliable. Due to KASLR, the memory reserved |
| by reserve_mem may not be located at the same location. If this happens, |
| then the ring buffer will not be from the previous boot and will be reset. |
| |
| Sometimes, by using a larger alignment, it can keep KASLR from moving things |
| around in such a way that it will move the location of the reserve_mem. By |
| using a larger alignment, you may find better that the buffer is more |
| consistent to where it is placed:: |
| |
| reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace |
| |
| On boot up, the memory reserved for the ring buffer is validated. It will go |
| through a series of tests to make sure that the ring buffer contains valid |
| data. If it is, it will then set it up to be available to read from the |
| instance. If it fails any of the tests, it will clear the entire ring buffer |
| and initialize it as new. |
| |
| The layout of this mapped memory may not be consistent from kernel to |
| kernel, so only the same kernel is guaranteed to work if the mapping is |
| preserved. Switching to a different kernel version may find a different |
| layout and mark the buffer as invalid. |
| |
| Using trace_printk() in the boot instance |
| ----------------------------------------- |
| By default, the content of trace_printk() goes into the top level tracing |
| instance. But this instance is never preserved across boots. To have the |
| trace_printk() content, and some other internal tracing go to the preserved |
| buffer (like dump stacks), either set the instance to be the trace_printk() |
| destination from the kernel command line, or set it after boot up via the |
| trace_printk_dest option. |
| |
| After boot up:: |
| |
| echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest |
| |
| From the kernel command line:: |
| |
| reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace |
| |
| If setting it from the kernel command line, it is recommended to also |
| disable tracing with the "traceoff" flag, and enable tracing after boot up. |
| Otherwise the trace from the most recent boot will be mixed with the trace |
| from the previous boot, and may make it confusing to read. |