| ================================================== |
| page owner: Tracking about who allocated each page |
| ================================================== |
| |
| Introduction |
| ============ |
| |
| page owner is for the tracking about who allocated each page. |
| It can be used to debug memory leak or to find a memory hogger. |
| When allocation happens, information about allocation such as call stack |
| and order of pages is stored into certain storage for each page. |
| When we need to know about status of all pages, we can get and analyze |
| this information. |
| |
| Although we already have tracepoint for tracing page allocation/free, |
| using it for analyzing who allocate each page is rather complex. We need |
| to enlarge the trace buffer for preventing overlapping until userspace |
| program launched. And, launched program continually dump out the trace |
| buffer for later analysis and it would change system behaviour with more |
| possibility rather than just keeping it in memory, so bad for debugging. |
| |
| page owner can also be used for various purposes. For example, accurate |
| fragmentation statistics can be obtained through gfp flag information of |
| each page. It is already implemented and activated if page owner is |
| enabled. Other usages are more than welcome. |
| |
| It can also be used to show all the stacks and their current number of |
| allocated base pages, which gives us a quick overview of where the memory |
| is going without the need to screen through all the pages and match the |
| allocation and free operation. |
| |
| page owner is disabled by default. So, if you'd like to use it, you need |
| to add "page_owner=on" to your boot cmdline. If the kernel is built |
| with page owner and page owner is disabled in runtime due to not enabling |
| boot option, runtime overhead is marginal. If disabled in runtime, it |
| doesn't require memory to store owner information, so there is no runtime |
| memory overhead. And, page owner inserts just two unlikely branches into |
| the page allocator hotpath and if not enabled, then allocation is done |
| like as the kernel without page owner. These two unlikely branches should |
| not affect to allocation performance, especially if the static keys jump |
| label patching functionality is available. Following is the kernel's code |
| size change due to this facility. |
| |
| Although enabling page owner increases kernel size by several kilobytes, |
| most of this code is outside page allocator and its hot path. Building |
| the kernel with page owner and turning it on if needed would be great |
| option to debug kernel memory problem. |
| |
| There is one notice that is caused by implementation detail. page owner |
| stores information into the memory from struct page extension. This memory |
| is initialized some time later than that page allocator starts in sparse |
| memory system, so, until initialization, many pages can be allocated and |
| they would have no owner information. To fix it up, these early allocated |
| pages are investigated and marked as allocated in initialization phase. |
| Although it doesn't mean that they have the right owner information, |
| at least, we can tell whether the page is allocated or not, |
| more accurately. On 2GB memory x86-64 VM box, 13343 early allocated pages |
| are caught and marked, although they are mostly allocated from struct |
| page extension feature. Anyway, after that, no page is left in |
| un-tracking state. |
| |
| Usage |
| ===== |
| |
| 1) Build user-space helper:: |
| |
| cd tools/mm |
| make page_owner_sort |
| |
| 2) Enable page owner: add "page_owner=on" to boot cmdline. |
| |
| 3) Do the job that you want to debug. |
| |
| 4) Analyze information from page owner:: |
| |
| cat /sys/kernel/debug/page_owner_stacks/show_stacks > stacks.txt |
| cat stacks.txt |
| post_alloc_hook+0x177/0x1a0 |
| get_page_from_freelist+0xd01/0xd80 |
| __alloc_pages+0x39e/0x7e0 |
| allocate_slab+0xbc/0x3f0 |
| ___slab_alloc+0x528/0x8a0 |
| kmem_cache_alloc+0x224/0x3b0 |
| sk_prot_alloc+0x58/0x1a0 |
| sk_alloc+0x32/0x4f0 |
| inet_create+0x427/0xb50 |
| __sock_create+0x2e4/0x650 |
| inet_ctl_sock_create+0x30/0x180 |
| igmp_net_init+0xc1/0x130 |
| ops_init+0x167/0x410 |
| setup_net+0x304/0xa60 |
| copy_net_ns+0x29b/0x4a0 |
| create_new_namespaces+0x4a1/0x820 |
| nr_base_pages: 16 |
| ... |
| ... |
| echo 7000 > /sys/kernel/debug/page_owner_stacks/count_threshold |
| cat /sys/kernel/debug/page_owner_stacks/show_stacks> stacks_7000.txt |
| cat stacks_7000.txt |
| post_alloc_hook+0x177/0x1a0 |
| get_page_from_freelist+0xd01/0xd80 |
| __alloc_pages+0x39e/0x7e0 |
| alloc_pages_mpol+0x22e/0x490 |
| folio_alloc+0xd5/0x110 |
| filemap_alloc_folio+0x78/0x230 |
| page_cache_ra_order+0x287/0x6f0 |
| filemap_get_pages+0x517/0x1160 |
| filemap_read+0x304/0x9f0 |
| xfs_file_buffered_read+0xe6/0x1d0 [xfs] |
| xfs_file_read_iter+0x1f0/0x380 [xfs] |
| __kernel_read+0x3b9/0x730 |
| kernel_read_file+0x309/0x4d0 |
| __do_sys_finit_module+0x381/0x730 |
| do_syscall_64+0x8d/0x150 |
| entry_SYSCALL_64_after_hwframe+0x62/0x6a |
| nr_base_pages: 20824 |
| ... |
| |
| cat /sys/kernel/debug/page_owner > page_owner_full.txt |
| ./page_owner_sort page_owner_full.txt sorted_page_owner.txt |
| |
| The general output of ``page_owner_full.txt`` is as follows:: |
| |
| Page allocated via order XXX, ... |
| PFN XXX ... |
| // Detailed stack |
| |
| Page allocated via order XXX, ... |
| PFN XXX ... |
| // Detailed stack |
| By default, it will do full pfn dump, to start with a given pfn, |
| page_owner supports fseek. |
| |
| FILE *fp = fopen("/sys/kernel/debug/page_owner", "r"); |
| fseek(fp, pfn_start, SEEK_SET); |
| |
| The ``page_owner_sort`` tool ignores ``PFN`` rows, puts the remaining rows |
| in buf, uses regexp to extract the page order value, counts the times |
| and pages of buf, and finally sorts them according to the parameter(s). |
| |
| See the result about who allocated each page |
| in the ``sorted_page_owner.txt``. General output:: |
| |
| XXX times, XXX pages: |
| Page allocated via order XXX, ... |
| // Detailed stack |
| |
| By default, ``page_owner_sort`` is sorted according to the times of buf. |
| If you want to sort by the page nums of buf, use the ``-m`` parameter. |
| The detailed parameters are: |
| |
| fundamental function:: |
| |
| Sort: |
| -a Sort by memory allocation time. |
| -m Sort by total memory. |
| -p Sort by pid. |
| -P Sort by tgid. |
| -n Sort by task command name. |
| -r Sort by memory release time. |
| -s Sort by stack trace. |
| -t Sort by times (default). |
| --sort <order> Specify sorting order. Sorting syntax is [+|-]key[,[+|-]key[,...]]. |
| Choose a key from the **STANDARD FORMAT SPECIFIERS** section. The "+" is |
| optional since default direction is increasing numerical or lexicographic |
| order. Mixed use of abbreviated and complete-form of keys is allowed. |
| |
| Examples: |
| ./page_owner_sort <input> <output> --sort=n,+pid,-tgid |
| ./page_owner_sort <input> <output> --sort=at |
| |
| additional function:: |
| |
| Cull: |
| --cull <rules> |
| Specify culling rules.Culling syntax is key[,key[,...]].Choose a |
| multi-letter key from the **STANDARD FORMAT SPECIFIERS** section. |
| |
| <rules> is a single argument in the form of a comma-separated list, |
| which offers a way to specify individual culling rules. The recognized |
| keywords are described in the **STANDARD FORMAT SPECIFIERS** section below. |
| <rules> can be specified by the sequence of keys k1,k2, ..., as described in |
| the STANDARD SORT KEYS section below. Mixed use of abbreviated and |
| complete-form of keys is allowed. |
| |
| Examples: |
| ./page_owner_sort <input> <output> --cull=stacktrace |
| ./page_owner_sort <input> <output> --cull=st,pid,name |
| ./page_owner_sort <input> <output> --cull=n,f |
| |
| Filter: |
| -f Filter out the information of blocks whose memory has been released. |
| |
| Select: |
| --pid <pidlist> Select by pid. This selects the blocks whose process ID |
| numbers appear in <pidlist>. |
| --tgid <tgidlist> Select by tgid. This selects the blocks whose thread |
| group ID numbers appear in <tgidlist>. |
| --name <cmdlist> Select by task command name. This selects the blocks whose |
| task command name appear in <cmdlist>. |
| |
| <pidlist>, <tgidlist>, <cmdlist> are single arguments in the form of a comma-separated list, |
| which offers a way to specify individual selecting rules. |
| |
| |
| Examples: |
| ./page_owner_sort <input> <output> --pid=1 |
| ./page_owner_sort <input> <output> --tgid=1,2,3 |
| ./page_owner_sort <input> <output> --name name1,name2 |
| |
| STANDARD FORMAT SPECIFIERS |
| ========================== |
| :: |
| |
| For --sort option: |
| |
| KEY LONG DESCRIPTION |
| p pid process ID |
| tg tgid thread group ID |
| n name task command name |
| st stacktrace stack trace of the page allocation |
| T txt full text of block |
| ft free_ts timestamp of the page when it was released |
| at alloc_ts timestamp of the page when it was allocated |
| ator allocator memory allocator for pages |
| |
| For --cull option: |
| |
| KEY LONG DESCRIPTION |
| p pid process ID |
| tg tgid thread group ID |
| n name task command name |
| f free whether the page has been released or not |
| st stacktrace stack trace of the page allocation |
| ator allocator memory allocator for pages |