Blame - Documentation/BUG-HUNTING - linux

blob: 35f5bd243336aeb1927f42483e26fa7586bcb205 [file] [log] [blame]

Ian McDonald	43019a56	2006-03-22 00:37:42 +0100	[diff] [blame]	1	Table of contents
				2	=================
				3
				4	Last updated: 20 December 2005
				5
				6	Contents
				7	========
				8
				9	- Introduction
				10	- Devices not appearing
				11	- Finding patch that caused a bug
				12	-- Finding using git-bisect
				13	-- Finding it the old way
				14	- Fixing the bug
				15
				16	Introduction
				17	============
				18
				19	Always try the latest kernel from kernel.org and build from source. If you are
				20	not confident in doing that please report the bug to your distribution vendor
				21	instead of to a kernel developer.
				22
				23	Finding bugs is not always easy. Have a go though. If you can't find it don't
				24	give up. Report as much as you have found to the relevant maintainer. See
				25	MAINTAINERS for who that is for the subsystem you have worked on.
				26
				27	Before you submit a bug report read REPORTING-BUGS.
				28
				29	Devices not appearing
				30	=====================
				31
				32	Often this is caused by udev. Check that first before blaming it on the
				33	kernel.
				34
				35	Finding patch that caused a bug
				36	===============================
				37
				38
				39
				40	Finding using git-bisect
				41	------------------------
				42
				43	Using the provided tools with git makes finding bugs easy provided the bug is
				44	reproducible.
				45
				46	Steps to do it:
				47	- start using git for the kernel source
				48	- read the man page for git-bisect
				49	- have fun
				50
				51	Finding it the old way
				52	----------------------
				53
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	54	[Sat Mar 2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)]
				55
				56	This is how to track down a bug if you know nothing about kernel hacking.
				57	It's a brute force approach but it works pretty well.
				58
				59	You need:
				60
				61	. A reproducible bug - it has to happen predictably (sorry)
				62	. All the kernel tar files from a revision that worked to the
				63	revision that doesn't
				64
				65	You will then do:
				66
				67	. Rebuild a revision that you believe works, install, and verify that.
				68	. Do a binary search over the kernels to figure out which one
				69	introduced the bug. I.e., suppose 1.3.28 didn't have the bug, but
				70	you know that 1.3.69 does. Pick a kernel in the middle and build
				71	that, like 1.3.50. Build & test; if it works, pick the mid point
				72	between .50 and .69, else the mid point between .28 and .50.
				73	. You'll narrow it down to the kernel that introduced the bug. You
				74	can probably do better than this but it gets tricky.
				75
				76	. Narrow it down to a subdirectory
				77
				78	- Copy kernel that works into "test". Let's say that 3.62 works,
				79	but 3.63 doesn't. So you diff -r those two kernels and come
				80	up with a list of directories that changed. For each of those
				81	directories:
				82
				83	Copy the non-working directory next to the working directory
				84	as "dir.63".
				85	One directory at time, try moving the working directory to
				86	"dir.62" and mv dir.63 dir"time, try
				87
				88	mv dir dir.62
				89	mv dir.63 dir
				90	find dir -name '*.[oa]' -print \| xargs rm -f
				91
				92	And then rebuild and retest. Assuming that all related
				93	changes were contained in the sub directory, this should
				94	isolate the change to a directory.
				95
				96	Problems: changes in header files may have occurred; I've
				97	found in my case that they were self explanatory - you may
				98	or may not want to give up when that happens.
				99
				100	. Narrow it down to a file
				101
				102	- You can apply the same technique to each file in the directory,
				103	hoping that the changes in that file are self contained.
				104
				105	. Narrow it down to a routine
				106
				107	- You can take the old file and the new file and manually create
				108	a merged file that has
				109
				110	#ifdef VER62
				111	routine()
				112	{
				113	...
				114	}
				115	#else
				116	routine()
				117	{
				118	...
				119	}
				120	#endif
				121
				122	And then walk through that file, one routine at a time and
				123	prefix it with
				124
				125	#define VER62
				126	/* both routines here */
				127	#undef VER62
				128
				129	Then recompile, retest, move the ifdefs until you find the one
				130	that makes the difference.
				131
				132	Finally, you take all the info that you have, kernel revisions, bug
				133	description, the extent to which you have narrowed it down, and pass
				134	that off to whomever you believe is the maintainer of that section.
				135	A post to linux.dev.kernel isn't such a bad idea if you've done some
				136	work to narrow it down.
				137
				138	If you get it down to a routine, you'll probably get a fix in 24 hours.
				139
				140	My apologies to Linus and the other kernel hackers for describing this
				141	brute force approach, it's hardly what a kernel hacker would do. However,
				142	it does work and it lets non-hackers help fix bugs. And it is cool
				143	because Linux snapshots will let you do this - something that you can't
				144	do with vendor supplied releases.
				145
Ian McDonald	43019a56	2006-03-22 00:37:42 +0100	[diff] [blame]	146	Fixing the bug
				147	==============
				148
				149	Nobody is going to tell you how to fix bugs. Seriously. You need to work it
				150	out. But below are some hints on how to use the tools.
				151
				152	To debug a kernel, use objdump and look for the hex offset from the crash
				153	output to find the valid line of code/assembler. Without debug symbols, you
				154	will see the assembler code for the routine shown, but if your kernel has
				155	debug symbols the C code will also be available. (Debug symbols can be enabled
				156	in the kernel hacking menu of the menu configuration.) For example:
				157
				158	objdump -r -S -l --disassemble net/dccp/ipv4.o
				159
				160	NB.: you need to be at the top level of the kernel tree for this to pick up
				161	your C files.
				162
				163	If you don't have access to the code you can also debug on some crash dumps
				164	e.g. crash dump output as shown by Dave Miller.
				165
				166	> EIP is at ip_queue_xmit+0x14/0x4c0
				167	> ...
				168	> Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00
				169	> 00 00 55 57 56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08
				170	> <8b> 83 3c 01 00 00 89 44 24 14 8b 45 28 85 c0 89 44 24 18 0f 85
				171	>
				172	> Put the bytes into a "foo.s" file like this:
				173	>
				174	> .text
				175	> .globl foo
				176	> foo:
				177	> .byte .... /* bytes from Code: part of OOPS dump */
				178	>
				179	> Compile it with "gcc -c -o foo.o foo.s" then look at the output of
				180	> "objdump --disassemble foo.o".
				181	>
				182	> Output:
				183	>
				184	> ip_queue_xmit:
				185	> push %ebp
				186	> push %edi
				187	> push %esi
				188	> push %ebx
				189	> sub $0xbc, %esp
				190	> mov 0xd0(%esp), %ebp ! %ebp = arg0 (skb)
				191	> mov 0x8(%ebp), %ebx ! %ebx = skb->sk
				192	> mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt
				193
Pekka Enberg	926b289	2007-06-01 00:46:50 -0700	[diff] [blame]	194	In addition, you can use GDB to figure out the exact file and line
				195	number of the OOPS from the vmlinux file. If you have
				196	CONFIG_DEBUG_INFO enabled, you can simply copy the EIP value from the
				197	OOPS:
				198
				199	EIP: 0060:[<c021e50e>] Not tainted VLI
				200
				201	And use GDB to translate that to human-readable form:
				202
				203	gdb vmlinux
				204	(gdb) l *0xc021e50e
				205
				206	If you don't have CONFIG_DEBUG_INFO enabled, you use the function
				207	offset from the OOPS:
				208
				209	EIP is at vt_ioctl+0xda8/0x1482
				210
				211	And recompile the kernel with CONFIG_DEBUG_INFO enabled:
				212
				213	make vmlinux
				214	gdb vmlinux
				215	(gdb) p vt_ioctl
				216	(gdb) l *(0x<address of vt_ioctl> + 0xda8)
				217
Ian McDonald	43019a56	2006-03-22 00:37:42 +0100	[diff] [blame]	218	Another very useful option of the Kernel Hacking section in menuconfig is
				219	Debug memory allocations. This will help you see whether data has been
				220	initialised and not set before use etc. To see the values that get assigned
				221	with this look at mm/slab.c and search for POISON_INUSE. When using this an
				222	Oops will often show the poisoned data instead of zero which is the default.
				223
				224	Once you have worked out a fix please submit it upstream. After all open
				225	source is about sharing what you do and don't you want to be recognised for
				226	your genius?
				227
				228	Please do read Documentation/SubmittingPatches though to help your code get
				229	accepted.