What do oops look like?
ibmcam.c: IBM USB camera found (model
3, rev. 0x0301)
ibmcam.c: IBM camera: using 320x240
usbvideo.c: ibmcam on /dev/video0:
canvas=320x240 videosize=320x240
Unable to handle kernel paging request
at virtual address 1ffffc0000d4771c
CPU 0 vidcat(871): Oops 0
pc = [<fffffc0000990554>]
ra = [<fffffffc0030eab8>] ps = 0007
v0 = 0000000000000000 t0 = 1ffffc0000d4771c
t1 = ffffffff7fffffff
t2 = fffffc0000d47720 t3 = ffffffffffffffff
t4 = 0000000000000000
t5 = 0000000000000000 t6 = fffffc0000d47720
t7 = fffffc001cf60000
s0 = fffffc0000abd950 s1 = fffffc000064a740
s2 = fffffc000064a750
s3 = 0000000000000007 s4 = 0000000000000000
s5 = fffffc001cf63d58
s6 = 000000000000001e
a0 = 0000000000000007 a1 = 0000000000000007
a2 = fffffc001cf63d58
a3 = 0000000000000001 a4 = 0000000000000001
a5 = 0000000000000080
t8 = fffffc0000d47700 t9 = fffffc0000d47700
t10= 0000000000000000
t11= 0000000000000080 pv = fffffc0000990320
at = 0000000000000000
gp = fffffc0000cd6928 sp = fffffc001cf63cb8
Code: 205f0001 lda t1,1(zero)
48410722 sll t1,t0,t1
20630020 lda t2,32(t2)
4880b681 srl t3,5,t0
47e20502 ornot zero,t1,t1
40230441 s4addq t0,t2,t0
*a8a10000 ldl_l t4,0(t0)
44a20005 and t4,t1,t4
Trace:fffffc00008627ec fffffc0000873508 fffffc0000860c14 fffffc0000860af0 fffffc0000861040 fffffc0000810c20
At first glance you'd be a bit overwhelmed by all the registers and dissassembly and stack trace, however, what you're looking at is an internal format of the state of play when the exception in the kernel occured. Whats NOT so obvious from this is that it's actually telling you what line number in the kernel code that all hell broke loose and unfortunately that takes a bit more effort to dig out.
Fortunately a lot of the work can be facilitated by ksymoops.
ie,
dmesg | ksymoops
yields
Unable to handle kernel paging request
at virtual address 1ffffc0000d4771c
CPU 0 vidcat(871): Oops 0
pc = [<fffffc0000990554>]
ra = [<fffffffc0030eab8>] ps = 0007
Using defaults from ksymoops -t elf64-alpha
-a alpha
v0 = 0000000000000000 t0 = 1ffffc0000d4771c
t1 = ffffffff7fffffff
t2 = fffffc0000d47720 t3 = ffffffffffffffff
t4 = 0000000000000000
t5 = 0000000000000000 t6 = fffffc0000d47720
t7 = fffffc001cf60000
s0 = fffffc0000abd950 s1 = fffffc000064a740
s2 = fffffc000064a750
s3 = 0000000000000007 s4 = 0000000000000000
s5 = fffffc001cf63d58
s6 = 000000000000001e
a0 = 0000000000000007 a1 = 0000000000000007
a2 = fffffc001cf63d58
a3 = 0000000000000001 a4 = 0000000000000001
a5 = 0000000000000080
t8 = fffffc0000d47700 t9 = fffffc0000d47700
t10= 0000000000000000
t11= 0000000000000080 pv = fffffc0000990320
at = 0000000000000000
gp = fffffc0000cd6928 sp = fffffc001cf63cb8
Code: 205f0001 lda t1,1(zero)
48410722 sll t1,t0,t1
20630020 lda t2,32(t2)
4880b681 srl t3,5,t0
47e20502 ornot zero,t1,t1
40230441 s4addq t0,t2,t0
44a20005 and t4,t1,t4
Trace:fffffc00008627ec fffffc0000873508
fffffc0000860c14 fffffc0000860af0 fffffc0000861040 fffffc0000810c20
Warning (Oops_read): Code line not
seen, dumping what data is available
>>PC; fffffc0000990554 <pci_pool_alloc+234/2d0>
<=====
Trace; fffffc00008627ec <chrdev_open+9c/f0>
Trace; fffffc0000873508 <get_write_access+18/80>
Trace; fffffc0000860c14 <dentry_open+104/210>
Trace; fffffc0000860af0 <filp_open+70/90>
Trace; fffffc0000861040 <sys_open+70/1a0>
Trace; fffffc0000810c20 <entSys+a8/c0>
The important differance here being that the "Trace:" has been matched up to the symbol table for the kernel to provide a map of the fuctions that were called/in play leading up to the kernel exception. The two numbers seen appended after the function are the offset of how far into the function the address is and the second number gives an indication of how large the fuction is (all in bytes). Unfortunately these numbers by themselves are not particulary useful.
In the original alpha oops you will notice that there was some disassembly
Code: 205f0001 lda t1,1(zero)Ksymoops unfortunately strips out the one line with the "*" in it which is the point that the exception occurs. Now then, believe it or not this assembly dump will provide you the actual code line that fell apart in the kernel. To do that, you'll have to have the kernel src built.
48410722 sll t1,t0,t1
20630020 lda t2,32(t2)
4880b681 srl t3,5,t0
47e20502 ornot zero,t1,t1
40230441 s4addq t0,t2,t0
*a8a10000 ldl_l t4,0(t0)
44a20005 and t4,t1,t4
Oh, a bit of prepreperation, it's a good idea to have a current TAGS
file around.
find . -type f -name "*[ch]" | etags -
The tags file can be used with emacs or vi to allow you to quickly navigate around the kernel src tree. In this case we use the tags file to find out that pci_pool_alloc is to be found in the file drivers/pci/pci.c
Now a bit of magic has to occur. We have to rebuild drivers/pci/pci.c but in such a way that it leaves the assembly file so that we can look at it.
In this case that gcc command would be
gcc -D__KERNEL__ -I/usr/src/linux-2.4.3/include -Wall -Wstrict-prototypes
-O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -Wno-unused -pipe
-mno-fp-regs -ffixed-8 -mcpu=ev5 -Wa,-mev6 -DEXPORT_SYMTAB
-gstabs+ -S pci.c
(you can get a feel for what would be required by changing directory to drivers/pci , rm -f pci.o and then type in make -C ../.. SUBDIRS=`pwd` and when the gcc compile line pops up, press crtl-c, and cut and paste the compile line, replaceing the -c option with -gstabs+ -S. This rebuilds the assembly file with stabs (basically a means for us to work out what line in the C file is being executed/compiled without making differing assembly from when the kernel was built)
In this example you will be left with pci.s which is the assembly file used to create the object code. This file is intresting because it also has burried in it the line numbers of the C statements being coded
Now, returning to the originally assembly,
Code: 205f0001 lda t1,1(zero)There is a slight problem here in that the code is being displayed in Tru64 assembly format as opposed to GNU assembly format. Don't worry, it's quite simple to translate.
48410722 sll t1,t0,t1
20630020 lda t2,32(t2)
4880b681 srl t3,5,t0
47e20502 ornot zero,t1,t1
40230441 s4addq t0,t2,t0
*a8a10000 ldl_l t4,0(t0)
44a20005 and t4,t1,t4
If you look at /usr/include/alpha/regdef.h you will notice that the registers (t0-7,s0-6 etc) are all represented as $N. These are the only bits that you need to translate. so our code fragment becomes
Code: 205f0001 lda $2,1(zero)Given that the exception was in ldl_l $5,0($1) we can look for this code fragment in pci.s
48410722 sll $2,$1,$2
20630020 lda $3,32($3)
4880b681 srl $4,5,$1
47e20502 ornot $31,$2,$2
40230441 s4addq $1,$3,$1
*a8a10000 ldl_l $5,0($1)
44a20005 and $5,$2,$5
.stabs "/usr/src/linux-2.4.3/include/asm/bitops.h",132,0,0,$Ltext134NOTE: It is IMPORTANT to get the preceeding .stabs line as that refers to the source code file being assembled.
$Ltext134:
$LM1157:
.stabn 68,0,88,$LM1157
addq $0,$22,$3
$LM1158:
.stabn 68,0,110,$LM1158
and $4,31,$1
lda $2,1
sll $2,$1,$2
$LM1159:
.stabn 68,0,88,$LM1159
lda $3,32($3)
$LM1160:
.stabn 68,0,93,$LM1160
srl $4,5,$1
$LM1161:
.stabn 68,0,110,$LM1161
ornot $31,$2,$2
$LM1162:
.stabn 68,0,93,$LM1162
s4addq $1,$3,$1
$LM1163:
.stabn 68,0,110,$LM1163
1: ldl_l $5,0($1)
and $5,$2,$5
stl_c $5,0($1)
beq $5,2f
The important line now becomes .stabn 68,0,110,$LM1163 as this has the all important line number encoded in it (the third number,.. 110). Don't get thrown by the way that the assembly seems to be compiling code out of order, (line 88, 93,110, 93,100..) this is simply the gcc optimizer at work.
If I now fire up emacs on /usr/src/linux-2.4.3/include/asm/bitops.h
and look at line 110 I can find that the above problem is in this routine
extern __inline__ void clear_bit(unsigned long nr, volatile void * addr)So, in this particular example, we've chased this oops down to a possibly faulty piece of hand assembler in the bitops.h header file. But regardless, this is the point that it all goes horribly wrong. It may not be that the routine is wrong, it may simply be that the routine is being passed bad variables.
{
#ifndef BITOPS_NO_BRANCH
unsigned long oldbit;
#endif
unsigned long temp;
unsigned int * m = ((unsigned int *) addr) + (nr >> 5); /*** This is line 93 ***/#ifndef BITOPS_NO_BRANCH
__asm__ __volatile__(
"1: ldl_l %0,%4\n"
" and %0,%3,%2\n"
" beq %2,2f\n"
" xor %0,%3,%0\n"
" stl_c %0,%1\n"
" beq %0,3f\n"
"2:\n"
".subsection 2\n"
"3: br 1b\n"
".previous"
:"=&r" (temp), "=m" (*m), "=&r" (oldbit)
:"Ir" (1UL << (nr & 31)), "m" (*m));
#else
__asm__ __volatile__( /*** This is line 110 ***/
"1: ldl_l %0,%3\n"
" and %0,%2,%0\n"
" stl_c %0,%1\n"
" beq %0,2f\n"
".subsection 2\n"
"2: br 1b\n"
".previous"
:"=&r" (temp), "=m" (*m)
:"Ir" (~(1UL << (nr & 31))), "m" (*m));
#endif
}
Importantly we can now tell what lines of pci_pool_alloc() may have failed because there are only two instances of the call in the routine... and there you have it.