u-boot debug using BDI2000 PowerPC4xx - u-boot

I'm trying to figure out what's going on while trying to debug a U-boot port. I've got U-boot loaded on my board and by BDI2000 set-up for debug. As I step through start.S I keep running into this error:
(gdb) si
314 mtspr SPRN_SRR0,r0
(gdb) si
315 mtspr SPRN_SRR1,r0
(gdb) si
316 mtspr SPRN_CSRR0,r0
(gdb) si
317 mtspr SPRN_CSRR1,r0
(gdb) si
320 mtspr SPRN_MCSRR0,r0
(gdb) si
321 mtspr SPRN_MCSRR1,r0
(gdb) si
322 mfspr r1,SPRN_MCSR
(gdb) si
323 mtspr SPRN_MCSR,r1
(gdb) si
333 lis r1,0x0030 /* store gathering & broadcast disable */
(gdb) si
Cannot access memory at address 0x300000
(gdb) si
_start_440 () at start.S:334
334 ori r1,r1,0x6000 /* cache touch */
Cannot access memory at address 0xfffff03c
(gdb) bt
#0 _start_440 () at start.S:334
#1 0xfffff18c in rsttlb () at start.S:480
Backtrace stopped: frame did not save the PC
This is my first board bring up so any pointers you might have would be very helpful.
Thanks!

For some reason GDB only reads in the asm for the module being run. By stepping into other areas with the BDI I'm able to stepi from GDB without the "Cannot access memory" issues.
If you have questions feel free to send me a message.
Thx

This appears to be PowerPC code. My experience suggests that your memory address is not yet mapped. Start up code by default will access Non-Volatile Memory (NVM) code (ex: ROM, EEPROM, Flash...) and it is the responsibility to set or define where RAM is located. Generally, this information is pulled from NVM, and written into a Memory Management device or within the PowerPC chip in order to make the Processor aware of RAM. Without seeing the entire code it is difficult to assess if it is set up properly. The other possibility is that the config file of the BDI is not describing what is at address 0x300000

Related

QEMU how pcie_host converts physical address to pcie address

I am learning the implementations of QEMU. Here I got a question: As we know, in real hardware, when cpu reads the virtual address which is the address of pci devices, pci host will take the responsibility to convert it to address of pci. And QEMU, provides pcie_host.c to imitate pcie host. In this file, pcie_mmcfg_data_write is implemented, but nothing about the conversion of physical address to pci address.
I do a test in QEMU using gdb:
firstly, I add edu device, which is a very simple pci device, into qemu.
When I try to open Memory Space Enable, (Mem- to Mem+):septic -s 00:02.0 04.b=2, qemu stop in function pcie_mmcfg_data_write.
static void pcie_mmcfg_data_write(void *opaque, hwaddr mmcfg_addr,
uint64_t val, unsigned len)
{
PCIExpressHost *e = opaque;
PCIBus *s = e->pci.bus;
PCIDevice *pci_dev = pcie_dev_find_by_mmcfg_addr(s, mmcfg_addr);
uint32_t addr;
uint32_t limit;
if (!pci_dev) {
return;
}
addr = PCIE_MMCFG_CONFOFFSET(mmcfg_addr);
limit = pci_config_size(pci_dev);
pci_host_config_write_common(pci_dev, addr, limit, val, len);
}
It is obvious that pcie host uses this function to find device and do the thing.
Use bt can get:
#0 pcie_mmcfg_data_write
(opaque=0xaaaaac573f10, mmcfg_addr=65540, val=2, len=1)
at hw/pci/pcie_host.c:39
#1 0x0000aaaaaae4e8a8 in memory_region_write_accessor
(mr=0xaaaaac574520, addr=65540, value=0xffffe14703e8, size=1, shift=0, mask=255, attrs=...)
at /home/mrzleo/Desktop/qemu/memory.c:483
#2 0x0000aaaaaae4eb14 in access_with_adjusted_size
(addr=65540, value=0xffffe14703e8, size=1, access_size_min=1, access_size_max=4, access_fn=
0xaaaaaae4e7c0 <memory_region_write_accessor>, mr=0xaaaaac574520, attrs=...) at /home/mrzleo/Desktop/qemu/memory.c:544
#3 0x0000aaaaaae51898 in memory_region_dispatch_write
(mr=0xaaaaac574520, addr=65540, data=2, op=MO_8, attrs=...)
at /home/mrzleo/Desktop/qemu/memory.c:1465
#4 0x0000aaaaaae72410 in io_writex
(env=0xaaaaac6924e0, iotlbentry=0xffff000e9b00, mmu_idx=2, val=2,
addr=18446603336758132740, retaddr=281473269319356, op=MO_8)
at /home/mrzleo/Desktop/qemu/accel/tcg/cputlb.c:1084
#5 0x0000aaaaaae74854 in store_helper
(env=0xaaaaac6924e0, addr=18446603336758132740, val=2, oi=2, retaddr=281473269319356, op=MO_8)
at /home/mrzleo/Desktop/qemu/accel/tcg/cputlb.c:1954
#6 0x0000aaaaaae74d78 in helper_ret_stb_mmu
(env=0xaaaaac6924e0, addr=18446603336758132740, val=2 '\002', oi=2, retaddr=281473269319356)
at /home/mrzleo/Desktop/qemu/accel/tcg/cputlb.c:2056
#7 0x0000ffff9a3b47cc in code_gen_buffer ()
#8 0x0000aaaaaae8d484 in cpu_tb_exec
(cpu=0xaaaaac688c00, itb=0xffff945691c0 <code_gen_buffer+5673332>)
at /home/mrzleo/Desktop/qemu/accel/tcg/cpu-exec.c:172
#9 0x0000aaaaaae8e4ec in cpu_loop_exec_tb
(cpu=0xaaaaac688c00, tb=0xffff945691c0 <code_gen_buffer+5673332>,
last_tb=0xffffe1470b78, tb_exit=0xffffe1470b70)
at /home/mrzleo/Desktop/qemu/accel/tcg/cpu-exec.c:619
#10 0x0000aaaaaae8e830 in cpu_exec (cpu=0xaaaaac688c00)
at /home/mrzleo/Desktop/qemu/accel/tcg/cpu-exec.c:732
#11 0x0000aaaaaae3d43c in tcg_cpu_exec (cpu=0xaaaaac688c00)
at /home/mrzleo/Desktop/qemu/cpus.c:1405
#12 0x0000aaaaaae3dd4c in qemu_tcg_cpu_thread_fn (arg=0xaaaaac688c00)
at /home/mrzleo/Desktop/qemu/cpus.c:1713
#13 0x0000aaaaab722c70 in qemu_thread_start (args=0xaaaaac715be0)
at util/qemu-thread-posix.c:519
#14 0x0000fffff5af84fc in start_thread (arg=0xffffffffe3ff)
at pthread_create.c:477
#15 0x0000fffff5a5167c in thread_start ()
at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
and I try to visit the address of edu: devmem 0x10000000
qemu stop in edu_mmio_read. use bt:
(gdb) bt
#0 edu_mmio_read
(opaque=0xaaaaae71c560, addr=0, size=4)
at hw/misc/edu.c:187
#1 0x0000aaaaaae4e5b4 in memory_region_read_accessor
(mr=0xaaaaae71ce50, addr=0, value=0xffffe2472438, size=4, shift=0, mask=4294967295, attrs=...)
at /home/mrzleo/Desktop/qemu/memory.c:434
#2 0x0000aaaaaae4eb14 in access_with_adjusted_size
(addr=0, value=0xffffe2472438, size=4, access_size_min=4, access_size_max=8, access_fn=
0xaaaaaae4e570 <memory_region_read_accessor>, mr=0xaaaaae71ce50, attrs=...)
at /home/mrzleo/Desktop/qemu/memory.c:544
#3 0x0000aaaaaae51524 in memory_region_dispatch_read1
(mr=0xaaaaae71ce50, addr=0, pval=0xffffe2472438, size=4, attrs=...)
at /home/mrzleo/Desktop/qemu/memory.c:1385
#4 0x0000aaaaaae51600 in memory_region_dispatch_read
(mr=0xaaaaae71ce50, addr=0, pval=0xffffe2472438, op=MO_32, attrs=...)
at /home/mrzleo/Desktop/qemu/memory.c:1413
#5 0x0000aaaaaae72218 in io_readx
(env=0xaaaaac6be0f0, iotlbentry=0xffff04282ec0, mmu_idx=0,
addr=281472901758976, retaddr=281473196263360, access_type=MMU_DATA_LOAD, op=MO_32)
at /home/mrzleo/Desktop/qemu/accel/tcg/cputlb.c:1045
#6 0x0000aaaaaae738b0 in load_helper
(env=0xaaaaac6be0f0, addr=281472901758976, oi=32, retaddr=281473196263360,
op=MO_32, code_read=false, full_load=0xaaaaaae73c68 <full_le_ldul_mmu>)
at /home/mrzleo/Desktop/qemu/accel/tcg/cputlb.c:1566
#7 0x0000aaaaaae73ca4 in full_le_ldul_mmu
(env=0xaaaaac6be0f0, addr=281472901758976, oi=32, retaddr=281473196263360)
at /home/mrzleo/Desktop/qemu/accel/tcg/cputlb.c:1662
#8 0x0000aaaaaae73cd8 in helper_le_ldul_mmu
(env=0xaaaaac6be0f0, addr=281472901758976, oi=32, retaddr=281473196263360)
at /home/mrzleo/Desktop/qemu/accel/tcg/cputlb.c:1669
#9 0x0000ffff95e08824 in code_gen_buffer
()
#10 0x0000aaaaaae8d484 in cpu_tb_exec
(cpu=0xaaaaac6b4810, itb=0xffff95e086c0 <code_gen_buffer+31491700>)
at /home/mrzleo/Desktop/qemu/accel/tcg/cpu-exec.c:172
#11 0x0000aaaaaae8e4ec in cpu_loop_exec_tb
(cpu=0xaaaaac6b4810, tb=0xffff95e086c0 <code_gen_buffer+31491700>,
last_tb=0xffffe2472b78, tb_exit=0xffffe2472b70)
at /home/mrzleo/Desktop/qemu/accel/tcg/cpu-exec.c:619
#12 0x0000aaaaaae8e830 in cpu_exec
(cpu=0xaaaaac6b4810) at /home/mrzleo/Desktop/qemu/accel/tcg/cpu-exec.c:732
#13 0x0000aaaaaae3d43c in tcg_cpu_exec
(cpu=0xaaaaac6b4810) at /home/mrzleo/Desktop/qemu/cpus.c:1405
#14 0x0000aaaaaae3dd4c in qemu_tcg_cpu_thread_fn
(arg=0xaaaaac6b4810)
at /home/mrzleo/Desktop/qemu/cpus.c:1713
#15 0x0000aaaaab722c70 in qemu_thread_start (args=0xaaaaac541610) at util/qemu-thread-posix.c:519
#16 0x0000fffff5af84fc in start_thread (arg=0xffffffffe36f) at pthread_create.c:477
#17 0x0000fffff5a5167c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
It seems that qemu just locates to edu device directly, and pcie host do nothing in this procedure. I wonder whether qemu do not implements the conversion here and just use memoryRegion to achieve polymorphism? If not, how QEMU's pcie host do in this procedure?
QEMU uses a set of data structures called MemoryRegions to model the address space that a CPU sees (the detailed API is documented in part in the developer docs).
MemoryRegions can be built up into a tree, where at the "root" there is one 'container' MR which covers the whole 64-bit address space the guest CPU can see, and then MRs for blocks of RAM, devices, etc are placed into that root MR at appropriate offsets. Child MRs can also be containers which in turn contain further MRs. You can then find the MR corresponding to a given guest physical address by walking through the tree of MRs.
The tree of MemoryRegions is largely built up statically when QEMU starts (because most devices don't move around), but it can also be changed dynamically in response to guest software actions. In particular, PCI works this way. When the guest OS writes to a PCI device BAR (which is in PCI config space) this causes QEMU's PCI host controller emulation code to place the MR corresponding to the device's registers into the MemoryRegion hierarchy at the correct place and offset (depending on what address the guest wrote to the BAR, ie where it asked for it to be mapped). Once this is done, the MR for the PCI device is like any other in the tree, and the PCI host controller code doesn't need to be involved in guest accesses to it.
As a performance optimisation, QEMU doesn't actually walk down a tree of MRs for every access. Instead, we first "flatten" the tree into a data structure (a FlatView) that directly says "for this range of addresses, it will be this MR; for this range; this MR", and so on. Secondly, QEMU's TLB structure can directly cache mappings from "guest virtual address" to "specific memory region". On first access it will do an emulated guest MMU page table walk to get from the guest virtual address to the guest physical address, and then it will look that physical address up in the FlatView to find either the real host RAM or the MemoryRegion that is mapped there, and it will add the "guest VA -> this MR" mapping to the TLB cache. Future accesses will hit in the TLB and need not repeat the work of converting to a physaddr and then finding the MR in the flatmap. This is what is happening in your backtrace -- the io_readx() function is passed the guest virtual address and also the relevant part of the TLB data structure, and it can then directly find the target MR and the offset within it, so it can call memory_region_dispatch_read() to dispatch the read request to that MR's read callback function. (If this was the first access, the initial "MMU walk + FlatView lookup" work will have just been done in load_helper() before it calls io_readx().)
Obviously, all this caching also implies that QEMU tracks events which mean the cached data is no longer valid so we can throw it away (eg if the guest writes to the BAR again to unmap it or to map it somewhere else; or if the MMU settings or page tables are changed to alter the guest virtual-to-physical mapping).

How can I trace the cause of an invalid PC fault on Cortex M3?

I have an STM32 Cortex M3 that is experiencing an intermittent invalid PC (INVPC) fault. Unfortunately it takes a day or more to manifest and I don't know the cause.
I have the device paused in the debugger after the fault happened. The INVPC flag is set. The stacked registers are as follows:
0x08003555 xPSR
0x08006824 PC
0x08006824 LR
0x00000000 R12
0x08003341 R3
0x08006824 R2
0xFFFFFFFD R2
0x0000FFFF R0
Unfortunately the return address 0x08006824 is just past the end of the firmware image. The decompilation of that region is as follows:
Region$$Table$$Base
0x08006804: 08006824 $h.. DCD 134244388
0x08006808: 20000000 ... DCD 536870912
0x0800680c: 000000bc .... DCD 188
0x08006810: 08005b30 0[.. DCD 134241072
0x08006814: 080068e0 .h.. DCD 134244576
0x08006818: 200000bc ... DCD 536871100
0x0800681c: 00001a34 4... DCD 6708
0x08006820: 08005b40 #[.. DCD 134241088
Region$$Table$$Limit
** Section #2 'RW_IRAM1' (SHT_PROGBITS) [SHF_ALLOC + SHF_WRITE]
Size : 188 bytes (alignment 4)
Address: 0x20000000
I'm not sure this address is valid. The disassembly of that address in the debugger looks like nonsense, maybe data interpreted as code or something.
Is there any way I can trace this back to see where the exception happened? If necessary I can add some additional code to capture more information.
Don't sure how it works on Cortex M3, but on some other ARMs PSR register holds processor mode bits that could help you find out when it happens (in user mode, IRQ, FIQ etc). Each mode generally have it's own stack.
For user mode, if you use some RTOS with multi-tasking, you probably have many stacks for each task, but you could try to find out which task is current one (was running before crash).
When you find crashed task (or IRQ) you could try to look at it's stack for addresses of all routines and find out what was called before accident. Of course if stack was not unrecoverably corrupted.
This is what I'd start investigation from. If you find crashed task or even function but still have no idea what happens, you could make something like small circular history buffer where you write some codes on every step of your program, so you could find what it does last even if stack was destroyed.

Can I rule out that SIGBUS is raised by a "minor page fault"? (Kernel log has no allocation failure)

Motivation
I am trying to improve my understanding of a SIGBUS error in Xwayland. This has been seen by several Fedora Linux users since around the 20th of February 2018, with Xwayland 1.19.6-5.fc27.x86_64 and Linux kernel 4.15.3-300.fc27.x86-64.
Sadly I do not have the kernel "segfault" log message (or equivalent for SIGBUS). Xwayland has some pointless code which traps the fatal signal. But I can see siginfo by debugging the coredump, and this seems to be nearly as good.
Definition
I understand that a "major page fault" occurs when a page of virtual memory is not available in RAM, and must be read from disk. I think I'm specifically interested in pages backed by a ext4 filesystem (e.g. no direct access to block devices) for this question.
Therefore a "minor page fault" is when no disk access is necessary. I assume the difference is fairly well-defined because Linux exposes counters for major and minor page faults.
My question
If the kernel sends a program SIGBUS, I wonder if I should generally expect that this would have been a major page fault.
According to the coredump and disassembly, the program is reading memory when it receives SIGBUS, not writing it. The fault address in siginfo->si_addr is within a mapped system executable, which is not writeable by the user, and the address seems within the bounds of the current file length. In fact when debugging the coredump, I have read very convincing values from the memory address. It seems the coredump generation process had no difficulty reading this address :-(.
I'm also confident in ruling out the "invalid address alignment" case (BUS_ADRALN), because siginfo->si_code is 2, i.e. BUS_ADRERR, "non-existent physical address". Also because I'm on x86, which permits unaligned accesses in most cases, and the trap isn't in any SSE extended instruction.
I considered what the kernel is normally responsible for, when it handles a page fault which it determines is "minor". I suppose minor faults could fail to allocate memory, and hence raise SIGBUS. However, I believe I would have noticed such an allocation failure:
I have plenty of free swap to evict user pages to, and I did not notice the usual obvious slowdown that occurs when my system starts swapping. The crash happened a few seconds after waking a laptop from suspend to ram, which would not have been long enough to fill 8GB of swap even at ~100MB/s.
Nor did I see the dread Out Of Memory (OOM) killer appear in kernel logs, as I would expect if the kernel failed allocating a page frame or page table.
Is there some other possibility that a minor page fault could have failed and caused the SIGBUS? I.e. is there some cause which I would not have noticed, when looking for errors in the kernel log? And which could have a quick onset?
Again, multiple coredumps are showing this as a page fault triggered by reading from a mapped file on the filesystem.
Ulterior motivation
I would really like to have missed a case for minor page faults. Because the horrifying flipside of this is that I don't see how this SIGBUS could have been cause by the hard page fault side of things. Several of us users have very similar-looking errors, starting a few months ago. There is no IO error in my kernel logs. During normal operation, I have no IO errors when reading the indicated file. I have no errors when running rpm --verify --all, or when running an extended SMART test on the HDD. Unfortunately I seem to have very few suspects. The closest suspect I have is a kernel upgrade, which I would obviously prefer to rule out; the dates don't exactly prove it but it's not entirely ruled out. Next closest in the dates is this years microcode update; this seems like it would be even harder to nail down.
Known causes of minor page faults
Logically, it sounds like minor page faults occur when implementing copy-on-write for MAP_PRIVATE mappings.
It should also include read faults on /dev/zero or MAP_ANONYMOUS, assuming a kernel did not implement them as reading a shared zero page and did not implement them to allocate pages for the entire mapping immediately.
But more generally, it could be any first access to a page. This is because it seems that the page tables for memory mappings are generally populated on-demand. (Which would be done by a page fault, and if the file page was already in cache, it would only be a minor page fault).
MAP_NONBLOCK (since Linux 2.5.46)
This flag is meaningful only in conjunction with MAP_POPULATE.
Don't perform read-ahead: create page tables entries only for
pages that are already present in RAM. Since Linux 2.6.23, this
flag causes MAP_POPULATE to do nothing. One day, the combina‐
tion of MAP_POPULATE and MAP_NONBLOCK may be reimplemented.
EDIT: Further excerpts detailing the above
A commenter asked for more concrete details, to clarify the faulting address and instruction. There are many excerpts in the initial link https://bugzilla.redhat.com/show_bug.cgi?id=1557682
The fault varies as described in the bug link. Here are fresh excerpts from a recent instance.
$ gdb 2018-03-21.core
...
Core was generated by `/usr/bin/Xwayland :0 -rootless -terminate -core -listen 4 -listen 5 -displayfd'.
Program terminated with signal SIGBUS, Bus error.
#0 _dl_fixup (l=0x7fc0be2e0130, reloc_arg=203) at ../elf/dl-runtime.c:73
73 const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)];
[Current thread is 1 (Thread 0x7fc0be29fa80 (LWP 1918))]
(gdb) p $_siginfo.si_signum
$1 = 7
(gdb) p $_siginfo.si_code
$2 = 2
(gdb) p $_siginfo._sifields._sigfault.si_addr
$3 = (void *) 0x41bd80
(gdb) disassemble
Dump of assembler code for function _dl_fixup:
0x00007fc0be0c8bd0 <+0>: push %rbx
0x00007fc0be0c8bd1 <+1>: mov %rdi,%r10
0x00007fc0be0c8bd4 <+4>: mov %esi,%esi
0x00007fc0be0c8bd6 <+6>: lea (%rsi,%rsi,2),%rdx
0x00007fc0be0c8bda <+10>: sub $0x10,%rsp
0x00007fc0be0c8bde <+14>: mov 0x68(%rdi),%rax
0x00007fc0be0c8be2 <+18>: mov 0x8(%rax),%rdi
0x00007fc0be0c8be6 <+22>: mov 0xf8(%r10),%rax
0x00007fc0be0c8bed <+29>: mov 0x8(%rax),%rax
0x00007fc0be0c8bf1 <+33>: lea (%rax,%rdx,8),%r8
0x00007fc0be0c8bf5 <+37>: mov 0x70(%r10),%rax
=> 0x00007fc0be0c8bf9 <+41>: mov 0x8(%r8),%rcx
(gdb) p/x $r8
$4 = 0x41bd78
(gdb) p/x $r8 + 8
$5 = 0x41bd80
Note this instruction is fetching the value reloc->r_info as per the highlighted source line.
(gdb) p reloc
$6 = (const Elf64_Rela * const) 0x41bd78
(gdb) p &reloc->r_info
$7 = (Elf64_Xword *) 0x41bd80
(gdb) p *reloc
$8 = {r_offset = 8443504, r_info = 936302870535, r_addend = 0}
The faulting address falls within the text mapping below (from maps file captured by abrtd):
00400000-0060b000 r-xp 00000000 fd:00 1708508 /usr/bin/Xwayland
0080a000-0080d000 r--p 0020a000 fd:00 1708508 /usr/bin/Xwayland
0080d000-00817000 rw-p 0020d000 fd:00 1708508 /usr/bin/Xwayland
$ size -x /usr/bin/Xwayland
text data bss dec hex filename
0x209ffb 0xbe9d 0x1f3e0 2314872 235278 /usr/bin/Xwayland
I certainly have some bug in the kernel, unless it is a bug in the kernel selftests.
EDIT: hmm, actually it seems others have noticed the GS selftests failure recently as well, but it was already present in older kernels, and also appears on AMD cpus. There does not appear to be a conclusion regarding how to fix it at the moment. https://lkml.org/lkml/2018/1/26/436
So it's not this bug on it's own, though I can't rule out that this GS bug causes more prominent breakage when PTI is enabled or something.
$ uname -r
4.15.10-300.fc27.x86_64
$ git describe --all
heads/4.15.10
$ cat ./Documentation/x86/pti.txt
...
2. Run several copies of all of the tools/testing/selftests/x86/ tests
(excluding MPX and protection_keys) in a loop on multiple CPUs for
several minutes. These tests frequently uncover corner cases in the
kernel entry code. In general, old kernels might cause these tests
themselves to crash, but they should never crash the kernel.
$ cd tools/testing/selftests/x86
$ make
...
In 4x terminals to match my 4x hardware threads:
sh -c ' while true; do for i in *; do if test -x $i; then ./$i || exit; fi ; done; done '
Failures quickly appear:
[RUN] ARCH_SET_GS(0x200000000), then schedule to 0x200000000
Before schedule, set selector to 0x3
other thread: ARCH_SET_GS(0x200000000) -- sel is 0x0
[FAIL] GS/BASE changed from 0x3/0x0 to 0x0/0x0
Also
[RUN] Executing 6-argument 32-bit syscall via VDSO
[WARN] Flags before=0000000000200ed7 id 0 00 o d i s z 0 a 0 p 1 c
[WARN] Flags after=0000000000200682 id 0 00 d i s 0 0 1
[WARN] Flags change=0000000000000855 0 00 o z 0 a 0 p 0 c
[OK] Arguments are preserved across syscall
[NOTE] R11 has changed:0000000000200682 - assuming clobbered by SYSRET insn
[OK] R8..R15 did not leak kernel data
[RUN] Executing 6-argument 32-bit syscall via INT 80
[OK] Arguments are preserved across syscall
[OK] R8..R15 did not leak kernel data
[RUN] Running tests under ptrace
[RUN] Executing 6-argument 32-bit syscall via VDSO
[WARN] Flags before=0000000000200ed7 id 0 00 o d i s z 0 a 0 p 1 c
[WARN] Flags after=0000000000200686 id 0 00 d i s 0 0 p 1
[WARN] Flags change=0000000000000851 0 00 o z 0 a 0 0 c
[OK] Arguments are preserved across syscall
[NOTE] R11 has changed:0000000000200686 - assuming clobbered by SYSRET insn
[OK] R8..R15 did not leak kernel data
[RUN] Executing 6-argument 32-bit syscall via INT 80
[OK] Arguments are preserved across syscall
[OK] R8..R15 did not leak kernel data
Warning: failed to find getcpu in vDSO
[RUN] Testing getcpu...
[OK] CPU 0: syscall: cpu 0, node 0
[OK] CPU 1: syscall: cpu 1, node 0
[OK] CPU 2: syscall: cpu 2, node 0
[OK] CPU 3: syscall: cpu 3, node 0
[RUN] Testing getcpu...
[OK] CPU 0: syscall: cpu 0, node 0 vdso: cpu 0, node 0 vsyscall: cpu 0, node 0
[OK] CPU 1: syscall: cpu 1, node 0 vdso: cpu 1, node 0 vsyscall: cpu 1, node 0
[OK] CPU 2: syscall: cpu 2, node 0 vdso: cpu 2, node 0 vsyscall: cpu 2, node 0
[OK] CPU 3: syscall: cpu 3, node 0 vdso: cpu 3, node 0 vsyscall: cpu 3, node 0
[NOTE] failed to find getcpu in vDSO
[RUN] test gettimeofday()
vDSO time offsets: 0.000006 0.000000
[OK] vDSO gettimeofday()'s timeval was okay
[RUN] test time()
[FAIL] vDSO returned the wrong time (1522063297 1522063296 1522063297)
Thanks everyone for your support. It was indeed a transient IO error. It seems the SIGBUS read-fault path doesn't necessarily log anything in the kernel log, unlike the cases I'm used to seeing for IO errors.
https://marc.info/?l=linux-ide&m=152232081917215&w=2
v4.15 intermittent errors on suspend/resume
To anyone waiting for the other show to drop on the SATA LPM work...
I've found something that's at least in the same area. It triggered a
fsck on my system 2 days ago. Evidence suggests it's occurred on many
other machines. I felt that was reason enough to give you a heads up
:).
I checked and I don't seem to have LPM enabled during runtime, even
when running on battery. My errors are all on suspend/resume, so
maybe that behaviour was changed at the same time?
It doesn't always show in kernel logs. What I first noticed was a
mysterious SIGBUS that kills Xwayland (and hence the entire Gnome
session) on resume from suspend. It surprised me to learn that this
SIGBUS can happen, without leaving anything like the read errors I'm
used to seeing in the kernel log!
My coredumps show the SIGBUS fault address is an instruction read
inside the program code of Xwayland. The backtraces vary along the
same call chain - the common factor is that they're always at the
first instruction of the function. I assume it varies according to
which page is not currently in-core, and hence triggers the failing
read request.
There are hundreds of backtraces along this same call chain from
other users, reported automatically to Fedora, that look the same.
At least so far we don't have any more plausible for them. I admit
it's funny that Xwayland is so prominent, and I haven't been swamped
with SIGBUS in other processes, but I stand by this analysis.
These crashes started within 24 hours of Fedora upgrading to kernel
v4.15.
Fedora bug for the Xwayland SIGBUS:
https://bugzilla.redhat.com/show_bug.cgi?id=1553979
My duplicate bug I've been spamming with puzzled comments:
https://bugzilla.redhat.com/show_bug.cgi?id=1557682
The earliest and biggest of the many crash report buckets:
[2018-02-17] https://retrace.fedoraproject.org/faf/reports/2049964/
[315 reports] https://retrace.fedoraproject.org/faf/reports/2055378/
EXT4 filesystem error:
Mar 27 11:28:30 alan-laptop kernel: PM: suspend exit
...
Mar 27 11:28:30 alan-laptop kernel: EXT4-fs error (device dm-2): ext4_find_entry:1436: inode #5514052: comm thunderbird: reading directory lblock 0
Mar 27 11:28:30 alan-laptop kernel: Buffer I/O error on dev dm-2, logical block 0, lost sync page write
(this marked the FS as needing fsck on next boot)
More frequently, it logs these swap errors:
Mar 02 18:47:03 alan-laptop kernel: Restarting tasks ...
Mar 02 18:47:03 alan-laptop kernel: Read-error on swap-device (253:1:836184)
Mar 02 18:47:06 alan-laptop kernel: Read-error on swap-device (253:1:580280)
My laptop LPM status, even after removing AC power:
$ head /sys/class/scsi_host/host*/link_power_management_policy
==> /sys/class/scsi_host/host0/link_power_management_policy <==
max_performance
==> /sys/class/scsi_host/host1/link_power_management_policy <==
max_performance
My laptop is a Dell Lattitude E5450. CPU is i5-5300U (a Broadwell).

Breakpoint not working in gdb with QEMU simulating cortex-a8

I am testing some simple code running in the ARM7TDMI, since I haven't found ARM7TDMI simulator on QEMU, I use Cortex-a8 instead (I am not sure if this will lead to bug, total newbie).
This is how I run QEMU:
qemu-system-arm -machine realview-pb-a8 -cpu cortex-a8 -nographic -monitor null -serial null -semihosting -kernel main.elf -gdb tcp::51234 -S
The code I want to test is quite simple, the function LoadContext() and SaveContext() is written in arm assembly for IAR IDE, and the IAR IDE is using ARM7TDMI as a core. I compiled this assembly file into an object file with IAR and link the code below with arm-none-eabi-gcc, will this cause unpredictable errors? (Just want to use gcc and QEMU instead of IAR...)
int main(void)
{
Running = &taskA;
Running->PC = task1;
Running->SP = &(Running->StackSeg[STACK_SIZE-1]);
LoadContext();
}
void task1(void)
{
register int reg_var = 1;
volatile int vol_var = 1;
SaveContext();
reg_var++;
vol_var++;
SaveContext();
reg_var++;
vol_var++;
LoadContext();
}
So, when I have set a breakpoint in the gdb, it is not working, it will just go into an endless loop I think. I checked the initialization process, it is:
(gdb)
0x000082f6 in __libc_init_array ()
(gdb)
0x000080e2 in _start ()
(gdb)
0x000080e4 in _start ()
(gdb)
0x000080e6 in _start ()
(gdb)
main () at src/context-demo.c:12
12 int main(void) {
(gdb)
0x000081ea 12 int main(void) {
(gdb)
0x00000008 in ?? ()
(gdb)
0x0000000c in ?? ()
(gdb)
0x00000010 in ?? ()
(gdb)
0x00000014 in ?? ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00000004 in ?? ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00000004 in ?? ()
(gdb)
Does anybody have any ideas about what happened here? Any help is appreciated, thanks!
You'll find this much easier to debug if you tell gdb to tell you about the assembly instructions it is executing ("display /3i $pc" will print the next 3 instructions every time gdb stops), and do single step of individual instructions ("stepi").
Something is causing you to end up at a low address 0x8 unexpectedly, and you need to find out what that is. Either you're really jumping to 0x8, or you've taken an exception. Looking at execution at a per-machine instruction level will tell you which it is.
Some plausible possibilities here:
executable built to assume it has RAM where the realview-pb-a8 does not have RAM -- this often manifests as "writing to the stack (or global variables) silently does nothing and reading from the stack/globals returns 0", so if you have a function pointer in a global or you try to push a return address to the stack and then pop it you'll end up at 0
executable built to assume it's running under an OS that provides an SVC API -- in this case the code will execute an SVC instruction and your code will crash because there's nothing to handle it at the SVC exception vector
executable built for the wrong CPU type and executes an instruction that UNDEFs (this should result in execution going to address 0x4, the undef vector, but I have a feeling there's a qemu bug in its gdbstub that may mean that a step that executes an UNDEF insn will not stop until after executing the first insn at the UNDEF vector)
executable built to assume that the FPU is always enabled. When QEMU is executing a "bare metal" binary like this, the CPU is started in the state that hardware starts, which has the FPU disabled. So any instructions using the FPU will UNDEF unless the executable's startup code has explicitly turned on the FPU.
I've listed those in rough order of probability for your case, but in any case single stepping by machine instruction should identify what's going on.

How to ignore interrupts with arm gdb

I am trying to debug a program using arm-none-eabi-gdb and step through it. There is an interrupt, USB0_Handler, which I do not want to step into while stepping the program. To achieve this, I tried to use skip, but it didn't work, even if I try to skip the function or skip the entire file (containing the interrupt). I am using openocd to achieve the remote debugging on the tm4c123gh6pm.
I have reached a point where I don't know if I should define myself a gdb function or I am missing a point. Here is the output of my terminal :
(gdb) info skip
Num Type Enb What
1 function y USB0_Handler
(gdb) c
Continuing.
Breakpoint 2, relayTask () at ./relay.c:191
191 nextTime = rtcGetTimeIn(DEFAULT_REFRESH_RATE);
(gdb) n
USB0_Handler () at ./UsbConfig.c:326
326 {
(gdb) n
332 ui32Status = MAP_USBIntStatusControl(USB0_BASE);
(gdb) n
337 USBDeviceIntHandlerInternal(0, ui32Status);
(gdb) n
338 }
(gdb) n #returning at the top of USB0_Handler
326 {
When an interrupt is triggered while stepping, GDB usually stops because the step ended in a place it didn't expect.
Interrupt handlers are generally hard to deal with from a debugger point of view because they are executed in a new context: the stack frames are changed and unless GDB recognizes a particular pattern in the frame it won't be able to compute a complete stack trace (i.e. the interrupt handler frames + your regular program stack trace before the interrupt.)
The simplest way to get you out of the interrupt handler is to plant a breakpoint on the last line of the function, resume and continue stepping. Someone suggested to use the finish command but it may fail depending on again the quality of the stack trace.
Thanks to GDB scriptability (in python for instance) it may be possible to automate that by checking the PC and if PC is on the isr address in irq vector, fetch the return address, plant a temporary breakpoint and resume.

Resources