Code execution exploit Cortex M4 - arm

For testing the MPU and playing around with exploits, I want to execute code from a local buffer running on my STM32F4 dev board.
int main(void)
{
uint16_t func[] = { 0x0301f103, 0x0301f103, 0x0301f103 };
MPU->CTRL = 0;
unsigned int address = (void*)&func+1;
asm volatile(
"mov r4,%0\n"
"ldr pc, [r4]\n"
:
: "r"(address)
);
while(1);
}
In main, I first turn of the MPU. In func my instructions are stored. In the ASM part I load the address (0x2001ffe8 +1 for thumb) into the program counter register. When stepping through the code with GDB, in R4 the correct value is stored and then transfered to PC register. But then I will end up in the HardFault Handler.
Edit:
The stack looks like this:
0x2001ffe8: 0x0301f103 0x0301f103 0x0301f103 0x2001ffe9
The instructions are correct in the memory. Definitive Guide to Cortex says region 0x20000000–0x3FFFFFFF is the SRAM and "this region is executable,
so you can copy program code here and execute it".

You are assigning 32 bit values to a 16 bit array.
Your instructions dont terminate, they continue on to run into whatever is found in ram, so that will crash.
You are not loading the address to the array into the program counter you are loading the first item in the array into the program counter, this will crash, you created a level of indirection.
Look at the BX instruction for this rather than ldr pc
You did not declare the array as static, so the array can be optimized out as dead and unused, so this can cause it to crash.
The compiler should also complain that you are assigning a void* to an unsigned variable, so a typecast is wanted there.
As a habit I recommend address|=1 rather than +=1, in this case either will function.

Related

Inject a memory exception by accessing forbidden memory location

I want to test a exception handler function that I have written for an embedded system and want to write a test code that injects an access to memory that forbidden.
void Test_Mem_exception
{
__asm(
"LDR R0, =0xA0100000\n\t"
"MOV R1, 0x77777777\n\t"
"STR R1, [R0,#0]"
);
This is the code I want to write that access memory location at 0xA010000. Somehow this does not seem a generic test code to me.
Is there a standard way of writing such test codes in C or C++. By Generic I mean a code that is independent of the memory map of the system that it runs on.
I wouldn't use asm for this, simply use a pointer.
void Test_Mem_exception
{
/* volatile, to suppress optimizing/removing the read statement */
volatile int *ptr = 0xC0C0C0C0;
int value = *ptr;
}
This won't always result to an exception, because reading from address 0 can be valid on some systems.
The same applies to any other address, there doesn't exist any address that will fail on all systems.

How do I call hex data stored in an array with inline assembly?

I have an OS project that I am working on and I am trying to call data that I have read from the disk in C with inline assembly.
I have already tried reading the code and executing it with the assembly call instruction, using inline assembly.
void driveLoop() {
uint16_t sectors = 31;
uint16_t sector = 0;
uint16_t basesector = 40000;
uint32_t i = 40031;
uint16_t code[sectors][256];
int x = 0;
while(x==0) {
read(i);
for (int p=0; p < 256; p++) {
if (readOut[p] == 0) {
} else {
x = 1;
//kprint_int(i);
}
}
i++;
}
kprint("Found sector!\n");
kprint("Loading OS into memory...\n");
for (sector=0; sector<sectors; sector++) {
read(basesector+sector);
for (int p=0; p<256; p++) {
code[sector][p] = readOut[p];
}
}
kprint("Done loading.\n");
kprint("Attempting to call...\n");
asm volatile("call (%0)" : : "r" (&code));
When the inline assembly is called I expect it to run the code from the sectors I read from the "disk" (this is in a VM, because its a hobby OS). What it does instead is it just hangs.
I probably don't much understand how variables, arrays, and assembly work, so if you could fill me in, that would be nice.
EDIT: The data I am reading from the disk is a binary file that was added
to the disk image file with
cat kernel.bin >> disk.img
and the kernel.bin is compiled with
i686-elf-ld -o kernel.bin -Ttext 0x4C4B40 *insert .o files here* --oformat binary
What it does instead is it just hangs.
Run your OS inside BOCHS so you can use BOCHS's built-in debugger to see exactly where it's stuck.
Being able to debug lockups, including with interrupts disabled, is probably very useful...
asm volatile("call (%0)" : : "r" (&code)); is unsafe because of missing clobbers.
But even worse than that it will load a new EIP value from the first 4 bytes of the array, instead of setting EIP to that address. (Unless the data you're loading is an array of pointers, not actual machine code?)
You have the %0 in parentheses, so it's an addressing mode. The assembler will warn you about an indirect call without *, but will assemble it like call *(%eax), with EAX = the address of code[0][0]. You actually want a call *%eax or whatever register the compiler chooses, register-indirect not memory-indirect.
&code and code are both just a pointer to the start of the array; &code doesn't create an anonymous pointer object storing the address of another address. &code takes the address of the array as a whole. code in this context "decays" to a pointer to the first object.
https://gcc.gnu.org/wiki/DontUseInlineAsm (for this).
You can get the compiler to emit a call instruction by casting the pointer to a function pointer.
__builtin___clear_cache(&code[0][0], &code[30][255]); // don't optimize away stores into the buffer
void (*fptr)(void) = (void*)code; // casting to void* instead of the actual target type is simpler
fptr();
That will compile (with optimization enabled) to something like lea 16(%esp), %eax / call *%eax, for 32-bit x86, because your code[][] buffer is an array on the stack.
Or to have it emit a jmp instead, do it at the end of a void function, or return funcptr(); in a non-void function, so the compiler can optimize the call/ret into a jmp tailcall.
If it doesn't return, you can declare it with __attribute__((noreturn)).
Make sure the memory page / segment is executable. (Your uint16_t code[]; is a local, so gcc will allocate it on the stack. This might not be what you want. The size is a compile-time constant so you could make it static, but if you do that for other arrays in other sibling functions (not parent or child), then you lose out on the ability to reuse a big chunk of stack memory for different arrays.)
This is much better than your unsafe inline asm. (You forgot a "memory" clobber, so nothing tells the compiler that your asm actually reads the pointed-to memory). Also, you forgot to declare any register clobbers; presumably the block of code you loaded will have clobbered some registers if it returns, unless it's written to save/restore everything.
In GNU C you do need to use __builtin__clear_cache when casting a data pointer to a function pointer. On x86 it doesn't actually clear any cache, it's telling the compiler that the stores to that memory are not dead because it's going to be read by execution. See How does __builtin___clear_cache work?
Without that, gcc could optimize away the copying into uint16_t code[sectors][256]; because it looks like a dead store. (Just like with your current inline asm which only asks for the pointer in a register.)
As a bonus, this part of your OS becomes portable to other architectures, including ones like ARM without coherent instruction caches where that builtin expands to a actual instructions. (On x86 it purely affects the optimizer).
read(basesector+sector);
It would probably be a good idea for your read function to take a destination pointer to read into, so you don't need to bounce data through your readOut buffer.
Also, I don't see why you'd want to declare your code as a 2D array; sectors are an artifact of how you're doing your disk I/O, not relevant to using the code after it's loaded. The sector-at-a-time thing should only be in the code for the loop that loads the data, not visible in other parts of your program.
char code[sectors * 512]; would be good.

SPARC assembly jmp \boot

I'll explain the problem briefly. I have a Leon3 board (gr-ut-g99). Using GRMON2 I can load executables at the desired address in the board.
I have two programs. Let's call them A and B. I tried to load both in memory and individually they work.
What I would like to do now is to make the A program call the B program.
Both programs are written in C using a variant of the gcc compiler (the Gaisler Sparc GCC).
To do the jump I wrote a tiny inline assembler function in program A that jumps to a memory address where I loaded the program B.
below a snippet of the program A
unsigned int return_address;
unsigned int * const RAM_pointer = (unsigned int *) RAM_ADDRESS;
printf("RAM pointer set to: 0x%08x \n",(unsigned int)RAM_pointer);
printf("jumping...\n");
__asm__(" nop;" //clean the pipeline
"jmp %1;" // jmp to programB
:"=r" (return_address)
:"r" (RAM_pointer)
);
RAM_ADDRESS is a #define
#define RAM_ADDRESS 0x60000000
The program B is a simple hello world. The program B is loaded at the 0x60000000 address. If I try to run it, it works!
int main()
{
printf ("HELLO! I'M BOOTED! \n");
fflush(stdout);
return 0;
}
What I expect when I run the ProgramA, is to see the "jumping..." message on the console and then see the "HELLO! I'M BOOTED!" from the programB
What happens instead an IU exception.
Below I posted the messages show by grmon2 monitor. I also reported the "inst" report which should show the last operations performed before the exception.
grmon2> run
IU exception (tt = 0x07, mem address not aligned)
0x60004824: 9fc04000 call %g1
grmon2> inst
TIME ADDRESS INSTRUCTION RESULT SYMBOL
407085 600047FC mov %i3, %o2 [600063B8] -
407086 60004800 cmp %i4 [00000013] -
407089 60004804 be 0x60004970 [00000000] -
407090 60004808 mov %i0, %o0 [6000646C] -
407091 6000480C mov %i4, %o3 [00000013] -
407092 60004810 cmp %i4, %l0 [80000413] -
407108 60004814 bleu 0x60004820 [00000000] -
407144 60004818 ld [%i1 + 0x20], %o1 [FFFFFFFF] -
407179 60004820 ld [%i1 + 0x28], %g1 [FFFFFFFF] -
407186 60004824 call %g1 [ TRAP ] -
I also tried to substitute the "jmp" with a "jmpl" or a "call" but it does not worked.
I'm quite confused.
I do not know how to cope well with the problem and therefore I do not know what other information it is necessary to provide.
I can say that, the programB is loaded at 0x60000000 and the entry_point is, of course, 0x60000000. Running directly program B from that entry point it works good!
Thanks in advance for your help!
Looks to me like you did execute the jump, and it got to program B, as evidenced by the addresses of the instructions in the trace buffer. But where you crashed was in stdio trying to print stuff. Stdio makes extensive use of function pointers, and the sequence clearly shows a call instruction with the target address in a register, which indicates use of a function pointer.
I suggest putting fflush(stdout) in program A just before the jump, and this will allow you to see the messages before doing the jump. Then, in program B, instead of using printf, just put some known value in memory that you can examine later via the monitor to verify that it got there.
My guess is that the stdio library has some data or parameter that needs to be set up at the start of the program, and that's not being done or not done properly. Not sure about the platform you are running on, but do you have some sort of debugging or single stepping ability, like in a debugger? If so, just single step through the jump and follow where the program goes.

does read / write atomicity hold on multi-computer memory?

I'm working in a multi-computer environment where 2 computers will access the same memory over a 32bit PCI bus.
The first computer will only write to the 32bit int.
*int_pointer = number;
The second computer will only read from the 32bit int.
number = *int_pointer;
Both OS's / CPU's are 32bit architecture.
computer with PCI is Intel Based.
computer on the PCI card is power PC.
The case I am worried about is if the write only computer is changing the variable at the same time the read computer is reading it, causing invalid data for the read computer.
I'd like to know if the atomicity of reading / writing to the same position in memory is preserved over multiple computers.
If so, would the following prevent the race condition:
number = *int_pointer;
while(*int_pointer != number) {
number = *int_pointer;
}
I can guarantee that writes will occur every 16ms*, and reads will occur randomly.
*times will drift since both computers have different timers.
There isn't enough information to answer this definitively.
If either of the CPUs decide to cache the memory at *int_pointer, this will always fail and your extra code will not fix any race condition.
Assuming both CPUs know that memory at *int_pointer is uncacheable, AND that location *int_pointer is aligned on a 4-byte/32-bit boundary AND the CPU on the PCI card has a 32 bit memory interface, AND the pointer is declared as a pointer to volatile AND both your C compilers implement volatile correctly, THEN it is likely that the updates will be atomic.
If any of the above conditions are not met, the result will be unpredictable and your "race detection" code is not likely to work.
Edited to explain why volatile is needed:
Here is your race condition code compiled to MIPS assembler with -O4 and no volatile qualifier. (I used MIPS because the generated code is easier to read than x86 code):
int nonvol(int *ip) {
int number = *ip;
while (*ip != number) {
number = *ip;
}
return number;
}
Disassembly output:
00000000 <nonvol>:
0: 8c820000 lw v0,0(a0)
4: 03e00008 jr ra
The compiler has optimized the while loop away since it knows that *ip cannot change.
Here's what happens with volatile and the same compiler options:
int vol(volatile int *ip) {
int number = *ip;
while (*ip != number) {
number = *ip;
}
return number;
}
Disassembly output:
00000008 <vol>:
8: 8c820000 lw v0,0(a0)
c: 8c830000 lw v1,0(a0)
10: 1443fffd bne v0,v1,8 <vol>
14: 00000000 nop
18: 03e00008 jr ra
Now the while loop is not optimized away because using volatile has told the compiler that *ip can change at any time.
To answer my own question:
There is no race condition. The memory controller on the external PCI device will handle all memory read/write operations. Since all data carriers are at least 32bits wide there will not be a race condition within those 32bits.
Since the transfers use a DMA controller the interactions between memory and processors is well behaved.
See Direct Memory Access -PCI

Using GCC inline assembly with instructions that take immediate values

The problem
I'm working on a custom OS for an ARM Cortex-M3 processor. To interact with my kernel, user threads have to generate a SuperVisor Call (SVC) instruction (previously known as SWI, for SoftWare Interrupt). The definition of this instruction in the ARM ARM is:
Which means that the instruction requires an immediate argument, not a register value.
This is making it difficult for me to architect my interface in a readable fashion. It requires code like:
asm volatile( "svc #0");
when I'd much prefer something like
svc(SVC_YIELD);
However, I'm at a loss to construct this function, because the SVC instruciton requires an immediate argument and I can't provide that when the value is passed in through a register.
The kernel:
For background, the svc instruction is decoded in the kernel as follows
#define SVC_YIELD 0
// Other SVC codes
// Called by the SVC interrupt handler (not shown)
void handleSVC(char code)
{
switch (code) {
case SVC_YIELD:
svc_yield();
break;
// Other cases follow
This case statement is getting rapidly out of hand, but I see no way around this problem. Any suggestions are welcome.
What I've tried
SVC with a register argument
I initially considered
__attribute__((naked)) svc(char code)
{
asm volatile ("scv r0");
}
but that, of course, does not work as SVC requires a register argument.
Brute force
The brute-force attempt to solve the problem looks like:
void svc(char code)
switch (code) {
case 0:
asm volatile("svc #0");
break;
case 1:
asm volatile("svc #1");
break;
/* 253 cases omitted */
case 255:
asm volatile("svc #255");
break;
}
}
but that has a nasty code smell. Surely this can be done better.
Generating the instruction encoding on the fly
A final attempt was to generate the instruction in RAM (the rest of the code is running from read-only Flash) and then run it:
void svc(char code)
{
asm volatile (
"orr r0, 0xDF00 \n\t" // Bitwise-OR the code with the SVC encoding
"push {r1, r0} \n\t" // Store the instruction to RAM (on the stack)
"mov r0, sp \n\t" // Copy the stack pointer to an ordinary register
"add r0, #1 \n\t" // Add 1 to the address to specify THUMB mode
"bx r0 \n\t" // Branch to newly created instruction
"pop {r1, r0} \n\t" // Restore the stack
"bx lr \n\t" // Return to caller
);
}
but this just doesn't feel right either. Also, it doesn't work - There's something I'm doing wrong here; perhaps my instruction isn't properly aligned or I haven't set up the processor to allow running code from RAM at this location.
What should I do?
I have to work on that last option. But still, it feels like I ought to be able to do something like:
__attribute__((naked)) svc(char code)
{
asm volatile ("scv %1"
: /* No outputs */
: "i" (code) // Imaginary directive specifying an immediate argument
// as opposed to conventional "r"
);
}
but I'm not finding any such option in the documentation and I'm at a loss to explain how such a feature would be implemented, so it probably doesn't exist. How should I do this?
You want to use a constraint to force the operand to be allocated as an 8-bit immediate. For ARM, that is constraint I. So you want
#define SVC(code) asm volatile ("svc %0" : : "I" (code) )
See the GCC documentation for a summary of what all the constaints are -- you need to look at the processor-specific notes to see the constraints for specific platforms. In some cases, you may need to look at the .md (machine description) file for the architecture in the gcc source for full information.
There's also some good ARM-specific gcc docs here. A couple of pages down under the heading "Input and output operands" it provides a table of all the ARM constraints
What about using a macro:
#define SVC(i) asm volatile("svc #"#i)
As noted by Chris Dodd in the comments on the macro, it doesn't quite work, but this does:
#define STRINGIFY0(v) #v
#define STRINGIFY(v) STRINGIFY0(v)
#define SVC(i) asm volatile("svc #" STRINGIFY(i))
Note however that it won't work if you pass an enum value to it, only a #defined one.
Therefore, Chris' answer above is the best, as it uses an immediate value, which is what's required, for thumb instructions at least.
My solution ("Generating the instruction encoding on the fly"):
#define INSTR_CODE_SVC (0xDF00)
#define INSTR_CODE_BX_LR (0x4770)
void svc_call(uint32_t svc_num)
{
uint16_t instrs[2];
instrs[0] = (uint16_t)(INSTR_CODE_SVC | svc_num);
instrs[1] = (uint16_t)(INSTR_CODE_BX_LR);
// PC = instrs (or 1 -> thumb mode)
((void(*)(void))((uint32_t)instrs | 1))();
}
It works and its much better than switch-case variant, which takes ~2kb ROM for 256 svc's. This func does not have to be placed in RAM section, FLASH is ok.
You can use it if svc_num should be a runtime variable.
As discussed in this question, the operand of SVC is fixed, that is it should be known to the preprocessor, and it is different from immediate Data-processing operands.
The gcc manual reads
'I'- Integer that is valid as an immediate operand in a data processing instruction. That is, an integer in the range 0 to 255 rotated by a multiple of 2.
Therefore the answers here that use a macro are preferred, and the answer of Chris Dodd is not guaranteed to work, depending on the gcc version and optimization level. See the discussion of the other question.
I wrote one handler recently for my own toy OS on Cortex-M. Works if tasks use PSP pointer.
Idea:
Get interrupted process's stack pointer, get process's stacked PC, it will have the instruction address of instruction after SVC, look up the immediate value in the instruction. It's not as hard as it sounds.
uint8_t __attribute__((naked)) get_svc_code(void){
__asm volatile("MSR R0, PSP"); //Get Process Stack Pointer (We're in SVC ISR, so currently MSP in use)
__asm volatile("ADD R0, #24"); //Pointer to stacked process's PC is in R0
__asm volatile("LDR R1, [R0]"); //Instruction Address after SVC is in R1
__asm volatile("SUB R1, R1, #2"); //Subtract 2 bytes from the address of the current instruction. Now R1 contains address of SVC instruction
__asm volatile("LDRB R0, [R1]"); //Load lower byte of 16-bit instruction into R0. It's immediate value.
//Value is in R0. Function can return
}

Resources