Multiple Threads Accessing Same Register Value ARM Assembly - c

I'm working with some ARM code experimenting with multiple threads which need to access the same register. I'm using C with asm calls. However, I keep running into a bus error. Here's an example of what I mean:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
int someVar = 0;
void setup(){
__asm__("LDR R7, =someVar\n\t"); // load someVar into R7
}
void loadAction(){
__asm__("LDREX R1, [R7]\n\t");
}
int main(){
setup();
loadAction();
}
This works totally fine.
However, when I introduce threads, like this a bus error results:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
int someVar = 0;
void setup(){
__asm__("LDR R7, =someVar\n\t"); // load someVar into R7
}
void *loadAction(void *threadArg){
__asm__("LDREX R1, [R7]\n\t");
}
int main(){
pthread_t tid;
setup();
int i;
for (i = 0; i < 1; i++){
pthread_create(&tid, NULL, loadAction, (void *)&tid);
}
pthread_exit(NULL);
return 0;
}
My best guess for this issue is that the value in R7 is invalid because registers are not guaranteed to be preserved across subroutine calls. Perhaps in the first example, I'm just getting lucky and the value in R7 that is placed in setup() happens to remain, but the thread code causes the value in R7 to be clobbered.
If this is the case, is there any way that I could preserve R7? I could save and store to the stack, but multiple threads will be accessing it at once. Is there some sort of compilation flag I could pass in with gcc to ensure that the value of R7 loaded in setup() is accessed in loadAction()?
Thanks

Every thread has its own register values. That is actually what makes a thread a thread. If two threads shared their register values (especially PC and SP) they would be the same thread.
And yes, registers generally aren't preserved across subroutine calls. The compiler uses them to store every value your code uses - they are not some special unusual thing that you can only access with inline assembly code. Depending on the calling convention being used in the program, the compiler may be obligated to save the old value of certain registers that it decides to use, and restore them back before the subroutine returns.
According to the linked Wikipedia page, on 32-bit ARM, r7 is one of those registers the compiler has to save and restore.
In this case the compiler hasn't decided to use r7 in the setup function (because there is no actual code in it that gets compiled); if setup did have a bunch of C code and the compiler decided to use r7, then it would save the old value at the beginning, load the old value at the end, use the register in the middle, and your load to r7 would overwrite whatever value the compiler thought was stored there, thus breaking the C code. And by the time loadAction ran on the same thread the old value would have been put back in r7.
There is a way to preserve a register in the C language and it's called a variable.
Instead of this:
// wrong code
void setup(){
__asm__("LDR R7, =someVar\n\t"); // load someVar into R7
}
void *loadAction(void *threadArg){
__asm__("LDREX R1, [R7]\n\t");
}
if you write it like this:
int *pSomeVar;
void setup(){
pSomeVar = &someVar; // load someVar into pSomeVar
}
void *loadAction(void *threadArg){
int value = *pSomeVar;
}
then the compiler will do whatever it takes to make sure that value gets from setup to loadAction.

Related

will gcc optimization remove for loop if it's only one iteration?

Im writing a real time DSP processing library.
My intention is to give it a flexibility to define input samples blockSize, while also having best possible performance in case of sample-by-sample processing, that is - single sample block size
I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.
This leads me to a question:
Will gcc compiler optimize this code
int blockSize = 1;
for (volatile int i=0; i<blockSize; i++)
{
foo()
}
or
//.h
#define BLOCKSIZE 1
//.c
for (volatile int i=0; i<BLOCKSIZE; i++)
{
foo()
}
to be same as simply calling body of the loop:
foo()
?
Thx
I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.
No, that doesn't make any sense. Only the input/output hardware registers themselves should be volatile. Pointers to them should be declared as pointer-to-volatile data, ie volatile uint8_t*. There is no need to make the pointer itself volatile, ie uint8_t* volatile //wrong.
As things stand now, you force the compiler to create a variable i and increase it, which will likely block loop unrolling optimizations.
Trying your code on gcc x86 with -O3 this is exactly what happens. No matter the size of BLOCKSIZE, it still generates the loop because of volatile. If I drop volatile it completely unrolls the loop up to BLOCKSIZE == 7 and replace it with a number of function calls. Beyond 8 it creates a loop (but keeps the iterator in a register instead of RAM).
x86 example:
for (int i=0; i<5; i++)
{
foo();
}
gives
call foo
call foo
call foo
call foo
call foo
But
for (volatile int i=0; i<5; i++)
{
foo();
}
gives way more inefficient
mov DWORD PTR [rsp+12], 0
mov eax, DWORD PTR [rsp+12]
cmp eax, 4
jg .L2
.L3:
call foo
mov eax, DWORD PTR [rsp+12]
add eax, 1
mov DWORD PTR [rsp+12], eax
mov eax, DWORD PTR [rsp+12]
cmp eax, 4
jle .L3
.L2:
For further study of the correct use of volatile in embedded systems, please see:
How to access a hardware register from firmware?
Using volatile in embedded C development
Since the loop variable is volatile it shouldn't optimize it. The compiler can not know wether i will be 1 when the condition is evaluated, so it has to keep the loop.
From the compiler point of view, the loop can run an indeterminite number of times until the condition is satisfied.
If you somehwere access hardware registers, then those should be declared volatile, which would make more sense, to the reader, and also allows the compiler to apply appropriate optimizations where possible.
volatile keyword says the compiler that the variable is side effects prone - ie it can be changed by something which is not visible for the compiler.
Because of that volatile variables have to read before every use and saved to their permanent storage location after every modification.
In your example the loop cannot be optimized as variable i can be changed during the loop (for example some interrupt routine will change it to zero so the loop will have to be executed again.
The answer to your question is: If the compiler can determine that every time you enter the loop, it will execute only once, then it can eliminate the loop.
Normally, the optimization phase unrolls the loops, based on how the iterations relate to one another, this makes your (e.g. indefinite) loop to get several times bigger, in exchange to avoid the back loops (that normally result in a bubble in the pipeline, depending on the cpu type) but not too much to lose cache hits.... so it is a bit complicate... but the earnings are huge. But if your loop executes only once, and always, is normally because the test you wrote is always true (a tautology) or always false (impossible fact) and can be eliminated, this makes the jump back unnecessary, and so, there's no loop anymore.
int blockSize = 1;
for (volatile int i=0; i<blockSize; i++)
{
foo(); // you missed a semicolon here.
}
In your case, the variable is assigned a value, that is never touched anymore, so the first thing the compiler is going to do is to replace all expressions of your variable by the literal you assigned to it. (lacking context I assume blocsize is a local automatic variable that is not changed anywhere else) Your code changes into:
for (volatile int i=0; i<1; i++)
{
foo();
}
the next is that volatile is not necessary, as its scope is the block body of the loop, where it is not used, so it can be replaced by a sequence of code like the following:
do {
foo();
} while (0);
hmmm.... this code can be replaced by this code:
foo();
The compiler analyses each data set analising the graph of dependencies between data and variables.... when a variable is not needed anymore, assigning a value to it is not necessary (if it is not used later in the program or goes out of life), so that code is eliminated. If you make your compiler to compile a for loop frrom 1 to 2^64, and then stop. and you optimize the compilation of that,, you will see you loop being trashed up and will get the false idea that your processor is capable of counting from 1 to 2^64 in less than a second.... but that is not true, 2^64 is still very big number to be counted in less than a second. And that is not a one fixed pass loop like yours.... but the data calculations done in the program are of no use, so the compiler eliminates it.
Just test the following program (in this case it is not a test of a just one pass loop, but 2^64-1 executions):
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
uint64_t low = 0UL;
uint64_t high = ~0UL;
uint64_t data = 0; // this data is updated in the loop body.
printf("counting from %lu to %lu\n", low, high);
alarm(10); /* security break after 10 seconds */
for (uint64_t i = low; i < high; i++) {
#if 0
printf("data = $lu\n", data = i ); // either here...
#else
data = i; // or here...
#endif
}
return 0;
}
(You can change the #if 0 to #if 1 to see how the optimizer doesn't eliminate the loop when you need to print the results, but you see that the program is essentially the same, except for the call to printf with the result of the assignment)
Just compile it with/without optimization:
$ cc -O0 pru.c -o pru_noopt
$ cc -O2 pru.c -o pru_optim
and then run it under time:
$ time pru_noopt
counting from 0 to 18446744073709551615
Alarm clock
real 0m10,005s
user 0m9,848s
sys 0m0,000s
while running the optimized version gives:
$ time pru_optim
counting from 0 to 18446744073709551615
real 0m0,002s
user 0m0,002s
sys 0m0,002s
(impossible, neither the best computer can count one after the other, upto that number in less than 2 milliseconds) so the loop must have gone somewhere else. You can check from the assembler code. As the updated value of data is not used after assignment, the loop body can be eliminated, so the 2^64-1 executions of it can also be eliminated.
Now add the following line after the loop:
printf("data = %lu\n", data);
You will see that then, even with the -O3 option, will get the loop untouched, because the value after all the assignments is used after the loop.
(I preferred not to show the assembler code, and remain in high level, but you can have a look at the assembler code and see the actual generated code)

Inject a memory exception by accessing forbidden memory location

I want to test a exception handler function that I have written for an embedded system and want to write a test code that injects an access to memory that forbidden.
void Test_Mem_exception
{
__asm(
"LDR R0, =0xA0100000\n\t"
"MOV R1, 0x77777777\n\t"
"STR R1, [R0,#0]"
);
This is the code I want to write that access memory location at 0xA010000. Somehow this does not seem a generic test code to me.
Is there a standard way of writing such test codes in C or C++. By Generic I mean a code that is independent of the memory map of the system that it runs on.
I wouldn't use asm for this, simply use a pointer.
void Test_Mem_exception
{
/* volatile, to suppress optimizing/removing the read statement */
volatile int *ptr = 0xC0C0C0C0;
int value = *ptr;
}
This won't always result to an exception, because reading from address 0 can be valid on some systems.
The same applies to any other address, there doesn't exist any address that will fail on all systems.

Code execution exploit Cortex M4

For testing the MPU and playing around with exploits, I want to execute code from a local buffer running on my STM32F4 dev board.
int main(void)
{
uint16_t func[] = { 0x0301f103, 0x0301f103, 0x0301f103 };
MPU->CTRL = 0;
unsigned int address = (void*)&func+1;
asm volatile(
"mov r4,%0\n"
"ldr pc, [r4]\n"
:
: "r"(address)
);
while(1);
}
In main, I first turn of the MPU. In func my instructions are stored. In the ASM part I load the address (0x2001ffe8 +1 for thumb) into the program counter register. When stepping through the code with GDB, in R4 the correct value is stored and then transfered to PC register. But then I will end up in the HardFault Handler.
Edit:
The stack looks like this:
0x2001ffe8: 0x0301f103 0x0301f103 0x0301f103 0x2001ffe9
The instructions are correct in the memory. Definitive Guide to Cortex says region 0x20000000–0x3FFFFFFF is the SRAM and "this region is executable,
so you can copy program code here and execute it".
You are assigning 32 bit values to a 16 bit array.
Your instructions dont terminate, they continue on to run into whatever is found in ram, so that will crash.
You are not loading the address to the array into the program counter you are loading the first item in the array into the program counter, this will crash, you created a level of indirection.
Look at the BX instruction for this rather than ldr pc
You did not declare the array as static, so the array can be optimized out as dead and unused, so this can cause it to crash.
The compiler should also complain that you are assigning a void* to an unsigned variable, so a typecast is wanted there.
As a habit I recommend address|=1 rather than +=1, in this case either will function.

How to fix a Hook in a C program (stack's restoration)

It's a kind of training task, because nowadays these methods (I guess) don't work anymore.
Win XP and MinGW compiler are used. No special compiler options are involved (just gcc with stating one source file).
First of all, saving an address to exit from the program and jumping to the some Hook function:
// Our system uses 4 bytes for addresses.
typedef unsigned long int DWORD;
// To save an address of the exit from the program.
DWORD addr_ret;
// An entry point.
int main()
{
// To make a direct access to next instructions.
DWORD m[1];
// Saving an address of the exit from the program.
addr_ret = (DWORD) m[4];
// Replacing the exit from the program with a jump to some Hook function.
m[4] = (DWORD) Hook;
// Status code of the program's execution.
return 0;
}
The goal of this code is to get an access to the system's privileges level, because when we return (should return) to the system, we just redirecting our program to some of our methods. The code of this method:
// Label's declaration to make a jump.
jmp_buf label;
void Hook()
{
printf ("Test\n");
// Trying to restore the stack using direct launch (without stack's preparation) of the function (we'll wee it later).
longjmp(label, 1);
// Just to make sure that we won't return here after jump's (from above) finish, because we are not getting stuck in the infinite loop.
while(1) {}
}
And finally I'll state a function which (in my opinion) should fix the stack pointer - ESP register:
void FixStack()
{
// A label to make a jump to here.
setjmp(label);
// A replacement of the exit from this function with an exit from the whole program.
DWORD m[1];
m[2] = addr_ret;
}
Of course we should use these includes for the stated program:
#include <stdio.h>
#include <setjmp.h>
The whole logic of the program works correctly in my system, but I can not restore my stack (ESP), so the program returns an incorrect return code.
Before the solution described above, I didn't use jumps and FixStack function. I mean that these lines were in the Hook function instead of jump and while cycle:
DWORD m[1];
m[2] = addr_ret;
But with this variant I was getting an incorrect value in ESP register before an exit from the program (it was on 8 bytes bigger then this register's value before an enter in this program). So I decided to add somehow these 8 bytes (avoiding any ASM code inside of the C program). It's the purpose of the jump into the FixStack function with an appropriate exit from it (to remove some values from stack). But, as I stated, it doesn't return a correct status of the program's execution using this command:
echo %ErrorLevel%
So my question is very wide: beginning from asking of some recommendations in a usage of debugging utilities (I was using only OllyDbg) and ending in possible solutions for the described Hook's implementation.
Ok, I could make my program work, as it was intended, finally. Now we can launch compiled (I use MinGW in Win XP) program without any errors and with correct return code.
Maybe will be helpful for someone:
#include <stdio.h>
#include <setjmp.h>
typedef unsigned long int DWORD;
DWORD addr_ret;
int FixStack()
{
DWORD m[1];
m[2] = addr_ret;
// This line is very necessary for correct running!
return 0;
}
void Hook()
{
printf("Test\n");
FixStack();
}
int main()
{
DWORD m[1];
addr_ret = (DWORD) m[4];
m[4] = (DWORD) Hook;
}
Of course it seems that you've realized that this will only work with a very specific build environment. It most definitely won't work on a 64-bit target (because the addresses aren't DWORD-ish).
Is there any reason why you don't want to use the facilities provided by the C standard library to do exactly this? (Or something very similar to this.)
#include <stdlib.h>
void Hook()
{
printf("Test\n");
}
int main()
{
atexit(Hook);
}

I'm writing my own JIT-interpreter. How do I execute generated instructions?

I intend to write my own JIT-interpreter as part of a course on VMs. I have a lot of knowledge about high-level languages, compilers and interpreters, but little or no knowledge about x86 assembly (or C for that matter).
Actually I don't know how a JIT works, but here is my take on it: Read in the program in some intermediate language. Compile that to x86 instructions. Ensure that last instruction returns to somewhere sane back in the VM code. Store the instructions some where in memory. Do an unconditional jump to the first instruction. Voila!
So, with that in mind, I have the following small C program:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main() {
int *m = malloc(sizeof(int));
*m = 0x90; // NOP instruction code
asm("jmp *%0"
: /* outputs: */ /* none */
: /* inputs: */ "d" (m)
: /* clobbers: */ "eax");
return 42;
}
Okay, so my intention is for this program to store the NOP instruction somewhere in memory, jump to that location and then probably crash (because I haven't setup any way for the program to return back to main).
Question: Am I on the right path?
Question: Could you show me a modified program that manages to find its way back to somewhere inside main?
Question: Other issues I should beware of?
PS: My goal is to gain understanding, not necessarily do everything the right way.
Thanks for all the feedback. The following code seems to be the place to start and works on my Linux box:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
unsigned char *m;
int main() {
unsigned int pagesize = getpagesize();
printf("pagesize: %u\n", pagesize);
m = malloc(1023+pagesize+1);
if(m==NULL) return(1);
printf("%p\n", m);
m = (unsigned char *)(((long)m + pagesize-1) & ~(pagesize-1));
printf("%p\n", m);
if(mprotect(m, 1024, PROT_READ|PROT_EXEC|PROT_WRITE)) {
printf("mprotect fail...\n");
return 0;
}
m[0] = 0xc9; //leave
m[1] = 0xc3; //ret
m[2] = 0x90; //nop
printf("%p\n", m);
asm("jmp *%0"
: /* outputs: */ /* none */
: /* inputs: */ "d" (m)
: /* clobbers: */ "ebx");
return 21;
}
Question: Am I on the right path?
I would say yes.
Question: Could you show me a modified program that manages to find its way back to somewhere inside main?
I haven't got any code for you, but a better way to get to the generated code and back is to use a pair of call/ret instructions, as they will manage the return address automatically.
Question: Other issues I should beware of?
Yes - as a security measure, many operating systems would prevent you from executing code on the heap without making special arrangements. Those special arrangements typically amount to you having to mark the relevant memory page(s) as executable.
On Linux this is done using mprotect() with PROT_EXEC.
If your generated code follows the proper calling convention, then you can declare a pointer-to-function type and invoke the function this way:
typedef void (*generated_function)(void);
void *func = malloc(1024);
unsigned char *o = (unsigned char *)func;
generated_function *func_exec = (generated_function *)func;
*o++ = 0x90; // NOP
*o++ = 0xcb; // RET
func_exec();

Resources