I'm trying to make a really simple spinlock mutex in C and for some reason I'm getting cases where two threads are getting the lock at the same time, which shouldn't be possible. It's running on a multiprocessor system which may be why there's a problem. Any ideas why it's not working?
void mutexLock(mutex_t *mutexlock, pid_t owner)
{
int failure = 1;
while(mutexlock->mx_state == 0 || failure || mutexlock->mx_owner != owner)
{
failure = 1;
if (mutexlock->mx_state == 0)
{
asm(
"movl $0x01,%%eax\n\t" // move 1 to eax
"xchg %%eax,%0\n\t" // try to set the lock bit
"mov %%eax,%1\n\t" // export our result to a test var
:"=r"(mutexlock->mx_state),"=r"(failure)
:"r"(mutexlock->mx_state)
:"%eax"
);
}
if (failure == 0)
{
mutexlock->mx_owner = owner; //test to see if we got the lock bit
}
}
}
Well for a start you're testing an uninitialised variable (failure) the first time the while() condition is executed.
Your actual problem is that you're telling gcc to use a register for mx_state - which clearly won't work for a spinlock. Try:
asm volatile (
"movl $0x01,%%eax\n\t" // move 1 to eax
"xchg %%eax,%0\n\t" // try to set the lock bit
"mov %%eax,%1\n\t" // export our result to a test var
:"=m"(mutexlock->mx_state),"=r"(failure)
:"m"(mutexlock->mx_state)
:"%eax"
);
Note that asm volatile is also important here, to ensure that it doesn't get hoisted out of your while loop.
The problem is that you load mx_state into a register (the 'r' constraint) and then do the exchange with the registers, only writing back the result into mx_state at the end of the asm code. What you want is something more like
asm(
"movl $0x01,%%eax\n\t" // move 1 to eax
"xchg %%eax,%1\n\t" // try to set the lock bit
"mov %%eax,%0\n\t" // export our result to a test var
:"=r"(failure)
:"m" (mutexlock->mx_state)
:"%eax"
);
Even this is somewhat dangerous, as in theory the compiler could load the mx_state, spill it into a local temp stack slot, and do the xchg there. It also is somewhat inefficient, as it has extra movs hardcoded that may not be needed but can't be eliminated by the optimizer. You're better off using a simpler asm that expands to a single instruction, such as
failure = 1;
asm("xchg %0,0(%1)" : "=r" (failure) : "r" (&mutex->mx_state), "0" (failure));
Note how we force the use of mx_state in place, by using it's address rather than its value.
Related
Background
I have a custom bare metal mutex primitive written for the STM32F7 (Arm Cortex M7) processor per the Barrier and Litmus Test Cookbook from ARM, using the LDREX and STREX instructions. I use this to control critical sections in my code.
It seems to work well on multiple IRQs with differing priorities, and solves inversion deadlock with a timeout loop. If the lock spins 100 times on acquisition, I assume it's held by a lower priority context and break with an error return flag, returning from the IRQ context. This means the critical section for that IRQ instance never runs, though.
Question
I'm wondering if I missed anything in the CMSIS HAL or STM32F7 reference/programming manual that would allow me (or help me) to easily pause execution in the blocked higher priority IRQ, switch context to the lower priority one, finish execution and free the lock, then return to the higher priority one?
Solutions I've considered/tried
Obviously just switching to an RTOS but it's not an option, it's an existing codebase that I don't entirely own.
I read the sections on "Exception entry and return", etc, and can maybe do it manually with the stack. Seems complex though and I'd have to keep track of which context actually holds the lock.
Ditch the timeout loop and use a separate timer, checking for priority inversion by examining the stack, and raising the lower priority IRQ to allow it to complete and stop blocking the higher priority IRQ. (Complex, and approaching the territory of just writing a scheduler.)
Using WFE and SEV instructions to give up context. I'm not 100% sure, but I don't think this will work the way I think it will, and is more for multiprocessor systems?
Accept that this is as good as it gets without significantly more effort.
Mutex code and Usage
Compiled with gcc-arm-none-eabi using -mcpu=cortex-m7 -mfpu=fpv5-d16 -mfloat-abi=hard -mthumb.
static inline int acquireLock(unsigned int *lock)
{
unsigned int tempStore = 0;
unsigned int lockFlag = 1;
unsigned int timeout = 0;
unsigned int result = 0;
__asm__ volatile( //
"Loop1%=: \n\t" // label for main spinlock loop
// Lock acquisition spin loop.
"add %[tim], %[tim], #1 \n\t" // add 1 to timeout counter
"ldrex %[ts], %[lock] \n\t" // read lock's current state
"cmp %[ts], #0 \n\t" // check if 0 (lock is available)
"it eq \n\t" // only try to store if lock is clear
"strexeq %[ts], %[lf], %[lock] \n\t" // try to grab lock if it is availble
// Loop exit logic block.
"cmp %[ts], #0 \n\t" // check we got the lock?
"beq Loop2%= \n\t" // if we got lock, quit loop
"cmp %[tim], #100 \n\t" // else, check timeout counter
"bgt Loop2%= \n\t" // quit loop if timeout > 100
"b Loop1%= \n\t" // else go back to start of spin loop
// Check and set return value (success) of lock acquisition
"Loop2%=: \n\t" // label for loop exit
"cmp %[ts], #0 \n\t" // check if we got lock (vs timeout)
"ite eq \n\t" // conditional store of return value
"moveq %[res], #0 \n\t" // return 0 if we got lock
"movne %[res], #1 \n\t" // else return 1 if we timed out
"dmb \n\t" // mem barrier for later RWMs
: [ lock ] "+m"(*lock), [ ts ] "+l"(tempStore), [ tim ] "+l"(timeout),
[ res ] "=l"(result)
: [ lf ] "l"(lockFlag)
: "memory");
return result;
}
and an example of usage, in an IRQ context:
if (acquireLock(&lock) == 0)
{
something_critical++;
releaseLock(&lock);
}
else
{
return;
}
Deadlock Example
If it helps, here's a backtrace of the deadlock when I disable the timeout counter in the spinlock loop. You can see TIM6 preempted execution of TIM7 while it was in the process of releasing the lock (but hadn't completed yet).
I am trying to check if the VMX hardware extensions are supported by the processor using inline assembly. I have tried the following two ways of doing it:
Method 1:
int vmx_support(void) {
int get_vmx_support, vmx_bit;
asm volatile ("mov $1, %eax");
asm volatile ("mov $0, %ecx");
asm volatile ("cpuid");
asm volatile ("mov %%ecx, %0\n\t":"=r" (get_vmx_support): : "memory");
vmx_bit = (get_vmx_support >> 5) & 1;
if (vmx_bit == 1) {
return 1;
} else {
return 0;
}
}
Method 2:
int vmx_support(void) {
unsigned int eax, ebx, ecx, edx;
eax = 1;
ecx = 0;
asm volatile("cpuid"
: "=a" (eax),
"=b" (ebx),
"=c" (ecx),
"=d" (edx)
: "0" (eax), "2" (ecx)
: "memory");
vmx_bit = (ecx >> 5) & 1;
if (vmx_bit == 1) {
return 1;
} else {
return 0;
}
}
When I try to execute vmx_support() from Method 1 inside a kernel module, Ubuntu freezes completely when I do insmod vmx.ko and I have to restart it to get it back. When I try to execute vmx_support() from Method 2 inside the kernel module, it executes and shows [VMX] vmx is supported. on dmesg | tail.
Also, when I try to run vmx_support() from Method 1 as a userspace program, it executes and prints [VMX] vmx is supported. as output to the console.
Question: Why does code from Method 1 freeze Ubuntu whereas code from Method 2 does not? Also, is there a safer way to test and debug code that uses inline assembly? (that is, avoid freezes for example)
Links to Makefile, kernel module and userspace program can be found here:
Makefile
vmx.c (kernel module. The code from method 2 is commented inside, uncomment it and comment the code from method 1 to see how it works)
vmx_sup.c (userspace program)
Method 1 has several problems, but the one that is causing it to hang is undoubtedly that it changes ebx without telling the compiler. In your user mode program, probably ebx doesn’t happen to have anything important in it, but in the kernel module, it apparently contains something critical.
I want to call a system call (prctl) in assembly inline and retrieve the result of the system call. But I cannot make it work.
This is the code I am using:
int install_filter(void)
{
long int res =-1;
void *prg_ptr = NULL;
struct sock_filter filter[] = {
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRAP),
/* If a trap is not generate, the application is killed */
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
};
struct sock_fprog prog = {
.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
.filter = filter,
};
prg_ptr = &prog;
no_permis();
__asm__ (
"mov %1, %%rdx\n"
"mov $0x2, %%rsi \n"
"mov $0x16, %%rdi \n"
"mov $0x9d, %%rax\n"
"syscall\n"
"mov %%rax, %0\n"
: "=r"(res)
: "r"(prg_ptr)
: "%rdx", "%rsi", "%rdi", "%rax"
);
if ( res < 0 ){
perror("prctl");
exit(EXIT_FAILURE);
}
return 0;
}
The address of the filter should be the input (prg_ptr) and I want to save the result in res.
Can you help me?
For inline assembly, you don't use movs like this unless you have to, and even then you have to do ugly shiffling. That's because you have no idea what registers arguments arrive in. Instead, you should use:
__asm__ __volatile__ ("syscall" : "=a"(res) : "d"(prg_ptr), "S"(0x2), "D"(0x16), "a"(0x9d) : "memory");
I also added __volatile__, which you should use for any asm with side-effects other than its output, and a memory clobber (memory barrier), which you should use for any asm with side-effects on memory or for which reordering it with respect to memory accesses would be invalid. It's good practice to always use both of these for syscalls unless you know you don't need them.
If you're still having problems, use strace to observe the syscall attempt and see what's going wrong.
The problem
I'm working on a custom OS for an ARM Cortex-M3 processor. To interact with my kernel, user threads have to generate a SuperVisor Call (SVC) instruction (previously known as SWI, for SoftWare Interrupt). The definition of this instruction in the ARM ARM is:
Which means that the instruction requires an immediate argument, not a register value.
This is making it difficult for me to architect my interface in a readable fashion. It requires code like:
asm volatile( "svc #0");
when I'd much prefer something like
svc(SVC_YIELD);
However, I'm at a loss to construct this function, because the SVC instruciton requires an immediate argument and I can't provide that when the value is passed in through a register.
The kernel:
For background, the svc instruction is decoded in the kernel as follows
#define SVC_YIELD 0
// Other SVC codes
// Called by the SVC interrupt handler (not shown)
void handleSVC(char code)
{
switch (code) {
case SVC_YIELD:
svc_yield();
break;
// Other cases follow
This case statement is getting rapidly out of hand, but I see no way around this problem. Any suggestions are welcome.
What I've tried
SVC with a register argument
I initially considered
__attribute__((naked)) svc(char code)
{
asm volatile ("scv r0");
}
but that, of course, does not work as SVC requires a register argument.
Brute force
The brute-force attempt to solve the problem looks like:
void svc(char code)
switch (code) {
case 0:
asm volatile("svc #0");
break;
case 1:
asm volatile("svc #1");
break;
/* 253 cases omitted */
case 255:
asm volatile("svc #255");
break;
}
}
but that has a nasty code smell. Surely this can be done better.
Generating the instruction encoding on the fly
A final attempt was to generate the instruction in RAM (the rest of the code is running from read-only Flash) and then run it:
void svc(char code)
{
asm volatile (
"orr r0, 0xDF00 \n\t" // Bitwise-OR the code with the SVC encoding
"push {r1, r0} \n\t" // Store the instruction to RAM (on the stack)
"mov r0, sp \n\t" // Copy the stack pointer to an ordinary register
"add r0, #1 \n\t" // Add 1 to the address to specify THUMB mode
"bx r0 \n\t" // Branch to newly created instruction
"pop {r1, r0} \n\t" // Restore the stack
"bx lr \n\t" // Return to caller
);
}
but this just doesn't feel right either. Also, it doesn't work - There's something I'm doing wrong here; perhaps my instruction isn't properly aligned or I haven't set up the processor to allow running code from RAM at this location.
What should I do?
I have to work on that last option. But still, it feels like I ought to be able to do something like:
__attribute__((naked)) svc(char code)
{
asm volatile ("scv %1"
: /* No outputs */
: "i" (code) // Imaginary directive specifying an immediate argument
// as opposed to conventional "r"
);
}
but I'm not finding any such option in the documentation and I'm at a loss to explain how such a feature would be implemented, so it probably doesn't exist. How should I do this?
You want to use a constraint to force the operand to be allocated as an 8-bit immediate. For ARM, that is constraint I. So you want
#define SVC(code) asm volatile ("svc %0" : : "I" (code) )
See the GCC documentation for a summary of what all the constaints are -- you need to look at the processor-specific notes to see the constraints for specific platforms. In some cases, you may need to look at the .md (machine description) file for the architecture in the gcc source for full information.
There's also some good ARM-specific gcc docs here. A couple of pages down under the heading "Input and output operands" it provides a table of all the ARM constraints
What about using a macro:
#define SVC(i) asm volatile("svc #"#i)
As noted by Chris Dodd in the comments on the macro, it doesn't quite work, but this does:
#define STRINGIFY0(v) #v
#define STRINGIFY(v) STRINGIFY0(v)
#define SVC(i) asm volatile("svc #" STRINGIFY(i))
Note however that it won't work if you pass an enum value to it, only a #defined one.
Therefore, Chris' answer above is the best, as it uses an immediate value, which is what's required, for thumb instructions at least.
My solution ("Generating the instruction encoding on the fly"):
#define INSTR_CODE_SVC (0xDF00)
#define INSTR_CODE_BX_LR (0x4770)
void svc_call(uint32_t svc_num)
{
uint16_t instrs[2];
instrs[0] = (uint16_t)(INSTR_CODE_SVC | svc_num);
instrs[1] = (uint16_t)(INSTR_CODE_BX_LR);
// PC = instrs (or 1 -> thumb mode)
((void(*)(void))((uint32_t)instrs | 1))();
}
It works and its much better than switch-case variant, which takes ~2kb ROM for 256 svc's. This func does not have to be placed in RAM section, FLASH is ok.
You can use it if svc_num should be a runtime variable.
As discussed in this question, the operand of SVC is fixed, that is it should be known to the preprocessor, and it is different from immediate Data-processing operands.
The gcc manual reads
'I'- Integer that is valid as an immediate operand in a data processing instruction. That is, an integer in the range 0 to 255 rotated by a multiple of 2.
Therefore the answers here that use a macro are preferred, and the answer of Chris Dodd is not guaranteed to work, depending on the gcc version and optimization level. See the discussion of the other question.
I wrote one handler recently for my own toy OS on Cortex-M. Works if tasks use PSP pointer.
Idea:
Get interrupted process's stack pointer, get process's stacked PC, it will have the instruction address of instruction after SVC, look up the immediate value in the instruction. It's not as hard as it sounds.
uint8_t __attribute__((naked)) get_svc_code(void){
__asm volatile("MSR R0, PSP"); //Get Process Stack Pointer (We're in SVC ISR, so currently MSP in use)
__asm volatile("ADD R0, #24"); //Pointer to stacked process's PC is in R0
__asm volatile("LDR R1, [R0]"); //Instruction Address after SVC is in R1
__asm volatile("SUB R1, R1, #2"); //Subtract 2 bytes from the address of the current instruction. Now R1 contains address of SVC instruction
__asm volatile("LDRB R0, [R1]"); //Load lower byte of 16-bit instruction into R0. It's immediate value.
//Value is in R0. Function can return
}
I would like to know whether it is possible to ensure line is atomically executed, given that it could be executed by both the ISR and Main context. I'm working on an ARM9 (LPC313x) and using RealView 4 (armcc).
foo() {
..
stack_var = ++volatile_var; // line
..
}
I'm looking for any routine like _atomic_ for C166, direct assembly code, etc. I would prefer not to have to disable the interrupts.
Thank you very much.
No, I don't think that you ever can expect ++volatile_var to be atomic, even if you don't have the assignment. Use a proper atomic primitive for that. If your compiler doesn't provide such an extension you easily find short inline assembler for that on the web. The assembler instructions are call ldrex and strex for atomic exchange on arm, I think.
Edit: it seems that the specific processor type that is asked for in the question does not implement these instructions.
Edit: The following should work with gcc, for another compiler one probably has to adapt the __asm__ parts.
inline
size_t arm_ldrex(size_t volatile*ptr) {
size_t ret;
__asm__ volatile ("ldrex %0,[%1]\t# load exclusive\n"
: "=&r" (ret)
: "r" (ptr)
: "cc", "memory"
);
return ret;
}
inline
_Bool arm_strex(size_t volatile*ptr, size_t val) {
size_t error;
__asm__ volatile ("strex %0,%1,[%2]\t# store exclusive\n"
: "=&r" (error)
: "r" (val), "r" (ptr)
: "cc", "memory"
);
return !error;
}
inline
size_t atomic_add_fetch(size_t volatile *object, size_t operand) {
for (;;) {
size_t oldval = arm_ldrex(object);
size_t newval = oldval + operand;
if (arm_strex(object, newval)) return newval;
}
}
From a quick look, the C166 _atomic_ macro seems to utilize an instruction that effectively masks interrupts for the duration of a specified number of instructions.
There is nothing directly corresponding to that in the ARM architecture.
You could of course use the swp instruction (or __swp intrinsic in the RealView toolchain) to implement a lock around the critical section. ldrex/strex mentioned in another answer do not exist in ARM architecture version 5, which includes the ARM9 processors.
http://infocenter.arm.com/help/topic/com.arm.doc.dui0491c/CJAHDCHB.html and http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/Chdbbbai.html respectively.
A simplistic lock implementation around this (using the RealView toolchain) would be:
{
/* Loop until lock acquired */
while (__swp(LOCKED, &lockvar) == LOCKED);
..
/* Critical section */
..
lockvar = UNLOCKED;
}
However, this will lead to deadlock in the ISR context when the Main thread is holding the lock.
I think masking interrupts around the operation is likely to be the least hairy solution, although if your Main context is executing in User mode it will require a system call to implement.