I am trying to check if the VMX hardware extensions are supported by the processor using inline assembly. I have tried the following two ways of doing it:
Method 1:
int vmx_support(void) {
int get_vmx_support, vmx_bit;
asm volatile ("mov $1, %eax");
asm volatile ("mov $0, %ecx");
asm volatile ("cpuid");
asm volatile ("mov %%ecx, %0\n\t":"=r" (get_vmx_support): : "memory");
vmx_bit = (get_vmx_support >> 5) & 1;
if (vmx_bit == 1) {
return 1;
} else {
return 0;
}
}
Method 2:
int vmx_support(void) {
unsigned int eax, ebx, ecx, edx;
eax = 1;
ecx = 0;
asm volatile("cpuid"
: "=a" (eax),
"=b" (ebx),
"=c" (ecx),
"=d" (edx)
: "0" (eax), "2" (ecx)
: "memory");
vmx_bit = (ecx >> 5) & 1;
if (vmx_bit == 1) {
return 1;
} else {
return 0;
}
}
When I try to execute vmx_support() from Method 1 inside a kernel module, Ubuntu freezes completely when I do insmod vmx.ko and I have to restart it to get it back. When I try to execute vmx_support() from Method 2 inside the kernel module, it executes and shows [VMX] vmx is supported. on dmesg | tail.
Also, when I try to run vmx_support() from Method 1 as a userspace program, it executes and prints [VMX] vmx is supported. as output to the console.
Question: Why does code from Method 1 freeze Ubuntu whereas code from Method 2 does not? Also, is there a safer way to test and debug code that uses inline assembly? (that is, avoid freezes for example)
Links to Makefile, kernel module and userspace program can be found here:
Makefile
vmx.c (kernel module. The code from method 2 is commented inside, uncomment it and comment the code from method 1 to see how it works)
vmx_sup.c (userspace program)
Method 1 has several problems, but the one that is causing it to hang is undoubtedly that it changes ebx without telling the compiler. In your user mode program, probably ebx doesn’t happen to have anything important in it, but in the kernel module, it apparently contains something critical.
Related
In my program I need to insert NOP as inline assembly into a loop, and the number of NOPs can be controlled by an argument. Something like this:
char nop[] = "nop\nnop";
for(offset = 0; offset < CACHE_SIZE; offset += BLOCK_SIZE) {
asm volatile (nop
:
: "c" (buffer + offset)
: "rax");
}
Is there any way to tell compiler to convert the above inline assembly into the following?
asm volatile ("nop\n"
"nop"
:
: "c" (buffer + offset)
: "rax");
Well, there is this trick you can do:
#define NOPS(n) asm volatile (".fill %c0, 1, 0x90" :: "i"(n))
This macro inserts the desired number of nop instructions into the instruction stream. Note that n must be a compile time constant. You can use a switch statement to select different lengths:
switch (len) {
case 1: NOPS(1); break;
case 2: NOPS(2); break;
...
}
You can also do this for more code size economy:
if (len & 040) NOPS(040);
if (len & 020) NOPS(020);
if (len & 010) NOPS(010);
if (len & 004) NOPS(004);
if (len & 002) NOPS(002);
if (len & 001) NOPS(001);
Note that you should really consider using pause instructions instead of nop instructions for this sort of thing as pause is a semantic hint that you are just trying to pass time. This changes the definition of the macro to:
#define NOPS(n) asm volatile (".fill %c0, 2, 0x90f3" :: "i"(n))
No, the inline asm template needs to be compile-time constant, so the assembler can assemble it to machine code.
If you want a flexible template that you modify at run-time, that's called JIT compiling or code generation. You normally generate machine-code directly, not assembler source text which you feed to an assembler.
For example, see this complete example which generates a function composed of a variable number of dec eax instructions and then executes it. Code golf: The repetitive byte counter
BTW, dec eax runs at 1 per clock on all modern x86 CPUs, unlike NOP which runs at 4 per clock, or maybe 5 on Ryzen. See http://agner.org/optimize/.
A better choice for a tiny delay might be a pause instruction, or a dependency chain of some variable number of imul instructions, or maybe sqrtps, ending with an lfence to block out-of-order execution (at least on Intel CPUs). I haven't checked AMD's manuals to see if lfence is documented as being an execution barrier there, but Agner Fog reports it can run at 4 per clock on Ryzen.
But really, you probably don't need to JIT any code at all. For a one-off experiment that only has to work on one or a few systems, hack up a delay loop with something like
for (int i=0 ; i<delay_count ; i++) {
asm volatile("" : "r" (i)); // defeat optimization
}
This forces the compiler to have the loop counter in a register on every iteration, so it can't optimize the loop away, or turn it into a multiply. You should get compiler-generated asm like delayloop: dec eax; jnz delayloop. You might want to put _mm_lfence() after the loop.
I want to call a system call (prctl) in assembly inline and retrieve the result of the system call. But I cannot make it work.
This is the code I am using:
int install_filter(void)
{
long int res =-1;
void *prg_ptr = NULL;
struct sock_filter filter[] = {
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRAP),
/* If a trap is not generate, the application is killed */
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
};
struct sock_fprog prog = {
.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
.filter = filter,
};
prg_ptr = &prog;
no_permis();
__asm__ (
"mov %1, %%rdx\n"
"mov $0x2, %%rsi \n"
"mov $0x16, %%rdi \n"
"mov $0x9d, %%rax\n"
"syscall\n"
"mov %%rax, %0\n"
: "=r"(res)
: "r"(prg_ptr)
: "%rdx", "%rsi", "%rdi", "%rax"
);
if ( res < 0 ){
perror("prctl");
exit(EXIT_FAILURE);
}
return 0;
}
The address of the filter should be the input (prg_ptr) and I want to save the result in res.
Can you help me?
For inline assembly, you don't use movs like this unless you have to, and even then you have to do ugly shiffling. That's because you have no idea what registers arguments arrive in. Instead, you should use:
__asm__ __volatile__ ("syscall" : "=a"(res) : "d"(prg_ptr), "S"(0x2), "D"(0x16), "a"(0x9d) : "memory");
I also added __volatile__, which you should use for any asm with side-effects other than its output, and a memory clobber (memory barrier), which you should use for any asm with side-effects on memory or for which reordering it with respect to memory accesses would be invalid. It's good practice to always use both of these for syscalls unless you know you don't need them.
If you're still having problems, use strace to observe the syscall attempt and see what's going wrong.
The problem
I'm working on a custom OS for an ARM Cortex-M3 processor. To interact with my kernel, user threads have to generate a SuperVisor Call (SVC) instruction (previously known as SWI, for SoftWare Interrupt). The definition of this instruction in the ARM ARM is:
Which means that the instruction requires an immediate argument, not a register value.
This is making it difficult for me to architect my interface in a readable fashion. It requires code like:
asm volatile( "svc #0");
when I'd much prefer something like
svc(SVC_YIELD);
However, I'm at a loss to construct this function, because the SVC instruciton requires an immediate argument and I can't provide that when the value is passed in through a register.
The kernel:
For background, the svc instruction is decoded in the kernel as follows
#define SVC_YIELD 0
// Other SVC codes
// Called by the SVC interrupt handler (not shown)
void handleSVC(char code)
{
switch (code) {
case SVC_YIELD:
svc_yield();
break;
// Other cases follow
This case statement is getting rapidly out of hand, but I see no way around this problem. Any suggestions are welcome.
What I've tried
SVC with a register argument
I initially considered
__attribute__((naked)) svc(char code)
{
asm volatile ("scv r0");
}
but that, of course, does not work as SVC requires a register argument.
Brute force
The brute-force attempt to solve the problem looks like:
void svc(char code)
switch (code) {
case 0:
asm volatile("svc #0");
break;
case 1:
asm volatile("svc #1");
break;
/* 253 cases omitted */
case 255:
asm volatile("svc #255");
break;
}
}
but that has a nasty code smell. Surely this can be done better.
Generating the instruction encoding on the fly
A final attempt was to generate the instruction in RAM (the rest of the code is running from read-only Flash) and then run it:
void svc(char code)
{
asm volatile (
"orr r0, 0xDF00 \n\t" // Bitwise-OR the code with the SVC encoding
"push {r1, r0} \n\t" // Store the instruction to RAM (on the stack)
"mov r0, sp \n\t" // Copy the stack pointer to an ordinary register
"add r0, #1 \n\t" // Add 1 to the address to specify THUMB mode
"bx r0 \n\t" // Branch to newly created instruction
"pop {r1, r0} \n\t" // Restore the stack
"bx lr \n\t" // Return to caller
);
}
but this just doesn't feel right either. Also, it doesn't work - There's something I'm doing wrong here; perhaps my instruction isn't properly aligned or I haven't set up the processor to allow running code from RAM at this location.
What should I do?
I have to work on that last option. But still, it feels like I ought to be able to do something like:
__attribute__((naked)) svc(char code)
{
asm volatile ("scv %1"
: /* No outputs */
: "i" (code) // Imaginary directive specifying an immediate argument
// as opposed to conventional "r"
);
}
but I'm not finding any such option in the documentation and I'm at a loss to explain how such a feature would be implemented, so it probably doesn't exist. How should I do this?
You want to use a constraint to force the operand to be allocated as an 8-bit immediate. For ARM, that is constraint I. So you want
#define SVC(code) asm volatile ("svc %0" : : "I" (code) )
See the GCC documentation for a summary of what all the constaints are -- you need to look at the processor-specific notes to see the constraints for specific platforms. In some cases, you may need to look at the .md (machine description) file for the architecture in the gcc source for full information.
There's also some good ARM-specific gcc docs here. A couple of pages down under the heading "Input and output operands" it provides a table of all the ARM constraints
What about using a macro:
#define SVC(i) asm volatile("svc #"#i)
As noted by Chris Dodd in the comments on the macro, it doesn't quite work, but this does:
#define STRINGIFY0(v) #v
#define STRINGIFY(v) STRINGIFY0(v)
#define SVC(i) asm volatile("svc #" STRINGIFY(i))
Note however that it won't work if you pass an enum value to it, only a #defined one.
Therefore, Chris' answer above is the best, as it uses an immediate value, which is what's required, for thumb instructions at least.
My solution ("Generating the instruction encoding on the fly"):
#define INSTR_CODE_SVC (0xDF00)
#define INSTR_CODE_BX_LR (0x4770)
void svc_call(uint32_t svc_num)
{
uint16_t instrs[2];
instrs[0] = (uint16_t)(INSTR_CODE_SVC | svc_num);
instrs[1] = (uint16_t)(INSTR_CODE_BX_LR);
// PC = instrs (or 1 -> thumb mode)
((void(*)(void))((uint32_t)instrs | 1))();
}
It works and its much better than switch-case variant, which takes ~2kb ROM for 256 svc's. This func does not have to be placed in RAM section, FLASH is ok.
You can use it if svc_num should be a runtime variable.
As discussed in this question, the operand of SVC is fixed, that is it should be known to the preprocessor, and it is different from immediate Data-processing operands.
The gcc manual reads
'I'- Integer that is valid as an immediate operand in a data processing instruction. That is, an integer in the range 0 to 255 rotated by a multiple of 2.
Therefore the answers here that use a macro are preferred, and the answer of Chris Dodd is not guaranteed to work, depending on the gcc version and optimization level. See the discussion of the other question.
I wrote one handler recently for my own toy OS on Cortex-M. Works if tasks use PSP pointer.
Idea:
Get interrupted process's stack pointer, get process's stacked PC, it will have the instruction address of instruction after SVC, look up the immediate value in the instruction. It's not as hard as it sounds.
uint8_t __attribute__((naked)) get_svc_code(void){
__asm volatile("MSR R0, PSP"); //Get Process Stack Pointer (We're in SVC ISR, so currently MSP in use)
__asm volatile("ADD R0, #24"); //Pointer to stacked process's PC is in R0
__asm volatile("LDR R1, [R0]"); //Instruction Address after SVC is in R1
__asm volatile("SUB R1, R1, #2"); //Subtract 2 bytes from the address of the current instruction. Now R1 contains address of SVC instruction
__asm volatile("LDRB R0, [R1]"); //Load lower byte of 16-bit instruction into R0. It's immediate value.
//Value is in R0. Function can return
}
I'm trying to make a really simple spinlock mutex in C and for some reason I'm getting cases where two threads are getting the lock at the same time, which shouldn't be possible. It's running on a multiprocessor system which may be why there's a problem. Any ideas why it's not working?
void mutexLock(mutex_t *mutexlock, pid_t owner)
{
int failure = 1;
while(mutexlock->mx_state == 0 || failure || mutexlock->mx_owner != owner)
{
failure = 1;
if (mutexlock->mx_state == 0)
{
asm(
"movl $0x01,%%eax\n\t" // move 1 to eax
"xchg %%eax,%0\n\t" // try to set the lock bit
"mov %%eax,%1\n\t" // export our result to a test var
:"=r"(mutexlock->mx_state),"=r"(failure)
:"r"(mutexlock->mx_state)
:"%eax"
);
}
if (failure == 0)
{
mutexlock->mx_owner = owner; //test to see if we got the lock bit
}
}
}
Well for a start you're testing an uninitialised variable (failure) the first time the while() condition is executed.
Your actual problem is that you're telling gcc to use a register for mx_state - which clearly won't work for a spinlock. Try:
asm volatile (
"movl $0x01,%%eax\n\t" // move 1 to eax
"xchg %%eax,%0\n\t" // try to set the lock bit
"mov %%eax,%1\n\t" // export our result to a test var
:"=m"(mutexlock->mx_state),"=r"(failure)
:"m"(mutexlock->mx_state)
:"%eax"
);
Note that asm volatile is also important here, to ensure that it doesn't get hoisted out of your while loop.
The problem is that you load mx_state into a register (the 'r' constraint) and then do the exchange with the registers, only writing back the result into mx_state at the end of the asm code. What you want is something more like
asm(
"movl $0x01,%%eax\n\t" // move 1 to eax
"xchg %%eax,%1\n\t" // try to set the lock bit
"mov %%eax,%0\n\t" // export our result to a test var
:"=r"(failure)
:"m" (mutexlock->mx_state)
:"%eax"
);
Even this is somewhat dangerous, as in theory the compiler could load the mx_state, spill it into a local temp stack slot, and do the xchg there. It also is somewhat inefficient, as it has extra movs hardcoded that may not be needed but can't be eliminated by the optimizer. You're better off using a simpler asm that expands to a single instruction, such as
failure = 1;
asm("xchg %0,0(%1)" : "=r" (failure) : "r" (&mutex->mx_state), "0" (failure));
Note how we force the use of mx_state in place, by using it's address rather than its value.
The code:
/* ctsw.c : context switcher
*/
#include <kernel.h>
static void *kstack;
extern int set_evec(int, long);
/* contextswitch - saves kernel context, switches to proc */
enum proc_req contextswitch(struct proc_ctrl_blk *proc) {
enum proc_req call;
kprintf("switching to %d\n", getpid(proc));
asm volatile("pushf\n" // save kernel flags
"pusha\n" // save kernel regs
"movl %%esp, %0\n" // save kernel %esp
"movl %1, %%esp\n" // load proc %esp
"popa\n" // load proc regs (from proc stack)
"iret" // switch to proc
: "=g" (kstack)
: "g" (proc->esp)
);
_entry_point:
asm volatile("pusha\n" // save proc regs
"movl %%esp, %0\n" // save proc %esp
"movl %2, %%esp\n" // restore kernel %esp
"movl %%eax, %1\n" // grabs syscall from process
"popa\n" // restore kernel regs (from kstack)
"popf" // restore kernel flags
: "=g" (proc->esp), "=g" (call)
: "g" (kstack)
);
kprintf("back to the kernel!\n");
return call;
}
void contextinit() {
set_evec(49, (long)&&_entry_point);
}
It's a context switcher for a small, cooperative, non-preemptive kernel. contextswitch() is called by dispatcher() with the stack pointer of the process to load. Once %esp and other general purpose registers have been loaded, iret is called and the user process starts running.
I need to setup an interrupt to return to the point in contextswitch() after the iret so I can restore the kernel context and return the value of the syscall to dispatcher().
How can I access the memory address of _entry_point from outside the function?
Switch the implementation of the function around: make it look like this:
Context switch from user to kernel;
Call into kernel routines;
Context switch back from kernel to user.
Then you can just set up the interrupt to run the function from the start. There will need to be a global pointer for "the current user process" - to switch between processes, the kernel code that's run by "Call into kernel routines" just changes that variable to point at a different process.
You will need one special case - for the initial switch from kernel to user mode, for the initial process that's running after boot. After that though, the above function should be able to handle it.
After a little while playing around with GCC, I've got an answer.
Dropping down to assembly silences GCC warnings about unused labels.
So,
_entry_point:
is replaced with
asm volatile("_entry_point:");
and
void contextinit() {
set_evec_(49, &&_entry_point);
}
is replaced with
void contextinit() {
long x;
asm("movl $_entry_point, %%eax\n"
"movl %%eax, %0": "=g" (x) : : "%eax");
set_evec(49, x);
}
Besides using inline assembly to access the _entry_point, you can also define it as a function, like:
asm volatile("_entry_point:");
void contextinit() {
extern void _entry_point();
set_evec(49, (long)&_entry_point);
}