Mixing C and assembly and its impact on registers

Mixing C and assembly and its impact on registers - c

Consider the following C and (ARM) assembly snippet, which is to be compiled with GCC:
__asm__ __volatile__ (
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
"vmov.f32 q12, #0.0\n\t"
: [data_addr] "+r" (data_addr)
: : "q0", "q12");
for(int n=0; n<10; ++n){
__asm__ __volatile__ (
"vadd.f32 q12, q12, q0\n\t"
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
: [data_addr] "+r" (data_addr),
:: "q0", "q12");
}
In this example, I am initialising some SIMD registers outside the loop and then having C handle the loop logic, with those initialised registers being used inside the loop.
This works in some test code, but I'm concerned of the risk of the compiler clobbering the registers between snippets. Is there any way of ensuring this doesn't happen? Can I infer any assurances about the type of registers that are going to be used in a snippet (in this case, that no SIMD registers will be clobbered)?

In general, there's not a way to do this in gcc; clobbers only guarantee that registers will be preserved around the asm call. If you need to ensure that the registers are saved between two asm sections, you will need to store them to memory in the first, and reload in the second.

Edit: After much fiddling around I've come to the conclusion this is much harder to solve in general using the strategy described below than I initially thought.
The problem is that, particularly when all the registers are used, there is nothing to stop the first register stash from overwriting another. Whether there is some trick to play with using direct memory writes that can be optimised away I don't know, but initial tests would suggest the compiler might still choose to clobber not-yet-stashed registers
For the time being and until I have more information, I'm unmarking this answer as correct and this answer should treated as probably wrong in the general case. My conclusion is this that such local protection of registers needs better support in the compiler to be useful
This absolutely is possible to do reliably. Drawing on the comments by #PeterCordes as well as the docs and a couple of useful bug reports (gcc 41538 and 37188) I came up with the following solution.
The point that makes it valid is the use of temporary variables to make sure the registers are maintained (logically, if the loop clobbers them, then they will be reloaded). In practice, the temporary variables are optimised away which is clear from inspection of the resultant asm.
// d0 and d1 map to the first and second values of q0, so we use
// q0 to reduce the number of tmp variables we pass around (instead
// of using one for each of d0 and d1).
register float32x4_t data __asm__ ("q0");
register float32x4_t output __asm__ ("q12");
float32x4_t tmp_data;
float32x4_t tmp_output;
__asm__ __volatile__ (
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
"vmov.f32 %q[output], #0.0\n\t"
: [data_addr] "+r" (data_addr),
[output] "=&w" (output),
"=&w" (data) // we still need to constrain data (q0) as written to.
::);
// Stash the register values
tmp_data = data;
tmp_output = output;
for(int n=0; n<10; ++n){
// Make sure the registers are loaded correctly
output = tmp_output;
data = tmp_data;
__asm__ __volatile__ (
"vadd.f32 %[output], %[output], q0\n\t"
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
: [data_addr] "+r" (data_addr),
[output] "+w" (output),
"+w" (data) // again, data (q0) was written to in the vldmia op.
::);
// Remember to stash the registers again before continuing
tmp_data = data;
tmp_output = output;
}
It's necessary to instruct the compiler that q0 is written to in the last line of each asm output constraint block, so it doesn't think it can reorder the stashing and reloading of the data register resulting in the asm block getting invalid values.

Related

How do I access a local variable using ARM assembly language?

I use the following piece of assembly code to enter the critical section in ARM Cortex-M4.
The question is, how can I access the local variable primeMask?
volatile uint8_t primaskValue;
asm(
"MRS R0, PRIMASK \n"
"CPSID i \n"
"STRB R0,%[output] \n"
:[output] "=r" (primaskValue)
);

The %[output] string in the asm code is replaced by the register used to hold the value specified by the constraint [output] "=r" (primaskValue) which will be a register used as the value of primaskValue afterwards. So within the block of the asm code, that register is the location of primaskValue.
If you instead used the constraint string "+r", that would be input and output (so the register would be initialized with the prior value of primaskValue as well as being used as the output value)
Since primaskValue had been declared as volatile this doesn't make a lot of sense -- it is not at all clear what (if anything) volatile means on a local variable.

Is there any way to generate inline assembly programmatically?

In my program I need to insert NOP as inline assembly into a loop, and the number of NOPs can be controlled by an argument. Something like this:
char nop[] = "nop\nnop";
for(offset = 0; offset < CACHE_SIZE; offset += BLOCK_SIZE) {
asm volatile (nop
:
: "c" (buffer + offset)
: "rax");
}
Is there any way to tell compiler to convert the above inline assembly into the following?
asm volatile ("nop\n"
"nop"
:
: "c" (buffer + offset)
: "rax");

Well, there is this trick you can do:
#define NOPS(n) asm volatile (".fill %c0, 1, 0x90" :: "i"(n))
This macro inserts the desired number of nop instructions into the instruction stream. Note that n must be a compile time constant. You can use a switch statement to select different lengths:
switch (len) {
case 1: NOPS(1); break;
case 2: NOPS(2); break;
...
}
You can also do this for more code size economy:
if (len & 040) NOPS(040);
if (len & 020) NOPS(020);
if (len & 010) NOPS(010);
if (len & 004) NOPS(004);
if (len & 002) NOPS(002);
if (len & 001) NOPS(001);
Note that you should really consider using pause instructions instead of nop instructions for this sort of thing as pause is a semantic hint that you are just trying to pass time. This changes the definition of the macro to:
#define NOPS(n) asm volatile (".fill %c0, 2, 0x90f3" :: "i"(n))

No, the inline asm template needs to be compile-time constant, so the assembler can assemble it to machine code.
If you want a flexible template that you modify at run-time, that's called JIT compiling or code generation. You normally generate machine-code directly, not assembler source text which you feed to an assembler.
For example, see this complete example which generates a function composed of a variable number of dec eax instructions and then executes it. Code golf: The repetitive byte counter
BTW, dec eax runs at 1 per clock on all modern x86 CPUs, unlike NOP which runs at 4 per clock, or maybe 5 on Ryzen. See http://agner.org/optimize/.
A better choice for a tiny delay might be a pause instruction, or a dependency chain of some variable number of imul instructions, or maybe sqrtps, ending with an lfence to block out-of-order execution (at least on Intel CPUs). I haven't checked AMD's manuals to see if lfence is documented as being an execution barrier there, but Agner Fog reports it can run at 4 per clock on Ryzen.
But really, you probably don't need to JIT any code at all. For a one-off experiment that only has to work on one or a few systems, hack up a delay loop with something like
for (int i=0 ; i<delay_count ; i++) {
asm volatile("" : "r" (i)); // defeat optimization
}
This forces the compiler to have the loop counter in a register on every iteration, so it can't optimize the loop away, or turn it into a multiply. You should get compiler-generated asm like delayloop: dec eax; jnz delayloop. You might want to put _mm_lfence() after the loop.

Impossible constraint in 'asm': asm volatile

I trying since a few days to write a very simple inline assembler code, but nothing worked. I have as IDE NetBeans and as compiler MinGW.
My latest code is:
uint16 readle_uint16(const uint8 * buffer, int offset) {
unsigned char x, y, z;
unsigned int PORTB;
__asm__ __volatile__("\n"
"addl r29,%0\n"
"addl r30,%1\n"
"addl r31,%2\n"
"lpm\n"
"out %3,r0\n"
: "=I" (PORTB)
: "r" (x), "r" (y), "r" (z)
);
return value;
}
But I get everytime the same message "error: impossible constraint in 'asm'".
I tried to write all in a single line or to use different asm introductions. I have no idea what I can do otherwise.

Notice that gcc's inline assembly syntax is
asm [volatile] ( AssemblerTemplate
: OutputOperands
[ : InputOperands
[ : Clobbers ] ])
After the assembler instructions first come the output operands, then the inputs.
As #DavidWohlferd said, I is for "constant greater than −1, less than 64 "constants ("immediates").
While the out instruction in fact requires a constant value from that range, PORTB is not that constant value. (You can see that for yourself if you look into the corresponding avr/ioXXXX.h file for your controller, where you may find something like #define PORTB _SFR_IO8(0x05).)
Also, not all IO registers may be accessible via out/in; especially the bigger controllers have more than 64 IO registers, but only the first 64 can be accessed as such. However, all IO registers can be accessed at their memory-mapped address through lds/sts. So, depending on which register on which controller you want to access you may not be able to use out for that register at all, but you can always use sts instead. If you want your code to be portable you'll have to take that into account, like suggested here for example.
If you know that PORTB is one of the first 64 IO registers on your controller, you can use
"I" (_SFR_IO_ADDR( PORTB )) with out, else use
"m" ( PORTB ) with sts.
So this:
__asm__ __volatile__("\n"
"addl r29,%0\n"
"addl r30,%1\n"
"addl r31,%2\n"
"lpm\n"
"out %3,r0\n"
: /* No output operands here */
: "r" (x), "r" (y), "r" (z), "I" (_SFR_IO_ADDR( PORTB ))
);
should get you rid of that "impossible constraint" error. Although the code still does not make any sense, mostly because you're using "random", uninitialized data as input. You clobber registers r29-r31 without declaring them, and I'm totally not sure what your intention is with all the code before the lpm.

As EOF says, the I constraint is used for parameters that are constant (see the AVR section at https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html). By putting this parameter after the first colon (and by using an =), you are saying this is an output. Outputting to a constant makes no sense.
Also:
You list x, y, and z as inputs to the asm (by putting them after the second colon), but they never get assigned a value. An input that has never been assigned a value makes no sense.
You are (apparently) modifying registers 29-31, but you don't tell the compiler that you are doing so?
There's more, but I just can't follow what you think this code is supposed to do. You might want to take some time to look thru the gcc docs for asm to understand how this works.

Using GCC inline assembly with instructions that take immediate values

The problem
I'm working on a custom OS for an ARM Cortex-M3 processor. To interact with my kernel, user threads have to generate a SuperVisor Call (SVC) instruction (previously known as SWI, for SoftWare Interrupt). The definition of this instruction in the ARM ARM is:
Which means that the instruction requires an immediate argument, not a register value.
This is making it difficult for me to architect my interface in a readable fashion. It requires code like:
asm volatile( "svc #0");
when I'd much prefer something like
svc(SVC_YIELD);
However, I'm at a loss to construct this function, because the SVC instruciton requires an immediate argument and I can't provide that when the value is passed in through a register.
The kernel:
For background, the svc instruction is decoded in the kernel as follows
#define SVC_YIELD 0
// Other SVC codes
// Called by the SVC interrupt handler (not shown)
void handleSVC(char code)
{
switch (code) {
case SVC_YIELD:
svc_yield();
break;
// Other cases follow
This case statement is getting rapidly out of hand, but I see no way around this problem. Any suggestions are welcome.
What I've tried
SVC with a register argument
I initially considered
__attribute__((naked)) svc(char code)
{
asm volatile ("scv r0");
}
but that, of course, does not work as SVC requires a register argument.
Brute force
The brute-force attempt to solve the problem looks like:
void svc(char code)
switch (code) {
case 0:
asm volatile("svc #0");
break;
case 1:
asm volatile("svc #1");
break;
/* 253 cases omitted */
case 255:
asm volatile("svc #255");
break;
}
}
but that has a nasty code smell. Surely this can be done better.
Generating the instruction encoding on the fly
A final attempt was to generate the instruction in RAM (the rest of the code is running from read-only Flash) and then run it:
void svc(char code)
{
asm volatile (
"orr r0, 0xDF00 \n\t" // Bitwise-OR the code with the SVC encoding
"push {r1, r0} \n\t" // Store the instruction to RAM (on the stack)
"mov r0, sp \n\t" // Copy the stack pointer to an ordinary register
"add r0, #1 \n\t" // Add 1 to the address to specify THUMB mode
"bx r0 \n\t" // Branch to newly created instruction
"pop {r1, r0} \n\t" // Restore the stack
"bx lr \n\t" // Return to caller
);
}
but this just doesn't feel right either. Also, it doesn't work - There's something I'm doing wrong here; perhaps my instruction isn't properly aligned or I haven't set up the processor to allow running code from RAM at this location.
What should I do?
I have to work on that last option. But still, it feels like I ought to be able to do something like:
__attribute__((naked)) svc(char code)
{
asm volatile ("scv %1"
: /* No outputs */
: "i" (code) // Imaginary directive specifying an immediate argument
// as opposed to conventional "r"
);
}
but I'm not finding any such option in the documentation and I'm at a loss to explain how such a feature would be implemented, so it probably doesn't exist. How should I do this?

You want to use a constraint to force the operand to be allocated as an 8-bit immediate. For ARM, that is constraint I. So you want
#define SVC(code) asm volatile ("svc %0" : : "I" (code) )
See the GCC documentation for a summary of what all the constaints are -- you need to look at the processor-specific notes to see the constraints for specific platforms. In some cases, you may need to look at the .md (machine description) file for the architecture in the gcc source for full information.
There's also some good ARM-specific gcc docs here. A couple of pages down under the heading "Input and output operands" it provides a table of all the ARM constraints

What about using a macro:
#define SVC(i) asm volatile("svc #"#i)

As noted by Chris Dodd in the comments on the macro, it doesn't quite work, but this does:
#define STRINGIFY0(v) #v
#define STRINGIFY(v) STRINGIFY0(v)
#define SVC(i) asm volatile("svc #" STRINGIFY(i))
Note however that it won't work if you pass an enum value to it, only a #defined one.
Therefore, Chris' answer above is the best, as it uses an immediate value, which is what's required, for thumb instructions at least.

My solution ("Generating the instruction encoding on the fly"):
#define INSTR_CODE_SVC (0xDF00)
#define INSTR_CODE_BX_LR (0x4770)
void svc_call(uint32_t svc_num)
{
uint16_t instrs[2];
instrs[0] = (uint16_t)(INSTR_CODE_SVC | svc_num);
instrs[1] = (uint16_t)(INSTR_CODE_BX_LR);
// PC = instrs (or 1 -> thumb mode)
((void(*)(void))((uint32_t)instrs | 1))();
}
It works and its much better than switch-case variant, which takes ~2kb ROM for 256 svc's. This func does not have to be placed in RAM section, FLASH is ok.
You can use it if svc_num should be a runtime variable.

As discussed in this question, the operand of SVC is fixed, that is it should be known to the preprocessor, and it is different from immediate Data-processing operands.
The gcc manual reads
'I'- Integer that is valid as an immediate operand in a data processing instruction. That is, an integer in the range 0 to 255 rotated by a multiple of 2.
Therefore the answers here that use a macro are preferred, and the answer of Chris Dodd is not guaranteed to work, depending on the gcc version and optimization level. See the discussion of the other question.

I wrote one handler recently for my own toy OS on Cortex-M. Works if tasks use PSP pointer.
Idea:
Get interrupted process's stack pointer, get process's stacked PC, it will have the instruction address of instruction after SVC, look up the immediate value in the instruction. It's not as hard as it sounds.
uint8_t __attribute__((naked)) get_svc_code(void){
__asm volatile("MSR R0, PSP"); //Get Process Stack Pointer (We're in SVC ISR, so currently MSP in use)
__asm volatile("ADD R0, #24"); //Pointer to stacked process's PC is in R0
__asm volatile("LDR R1, [R0]"); //Instruction Address after SVC is in R1
__asm volatile("SUB R1, R1, #2"); //Subtract 2 bytes from the address of the current instruction. Now R1 contains address of SVC instruction
__asm volatile("LDRB R0, [R1]"); //Load lower byte of 16-bit instruction into R0. It's immediate value.
//Value is in R0. Function can return
}

Calculating CPU frequency in C with RDTSC always returns 0

The following piece of code was given to us from our instructor so we could measure some algorithms performance:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
}
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
}
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
}
However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:
int main(void)
{
double c1, c2;
start_counter();
c1 = get_counter();
sleep(1);
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
}
The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.
On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.
Any ideas why this is happening and how can I fix it?

Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:
rdtsc
push 1
call sleep
rdtsc
Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:
rdtsc
rdtsc
push 1
call sleep
In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.
Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.
Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).
What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).

I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.
Here is a greatly simplified version of this code:
#include <stdint.h>
uint64_t rdtsc() {
uint64_t ret;
# if __WORDSIZE == 64
asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
: "=A"(ret)
: /* no input */
: "%edx"
);
#else
asm ("rdtsc"
: "=A"(ret)
);
#endif
return ret;
}
Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.

As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):
Passed directly to the real hardware (PV guest)
Count cycles while the VM is executing on the host processor (Windows / etc)
Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.

You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)
This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).
See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and #Mysticial's answer there has working GNU C inline asm, which I'll quote here:
// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
This works correctly and efficiently for 32 and 64-bit code.

hmmm I'm not positive but I suspect the problem may be inside this line:
result = (double) hi * (1 << 30) * 4 + lo;
I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight