How to relace byte of 32bit variable in inline assembly? - c

I want to replace the highest byte of 32bit value with inline assembly, following code writes buffer to FRAM memory with spi interace:
#define _load_op_code(op_code, addr)\
__asm__ __volatile__ (\
" ldi %D0, %1" "\n\t"\
: "=d" ((uint32_t)addr)\
: "M" (op_code)\
)
#define SMEM_WREN 0x06
#define SMEM_WRITE 0x02
void fram_write(uint32_t addr, uint8_t *buf, uint16_t len) {
FRAM_SELECT();
spi_send_char(SMEM_WREN);
FRAM_DESELECT();
_load_op_code(SMEM_WRITE, addr);
FRAM_SELECT();
spi_send_32b(addr);
spi_send(buf, len);
FRAM_DESELECT();
}
after _load_op_code() inline assembly addr variable gets cluttered - compiler use registers allocated for addr as temporary registers for other operations and i lose original addr value. addr is in fact 24bit variable. Any idea whats wrong with this code?

The original value of addr is lost because it is overwritten by the asm statement with SMEM_WRITE and 3 undefined bytes. From the GCC manual:
The ordinary output operands must be write-only; GCC assumes that the values in these operands before the instruction are dead and need not be generated. Extended asm supports input-output or read-write operands. Use the constraint character ‘+’ to indicate such an operand and list it with the output operands.

Related

How to understand this GNU C inline assembly macro for PowerPC stwbrx

This is basically to perform swap for the buffers while transferring a message buffer. This statement left me puzzled (because of my unfamiliarity with the embedded assembly code in c). This is a power pc instruction
#define ASMSWAP32(dest_addr,data) __asm__ volatile ("stwbrx %0, 0, %1" : : "r" (data), "r" (dest_addr))
Besides being unsafe because of a bug, this macro is also less efficient than what the compiler will generate for you.
stwbrx = store word byte-reversed. The x stands for indexed.
You don't need inline asm for this in GNU C, where you can use __builtin_bswap32 and let the compiler emit this instruction for you.
void swapstore_asm(int a, int *p) {
ASMSWAP32(p, a);
}
void swapstore_c(int a, int *p) {
*p = __builtin_bswap32(a);
}
Compiled with gcc4.8.5 -O3 -mregnames, we get identical code from both functions (Godbolt compiler explorer):
swapstore:
stwbrx %r3, 0, %r4
blr
swapstore_c:
stwbrx %r3,0,%r4
blr
But with a more complicated address (storing to p[off], where off is an integer function arg), the compiler knows how to use both register inputs, while your macro forces the compiler to have the address in a single register:
void swapstore_offset(int a, int *p, int off) {
= __builtin_bswap32(a);
}
swapstore_offset:
slwi %r5,%r5,2 # *4 = sizeof(int)
stwbrx %r3,%r4,%r5 # use an indexed addressing mode, with both registers non-zero
blr
swapstore_offset_asm:
slwi %r5,%r5,2
add %r4,%r4,%r5 # extra instruction forced by using the macro
stwbrx %r3, 0, %r4
blr
BTW, if you're having trouble understanding GNU C inline asm templates, looking at the compiler's asm output can be a useful way to see what gets substituted in. See How to remove "noise" from GCC/clang assembly output? for more about reading compiler asm output.
Also note that this macro is buggy: it's missing a "memory" clobber for the store. And yes, you still need that with asm volatile. The compiler doesn't assume that *dest_addr is modified unless you tell it, so it could hoist a non-volatile load of *dest_addr ahead of this insn, or more likely to be a real problem, sink a store after it. (e.g. if you zeroed a buffer before storing to it with this, the compiler might actually zero after this instruction.)
Instead of a "memory" clobber (and also leaving out volatile), you could tell the compiler which memory location you modify with a =m" (*dest_addr) operand, either as a dummy operand or with a constraint on the addressing mode so you could use it as reg+reg. (IDK PPC well enough to know what "=m" usually expands to.)
In most cases this bug won't bite you, but it's still a bug. Upgrading your compiler version or using link-time optimization could maybe make your program buggy with no source-level changes.
This kind of thing is why https://gcc.gnu.org/wiki/DontUseInlineAsm
See also https://stackoverflow.com/tags/inline-assembly/info.
#define ASMSWAP32(dest_addr,data) ...
This part should be clear
__asm__ volatile ( ... : : "r" (data), "r" (dest_addr))
This is the actual inline assembly:
Two values are passed to the assmbly code; no value is returned from the assembly code (this is the colons after the actual assembly code).
Both parameters are passed in registers ("r"). The expression %0 will be replaced by the register that contains the value of data while the expression %1 will be replaced by the register that contains the value of dest_addr (which will be a pointer in this case).
The volatile here means that the assembly code has to be executed at this point and cannot be moved to somewhere else.
So if you use the following code in the C source:
ASMSWAP(&a, b);
... the following assembler code will be generated:
# write the address of a to register 5 (for example)
...
# write the value of b to register 6
...
stwbrx 6, 0, 5
So the first argument of the stwbrx instruction is the value of b and the last argument is the address of a.
stwbrx x, 0, y
This instruction writes the value in register x to the address stored in register y; however it writes the value in "reverse endian" (on a big-endian CPU it writes the value "little endian".
The following code:
uint32 a;
ASMSWAP32(&a, 0x12345678);
... should therefore result in a = 0x78563412.

What ensures reads/writes of operands occurs at desired timed with extended ASM?

According to GCC's Extended ASM and Assembler Template, to keep instructions consecutive, they must be in the same ASM block. I'm having trouble understanding what provides the scheduling or timings of reads and writes to the operands in a block with multiple statements.
As an example, EBX or RBX needs to be preserved when using CPUID because, according to the ABI, the caller owns it. There are some open questions with respect to the use of EBX and RBX, so we want to preserve it unconditionally (its a requirement). So three instructions need to be encoded into a single ASM block to ensure the consecutive-ness of the instructions (re: the assembler template discussed in the first paragraph):
unsigned int __FUNC = 1, __SUBFUNC = 0;
unsigned int __EAX, __EBX, __ECX, __EDX;
__asm__ __volatile__ (
"push %ebx;"
"cpuid;"
"pop %ebx"
: "=a"(__EAX), "=b"(__EBX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC)
);
If the expression representing the operands is interpreted at the wrong point in time, then __EBX will be the saved EBX (and not the CPUID's EBX), which will likely be a pointer to the Global Offset Table (GOT) if PIC is enabled.
Where, exactly, does the expression specify that the store of CPUID's %EBX into __EBX should happen (1) after the PUSH %EBX; (2) after the CPUID; but (3) before the POP %EBX?
In your question you present some code that does a push and pop of ebx. The idea of saving ebx in the event that you compile with gcc using -fPIC (position independent code) is correct. It is up to our function not to clobber ebx upon return in that situation. Unfortunately the way you have defined the constraints you explicitly use ebx. Generally the compiler will warn you (error: inconsistent operand constraints in an 'asm') if you are using PIC code and you specify =b as an output constraint. Why it doesn't produce a warning for you is unusual.
To get around this problem you can let the assembler template choose a register for you. Instead of pushing and popping we simply exchange %ebx with an unused register chosen by the compiler and restore it by exchanging it back after. Since we don't wish to have the compiler clobber our input registers during the exchange we specify early clobber modifier, thus ending up with a constraint of =&r (instead of =b in the OPs code). More on modifiers can be found here. Your code (for 32 bit) would look something like:
unsigned int __FUNC = 1, __SUBFUNC = 0;
unsigned int __EAX, __EBX, __ECX, __EDX;
__asm__ __volatile__ (
"xchgl\t%%ebx, %k1\n\t" \
"cpuid\n\t" \
"xchgl\t%%ebx, %k1\n\t"
: "=a"(__EAX), "=&r"(__EBX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
If you intend to compile for X86_64 (64 bit) you'll need to save the entire contents of %rbx. The code above will not quite work. You'd have to use something like:
uint32_t __FUNC = 1, __SUBFUNC = 0;
uint32_t __EAX, __ECX, __EDX;
uint64_t __BX; /* Big enough to hold a 64 bit value */
__asm__ __volatile__ (
"xchgq\t%%rbx, %q1\n\t" \
"cpuid\n\t" \
"xchgq\t%%rbx, %q1\n\t"
: "=a"(__EAX), "=&r"(__BX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
You could code this up using conditional compilation to deal with both X86_64 and i386:
uint32_t __FUNC = 1, __SUBFUNC = 0;
uint32_t __EAX, __ECX, __EDX;
uint64_t __BX; /* Big enough to hold a 64 bit value */
#if defined(__i386__)
__asm__ __volatile__ (
"xchgl\t%%ebx, %k1\n\t" \
"cpuid\n\t" \
"xchgl\t%%ebx, %k1\n\t"
: "=a"(__EAX), "=&r"(__BX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
#elif defined(__x86_64__)
__asm__ __volatile__ (
"xchgq\t%%rbx, %q1\n\t" \
"cpuid\n\t" \
"xchgq\t%%rbx, %q1\n\t"
: "=a"(__EAX), "=&r"(__BX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
#else
#error "Unknown architecture."
#endif
GCC has a __cpuid macro defined in cpuid.h. It defined the macro so that it only saves the ebx and rbx register when required. You can find the GCC 4.8.1 macro definition here to get an idea of how they handle cpuid in cpuid.h.
The astute reader may ask the question - what stops the compiler from choosing ebx or rbx as the scratch register to use for the exchange. The compiler knows about ebx and rbx in the context of PIC, and will not allow it to be used as a scratch register. This is based on my personal observations over the years and reviewing the assembler (.s) files generated from C code. I can't say for certain how more ancient versions of gcc handled it so it could be a problem.
I think you understand, but to be clear, the "consecutive" rule means that this:
asm ("a");
asm ("b");
asm ("c");
... might get other instructions interposed, so if that's not desirable then it must be rewritten like this:
asm ("a\n"
"b\n"
"c");
... and now it will be inserted as a whole.
As for the cpuid snippet, we have two problems:
The cpuid instruction will overwrite ebx, and hence clobber the data that PIC code must keep there.
We want to extract the value that cpuid places in ebx while never returning to compiled code with the "wrong" ebx value.
One possible solution would be this:
unsigned int __FUNC = 1, __SUBFUNC = 0;
unsigned int __EAX, __EBX, __ECX, __EDX;
__asm__ __volatile__ (
"push %ebx;"
"cpuid;"
"mov %ebx, %ecx"
"pop %ebx"
: "=c"(__EBX)
: "a"(__FUNC), "c"(__SUBFUNC)
: "eax", "edx"
);
__asm__ __volatile__ (
"push %ebx;"
"cpuid;"
"pop %ebx"
: "=a"(__EAX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC)
);
There's no need to mark ebx as clobbered as you're putting it back how you found it.
(I don't do much Intel programming, so I may have some of the assembler-specific details off there, but this is how asm works.)

Decoding this assembly inline code snippet on PowerPc

I have this below code snippet from kernel source for PowerPc
#define SPRN_IVOR32 0x210 /* Interrupt Vector Offset Register 32 */
unsigned long ivor[3];
ivor[0] = mfspr(SPRN_IVOR32);
#define __stringify_1(x) #x
#define __stringify(x) __stringify_1(x)
#define mfspr(rn) ({unsigned long rval; \
asm volatile("mfspr %0," __stringify(rn) \
: "=r" (rval)); rval; })
Also, it this above exercise is about emulating MSR register's bits in PowerPc?
Can anyone help me on what exactly we are doing here?
The mfspr macro generates an asm instruction mfspr which reads the given special purpose register into a register chosen by the compiler, which then gets assigned to rval hence becomes the return value of the expression.
As the comment says, SPRN_IVOR32 is the Interrupt Vector Offset Register 32, whose contents are thus fetched into ivor[0].

GCC: Prohibit use of some registers

This is a strange request but I have a feeling that it could be possible. What I would like is to insert some pragmas or directives into areas of my code (written in C) so that GCC's register allocator will not use them.
I understand that I can do something like this, which might set aside this register for this variable
register int var1 asm ("EBX") = 1984;
register int var2 asm ("r9") = 101;
The problem is that I'm inserting new instructions (for a hardware simulator) directly and GCC and GAS don't recognise these yet. My new instructions can use the existing general purpose registers and I want to make sure that I have some of them (i.e. r12->r15) reserved.
Right now, I'm working in a mockup environment and I want to do my experiments quickly. In the future I will append GAS and add intrinsics into GCC, but right now I'm looking for a quick fix.
Thanks!
When writing GCC inline assembler, you can specify a "clobber list" - a list of registers that may be overwritten by your inline assembler code. GCC will then do whatever is needed to save and restore data in those registers (or avoid their use in the first place) over the course of the inline asm segment. You can also bind input or output registers to C variables.
For example:
inline unsigned long addone(unsigned long v)
{
unsigned long rv;
asm("mov $1, %%eax;"
"mov %0, %%ebx;"
"add %%eax, %%ebx"
: /* outputs */ "b" (rv)
: /* inputs */ "g" (v) /* select unused general purpose reg into %0 */
: /* clobbers */ "eax"
);
}
For more information, see the GCC-Inline-Asm-HOWTO.
If you use global explicit register variables, these will be reserved throughout the compilation unit, and will not be used by the compiler for anything else (it may still be used by the system's libraries, so choose something that will be restored by those). local register variables do not guarantee that your value will be in the register at all times, but only when referenced by code or as an asm operand.
If you write an inline asm block for your new instructions, there are commands that inform GCC what registers are used by that block and how they are used. GCC will then avoid using those registers or will at least save and reload their contents.
Non-hardcoded scratch register in inline assembly
This is not a direct answer to the original question, but since and since I keep Googling this in that context and since https://stackoverflow.com/a/6683183/895245 was accepted, I'm going to try and provide a possible improvement to that answer.
The improvement is the following: you should avoid hard-coding your scratch registers when possible, to give the register allocator more freedom.
Therefore, as an educational example that is useless in practice (could be done in a single lea (%[in1], %[in2]), %[out];), the following hardcoded scratch register code:
bad.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t in1 = 0xFFFFFFFF;
uint64_t in2 = 1;
uint64_t out;
__asm__ (
"mov %[in2], %%rax;" /* scratch = in2 */
"add %[in1], %%rax;" /* scratch += in1 */
"mov %%rax, %[out];" /* out = scratch */
: [out] "=r" (out)
: [in1] "r" (in1),
[in2] "r" (in2)
: "rax"
);
assert(out == 0x100000000);
}
could compile to something more efficient if you instead use this non-hardcoded version:
good.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t in1 = 0xFFFFFFFF;
uint64_t in2 = 1;
uint64_t out;
uint64_t scratch;
__asm__ (
"mov %[in2], %[scratch];" /* scratch = in2 */
"add %[in1], %[scratch];" /* scratch += in1 */
"mov %[scratch], %[out];" /* out = scratch */
: [scratch] "=&r" (scratch),
[out] "=r" (out)
: [in1] "r" (in1),
[in2] "r" (in2)
:
);
assert(out == 0x100000000);
}
since the compiler is free to choose any register it wants instead of just rax,
Note that in this example we had to mark the scratch as an early clobber register with & to prevent it from being put into the same register as an input, I have explained that in more detail at: When to use earlyclobber constraint in extended GCC inline assembly? This example also happens to fail in the implementation I tested on without &.
Tested in Ubuntu 18.10 amd64, GCC 8.2.0, compile and run with:
gcc -O3 -std=c99 -ggdb3 -Wall -Werror -pedantic -o good.out good.c
./good.out
Non-hardcoded scratch registers are also mentioned in the GCC manual 6.45.2.6 "Clobbers and Scratch Registers", although their example is too much for mere mortals to take in at once:
Rather than allocating fixed registers via clobbers to provide scratch registers for an asm statement, an alternative is to define a variable and make it an early-clobber output as with a2 and a3 in the example below. This gives the compiler register allocator more freedom. You can also define a variable and make it an output tied to an input as with a0 and a1, tied respectively to ap and lda. Of course, with tied outputs your asm can’t use the input value after modifying the output register since they are one and the same register. What’s more, if you omit the early-clobber on the output, it is possible that GCC might allocate the same register to another of the inputs if GCC could prove they had the same value on entry to the asm. This is why a1 has an early-clobber. Its tied input, lda might conceivably be known to have the value 16 and without an early-clobber share the same register as %11. On the other hand, ap can’t be the same as any of the other inputs, so an early-clobber on a0 is not needed. It is also not desirable in this case. An early-clobber on a0 would cause GCC to allocate a separate register for the "m" ((const double ()[]) ap) input. Note that tying an input to an output is the way to set up an initialized temporary register modified by an asm statement. An input not tied to an output is assumed by GCC to be unchanged, for example "b" (16) below sets up %11 to 16, and GCC might use that register in following code if the value 16 happened to be needed. You can even use a normal asm output for a scratch if all inputs that might share the same register are consumed before the scratch is used. The VSX registers clobbered by the asm statement could have used this technique except for GCC’s limit on the number of asm parameters.
static void
dgemv_kernel_4x4 (long n, const double *ap, long lda,
const double *x, double *y, double alpha)
{
double *a0;
double *a1;
double *a2;
double *a3;
__asm__
(
/* lots of asm here */
"#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
"#a0=%3 a1=%4 a2=%5 a3=%6"
:
"+m" (*(double (*)[n]) y),
"+&r" (n), // 1
"+b" (y), // 2
"=b" (a0), // 3
"=&b" (a1), // 4
"=&b" (a2), // 5
"=&b" (a3) // 6
:
"m" (*(const double (*)[n]) x),
"m" (*(const double (*)[]) ap),
"d" (alpha), // 9
"r" (x), // 10
"b" (16), // 11
"3" (ap), // 12
"4" (lda) // 13
:
"cr0",
"vs32","vs33","vs34","vs35","vs36","vs37",
"vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
);
}

What is this x86 inline assembly doing?

I came across this code and need to understand what it is doing. It just seems to be declaring two bytes and then doing nothing...
uint64_t x;
__asm__ __volatile__ (".byte 0x0f, 0x31" : "=A" (x));
Thanks!
This is generating two bytes (0F 31) directly into the code stream. This is an RDTSC instruction, which reads the time-stamp counter into EDX:EAX, which will then be copied to the variable 'x' by the output constraint "=A"(x)
0F 31 is the x86 opcode for the RDTSC (read time stamp counter) instruction; it places the value read into the EDX and EAX registers.
The _ _ asm__ directive isn't just declaring two bytes, it's placing inline assembly into the C code. Presumably, the program has a way of using the value in those registers immediately afterwards.
http://en.wikipedia.org/wiki/Time_Stamp_Counter
It's inserting an 0F 31 opcode, which according to this site is:
0F 31 P1+ f2 RDTSC EAX EDX IA32_T... Read Time-Stamp Counter
Then it is storing the result in the x variable
It's inline asm for rdtsc, with the machine-code encoding written out to support really old assemblers that don't know the mnemonic.
Unfortunately, it only works correctly in 32bit code because "=A" doesn't split 64bit operands in half in 64bit code. (The gcc manual even uses rdtsc an an example to illustrate this)
The safe way to write this, which compiles to optimal code with gcc -m32 or -m64, is:
#include <stdint.h>
uint64_t timestamp_safe(void)
{
unsigned long tsc_low, tsc_high; // not uint32_t: saves a zero-extend for -m64 (but not x32 :/)
asm volatile("rdtsc" : "=d"(tsc_high), "=a" (tsc_low));
return ((uint64_t)tsc_high << 32) | tsc_low;
}
In 32bit code, it's just rdtsc/ret, but in 64bit code it does the necessary shift/or to get both halves into rax for the return value.
See it on the Godbolt compiler explorer.

Resources