Atomic 16-Byte operations on x86_64 - c

Are the following 16 byte atomic operations correctly implemented? Are there any better alternatives?
typedef struct {
uintptr_t low;
uintptr_t high;
} uint128_atomic;
uint128_atomic load_relaxed(uint128_atomic const *atomic)
{
uint128_atomic ret;
asm volatile("xor %%eax, %%eax\n"
"xor %%ebx, %%ebx\n"
"xor %%ecx, %%ecx\n"
"xor %%edx, %%edx\n"
"lock; cmpxchg16b %1"
: "=A"(ret)
: "m"(*atomic)
: "cc", "rbx", "rcx");
return ret;
}
bool cmpexch_weak_relaxed(
uint128_atomic *atomic,
uint128_atomic *expected,
uint128_atomic desired)
{
bool matched;
uint128_atomic e = *expected;
asm volatile("lock; cmpxchg16b %1\n"
"setz %0"
: "=q"(matched), "+m"(atomic->ui)
: "a"(e.low), "d"(e.high), "b"(desired.low), "c"(desired.high)
: "cc");
return matched;
}
void store_relaxed(uint128_atomic *atomic, uint128_atomic val)
{
uint128_atomic old = *atomic;
asm volatile("lock; cmpxchg16b %0"
: "+m"(*atomic)
: "a"(old.low), "d"(old.high), "b"(val.low), "c"(val.high)
: "cc");
}
For a full working example, checkout:
https://godbolt.org/g/CemfSg
Updated implementation can be found here: https://godbolt.org/g/vGNQG5

I came up with the following implementation, after applying all the suggestions from #PeterCordes, #David Wohlferd and #prl. Thanks a lot!
struct _uint128_atomic {
volatile uint64_t low;
volatile uint64_t high;
} __attribute__((aligned(16)));
typedef struct _uint128_atomic uint128_atomic;
bool
cmpexch_weak_relaxed(
uint128_atomic *atomic,
uint128_atomic *expected,
uint128_atomic desired)
{
bool matched;
uint128_atomic e = *expected;
asm volatile("lock cmpxchg16b %1"
: "=#ccz"(matched), "+m"(*atomic), "+a"(e.low), "+d"(e.high)
: "b"(desired.low), "c"(desired.high)
: "cc");
if (!matched)
*expected = e;
return matched;
}
uint128_atomic
load_relaxed(uint128_atomic const *atomic)
{
uint128_atomic ret = {0, 0};
asm volatile("lock cmpxchg16b %1"
: "+A"(ret)
: "m"(*atomic), "b"(0), "c"(0)
: "cc");
return ret;
}
void
store_relaxed(uint128_atomic *atomic, uint128_atomic val)
{
uint128_atomic old = *atomic;
while (!cmpexch_weak_relaxed(atomic, &old, val))
;
}
Please keep in mind that the implementation is GCC specific, and will not work on clang. The implementation of GCCs inline assembly in clang is suboptimal at best, and garbage at worst.
The GCC implementation can also be found on Godbolt's Compiler Explorer here.
A suboptimal, but working, clang implementation can be found here.

Why don't you just use the C11 atomic intrinsics?
#include <stdatomic.h>
inline __uint128_t load_relaxed(_Atomic __uint128_t *obj)
{
return atomic_load_explicit(obj, memory_order_relaxed);
}
inline _Bool cmpexch_weak_relaxed(_Atomic __uint128_t *obj,
__uint128_t *expected,
__uint128_t desired)
{
return atomic_compare_exchange_weak_explicit(obj, expected, desired,
memory_order_relaxed, memory_order_relaxed);
}
This compiles to more-or-less the assembly you wrote, using clang 4.0.1 and -march=native. But, unlike what you wrote, the compiler actually understands what's going on, so code generation around these functions will be correct. There is, as far as I know, no way to annotate a GNU-style assembly insert to tell the compiler that it has the semantics of an atomic operation.

No, you need "+a" and "+d" in cmpexch_weak_relaxed and store_relaxed.
Other than that, I don't see any problems. (I compared to my own implementations in working software.)
As far as improvements, I suggest
uint128_atomic load_relaxed(uint128_atomic const *atomic)
{
uint128_atomic ret = { 0, 0 };
asm volatile("lock; cmpxchg16b %1"
: "+A"(ret)
: "m"(*atomic), "b"(0), "c"(0)
: "cc");
return ret;
}
(I see that David Wohlferd also made this suggestion in a comment.)

Related

Early-clobbers and named registers

I'm trying to understand the usage of "early-clobber outputs" but I stumbled upon a snipped which confuses me. Consider the following multiply-modulo function:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t d;
uint64_t unused;
asm ("mulq %3\n\t"
"divq %4"
:"=a"(unused), "=&d"(d)
:"a"(a), "rm"(b), "rm"(n)
:"cc");
return d;
}
Why has RDX the early-clobber flag (&)? Is it because mulq implicitly modified RDX? Would the example work without the flag? (I tried and it seems it does. But would it be correct as well?) On the other had, isn't it enough that the function outputs RDX to tell the compiler RDX was modified?
Also, why there is that unused variable? I assume it's there to denote that RAX was modified, correct? Can I remove it? (I tried and it seems to work.) I would have expected the correct way of marking the modified RAX is by including "rax" to "clobbers", along with "cc". But that does not work.
While this doesn't answer the question - I think the comments have it covered - I would simplify this, by letting the compiler choose registers vs memory, and allowing it to schedule mulq and divq as required... The problem is that div has register restrictions:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t ret, q, rh, rl;
__asm__ ("mulq %3" : "=a,a" (rl), "=d,d" (rh)
: "%0,0" (a), "r,m" (b) : "cc");
/* assert(rh < n), otherwise `div` raises a 'divide error' - the quotient is
* too large to store in in `%rax`. */
/* the "%0,0" notation implies that `(a)` and `(b)` are commutative.
* the "cc" clobber is implicit in gcc / clang asm (and, I expect, Intel icc)
* for the x86-64 asm statements. */
__asm__ ("divq %4" : "=a,a" (q), "=d,d" (ret)
: "0,0" (rl), "1,1" (rh), "r,m" (n), "cc");
return ret;
}

MUL gets runtime error when using GCC asm

I want to calculate x * y % 998244353 by GCC asm, so I wrote the asm code :
int Modmul(int x, int y)
{
int t;
__asm__ __volatile__ ("mull %%ebx\n\tdivl %%ecx\n\t" : "=d"(t) : "a"(x), "b"(y), "c"(998244353) : "eax", "edx", "memory");
return t;
}
However, it got compile error "'asm' operand has impossible constraints', and if I remove : "eax", "edx", "memory", it can pass the compilation but got the wrong answer, why?
BTW, I replaced ebx with r0 and ecx with r1:
int Modmul(int x, int y)
{
int t;
__asm__ __volatile__ ("mull %0\n\tdivl %1\n\t" : "=d"(t) : "a"(x), "r"(y), "r"(998244353));
return t;
}
It can pass the compilation, but it got runtime error, why?

How can I determine (preferably at compile-time) whether gcc is using rbp-based offsets or rsp-based offsets?

I want to write something like this:
#include <stdint.h>
inline uint64_t with_rsp(uint64_t x, uint64_t y) {
uint64_t z, w;
uint64_t rsp;
asm ("mov %%rsp, %[rsp]\t\n"
"mov $0x13, %%rsp\t\n"
"mov %[x], %%rdx\t\n"
"mulx %[y], %[z], %[w]\t\n"
"mov %[rsp], %%rsp\t\n"
: [z] "=&r" (z), [w] "=&r" (w)
: [x] "r" (x), [y] "r" (y), [rsp] "m" (rsp)
: "rdx"
);
return z + w;
}
inline uint64_t with_rbp(uint64_t x, uint64_t y) {
uint64_t z, w;
uint64_t rbp;
asm ("mov %%rbp, %[rbp]\t\n"
"mov $0x13, %%rbp\t\n"
"mov %[x], %%rdx\t\n"
"mulx %[y], %[z], %[w]\t\n"
"mov %[rbp], %%rbp\t\n"
: [z] "=&r" (z), [w] "=&r" (w)
: [x] "r" (x), [y] "r" (y), [rbp] "m" (rbp)
: "rdx"
);
return z + w;
}
int main() {
uint64_t x = 15, y = 3, zw;
if (inline_asm_uses_rbp()) {
zw = with_rsp(x, y);
} else {
zw = with_rbp(x, y);
}
return zw;
}
Ideally, the if statement should compile away at compile-time (but I don't think I can do this with preprocessor macros, because those get evaluated before the code is assembled). So I'm fine with needing some sort of jump to get it to work, though I'd prefer to not need that.
The reason I need this is that I have some inline assembly that needs to be able to use 15 registers, plus some memory locations on the stack, and gcc is choosing rsp-based offsets in some locations where the function is inlined, and it's choosing rbp-based offsets in other locations. (A separate assembly module isn't a good match for this because I'd like to avoid the overhead of a function call.)

RVCT to ARM GCC porting(__uadd8)

I was doing porting of armcc compiler to ARM GNU GCC , I pretty much figured out everything but I am stuck at this point :
A code is using something like this :
unsigned int add_bytes(unsigned int val1, unsigned int val2)
{
unsigned int res;
res = __uadd8(val1,val2); /* res[7:0] = val1[7:0] + val2[7:0]
res[15:8] = val1[15:8] + val2[15:8]
res[23:16] = val1[23:16] + val2[23:16]
res[31:24] = val1[31:24] + val2[31:24]
*/
return res;
}
__uadd8 is RVCT specific , Is there something equivalent provided by GCC or how can I achieve this?
GCC doesn't provide intrinsics for ARMv6 SIMD instructions. However you can define your own __UADD8 like below.
__attribute__( ( always_inline ) ) static __inline__ uint32_t __UADD8(uint32_t op1, uint32_t op2)
{
uint32_t result;
__asm__ volatile ("uadd8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) );
return(result);
}
This is from one of the CMSIS header files. I didn't test it myself but including that file might fix every other v6 intrinsics. At worst you may need to do some copy-pasting.

How to deal with this : selected processor does not support `qadd16 r1,r1,r0'

I am developing android application and in that i am working on NDK. while compiling the files i got the error of selected processor does not support `qadd16 r1,r1,r0'. can anyone explain me why and where this error comes and how to deal with this error? Here is my code snippet of basic_op.h file
static inline Word32 L_add(register Word32 ra, register Word32 rb)
{
Word32 out;
__asm__("qadd %0, %1, %2"
: "=r"(out)
: "r"(ra), "r"(rb));
return (out);
}
Thanks in advance
This happens because QADD instruction is not supported on your target architecture (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211h/Chddhfig.html). To compile this code you need to enable arm-v7 support in NDK.
Add the line
APP_ABI := armeabi-v7a
to your Application.mk and this code will compile perfectly:
static inline unsigned int L_add(register unsigned int ra, register unsigned int rb)
{
unsigned int out;
__asm__("qadd %0, %1, %2"
: "=r"(out)
: "r"(ra), "r"(rb));
return (out);
}
P.S. I am using Android NDK r8.
P.P.S. Why you need this ugly assembly? The output assembly listing for:
static inline unsigned int L_add(register unsigned int ra, register unsigned int rb)
{
return (ra > 0xFFFFFFFF - rb) ? 0xFFFFFFFF : ra + rb;
}
looks still reasonably efficient and it is much more portable!

Resources