I was doing porting of armcc compiler to ARM GNU GCC , I pretty much figured out everything but I am stuck at this point :
A code is using something like this :
unsigned int add_bytes(unsigned int val1, unsigned int val2)
{
unsigned int res;
res = __uadd8(val1,val2); /* res[7:0] = val1[7:0] + val2[7:0]
res[15:8] = val1[15:8] + val2[15:8]
res[23:16] = val1[23:16] + val2[23:16]
res[31:24] = val1[31:24] + val2[31:24]
*/
return res;
}
__uadd8 is RVCT specific , Is there something equivalent provided by GCC or how can I achieve this?
GCC doesn't provide intrinsics for ARMv6 SIMD instructions. However you can define your own __UADD8 like below.
__attribute__( ( always_inline ) ) static __inline__ uint32_t __UADD8(uint32_t op1, uint32_t op2)
{
uint32_t result;
__asm__ volatile ("uadd8 %0, %1, %2" : "=r" (result) : "r" (op1), "r" (op2) );
return(result);
}
This is from one of the CMSIS header files. I didn't test it myself but including that file might fix every other v6 intrinsics. At worst you may need to do some copy-pasting.
Related
Are the following 16 byte atomic operations correctly implemented? Are there any better alternatives?
typedef struct {
uintptr_t low;
uintptr_t high;
} uint128_atomic;
uint128_atomic load_relaxed(uint128_atomic const *atomic)
{
uint128_atomic ret;
asm volatile("xor %%eax, %%eax\n"
"xor %%ebx, %%ebx\n"
"xor %%ecx, %%ecx\n"
"xor %%edx, %%edx\n"
"lock; cmpxchg16b %1"
: "=A"(ret)
: "m"(*atomic)
: "cc", "rbx", "rcx");
return ret;
}
bool cmpexch_weak_relaxed(
uint128_atomic *atomic,
uint128_atomic *expected,
uint128_atomic desired)
{
bool matched;
uint128_atomic e = *expected;
asm volatile("lock; cmpxchg16b %1\n"
"setz %0"
: "=q"(matched), "+m"(atomic->ui)
: "a"(e.low), "d"(e.high), "b"(desired.low), "c"(desired.high)
: "cc");
return matched;
}
void store_relaxed(uint128_atomic *atomic, uint128_atomic val)
{
uint128_atomic old = *atomic;
asm volatile("lock; cmpxchg16b %0"
: "+m"(*atomic)
: "a"(old.low), "d"(old.high), "b"(val.low), "c"(val.high)
: "cc");
}
For a full working example, checkout:
https://godbolt.org/g/CemfSg
Updated implementation can be found here: https://godbolt.org/g/vGNQG5
I came up with the following implementation, after applying all the suggestions from #PeterCordes, #David Wohlferd and #prl. Thanks a lot!
struct _uint128_atomic {
volatile uint64_t low;
volatile uint64_t high;
} __attribute__((aligned(16)));
typedef struct _uint128_atomic uint128_atomic;
bool
cmpexch_weak_relaxed(
uint128_atomic *atomic,
uint128_atomic *expected,
uint128_atomic desired)
{
bool matched;
uint128_atomic e = *expected;
asm volatile("lock cmpxchg16b %1"
: "=#ccz"(matched), "+m"(*atomic), "+a"(e.low), "+d"(e.high)
: "b"(desired.low), "c"(desired.high)
: "cc");
if (!matched)
*expected = e;
return matched;
}
uint128_atomic
load_relaxed(uint128_atomic const *atomic)
{
uint128_atomic ret = {0, 0};
asm volatile("lock cmpxchg16b %1"
: "+A"(ret)
: "m"(*atomic), "b"(0), "c"(0)
: "cc");
return ret;
}
void
store_relaxed(uint128_atomic *atomic, uint128_atomic val)
{
uint128_atomic old = *atomic;
while (!cmpexch_weak_relaxed(atomic, &old, val))
;
}
Please keep in mind that the implementation is GCC specific, and will not work on clang. The implementation of GCCs inline assembly in clang is suboptimal at best, and garbage at worst.
The GCC implementation can also be found on Godbolt's Compiler Explorer here.
A suboptimal, but working, clang implementation can be found here.
Why don't you just use the C11 atomic intrinsics?
#include <stdatomic.h>
inline __uint128_t load_relaxed(_Atomic __uint128_t *obj)
{
return atomic_load_explicit(obj, memory_order_relaxed);
}
inline _Bool cmpexch_weak_relaxed(_Atomic __uint128_t *obj,
__uint128_t *expected,
__uint128_t desired)
{
return atomic_compare_exchange_weak_explicit(obj, expected, desired,
memory_order_relaxed, memory_order_relaxed);
}
This compiles to more-or-less the assembly you wrote, using clang 4.0.1 and -march=native. But, unlike what you wrote, the compiler actually understands what's going on, so code generation around these functions will be correct. There is, as far as I know, no way to annotate a GNU-style assembly insert to tell the compiler that it has the semantics of an atomic operation.
No, you need "+a" and "+d" in cmpexch_weak_relaxed and store_relaxed.
Other than that, I don't see any problems. (I compared to my own implementations in working software.)
As far as improvements, I suggest
uint128_atomic load_relaxed(uint128_atomic const *atomic)
{
uint128_atomic ret = { 0, 0 };
asm volatile("lock; cmpxchg16b %1"
: "+A"(ret)
: "m"(*atomic), "b"(0), "c"(0)
: "cc");
return ret;
}
(I see that David Wohlferd also made this suggestion in a comment.)
I'm working on an ARMv7 platform, and encountered a register-access problem.
The registers in the device module has a strong WORD requirement for access:
typedef unsigned char u8;
struct reg {
u8 byte0; u8 byte1; u8 byte2; u8 byte3;
};
when try c code like this: reg.byte0 = 0x3, normally gcc generate assembly code similar LDRB r1, [r0], and this byte operation will lead undefined behavior of my platform.
It there an option so that gcc will produce code "read reg, mask byte0" and then a dword "LDR r1, [r0]" rather than "LDRB" op code?
update: the destination i wanna access is a device register on SOC. It has 4 fields and we use a struct representing this register. Accessing byte0 field like reg.byte0 = 3 normally generate byte access assembly code. I want to know whether this kind of c code reg.byte0=3 could be assembled to word access (32 bit, LDR) code.
really sorry for my poor English!
UPDATE: The example is just a simplification for real world. and volatile and memory barrier are also used in linux driver. just forgot to add in examples. It's ARM11 on which i'm working on.
1) seems memcpy not good for me, because different register has various fields, i cannot write all of access-inline-function
2) using union seems effective and i'll update result when completing test.
UPDATE2: just test union and it still cannot work on my platform.
i think the better way is to use explicit word access and do not confuse compiler.
UPDATE3: seems someone else post the exact same question, and it has been resolved. Force GCC to access structs with words
thanks your guys!
You could go with inline assembly:
static inline u8 read_reg_b0(const struct reg *rp) __attribute__((always_inline)) {
struct reg r;
u32 tmp;
__asm__("ldr %0, %1" : "=r" (tmp) : "m" (*rp));
memcpy(&r, &tmp, 4);
return r.b0;
}
static inline void write_reg_b0(struct reg *rp, u8 b0) __attribute__((always_inline)) {
struct reg r;
u32 tmp;
__asm__("ldr %0, %1" : "=r" (tmp) : "m" (*rp));
memcpy(&r, &tmp, 4);
r.b0 = b0;
memcpy(&tmp, &r, 4);
__asm__("str %1, %0" : "=m" (*rp) : "r" (tmp));
}
GCC will optimize away the memcpy but can't modify the assembly instructions.
I'm trying to learn how to write gcc inline assembly.
The following code is supposed to perform an shl instruction and return the result.
#include <stdio.h>
#include <inttypes.h>
uint64_t rotate(uint64_t x, int b)
{
int left = x;
__asm__ ("shl %1, %0"
:"=r"(left)
:"i"(b), "0"(left));
return left;
}
int main()
{
uint64_t a = 1000000000;
uint64_t res = rotate(a, 10);
printf("%llu\n", res);
return 0;
}
Compilation fails with error: impossible constraint in asm
The problem is basically with "i"(b). I've tried "o", "n", "m" among others but it still doesn't work. Either its this error or operand size mismatch.
What am I doing wrong?
As written, you code compiles correctly for me (I have optimization enabled). However, I believe you may find this to be a bit better:
#include <stdio.h>
#include <inttypes.h>
uint64_t rotate(uint64_t x, int b)
{
__asm__ ("shl %b[shift], %[value]"
: [value] "+r"(x)
: [shift] "Jc"(b)
: "cc");
return x;
}
int main(int argc, char *argv[])
{
uint64_t a = 1000000000;
uint64_t res = rotate(a, 10);
printf("%llu\n", res);
return 0;
}
Note that the 'J' is for 64bit. If you are using 32bit, 'I' is the correct value.
Other things of note:
You are truncating your rotate value from uint64_t to int? Are you compiling for 32bit code? I don't believe shl can do 64bit rotates when compiled as 32bit.
Allowing 'c' on the input constraint means you can use variable rotate amounts (ie not hard-coded at compile time).
Since shl modifies the flags, use "cc" to let the compiler know.
Using the [name] form makes the asm easier to read (IMO).
The %b is a modifier. See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#i386Operandmodifiers
If you want to really get smart about inline asm, check out the latest gcc docs: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
I want to use Intel's PCLMULQDQ instruction with inline assembly in my C Code for multiplying two polynomials, which are elements in GF(2^n). Compiler is GCC 4.8.1.
The polynomials are stored in arrays of uint32_t (6 fields big).
I already checked the web how to use the PCLMULQDQ instruction or CLMUL instruction set properly, but didn't found any good documentation.
I would really appreciate a simple example in C and asm of how to multiply two simple polynomials with the instruction. Does anybody know how to do it?
Besides are there any prerequisites (except a capable processor), like included libraries, compiler options etc.?
I already found a solution. Thus for the record:
void f2m_intel_mult(
uint32_t t, // length of arrays A and B
uint32_t *A,
uint32_t *B,
uint32_t *C
)
{
memset(C, 0, 2*t*sizeof(uint32_t));
uint32_t offset = 0;
union{ uint64_t val; struct{uint32_t low; uint32_t high;} halfs;} prod;
uint32_t i;
uint32_t j;
for(i=0; i<t; i++){
for(j=0; j<t; j++){
prod.halfs.low = A[i];
prod.halfs.high = 0;
asm ("pclmulqdq %2, %1, %0;"
: "+x"(prod.val)
: "x"(B[j]), "i"(offset)
);
C[i+j] = C[i+j] ^ prod.halfs.low;
C[i+j+1] = C[i+j+1] ^ prod.halfs.high;
}
}
}
I think it is possible to use 64bit registers for pclmulqdq, but I couldn't find out how to get this working with inline assembler. Does anybody know this?
Nevertheless it is also possible to do the same with intrinsics. (If you want the code just ask.)
Besides it is possible to optimize the calculation further with Karatsuba, if you know the size t of the arrays.
I am developing android application and in that i am working on NDK. while compiling the files i got the error of selected processor does not support `qadd16 r1,r1,r0'. can anyone explain me why and where this error comes and how to deal with this error? Here is my code snippet of basic_op.h file
static inline Word32 L_add(register Word32 ra, register Word32 rb)
{
Word32 out;
__asm__("qadd %0, %1, %2"
: "=r"(out)
: "r"(ra), "r"(rb));
return (out);
}
Thanks in advance
This happens because QADD instruction is not supported on your target architecture (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211h/Chddhfig.html). To compile this code you need to enable arm-v7 support in NDK.
Add the line
APP_ABI := armeabi-v7a
to your Application.mk and this code will compile perfectly:
static inline unsigned int L_add(register unsigned int ra, register unsigned int rb)
{
unsigned int out;
__asm__("qadd %0, %1, %2"
: "=r"(out)
: "r"(ra), "r"(rb));
return (out);
}
P.S. I am using Android NDK r8.
P.P.S. Why you need this ugly assembly? The output assembly listing for:
static inline unsigned int L_add(register unsigned int ra, register unsigned int rb)
{
return (ra > 0xFFFFFFFF - rb) ? 0xFFFFFFFF : ra + rb;
}
looks still reasonably efficient and it is much more portable!