I'm trying to learn to code using intrinsics and below is a code which does addition
compiler used: icc
#include<stdio.h>
#include<emmintrin.h>
int main()
{
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);
printf("%d\n",c[2]);
return 0;
}
I get the below error:
test.c(9): error: expression must have pointer-to-object type
printf("%d\n",c[2]);
How do I print the values in the variable c which is of type __m128i
Use this function to print them:
#include <stdint.h>
#include <string.h>
void print128_num(__m128i var)
{
uint16_t val[8];
memcpy(val, &var, sizeof(val));
printf("Numerical: %i %i %i %i %i %i %i %i \n",
val[0], val[1], val[2], val[3], val[4], val[5],
val[6], val[7]);
}
You split 128bits into 16-bits(or 32-bits) before printing them.
This is a way of 64-bit splitting and printing if you have 64-bit support available:
#include <inttypes.h>
void print128_num(__m128i var)
{
int64_t v64val[2];
memcpy(v64val, &var, sizeof(v64val));
printf("%.16llx %.16llx\n", v64val[1], v64val[0]);
}
Note: casting the &var directly to an int* or uint16_t* would also work MSVC, but this violates strict aliasing and is undefined behaviour. Using memcpy is the standard compliant way to do the same and with minimal optimization the compiler will generate the exact same binary code.
Portable across gcc/clang/ICC/MSVC, C and C++.
fully safe with all optimization levels: no strict-aliasing violation UB
print in hex as u8, u16, u32, or u64 elements (based on #AG1's answer)
Prints in memory order (least-significant element first, like _mm_setr_epiX). Reverse the array indices if you prefer printing in the same order Intel's manuals use, where the most significant element is on the left (like _mm_set_epiX). Related: Convention for displaying vector registers
Using a __m128i* to load from an array of int is safe because the __m128 types are defined to allow aliasing just like ISO C unsigned char*. (e.g. in gcc's headers, the definition includes __attribute__((may_alias)).)
The reverse isn't safe (pointing an int* onto part of a __m128i object). MSVC guarantees that's safe, but GCC/clang don't. (-fstrict-aliasing is on by default). It sometimes works with GCC/clang, but why risk it? It sometimes even interferes with optimization; see this Q&A. See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
See GCC AVX _m256i cast to int array leads to wrong values for a real-world example of GCC breaking code which points an int* at a __m256i.
(uint32_t*) &my_vector violates the C and C++ aliasing rules, and is not guaranteed to work the way you'd expect. Storing to a local array and then accessing it is guaranteed to be safe. It even optimizes away with most compilers, so you get movq / pextrq directly from xmm to integer registers instead of an actual store/reload, for example.
Source + asm output on the Godbolt compiler explorer: proof it compiles with MSVC and so on.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#ifndef __cplusplus
#include <stdalign.h> // C11 defines _Alignas(). This header defines alignas()
#endif
void p128_hex_u8(__m128i in) {
alignas(16) uint8_t v[16];
_mm_store_si128((__m128i*)v, in);
printf("v16_u8: %x %x %x %x | %x %x %x %x | %x %x %x %x | %x %x %x %x\n",
v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7],
v[8], v[9], v[10], v[11], v[12], v[13], v[14], v[15]);
}
void p128_hex_u16(__m128i in) {
alignas(16) uint16_t v[8];
_mm_store_si128((__m128i*)v, in);
printf("v8_u16: %x %x %x %x, %x %x %x %x\n", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}
void p128_hex_u32(__m128i in) {
alignas(16) uint32_t v[4];
_mm_store_si128((__m128i*)v, in);
printf("v4_u32: %x %x %x %x\n", v[0], v[1], v[2], v[3]);
}
void p128_hex_u64(__m128i in) {
alignas(16) unsigned long long v[2]; // uint64_t might give format-string warnings with %llx; it's just long in some ABIs
_mm_store_si128((__m128i*)v, in);
printf("v2_u64: %llx %llx\n", v[0], v[1]);
}
If you need portability to C99 or C++03 or earlier (i.e. without C11 / C++11), remove the alignas() and use storeu instead of store. Or use __attribute__((aligned(16))) or __declspec( align(16) ) instead.
(If you're writing code with intrinsics, you should be using a recent compiler version. Newer compilers usually make better asm than older compilers, including for SSE/AVX intrinsics. But maybe you want to use gcc-6.3 with -std=gnu++03 C++03 mode for a codebase that isn't ready for C++11 or something.)
Sample output from calling all 4 functions on
// source used:
__m128i vec = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15, 16);
// output:
v2_u64: 0x807060504030201 0x100f0e0d0c0b0a09
v4_u32: 0x4030201 0x8070605 0xc0b0a09 0x100f0e0d
v8_u16: 0x201 0x403 0x605 0x807 | 0xa09 0xc0b 0xe0d 0x100f
v16_u8: 0x1 0x2 0x3 0x4 | 0x5 0x6 0x7 0x8 | 0x9 0xa 0xb 0xc | 0xd 0xe 0xf 0x10
Adjust the format strings if you want to pad with leading zeros for consistent output width. See printf(3).
I know this question is tagged C, but it was the best search result also when looking for a C++ solution to the same problem.
So, this could be a C++ implementation:
#include <string>
#include <cstring>
#include <sstream>
#if defined(__SSE2__)
template <typename T>
std::string __m128i_toString(const __m128i var) {
std::stringstream sstr;
T values[16/sizeof(T)];
std::memcpy(values,&var,sizeof(values)); //See discussion below
if (sizeof(T) == 1) {
for (unsigned int i = 0; i < sizeof(__m128i); i++) { //C++11: Range for also possible
sstr << (int) values[i] << " ";
}
} else {
for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) { //C++11: Range for also possible
sstr << values[i] << " ";
}
}
return sstr.str();
}
#endif
Usage:
#include <iostream>
[..]
__m128i x
[..]
std::cout << __m128i_toString<uint8_t>(x) << std::endl;
std::cout << __m128i_toString<uint16_t>(x) << std::endl;
std::cout << __m128i_toString<uint32_t>(x) << std::endl;
std::cout << __m128i_toString<uint64_t>(x) << std::endl;
Result:
141 114 0 0 0 0 0 0 151 104 0 0 0 0 0 0
29325 0 0 0 26775 0 0 0
29325 0 26775 0
29325 26775
Note: there exists a simple way to avoid the if (size(T)==1), see https://stackoverflow.com/a/28414758/2436175
#include<stdio.h>
#include<emmintrin.h>
int main()
{
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
const int32_t* q;
//add a pointer
c = _mm_add_epi32(a,b);
q = (const int32_t*) &c;
printf("%d\n",q[2]);
//printf("%d\n",c[2]);
return 0;
}
Try this code.
Related
I am wondering if the following example is a Clang SA false positive, and if so, is there a way to suppress it?
The key here is that I am copying a structure containing bit-fields by casting it as a word instead of a field-by-field copy (or memcpy). Both field-by-field copy and memcpy doesn't trigger warnings, but copying as a word (after casting) raises an "uninitialized access" warning. This is on a embedded system where only word-access is possible and these types of word copies are common place.
Below is the example code:
#include <stdio.h>
#include <string.h>
struct my_fields_t {
unsigned int f0: 16;
unsigned int f1: 8;
unsigned int f2: 8;
};
int main(void) {
struct my_fields_t var1, var2;
// initialize all the fields in var1.
var1.f0 = 1;
var1.f1 = 2;
var1.f2 = 3;
// Method #1: copy var1 -> var2 as a word (sizeof(unsigned int) = 4).
unsigned int *src = (unsigned int *) &var1;
unsigned int *dest = (unsigned int *) &var2;
*dest = *src;
// Method #2: copy var1->var2 field-by-field [NO SA WARNINGS]
// var2.f0 = var1.f0;
// var2.f1 = var1.f1;
// var2.f2 = var1.f2;
// Method #3: use memcpy to copy var1 to var2 [NO SA WARNINGS]
// memcpy(&var2, &var1, sizeof(struct my_fields_t));
printf("%d, %d, %d\n", var1.f0, var1.f1, var1.f2);
printf("%d, %d, %d\n", var2.f0, var2.f1, var2.f2); // <--- Function call argument is an uninitialized value
printf("sizeof(unsigned int) = %ld\n", sizeof(unsigned int));
}
Here's the output:
$ clang --version
clang version 4.0.0 (tags/RELEASE_401/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
$ clang -Wall clang_sa.c
$ ./a.out
1, 2, 3
1, 2, 3
sizeof(unsigned int) = 4
$ scan-build clang clang_sa.c
scan-build: Using '<snipped>/clang-4.0' for static analysis
clang_sa.c:33:3: warning: Function call argument is an uninitialized value
printf("%d, %d, %d\n", var2.f0, var2.f1, var2.f2); // <--- Function call argument is an uninitialized value
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
scan-build: 1 bug found.
In the above example, it is quite clear that all the fields in var2 will be initialized by the word copy. So, clang SA shouldn't complain about un-intialized access.
I appreciate any help/insight.
In terms of suppressing a specific warning, from the documentation:
Q: How can I suppress a specific analyzer warning?
There is currently no solid mechanism for suppressing an analyzer warning, although this is currently being investigated. ...
But on the next question, it shows you that you can mark a block of code to be skipped over during static analysis by surrounding the code with an #ifdef block:
Q: How can I selectively exclude code the analyzer examines?
When the static analyzer is using clang to parse source files, it implicitly defines the preprocessor macro __clang_analyzer__. One can use this macro to selectively exclude code the analyzer examines. ...
So, you could do it like this:
#ifdef __clang_analyzer__
#define COPY_STRUCT(DEST, SRC) (DEST) = (SRC)
#else
#define COPY_STRUCT(DEST, SRC) do { \
const unsigned int *src = (const void *)&(SRC); \
unsigned int *dest = (void *)&(DEST); \
*dest = *src; \
} while(0)
#endif
COPY_STRUCT(var2, var1);
I have a function in this form (From Fastest Implementation of Exponential Function Using SSE):
__m128 FastExpSse(__m128 x)
{
static __m128 const a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411);
static __m128 const m87 = _mm_set1_ps(-87);
// fast exponential function, x should be in [-87, 87]
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
I want to make it C compatible.
Yet the compiler doesn't accept the form static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411); when I use C compiler.
Yet I don't want the first 3 values to be recalculated in each function call.
One solution is to inline it (But sometimes the compilers reject that).
Is there a C style to achieve it in case the function isn't inlined?
Thank You.
Remove static and const.
Also remove them from the C++ version. const is OK, but static is horrible, introducing guard variables that are checked every time, and a very expensive initialization the first time.
__m128 a = _mm_set1_ps(12102203.2f); is not a function call, it's just a way to express a vector constant. No time can be saved by "doing it only once" - it normally happens zero times, with the constant vector being prepared in the data segment of the program and simply being loaded at runtime, without the junk around it that static introduces.
Check the asm to be sure, without static this is what happens: (from godbolt)
FastExpSse(float __vector(4)):
movaps xmm1, XMMWORD PTR .LC0[rip]
cmpleps xmm1, xmm0
mulps xmm0, XMMWORD PTR .LC1[rip]
cvtps2dq xmm0, xmm0
paddd xmm0, XMMWORD PTR .LC2[rip]
andps xmm0, xmm1
ret
.LC0:
.long 3266183168
.long 3266183168
.long 3266183168
.long 3266183168
.LC1:
.long 1262004795
.long 1262004795
.long 1262004795
.long 1262004795
.LC2:
.long 1064866805
.long 1064866805
.long 1064866805
.long 1064866805
_mm_set1_ps(-87); or any other _mm_set intrinsic is not a valid static initializer with current compilers, because it's not treated as a constant expression.
In C++, it compiles to runtime initialization of the static storage location (copying from a vector literal somewhere else). And if it's a static __m128 inside a function, there's a guard variable to protect it.
In C, it simply refuses to compile, because C doesn't support non-constant initializers / constructors. _mm_set is not like a braced initializer for the underlying GNU C native vector, like #benjarobin's answer shows.
This is really dumb, and seems to be a missed-optimization in all 4 mainstream x86 C++ compilers (gcc/clang/ICC/MSVC). Even if it somehow matters that each static const __m128 var have a distinct address, the compiler could achieve that by using initialized read-only storage instead of copying at runtime.
So it seems like constant propagation fails to go all the way to turning _mm_set into a constant initializer even when optimization is enabled.
Never use static const __m128 var = _mm_set... even in C++; it's inefficient.
Inside a function is even worse, but global scope is still bad.
Instead, avoid static. You can still use const to stop yourself from accidentally assigning something else, and to tell human readers that it's a constant. Without static, it has no effect on where/how your variable is stored. const on automatic storage just does compile-time checking that you don't modify the object.
const __m128 var = _mm_set1_ps(-87); // not static
Compilers are good at this, and will optimize the case where multiple functions use the same vector constant, the same way they de-duplicate string literals and put them in read-only memory.
Defining constants this way inside small helper functions is fine: compilers will hoist the constant-setup out of a loop after inlining the function.
It also lets compilers optimize away the full 16 bytes of storage, and load it with vbroadcastss xmm0, dword [mem], or stuff like that.
This solution is clearly not portable, it's working with GCC 8 (only tested with this compiler):
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
static void print128(const void *p)
{
unsigned char buf[16];
memcpy(buf, p, 16);
for (int i = 0; i < 16; ++i)
{
printf("%02X ", buf[i]);
}
printf("\n");
}
int main(void)
{
static __m128 const glob_a = INIT_M128(12102203.2f);
static __m128i const glob_b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const glob_m87 = INIT_M128(-87.0f);
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
print128(&a);
print128(&glob_a);
print128(&b);
print128(&glob_b);
print128(&m87);
print128(&glob_m87);
return 0;
}
As explained in the answer of #harold (in C only), the following code (build with or without WITHSTATIC) produces exactly the same code.
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
__m128 FastExpSse2(__m128 x)
{
#ifdef WITHSTATIC
static __m128 const a = INIT_M128(12102203.2f);
static __m128i const b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const m87 = INIT_M128(-87.0f);
#else
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
#endif
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
So in summary it's better to remove static and const keywords (better and simpler code in C++, and in C the code is portable since with my proposed hack the code is not really portable)
I ran into an interesting issue when compiling some code with -O3 using clang on OSX High Sierra. The code is this:
#include <stdint.h>
#include <limits.h> /* for CHAR_BIT */
#include <stdio.h> /* for printf() */
#include <stddef.h> /* for size_t */
uint64_t get_morton_code(uint16_t x, uint16_t y, uint16_t z)
{
/* Returns the number formed by interleaving the bits in x, y, and z, also
* known as the morton code.
*
* See https://graphics.stanford.edu/~seander/bithacks.html#InterleaveTableO
bvious.
*/
size_t i;
uint64_t a = 0;
for (i = 0; i < sizeof(x)*CHAR_BIT; i++) {
a |= (x & 1U << i) << (2*i) | (y & 1U << i) << (2*i + 1) | (z & 1U << i)
<< (2*i + 2);
}
return a;
}
int main(int argc, char **argv)
{
printf("get_morton_code(99,159,46) = %llu\n", get_morton_code(99,159,46));
return 0;
}
When compiling this with cc -O1 -o test_morton_code test_morton_code.c I get the following output:
get_morton_code(99,159,46) = 4631995
which is correct. However, when compiling with cc -O3 -o test_morton_code test_morton_code.c:
get_morton_code(99,159,46) = 4294967295
which is wrong.
What is also odd is that this bug appears in my code when switching from -O2 to -O3 whereas in the minimal working example above it appears when going from -O1 to -O2.
Is this a bug in the compiler optimization or am I doing something stupid that's only appearing when the compiler is optimizing more aggressively?
I'm using the following version of clang:
snotdaqs-iMac:snoFitter snoperator$ cc --version
Apple LLVM version 9.1.0 (clang-902.0.39.1)
Target: x86_64-apple-darwin17.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
UndefinedBehaviorSanitizer is really helpful in catching such mistakes:
$ clang -fsanitize=undefined -O3 o3.c
$ ./a.out
o3.c:19:2: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
get_morton_code(99,159,46) = 4294967295
A possible fix would be replacing the 1Us with 1ULL, an unsigned long long is at least 64 bit and can be shifted that far.
When i is 15 in the loop, 2*i+2 is 32, and you are shifting an unsigned int by the number of bits in an unsigned int, which is undefined.
You apparently intend to work in a 64-bit field, so cast the left side of the shift to uint64_t.
A proper printf format for uint64_t is get_morton_code(99,159,46) = %" PRIu64 "\n". PRIu64 is defined in the <inttypes.h> header.
For input 0xffffffff, the following c code works fine with no optimization, but produces wrong results when compiled with -O1. Other compilation options are -g -m32 -Wall. The code is tested with clang-900.0.39.2 in macOS 10.13.2.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
if (argc < 2) return 1;
char *endp;
int x = (int)strtoll(argv[1], &endp, 0);
int mask1 = 0x55555555;
int mask2 = 0x33333333;
int count = (x & mask1) + ((x >> 1) & mask1);
int v1 = count >> 2;
printf("v1 = %#010x\n", v1);
int v2 = v1 & mask2;
printf("v2 = %#010x\n", v2);
return 0;
}
Input: 0xffffffff
Outputs with -O0: (expected)
v1 = 0xeaaaaaaa
v2 = 0x22222222
Outputs with -O1: (wrong)
v1 = 0x2aaaaaaa
v2 = 0x02222222
Below are disassembled instructions for the line "int v1 = count >> 2;" with -O0 and -O1.
With -O0:
sarl $0x2, %esi
With -O1:
shrl $0x2, %esi
Below are disassembled instructions for the line "int v2 = v1 & mask2;" with -O0 and -O1.
With -O0:
andl -0x24(%ebp), %esi //-0x24(%ebp) stores 0x33333333
With -O1:
andl $0x13333333, %esi //why does the optimization changes 0x33333333 to 0x13333333?
In addition, if x is set to 0xffffffff locally instead of getting its value from arguments, the code will work as expected even with -O1.
P.S: The code is an experimental piece based on my solution to the Data Lab from the CS:APP course # CMU. The lab asks the student to implement a function that counts the number of 1 bit of an int variable without using any type other than int.
As several commenters have pointed out, right-shifting signed values is not well defined.
I changed the declaration and initialization of x to
unsigned int x = (unsigned int)strtoll(argv[1], &endp, 0);
and got consistent results under -O0 and -O1. (But before making that change, I was able to reproduce your result under clang under MacOS.)
As you have discovered, you raise Implementation-defined Behavior in your attempt to store 0xffffffff (4294967295) in int x (where INT_MAX is 7fffffff, or 2147483647). C11 Standard ยง6.3.1.3 (draft n1570) - Signed and unsigned integers Whenever using strtoll (or strtoull) (both versions with 1-l would be fine) and attempting to store the value as an int, you must check the result against INT_MAX before making the assignment with a cast. (or if using exact width types, against INT32_MAX, or UINT32_MAX for unsigned)
Further, in circumstance such as this where bit operations are involved, you can remove uncertainty and insure portability by using the exact width types provided in stdint.h and the associated format specifiers provided in inttypes.h. Here, there is no need for use of a signed int. It would make more sense to handle all values as unsigned (or uint32_t).
For example, the following provides a default value for the input to avoid the Undefined Behavior invoked if your code is executed without argument (you can also simply test argc), replaces the use of strtoll with strtoul, validates the input fits within the associated variable before assignment handling the error if it does not, and then makes use of the unambiguous exact types, e.g.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
int main (int argc, char *argv[]) {
uint64_t tmp = argc > 1 ? strtoul (argv[1], NULL, 0) : 0xffffffff;
if (tmp > UINT32_MAX) {
fprintf (stderr, "input exceeds UINT32_MAX.\n");
return 1;
}
uint32_t x = (uint32_t)tmp,
mask1 = 0x55555555,
mask2 = 0x33333333,
count = (x & mask1) + ((x >> 1) & mask1),
v1 = count >> 2,
v2 = v1 & mask2;
printf("v1 = 0x%" PRIx32 "\n", v1);
printf("v2 = 0x%" PRIx32 "\n", v2);
return 0;
}
Example Use/Output
$ ./bin/masktst
v1 = 0x2aaaaaaa
v2 = 0x22222222
Compiled with
$ gcc -Wall -Wextra -pedantic -std=gnu11 -Ofast -o bin/masktst masktst.c
Look things over and let me know if you have further questions.
this statement:
int x = (int)strtoll(argv[1], &endp, 0);
results in a signed overflow, which is undefined behavior.
(on my system, the result is: -1431655766
The resulting values tend to go downhill from there:
The variable: v1 receives: -357913942
The variable: v2 receives: 572662306
the %x format specifier only works correctly with unsigned variables
I'm trying to learn how to write gcc inline assembly.
The following code is supposed to perform an shl instruction and return the result.
#include <stdio.h>
#include <inttypes.h>
uint64_t rotate(uint64_t x, int b)
{
int left = x;
__asm__ ("shl %1, %0"
:"=r"(left)
:"i"(b), "0"(left));
return left;
}
int main()
{
uint64_t a = 1000000000;
uint64_t res = rotate(a, 10);
printf("%llu\n", res);
return 0;
}
Compilation fails with error: impossible constraint in asm
The problem is basically with "i"(b). I've tried "o", "n", "m" among others but it still doesn't work. Either its this error or operand size mismatch.
What am I doing wrong?
As written, you code compiles correctly for me (I have optimization enabled). However, I believe you may find this to be a bit better:
#include <stdio.h>
#include <inttypes.h>
uint64_t rotate(uint64_t x, int b)
{
__asm__ ("shl %b[shift], %[value]"
: [value] "+r"(x)
: [shift] "Jc"(b)
: "cc");
return x;
}
int main(int argc, char *argv[])
{
uint64_t a = 1000000000;
uint64_t res = rotate(a, 10);
printf("%llu\n", res);
return 0;
}
Note that the 'J' is for 64bit. If you are using 32bit, 'I' is the correct value.
Other things of note:
You are truncating your rotate value from uint64_t to int? Are you compiling for 32bit code? I don't believe shl can do 64bit rotates when compiled as 32bit.
Allowing 'c' on the input constraint means you can use variable rotate amounts (ie not hard-coded at compile time).
Since shl modifies the flags, use "cc" to let the compiler know.
Using the [name] form makes the asm easier to read (IMO).
The %b is a modifier. See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#i386Operandmodifiers
If you want to really get smart about inline asm, check out the latest gcc docs: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html