gcc target for AVX2 disabling SSE instruction set - c

We have a translation unit we want to compile with AVX2 (only that one):
It's telling GCC upfront, first line in the file:
#pragma GCC target "arch=core-avx2,tune=core-avx2"
This used to work with GCC 4.8 and 4.9 but from 6 onward (tried 7 and 8 too) we get this warning (that we treat as an error):
error: SSE instruction set disabled, using 387 arithmetics
On the first function returning a float. I have tried to enable back SSE 4.2 (and avx and avx2) like so
#pragma GCC target "sse4.2,arch=core-avx2,tune=core-avx2"
But that is not enough, the error persists.
EDIT:
Relevant compiler flags, we target AVX for most stuff:
-mfpmath=sse,387 -march=corei7-avx -mtune=corei7-avx
EDIT2: minimal sample:
#pragma GCC target "arch=core-avx2,tune=core-avx2"
#include <immintrin.h>
#include <math.h>
static inline float
lg1pf( float x ) {
return log1pf(x)*1.44269504088896338700465f;
}
int main()
{
log1pf(2.0f);
}
Compiled that way:
gcc -o test test.c -O2 -Wall -Werror -pedantic -std=c99 -mfpmath=sse,387 -march=corei7-avx -mtune=corei7-avx
In file included from /home/xxx/gcc-7.1.0/lib/gcc/x86_64-pc-linux-gnu/7.1.0/include/immintrin.h:45:0,
from test.c:3:
/home/xxx/gcc-7.1.0/lib/gcc/x86_64-pc-linux-gnu/7.1.0/include/avx512fintrin.h: In function ‘_mm_add_round_sd’:
/home/xxx/gcc-7.1.0/lib/gcc/x86_64-pc-linux-gnu/7.1.0/include/avx512fintrin.h:1412:1: error: SSE register return with SSE disabled
{
^
GCC details (I don't have the flags that were used to compile it though)
gcc --version
gcc (GCC) 7.1.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Potential solution
#pragma GCC target "avx2"
Worked for me without other changes to the code.
Applying the attribute to individual functions did not work either:
Related problem:
__attribute__((__target__("arch=broadwell"))) // does not compile
__m256 use_avx(__m256 a) { return _mm256_add_ps(a,a); }
__attribute__((__target__("avx2,arch=broadwell"))) // does not compile
__m256 use_avx(__m256 a) { return _mm256_add_ps(a,a); }
__attribute__((__target__("avx2"))) // compiles
__m256 use_avx(__m256 a) { return _mm256_add_ps(a,a); }

This looks like a bug. #pragma GCC target before #include <immintrin.h> breaks the header somehow, IDK why. Even if AVX2 was enabled on the command line with -march=haswell, a #pragma seems to break inlining of any intrinsics defined after that.
You can use #pragma after the header, but then using instrinsics that weren't enabled on the command line fails.
Even a more modern target name like #pragma GCC target "arch=haswell" causes the error, so it's not that the old nebulous target names like corei7-avx are broken in general. They still work on the command line. If you want to enable something for a whole file, the standard way is to use compiler options and not pragmas.
GCC does claim to support target options on a per-function basis with pragmas or __attribute__, though. https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html.
This is as far as I've gotten playing around with this (Godbolt compiler explorer with gcc8.1). Clang is unaffected because it ignores #pragma GCC target. (So that means that #pragma is not very portable; you probably want your code work with any GNU C compiler, not just gcc itself.)
// breaks gcc when before immintrin.h
// #pragma GCC target "arch=haswell"
#include <immintrin.h>
#include <math.h>
//#pragma GCC target "arch=core-avx2,tune=core-avx2"
#pragma GCC target "arch=haswell"
//static inline
float
lg1pf( float x ) {
return log1pf(x)*1.44269504088896338700465f;
}
// can accept / return wide vectors
__m128 nop(__m128 a) { return a; }
__m256 require_avx(__m256 a) { return a; }
// but error on using intrinsics if #include happened without target options
//__m256 use_avx(__m256 a) { return _mm256_add_ps(a,a); }
// this works, though, because AVX is enabled at this point
// presumably so would __builtin_ia32_whatever
// Without `arch=haswell`, this breaks, so we know the pragma "worked"
__m256 use_native_vec(__m256 a) { return a+a; }

Related

Getting "call to '__bad_copy_to' declared with attribute error: copy destination size is too small" after disabling optimization

This is part of a linux device driver.
uint32_t val = 0;
static long axpu_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
...
copy_from_user(&val,(int32_t *)arg, sizeof(val));
...
}
I can compile this ok and was doing some debug. To keep from some codes from being optimized away. I changed the code like this.
#pragma GCC push_options
#pragma GCC optimize ("O0")
uint32_t val = 0;
static long axpu_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
...
copy_from_user(&val,(int32_t *)arg, sizeof(val)); // <-- line 364
...
}
#pragma GCC pop_options
This is a trick I've been using for a while and worked just fine.(just that I used in the linux kernel codes for boot debug). This time, I'm doing it for the device driver. Now the compiler gives me this error.
In file included from ./arch/arm64/include/asm/preempt.h:5,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:51,
from ./include/linux/seqlock.h:36,
from ./include/linux/time.h:6,
from ./arch/arm64/include/asm/stat.h:12,
from ./include/linux/stat.h:6,
from ./include/linux/module.h:10,
from /home/ckim/testlin540/axpu_ldd_kc.c:8:
In function 'check_copy_size',
inlined from 'copy_from_user' at ./include/linux/uaccess.h:143:6,
inlined from 'axpu_ioctl' at /home/ckim/testlin540/axpu_ldd_kc.c:364:4:
./include/linux/thread_info.h:160:4: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
160 | __bad_copy_to();
| ^~~~~~~~~~~~~~~
I tried putting the #pragma thing around #include <linux/uaccess.h> in vain.
Some say it's gcc problem but I'm not sure. This is linux-5.4.188 code and this is the compiler version.
ckim#ckim-ubuntu:~/testlin540$ aarch64-none-linux-gnu-gcc --version
aarch64-none-linux-gnu-gcc (GNU Toolchain for the A-profile Architecture 10.2-2020.11 (arm-10.16)) 10.2.1 20201103
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

pragma omp for simd does not generate vector instructions in GCC

Short: Does the pragma omp for simd OpenMP directive generate code that uses SIMD registers?
Longer:
As stated in the OpenMP documentation "The worksharing-loop SIMD construct specifies that the iterations of one or more associated loops will be distributed across threads that already exist [..] using SIMD instructions". From this statement, I would expect the following code (simd.c) to use XMM, YMM or ZMM registers when compiling running gcc simd.c -o simd -fopenmp but it does not.
#include <stdio.h>
#define N 100
int main() {
int x[N];
int y[N];
int z[N];
int i;
int sum;
for(i=0; i < N; i++) {
x[i] = i;
y[i] = i;
}
#pragma omp parallel
{
#pragma omp for simd
for(i=0; i < N; i++) {
z[i] = x[i] + y[i];
}
#pragma omp for simd reduction(+:sum)
for(i=0; i < N; i++) {
sum += x[i];
}
}
printf("%d %d\n",z[N/2], sum);
return 0;
}
When checking the assembler generated running gcc simd.c -S -fopenmp no SIMD register is used.
I can use SIMD registers without OpenMP using the option -O3 because according to GCC documentation
it includes the -ftree-vectorize flag.
XMM registers: gcc simd.c -o simd -O3
YMM registers: gcc simd.c -o simd -O3 -march=skylake-avx512
ZMM registers: gcc simd.c -o simd -O3 -march=skylake-avx512 -mprefer-vector-width=512
However, using the flags -march=skylake-avx512 -mprefer-vector-width=512 combined with -fopenmp does not generates SIMD instructions.
Therefore, I can easily vectorize my code with -O3 without the pragma omp for simd but not for the other way around.
At this point, my purpose is not to generate SIMD instructions but to understand how do OpenMP SIMD directives work in GCC and how to generate SIMD instructions only with OpenMP (without -O3).
Enable at least -O2 for -fopenmp to work, and for performance in general
gcc simd.c -S -fopenmp
GCC's default is -O0, anti-optimized for consistent debugging. It's never going to auto-vectorize with -O0 because it's pointless when every i value from the C source has to exist in memory, and so on. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
Also impossible when you have to be able to single-step source lines one at a time, and even modify i or memory contents at runtime with the debugger, and have the program keep running like you'd expect the C abstract machine would.
Building without any optimization is utter garbage for performance; it's insane to even consider if you care about performance enough to be using OpenMP. (Except of course for actual debugging.) Often the speedup from anti-optimized to optimized scalar is more than what you could gain from vectorizing that scalar code, but both can be large factors so you definitely want optimizations beyond auto-vectorization.
I can use SIMD registers without OpenMP using the option -O3 because according to GCC documentation it includes the -ftree-vectorize flag.
Right, so do that. -O3 -march=native -flto is usually your best bet for code that will run on the compile host. Also -fno-trapping-math -fno-math-errno should be safe for everything and enable some better FP function inlining, even if you don't want -ffast-math. Also preferably -fprofile-generate / -fprofile-use profile-guided optimization (PGO), to unroll hot loops and choose branchy vs. branchless appropriately, etc.
#pragma omp parallel is still effective at -O3 -fopenmp - GCC doesn't enable autoparallelization by default.
Also, #pragma omp simd will use a different vectorization style sometimes. In your case, it seems to make GCC forget that it knows the arrays are 16-byte aligned, and use movdqu loads (when AVX isn't available for an unaligned memory source operand for paddd xmm0, [rax]). Compare https://godbolt.org/z/8q8Dqm - the main._omp_fn.0: helper function that main calls doesn't assume alignment. (Although maybe it can't after division by number of threads splits up the array into ranges, if GCC doesn't bother to do vector-sized chunks?)
Use -O2 -fopenmp to get what you were expecting
OpenMP will let gcc vectorize more easily or efficiently for loops where you didn't use restrict on pointer args to functions to let it know that arrays don't overlap, or for floating point to let it pretend that FP math is associative even if you didn't use -ffast-math.
Or if you enable some optimization but not full optimization (e.g. -O2 which doesn't include -ftree-vectorize), then #pragma omp will work the way you expected.
Note that the x[i] = y[i] = i; init loop doesn't get auto-vectorized at -O2, but the #pragma loops are. And that without -fopenmp, pure scalar. Godbolt compiler explorer
The serial -O3 code will run faster for this small N because thread-startup overhead is nowhere near worth it. But for large N, parallelization could help if a single core can't saturate memory bandwidth (e.g. on a Xeon, but most dual/quad-core desktop CPUs can almost saturate mem bandwidth with one core). Or if your arrays are hot in cache on different cores.
Unfortunately(?) even GCC -O3 doesn't manage to do constant-propagation through your whole code and just print the result. Or to fuse the z[i] = x[i]+y[i] loop with the sum(x[]) loop.

C preprocessing fails to stop immediately after an #error

My question today should not be very much complicated, but I simply can't find a reason/solution. As a small, reproducible example, consider the following toy C code
#define _state_ 0
#if _state_ == 1
int foo(void) {return 1;}
#else
/* check GCC flags first
note that -mfma will automatically turn on -mavx, as shown by [gcc -mfma -dM -E - < /dev/null | egrep "SSE|AVX|FMA"]
so it is sufficient to check for -mfma only */
#ifndef __FMA__
#error "Please turn on GCC flag: -mfma"
#endif
#include <immintrin.h> /* All OK, compile C code */
void foo (double *A, double *B) {
__m256d A1_vec = _mm256_load_pd(A);
__m256d B_vec = _mm256_broadcast_sd(B);
__m256d C1_vec = A1_vec * B_vec;
}
#endif
I am going to compile this test.c file by
gcc -fpic -O2 -c test.c
Note I did not turn on GCC flag -mfma, so the #error will be triggered. What I would expect, is that compilation will stop immediately after GCC sees this #error, but this is what I got with GCC 5.3:
test.c:14:2: error: #error "Please turn on GCC flag: -mfma"
#error "Please turn on GCC flag: -mfma"
^
test.c: In function ‘foo’:
test.c:22:11: warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
__m256d A1_vec = _mm256_load_pd(A);
^
GCC does stops, but why does it also pick up a line after #error? Any explanation? Thanks.
For people who want to try, there are some hardware requirement. You need an x86-64 with AVX and FMA instruction sets.
I have a draft copy of the C ISO spec, and in §4/4 - it states
The implementation shall not successfully translate a preprocessing translation unit containing a #error preprocessing directive unless it is part of a group skipped by conditional inclusion.
Later on, in §6.10.5, where #error is formally defined, it says
A preprocessing directive of the form
# error pp-tokens opt new-line
causes the implementation to produce a diagnostic message that includes the specified
sequence of preprocessing tokens.
In other words, the spec only requires that any code that has an #error just needs to fail to compile and report an error message along the way, not that the compilation immediately needs to terminate as soon as #error is reached.
Given that it's considered good practice to always check the top-level errors reported by a compiler before later ones, I'd imagine a competent programmer who saw a string of errors beginning with an #error directive would likely know what was going on.

Can I use Intel syntax of x86 assembly with GCC?

I want to write a small low level program. For some parts of it I will need to use assembly language, but the rest of the code will be written on C/C++.
So, if I will use GCC to mix C/C++ with assembly code, do I need to use AT&T syntax or can
I use Intel syntax? Or how do you mix C/C++ and asm (intel syntax) in some other way?
I realize that maybe I don't have a choice and must use AT&T syntax, but I want to be sure..
And if there turns out to be no choice, where I can find full/official documentation about the AT&T syntax?
Thanks!
If you are using separate assembly files, gas has a directive to support Intel syntax:
.intel_syntax noprefix # not recommended for inline asm
which uses Intel syntax and doesn't need the % prefix before register names.
(You can also run as with -msyntax=intel -mnaked-reg to have that as the default instead of att, in case you don't want to put .intel_syntax noprefix at the top of your files.)
Inline asm: compile with -masm=intel
For inline assembly, you can compile your C/C++ sources with gcc -masm=intel (See How to set gcc to use intel syntax permanently? for details.) The compiler's own asm output (which the inline asm is inserted into) will use Intel syntax, and it will substitute operands into asm template strings using Intel syntax like [rdi + 8] instead of 8(%rdi).
This works with GCC itself and ICC, but for clang only clang 14 and later.
(Not released yet, but the patch is in current trunk.)
Using .intel_syntax noprefix at the start of inline asm, and switching back with .att_syntax can work, but will break if you use any m constraints. The memory reference will still be generated in AT&T syntax. It happens to work for registers because GAS accepts %eax as a register name even in intel-noprefix mode.
Using .att_syntax at the end of an asm() statement will also break compilation with -masm=intel; in that case GCC's own asm after (and before) your template will be in Intel syntax. (Clang doesn't have that "problem"; each asm template string is local, unlike GCC where the template string truly becomes part of the text file that GCC sends to as to be assembled separately.)
Related:
GCC manual: asm dialect alternatives: writing an asm statement with {att | intel} in the template so it works when compiled with -masm=att or -masm=intel. See an example using lock cmpxchg.
https://stackoverflow.com/tags/inline-assembly/info for more about inline assembly in general; it's important to make sure you're accurately describing your asm to the compiler, so it knows what registers and memory are read / written.
AT&T syntax: https://stackoverflow.com/tags/att/info
Intel syntax: https://stackoverflow.com/tags/intel-syntax/info
The x86 tag wiki has links to manuals, optimization guides, and tutorials.
You can use inline assembly with -masm=intel as ninjalj wrote, but it may cause errors when you include C/C++ headers using inline assembly. This is code to reproduce the errors on Cygwin.
sample.cpp:
#include <cstdint>
#include <iostream>
#include <boost/thread/future.hpp>
int main(int argc, char* argv[]) {
using Value = uint32_t;
Value value = 0;
asm volatile (
"mov %0, 1\n\t" // Intel syntax
// "movl $1, %0\n\t" // AT&T syntax
:"=r"(value)::);
auto expr = [](void) -> Value { return 20; };
boost::unique_future<Value> func { boost::async(boost::launch::async, expr) };
std::cout << (value + func.get());
return 0;
}
When I built this code, I got error messages below.
g++ -E -std=c++11 -Wall -o sample.s sample.cpp
g++ -std=c++11 -Wall -masm=intel -o sample sample.cpp -lboost_system -lboost_thread
/tmp/ccuw1Qz5.s: Assembler messages:
/tmp/ccuw1Qz5.s:1022: Error: operand size mismatch for `xadd'
/tmp/ccuw1Qz5.s:1049: Error: no such instruction: `incl DWORD PTR [rax]'
/tmp/ccuw1Qz5.s:1075: Error: no such instruction: `movl DWORD PTR [rcx],%eax'
/tmp/ccuw1Qz5.s:1079: Error: no such instruction: `movl %eax,edx'
/tmp/ccuw1Qz5.s:1080: Error: no such instruction: `incl edx'
/tmp/ccuw1Qz5.s:1082: Error: no such instruction: `cmpxchgl edx,DWORD PTR [rcx]'
To avoid these errors, it needs to separate inline assembly (the upper half of the code) from C/C++ code which requires boost::future and the like (the lower half). The -masm=intel option is used to compile .cpp files that contain Intel syntax inline assembly, not to other .cpp files.
sample.hpp:
#include <cstdint>
using Value = uint32_t;
extern Value GetValue(void);
sample1.cpp: compile with -masm=intel
#include <iostream>
#include "sample.hpp"
int main(int argc, char* argv[]) {
Value value = 0;
asm volatile (
"mov %0, 1\n\t" // Intel syntax
:"=r"(value)::);
std::cout << (value + GetValue());
return 0;
}
sample2.cpp: compile without -masm=intel
#include <boost/thread/future.hpp>
#include "sample.hpp"
Value GetValue(void) {
auto expr = [](void) -> Value { return 20; };
boost::unique_future<Value> func { boost::async(boost::launch::async, expr) };
return func.get();
}

Segmentation fault using OpenMp and SSE

I'm just getting started experimenting adding OpenMP to some SSE code.
My first test program SOMETIMES crashes in _mm_set_ps, but works when I set the if (0).
It looks so simple I must be missing something obvious.
I'm compiling with gcc -fopenmp -g -march=core2 -pthreads
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
int main()
{
#pragma omp parallel if (1)
{
#pragma omp sections
{
#pragma omp section
{
__m128 x1 = _mm_set_ps ( 1.1f, 2.1f, 3.1f, 4.1f );
}
#pragma omp section
{
__m128 x2 = _mm_set_ps ( 1.2f, 2.2f, 3.2f, 4.2f );
}
} // end omp sections
} //end omp parallel
return 0;
}
This is a bug in the openMP implementation. I was having the same problem in gcc on Windows (MinGW). -mstackrealign command line option solved my problem. This adds an instruction to the prolog of every function to realign the stack at the 16-byte boundary. I didn't notice any performance penalty. You can also try to add __attribute__ ((force_align_arg_pointer)) to a function declaration, which should do the same, but only for a specific function. You might have to put the SSE code in a separate function that you then call from the function with #pragma omp, so that the stack has a chance to be realigned.
I stopped having the problem when I moved onto compiling for a 64-bit target (MinGW64, such as TDM GCC build).
I am playing with AVX instructions which require a 32-byte alignment, but GCC doesn't support that for windows at all. This forced me to fix the produced assembly code using a python script, but it works.
I smell unaligned memory access. Its the only way code like that could explode(assuming that is the only code there). For that to happen the XMM registers wouldn't be used but rather stack memory, which is only aligned to 4 bytes, my guess is the omp code is messing up the alignment of the stack.

Resources