Can I make #Pragma unroll accept macros/expressions rather than plain numbers? - c

I am trying to tell my compiler to unroll a loop for me using #pragma unroll. However, the number of iterations is determined by a compile-time variable, so the loop needs to be unrolled that many times. Like this:
#define ITEMS 4
#pragma unroll (ITEMS + 1)
for (unsigned int ii = 0; ii <= ITEMS; ++ii)
/* do something */;
The compiler doesn't like this, though, as it gives me the following warning: warning: extra characters in the unroll pragma (expected a single positive integer), ignoring pragma for this loop. I understand what this means, of course: it wants a single integer rather than an expression. Is there a way to do this, though, without changing the unroll parameter every time I change ITEMS?
The compiler I am using is CUDA's NVCC compiler.

You could do it the other way around:
Note: Just noticed Daniel Fischer's comment, which suggests exactly the same, before me.
#define ITEMS_PLUS_ONE 5
#define ITEMS (ITEMS_PLUS_ONE - 1)
The issue is that the preprocessor doesn't do math. It only does copy&paste.
When you write #define ITEMS_PLUS_ONE (ITEMS + 1), unroll is replaced with (4 + 1), not with 5.
Once this reaches the compiler, it doesn't matter. Even without optimization, the calculation is done during compilation, and (4 + 1) is exactly the same as 5.
But in your compiler, #pragma unroll is processed before compilation, and wants the simple number.

Related

Where is the source of imprecise calculation in the assembler code of gcc -Ofast compared with -O3? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
The following 3 lines give imprecise results with "gcc -Ofast -march=skylake":
int32_t i = -5;
const double sqr_N_min_1 = (double)i * i;
1. - ((double)i * i) / sqr_N_min_1
Obviously, sqr_N_min_1 gets 25., and in the 3rd line (-5 * -5) / 25 should become 1. so that the overall result from the 3rd line is exactly 0.. Indeed, this is true for compiler options "gcc -O3 -march=skylake".
But with "-Ofast" the last line yields -2.081668e-17 instead of 0. and with other i than -5 (e.g. 6 or 7) it gets other very small positive or negative random deviations from 0..
My question is: Where exactly is the source of this imprecision?
To investigate this, I wrote a small test program in C:
#include <stdint.h> /* int32_t */
#include <stdio.h>
#define MAX_SIZE 10
double W[MAX_SIZE];
int main( int argc, char *argv[] )
{
volatile int32_t n = 6; /* try 6 7 or argv[1][0]-'0' */
double *w = W;
int32_t i = 1 - n;
const int32_t end = n - 1;
const double sqr_N_min_1 = (double)i * i;
/* Here is the crucial part. The loop avoids the compiler replacing it with constants: */
do {
*w++ = 1. - ((double)i * i) / sqr_N_min_1;
} while ( (i+=2) <= end );
/* Then, show the results (only the 1st and last output line matters): */
w = W;
i = 1 - n;
do {
fprintf( stderr, "%e\n", *w++ );
} while ( (i+=2) <= end );
return( 0 );
}
Godbolt shows me the assembly produced by an "x86-64 gcc9.3" with the option "-Ofast -march=skylake" vs. "-O3 -march=skylake". Please, inspect the five columns of the website (1. source code, 2. assembly with "-Ofast", 3. assembly with "-O3", 4. output of 1st assembly, 5. output of 2nd assembly):
Godbolt site with five columns
As you can see the differences in the assemblies are obvious, but I can't figure out where exactly the imprecision comes from. So, the question is, which assembler instruction(s) are responsible for this?
A follow-up question is: Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?
Comments and another answer have pointed out the specific transformation that's happening in your case, with a reciprocal and an FMA instead of a division.
Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?
Not in general.
-Ofast is (currently) a synonym for -O3 -ffast-math.
See https://gcc.gnu.org/wiki/FloatingPointMath
Part of -ffast-math is -funsafe-math-optimizations, which as the name implies, can change numerical results. (With the goal of allowing more optimizations, like treating FP math as associative to allow auto-vectorizing the sum of an array with SIMD, and/or unrolling with multiple accumulators, or even just rearranging a sequence of operations within one expression to combine two separate constants.)
This is exactly the kind of speed-over-accuracy optimization you're asking for by using that option. If you don't want that, don't enable all of the -ffast-math sub-options, only the safe ones like -fno-math-errno / -fno-trapping-math. (See How to force GCC to assume that a floating-point expression is non-negative?)
There's no way of formulating your source to avoid all possible problems.
Possibly you could use volatile tmp vars all over the place to defeat optimization between statements, but that would make your code slower than regular -O3 with the default -fno-fast-math. And even then, calls to library functions like sin or log may resolve to versions that assume the args are finite, not NaN or infinity, because of -ffinite-math-only.
GCC issue with -Ofast? points out another effect: isnan() is optimized into a compile-time 0.
From the comments, it seems that, for -O3, the compiler computes 1. - ((double)i * i) / sqr_N_min_1:
Convert i to double and square it.
Divide that by sqr_N_min_1.
Subtract that from 1.
and, for -Ofast, computes it:
Prior to the loop, calculate the reciprocal of sqr_N_min_1.
Convert i to double and square it.
Compute the fused multiply-subtract of 1 minus the square times the reciprocal.
The latter improves speed because it calculates the division only once, and multiplication is much faster than division in the target processors. On top of that, the fused operation is faster than a separate multiplication and subtraction.
The error occurs because the reciprocal operation introduces a rounding error that is not present in the original expression (1/25 is not exactly representable in a binary format, while 25/25 of course is). This is why the compiler does not make this optimization when it is attempting to provide strict floating-point semantics.
Additionally, simply multiplying the reciprocal by 25 would erase the error. (This is somewhat by “chance,” as rounding errors vary in complicated ways. 1./25*25 produces 1, but 1./49*49 does not.) But the fused operation produces a more accurate result (it produces the result as if the product were computed exactly, with rounding occurring only after the subtraction), so it preserves the error.

Why can GCC only do loop interchange optimization when the int size is a compile-time constant?

When I compile this snippet (with -Ofast -floop-nest-optimize) gcc generates assembly which traverses the array in source order.
However, if I uncomment the line // n = 32767 and assign any number to n, it interchanges the index order to x[i * n + j]. Traversing memory in contiguous row-major order is much more cache-friendly than striding down columns.
float matrix_sum_column_major(float* x, int n) {
// n = 32767;
float sum = 0;
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
sum += x[j * n + i];
return sum;
}
On godbolt
Why can't GCC or clang do loop interchange with a runtime-variable int size? Real-world code won't usually have the size declared explicitly.
PD: I've tried this with different versions of gcc and clang-9 and it seems to happen in both.
PD2: Even if I make x be a local variable malloced inside the function it still happens.
Compilers generally focus their efforts (and should focus their efforts) on places where constructs which will likely be used by programmers interested in efficiency can be replaced with other constructs that are easily proven to be equivalent in all cases that should be expected to matter. If n is a constant, a compiler can determine the exact set of array indices that will be used in the loop and then figure out how to process all those indices. If n isn't constant, a compiler might be able to determine that when n is positive, code will use all indices from 0 to n*n-1, but that would likely require a lot more effort. The authors of clang and might have been able to make such a determination in this case if they tried hard enough, but they likely thought the effort wasn't worthwhile.
Note that if code will use a few particular values of n far more than any others, having code explicitly check for those values and use loops tailored for them, a compiler might be able to generate far more efficient code for those loops than would be possible for loops that can use an arbitrary n. Because many real-world problems would likely have some values of n that get used much more than others, it would not be unreasonable for a compiler writer to assume that programmers interested in performance would be likely to use such special-purpose loops, and spending a certain amount of effort improving the arbitrary-n loop may offer less benefit than spending the same amount of effort elsewhere.

C compiler bug or program error?

I am using the IAR C compiler to build an application for an embedded micro (specifically a Renesas uPD78F0537). In this application I am using two nested for loops to initialize some data, as shown in the following MCVE:
#include <stdio.h>
#define NUM_OF_OBJS 254
#define MAX_OBJ_SIZE 4
unsigned char objs[NUM_OF_OBJS][MAX_OBJ_SIZE];
unsigned char srcData[NUM_OF_OBJS][MAX_OBJ_SIZE];
void main(void)
{
srcData[161][2] = 10;
int x, y;
for (x = 0; x < NUM_OF_OBJS; x++)
{
for (y = 0; y < MAX_OBJ_SIZE; y++)
{
objs[x][y] = srcData[x][y];
}
}
printf("%d\n", (int) objs[161][2]);
}
The output value is 0, not 10.
The compiler is generating the following code for the for loop:
13 int x, y;
14 for (x = 0; x < NUM_OF_OBJS; x++)
\ 0006 14.... MOVW DE,#objs
\ 0009 16.... MOVW HL,#srcData
15 {
16 for (y = 0; y < MAX_OBJ_SIZE; y++)
\ 000C A0F8 MOV X,#248
17 {
18 objs[x][y] = srcData[x][y];
\ ??main_0:
\ 000E 87 MOV A,[HL]
\ 000F 95 MOV [DE],A
19 }
\ 0010 86 INCW HL
\ 0011 84 INCW DE
\ 0012 50 DEC X
\ 0013 BDF9 BNZ ??main_0
20 }
The above does not work: The compiler is apparently precalculating NUM_OF_OBJS x MAX_OBJ_SIZE = 1016 (0x3f8). This value is used as a counter, however it is being truncated to 8 bits (0xf8 == 248) and stored in 8-bit register 'X'. As a result, only the first 248 bytes of data are initialised, instead of the full 1016 bytes.
I can work around this, however my question is: Is this a compiler bug? Or am I overlooking something?
Update
This microcontroller has 7KB of RAM, and sizeof(int) == 2
Copying the data over using pointers (e.g. something along the lines of while (len-- > 0) *dst++ = *src++;) works fine.
I am pretty much convinced that this is a compiler bug based on the following:
Copying the data over using pointers (e.g. something along the lines of while (len-- > 0) *dst++ = *src++;) works fine. Thus this does not look like a problem of RAM size, pointer size, etc.
More relevant perhaps: If I just replace one of the two constants (NUM_OF_OBJS or MAX_OBJ_SIZE) with a static variable (thus preventing the compiler from precalculating the total count) it works fine.
Unfortunately I contacted IAR (providing a link to this SO question) and this is their answer:
Sorry to read that you don't have a license/support-agreement (SUA).
An examination such as needed for this case can take time, and we
(post-sales-support) prioritize to put time and effort to users with
valid SUA.
Along with some generic comments which are not particularly useful.
So I guess I'll just assume this is a bug and work around it in my code (there are many possible workarounds, including the two described above).
Although technically one may assume that those arrays are in .bss and zeroed, compilers are now issuing warnings when you actually make that assumption and read something from bss before writing it. Being global an interrupt, etc could come in and modify those arrays, so is it really a safe assumption (by a compiler) that they are zeros. Gcc does not make this assumption it copies the whole array over. Clang/llvm as well copies the whole array, does not take a shortcut. A shortcut would have looked like writing the one value to both arrays and printing that value, instead of copying the whole array.
Interesting even if the arrays were declared static, or made local, gcc is not able to figure out the shortcut, which is strange for gcc, which isnt the best but usually figures out much more complicated dead code loops.
What is the size of int defined as for this compiler target, maybe int is 8 bits in size and the compiler performed the action you asked. If not then perhaps yes this is a compiler bug. Try making the 254 a little bit smaller does it change the 0xF8 by a proportional amount, does it always appear as if it is carving off the lower 8 bits or is this some magic number that 0xF8 has some relationship to 1016 or 254 or 161, etc.
Given that this is the full code, the compiler isn't obliged to do anything, since the code is nonsense. The compiler is free to optimize away the whole loop since it does nothing.
A fully standard compliant compiler must initialize both objs and srcData to all-zeroes, since they have static storage duration.
Therefore the nested loop does nothing but shovelling zeroes from one array to another. If the compiler notices that this loop is pointless, it is free to remove the loop entirely.
Of course it doesn't make much sense to reduce the number of iterations in the loop as a way of optimization, so one may wonder what weird decisions the optimizer took to come up with that machine code. Odd and dumb as it may seem, it is fully standard compliant.
You could declare the loop iterators as volatile to enforce side-effects. In that case the loop has to be executed even though it is pointless, since reading/writing to a volatile variable is a side-effect, which the compiler isn't allowed to optimize away.
Please keep in mind that embedded systems compilers often have a non-standard, "minimal start-up" option where they skip the initialization of static storage duration variables to achieve faster system boot-up. If such a non-standard option is enabled, the variables will contain garbage values instead of zeroes.

optimizing a line of C code for 8 bit processor

I'm working on a 8bit processor and have written code in a C compiler, now more than 140 lines of code are taking just 1200 bytes and this single line is taking more than 200 bytes of ROM space. eeprom_read() is a function, there should be a problem with this 1000 and 100 and 10 multiplication.
romAddr = eeprom_read(146)*1000 + eeprom_read(147)*100 +
eeprom_read(148)*10 + eeprom_read(149);
Processor is 8-bit and data type of romAddr is int. Is there any way to write this line in a more optimized way?
It's possible that the thing that uses the most space is the use of multiplication. If your processor lacks an instruction to do multiplication, the compiler is forced to use software to do it step by step, which can require quite a bit of code.
It's hard to say, since you don't specify anything about your target processor (or which compiler you're using).
One way might be to somehow try to reduce inlining, so the code to multiply by 10 (which is used in all four terms) can be re-used.
To know if this is the case at all, the machine code must be inspected. By the way, the use of decimal constants for an address calculation is really odd.
Sometimes the multiplication can be compiled into a sequence of additions, yes. You can optimize it say by using left shift operator.
A*1000 = A*512 + A*256 + A*128 + A*64 + A*32 + A*8
Or the same thing:
A<<9 + A<<8 + A<<7 + A<<6 + A<<5 + A<<3
This still is way longer then a single "multiply" instruction, but your processor apparently doesn't have it anyway, so this might be the next best thing.
You're concerned about space, not time, right?
You've got four function calls, with an integer argument being passed to each one, followed by a multiplication by a constant, followed by adding.
Just as a first guess, that could be
load integer constant into register (6 bytes)
push register (2 bytes,
call eeprom_read (6 bytes)
adjust stack (4 bytes)
load integer multiplier into register (6 bytes)
push both registers (4 bytes),
call multiplication routine (6 bytes)
adjust stack (4 bytes)
load temporary sum into a register (6 bytes)
add to that register the result of the multiplication (2 bytes)
store back in the temporary sum (6 bytes).
Let's see, 6+2+6+4+6+4+6+4+6+2+6= about 52 bytes per call to eeprom_read.
The last call would be shorter because it doesn't do the multiply.
I would try calling eeprom_read not with arguments like 146 but with (unsigned char)146, and multiplying not by 1000 but by (unsigned short)1000.
That way, you might be able to tease the compiler into using shorter instructions, and possibly using a multiply instruction rather than a multiply function call.
Also, the call to eeprom_read might be macro'ed into a direct memory fetch, saving the pushing of the argument, the calling of the function, and the stack adjustment.
Another trick could be to store each one of the four products in a local variable, and add them all together at the end. That could generate less code.
All these possibilities would also make it faster, as well as smaller, though you probably don't need to care about that.
Another possibility for saving space could be to use a loop, like this:
static unsigned short powerOf10[] = {1000, 100, 10, 1};
unsigned short i;
romAddr = 0;
for (i = 146; i < 150; i++){
romAddr += powerOf10[i-146] * eeprom_read(i);
}
which should save space by having the call and the multiply only once, plus the looping instructions, rather than four copies.
In any case, get handy with the assembler language that the compiler generates.
It depends very, very much on the compiler, but I would suggest that you at least simplify the multiplication this way:
romAddr = ((eeprom_read(146)*10 + eeprom_read(147))*10 +
eeprom_read(148))*10 + eeprom_read(149);
You could put this in a loop:
uint8_t i = 146;
romAddr = eeprom_read(i);
for (i = 147; i < 150; i++)
romAddr = romAddr * 10 + eeprom_read(i);
Hopefully the compiler should recognise how much simpler it is to multiply a 16-bit value by ten, compared with separately implementing multiplications by 1000 and 100.
I'm not completely comfortable relying on the compiler to deal with the loop effectively, though.
Maybe:
uint8_t hi, lo;
hi = (uint8_t)eeprom_read(146) * (uint8_t)10 + (uint8_t)eeprom_read(147);
lo = (uint8_t)eeprom_read(148) * (uint8_t)10 + (uint8_t)eeprom_read(149);
romAddr = hi * (uint8_t)100 + lo;
All of these are untested.

Better way to do predicate assignment in C?

What I'm trying to do is avoid the following:
if(*ptr > 128) {
number = 5;
}
Such code performs poorly when there's no clear pattern as to which way the branch will go. What I came up with is this:
int arr[] = { number, 5 };
int cond = *ptr > 128;
number = arr[cond];
Based on my testing, that runs more than twice as fast as doing the conditional when the input is random. What I'm wondering is if there's a more clever way to do this, perhaps using bitwise operators.
A clever compiler should definitely compile this to a conditional move with the right optimization settings; check the disassembly to be sure.
There is this branchless solution:
int mask = -(*ptr > 128);
number = (number & mask) | (5 & ~mask);
The last line can also be
number = ((mask & (number ^ 5)) ^ 5);
if you're looking to use one less operation. But, caveat emptor, the compiler won't be able to optimize either of these nearly as well. You are best leaving this particular optimization for the compiler to worry about, unless you specifically know that the compiler is unable to make the optimization (in that case, you may want to check your compiler version or flags).

Resources