I'm trying to understand when and when not to use the restrict keyword in C and in what situations it provides a tangible benefit.
After reading, "Demystifying The Restrict Keyword", ( which provides some rules of thumb on usage ), I get the impression that when a function is passed pointers, it has to account for the possibility that the data pointed to might overlap (alias) with any other arguments being passed into the function. Given a function:
foo(int *a, int *b, int *c, int n) {
for (int i = 0; i<n; ++i) {
b[i] = b[i] + c[i];
a[i] = a[i] + b[i] * c[i];
}
}
the compiler has to reload c in the second expression, because maybe b and c point to the same location. It also has to wait for b to be stored before it can load a for the same reason. It then has to wait for a to be stored and must reload b and c at the beginning of the next loop. If you call the function like this:
int a[N];
foo(a, a, a, N);
then you can see why the compiler has to do this. Using restrict effectively tells the compiler that you will never do this, so that it can drop the redundant load of c and load a before b is stored.
In a different SO post, Nils Pipenbrinck, provides a working example of this scenario demonstrating the performance benefit.
So far I've gathered that it's a good idea to use restrict on pointers you pass into functions which won't be inlined. Apparently if the code is inlined the compiler can figure out that the pointers don't overlap.
Now here's where things start getting fuzzy for me.
In Ulrich Drepper's paper, "What every programmer should know about memory" he makes the statement that, "unless restrict is used, all pointer accesses are potential sources of aliasing," and he gives a specific code example of a submatrix matrix multiply where he uses restrict.
However, when I compile his example code either with or without restrict I get identical binaries in both cases. I'm using gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
The thing I can't figure out in the following code is whether it needs to be rewritten to make more extensive use of restrict, or if the alias analysis in GCC is just so good that it's able to figure out that none of the arguments alias each other. For purely educational purposes, how can I make using or not using restrict matter in this code - and why?
For restrict compiled with:
gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE) -DUSE_RESTRICT -Wextra -std=c99 -O3 matrixMul.c -o matrixMul
Just remove -DUSE_RESTRICT to not use restrict.
#include <stdlib.h>
#include <stdio.h>
#include <emmintrin.h>
#ifdef USE_RESTRICT
#else
#define restrict
#endif
#define N 1000
double _res[N][N] __attribute__ ((aligned (64)));
double _mul1[N][N] __attribute__ ((aligned (64)))
= { [0 ... (N-1)]
= { [0 ... (N-1)] = 1.1f }};
double _mul2[N][N] __attribute__ ((aligned (64)))
= { [0 ... (N-1)]
= { [0 ... (N-1)] = 2.2f }};
#define SM (CLS / sizeof (double))
void mm(double (* restrict res)[N], double (* restrict mul1)[N],
double (* restrict mul2)[N]) __attribute__ ((noinline));
void mm(double (* restrict res)[N], double (* restrict mul1)[N],
double (* restrict mul2)[N])
{
int i, i2, j, j2, k, k2;
double *restrict rres;
double *restrict rmul1;
double *restrict rmul2;
for (i = 0; i < N; i += SM)
for (j = 0; j < N; j += SM)
for (k = 0; k < N; k += SM)
for (i2 = 0, rres = &res[i][j],
rmul1 = &mul1[i][k]; i2 < SM;
++i2, rres += N, rmul1 += N)
for (k2 = 0, rmul2 = &mul2[k][j];
k2 < SM; ++k2, rmul2 += N)
for (j2 = 0; j2 < SM; ++j2)
rres[j2] += rmul1[k2] * rmul2[j2];
}
int main (void)
{
mm(_res, _mul1, _mul2);
return 0;
}
It is a hint to the code optimizer. Using restrict ensures it that it can store a pointer variable in a CPU register and not have to flush an update of the pointer value to memory so that an alias is updated as well.
Whether or not it takes advantage of it depends heavily on implementation details of the optimizer and the CPU. Code optimizers already are heavily invested in detecting non-aliasing since it is such an important optimization. It should have no trouble detecting that in your code.
Also, GCC 4.0.0-4.4 has a regression bug that causes the restrict keyword to be ignored. This bug was reported as fixed in 4.5 (I lost the bug number though).
(I don't know if using this keyword gives you a significant advantage, actually. It's very easy for programmer to err with this qualifier as there is no enforcement, so an optimizer cannot be certain that the programmer doesn't "lie".)
When you know that a pointer A is the only pointer to some region of memory, that is, it doesn't have aliases (that is, any other pointer B will necessarily be unequal to A, B != A), you can tell this fact to the optimizer by qualifying the type of A with the "restrict" keyword.
I have written about this here: http://mathdev.org/node/23 and tried to show that some restricted pointers are in fact "linear" (as mentioned in that post).
It's worth noting that recent versions of clang are capable of generating code with a run-time check for aliasing, and two code paths: one for cases where there is potential aliasing and the other for case where is is obvious there is no chance of it.
This clearly depends on the extents of data pointed to being conspicuous to the compiler - as they would be in the example above.
I believe the prime justification is for programs making heavy use of STL - and particularly <algorithm> , where is either difficult or impossible to introduce the __restrict qualifier.
Of course, this all comes at the expense of code-size, but removes a great deal of potential for obscure bugs that could result for pointers declared as __restrict not being quite as non-overlapping as the developer thought.
I would be surprised if GCC hadn't also got this optimisation.
May be the optimisation done here don't rely on pointers not being aliased ? Unless you preload multiple mul2 element before writing result in res2, I don't see any aliasing problem.
In the first piece of code you show, it is quite clear what kind of aliases problem can occur.
Here it is not so clear.
Rereading Dreppers article, he does not specifically says restrict might solve anything. There is even this phrase :
{In theory the restrict keyword
introduced into the C language in the
1999 revision should solve the
problem. Compilers have not caught up
yet, though. The reason is mainly that
too much incorrect code exists which
would mislead the compiler and cause
it to generate incorrect object code.}
In this code, optimisations of memory access has already been done within the algorithm. The residual optimisation seems to be done in the vectorized code presented in appendice. So for the code presented here, I guess there is no difference, because no optimisation relying on restrict is done. Every pointer access is a source of aliasing, but not every optimisation relies on aliassing.
Premature optimization being the root of all evil, the use of the restrict keyword should be limited to the case your are actively studying and optimizing, not used wherever it could be used.
If there is a difference at all, moving mm to a seperate DSO (such that gcc can no longer know everything about the calling code) will be the way to demonstrate it.
Are you running on 32 or 64-bit Ubuntu? If 32-bit, then you need to add -march=core2 -mfpmath=sse (or whatever your processor architecture is), otherwise it doesn't use SSE. Secondly, in order to enable vectorization with GCC 4.2, you need to add the -ftree-vectorize option (as of 4.3 or 4.4 this is included as default in -O3). It might also be necessary to add -ffast-math (or another option providing relaxed floating point semantics) in order to allow the compiler to reorder floating point operations.
Also, add the -ftree-vectorizer-verbose=1 option to see whether it manages to vectorize the loop or not; that's an easy way to check the effect of adding the restrict keyword.
The problem with your example code is that the compiler will just inline the call and see that there is no aliasing ever possible in your example. I suggest you remove the main() function and compile it using -c.
The following C99 code can show you that the output of the program depends on restrict :
__attribute__((noinline))
int process(const int * restrict const a, int * const b) {
*b /= (*a + 1) ;
return *a + *b ;
}
int main(void) {
int data[2] = {1, 2};
return process(&data[0], &data[0]);
}
The software terminates with code 1 using restrict and 0 without restrict qualifier.
The compilation is done with gcc -std=c99 -Wall -pedantic -O3 main.c.
The flag -O1 do the job too.
It is useful to use restrict when, for example, you can tell the compiler that the loop condition remains unchanged, even if another pointer has been updated (necessarily, the loop condition couldn't change due to restrict).
And certainly so on.
Related
I found that this code produces different results with "-fsanitize=undefined,address" and without it.
int printf(const char *, ...);
union {
long a;
short b;
int c;
} d;
int *e = &d.c;
int f, g;
long *h = &d.a;
int main() {
for (; f <= 0; f++) {
*h = g;
*e = 6;
}
printf("%d\n", d.b);
}
The command line is:
$ clang -O0 -fsanitize=undefined,address a.c -o out0
$ clang -O1 -fsanitize=undefined,address a.c -o out1
$ clang -O1 a.c -o out11
$ ./out0
6
$ ./out1
6
$ ./out11
0
The Clang version is:
$ clang -v
clang version 13.0.0 (/data/src/llvm-dev/llvm-project/clang 3eb2158f4fea90d56aeb200a5ca06f536c1df683)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /data/bin/llvm-dev/bin
Found candidate GCC installation: /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7
Selected GCC installation: /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7
Candidate multilib: .;#m64
Candidate multilib: 32;#m32
Selected multilib: .;#m64
Found CUDA installation: /usr/local/cuda, version 10.2
The OS and platform are:
CentOS Linux release 7.8.2003 (Core).0, x86_64 GNU/Linux
My questions:
Is there something wrong with my code? Is taking the address of multiple members of the union invalid in C?
If there is something wrong with my code, how do I get LLVM (or GCC) to warn me? I have used -Wall -Wextra but LLVM and GCC show no warning.
Is there something wrong with the code?
For practical purposes, yes.
I think this is the same underlying issue as Is it undefined behaviour to call a function with pointers to different elements of a union as arguments?
As Eric Postpischil points out, the C standard as read literally seems to permit your code, and require it to print out 6 (assuming that's consistent with how your implementation represents integer types and how it lays out unions). However, this literal reading would render the strict aliasing rule almost entirely impotent, so in my opinion it's not what the standard authors would have intended.
The spirit of the strict aliasing rule is that the same object may not be accessed through pointers to different types (with certain exceptions for character types, etc) and that the compiler may optimize on the assumption that this never happens. Although d.a and d.c are not strictly speaking "the same object", they do have overlapping storage, and I think compiler authors interpret the rule as also not allowing overlapping objects to be accessed through pointers to different types. Under that interpretation your code would have undefined behavior.
In Defect Report 236 the committee considered a similar example and stated that it has undefined behavior, because of its use of pointers that "have different types but designate the same region of storage". However, wording to clarify this does not seem to have ever made it into any subsequent version of the standard.
Anyhow, I think the practical upshot is that you cannot expect your code to work "correctly" under modern compilers that enforce their interpretations of the strict aliasing rule. Whether or not this is a clang bug is a matter of opinion, but even if you do think it is, then it's a bug that they are probably not ever going to fix.
Why does it behave this way?
If you use the -fno-strict-aliasing flag, then you get back to the 6 behavior. My guess is that the sanitizers happen to inhibit some of these optimizations, which is why you don't see the 0 behavior when using those options.
What seems to have happened under the hood with -O1 is the compiler assumed that the stores to *h and *e don't interact (because of their different types) and therefore can be freely reordered. So it hoisted *h = g outside the loop, since after all multiple stores to the same address, with no intervening load, are redundant and only the last one needs to be kept. It happened to put it after the loop, presumably because it can't prove that e doesn't point to g, so the value of g needs to be reloaded after the loop. So the final value of d.b is derived from *h = g which effectively does d.a = 0.
How to get a warning?
Unfortunately, compilers are not good at checking, either statically or at runtime, for violations of (their interpretation of) the strict aliasing rule. I'm not aware of any way to get a warning for such code. With clang you can use -Weverything to enable every warning option that it supports (many of which are useless or counterproductive), and even with that, it gives no relevant warnings about your program.
Another example
In case anyone is curious, here's another test case that doesn't rely on any type pun, reinterpretation, or other implementation-defined behavior.
#include <stdio.h>
short int zero = 0;
void a(int *pi, long *pl) {
for (int x = 0; x < 1000; x++) {
*pl = x;
*pi = zero;
}
}
int main(void) {
union { int i; long l; } u;
a(&u.i, &u.l);
printf("%d\n", u.i);
}
Try on godbolt
As read literally, this code would appear to print 0 on any implementation: the last assignment in a() was to u.i, so u.i should be the active member, and the printf should output the value 0 which was assigned to it. However, with clang -O2, the stores are reordered and the program outputs 999.
Just as a counterpoint, though, if you read the standard so as to make the above example UB, then this leads to the somewhat absurd conclusion that u.l = 0; u.i = 5; print(u.i); is well defined and prints 5, but that *&u.l = 0; *&u.i = 5; print(u.i); is UB. (Recall that the "cancellation rule" of & and * applies to &*p but not to *&x.)
The whole situation is rather unsatisfactory.
I will rewrite the code for ease of reading:
int printf(const char *, ...);
union
{
long l;
short s;
int i;
} u;
long *ul = &u.l;
int *ui = &u.i;
int counter, zero;
int main(void)
{
for (; counter <= 0; counter++)
{
*ul = zero;
*ui = 6;
}
printf("%d\n", u.s);
}
The only questionable code here is the use of u.s in the printf, when u.s is not the last member of the union that was stored. That is defined by C 2018 6.5.2.3, which says the value of u.s is that of the named member, and note 99 clarifies this means that, if s is not the member last used to store a value, the appropriate bytes are reinterpreted as a short. This is well established.
The other code is ordinary: *ul = zero; stores a value in a union member. There is no aliasing violating because ul points to a long and is used to access a long. *ui = 6; stores a value in another union member and is also not an aliasing violation.
The specific bytes used to represent 6 in an int are implementation-defined in regard to ordering and padding bits. However, whatever they are, they should be the same with or without Clang’s “sanitization” and the same in optimization levels 0 and 1. Therefore, the same result should be obtained in all compilations.
This is a compiler bug.
I agree with other comments and answer that this is likely a defect in the C standard, as it makes the aliasing rule largely useless. Nonetheless, the sample code conforms to the requirements of the C standard and ought to work as described.
I have the following ostensibly simple C program:
#include <stdint.h>
#include <stdio.h>
uint16_t
foo(uint16_t *arr)
{
unsigned int i;
uint16_t sum = 0;
for (i = 0; i < 4; i++) {
sum += *arr;
arr++;
}
return sum;
}
int main()
{
uint32_t arr[] = {5, 6, 7, 8};
printf("sum: %x\n", foo((uint16_t*)arr));
return 0;
}
The idea being that we iterate over an array and add up it's 16-bit words ignoring overflow. When compiling this code on x86-64 with gcc and no optimization I get what would seem to be the correct result of 0xb (11) because it's summing the first 4 16-bit words which include 5, and 6:
$ gcc -O0 -o castit castit.c
$ ./castit
sum: b
$ ./castit
sum: b
$
With optimization on it's another story:
$ gcc -O2 -o castit castit.c
$ ./castit
sum: 5577
$ ./castit
sum: c576
$ ./castit
sum: 1de6
The program generates indeterminate values for the sum.
I'm assuming the position that it's not a compiler bug for now, which would lead me to believe that there is some undefined behavior in the program, however I can't point to a specific thing which would lead to it.
Note that when the function foo is compiled to a separately linked module the issue is not seen.
You are breaking the strict aliasing rule, which is indeed UB. That's because you alias your array arr of uint32_t via a pointer of different type, i.e. uint16_t when passing it to foo(uint16_t*).
The only pointer type you can use to alias other types is a char*.
Some additional reading material on the subject: http://dbp-consulting.com/tutorials/StrictAliasing.html
The C language as defined by K & R was not designed to facilitate easy code generation, but not necessarily the generation of efficient code. ANSI et al. added some rules with the intention of allowing compilers to generate more efficient code by assuming programmers wouldn't do certain things, but did so in a way which makes it impossible to implement certain kinds of algorithms cleanly and efficiently. A standards-conforming version of foo which can accept pointers to either uint16_t or uint32_t would look like:
uint16_t foo(uint16_t *arr) // Could perhaps use void*
{
unsigned int i;
uint16_t *src = arr;
uint16_t temp;
uint16_t sum = 0;
for (i = 0; i < 4; i++) {
memcpy(&temp, src, sizeof temp);
sum += temp;
src++;
}
return sum;
}
Some compilers will recognize that memcpy can be replaced with code that simply reads *src directly as a uint16_t, so the above code might run efficiently despite its use of memcpy. On some other compilers, however, the above code would run quite slowly. Further, while memcpy made it possible in C89 to do everything which would be possible in the absence of the pointer-usage rules, C99 added additional rules which make some kinds of semantics essentially unachievable. Fortunately, in your particular case, memcpy will allow the algorithm to be expressed in a way that will allow some compilers will yield efficient code.
Note that many compilers include an option to compile a language which extends standard C by eliminating the aforementioned restrictions on pointer usage. On gcc, the option -fno-strict-alias may be used for that purpose. Use of that option may severely degrade the efficiency of some kinds of code unless the programmer takes steps to prevent such degradation (using restrict qualifiers when possible, and copying things that might be aliased to local variables), but in many cases code can be written so -fno-strict-alias won't hurt performance too badly.
Well, there is no guarantee by the standard that inline functions are actually inlined; one must use macros to have 100 % guarantee. The compiler always decides which function is or is not inlined based on its own rules irrespective of the inline keyword.
Then when will the inline keyword actually have some effect to what the compiler does when using modern compilers such as the recent version of GCC?
It has a semantic effect. To simplify, a function marked inline may be defined multiple times in one program — though all definitions must be equivalent to each other — so presence of inline is required for correctness when including the function definition in headers (which is, in turn, makes the definition visible so the compiler can inline it without LTO).
Other than that, for inlining-the-optimization, "never" is a perfectly safe approximation. It probably has some effect in some compilers, but nothing worth losing sleep over, especially not without actual hard data. For example, in the following code, using Clang 3.0 or GCC 4.7, main contains the same code whether work is marked inline or not. The only difference is whether work remains as stand-alone function for other translation units to link to, or is removed.
void work(double *a, double *b) {
if (*b > *a) *a = *b;
}
void maxArray(double* x, double* y) {
for (int i = 0; i < 65536; i++) {
//if (y[i] > x[i]) x[i] = y[i];
work(x+i, y+i);
}
}
If you want to control inlining, stick to whatever pragmas or attributes your compiler provides with which to control that behaviour. For example __attribute__((always_inline)) on GCC and similar compilers. As you've mentioned, the inline keyword is often ignored depending on optimization settings, etc.
I would like to create a C macro returning the scalar minimum for any type of static array in input. For example:
float A[100];
int B[10][10];
// [...]
float minA = MACRO_MIN(A);
int minB = MACRO_MIN(B);
How can I do so?
It can be probably be done with GCC extensions, but not in standard C. Other compilers might have suitable extensions, too. It will of course make the code fantastically hard to port. I would advise against it, since it's quite hard to achieve it will be "unexpected" and probably act as a source of confusion (or, worse, bugs) down the line.
You're going to have to declare a temporary variable to hold the max/min seen "so far" when iterating over the array, and the type of that variable is hard to formulate without extensions.
Also returning the value of the temporary is hard, but possible with GCC extensions.
To make the above more concrete, here's a sketch of what I imagine. I did not test-compile this, so it's very likely to have errors in it:
#define ARRAY_MAX(a) ({ typeof(a) tmp = a[0];\
for(size_t i = 1; i < sizeof a / sizeof tmp; ++i)\
{\
if(a[i] > tmp)\
tmp = a[i];\
}\
tmp;\
})
The above uses:
({ and }) is the GCC Statement Expressions extension, allowing the macro to have a local variable which is used as the "return value".
typeof is used to compute the proper type.
Note assumption that the array is not of zero size. This should not be a very limiting assumption.
The use of sizeof is of course standard.
As I wrote the above, I realize there might be issues with multi-dimensional arrays that I hadn't realized until trying. I'm not going to polish it further, though. Note that it starts out with "probably".
Today i was reading about pure function, got confused with its use:
A function is said to be pure if it returns same set of values for same set of inputs and does not have any observable side effects.
e.g. strlen() is a pure function while rand() is an impure one.
__attribute__ ((pure)) int fun(int i)
{
return i*i;
}
int main()
{
int i=10;
printf("%d",fun(i));//outputs 100
return 0;
}
http://ideone.com/33XJU
The above program behaves in the same way as in the absence of pure declaration.
What are the benefits of declaring a function as pure[if there is no change in output]?
pure lets the compiler know that it can make certain optimisations about the function: imagine a bit of code like
for (int i = 0; i < 1000; i++)
{
printf("%d", fun(10));
}
With a pure function, the compiler can know that it needs to evaluate fun(10) once and once only, rather than 1000 times. For a complex function, that's a big win.
When you say a function is 'pure' you are guaranteeing that it has no externally visible side-effects (and as a comment says, if you lie, bad things can happen). Knowing that a function is 'pure' has benefits for the compiler, which can use this knowledge to do certain optimizations.
Here is what the GCC documentation says about the pure attribute:
pure
Many functions have no effects except the return value and their return
value depends only on the parameters and/or global variables.
Such a function can be subject to common subexpression elimination and
loop optimization just as an arithmetic operator would be. These
functions should be declared with the attribute pure. For example,
int square (int) __attribute__ ((pure));
Philip's answer already shows how knowing a function is 'pure' can help with loop optimizations.
Here is one for common sub-expression elimination (given foo is pure):
a = foo (99) * x + y;
b = foo (99) * x + z;
Can become:
_tmp = foo (99) * x;
a = _tmp + y;
b = _tmp + z;
In addition to possible run-time benefits, a pure function is much easier to reason about when reading code. Furthermore, it's much easier to test a pure function since you know that the return value only depends on the values of the parameters.
A non-pure function
int foo(int x, int y) // possible side-effects
is like an extension of a pure function
int bar(int x, int y) // guaranteed no side-effects
in which you have, besides the explicit function arguments x, y,
the rest of the universe (or anything your computer can communicate with) as an implicit potential input. Likewise, besides the explicit integer return value, anything your computer can write to is implicitly part of the return value.
It should be clear why it is much easier to reason about a pure function than a non-pure one.
Just as an add-on, I would like to mention that C++11 codifies things somewhat using the constexpr keyword. Example:
#include <iostream>
#include <cstring>
constexpr unsigned static_strlen(const char * str, unsigned offset = 0) {
return (*str == '\0') ? offset : static_strlen(str + 1, offset + 1);
}
constexpr const char * str = "asdfjkl;";
constexpr unsigned len = static_strlen(str); //MUST be evaluated at compile time
//so, for example, this: int arr[len]; is legal, as len is a constant.
int main() {
std::cout << len << std::endl << std::strlen(str) << std::endl;
return 0;
}
The restrictions on the usage of constexpr make it so that the function is provably pure. This way, the compiler can more aggressively optimize (just make sure you use tail recursion, please!) and evaluate the function at compile time instead of run time.
So, to answer your question, is that if you're using C++ (I know you said C, but they are related), writing a pure function in the correct style allows the compiler to do all sorts of cool things with the function :-)
In general, Pure functions has 3 advantages over impure functions that the compiler can take advantage of:
Caching
Lets say that you have pure function f that is being called 100000 times, since it is deterministic and depends only on its parameters, the compiler can calculate its value once and use it when necessary
Parallelism
Pure functions don't read or write to any shared memory, and therefore can run in separate threads without any unexpected consequence
Passing By Reference
A function f(struct t) gets its argument t by value, and on the other hand, the compiler can pass t by reference to f if it is declared as pure while guaranteeing that the value of t will not change and have performance gains
In addition to the compile time considerations, pure functions can be tested fairly easy: just call them.
No need to construct objects or mock connections to DBs / file system.