Unexpected performance with global variables

Unexpected performance with global variables - c

I am getting a strange result using global variables. This question was inspired by another question. In the code below if I change
int ncols = 4096;
to
static int ncols = 4096;
or
const int ncols = 4096;
the code runs much faster and the assembly is much simpler.
//c99 -O3 -Wall -fopenmp foo.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int nrows = 4096;
int ncols = 4096;
//static int ncols = 4096;
char* buff;
void func(char* pbuff, int * _nrows, int * _ncols) {
for (int i=0; i<*_nrows; i++) {
for (int j=0; j<*_ncols; j++) {
*pbuff += 1;
pbuff++;
}
}
}
int main(void) {
buff = calloc(ncols*nrows, sizeof*buff);
double dtime = -omp_get_wtime();
for(int k=0; k<100; k++) func(buff, &nrows, &ncols);
dtime += omp_get_wtime();
printf("time %.16e\n", dtime/100);
return 0;
}
I also get the same result if char* buff is a automatic variable (i.e. not global or static). I mean:
//c99 -O3 -Wall -fopenmp foo.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int nrows = 4096;
int ncols = 4096;
void func(char* pbuff, int * _nrows, int * _ncols) {
for (int i=0; i<*_nrows; i++) {
for (int j=0; j<*_ncols; j++) {
*pbuff += 1;
pbuff++;
}
}
}
int main(void) {
char* buff = calloc(ncols*nrows, sizeof*buff);
double dtime = -omp_get_wtime();
for(int k=0; k<100; k++) func(buff, &nrows, &ncols);
dtime += omp_get_wtime();
printf("time %.16e\n", dtime/100);
return 0;
}
If I change buff to be a short pointer then the performance is fast and does not depend on if ncols is static or constant of if buff is automatic. However, when I make buff an int* pointer I observe the same effect as char*.
I thought this may be due to pointer aliasing so I also tried
void func(int * restrict pbuff, int * restrict _nrows, int * restirct _ncols)
but it made no difference.
Here are my questions
When buff is either a char* pointer or a int* global pointer why is the code
faster when ncols has file scope or is constant?
Why does buff being an automatic variable instead of global or static make the code faster?
Why does it make no difference when buff is a short pointer?
If this is due to pointer aliasing why does restrict have no noticeable effect?
Note that I'm using omp_get_wtime() simply because it's convenient for timing.

Some elements allow, as it's been written, GCC to assume different behaviors in terms of optimization; likely, the most impacting optimization we see is loop vectorization. Therefore,
Why is the code faster?
The code is faster because the hot part of it, the loops in func, have been optimized with auto-vectorization. In the case of a qualified ncols with static/const, indeed, GCC emits:
note: loop vectorized
note: loop peeled for vectorization to enhance alignment
which is visible if you turn on -fopt-info-loop, -fopt-info-vec or combinations of those with a further -optimized since it has the same effect.
Why does buff being an automatic variable instead of global or static
make the code faster?
In this case, GCC is able to compute the number of iterations which is intuitively necessary to apply vectorization. This is again due to the storage of buf which is external if not specified otherwise. The whole vectorization is immediately skipped, unlike when buff is local where it carries on and succeeds.
Why does it make no difference when buff is a short pointer?
Why should it? func accepts a char* which may alias anything.
If this is due to pointer aliasing why does restrict have no noticeable effect?
I don't think because GCC can see that they don't alias when func is invoked: restrict isn't needed.

A const will most likely always yield faster or equally fast code as a read/write variable, since the compiler knows that the variable won't be changed, which in turn enables a whole lot of optimization options.
Declaring a file scope variable int or static int should not affect performance much, as it will still be allocated at the very same place: the .data section.
But as mentioned in comments, if the variable is global, the compiler might have to assume that some other file (translation unit) might modify it and therefore block some optimization. I suppose this is what's happening.
But this shouldn't be any concern anyhow, since there is never a reason to declare a global variable in C, period. Always declare them as static to prevent the variable from getting abused for spaghetti-coding purposes.
In general I'd also question your benchmarking results. In Windows you should be using QueryPerformanceCounter and similar.
https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx

Related

Dynamic array with Frama-C and Eva

In https://stackoverflow.com/a/57116260/946226 I learned how to verify that a function foo that operates on a buffer (given by a begin and end pointer) really only reads form it, but creating a representative main function that calls it:
#include <stddef.h>
#define N 100
char test[N];
extern char *foo(char *, char *);
int main() {
char* beg, *end;
beg = &test[0];
end = &test[0] + N;
foo(beg, end);
}
but this does not catch bugs that only appear when the buffer is very short.
I tried the following:
#include <stddef.h>
#include <stdlib.h>
#include "__fc_builtin.h"
extern char *foo(char *, char *);
int main() {
int n = Frama_C_interval(0, 255);
uint8_t *test = malloc(n);
if (test != NULL) {
for (int i=0; i<n; i++) test[i]=Frama_C_interval(0, 255);
char* beg, *end;
beg = &test[0];
end = &test[0] + n;
foo(beg, end);
}
}
But that does not work:
[eva:alarm] frama-main.c:14: Warning:
out of bounds write. assert \valid(test + i);
Can I make it work?

As mentioned in anol's comment, none of the abstract domains available within Eva is capable of keeping track of the relation between n and the length of the memory block returned by malloc. Hence, for all practical purposes, it will not be possible to get rid of the warning in such circumstances for a real-life analysis. Generally speaking, it is important to prepare an initial state which leads to precise bounds for the buffer that are manipulated throughout the program (while the content can stay much more abstract).
That said, for smaller experiments, and if you don't mind wasting (quite a lot of) CPU cycles, it is possible to cheat a little bit, by basically instructing Eva to consider each possible length separately. This is done by a few annotations and command-line options (Frama-C 19.0 Potassium only)
#include <stdint.h>
#include <stddef.h>
#include <stdlib.h>
#include "__fc_builtin.h"
extern char *foo(char *, char *);
int main() {
int n = Frama_C_interval(0, 255);
//# split n;
uint8_t *test = malloc(n);
if (test != NULL) {
//# loop unroll n;
for (int i=0; i<n; i++) {
Frama_C_show_each_test(n, i, test);
test[i]=Frama_C_interval(0, 255);
}
char* beg, *end;
beg = &test[0];
end = &test[0] + n;
foo(beg, end);
}
}
Launch Frama-C with
frama-c -eva file.c \
-eva-precision 7 \
-eva-split-limit 256 \
-eva-builtin malloc:Frama_C_malloc_fresh
In the code, //# split n indicates that Eva should consider separately each possible value of n at this point. It goes along with the -eva-split-limit 256 (by default, Eva won't split if the expression can have more than 100 values). //# loop unroll n ask to unroll the loop n times instead of merging results for all steps.
For the other command-line options, -eva-precision 7 sets the various parameters controlling Eva precision to sensible values. It goes from 0 (less precise than default) up to 11 (maximal precision - don't try it on anything more than a dozen of lines). -eva-builtin malloc:Frama_C_malloc_fresh instructs Eva to create a fresh base address for any call to malloc it encounters. Otherwise, you'd get a single base for all lengths, defeating the purpose of splitting on n in the first place.

How to return multiple types from a function in C?

I have a function in C which calculates the mean of an array. Within the same loop, I am creating an array of t values. My current function returns the mean value. How can I modify this to return the t array also?
/* function returning the mean of an array */
double getMean(int arr[], int size) {
int i;
printf("\n");
float mean;
double sum = 0;
float t[size];/* this is static allocation */
for (i = 0; i < size; ++i) {
sum += arr[i];
t[i] = 10.5*(i) / (128.0 - 1.0);
//printf("%f\n",t[i]);
}
mean = sum/size;
return mean;
}
Thoughts:
Do I need to define a struct within the function? Does this work for type scalar and type array? Is there a cleaner way of doing this?

You can return only 1 object in a C function. So, if you can't choose, you'll have to make a structure to return your 2 values, something like :
typedef struct X{
double mean;
double *newArray;
} X;
BUT, in your case, you'll also need to dynamically allocate the t by using malloc otherwise, the returned array will be lost in stack.
Another way, would be to let the caller allocate the new array, and pass it to you as a pointer, this way, you will still return only the mean, and fill the given array with your computed values.

The most common approach for something like this is letting the caller provide storage for the values you want to return. You could just make t another parameter to your function for that:
double getMean(double *t, const int *arr, size_t size) {
double sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += arr[i];
t[i] = 10.5*(i) / (128.0 - 1.0);
}
return sum/size;
}
This snippet also improves on some other aspects:
Don't use float, especially not when you intend to return a double. float has very poor precision
Use size_t for object sizes. While int often works, size_t is guaranteed to hold any possible object size and is the safe choice
Don't mix output in functions calculating something (just a stylistic advice)
Declare variables close to where they are used first (another stylistic advice)
This is somewhat opinionated, but I changed your signature to make it explicit the function is passed pointers to arrays, not arrays. It's impossible to pass an array in C, therefore a parameter with an array type is automatically adjusted to the corresponding pointer type anyways.
As you don't intend to modify what arr points to, make it explicit by adding a const. This helps for example the compiler to catch errors if you accidentally attempt to modify this array.
You would call this code e.g. like this:
int numbers[] = {1, 2, 3, 4, 5};
double foo[5];
double mean = getMean(foo, numbers, 5);
instead of the magic number 5, you could write e.g. sizeof numbers / sizeof *numbers.
Another approach is to dynamically allocate the array with malloc() inside your function, but this requires the caller to free() it later. Which approach is more suitable depends on the rest of your program.

Following the advice suggested by #FelixPalmen is probably the best choice. But, if there is a maximum array size that can be expected, it is also possible to wrap arrays in a struct, without needing dynamic allocation. This allows code to create new structs without the need for deallocation.
A mean_array structure can be created in the get_mean() function, assigned the correct values, and returned to the calling function. The calling function only needs to provide a mean_array structure to receive the returned value.
#include <stdio.h>
#include <assert.h>
#define MAX_ARR 100
struct mean_array {
double mean;
double array[MAX_ARR];
size_t num_elems;
};
struct mean_array get_mean(int arr[], size_t arr_sz);
int main(void)
{
int my_arr[] = { 1, 2, 3, 4, 5 };
struct mean_array result = get_mean(my_arr, sizeof my_arr / sizeof *my_arr);
printf("mean: %f\n", result.mean);
for (size_t i = 0; i < result.num_elems; i++) {
printf("%8.5f", result.array[i]);
}
putchar('\n');
return 0;
}
struct mean_array get_mean(int arr[], size_t arr_sz)
{
assert(arr_sz <= MAX_ARR);
struct mean_array res = { .num_elems = arr_sz };
double sum = 0;
for (size_t i = 0; i < arr_sz; i++) {
sum += arr[i];
res.array[i] = 10.5 * i / (128.0 - 1.0);
}
res.mean = sum / arr_sz;
return res;
}
Program output:
mean: 3.000000
0.00000 0.08268 0.16535 0.24803 0.33071
In answer to a couple of questions asked by OP in the comments:
size_t is the correct type to use for array indices, since it is guaranteed to be able to hold any array index. You can often get away with int instead; be careful with this, though, since accessing, or even forming a pointer to, the location one before the first element of an array leads to undefined behavior. In general, array indices should be non-negative. Further, size_t may be a wider type than int in some implementations; size_t is guaranteed to hold any array index, but there is no such guarantee for int.
Concerning the for loop syntax used here, e.g., for (size_t i = 0; i < sz; i++) {}: here i is declared with loop scope. That is, the lifetime of i ends when the loop body is exited. This has been possible since C99. It is good practice to limit variable scopes when possible. I default to this so that I must actively choose to make loop variables available outside of loop bodies.
If the loop-scoped variables or size_t types are causing compilation errors, I suspect that you may be compiling in C89 mode. Both of these features were introduced in C99.If you are using gcc, older versions (for example, gcc 4.x, I believe) default to C89. You can compile with gcc -std=c99 or gcc -std=c11 to use a more recent language standard. I would recommend at least enabling warnings with: gcc -std=c99 -Wall -Wextra to catch many problems at compilation time. If you are working in Windows, you may also have similar difficulties. As I understand it, MSVC is C89 compliant, but has limited support for later C language standards.

A puzzling example about the C keyword "restrict"

The example is taken from Wikipedia:
void updatePtrs(size_t *restrict ptrA, size_t *restrict ptrB, size_t *restrict val)
{
*ptrA += *val;
*ptrB += *val;
}
I call this function in the main():
int main(void)
{
size_t i = 10;
size_t j = 0;
updatePtrs(&i, &j, &i);
printf("i = %lu\n", i);
printf("j = %lu\n", j);
return 0;
}
The val pointer is not be loaded twice according to the Wikipedia's description, so the value of j should be 10, but it's 20 in fact.
Is my comprehension about this keyword not correct? Should I utilize some specific options of gcc?
Thanks in advance.

Your code causes undefined behaviour. restrict is a promise from you to the compiler that all of the pointer parameters point to different memory areas.
You break this promise by putting &i for two of the arguments.
(In fact, with restrict it is allowed to pass overlapping pointers, but only if no writes are done through any of the overlapping pointers within the function. But typically you would not bother with restrict if there is no writing happening).
FWIW, on my system with gcc 4.9.2, output is j = 20 at -O0 and j = 10 at -O1 or higher, which suggests that the compiler is indeed taking note of the restrict. Of course, since it is undefined behaviour, your results may vary.

what is the purpose of using keyword volatile to prevent optimization in this piece of code?

In CSAPP2e, when demonstrating the "memory mountain", the author used the following piece of code:
double data[MAXELEMS]
/* $begin mountainfuns */
void test(int elems, int stride) /* The test function */
{
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i += stride)
result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
}
/* Run test(elems, stride) and return read throughput (MB/s) */
double run(int size, int stride, double Mhz)
{
double cycles;
int elems = size / sizeof(int);
test(elems, stride); /* warm up the cache */
cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */
return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */
}
I am not quite clear why we use volatile to avoid optimization in function test(). I saw on wikipedia that volatile keyword indicates that a value may change between different accesses, even if it does not appear to be modified, however, I am not clear the reason to use volatile in this example, and if we don't use volatile, what will happen?

According to the C Standard, writing to a volatile variable is observable behaviour. (Doesn't make a lot of sense to me in the case of a stack variable that is not used afterwards, but them's the rules).
Compiler optimization is not permitted to alter the program's observable behaviour, so this forces the compiler to work out the value of result in order to assign it to sink.
If you don't use volatile the compiler may transform the whole function to a no-op since it has no observable behaviour.

Loop unrolling in inlined functions in C

I have a question about C compiler optimization and when/how loops in inline functions are unrolled.
I am developing a numerical code which does something like the example below. Basically, my_for() would compute some kind of stencil and call op() to do something with the data in my_type *arg for each i. Here, my_func() wraps my_for(), creating the argument and sending the function pointer to my_op()... who’s job it is to modify the ith double for each of the (arg->n) double arrays arg->dest[j].
typedef struct my_type {
int const n;
double *dest[16];
double const *src[16];
} my_type;
static inline void my_for( void (*op)(my_type *,int), my_type *arg, int N ) {
int i;
for( i=0; i<N; ++i )
op( arg, i );
}
static inline void my_op( my_type *arg, int i ) {
int j;
int const n = arg->n;
for( j=0; j<n; ++j )
arg->dest[j][i] += arg->src[j][i];
}
void my_func( double *dest0, double *dest1, double const *src0, double const *src1, int N ) {
my_type Arg = {
.n = 2,
.dest = { dest0, dest1 },
.src = { src0, src1 }
};
my_for( &my_op, &Arg, N );
}
This works fine. The functions are inlining as they should and the code is (almost) as efficient as having written everything inline in a single function and unrolled the j loop, without any sort of my_type Arg.
Here’s the confusion: if I set int const n = 2; rather than int const n = arg->n; in my_op(), then the code becomes as fast as the unrolled single-function version. So, the question is: why? If everything is being inlined into my_func(), why doesn’t the compiler see that I am literally defining Arg.n = 2? Furthermore, there is no improvement when I explicitly make the bound on the j loop arg->n, which should look just like the speedier int const n = 2; after inlining. I also tried using my_type const everywhere to really signal this const-ness to the compiler, but it just doesn't want to unroll the loop.
In my numerical code, this amounts to about a 15% performance hit. If it matters, there, n=4 and these j loops appear in a couple of conditional branches in an op().
I am compiling with icc (ICC) 12.1.5 20120612. I tried #pragma unroll. Here are my compiler options (did I miss any good ones?):
-O3 -ipo -static -unroll-aggressive -fp-model precise -fp-model source -openmp -std=gnu99 -Wall -Wextra -Wno-unused -Winline -pedantic
Thanks!

Well, obviously the compiler isn't 'smart' enough to propagate the n constant and unroll the for loop. Actually it plays it safe since arg->n can change between instantiation and usage.
In order to have consistent performance across compiler generations and squeeze the maximum out of your code, do the unrolling by hand.
What people like myself do in these situations (performance is king) is rely on macros.
Macros will 'inline' in debug builds (useful) and can be templated (to a point) using macro parameters. Macro parameters which are compile time constants are guaranteed to remain this way.

It's faster, because your program does not assign memory to the variable.
If you don't have to perform any operations on unknown values they are treated as if they were #define constant 2 with type checking. They are just added while the compilation.
Could you please chose one of the two tags (I mean C or C++), it's confusing, because the languages treat const values differently - C treats them like normal variables which value just can't be changed, and in C++ they do or don't have memory assigned depending on the context (if you need their address or if you need to compute them when the program is running, then memory is assigned).
Source: "Thinking in C++". No exact quote.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Unexpected performance with global variables - c

Related

Dynamic array with Frama-C and Eva

How to return multiple types from a function in C?

A puzzling example about the C keyword "restrict"

what is the purpose of using keyword volatile to prevent optimization in this piece of code?

Loop unrolling in inlined functions in C

Categories

Resources