Cache optimization of C loop: Why doesn't this work?

Cache optimization of C loop: Why doesn't this work? - c

I'm trying to duplicate the first piece of code on this article
http://www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736
Namely:
static volatile int array[Size];
static void test_function(void)
{
for (int i = 0; i < Iterations; i++)
for (int x = 0; x < Size; x++)
array[x]++;
}
I'm running on OS X with an Ivy Bridge processor, and therefore have 64KiB of L1 cache. However, no matter how much I change around the array size, it takes the same amount of time. Here's my code:
#define ARRAY_SIZE 16 * 1024
#define NUM_ITERATIONS 200000
volatile int array[ARRAY_SIZE];
int main(int argc, const char * argv[])
{
for (int i = 0; i < NUM_ITERATIONS; i++)
for (int x = 0; x < ARRAY_SIZE; x++)
array[x]++;
return 0;
}
Now, according to the logic suggested by the article, array should be 64KiB and utilize all my L1 cache. However, I've tried this with many difference combinations of ARRAY_SIZE (up to 160 * 1024), setting NUM_ITERATIONS accordingly, but every combination about takes the same amount of time.
I'm using gcc -o cachetest cachetest.c to compile, with no other options. Is there some kind of optimization going on that I don't know about, even though volatile is used? Or are there so many parallel processes and context switching that I can't even tell? What's going on here? I'm so confused.
Thanks SO!

There are 2 things:
Compiler may do some default optimization to your code
Your code does not use array in any other code/functions, it only increment the array value inside loop, so compiler may optimize it more by changing your program to do nothing (just return 0), which is still correct.
I recommend to:
Add more code inside the loop so the compiler will not eliminate your code, for example: printf the array value, or add the array value to a sum variable then print the sum variable at the end of the loop.
Turn off all compiler optimization when compiling by using -O0 option.
Check the assembly file of the code generated by compiler by using -S option

Related

why do I have a runtime #2 failure in C when I have enough space and there isn't many data in the array

I'm writing this code in C for some offline games but when I run this code, it says "runtime failure #2" and "stack around the variable has corrupted". I searched the internet and saw some answers but I think there's nothing wrong with this.
#include <stdio.h>
int main(void) {
int a[16];
int player = 32;
for (int i = 0; i < sizeof(a); i++) {
if (player+1 == i) {
a[i] = 254;
}
else {
a[i] = 32;
}
}
printf("%d", a[15]);
return 0;
}

Your loop runs from 0 to sizeof(a), and sizeof(a) is the size in bytes of your array.
Each int is (typically) 4-bytes, and the total size of the array is 64-bytes. So variable i goes from 0 to 63.
But the valid indices of the array are only 0-15, because the array was declared [16].
The standard way to iterate over an array like this is:
#define count_of_array(x) (sizeof(x) / sizeof(*x))
for (int i = 0; i < count_of_array(a); i++) { ... }
The count_of_array macro calculates the number of elements in the array by taking the total size of the array, and dividing by the size of one element.
In your example, it would be (64 / 4) == 16.

sizeof(a) is not the size of a, but rather how many bytes a consumes.
a has 16 ints. The size of int depends on the implementation. A lot of C implementations make int has 4 bytes, but some implementations make int has 2 bytes. So sizeof(a) == 64 or sizeof(a) == 32. Either way, that's not what you want.
You define int a[16];, so the size of a is 16.
So, change your for loop into:
for (int i = 0; i < 16; i++)

You're indexing too far off the size of the array, trying to touch parts of memory that doesn't belong to your program. sizeof(a) returns 64 (depending on C implementation, actually), which is the total amount of bytes your int array is taking up.
There are good reasons for trying not to statically declare the number of iterations in a loop when iterating over an array.
For example, you might realloc memory (if you've declared the array using malloc) in order to grow or shrink the array, thus making it harder to keep track of the size of the array at any given point. Or maybe the size of the array depends on user input. Or something else altogether.
There's no good reason to avoid saying for (int i = 0; i < 16; i++) in this particular case, though. What I would do is declare const int foo = 16; and then use foo instead of any number, both in the array declaration and the for loop, so that if you ever need to change it, you only need to change it in one place. Else, if you really want to use sizeof() (maybe because one of the reasons above) you should divide the return value of sizeof(array) by the return value of sizeof(type of array). For example:
#include <stdio.h>
const int ARRAY_SIZE = 30;
int main(void)
{
int a[ARRAY_SIZE];
for(int i = 0; i < sizeof(a) / sizeof(int); i++)
a[i] = 100;
// I'd use for(int i = 0; i < ARRAY_SIZE; i++) though
}

Why is the use of unrelated printf statement causing changes in my program output?

I'm stuck with a program where just having a printf statement is causing changes in the output.
I have an array of n elements. For the median of every d consecutive elements, if the (d+1)th element is greater or equals to twice of it (the median), I'm incrementing the value of notifications. The complete problem statement might be referred here.
This is my program:
#include <math.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>
#include <stdbool.h>
#define RANGE 200
float find_median(int *freq, int *ar, int i, int d) {
int *count = (int *)calloc(sizeof(int), RANGE + 1);
for (int j = 0; j <= RANGE; j++) {
count[j] = freq[j];
}
for (int j = 1; j <= RANGE; j++) {
count[j] += count[j - 1];
}
int *arr = (int *)malloc(sizeof(int) * d);
float median;
for (int j = i; j < i + d; j++) {
int index = count[ar[j]] - 1;
arr[index] = ar[j];
count[ar[j]]--;
if (index == d / 2) {
if (d % 2 == 0) {
median = (float)(arr[index] + arr[index - 1]) / 2;
} else {
median = arr[index];
}
break;
}
}
free(count);
free(arr);
return median;
}
int main() {
int n, d;
scanf("%d %d", &n, &d);
int *arr = malloc(sizeof(int) * n);
for (int i = 0; i < n; i++) {
scanf("%i", &arr[i]);
}
int *freq = (int *)calloc(sizeof(int), RANGE + 1);
int notifications = 0;
if (d < n) {
for (int i = 0; i < d; i++)
freq[arr[i]]++;
for (int i = 0; i < n - d; i++) {
float median = find_median(freq, arr, i, d); /* Count sorts the arr elements in the range i to i+d-1 and returns the median */
if (arr[i + d] >= 2 * median) { /* If the (i+d)th element is greater or equals to twice the median, increments notifications*/
printf("X");
notifications++;
}
freq[arr[i]]--;
freq[arr[i + d]]++;
}
}
printf("%d", notifications);
return 0;
}
Now, For large inputs like this, the program outputs 936 as the value of notifications whereas when I just exclude the statement printf("X") the program outputs 1027 as the value of notifications.
I'm really not able to understand what is causing this behavior in my program, and what I'm missing/overseeing.

Your program has undefined behavior here:
for (int j = 0; j <= RANGE; j++) {
count[j] += count[j - 1];
}
You should start the loop at j = 1. As coded, you access memory before the beginning of the array count, which could cause a crash or produce an unpredictable value. Changing anything in the running environment can lead to a different behavior. As a matter of fact, even changing nothing could.
The rest of the code is more difficult to follow at a quick glance, but given the computations on index values, there may be more problems there too.
For starters, you should add some consistency checks:
verify the return value of scanf() to ensure proper conversions.
verify the values read into arr, they must be in the range 0..RANGE
verify that int index = count[ar[j]] - 1; never produces a negative number.
same for count[ar[j]]--;
verify that median = (float)(arr[index] + arr[index - 1]) / 2; is never evaluated with index == 0.

Your program has undefined behavior (at several occasions). You really should be scared (and you are not scared enough).
I'm really not able to understand what is causing this behavior in my program
With UB, that question is pointless. You need to dive into implementation details (e.g. study the generated machine code of your program, and the code of your C compiler and standard library) to understand anything more. You probably don't want to do that (it could take years of work).
Please read as quickly as possible Lattner's blog on What Every C Programmer Should Know on Undefined Behavior
what I'm missing/overseeing.
You don't understand well enough UB. Be aware that a programming language is a specification (and code against it), not a software (e.g. your compiler). Program semantics is important.
As I said in comments:
compile with all warnings and debug info (gcc -Wall -Wextra -g with GCC)
improve your code to get no warnings; perhaps try also another compiler like Clang and work to also get no warnings from it (since different compilers give different warnings).
consider using some version control system like git to keep various variants of your code, and some build automation tool.
think more about your program and invariants inside it.
use the debugger (gdb), in particular with watchpoints, to understand the internal state of your process; and have several test cases to run under the debugger and without it.
use instrumentation facilities such as the address sanitizer -fsanitize=address of GCC and tools like valgrind.
use rubber duck debugging methodology
sometimes consider static source code analysis tools (e.g. Frama-C). They require expertise to be used, and/or give many false positives.
read more about programming (e.g. SICP) and about the C Programming Language. Download and study the C11 programming language specification n1570 (and be very careful about every mention of UB in it). Read carefully the documentation of every standard or external function you are using. Study also the documentation of your compiler and of other tools. Handle error and failure cases (e.g. calloc and scanf can fail).
Debugging is difficult (e.g. because of the Halting Problem, of Heisenbugs, etc...) - but sometimes fun and challenging. You can spend weeks on finding one single bug. And you often cannot understand the behavior of a buggy program without diving into implementation details (studying the machine code generated by the compiler, studying the code of the compiler).
PS. Your question shows a wrong mindset -which you should improve-, and misunderstanding of UB.

Effect of cache size on code

I want to study the effect of the cache size on code. For programs operating on large arrays, there can be a significant speed-up if the array fits in the cache.
How can I meassure this?
I tried to run this c program:
#define L1_CACHE_SIZE 32 // Kbytes 8192 integers
#define L2_CACHE_SIZE 256 // Kbytes 65536 integers
#define L3_CACHE_SIZE 4096 // Kbytes
#define ARRAYSIZE 32000
#define ITERATIONS 250
int arr[ARRAYSIZE];
/*************** TIME MEASSUREMENTS ***************/
double microsecs() {
struct timeval t;
if (gettimeofday(&t, NULL) < 0 )
return 0.0;
return (t.tv_usec + t.tv_sec * 1000000.0);
}
void init_array() {
int i;
for (i = 0; i < ARRAYSIZE; i++) {
arr[i] = (rand() % 100);
}
}
int operation() {
int i, j;
int sum = 0;
for (j = 0; j < ITERATIONS; j++) {
for (i = 0; i < ARRAYSIZE; i++) {
sum =+ arr[i];
}
}
return sum;
}
void main() {
init_array();
double t1 = microsecs();
int result = operation();
double t2 = microsecs();
double t = t2 - t1;
printf("CPU time %f milliseconds\n", t/1000);
printf("Result: %d\n", result);
}
taking values of ARRAYSIZE and ITERATIONS (keeping the product, and hence the number of instructions, constant) in order to check if the program run faster if the array fits in the cache, but I always get the same CPU time.
Can anyone say what I am doing wrong?

What you really want to do is build a "memory mountain." A memory mountain helps you visualize how memory accesses affect program performance. Specifically, it measures read throughput vs spatial locality and temporal locality. Good spatial locality means that consecutive memory accesses are near each other and good temporal locality means that a certain memory location is accessed multiple times in a short amount of program time. Here is a link that briefly mentions cache performance and memory mountains. The 3rd edition of the textbook mentioned in that link is a very good reference, specifically chapter 6, for learning about memory and cache performance. (In fact, I'm currently using that section as a reference as I answer this question.)
Another link shows a test function that you could use to measure cache performance, which I have copied here:
void test(int elems, int stride)
{
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i+=stride)
result += data[i];
sink = result;
}
Stride is the temporal locality - how far apart the memory accesses are.
The idea is that this function would estimate the number of cycles that it took to run. To get throughput, you'll want to take (size / stride) / (cycles / MHz), where size is the size of the array in bytes, cycles is the result of this function, and MHz is the clock speed of your processor. You'd want to call this once before you take any measurements to "warm up" your cache. Then, run the loop and take measurements.
I found a GitHub repository that you could use to build a 3D memory mountain on your own machine. I encourage you to try it on multiple machines with different processors and compare differences.

There's a typo in your code. =+ instead of +=.

The arr array is linked into the BSS [uninitialized] section. The default value for the variables in this section is zero. All pages in this section are initially mapped R/O to a single zero page. This is linux/Unix centric, but, probably applies to most modern OSes
So, regardless of the array size, you're only fetching from a single page, which will get cached, so that's why you get the same results.
You'll need to break the "zero page mapping" by writing something to all of arr before doing your tests. That is, do something like memset first. This will cause the OS to create a linear page mapping for arr using its COW (copy-on-write) mechanism.

Declared array of size [x][y] and another array with size [y-1]

I am using Code::Blocks 10.05, and the GNU GCC Compiler.
Basically, I ran into a really strange (and for me, inexplicable) issue that arises when trying to initialize an array outside it's declared size. In words, it's this:
*There is a declared array of size [x][y].
*There is another declared array with size [y-1].
The issue comes up when trying to put values into this second, size [y-1] array, outside of the [y-1] size. When this is attempted, the first array [x][y] will no longer maintain all of its values. I simply don't understand why breaking (or attempting to break) one array would affect the contents of the other. Here is some sample code to see it happening (it is in the broken format. To see the issue vanish, simply change array2[4] to array2[5] (thus eliminating what I have pinpointed to be the problem).
#include <stdio.h>
int main(void)
{
//Declare the array/indices
char array[10][5];
int array2[4]; //to see it work (and verify the issue), change 4 to 5
int i, j;
//Set up use of an input text file to fill the array
FILE *ifp;
ifp = fopen("input.txt", "r");
//Fill the array
for (i = 0; i <= 9; i++)
{
for (j = 0; j <= 5; j++)
{
fscanf(ifp, "%c", &array[i][j]);
//printf("[%d][%d] = %c\n", i, j, array[i][j]);
}
}
for (j = 4; j >= 0; j--)
{
for (i = 0; i <= 9; i++)
{
printf("[%d][%d] = %c\n", i, j, array[i][j]);
}
//PROBLEM LINE*************
array2[j] = 5;
}
fclose(ifp);
return 0;
}
So does anyone know how or why this happens?

Because when you write outside of an array bounds, C lets you. You're just writing to somewhere else in the program.
C is known as the lowest level high level language. To understand what "low level" means, remember that each of these variables you have created you can think of as living in physical memory. An array of integers of length 16 might occupy 64 bytes if integers are size 4. Perhaps they occupy bytes 100-163 (unlikely but I'm not going to make up realistic numbers, also these are usually better thought of in hexadecimal). What occupies byte 164? Maybe another variable in your program. What happens if you write to one past your array of 16 integers? well, it might write to that byte.
C lets you do this. Why? If you can't think of any answers, then maybe you should switch languages. I'm not being pedantic - if this doesn't benefit you then you might want to program in a language in which it is a little harder for you to make weird mistakes like this. But reasons include:
It's faster and smaller. Adding bounds checking takes time and space, so if you're writing code for a microprocessor, or writing a JIT compiler, speed and size really do matter a lot.
If you want to understand machine architecture and go into hardware, e.g. if you're a student, it's a good gateway from programming into OS/hardware/electrical engineering. And much of computer science.
Being close to machine code, it's standard in a way that many other languages and systems have to, or can easily, support some degree of compatibility with.
Other reasons that I would be able to give if I ever actually had to work this close to the machine code.
The moral is: In C, be very careful. You must check your own array bounds. You must clean up your own memory. If you don't, your program often won't crash but will start just doing really weird things without telling you where or why.

for (j = 0; j <= 5; j++)
should be
for (j = 0; j <= 4; j++)
and array2 max index is 3 so
array2[j] = 5;
is also going to be a problem when j == 4.
C array indexes start from 0. So an [X] array valid indexes are from 0 to X-1, thus you get X elements in total.
You should use the < operator, instead of <=, in order to show the same number in both the array declaration [X] and in the expression < X. For instance
int array[10];
...
for (i=0 ; i < 10 ; ++i) ... // instead of `<= 9`
This is less error prone.

If you're outside the bounds of one array, there's always a possibility you'll be inside the bounds of the other.

array2[j] = 5; - This is your problem of overflow.
for (j = 0; j <= 5; j++) - This is also a problem of overflow. Here also you are trying to access 5th index, where you can access only 0th to 4th index.
In the process memory, while calling each function one activation records will be created to keep all the local variables of the function and also it will have some more memory to store the called function address location also. In your function four local variables are there, array, array2, i and j. All these four will be aligned in an order. So if overflow happens it will first tries to overwrite in the variable declared above or below which depends on architecture. If overflow happens for more bytes then it may corrupt the entire stack itself by overwriting some of the local variables of the called functions. This may leads to crash also, Sometimes it may not but it will behave indifferently as you are facing now.

Efficient computation of kronecker products in C

I'm fairly new to C, not having much need to anything faster than python for most of my research. However, it turns out that recent work I've been doing required the computation of fairly large vectors/matrices, and there therefore a C+MPI solution might be in order.
Mathematically speaking, the task is very simple. I have a lot of vectors of dimensionality ~40k and wish to compute the Kronecker Product of selected pairs of these vectors, and then sum these kronecker products.
The question is, how to do this efficiently? Is there anything wrong with the following structure of code, using for loops, or obtain the effect?
The function kron described below passes vectors A and B of lengths vector_size, and computes their kronecker product, which it stores in C, a vector_size*vector_size matrix.
void kron(int *A, int *B, int *C, int vector_size) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = A[i] * B[j];
}
}
return;
}
This seems fine to me, and certainly (if I've not made some silly syntax error) produce the right result, but I have a sneaking suspicion that embedded for loops is not optimal. If there's another way I should be going about this, please let me know. Suggestions welcome.
I thank you for you patience and any advice you may have. Once again, I'm very inexperienced with C, but Googling around has brought me little joy for this query.

Since your loop bodies are all completely independent, there is certainly a way to accelerate this. Easiest would be already to take advantage of several cores before thinking of MPI. OpenMP should do quite fine on this.
#pragma omp parallel for
for(int i = 0; i < vector_size; i++) {
for (int j = 0; j < vector_size; j++) {
C[i][j] = A[i] * B[j];
}
}
This is supported by many compilers nowadays.
You could also try to drag some common expressions out of the inner loop but decent compilers e.g gcc, icc or clang should do this quite well all by themselves:
#pragma omp parallel for
for(int i = 0; i < vector_size; ++i) {
int const x = A[i];
int * vec = &C[i][0];
for (int j = 0; j < vector_size; ++j) {
vec[j] = x * B[j];
}
}
BTW, indexing with int is usually not the right thing to do. size_t is the correct typedef for everything that has to do with indexing and sizes of objects.

For double-precision vectors (single-precision and complex are similar), you can use the BLAS routine DGER (rank-one update) or similar to do the products one-at-a-time, since they are all on vectors. How many vectors are you multiplying? Remember that adding a bunch of vector outer products (which you can treat the Kronecker products as) ends up as a matrix-matrix multiplication, which BLAS's DGEMM can handle efficiently. You might need to write your own routines if you truly need integer operations, though.

If your compiler supports C99 (and you never pass the same vector as A and B), consider compiling in a C99-supporting mode and changing your function signature to:
void kron(int * restrict A, int * restrict B, int * restrict C, int vector_size);
The restrict keyword promises the compiler that the arrays pointed to by A, B and C do not alias (overlap). With your code as written, the compiler must re-load A[i] on every execution of the inner loop, because it must be conservative and assume that your stores to C[] can modify values in A[]. Under restrict, the compiler can assume that this will not happen.

Solution found (thanks to #Jeremiah Willcock): GSL's BLAS bindings seem to do the trick beautifully. If we're progressively selecting pairs of vectors A and B and adding them to some 'running total' vector/matrix C, the following modified version of the above kron function
void kronadd(int *A, int *B, int *C, int vector_size, int alpha) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = alpha * A[i] * B[j];
}
}
return;
}
precisely corresponds to the BLAS DGER function (accessible as gsl_blas_dger), functionally speaking. The initial kron function is DGER with alpha = 0 and C being an uninitialised (zeroed) matrix/vector of the correct dimensionality.
It turns out, it might well be easier to simply use python bindings for these libraries, in the end. However, I think I've learned a lot while trying to figure this stuff out. There are some more helpful suggestions in the other responses, do check them out if you have the same sort of problem to deal with. Thanks everyone!

This is a common enough problem in numerical computational circles, that really the best thing to do would be to use a well-debugged package like Matlab (or one of its Free Software clones).
You could probably even find a python binding to it, so you can get rid of C.
All of the above is (probably) going to be faster than code written strictly in python. If you need more speed than that, I'd suggest a couple of things:
Look into using Fortran instead of C. Fortran compilers tend to be better at optimizing numerical computations (one exception would be if you are using gcc, since both its C and Fortran compilers use the same backend).
Consider parallelizing your algorithm. There are variants of Fortran I know that have parallel loop statements. I think there are some C addons around that do the same thing. If you are using a PC (and single-precision) you could also consider using your video card's GPU, which is essentially a really cheap array processor.

Another optimisation that would be easy to implement is that if you know that the inner dimension of your arrays will be divisible by n then add n assignment statements to the body of the loop, reducing the number of necessary iterations, with corresponding changes to the loop counting.
This strategy can be generalised by using a switch statement around the outer loop with cases for array sizes divisible by two, three, four and five, or whatever is most common. This can give quite a big performance win and is compatible with suggestions 1 and 3 for further optimisation/parallelisation. A good compiler may even do something like this for you (aka loop unrolling).
Another optimisation would be to make use of pointer arithmetic to avoid the array indexing. Something like this should do the trick:
int i, j;
for(i = 0; i < vector_size; i++) {
int d = *A++;
int *e = B;
for (j = 0; j < vector_size; j++) {
*C++ = *e++ * d;
}
}
This also avoids accessing the value of A[i] multiple times by caching it in a local variable, which might give you a minor speed boost. (Note that this version is not parallelisable since it alters the value of the pointers, but would still work with loop unrolling.)

To solve your problem, I think you should try to use Eigen 3, it's a C++ library which use all matrix functions!
If you have time, go to see its documentation! =)
Good luck !

uint32_t rA = 3;
uint32_t cA = 5;
uint32_t lda = cA;
uint32_t rB = 5;
uint32_t cB = 3;
uint32_t ldb = cB;
uint32_t rC = rA*rB;
uint32_t cC = cA*cB;
uint32_t ldc = cC;
double *A = (double *)malloc(rA*cA*sizeof(double));
double *B = (double *)malloc(rB*cB*sizeof(double));
double *C = (double *)malloc(rC*cC*sizeof(double));
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
A[i]=i;
for (uint32_t i=0, allB=rB*cB; i<allB; i++)
B[i]=i;
for (uint32_t i=0, allC=rC*cC; i<allC; i++)
C[i]=0;
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
{
for (uint32_t j=0, allB=rB*cB; j<allB; j++)
C[((i/lda)*rB+j/ldb)*ldc
+ (i%lda)*cB+j%ldb ]=A[i]*B[j];
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight