Measuring cache size in C

Measuring cache size in C - c

I have a function as follow:
int doSomething(long numLoop,long arraySize){
int * buffer;
buffer = (int*) malloc (arraySize * sizeof(int));
long k;
int i;
for (i=0;i<arraySize;i++)
buffer[i]=2;//write to make sure memory is allocated
//start reading from cache
for(k=0;k<numLoop;k++){
int i;
int temp
for (i=0;i<arraySize;i++)
temp = buffer[i];
}
}
What it do is to declare an array and read from the beginning to the end. The purpose is to see the effect of cache.
What I expect to see is: when I call doSomething(10000,1000), the arraySize is small so it is all stored in the cache. After that I call doSomething(100,100000), the arraySize is bigger than that of the cache. As a result, the 2nd function call should take longer than the 1st one. The latter function call involved in some memory access as the whole array cannot be stored in the cache.
However, it seems that the 2nd operation takes approximately the same time as the 1st one. So what's wrong here? I tried to compile with -O0 and it doesnt solve the problem.
Thank you.
Update 1: these are the code with random access and it seems to work, time access with large array is ~15s while small array is ~3s
int doSomething(long numLoop,int a, long arraySize){
int * buffer;
buffer = (int*) malloc (arraySize * sizeof(int));
long k;
int i;
for (i=0;i<arraySize;i++)
buffer[i]=2;//write to make sure memory is allocated
//start reading from cache
for(k=0;k<numLoop;k++){
int temp;
for (i=0;i<arraySize;i++){
long randnum = rand();//max is 32767
randnum = (randnum <<16) | rand();
if (randnum < 0) randnum = -randnum;
randnum%=arraySize;
temp = buffer[randnum];
}
}
}

You are accessing the array in sequence,
for (i=0;i<arraySize;i++)
temp = buffer[i];
so the part you are accessing will always be in the cache since that pattern is trivial to predict. To see a cache-effect, you must access the array in a less predictable order, for example by generating (pseudo)random indices, so that you jump between the fron and the back of the array.

In addition to the other answers: Your code accesses the memory sequentially. Let's assume that the cache line is 32 bytes. That means that you probably get a cache miss on every 8 access. So, picking a random index you should make it at least 32 bytes far from the previous value

In order to measure the effect across multiple calls, you must use the same buffer (with the expectation that the first time through you are loading the cache, and the next time you are using it). In your case, you are allocating a new buffer for every call. (Additionally, you are never freeing your allocation.)

Related

Do arrays in C have a maximum index size of 2048?

I've written a piece of code that uses a static array of size 3000.
Ordinarily, I would just use a for loop to scan in 3000 values, but it appears that I can only ever scan in a maximum of 2048 numbers. To me that seems like an issue with memory allocation, but I'm not sure.
The problem arises because I do not want a user to input the amount of numbers they intend to input. They should only input whatever amount of numbers they want, terminate the scan by inputting 0, after which the program does its work. (Otherwise I would just use malloc.)
The code is a fairly simple number occurrence counter, found below:
int main(int argc, char **argv)
{
int c;
int d;
int j = 0;
int temp;
int array[3000];
int i;
// scanning in elements to array (have just used 3000 because no explicit value for the length of the sequence is included)
for (i = 0; i < 3000; i++)
{
scanf("%d", &array[i]);
if (array[i] == 0)
{
break;
}
}
// sorting
for(c = 0; c < i-1; c++) {
for(d = 0; d < i-c-1; d++) {
if(array[d] > array[d+1]) {
temp = array[d]; // swaps
array[d] = array[d+1];
array[d+1] = temp;
}
}
}
int arrayLength = i + 1; // saving current 'i' value to use as 'n' value before reset
for(i = 0; i < arrayLength; i = j)
{
int numToCount = array[i];
int occurrence = 1; // if a number has been found the occurence is at least 1
for(j = i+1; j < arrayLength; j++) // new loops starts at current position in array +1 to check for duplicates
{
if(array[j] != numToCount) // prints immediately after finding out how many occurences there are, else adds another
{
printf("%d: %d\n", numToCount, occurrence);
break; // this break keeps 'j' at whatever value is NOT the numToCount, thus making the 'i = j' iterator restart the process at the right number
} else {
occurrence++;
}
}
}
return 0;
}
This code works perfectly for any number of inputs below 2048. An example of it not working would be inputting: 1000 1s, 1000 2s, and 1000 3s, after which the program would output:
1: 1000
2: 1000
3: 48
My question is whether there is any way to fix this so that the program will output the right amount of occurrences.

To answer your title question: The size of an array in C is limited (in theory) only by the maximum value that can be represented by a size_t variable. This is typically a 32- or 64-bit unsigned integer, so you can have (for the 32-bit case) over 4 billion elements (or much, much more in 64-bit systems).
However, what you are probably encountering in your code is a limit on the memory available to the program, where the line int array[3000]; declares an automatic variable. Space for these is generally allocated on the stack - which is a chunk of memory of limited size made available when the function (or main) is called. This memory has limited size and, in your case (assuming 32-bit, 4-byte integers), you are taking 12,000 bytes from the stack, which may cause problems.
There are two (maybe more?) ways to fix the problem. First, you could declared the array static - this would make the compiler pre-allocate the memory, so it would not need to be taken from the stack at run-time:
static int array[3000];
A second, probably better, approach would be to call malloc to allocate memory for the array; this assigns memory from the heap - which has (on almost all systems) considerably more space than the stack. It is often limited only by the available virtual memory of the operating system (many gigabytes on most modern PCs):
int *array = malloc(3000 * sizeof(int));
Also, the advantage of using malloc is that if, for some reason, there isn't enough memory available, the function will return NULL, and you can test for this.
You can access the elements of the array in the same way, using array[i] for example. Of course, you should be sure to release the memory when you've done with it, at the end of your function:
free(array);
(This will be done automatically in your case, when the program exits, but it's good coding style to get used to doing it explicitly!)

Can I avoid a loop for writing the same value in a continous subset of an array?

I have a program where I repeat a succession of methods to reproduce time evolution. One of the things I have to do is to write the same value for a long continue subset of elements of a very large array. Knowing which elements are and which value I want, is there any other way rather than doing a loop for setting these values each by each?
EDIT: To be clear, I want to avoid this:
double arr[10000000];
int i;
for (i=0; i<100000; ++i)
arr[i] = 1;
by just one single call if it is possible. Can you assign to a part of an array the values from another array of the same size? Maybe I could have in memory a second array arr2[1000000] with all elements 1 and then do something like copying the memory of arr2 to the first 100.000 elements of arr?

I have a somewhat tongue-in-cheek and non-portable possibility for you to consider. If you tailored your buffer to a size that is a power of 2, you could seed the buffer with a single double, then use memcpy to copy successively larger chunks of the buffer until the buffer is full.
So first you copy the first 8 bytes over the next 8 bytes...(so now you have 2 doubles)
...then you copy the first 16 bytes over the next 16 bytes...(so now you have 4 doubles)
...then you copy the first 32 bytes over the next 32 bytes...(so now you have 8 doubles)
...and so on.
It's plain to see that we won't actually call memcpy all that many times, and if the implementation of memcpy is sufficiently faster than a simple loop we'll see a benefit.
Try building and running this and tell me how it performs on your machine. It's a very scrappy proof of concept...
#include <string.h>
#include <time.h>
#include <stdio.h>
void loop_buffer_init(double* buffer, int buflen, double val)
{
for (int i = 0; i < buflen; i++)
{
buffer[i] = val;
}
}
void memcpy_buffer_init(double* buffer, int buflen, double val)
{
buffer[0] = val;
int half_buf_size = buflen * sizeof(double) / 2;
for (int i = sizeof(double); i <= half_buf_size; i += i)
{
memcpy((unsigned char *)buffer + i, buffer, i);
}
}
void check_success(double* buffer, int buflen, double expected_val)
{
for (int i = 0; i < buflen; i++)
{
if (buffer[i] != expected_val)
{
printf("But your whacky loop failed horribly.\n");
break;
}
}
}
int main()
{
const int TEST_REPS = 500;
const int BUFFER_SIZE = 16777216;
static double buffer[BUFFER_SIZE]; // 2**24 doubles, 128MB
time_t start_time;
time(&start_time);
printf("Normal loop starting...\n");
for (int reps = 0; reps < TEST_REPS; reps++)
{
loop_buffer_init(buffer, BUFFER_SIZE, 1.0);
}
time_t end_time;
time(&end_time);
printf("Normal loop finishing after %.f seconds\n",
difftime(end_time, start_time));
time(&start_time);
printf("Whacky loop starting...\n");
for (int reps = 0; reps < TEST_REPS; reps++)
{
memcpy_buffer_init(buffer, BUFFER_SIZE, 2.5);
}
time(&end_time);
printf("Whacky loop finishing after %.f seconds\n",
difftime(end_time, start_time));
check_success(buffer, BUFFER_SIZE, 2.5);
}
On my machine, the results were:
Normal loop starting...
Normal loop finishing after 21 seconds
Whacky loop starting...
Whacky loop finishing after 9 seconds
To work with a buffer that was less than a perfect power of 2 in size, just go as far as you can with the increasing powers of 2 and then fill out the remainder in one final memcpy.
(Edit: before anyone mentions it, of course this is pointless with a static double (might as well initialize it at compile time) but it'll work just as well with a nice fresh stretch of memory requested at runtime.)
It looks like this solution is very sensitive to your cache size or other hardware optimizations. On my old (circa 2009) laptop the memcpy solution is as slow or slower than the simple loop, until the buffer size drops below 1MB. Below 1MB or so the memcpy solution returns to being twice as fast.

I have a program where I repeat a succession of methods to reproduce
time evolution. One of the things I have to do is to write the same
value for a long continue subset of elements of a very large array.
Knowing which elements are and which value I want, is there any other
way rather than doing a loop for setting these values each by each?
In principle, you can initialize an array however you like without using a loop. If that array has static duration then that initialization might in fact be extremely efficient, as the initial value is stored in the executable image in one way or another.
Otherwise, you have a few options:
if the array elements are of a character type then you can use memset(). Very likely this involves a loop internally, but you won't have one literally in your own code.
if the representation of the value you want to set has all bytes equal, such as is the case for typical representations of 0 in any arithmetic type , then memset() is again a possibility.
as you suggested, if you have another array with suitable contents then you can copy some or all of it into the target array. For this you would use memcpy(), unless there is a chance that the source and destination could overlap, in which case you would want memmove().
more generally, you may be able to read in the data from some external source, such as a file (e.g. via fread()). Don't count on any I/O-based solution to be performant, however.
you can write an analog of memset() that is specific to the data type of the array. Such a function would likely need to use a loop of some form internally, but you could avoid such a loop in the caller.
you can write a macro that expands to the needed loop. This can be type-generic, so you don't need different versions for different data types. It uses a loop, but the loop would not appear literally in your source code at the point of use.
If you know in advance how many elements you want to set, then in principle, you could write that many assignment statements without looping. But I cannot imagine why you would want so badly to avoid looping that you would resort to this for a large number of elements.
All of those except the last actually do loop, however -- they just avoid cluttering your code with a loop construct at the point where you want to set the array elements. Some of them may also be clearer and more immediately understandable to human readers.

Why does repeating the exact same task within an OpenCL kernel more times cause it to crash?

I've written an OpenCL program in C in order to take advantage of my GPU for parallel processing, and I've run into an issue where the display driver crashes under certain calling conditions when running one of my kernels. I've created a new stripped-down program that demonstrates the same behavior.
Essentially I allocate a linear array on the GPU and then launch a kernel, in which each thread will increment each value in a single nonoverlapping 'row' of the array of fixed size, according to its global thread ID.
I have a for loop wrapping this task which causes it to be repeated a number of times - however, each repetition, I reset the pointer to memory to the same starting value, so the inner loop should be performing exactly the same task each iteration of the outer loop.
The odd behavior is that the program runs with no apparent errors (and the output looks correct) when run with between 1 and 958 repetitions of the outer loop. However, if this number is increased to anything above 958, the display driver crashes and is recovered. Oddly, this doesn't result in an error returned by clEnqueueNDRangeKernel() or the subsequent clFinish().
Here's the kernel in question:
__kernel void testKernel(__global unsigned int* arr)
{
// OVERRIDE ARGS
unsigned int numReps = 958;
unsigned int numRows = 1000;
unsigned int rowLength = 676;
// Make sure thread index is in-bounds
if( get_global_id(0) < numRows )
{
__global unsigned int* arrPtr;
__global unsigned int* arrInitPtr = arr + (get_global_id(0) * rowLength);
unsigned int i, j;
unsigned int tmp;
for( i = 0; i < numReps; ++i )
{
// Reset the array pointer to the first element in this thread's row
arrPtr = arrInitPtr;
for( j = 0; j < rowLength; ++j )
{
// Increment value in the row
tmp = *arrPtr;
*arrPtr = tmp + 1;
// Advance pointer to the next value
++arrPtr;
}
}
}
}
I've hard-coded the number of rows and row length to avoid any possible mistakes in parameter-passing and simplify things further.
I allocate the buffer (passed in to the kernel as arr) and enqueue the kernel as follows:
size_t numThreads = 1000;
unsigned int rowLength = 676;
size_t arrLength = rowLength * numThreads;
cl_mem arr_d = clCreateBuffer(gpuContext, CL_MEM_READ_WRITE, arrLength * sizeof(unsigned int), NULL, &clErr);
if( clErr != CL_SUCCESS )
{
printf("Error: Failed to allocate buffer on device.\n");
exit(2);
}
clSetKernelArg(testKernel, 0, sizeof(cl_mem), &arr_d);
clErr = clEnqueueNDRangeKernel(gpuCmdQueue, testKernel, 1, NULL, &numThreads, &numThreads, 0, NULL, NULL);
My first instinct is of course that arrPtr is being incremented beyond the boundaries of the array - however, I don't think this should be happening based on the for loop conditional and the fact that when I examine memory after copying the array back to the host, no values outside of the array appear to have been modified. For clarity, in my original program I initialize every value in the array to zero beforehand, but I left that out of this example program since it doesn't seem relevant to my problem.
I am positive that the memory access to arrPtr is out-of-bounds somehow - I don't see any other way for this to be crashing. However, my array is large enough, and I check the global thread ID before making any accesses, so even if my thread pool size were too large, that shouldn't be a problem.
I assume that the specific boundaries of the failure (958 - 959) are fairly arbitrary since they don't directly correspond to any of my parameters. The added repetitions must be exposing an underlying indexing problem. However, it's odd in that case that it's so repeatable with those values. I've also tried reducing one from various parameters in order to look for off-by-one errors, to no avail.
For reference, I'm using nVidia's 64-bit implementation of OpenCL (CUDA 6.0 drivers) with a GeForce 770 under Windows 7 64-bit.
Thanks for any responses! I've tried to be specific but didn't want this to become too long - if you have any questions or want to see my full OpenCL setup code, please just let me know.

I know it's old, but whatever... From the comment:
Windows has a watchdog timer mechanism that restarts the display driver if it appears to become unresponsive. I find that if my kernel runs for more than a few seconds, the timer will trip and restart the display driver. The only solution I know of is to break up the kernel execution into segments of one or two seconds each and run them sequentially.
(I got kinda this error, so this still seems to be true)

Bus Error in C for Loop

I have a toy cipher program which is encountering a bus error when given a very long key (I'm using 961168601842738797 to reproduce it), which perplexes me. When I commented out sections to isolate the error, I found it was being caused by this innocent-looking for loop in my Sieve of Eratosthenes.
unsigned long i;
int candidatePrimes[CANDIDATE_PRIMES];
// CANDIDATE_PRIMES is a macro which sets the length of the array to
// two less than the upper bound of the sieve. (2 being the first prime
// and the lower bound.)
for (i=0;i<CANDIDATE_PRIMES;i++)
{
printf("i: %d\n", i); // does not print; bus error occurs first
//candidatePrimes[i] = PRIME;
}
At times this has been a segmentation fault rather than a bus error.
Can anyone help me to understand what is happening and how I can fix it/avoid it in the future?
Thanks in advance!
PS
The full code is available here:
http://pastebin.com/GNEsg8eb

I would say your VLA is too large for your stack, leading to undefined behaviour.
Better to allocate the array dynamically:
int *candidatePrimes = malloc(CANDIDATE_PRIMES * sizeof(int));
And don't forget to free before returning.
If this is Eratosthenes Sieve, then the array is really just flags. It's wasteful to use int if it's just going to hold 0 or 1. At least use char (for speed), or condense to a bit array (for minimal storage).

The problem is that you're blowing the stack away.
unsigned long i;
int candidatePrimes[CANDIDATE_PRIMES];
If CANDIDATE_PRIMES is large, this alters the stack pointer by a massive amount. But it doesn't touch the memory, it just adjusts the stack pointer by a very large amount.
for (i=0;i<CANDIDATE_PRIMES;i++)
{
This adjusts "i" which is way back in the good area of the stack, and sets it to zero. Checks that it's < CANDIDATE_PRIMES, which it is, and so performs the first iteration.
printf("i: %d\n", i); // does not print; bus error occurs first
This attempts to put the parameters for "printf" onto the bottom of the stack. BOOM. Invalid memory location.
What value does CANDIDATE_PRIMES have?
And, do you actually want to store all the primes you're testing or only those that pass? What is the purpose of storing the values 0 thru CANDIDATE_PRIMES sequentially in an array???
If what you just wanted to store the primes, you should use a dynamic allocation and grow it as needed.
size_t g_numSlots = 0;
size_t g_numPrimes = 0;
unsigned long* g_primes = NULL;
void addPrime(unsigned long prime) {
unsigned long* newPrimes;
if (g_numPrimes >= g_numSlots) {
g_numSlots += 256;
newPrimes = realloc(g_primes, g_numSlots * sizeof(unsigned long));
if (newPrimes == NULL) {
die(gracefully);
}
g_primes = newPrimes;
}
g_primes[g_numPrimes++] = prime;
}

Decipher assignment about measuring throughput of L2 cache

I've noticed that a few of my classmates have actually tried asking questions about this same assignment on StackOverflow over the past few days so I'm going to shamelessly copy paste (only) the context of one question that was deleted (but still cached on Google with no answers) to save time. I apologize in advance for that.
Context
I am trying to write a C program that measures the data throughput (MBytes/sec) of the L2 cache of my system. To perform the measurement I have to write a program that copies an array A to an array B, repeated multiple times, and measure the throughput.
Consider at least two scenarios:
Both fields fit in the L2 cache
The array size is significantly larger than the L2 cache size.
Using memcpy() from string.h to copy the arrays, initialize both arrays with some values (e.g. random numbers using rand()), and repeat at least 100 times, otherwise you do not see a difference.
The array size and number of repeats should be input parameters. One of the array sizes should be half of my L2 cache size.
Question
So based on that context of the assignment I have a good idea of what I need to do because it pretty much tells me straight out. The problem is that we were given some template code to work with and I'm having trouble deciphering parts of it. I would really appreciate it if someone would help me to just figure out what is going on.
The code is:
/* do not add other includes */
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#include <string.h>
double getTime(){
struct timeval t;
double sec, msec;
while (gettimeofday(&t, NULL) != 0);
sec = t.tv_sec;
msec = t.tv_usec;
sec = sec + msec/1000000.0;
return sec;
}
/* for task 1 only */
void usage(void)
{
fprintf(stderr, "bandwith [--no_iterations iterations] [--array_size size]\n");
exit(1);
}
int main (int argc, char *argv[])
{
double t1, t2;
/* variables for task 1 */
unsigned int size = 1024;
unsigned int N = 100;
unsigned int i;
/* declare variables; examples, adjust for task */
int *A;
int *B;
/* parameter parsing task 1 */
for(i=1; i<(unsigned)argc; i++) {
if (strcmp(argv[i], "--no_iterations") == 0) {
i++;
if (i < argc)
sscanf(argv[i], "%u", &N);
else
usage();
} else if (strcmp(argv[i], "--array_size") == 0) {
i++;
if (i < argc)
sscanf(argv[i], "%u", &size);
else
usage();
} else usage();
}
/* allocate memory for arrays; examples, adjust for task */
A = malloc (size*size * sizeof (int));
B = malloc (size*size * sizeof (int));
/* initialise arrray elements */
t1 = getTime();
/* code to be measured goes here */
t2 = getTime();
/* output; examples, adjust for task */
printf("time: %6.2f secs\n",t2 - t1);
/* free memory; examples, adjust for task */
free(B);
free(A);
return 0;
}
My questions are:
What could the purpose of the usage method be?
What is the parameter passing part supposed to be doing because as far as I can tell it will just always lead to usage() and won't take any parameters with the sscanf lines?
In this assignment we're meant to record array sizes in KB or MB, and I know that malloc allocates size in bytes and with a size variable value of 1024 would result in 1MB * sizeof(int) (I think at least). In this case would the array size I should record be 1MB or 1MB * sizeof(int)?
If parameter passing worked properly and we passed parameters to change the size variable value would the array size always be the size variable squared? Or would the array size be considered to be just the size variable? It seems very unintuitive to malloc size*size instead of just size unless there's something I'm missing about all this.
My understanding of measuring the throughput is that I should just multiply the array size by the number of iterations and then divide by the time taken. Can I get any confirmation that this is right?
These are the only hurdles in my understanding of this assignment. Any help would be much appreciated.

What could the purpose of the usage method be?
The usage function tells you what arguments are supposed to be passed to the program on the command-line.
What is the parameter passing part supposed to be doing because as far as I can tell it will just always lead to usage() and won't take any parameters with the sscanf lines?
It leads the calling the usage() function when an invalid argument is passed to the program.
Otherwise, it sets the number of iterations to the variable N to the value of the argument no_iterations (default value of 100), and it sets the size of the array to the variable size to the value of the argument array_size (default value of 1024).
In this assignment we're meant to record array sizes in KB or MB, and I know that malloc allocates size in bytes and with a size variable value of 1024 would result in 1MB * sizeof(int) (I think at least). In this case would the array size I should record be 1MB or 1MB * sizeof(int)?
If your size is supposed to be 1 MB, then that is probably what the size should be.
If you want to make it sure the size is a factor of the size of the data type, then you can do:
if (size % sizeof(int) != 0)
{
size = ((int)(size / sizeof(int))) * sizeof(int);
}
If parameter passing worked properly and we passed parameters to change the size variable value would the array size always be the size variable squared? Or would the array size be considered to be just the size variable? It seems very unintuitive to malloc size*size instead of just size unless there's something I'm missing about all this.
You probably just want to allocate size bytes. Unless you are supposed to be working with matrices, rather than just arrays. In that case, it would be size * size bytes.
My understanding of measuring the throughput is that I should just multiply the array size by the number of iterations and then divide by the time taken. Can I get any confirmation that this is right?
I guess so.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight