Decipher assignment about measuring throughput of L2 cache - c

I've noticed that a few of my classmates have actually tried asking questions about this same assignment on StackOverflow over the past few days so I'm going to shamelessly copy paste (only) the context of one question that was deleted (but still cached on Google with no answers) to save time. I apologize in advance for that.
Context
I am trying to write a C program that measures the data throughput (MBytes/sec) of the L2 cache of my system. To perform the measurement I have to write a program that copies an array A to an array B, repeated multiple times, and measure the throughput.
Consider at least two scenarios:
Both fields fit in the L2 cache
The array size is significantly larger than the L2 cache size.
Using memcpy() from string.h to copy the arrays, initialize both arrays with some values (e.g. random numbers using rand()), and repeat at least 100 times, otherwise you do not see a difference.
The array size and number of repeats should be input parameters. One of the array sizes should be half of my L2 cache size.
Question
So based on that context of the assignment I have a good idea of what I need to do because it pretty much tells me straight out. The problem is that we were given some template code to work with and I'm having trouble deciphering parts of it. I would really appreciate it if someone would help me to just figure out what is going on.
The code is:
/* do not add other includes */
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#include <string.h>
double getTime(){
struct timeval t;
double sec, msec;
while (gettimeofday(&t, NULL) != 0);
sec = t.tv_sec;
msec = t.tv_usec;
sec = sec + msec/1000000.0;
return sec;
}
/* for task 1 only */
void usage(void)
{
fprintf(stderr, "bandwith [--no_iterations iterations] [--array_size size]\n");
exit(1);
}
int main (int argc, char *argv[])
{
double t1, t2;
/* variables for task 1 */
unsigned int size = 1024;
unsigned int N = 100;
unsigned int i;
/* declare variables; examples, adjust for task */
int *A;
int *B;
/* parameter parsing task 1 */
for(i=1; i<(unsigned)argc; i++) {
if (strcmp(argv[i], "--no_iterations") == 0) {
i++;
if (i < argc)
sscanf(argv[i], "%u", &N);
else
usage();
} else if (strcmp(argv[i], "--array_size") == 0) {
i++;
if (i < argc)
sscanf(argv[i], "%u", &size);
else
usage();
} else usage();
}
/* allocate memory for arrays; examples, adjust for task */
A = malloc (size*size * sizeof (int));
B = malloc (size*size * sizeof (int));
/* initialise arrray elements */
t1 = getTime();
/* code to be measured goes here */
t2 = getTime();
/* output; examples, adjust for task */
printf("time: %6.2f secs\n",t2 - t1);
/* free memory; examples, adjust for task */
free(B);
free(A);
return 0;
}
My questions are:
What could the purpose of the usage method be?
What is the parameter passing part supposed to be doing because as far as I can tell it will just always lead to usage() and won't take any parameters with the sscanf lines?
In this assignment we're meant to record array sizes in KB or MB, and I know that malloc allocates size in bytes and with a size variable value of 1024 would result in 1MB * sizeof(int) (I think at least). In this case would the array size I should record be 1MB or 1MB * sizeof(int)?
If parameter passing worked properly and we passed parameters to change the size variable value would the array size always be the size variable squared? Or would the array size be considered to be just the size variable? It seems very unintuitive to malloc size*size instead of just size unless there's something I'm missing about all this.
My understanding of measuring the throughput is that I should just multiply the array size by the number of iterations and then divide by the time taken. Can I get any confirmation that this is right?
These are the only hurdles in my understanding of this assignment. Any help would be much appreciated.

What could the purpose of the usage method be?
The usage function tells you what arguments are supposed to be passed to the program on the command-line.
What is the parameter passing part supposed to be doing because as far as I can tell it will just always lead to usage() and won't take any parameters with the sscanf lines?
It leads the calling the usage() function when an invalid argument is passed to the program.
Otherwise, it sets the number of iterations to the variable N to the value of the argument no_iterations (default value of 100), and it sets the size of the array to the variable size to the value of the argument array_size (default value of 1024).
In this assignment we're meant to record array sizes in KB or MB, and I know that malloc allocates size in bytes and with a size variable value of 1024 would result in 1MB * sizeof(int) (I think at least). In this case would the array size I should record be 1MB or 1MB * sizeof(int)?
If your size is supposed to be 1 MB, then that is probably what the size should be.
If you want to make it sure the size is a factor of the size of the data type, then you can do:
if (size % sizeof(int) != 0)
{
size = ((int)(size / sizeof(int))) * sizeof(int);
}
If parameter passing worked properly and we passed parameters to change the size variable value would the array size always be the size variable squared? Or would the array size be considered to be just the size variable? It seems very unintuitive to malloc size*size instead of just size unless there's something I'm missing about all this.
You probably just want to allocate size bytes. Unless you are supposed to be working with matrices, rather than just arrays. In that case, it would be size * size bytes.
My understanding of measuring the throughput is that I should just multiply the array size by the number of iterations and then divide by the time taken. Can I get any confirmation that this is right?
I guess so.

Related

C function with array pointer

I'm trying to write a function that when given an array and a value, it checks if the value is in that array. If it is there then keep finding a new unique random value before adding it to the array. This is what I have done so far but I think the problem is my lack of understanding of pointers. Here is what I have so far:
#include <stdio.h>
#include <stdlib.h>
int getNewIndex(int index, int *visitedPixels, int *visitedPixelsIndex);
int main() {
int *visitedPixels = malloc(2 * sizeof(int));
int *visitedPixelsIndex = 0;
srand(1);
int randIndex = rand() % 16, i;
printf("Initial randIndex = %d\n", randIndex);
for(i = 0; i < 16; i++) {
randIndex = getNewIndex(randIndex, visitedPixels, visitedPixelsIndex);
printf("randIndex[%d] = %d\n", i, visitedPixels[i]);
}
return 0;
}
int getNewIndex(int index, int *visitedPixels, int *visitedPixelsIndex) {
int i = 0;
while (i < *visitedPixelsIndex) {
(index == visitedPixels[i]) ? index = rand() % 16, i = 0 : i++;
}
visitedPixels[*visitedPixelsIndex] = index;
(*visitedPixelsIndex)++;
//(*visitedPixels) = realloc(visitedPixels, (*visitedPixelsIndex+1) * sizeof(int));
return index;
}
Any help would be appreciated.
Okay, so. I'm going to try to explain with a metaphor. Hopefully it helps rather than confusing more.
Imagine memory is a long board you can write numbers on. It takes an inch of board to write a small number. Bigger numbers can be represented by writing across more slots.
An array, in our metaphor, is just a contiguous length of board you can write stuff into. If you want an array of 5 integers, and each integer takes 4 inches, you'll need 20 inches of board for it. If you wanted to pass all these integers to a function, instead of copying them all across, you would instead write down how many inches from the end of the board your array is. That's what a pointer is. It's a number telling where something is.
When you called malloc( 2 * sizeof( int ) ), you requested for a segment of the board big enough for two integers, and you received how many inches from the end of the board that new segment is. So we've got 8 inches of board X inches from the end, with X being our pointer.
Incrementing a pointer says "increase this value to point at the next element of the underlying array". A int* will increase by 4, a pointer to a structure by the size of the structure plus any alignment offset the compiler has decided for it.
It does not increase the amount of storage.
If I have a pointer to two 8 inches of board, write a 4 inch number, increment the pointer to point 4 inches more in, write another 4 inch number and increment again, my pointer is now right after the last element of the array. If I write here, all bets are off. What was on the board after the array? Who knows. It could be anything. Maybe it was a different array. Maybe it was information for keeping track of what parts of the board have been handed out to the program. Maybe it was the end of my board and I'll write off the end. Writing to memory you haven't received permission to from the operating system is where signals for "segment violations", SIGSEGV, program failures come from.
You need to request more space up front, or bigger arrays as you need them. There's also a realloc that will do this too. And for all of them, you have to check if the call failed and terminate or otherwise recover appropriately.
Hopefully this is more helpful than confusing. Good luck :)

Can I avoid a loop for writing the same value in a continous subset of an array?

I have a program where I repeat a succession of methods to reproduce time evolution. One of the things I have to do is to write the same value for a long continue subset of elements of a very large array. Knowing which elements are and which value I want, is there any other way rather than doing a loop for setting these values each by each?
EDIT: To be clear, I want to avoid this:
double arr[10000000];
int i;
for (i=0; i<100000; ++i)
arr[i] = 1;
by just one single call if it is possible. Can you assign to a part of an array the values from another array of the same size? Maybe I could have in memory a second array arr2[1000000] with all elements 1 and then do something like copying the memory of arr2 to the first 100.000 elements of arr?
I have a somewhat tongue-in-cheek and non-portable possibility for you to consider. If you tailored your buffer to a size that is a power of 2, you could seed the buffer with a single double, then use memcpy to copy successively larger chunks of the buffer until the buffer is full.
So first you copy the first 8 bytes over the next 8 bytes...(so now you have 2 doubles)
...then you copy the first 16 bytes over the next 16 bytes...(so now you have 4 doubles)
...then you copy the first 32 bytes over the next 32 bytes...(so now you have 8 doubles)
...and so on.
It's plain to see that we won't actually call memcpy all that many times, and if the implementation of memcpy is sufficiently faster than a simple loop we'll see a benefit.
Try building and running this and tell me how it performs on your machine. It's a very scrappy proof of concept...
#include <string.h>
#include <time.h>
#include <stdio.h>
void loop_buffer_init(double* buffer, int buflen, double val)
{
for (int i = 0; i < buflen; i++)
{
buffer[i] = val;
}
}
void memcpy_buffer_init(double* buffer, int buflen, double val)
{
buffer[0] = val;
int half_buf_size = buflen * sizeof(double) / 2;
for (int i = sizeof(double); i <= half_buf_size; i += i)
{
memcpy((unsigned char *)buffer + i, buffer, i);
}
}
void check_success(double* buffer, int buflen, double expected_val)
{
for (int i = 0; i < buflen; i++)
{
if (buffer[i] != expected_val)
{
printf("But your whacky loop failed horribly.\n");
break;
}
}
}
int main()
{
const int TEST_REPS = 500;
const int BUFFER_SIZE = 16777216;
static double buffer[BUFFER_SIZE]; // 2**24 doubles, 128MB
time_t start_time;
time(&start_time);
printf("Normal loop starting...\n");
for (int reps = 0; reps < TEST_REPS; reps++)
{
loop_buffer_init(buffer, BUFFER_SIZE, 1.0);
}
time_t end_time;
time(&end_time);
printf("Normal loop finishing after %.f seconds\n",
difftime(end_time, start_time));
time(&start_time);
printf("Whacky loop starting...\n");
for (int reps = 0; reps < TEST_REPS; reps++)
{
memcpy_buffer_init(buffer, BUFFER_SIZE, 2.5);
}
time(&end_time);
printf("Whacky loop finishing after %.f seconds\n",
difftime(end_time, start_time));
check_success(buffer, BUFFER_SIZE, 2.5);
}
On my machine, the results were:
Normal loop starting...
Normal loop finishing after 21 seconds
Whacky loop starting...
Whacky loop finishing after 9 seconds
To work with a buffer that was less than a perfect power of 2 in size, just go as far as you can with the increasing powers of 2 and then fill out the remainder in one final memcpy.
(Edit: before anyone mentions it, of course this is pointless with a static double (might as well initialize it at compile time) but it'll work just as well with a nice fresh stretch of memory requested at runtime.)
It looks like this solution is very sensitive to your cache size or other hardware optimizations. On my old (circa 2009) laptop the memcpy solution is as slow or slower than the simple loop, until the buffer size drops below 1MB. Below 1MB or so the memcpy solution returns to being twice as fast.
I have a program where I repeat a succession of methods to reproduce
time evolution. One of the things I have to do is to write the same
value for a long continue subset of elements of a very large array.
Knowing which elements are and which value I want, is there any other
way rather than doing a loop for setting these values each by each?
In principle, you can initialize an array however you like without using a loop. If that array has static duration then that initialization might in fact be extremely efficient, as the initial value is stored in the executable image in one way or another.
Otherwise, you have a few options:
if the array elements are of a character type then you can use memset(). Very likely this involves a loop internally, but you won't have one literally in your own code.
if the representation of the value you want to set has all bytes equal, such as is the case for typical representations of 0 in any arithmetic type , then memset() is again a possibility.
as you suggested, if you have another array with suitable contents then you can copy some or all of it into the target array. For this you would use memcpy(), unless there is a chance that the source and destination could overlap, in which case you would want memmove().
more generally, you may be able to read in the data from some external source, such as a file (e.g. via fread()). Don't count on any I/O-based solution to be performant, however.
you can write an analog of memset() that is specific to the data type of the array. Such a function would likely need to use a loop of some form internally, but you could avoid such a loop in the caller.
you can write a macro that expands to the needed loop. This can be type-generic, so you don't need different versions for different data types. It uses a loop, but the loop would not appear literally in your source code at the point of use.
If you know in advance how many elements you want to set, then in principle, you could write that many assignment statements without looping. But I cannot imagine why you would want so badly to avoid looping that you would resort to this for a large number of elements.
All of those except the last actually do loop, however -- they just avoid cluttering your code with a loop construct at the point where you want to set the array elements. Some of them may also be clearer and more immediately understandable to human readers.

Bus Error in C for Loop

I have a toy cipher program which is encountering a bus error when given a very long key (I'm using 961168601842738797 to reproduce it), which perplexes me. When I commented out sections to isolate the error, I found it was being caused by this innocent-looking for loop in my Sieve of Eratosthenes.
unsigned long i;
int candidatePrimes[CANDIDATE_PRIMES];
// CANDIDATE_PRIMES is a macro which sets the length of the array to
// two less than the upper bound of the sieve. (2 being the first prime
// and the lower bound.)
for (i=0;i<CANDIDATE_PRIMES;i++)
{
printf("i: %d\n", i); // does not print; bus error occurs first
//candidatePrimes[i] = PRIME;
}
At times this has been a segmentation fault rather than a bus error.
Can anyone help me to understand what is happening and how I can fix it/avoid it in the future?
Thanks in advance!
PS
The full code is available here:
http://pastebin.com/GNEsg8eb
I would say your VLA is too large for your stack, leading to undefined behaviour.
Better to allocate the array dynamically:
int *candidatePrimes = malloc(CANDIDATE_PRIMES * sizeof(int));
And don't forget to free before returning.
If this is Eratosthenes Sieve, then the array is really just flags. It's wasteful to use int if it's just going to hold 0 or 1. At least use char (for speed), or condense to a bit array (for minimal storage).
The problem is that you're blowing the stack away.
unsigned long i;
int candidatePrimes[CANDIDATE_PRIMES];
If CANDIDATE_PRIMES is large, this alters the stack pointer by a massive amount. But it doesn't touch the memory, it just adjusts the stack pointer by a very large amount.
for (i=0;i<CANDIDATE_PRIMES;i++)
{
This adjusts "i" which is way back in the good area of the stack, and sets it to zero. Checks that it's < CANDIDATE_PRIMES, which it is, and so performs the first iteration.
printf("i: %d\n", i); // does not print; bus error occurs first
This attempts to put the parameters for "printf" onto the bottom of the stack. BOOM. Invalid memory location.
What value does CANDIDATE_PRIMES have?
And, do you actually want to store all the primes you're testing or only those that pass? What is the purpose of storing the values 0 thru CANDIDATE_PRIMES sequentially in an array???
If what you just wanted to store the primes, you should use a dynamic allocation and grow it as needed.
size_t g_numSlots = 0;
size_t g_numPrimes = 0;
unsigned long* g_primes = NULL;
void addPrime(unsigned long prime) {
unsigned long* newPrimes;
if (g_numPrimes >= g_numSlots) {
g_numSlots += 256;
newPrimes = realloc(g_primes, g_numSlots * sizeof(unsigned long));
if (newPrimes == NULL) {
die(gracefully);
}
g_primes = newPrimes;
}
g_primes[g_numPrimes++] = prime;
}

Measuring cache size in C

I have a function as follow:
int doSomething(long numLoop,long arraySize){
int * buffer;
buffer = (int*) malloc (arraySize * sizeof(int));
long k;
int i;
for (i=0;i<arraySize;i++)
buffer[i]=2;//write to make sure memory is allocated
//start reading from cache
for(k=0;k<numLoop;k++){
int i;
int temp
for (i=0;i<arraySize;i++)
temp = buffer[i];
}
}
What it do is to declare an array and read from the beginning to the end. The purpose is to see the effect of cache.
What I expect to see is: when I call doSomething(10000,1000), the arraySize is small so it is all stored in the cache. After that I call doSomething(100,100000), the arraySize is bigger than that of the cache. As a result, the 2nd function call should take longer than the 1st one. The latter function call involved in some memory access as the whole array cannot be stored in the cache.
However, it seems that the 2nd operation takes approximately the same time as the 1st one. So what's wrong here? I tried to compile with -O0 and it doesnt solve the problem.
Thank you.
Update 1: these are the code with random access and it seems to work, time access with large array is ~15s while small array is ~3s
int doSomething(long numLoop,int a, long arraySize){
int * buffer;
buffer = (int*) malloc (arraySize * sizeof(int));
long k;
int i;
for (i=0;i<arraySize;i++)
buffer[i]=2;//write to make sure memory is allocated
//start reading from cache
for(k=0;k<numLoop;k++){
int temp;
for (i=0;i<arraySize;i++){
long randnum = rand();//max is 32767
randnum = (randnum <<16) | rand();
if (randnum < 0) randnum = -randnum;
randnum%=arraySize;
temp = buffer[randnum];
}
}
}
You are accessing the array in sequence,
for (i=0;i<arraySize;i++)
temp = buffer[i];
so the part you are accessing will always be in the cache since that pattern is trivial to predict. To see a cache-effect, you must access the array in a less predictable order, for example by generating (pseudo)random indices, so that you jump between the fron and the back of the array.
In addition to the other answers: Your code accesses the memory sequentially. Let's assume that the cache line is 32 bytes. That means that you probably get a cache miss on every 8 access. So, picking a random index you should make it at least 32 bytes far from the previous value
In order to measure the effect across multiple calls, you must use the same buffer (with the expectation that the first time through you are loading the cache, and the next time you are using it). In your case, you are allocating a new buffer for every call. (Additionally, you are never freeing your allocation.)

Strange behaviour of an elementary CUDA code.

I am having trouble understanding the output of the following simple CUDA code. All that the code does is allocate two integer arrays: one on the host and one on the device each of size 16. It then sets the device array elements to the integer value 3 and then copies these values into the host_array where all the elements are then printed out.
#include <stdlib.h>
#include <stdio.h>
int main(void)
{
int num_elements = 16;
int num_bytes = num_elements * sizeof(int);
int *device_array = 0;
int *host_array = 0;
// malloc host memory
host_array = (int*)malloc(num_bytes);
// cudaMalloc device memory
cudaMalloc((void**)&device_array, num_bytes);
// Constant out the device array with cudaMemset
cudaMemset(device_array, 3, num_bytes);
// copy the contents of the device array to the host
cudaMemcpy(host_array, device_array, num_bytes, cudaMemcpyDeviceToHost);
// print out the result element by element
for(int i = 0; i < num_elements; ++i)
printf("%i\n", *(host_array+i));
// use free to deallocate the host array
free(host_array);
// use cudaFree to deallocate the device array
cudaFree(device_array);
return 0;
}
The output of this program is 50529027 printed line by line 16 times.
50529027
50529027
50529027
..
..
..
50529027
50529027
Where did this number come from? When I replace 3 with 0 in the cudaMemset call then I get correct behaviour. i.e.
0 printed line by line 16 times.
I compiled the code with nvcc test.cu on Ubuntu 10.10 with CUDA 4.0
I'm no cuda expert but 50529027 is 0x03030303 in hex. This means cudaMemset sets each byte in the array to 3 and not each int. This is not surprising given the signature of cuda memset (to pass in the number of bytes to set) and the general semantics of memset operations.
Edit: As to your (I guess) implicit question of how to achieve what you intended I think you have to write a loop and initialize each array element.
As others have pointed out, cudaMesetworks like the standard C memset- it sets byte values. From the CUDA documentation:
cudaError_t cudaMemset( void * devPtr, int value, size_t count)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
If you want to set word size values, the best solution is to use your own memset kernel, perhaps something like this:
template<typename T>
__global__ void myMemset(T * x, T value, size_t count )
{
size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
size_t stride = blockDim.x * gridDim.x;
for(int i=tid; i<count; i+=stride) {
x[i] = value;
}
}
which could be launched with enough blocks to cover the number of MP in your GPU, and each thread will do as many iterations as required to fill the memory allocation. Writes will be coalesced, so performance shouldn't be too bad. This could also be adapted to CUDA's vector types, if you so desired.
memset sets bytes, and integer is 4 bytes.. so what you get is 50529027 decimal, which is 0x3030303 in hex... In other words - you are using it wrong, and it has nothing to do with CUDA.
This is a classic memset shortcoming; it works only on data type with 8-bit size i.e char. This means it sets (probably) 3 to every 8-bits of the total memory. You can confirm this by a simple C++ code:
int main ()
{
int x=16;
size_t bytes = x*sizeof(int);
int *M = (int*)malloc(bytes);
memset(M,3,bytes);
for (int i = 0; i < x; ++i) {
printf("%d\n", M[i]);
}
return 0;
}
The only case in which memset works on all data types is when you set it to 0. (it sets every byte to 0 and hence all data to 0). If you change the data type to char, you'll see the desired output. cudaMemset is ditto copy of memset with the only difference that it takes a GPU pointer in input.
So memset or cudaMemset probably sets every byte to the integer value (in your case 3) of whole memory space defined by the third argument regardless of the datatype.
Tip:
Google: 50529027 in binary and you'll get the answer :)

Resources