for-loop optimization using pointer - c

I am trying to optimize code to run in under 7 seconds. I had it down to 8, and now I am trying to use pointers to speed up the code. But gcc gives an error when I try to compile:
.c:29: warning: assignment from incompatible pointer type .c:29:
warning: comparison of distinct pointer types lacks a cast
Here is what I had before trying to use pointers:
#include <stdio.h>
#include <stdlib.h>
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main (void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
double sum1 = 0;
for (i = 0; i < N_TIMES; i++) {
int j;
for (j = 0; j < ARRAY_SIZE; j += 20) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7] + array[j+8] + array[j+9];
sum1 += array[j+10] + array[j+11] + array[j+12] + array[j+13] + array[j+14] + array[j+15] + array[j+16] + array[j+17] + array[j+18] + array[j+19];
}
}
sum += sum1;
return 0;
}
Here is what I have when I use pointers (this code generates the error):
int *j;
for (j = array; j < &array[ARRAY_SIZE]; j += 20) {
sum += *j + *(j+1) + *(j+2) + *(j+3) + *(j+4) + *(j+5) + *(j+6) + *(j+7) + *(j+8) + *(j+9);
sum1 += *(j+10) + *(j+11) + *(j+12) + *(j+13) + *(j+14) + *(j+15) + *(j+16) + *(j+17) + *(j+18) + *(j+19);
}
How do I fix this error? Btw I don't want suggestions on alternative ways to try to optimize the code. This is a homework problem that has constraints about what I'm allowed to do. I think once I get this pointer thing fixed it will run under 7 seconds and i'll be good to go.

comparison of distinct pointer types lacks a cast
This means that you tried to compare a pointer of one type to a pointer of another type, and did so without a cast.
double *array = calloc(ARRAY_SIZE, sizeof(double));
int *j;
Pointers to double and pointers to int are not directly comparable. You aren't allowed to compare j to array for this reason. Perhaps you meant to declare j as a pointer to double ?

C is a statically typed language, and comparisons across pointer types will give you errors. There is some implicit casting in certain cases, like if you compare a double to an int, because comparing numbers is a common operation. Comparing pointers of different types isn't.
Further, when you increment a pointer over an array, it uses the size of it's dereferenced element to know how far in memory to move. Moving with an int over an array of doubles will lead to issues.
A double will move farther then an int, so you will get more interations with an int pointer anyway.
You could explicitly cast things, but really you should be using a double * for an array of doubles.

I'd be greatly surprised if moving from an array representation to a pointer representation would yield much (if any) speedup, as both are memory addresses (and memory offsets) in the final outputted code. Remember, the array representation is actually a pointer representation in different clothing too.
Instead, I'd look towards one of two techniques:
Embedded MMX representations, to do multiple addition operations within the same register, under the same clock cycle. Then, you need one operation near the end to combine the high double with the low double.
Scatter / Gather algorithims to spread the addition operation across multiple cores (nearly every CPU these days has 4 cores available, if not 16 pseudo-cores (a la hyper-threading))
Beyond that, you can do a few attempts at cache analysis, and at storing intermediates in different registers. There seems to be a deep chain of additions in each of your computations. Breaking them up might yield the ability to spread the on-cpu storage across more registers.
Most operations become memory bound. 20 is a really strange boundary for loop unrolling. Doubles probably are 16 bits, so 20 doubles is 320 bits, which is probably not aligned to your memory cache line size. Try making sure that multiples of your unrolled loop align cleanly with your architecture's level 1 cache, and you might avoid a page fault as you read across cache boundaries. Doing so will speed up your program by some (but who knows how much).

" When you increment a pointer over an array, it uses the size of it's dereferenced element to know how far in memory to move. Moving with an int over an array of doubles will lead to issues ".
To avoid your warn: do the below one
for (j= (int *)array; j < (int *)&array[ARRAY_SIZE]; j += 20)

Related

C malloc segmentation fault using a 1-dimensional array

I use malloc to create an array in C. But I got the segmentation fault when I tried to assign random values to the array in 2 loops.
There is no segmentation fault when I assign values to this array in 1 loop. The array size is large. Please see the code I attached.
Anyone can give me a hint what is going on here. I am pretty new to C. Thanks a lot in advance.
int n=50000;
float *x = malloc(n*n*sizeof(float));
// there is segmentation fault:
int i, j;
for (i=0; i<n; i++){
for (j=0; j<n; j++){
x[i*n+j] = random() / (float)RAND_MAX;
}
}
// there is no segmentation fault:
int ii;
for (ii=0; ii<n*n; ii++){
x[ii] = random() / (float)RAND_MAX;
}
int overflow.
50000 * 50000 --> 2,500,000,000 --> more than INT_MAX --> undefined behavior (UB).
First, let us make certain a calculation for the size of this allocation is possible
assert(SIZE__MAX/n/n/sizeof(float) >= 1);
Then with verified wide enough size_t, use size_t math to do the multiplication and use size_t math for array index calculation. Rather than int*int*size_t, do size_t*int*int.
// float *x = malloc(n*n*sizeof(float));
// Uses at least `size_t` math by leading the multiplication with that type.
float *x = malloc(sizeof(float) * n*n);
// or better
float *x = malloc(sizeof *x * n*n);
for (i=0; i<n; i++){
for (j=0; j<n; j++){
x[(size_t)n*i + j] = random() / (float)RAND_MAX;
}
}
2nd loop did not not "fail" as n*n is not the large value as expected, but likely the same UB value in the allocation.
First off, you're invoking undefined behavior due to signed integer overflow. Assuming an int is 32-bit, the value of 50000*50000 is outsize the range of an int, causing the overflow.
You can fix this by putting sizeof(float) first in the expression. The result of sizeof is a size_t which is unsigned and at least as large as an int. Then when each n is multiplied, it is first converted to size_t thus avoiding overflow.
float *x = malloc(sizeof(float)*n*n);
However, even if you fix this you're asking for too much memory.
Assuming sizeof(float) is 4 bytes, n*n*sizeof(float) is about 10GB of memory. If you check the return value of malloc, you'll probably see that it returns NULL.
You'll need to make your array much smaller. Try n=1000 instead, which will only use about 4MB.
I believe the issue is related to integer overflow:
50,000 * 50,000 = 2.5 Billion
2^31 ~ 2.1 Billion
Thus, you are invoking undefined behavior when calculating the array index. As to why it works for one but not the other, that's just the way it is. Undefined behavior means the compiler (and computer) can do whatever it wants including doing what you expect and crashing the program.
To fix, change the types of i, j, n, and ii to long long from int. That should solve the overflow issue and the segmentation fault.
Edit:
You should also check that malloc returns a valid pointer before you perform operations on the pointer. If malloc fails, you will receive a null pointer.

sse precision error with Matrix multiplication

My program does NxN matrices multiplication where elements of both the matrices are initialized to values (0, 1, 2, ... N) using a for loop. Both the matrix elements are of type float. There is no memory allocation problem. Matrix sizes are input as a multiple of 4 eg: 4x4 or 8x8 etc. The answers are verified with a sequential calculation. Everything works fine upto matrix size of 64x64. A difference between the sequential version and SSE version is observed only when the matrix size exceeds 64 (eg: 68 x 68).
SSE snippet is as shown (size = 68):
void matrix_mult_sse(int size, float *mat1_in, float *mat2_in, float *ans_out) {
__m128 a_line, b_line, r_line;
int i, j, k;
for (k = 0; k < size * size; k += size) {
for (i = 0; i < size; i += 4) {
j = 0;
b_line = _mm_load_ps(&mat2_in[i]);
a_line = _mm_set1_ps(mat1_in[j + k]);
r_line = _mm_mul_ps(a_line, b_line);
for (j = 1; j < size; j++) {
b_line = _mm_load_ps(&mat2_in[j * size + i]);
a_line = _mm_set1_ps(mat1_in[j + k]);
r_line = _mm_add_ps(_mm_mul_ps(a_line, b_line), r_line);
}
_mm_store_ps(&ans_out[i + k], r_line);
}
}
}
With this, the answer differs at element 3673 where I get the answers of multiplication as follows
scalar: 576030144.000000 & SSE: 576030208.000000
I also wrote a similar program in Java with the same initialization and setup and N = 68 and for element 3673, I got the answer as 576030210.000000
Now there are three different answers and I'm not sure how to proceed. Why does this difference occur and how do we eliminate this?
I am summarizing the discussion in order to close this question as answered.
So according to the article (What Every Computer Scientist Should Know About Floating-Point Arithmetic) in link, floating point always results in a rounding error which is a direct consequence of the approximate representation nature of the floating point number.
Arithmetic operations such as addition, subtraction etc results in a precision error. Hence, the 6 most significant digits of the floating point answer (irrespective of where the decimal point is situated) can be considered to be accurate while the other digits may be erroneous (prone to precision error).

optimizing a line of C code for 8 bit processor

I'm working on a 8bit processor and have written code in a C compiler, now more than 140 lines of code are taking just 1200 bytes and this single line is taking more than 200 bytes of ROM space. eeprom_read() is a function, there should be a problem with this 1000 and 100 and 10 multiplication.
romAddr = eeprom_read(146)*1000 + eeprom_read(147)*100 +
eeprom_read(148)*10 + eeprom_read(149);
Processor is 8-bit and data type of romAddr is int. Is there any way to write this line in a more optimized way?
It's possible that the thing that uses the most space is the use of multiplication. If your processor lacks an instruction to do multiplication, the compiler is forced to use software to do it step by step, which can require quite a bit of code.
It's hard to say, since you don't specify anything about your target processor (or which compiler you're using).
One way might be to somehow try to reduce inlining, so the code to multiply by 10 (which is used in all four terms) can be re-used.
To know if this is the case at all, the machine code must be inspected. By the way, the use of decimal constants for an address calculation is really odd.
Sometimes the multiplication can be compiled into a sequence of additions, yes. You can optimize it say by using left shift operator.
A*1000 = A*512 + A*256 + A*128 + A*64 + A*32 + A*8
Or the same thing:
A<<9 + A<<8 + A<<7 + A<<6 + A<<5 + A<<3
This still is way longer then a single "multiply" instruction, but your processor apparently doesn't have it anyway, so this might be the next best thing.
You're concerned about space, not time, right?
You've got four function calls, with an integer argument being passed to each one, followed by a multiplication by a constant, followed by adding.
Just as a first guess, that could be
load integer constant into register (6 bytes)
push register (2 bytes,
call eeprom_read (6 bytes)
adjust stack (4 bytes)
load integer multiplier into register (6 bytes)
push both registers (4 bytes),
call multiplication routine (6 bytes)
adjust stack (4 bytes)
load temporary sum into a register (6 bytes)
add to that register the result of the multiplication (2 bytes)
store back in the temporary sum (6 bytes).
Let's see, 6+2+6+4+6+4+6+4+6+2+6= about 52 bytes per call to eeprom_read.
The last call would be shorter because it doesn't do the multiply.
I would try calling eeprom_read not with arguments like 146 but with (unsigned char)146, and multiplying not by 1000 but by (unsigned short)1000.
That way, you might be able to tease the compiler into using shorter instructions, and possibly using a multiply instruction rather than a multiply function call.
Also, the call to eeprom_read might be macro'ed into a direct memory fetch, saving the pushing of the argument, the calling of the function, and the stack adjustment.
Another trick could be to store each one of the four products in a local variable, and add them all together at the end. That could generate less code.
All these possibilities would also make it faster, as well as smaller, though you probably don't need to care about that.
Another possibility for saving space could be to use a loop, like this:
static unsigned short powerOf10[] = {1000, 100, 10, 1};
unsigned short i;
romAddr = 0;
for (i = 146; i < 150; i++){
romAddr += powerOf10[i-146] * eeprom_read(i);
}
which should save space by having the call and the multiply only once, plus the looping instructions, rather than four copies.
In any case, get handy with the assembler language that the compiler generates.
It depends very, very much on the compiler, but I would suggest that you at least simplify the multiplication this way:
romAddr = ((eeprom_read(146)*10 + eeprom_read(147))*10 +
eeprom_read(148))*10 + eeprom_read(149);
You could put this in a loop:
uint8_t i = 146;
romAddr = eeprom_read(i);
for (i = 147; i < 150; i++)
romAddr = romAddr * 10 + eeprom_read(i);
Hopefully the compiler should recognise how much simpler it is to multiply a 16-bit value by ten, compared with separately implementing multiplications by 1000 and 100.
I'm not completely comfortable relying on the compiler to deal with the loop effectively, though.
Maybe:
uint8_t hi, lo;
hi = (uint8_t)eeprom_read(146) * (uint8_t)10 + (uint8_t)eeprom_read(147);
lo = (uint8_t)eeprom_read(148) * (uint8_t)10 + (uint8_t)eeprom_read(149);
romAddr = hi * (uint8_t)100 + lo;
All of these are untested.

CUDA warning "floating-point value does not fit in required integral type" - Why?

I wrote a code for multiplying 2 vectors of length "N" elements, and returning the product vector of the same length in CUDA 5.0. Here is my code
I vary the value of "N" just see how the GPU fares compared to the CPU. I am able to go up to 2000000000 elements. However when I go to 3000000000 I get the warning:
vecmul.cu(52): warning: floating-point value does not fit in required integral type
vecmul.cu(52): warning: floating-point value does not fit in required integral type
vecmul.cu: In function `_Z6vecmulPiS_S_':
vecmul.cu:15: warning: comparison is always false due to limited range of data type
vecmul.cu: In function `int main()':
vecmul.cu:40: warning: comparison is always true due to limited range of data type
And here is my code
// Summing 2 Arrays
#include<stdio.h>
#include <fstream>
#define N (3000000000)
//const int threadsPerBlock = 256;
// Declare add function for Device
__global__ void vecmul(int *a,int *b,int *c)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid >= N) {return;} // (LINE 15)
c[tid] = a[tid] * b[tid];
}
int main(void)
{
// Allocate Memory on Host
int *a_h = new int[N];
int *b_h = new int[N];
int *c_h = new int[N];
// Allocate Memory on GPU
int *a_d;
int *b_d;
int *c_d;
cudaMalloc((void**)&a_d,N*sizeof(int));
cudaMalloc((void**)&b_d,N*sizeof(int));
cudaMalloc((void**)&c_d,N*sizeof(int));
//Initialize Host Array
for (int i=0;i<N;i++) // (LINE 40)
{
a_h[i] = i;
b_h[i] = (i+1);
}
// Copy Data from Host to Device
cudaMemcpy(a_d,a_h,N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(b_d,b_h,N*sizeof(int),cudaMemcpyHostToDevice);
// Run Kernel
int blocks = int(N - 0.5)/256 + 1; // (LINE 52)
vecmul<<<blocks,256>>>(a_d,b_d,c_d);
// Copy Data from Device to Host
cudaMemcpy(c_h,c_d,N*sizeof(int),cudaMemcpyDeviceToHost);
// Free Device Memory
cudaFree(a_d);
cudaFree(b_d);
cudaFree(c_d);
// Free Memory from Host
free(a_h);
free(b_h);
free(c_h);
return 0;
}
Is this something because of the number of blocks is not sufficient for this array size?
Any suggestions would be welcome since I am a beginner in CUDA.
I am running this on a NVIDIA Quadro 2000.
The errors are caused by overflowing a 32-bit signed int. 2147483648 is the max 32-bit signed int so N will always be negative, causing your boolean tests to always return true/false as specified by the warning.
The other problem is around
int blocks = int(N - 0.5)/256 + 1; // (LINE 52)
trying to turn N into a floating point and then turn it back into an int. The value in the floating point number is too big -- again because you've overflowed a 32-bit int.
I think if you can remove the int(), it will work since once you divide by 256, you will be small enough, but you're forcing it to int before the division, so it's too big causing the error. It's not the assignment into blocks that's the problem, it's the explicit conversion to int.
edit: Wondering if now that we've fixed some of the computation problems with N and floating point vs int that you're seeing issues with the overflow. For example:
for (int i=0;i<N;i++) // (LINE 40)
{
a_h[i] = i;
b_h[i] = (i+1);
}
When N is over 2^31-1, this will always result in true (at least until i overflows. This SHOULD cause this to be either an infinite loop or perhaps do 2^31-1 iterations and then exit? The compiler says it will ALWAYS be true, which if that's the case, the loop should never end.
Also, I don't know what a size_t is in CUDA, but
cudaMemcpy(c_h,c_d,N*sizeof(int),cudaMemcpyDeviceToHost);
doing N*sizeof(int) is going way over 2^31 and even 2^32 when N=3B.
At some point you need to ask yourself why you are trying to allocate this much space and if there is a better approach.

Accessing elements in a static array using pointer(arithmetic) in C

If I have the following code in a function:
int A[5][5];
int i; int j;
for(i=0;i<5;i++){
for(j=0;j<5;j++){
A[i][j]=i+j;
printf("%d\n", A[i][j]);
}
}
This simply prints out the sum of each index. What I want to know is if it's possible to access each index in the static array in a similar fashion to dynamic array. So for example, if I wanted to access A[2][2], can I say:
*(A+(2*5+2)*sizeof(int))?
I want to perform some matrix operations on statically allocated matrices and I feel like the method used to dereference dynamic matrices would work the best for my purposes. Any ideas? Thank you.
That's the way to do it: A[i][j].
It prints out the sum of the indexes because, well, you set the element A[i][j] to the sum of the indexes: A[i][j] = i+j.
You can use:
*(*(A + 2) + 2)
for A[2][2]. Pointer arithmetics is done in unit of the pointed type not in unit of char.
Of course, the preferred way is to use A[2][2] in your program.
The subscript operation a[i] is defined as *(a + i) - you compute an offset of i elements (not bytes) from a and then dereference the result. For a 2D array, you just apply that definition recursively:
a[i][j] == *(a[i] + j) == *(*(a + i) + j)
If the array is allocated contiguously, you could also just write *(a + i * rows + j).
When doing pointer arithmetic, the size of the base type is taken into account. Given a pointer
T *p;
the expression p + 1 will evaluate to the address of the next object of type T, which is sizeof T bytes after p.
Note that using pointer arithmetic may not be any faster than using the subscript operator (code up both versions and run them through a profiler to be sure). It will definitely be less readable.
Pointer arithmetic can be tricky.
You are on the right track, however there are some differences between pointer and normal arithmetic.
For example consider this code
int I = 0;
float F = 0;
double D = 0;
int* PI = 0;
float* PF = 0;
double* PD = 0;
cout<<I<<" "<<F<<" "<<D<<" "<<PI<<" "<<PF<<" "<<PD<<endl;
I++;F++;D++;PI++;PF++,PD++;
cout<<I<<" "<<F<<" "<<D<<" "<<PI<<" "<<PF<<" "<<PD<<endl;
cout<<I<<" "<<F<<" "<<D<<" "<<(int)PI<<" "<<(int)PF<<" "<<(int)PD<<endl;
If you run it see the output you would see would look something like this (depending on your architecture and compiler)
0 0 0 0 0 0
1 1 1 0x4 0x4 0x8
1 1 1 4 4 8
As you can see the pointer arithmetic is handled depending on the type of the variable it points to.
So keep in mind which type of variable you are accessing when working with pointer arithmetic.
Just for the sake of example consider this code too:
void* V = 0;
int* IV = (int*)V;
float* FV = (float*)V;
double* DV = (double*)V;
IV++;FV++;DV++;
cout<<IV<<" "<<FV<<" "<<DV<<endl;
You will get the output (again depending on your architecture and compiler)
0x4 0x4 0x8
Remember that the code snippets above are just for demonstration purposes. There are a lot of things NOT to use from here.

Resources