What causes my array address to be corrupted (change) when passed to function? - c

I am performing Compressed Sparse Raw Matrix Vector multiplications (CSR SPMV): This involves dividing the array A into multiple chunks, then pass this chunk by reference to a function, however only the first part of the array (A[0] first chunk starting the beginning of the array) is modified. However starting from the second loop A[0 + chunkIndex], when the function reads the sub array it jumps and reads a different address beyond the total array address range, although the indices are correct.
For reference:
The SPMV kernel is:
void serial_matvec(size_t TS, double *A, int *JA, int *IA, double *X, double *Y)
{
double sum;
for (int i = 0; i < TS; ++i)
{
sum = 0.0;
for (int j = IA[i]; j < IA[i + 1]; ++j)
{
sum += A[j] * X[JA[j]]; // the error is here , the function reads diffrent
// address of A, and JA, so the access
// will be out-of-bound
}
Y[i] = sum;
}
}
and it is called this way:
int chunkIndex = 0;
for(size_t k = 0; k < rows/TS; ++k)
{
chunkIndex = IA[k * TS];
serial_matvec(TS, &A[chunkIndex], &JA[chunkIndex], &IA[k*TS], &X[0], &Y[k*TS]);
}
assume I process (8x8) Matrix, and I process 2 rows per chunk, so the loop k will be rows/TS = 4 loops, the chunkIndex and array passed to the function will be as following:
chunkIndex: 0 --> loop k = 0, &A[0], &JA[0]
chunkIndex: --> loop k = 1, &A[16], &JA[16] //[ERROR here, function reads different address]
chunkIndex: --> loop k = 2, &A[32], &JA[32] //[ERROR here, function reads different address]
chunkIndex: --> loop k = 3, &A[48], &JA[48] //[ERROR here, function reads different address]
When I run the code, only the first chunk executes correctly, the other 3 chunks memory are corrupted and the array pointers jump into boundary beyond the array size.
I've checked all indices manually, of all the parameter, they are all correct, however when I print the addresses they are not the same. (debugging this for 3 days now)
I used valgrind and it reported:
Invalid read of size 8 and Use of uninitialised value of size 8 at the sum += A[j] * X[JA[j]]; line
I compiled it with -g -fsanitize=address and I got
heap-buffer-overflow
I tried to access these chunks manually outside the function, and they are correct, so what can cause the heap memory to be corrupted like this ?
The code is here, This is the minimum I can do.

The problem was that I was using global indices (indices inside main) when indexing the portion of the array (chunk) passed to the function, hence the out-of-bound problem.
The solution is to start indexing the sub-arrays from 0 at each function call, but I had another problem. At each function call, I process TS rows, each row has different number of non-zeros.
As an example, see the picture, chunk 1, sorry for my bad handwriting, it is easier this way. As you can see we will need 3 indices, one for the TS rows proceeded per chunk i , and the other because each row has different number of non-zeros j, and the third one to index the sub-array passed l, which was the original problem.
and the serial_matvec function will be as following:
void serial_matvec(size_t TS, const double *A, const int *JA, const int *IA,
const double *X, double *Y) {
int l = 0;
for (int i = 0; i < TS; ++i) {
for (int j = 0; j < (IA[i + 1] - IA[i]); ++j) {
Y[i] += A[l] * X[JA[l]];
l++;
}
}
}
The complete code with test is here If anyone has a more elegant solution, you are more than welcome.

Related

Trying to make a function which squares all values in an array. Getting strange number on the last value

int squaring_function (int *array, int i);
int main()
{
int array[5];
int i;
for(i=0; (i <= 5) ; i++)
{
array[i] = i;
printf("\nArray value %d is %d",i,array[i]);
}
for(i=0; (i <= 5) ; i++)
{
array[i] = (squaring_function(array, i));
printf("\nSquared array value %d is %d",i,array[i]);
}
return 0;
}
int squaring_function (int *array, int i)
{
return pow((array[i]),2);
}
I'm trying to use this squaring_function to square each value in turn in my array (containing integers 0 to 5). It seems to work however the last value (which should be 5)^2 is not coming up as 25. cmd window
I have tried reducing the array size to 5 (so the last value is 4) however this prints an incorrect number also.
I'm quite new to C and don't understand why this last value is failing.
I'm aware I could do this without a separate function however I'd quite like to learn why this isn't working.
Any help would be much appreciated.
Thanks,
Dan.
There are 2 bugs in your code. First is that you're accessing array out of bounds. The memory rule is that with n elements the indices must be smaller than n, hence < 5, not <= 5. And if you want to count up to 5, then you must declare
int array[6];
The other problem is that your code calculates pow(5, 2) as 24.99999999 which gets truncated to 24. The number 24 went to the memory location immediately after array overwriting i; which then lead to array[i] evaluating to array[24] which happened to be all zeroes.
Use array[i] * array[i] instead of pow to ensure that the calculation is done with integers.
The code
int array[5];
for(int i=0; (i <= 5) ; i++)
exceeds array bounds and introduces undefined behaviour. Note that 0..5 are actually 6 values, not 5. If you though see some "meaningful" output, well - good or bad luck - it's just the result of undefined behaviour, which can be everything (including sometimes meaningful values).
Your array isn't big enough to hold all the values.
An array of size 5 has indexes from 0 - 4. So array[5] is off the end of the array. Reading or writing past the end of an array invokes undefined behavior.
Increase the size of the array to 6 to fit the values you want.
int array[6];
The other answers show the flaws in the posted code.
If your goal is to square each element of an array, you can either write a function which square a value
void square(int *x)
{
*x *= *x;
}
and apply it to every element of an array or write a function which takes an entire array as an input and perform that transformation:
void square_array(int size, int arr[size])
{
for (int i = 0; i < size; ++i)
{
arr[i] *= arr[i];
}
}
// ... where given an array like
int nums[5] = {1, 2, 3, 4, 5};
// you can call it like this
square_array(5, nums); // -> {1, 4, 9, 16, 25}

C: Array not allocating more memory correctly

I'm fairly new to C and I'm working on a project. Given an integer array, I want to move all the zeros in it to the left of the array with the rest of the elements in any order to the right of all the zeros. The basic idea of my algorithm is to count the number of zeros in the array, create a new array with the number of zeros in it from the old array, and then "append" the rest of the non-zero integers onto this array. And then of course I print finished product.
int main(int argc, const char * argv[]) {
int a[10] = {3, 0, 1, 4, 0, 0, 7, 20, 1, 5};
int n = 10, count = 0;
// counts the number of 0's in the original array
for (int i = 0; i < n; ++i)
{
if (a[i] == 0)
{
++count;
}
}
// creates a new array and makes each element 0
int *array = NULL;
for (int j = 0; j < count; ++j)
{
array = realloc(array, (j + 1) * sizeof(int));
array[j] = 0;
}
// adds the nonzero elements of the array to the new array
for (int l = count; l < n; ++l)
{
array = realloc(array, l * sizeof(int)); // getting an error here
if (a[l] != 0)
{
array[l+count] = a[l];
}
}
// prints the array out in a nice format
printf("%s", "{");
for (int k = 0; k < n-1; ++k)
{
printf("%d%s", array[k], ",");
}
printf("%d", array[n-1]);
printf("%s", "}\n");
free(array);
return 0;
}
I'm getting a "Thread 1: EXC_BAD_ACCESS (code=1, address=0x40)" error when I run this code. I think it's got to do something with invalid pointers to the new array, but I'm not sure how to fix it.
array[l+count] = a[l];
This accesses the memory block that array points at beyond its allocated size. You have to do it differently, using a second index:
// adds the nonzero elements of the array to the new array
int l = count;
for (int j=0; j < n; ++j)
{
if (a[j] != 0)
{
array = realloc(array, (l+1) * sizeof(int));
array[l] = a[j];
++l;
}
}
About your algorithm:
I think u don't need to create a new array, just use a int tmp as swap area, and a int foundZeroCount as index, u swap 2 numbers at a time.
About memory allocation:
If u want to allocate memory for a fixed size array, just use malloc() to allocate array once, later when u need to extend the array, just call realloc() once.
About memory reset:
Just use memset(), and don't need a loop.
Suggestion - about c programming
Try improve your c basic, especially about array / pointer / memory, and try to know more functions from glibc.
Books like <The c programming language 2nd>, GNU c library document, and <The linux programming interface> would be useful, I guess.
The problem is with array[l+count] = a[l]; right when you are done allocating your 'zero-array' it's size is 3 and then you try to access (l + count)'th position which is 6.
And even if you have fixed those issues with memory it still wouldn't work because a[l] and further may still be zeros. (Your initial array is doesn't have zeroes in the beggining, remember?)
And there is a couple of suggestions:
use calloc() to build your initial array of zeros because as man states:
The calloc() function allocates memory for an array of nmemb elements
of size bytes each and returns a pointer to the allocated memory. The
memory is set to zero
First allocate then set because operations with memory are quite taxing for performance. It would be better for you to first allocate some memory and work with it instead of reallocating it each step. It would be much easier to keep track of as well.
Other answers address your immediate issue, that
array[l+count] = a[l];
attempts to access outside the bounds of the allocated space to which array points. I'll focus instead on your approach to the problem, which is flawed.
Dynamic memory allocation is comparatively expensive. You do not want to do any more than necessary, and it is particularly poor form to reallocate many times to increase by small increments each time, as you do.
Since you know at compile time how many elements you will need, dynamic allocation is altogether unnecessary here. You could instead do this:
int a[10] = {3, 0, 1, 4, 0, 0, 7, 20, 1, 5};
int array[10] = { 0 };
(Note also here that when an array initializer is provided, any array elements it does not explicitly initialize are initialized to 0.)
Even if you did not know at compile time how many elements you would need, it would be far better to perform the whole allocation in one chunk. Moreover, if you did that via calloc() then you would get automatic initialization of the allocated space to all-zeroes.
The count is known before defining the array. You can allocate memory using malloc as shown below.
array = malloc( count * sizeof(int)).
The error indicates you are trying to access address 0x40. This indicates one of the pointer has become NULL and you are trying to dereference ptr+0x40.
Before you start to hack away, take your time to consider the actual problem, which you have described as:
count the number of zeros in the array, create a new array with the number of zeros in it from the old array, and then "append" the rest of the non-zero integers onto this array.
Your comments say what the code should do, yet the code does something entirely different. Your algorithm to solve the problem is wrong - this, and nothing else, is the cause of the bugs.
To begin with, if the new array should contain all zeroes of the old array plus all non-zeroes, then common sense says that the new array will always have the same size as the old array. You don't even need to use dynamic memory allocation.
The code you have creates a new array and discards the old one, in the same memory location, over and over. This doesn't make any sense. On top of that, it is very ineffective to call realloc repeatedly in a loop.
You should do something like this instead:
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
int main(int argc, const char * argv[]) {
int a[10] = {3, 0, 1, 4, 0, 0, 7, 20, 1, 5};
const int n = 10;
// counts the number of 0's in the original array
int zeroes = 0;
for (int i = 0; i < n; ++i)
{
if (a[i] == 0)
{
++zeroes;
}
}
// creates a new array and makes each element 0
// there's actually no need to allocate this dynamically at all...
int *array = calloc(1, sizeof(int[n]) );
assert(array != NULL);
// skip zeroes, ie "move the zeroes from the original array"
int* not_zeroes = array + zeroes; // point at first item to contain "not zeroes"
// adds the non-zero elements of the original array to the new array
for (int i = 0; i < n; ++i)
{
if(a[i] != 0)
{
*not_zeroes = a[i];
not_zeroes++;
}
}
// prints the array out in a nice format
printf("%s", "{");
for (int i = 0; i < n; ++i)
{
printf("%d,", array[i]);
}
printf("}\n");
free(array);
return 0;
}

Arguments to Partition function for Quicksort

So I've been attempting to write a stack-based quicksort function, which calls a partition function. The header for partition() is as follows:
int partition(void **A, int n, void *pivot, int (cmp)(void *, void *));
where A is the array, n is the size of the array and pivot is the value of the pivot (not the index).
My current call to partition is:
partition(&A[low_int], high_int + 1, A[low_int+(high_int-low_int)/2], cmp)
Above, my values for low and high are the classic 'l' and 'h' used in iterative quicksort, where l begins as the lowest index (and h the highest). These values then change as the function continues.
I'll post my partition function below:
int
partition(void **A, int n, void *pivot, int (cmp)(void *, void *)) {
int k;
int i = 0;
for (int j = 0; j <= n-2; j++) {
if (cmp(A[j], pivot) <= 0) {
i++;
swap(A, i, j);
}
}
swap(A, i+1, n-1);
k = i + 2;
return k; //k is the first value after the pivot in partitioned A
}
My problem is deciding on the inputs for my call to partition(). For the first argument, I've chose &A[low_int], as I'm not using a "left" as one of my inputs and therefore am trying to create a pointer to essentially start my array later. The third argument if for pivot, where I've been trying to select an element within that range, but both this and argument 1 have been causing my code to either return an unsorted array or run infinitely.
Could I please get some help here with what I've done wrong and how to fix it?
I've tried to include all relevant code, but please let me know if I've missed anything important or if anything I've written is unclear, or you need more info. Thank you
Consider what happens if low_int is 1000 and high_int is 2000 and the array ends at 2000. Now you give it the array B = &A[1000] and the value 2001. The value of 2001 causes it to access the element B[2001-1] = B[2000] = A[3000]. It's accessing the array out of bounds.
Shouldn't you use something like high_int - low_int + 1 for the second argument? Note: I haven't verified that your code doesn't have off-by-one errors with the argument high_int - low_int + 1 but anyway it seems to me that you should be substracting low_int from high_int.
Another option would be to give it A, low_int and high_int.

Allocate contiguous memory

I'm trying to allocate a large space of contiguous memory in C and print this out to the user. My strategy for doing this is to create two pointers (one a pointer to double, one a pointer to pointer to double), malloc one of them to the entire size (m * n) in this case the pointer to pointer to double. Then malloc the second one to the size of m. The last step will be to iterate through the size of m and perform pointer arithmetic that would ensure the addresses of the doubles in the large array will be stored in contiguous memory. Here is my code. But when I print out the address it doesn't seem to be in contiguous (or in any sort of order). How do i print out the memory addresses of the doubles (all of them are of value 0.0) correctly?
/* correct solution, with correct formatting */
/*The total number of bytes allocated was: 4
0x7fd5e1c038c0 - 1
0x7fd5e1c038c8 - 2
0x7fd5e1c038d0 - 3
0x7fd5e1c038d8 - 4*/
double **dmatrix(size_t m, size_t n);
int main(int argc, char const *argv[])
{
int m,n,i;
double ** f;
m = n = 2;
i = 0;
f = dmatrix(sizeof(m), sizeof(n));
printf("%s %d\n", "The total number of bytes allocated was: ", m * n);
for (i=0;i<n*m;++i) {
printf("%p - %d\n ", &f[i], i + 1);
}
return 0;
}
double **dmatrix(size_t m, size_t n) {
double ** ptr1 = (double **)malloc(sizeof(double *) * m * n);
double * ptr2 = (double *)malloc(sizeof(double) * m);
int i;
for (i = 0; i < n; i++){
ptr1[i] = ptr2+m*i;
}
return ptr1;
}
Remember that memory is just memory. Sounds trite, but so many people seem to think of memory allocation and memory management in C as being some magic-voodoo. It isn't. At the end of the day you allocate whatever memory you need, and free it when you're done.
So start with the most basic question: If you had a need for 'n' double values, how would you allocate them?
double *d1d = calloc(n, sizeof(double));
// ... use d1d like an array (d1d[0] = 100.00, etc. ...
free(d1d);
Simple enough. Next question, in two parts, where the first part has nothing to do with memory allocation (yet):
How many double values are in a 2D array that is m*n in size?
How can we allocate enough memory to hold them all.
Answers:
There are m*n doubles in a m*n 2D-matrix of doubles
Allocate enough memory to hold (m*n) doubles.
Seems simple enough:
size_t m=10;
size_t n=20;
double *d2d = calloc(m*n, sizeof(double));
But how do we access the actual elements? A little math is in order. Knowing m and n, you can simple do this
size_t i = 3; // value you want in the major index (0..(m-1)).
size_t j = 4; // value you want in the minor index (0..(n-1)).
d2d[i*n+j] = 100.0;
Is there a simpler way to do this? In standard C, yes; in C++ no. Standard C supports a very handy capability that generates the proper code to declare dynamically-sized indexible arrays:
size_t m=10;
size_t n=20;
double (*d2d)[n] = calloc(m, sizeof(*d2d));
Can't stress this enough: Standard C supports this, C++ does NOT. If you're using C++ you may want to write an object class to do this all for you anyway, so it won't be mentioned beyond that.
So what does the above actual do ? Well first, it should be obvious we are still allocating the same amount of memory we were allocating before. That is, m*n elements, each sizeof(double) large. But you're probably asking yourself,"What is with that variable declaration?" That needs a little explaining.
There is a clear and present difference between this:
double *ptrs[n]; // declares an array of `n` pointers to doubles.
and this:
double (*ptr)[n]; // declares a pointer to an array of `n` doubles.
The compiler is now aware of how wide each row is (n doubles in each row), so we can now reference elements in the array using two indexes:
size_t m=10;
size_t n=20;
double (*d2d)[n] = calloc(m, sizeof(*d2d));
d2d[2][5] = 100.0; // does the 2*n+5 math for you.
free(d2d);
Can we extend this to 3D? Of course, the math starts looking a little weird, but it is still just offset calculations into a big'ol'block'o'ram. First the "do-your-own-math" way, indexing with [i,j,k]:
size_t l=10;
size_t m=20;
size_t n=30;
double *d3d = calloc(l*m*n, sizeof(double));
size_t i=3;
size_t j=4;
size_t k=5;
d3d[i*m*n + j*m + k] = 100.0;
free(d3d);
You need to stare at the math in that for a minute to really gel on how it computes where the double value in that big block of ram actually is. Using the above dimensions and desired indexes, the "raw" index is:
i*m*n = 3*20*30 = 1800
j*m = 4*20 = 80
k = 5 = 5
======================
i*m*n+j*m+k = 1885
So we're hitting the 1885'th element in that big linear block. Lets do another. what about [0,1,2]?
i*m*n = 0*20*30 = 0
j*m = 1*20 = 20
k = 2 = 2
======================
i*m*n+j*m+k = 22
I.e. the 22nd element in the linear array.
It should be obvious by now that so long as you stay within the self-prescribed bounds of your array, i:[0..(l-1)], j:[0..(m-1)], and k:[0..(n-1)] any valid index trio will locate a unique value in the linear array that no other valid trio will also locate.
Finally, we use the same array pointer declaration like we did before with a 2D array, but extend it to 3D:
size_t l=10;
size_t m=20;
size_t n=30;
double (*d3d)[m][n] = calloc(l, sizeof(*d3d));
d3d[3][4][5] = 100.0;
free(d3d);
Again, all this really does is the same math we were doing before by hand, but letting the compiler do it for us.
I realize is may be a bit much to wrap your head around, but it is important. If it is paramount you have contiguous memory matrices (like feeding a matrix to a graphics rendering library like OpenGL, etc), you can do it relatively painlessly using the above techniques.
Finally, you might wonder why would anyone do the whole pointer arrays to pointer arrays to pointer arrays to values thing in the first place if you can do it like this? A lot of reasons. Suppose you're replacing rows. swapping a pointer is easy; copying an entire row? expensive. Supposed you're replacing an entire table-dimension (m*n) in your 3D array (l*n*m), even more-so, swapping a pointer: easy; copying an entire m*n table? expensive. And the not-so-obvious answer. What if the rows widths need to be independent from row to row (i.e. row0 can be 5 elements, row1 can be 6 elements). A fixed l*m*n allocation simply doesn't work then.
Best of luck.
Never mind, I figured it out.
/* The total number of bytes allocated was: 8
0x7fb35ac038c0 - 1
0x7fb35ac038c8 - 2
0x7fb35ac038d0 - 3
0x7fb35ac038d8 - 4
0x7fb35ac038e0 - 5
0x7fb35ac038e8 - 6
0x7fb35ac038f0 - 7
0x7fb35ac038f8 - 8 */
double ***d3darr(size_t l, size_t m, size_t n);
int main(int argc, char const *argv[])
{
int m,n,l,i;
double *** f;
m = n = l = 10; i = 0;
f = d3darr(sizeof(l), sizeof(m), sizeof(n));
printf("%s %d\n", "The total number of bytes allocated was: ", m * n * l);
for (i=0;i<n*m*l;++i) {
printf("%p - %d\n ", &f[i], i + 1);
}
return 0;
}
double ***d3darr(size_t l, size_t m, size_t n){
double *** ptr1 = (double ***)malloc(sizeof(double **) * m * n * l);
double ** ptr2 = (double **)malloc(sizeof(double *) * m * n);
double * ptr3 = (double *)malloc(sizeof(double) * m);
int i, j;
for (i = 0; i < l; ++i) {
ptr1[i] = ptr2+m*n*i;
for (j = 0; j < l; ++j){
ptr2[i] = ptr3+j*n;
}
}
return ptr1;
}

Optimizing array transposing function

I'm working on a homework assignment, and I've been stuck for hours on my solution. The problem we've been given is to optimize the following code, so that it runs faster, regardless of how messy it becomes. We're supposed to use stuff like exploiting cache blocks and loop unrolling.
Problem:
//transpose a dim x dim matrix into dist by swapping all i,j with j,i
void transpose(int *dst, int *src, int dim) {
int i, j;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
}
}
What I have so far:
//attempt 1
void transpose(int *dst, int *src, int dim) {
int i, j, id, jd;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
jd = 0;
for(j = 0; j < dim; j++, jd+=dim) {
dst[jd + i] = src[id + j];
}
}
}
//attempt 2
void transpose(int *dst, int *src, int dim) {
int i, j, id;
int *pd, *ps;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
pd = dst + i;
ps = src + id;
for(j = 0; j < dim; j++) {
*pd = *ps++;
pd += dim;
}
}
}
Some ideas, please correct me if I'm wrong:
I have thought about loop unrolling but I dont think that would help, because we don't know if the NxN matrix has prime dimensions or not. If I checked for that, it would include excess calculations which would just slow down the function.
Cache blocks wouldn't be very useful, because no matter what, we will be accessing one array linearly (1,2,3,4) while the other we will be accessing in jumps of N. While we can get the function to abuse the cache and access the src block faster, it will still take a long time to place those into the dst matrix.
I have also tried using pointers instead of array accessors, but I don't think that actually speeds up the program in any way.
Any help would be greatly appreciated.
Thanks
Cache blocking can be useful. For an example, lets say we have a cache line size of 64 bytes (which is what x86 uses these days). So for a large enough matrix such that it's larger than the cache size, then if we transpose a 16x16 block (since sizeof(int) == 4, thus 16 ints fit in a cache line, assuming the matrix is aligned on a cacheline bounday) we need to load 32 (16 from the source matrix, 16 from the destination matrix before we can dirty them) cache lines from memory and store another 16 lines (even though the stores are not sequential). In contrast, without cache blocking transposing the equivalent 16*16 elements requires us to load 16 cache lines from the source matrix, but 16*16=256 cache lines to be loaded and then stored for the destination matrix.
Unrolling is useful for large matrixes.
You'll need some code to deal with excess elements if the matrix size isn't a multiple of the times you unroll. But this will be outside the most critical loop, so for a large matrix it's worth it.
Regarding the direction of accesses - it may be better to read linearly and write in jumps of N, rather than vice versa. This is because read operations block the CPU, while write operations don't (up to a limit).
Other suggestions:
1. Can you use parallelization? OpenMP can help (though if you're expected to deliver single CPU performance, it's no good).
2. Disassemble the function and read it, focusing on the innermost loop. You may find things you wouldn't notice in C code.
3. Using decreasing counters (stopping at 0) might be slightly more efficient that increasing counters.
4. The compiler must assume that src and dst may alias (point to the same or overlapping memory), which limits its optimization options. If you could somehow tell the compiler that they can't overlap, it may be great help. However, I'm not sure how to do that (maybe use the restrict qualifier).
Messyness is not a problem, so: I would add a transposed flag to each matrix. This flag indicates, whether the stored data array of a matrix is to be interpreted in normal or transposed order.
All matrix operations should receive these new flags in addition to each matrix parameter. Inside each operation implement the code for all possible combinations of flags. Perhaps macros can save redundant writing here.
In this new implementation, the matrix transposition just toggles the flag: The space and time needed for the transpose operation is constant.
Just an idea how to implement unrolling:
void transpose(int *dst, int *src, int dim) {
int i, j;
const int dim1 = (dim / 4) * 4;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim1; j+=4) {
dst[j*dim + i] = src[i*dim + j];
dst[(j+1)*dim + i] = src[i*dim + (j+1)];
dst[(j+2)*dim + i] = src[i*dim + (j+2)];
dst[(j+3)*dim + i] = src[i*dim + (j+3)];
}
for( ; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
__builtin_prefetch (&src[(i+1)*dim], 0, 1);
}
}
Of cource you should remove counting ( like i*dim) from the inner loop, as you already did in your attempts.
Cache prefetch could be used for source matrix.
you probably know this but register int (you tell the compiler that it would be smart to put this in register). And making the int's unsigned, may make things go little bit faster.

Resources