My task is to run the matrix vector multiplication c program for like 14-16 minutes. And I am only able that to run for like 20 seconds maximum. I also tried increasing the value of row and columns and also other bunch of stuff and I am not able to do it. Can you please help me?
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define NUM_ROWS 600000 // M dimension
#define NUM_COLS 200000// N dimension
void main(int argc, char *argv[])
{
struct timeval start, stop, elapse; // declaring the structure elements
gettimeofday( &start, NULL ); // calling gettimeofday function to calculate time
long long int myid, numprocs, i, j;
long long int local_row_sum;
double starttime, endtime;
long long int *matrix = (long long int*)malloc(NUM_ROWS * NUM_COLS * sizeof(long long int));
long long int *vector = (long long int*)malloc(NUM_COLS * sizeof(long long int ));
long long int *local_result = (long long int *)malloc(NUM_ROWS * sizeof(long long int));
long long int *final_result = (long long int *)malloc(NUM_ROWS * sizeof(long long int ));
// Initialize array and vector
for (i = 0; i < NUM_ROWS; i++) {
for(j = 0; j < NUM_COLS; j++) {
matrix[i * NUM_COLS + j] = i + j;
}
}
for (j = 0; j < NUM_COLS; j++) {
vector[j] = j;
}
int k, l;
long long int*sequential_result = (long long int*)malloc(NUM_ROWS * 1 * sizeof(long long int ));
for (k = 0; k < NUM_ROWS; k++)
{
sequential_result[k] = 0;
for (l = 0; l < NUM_COLS; l++)
{
sequential_result[k] += *(matrix + k * NUM_COLS + l) * *(vector + l);
}
printf("The result: %lld\n", sequential_result[k]);
// Check that sequential result equals MPI result (0 is error code here)
}
gettimeofday( &stop, NULL ); // calling gametimeofday function to stop counting the time taken
timersub( &stop, &start, &elapse ); // calling this function to calculate the time difference
fprintf( stderr, "Elapse time\t%g\n", elapse.tv_sec+0.000001*elapse.tv_usec );
}
You're getting a compile error on:
long long int *matrix = (long long int *) malloc(NUM_ROWS * NUM_COLS * sizeof(long long int));
The error is:
orig.c:21:60: warning: integer overflow in expression of type ‘int’ results in ‘-259084288’ [-Woverflow]
orig.c:21:44: warning: argument 1 value ‘18446744071636877312’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]
That's because you put the sizeof as the last term, so it used int (32 bits) instead of size_t [probably 64 bits].
Thus, the size argument actually passed to malloc gets wrapped/truncated to 32 bits before it is passed as an argument. So, you'll only allocate a much smaller area than you expect.
This probably results in UB (undefined behavior) because the rest of the code will assume the larger area and will modify memory beyond the end of the actual allocation.
To fix, do this:
long long int *matrix = malloc(sizeof(*matrix) * NUM_ROWS * NUM_COLS);
This will force the entire argument expression to be 64 bits [which is what you want/need]. You should do similar changes for your other malloc calls.
Edit: Oops, I missed the fact that your allocation is so large that it can't be fulfilled by malloc [which was flagged by the second error message]. You'll have to cut down the array sizes [see below]
BTW, don't cast the return from malloc: Do I cast the result of malloc?.
And, using the sizeof(*matrix) trick is much more likely to produce robust code.
Also, you should definitely check the return value from malloc for NULL and abort if so. You're allocating a huge array. Your system may not have enough memory (either physical memory or logical memory if you've defined swap areas).
Since this check has to be repeated, consider using a wrapper function:
void *
safe_malloc(size_t size)
{
void *ptr;
ptr = malloc(size);
if (ptr == NULL) {
fprintf(stderr,"safe_malloc: malloc failure -- size=%zu -- %s\n",
size,strerror(errno));
exit(1);
}
return ptr;
}
And, replace the [fixed/adjusted] malloc calls with calls to safe_malloc.
I sometimes also use a macro to further reduce repeating boilerplate:
#define DEFINE_AND_ALLOC_VECTOR(_typ,_ptr,_count) \
_typ *_ptr = safe_malloc(sizeof(*_ptr) * (_count))
Then, you can do:
DEFINE_AND_ALLOC_VECTOR(long long int,matrix,NUM_ROWS * NUM_COLS);
UPDATE:
what about long long int*sequential_result = (long long int*)malloc(NUM_ROWS * 1 * sizeof(long long int ));
No, I wouldn't do that for a few reasons. It violates the DRY (don't repeat yourself) principle [because you repeated long long int three times.
Under C, casting the return value of malloc which returns a void * is cruft and can silently mask a bug [if you inadvertently fail to do #include <stdlib.h> the compiler will default the return type to int and truncate the return value to 32 bits].
Suppose you changed the type of sequential_result from long long int to (e.g. long double). But, didn't change the [bad] cast type or the sizeof.
Now, you've got a bug:
long double *sequential_result = (long long int *) malloc(sizeof(long long int) * NUM_COLS);
Here not enough space was allocated.
That's why experienced programmers [usually] prefer the sizeof(*ptr)
Further tip: The preferred grouping is:
long long int *sequential_result = ...;
And, not:
long long int* sequential_result = ...;
That's because if you have:
int* a,b;
That seems to be the equivalent of:
int *a;
int *b;
But, it is actually:
int *a;
int b;
The asterisk binds tightly right-to-left and not left-to-right. So, doing int* is deceptive.
UPDATE #2:
Can you please rewrite the program?
Here is the refactored code. I've cut down the dimensions by a factor of 100 to fit into a sane amount of memory.
From your question, it seems you want large arrays so you can benchmark for an extended period of time.
I do a lot of code optimization and benchmarking. Smaller size arrays will give you accurate timing data. For a simple matrix multiply, about 10 seconds is sufficient.
I've used cpp conditionals to show old vs. new code:
#if 0
// old [broken] code
#else
// new [fixed] code
#endif
Anyway, here is the refactored code [as I would write it]:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <time.h>
#if 0
#define NUM_ROWS 600000 // M dimension
#define NUM_COLS 200000 // N dimension
#else
#define NUM_ROWS 6000 // M dimension
#define NUM_COLS 2000 // N dimension
#endif
#define DEFINE_AND_ALLOC_VECTOR(_typ,_ptr,_count) \
_typ *_ptr = safe_malloc(sizeof(*_ptr) * _count)
typedef long long int bignum_t;
double
tscgetf(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_MONOTONIC,&ts);
sec = ts.tv_nsec;
sec /= 1e9;
sec += ts.tv_sec;
return sec;
}
void *
safe_malloc(size_t size)
{
void *ptr;
ptr = malloc(size);
if (ptr == NULL) {
fprintf(stderr,"safe_malloc: malloc failure -- size=%zu -- %s\n",
size,strerror(errno));
exit(1);
}
return ptr;
}
int
main(int argc, char **argv)
{
double start, stop;
start = tscgetf();
//int myid;
//int numprocs;
size_t i;
size_t j;
//bignum_t local_row_sum;
DEFINE_AND_ALLOC_VECTOR(bignum_t,matrix,NUM_ROWS * NUM_COLS);
DEFINE_AND_ALLOC_VECTOR(bignum_t,vector,NUM_COLS);
DEFINE_AND_ALLOC_VECTOR(bignum_t,local_result,NUM_ROWS);
DEFINE_AND_ALLOC_VECTOR(bignum_t,final_result,NUM_ROWS);
bignum_t *mat;
// Initialize array and vector
#if 0
for (i = 0; i < NUM_ROWS; i++) {
for (j = 0; j < NUM_COLS; j++) {
matrix[i * NUM_COLS + j] = i + j;
}
}
#else
for (i = 0; i < NUM_ROWS; i++) {
mat = &matrix[i * NUM_COLS];
for (j = 0; j < NUM_COLS; j++)
mat[j] = i + j;
}
#endif
for (j = 0; j < NUM_COLS; j++)
vector[j] = j;
DEFINE_AND_ALLOC_VECTOR(bignum_t,sequential_result,NUM_ROWS);
#if 0
int k, l;
#else
size_t k, l;
#endif
#if 0
for (k = 0; k < NUM_ROWS; k++) {
sequential_result[k] = 0;
for (l = 0; l < NUM_COLS; l++) {
sequential_result[k] += *(matrix + k * NUM_COLS + l) *
*(vector + l);
}
}
#else
for (k = 0; k < NUM_ROWS; ++k) {
bignum_t acc = 0;
mat = &matrix[k * NUM_COLS];
for (l = 0; l < NUM_COLS; ++l)
acc += mat[l] * vector[l];
sequential_result[k] = acc;
#endif
printf("The %zu result: %lld\n", k, sequential_result[k]);
// Check that sequential result equals MPI result (0 is error code here)
}
stop = tscgetf();
fprintf(stderr, "Elapse time\t%.9f seconds\n", stop - start);
return 0;
}
Related
I want create random int array in CUDA. And I need to check for duplicity on array index 0-9, 10-19 ... and repair them.
Any idea, how to make it effective? I really dont want check each element with each other.
Here is my code:
__global__ void generateP(int *d_p, unsigned long seed)
{
int i = X * blockIdx.x + threadIdx.x * X;
int buffer[X];
curandState state;
curand_init(seed, i, 0, &state);
for (int j = 0; j < X; j++)
{
float random = HB + (curand_uniform(&state) * (LB - HB));
buffer[j] = (int)truncf(random);
}
// TODO unique check and repair duplicity
for (int k = 0; k < X; k++)
{
d_p[i] = buffer[k];
i++;
}
}
Is there in CUDA some kind of Contains function? Thanks for help.
You really are asking the wrong question here. You should be looking for a way of randomly ordering a list of unique values, rather than attempting to fill a list with unique random numbers by searching and replacing duplicates repeatedly until you have the unique list. The latter is terribly inefficient and a poor fit to a data parallel execution model like CUDA.
There are simple, robust algorithms for randomly shuffling list of values that only require at most N calls to a random generator in order to shuffle a list of N values. The Fisher-Yates shuffle is almost universally used for this.
I'm not going to comment much on this code except to say that it illustrates one approach to doing this, using one thread per list. It isn't intended to be performant, just a teaching example to get you started. I think it probably does close to what you are asking for (more based on your previous attempt at this question than this one). I recommend you study it as a lead-in to writing your own implementation which does whatever it is you are trying to do.
#include <ctime>
#include <iostream>
#include <curand_kernel.h>
struct source
{
int baseval;
__device__ source(int _b) : baseval(_b) {};
__device__ int operator()(int v) { return baseval + v; };
};
__device__ int urandint(int minval, int maxval, curandState_t& state)
{
float rval = curand_uniform(&state);
rval *= (float(maxval) - float(minval) + 0.99999999f);
rval += float(minval);
return (int)truncf(rval);
}
template<int X>
__global__ void kernel(int* out, int N, unsigned long long seed)
{
int tidx = threadIdx.x + blockIdx.x * blockDim.x;
if (tidx < N) {
curandState_t state;
curand_init(seed, tidx, 0, &state);
int seq[X];
source vals(tidx * X);
// Fisher Yeats Shuffle straight from Wikipedia
#pragma unroll
for(int i=0; i<X; ++i) {
int j = urandint(0, i, state);
if (j != i)
seq[i] = seq[j];
seq[j] = vals(i);
}
// Copy local shuffled sequence to output array
int* dest = &out[X * tidx];
memcpy(dest, &seq[0], X * sizeof(int));
}
}
int main(void)
{
const int X = 10;
const int nsets = 200;
int* d_result;
size_t sz = size_t(nsets) * sizeof(int) * size_t(X);
cudaMalloc((void **)&d_result, sz);
int tpb = 32;
int nblocks = (nsets/tpb) + ((nsets%tpb !=0) ? 1 : 0);
kernel<X><<<nblocks, tpb>>>(d_result, nsets, std::time(0));
int h_result[nsets][X];
cudaMemcpy(&h_result[0][0], d_result, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<nsets; ++i) {
std::cout << i << " : ";
for(int j=0; j<X; ++j) {
std::cout << h_result[i][j] << ",";
}
std::cout << std::endl;
}
cudaDeviceReset();
return 0;
}
I'm writing implementation of Sieve of Eratosthenes (https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) on GPU. But no sth like this - http://developer-resource.blogspot.com/2008/07/cuda-sieve-of-eratosthenes.html
Method:
Creating n-element array with default values 0/1 (0 - prime, 1 - no) and passing it on GPU (I know that it can be done directly in kernel but it's not problem in this moment).
Each thread in block checks multiples of a single number. Each block checks in total sqrt(n) possibilities. Each block == different interval.
Marking multiples as 1 and passing data back to the host.
Code:
#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024
__global__ void kernel(int *global, int threads) {
extern __shared__ int cache[];
int tid = threadIdx.x + 1;
int offset = blockIdx.x * blockDim.x;
int number = offset + tid;
cache[tid - 1] = global[number];
__syncthreads();
int start = offset + 1;
int end = offset + threads;
for (int i = start; i <= end; i++) {
if ((i != tid) && (tid != 1) && (i % tid == 0)) {
cache[i - offset - 1] = 1;
}
}
__syncthreads();
global[number] = cache[tid - 1];
}
int main(int argc, char *argv[]) {
int *array, *dev_array;
int n = atol(argv[1]);
int n_sqrt = floor(sqrt((double)n));
size_t array_size = n * sizeof(int);
array = (int*) malloc(n * sizeof(int));
array[0] = 1;
array[1] = 1;
for (int i = 2; i < n; i++) {
array[i] = 0;
}
cudaMalloc((void**)&dev_array, array_size);
cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);
int threads = min(n_sqrt, THREADS);
int blocks = n / threads;
int shared = threads * sizeof(int);
kernel<<<blocks, threads, shared>>>(dev_array, threads);
cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);
int count = 0;
for (int i = 0; i < n; i++) {
if (array[i] == 0) {
count++;
}
}
printf("Count: %d\n", count);
return 0;
}
Run:
./sieve 10240000
It works correctly when n = 16, 64, 1024, 102400... but for n = 10240000 I getting incorrect result. Where is problem?
This code has a variety of problems, in my view.
You are fundamentally accessing items out of range. Consider this sequence in your kernel:
int tid = threadIdx.x + 1;
int offset = blockIdx.x * blockDim.x;
int number = offset + tid;
cache[tid - 1] = global[number];
You (in some cases -- see below) have launched a thread array exactly equal in size to your global array. So what happens when the highest numbered thread runs the above code? number = threadIdx.x+1+blockIdx.x*blockDim.x. This number index will be one beyond the end of your array. This is true for many possible values of n. This problem would have been evident to you if you had either used proper cuda error checking or had run your code with cuda-memcheck. You should always do those things when you are having trouble with a CUDA code and also before asking for help from others.
The code only has a chance of working correctly if the input n is a perfect square. The reason for this is contained in these lines of code (as well as dependencies in the kernel):
int n = atol(argv[1]);
int n_sqrt = floor(sqrt((double)n));
...
int threads = min(n_sqrt, THREADS);
int blocks = n / threads;
(note that the correct function here would be atoi not atol, but I digress...) Unless n is a perfect square, the resultant n_sqrt will be somewhat less than the actual square root of n. This will lead you to compute a total thread array that is smaller than the necessary size. (It's OK if you don't believe me at this point. Run the code I will post below and input a size like 1025, then see if the number of threads * blocks is of sufficient size to cover an array of 1025.)
As you've stated:
Each block checks in total sqrt(n) possibilities.
Hopefully this also points out the danger of non-perfect square n, but we must now ask "what if n is larger than the square of the largest threadblock size (1024)? The answer is that the code will not work correctly in many cases - and your chosen input of 10240000, although a perfect square, exceeds 1024^2 (1048576) and it does not work for this reason. Your algorithm (which I claim is not a Sieve of Eratosthenes) requires that each block be able to check sqrt(n) possibilities, just as you stated in the question. When that no longer becomes possible because of the limits of threads per block, then your algorithm starts to break.
Here is a code that makes some attempt to fix issue #1 above, and at least give an explanation for the failures associated with #2 and #3:
#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024
#define MAX 10240000
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__global__ void kernel(int *global, int threads) {
extern __shared__ int cache[];
int tid = threadIdx.x + 1;
int offset = blockIdx.x * blockDim.x;
int number = offset + tid;
if ((blockIdx.x != (gridDim.x-1)) || (threadIdx.x != (blockDim.x-1))){
cache[tid - 1] = global[number];
__syncthreads();
int start = offset + 1;
int end = offset + threads;
for (int i = start; i <= end; i++) {
if ((i != tid) && (tid != 1) && (i % tid == 0)) {
cache[i - offset - 1] = 1;
}
}
__syncthreads();
global[number] = cache[tid - 1];}
}
int cpu_sieve(int n){
int limit = floor(sqrt(n));
int *test_arr = (int *)malloc(n*sizeof(int));
if (test_arr == NULL) return -1;
memset(test_arr, 0, n*sizeof(int));
for (int i = 2; i < limit; i++)
if (!test_arr[i]){
int j = i*i;
while (j <= n){
test_arr[j] = 1;
j += i;}}
int count = 0;
for (int i = 2; i < n; i++)
if (!test_arr[i]) count++;
return count;
}
int main(int argc, char *argv[]) {
int *array, *dev_array;
if (argc != 2) {printf("must supply n as command line parameter\n"); return 1;}
int n = atoi(argv[1]);
if ((n < 1) || (n > MAX)) {printf("n out of range %d\n", n); return 1;}
int n_sqrt = floor(sqrt((double)n));
size_t array_size = n * sizeof(int);
array = (int*) malloc(n * sizeof(int));
array[0] = 1;
array[1] = 1;
for (int i = 2; i < n; i++) {
array[i] = 0;
}
cudaMalloc((void**)&dev_array, array_size);
cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);
int threads = min(n_sqrt, THREADS);
int blocks = n / threads;
int shared = threads * sizeof(int);
printf("threads = %d, blocks = %d\n", threads, blocks);
kernel<<<blocks, threads, shared>>>(dev_array, threads);
cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);
cudaCheckErrors("some error");
int count = 0;
for (int i = 0; i < n; i++) {
if (array[i] == 0) {
count++;
}
}
printf("Count: %d\n", count);
printf("CPU Sieve: %d\n", cpu_sieve(n));
return 0;
}
There are a couple of issues, I think, but here's a pointer to the actual problem: The sieve of Eratosthenes removes iteratively multiples of already encountered prime numbers, and you want to separate the work-load into thread-blocks, where each thread-block operates on a piece of shared memory (cache, in your example). Thread-blocks, however, are generally independent from all other thread-blocks and cannot easily communicate with one another. One example to illustrate the problem: The thread with index 0 in thread-block with index 0 removes multiples of 2. Thread blocks with index > 0 have no way to know about this.
I find out about Variable Length Array in C99, but it looks like it behave almost the same as malloc + free.
The practical differences I found:
Too big array handling:
unsigned size = 4000000000;
int* ptr = malloc(size); // ptr is 0, program doesn't crash
int array[size]; // segmentation fault, program crashes
Memory leaks: only possible in dynamic array allocation:
int* ptr = malloc(size);
...
if(...)
return;
...
free(ptr);
Life of object and possibility to return from function: dynamically allocated array lives until the memory is frees and can be returned from function which allocated the memory.
Resizing: resizing possible only with pointers to allocated memory.
My questions are:
What are more differences (I'm interested in practical advice)?
What are more problems a programmer can have with both ways of arrays with variable length?
When to choose VLA but when dynamic array allocation?
What is faster: VLA or malloc+free?
Some practical advices:
VLAs are in practice located on the space-limited stack, while malloc() and its friends allocates on the heap, that is likely to allow bigger allocations. Moreveover you have more control on that process, as malloc() could return NULL if it fails. In other words you have to be careful with VLA not-to-blow your stack in runtine.
Not all compilers support VLA, e.g. Visual Studio. Moreover C11 marked them as optional feature and allows not to support them when __STDC_NO_VLA__ macro is defined.
From my experience (numerical programs like finding prime numbers with trial division, Miller-Rabin etc.) I wouldn't say that VLAs are any faster than malloc(). There is some overhead of malloc() call of course, but what seems to be more important is data access efficiency.
Here is some quick & dirty comparison using GNU/Linux x86-64 and GCC compiler. Note that results may vary from platform to another or even compiler's version. You might use as some basic (though very far of being complete) data-access malloc() vs VLA benchmark.
prime-trial-gen.c:
#include <assert.h>
#include <stdbool.h>
#include <stdio.h>
bool isprime(int n);
int main(void)
{
FILE *fp = fopen("primes.txt", "w");
assert(fp);
fprintf(fp, "%d\n", 2);
for (int i = 3; i < 10000; i += 2)
if (isprime(i))
fprintf(fp, "%d\n", i);
fclose(fp);
return 0;
}
bool isprime(int n)
{
if (n % 2 == 0)
return false;
for (int i = 3; i * i <= n; i += 2)
if (n % i == 0)
return false;
return true;
}
Compile & run:
$ gcc -std=c99 -pedantic -Wall -W prime-trial-gen.c
$ ./a.out
Then here is second program, that take use of generated "primes dictionary":
prime-trial-test.c:
#include <assert.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
bool isprime(int n, int pre_prime[], int num_pre_primes);
int get_num_lines(FILE *fp);
int main(void)
{
FILE *fp = fopen("primes.txt", "r");
assert(fp);
int num_lines = get_num_lines(fp);
rewind(fp);
#if WANT_VLA
int pre_prime[num_lines];
#else
int *pre_prime = malloc(num_lines * sizeof *pre_prime);
assert(pre_prime);
#endif
for (int i = 0; i < num_lines; i++)
assert(fscanf(fp, "%d", pre_prime + i));
fclose(fp);
/* NOTE: primes.txt holds primes <= 10 000 (10**4), thus we are safe upto 10**8 */
int num_primes = 1; // 2
for (int i = 3; i < 10 * 1000 * 1000; i += 2)
if (isprime(i, pre_prime, num_lines))
++num_primes;
printf("pi(10 000 000) = %d\n", num_primes);
#if !WANT_VLA
free(pre_prime);
#endif
return 0;
}
bool isprime(int n, int pre_prime[], int num_pre_primes)
{
for (int i = 0; i < num_pre_primes && pre_prime[i] * pre_prime[i] <= n; ++i)
if (n % pre_prime[i] == 0)
return false;
return true;
}
int get_num_lines(FILE *fp)
{
int ch, c = 0;
while ((ch = fgetc(fp)) != EOF)
if (ch == '\n')
++c;
return c;
}
Compile & run (malloc version):
$ gcc -O2 -std=c99 -pedantic -Wall -W prime-trial-test.c
$ time ./a.out
pi(10 000 000) = 664579
real 0m1.930s
user 0m1.903s
sys 0m0.013s
Compile & run (VLA version):
$ gcc -DWANT_VLA=1 -O2 -std=c99 -pedantic -Wall -W prime-trial-test.c
ime ./a.out
pi(10 000 000) = 664579
real 0m1.929s
user 0m1.907s
sys 0m0.007s
As you might check π(10**7) is indeed 664,579. Notice that both execution times are almost the same.
One advantage of VLAs is that you can pass variably-dimensioned arrays to functions, which can be handy when dealing with (sanely sized) matrices, for example:
int n = 4;
int m = 5;
int matrix[n][m];
// …code to initialize matrix…
another_func(n, m, matrix);
// No call to free()
where:
void another_func(int n, int m, int matrix[n][m])
{
int sum = 0;
for (int i = 0; i < n; i++)
{
for (int j = 0; j < m; j++)
{
// …use matrix just like normal…
sum += matrix[i][j];
}
}
// …do something with sum…
}
This is particularly valuable since the alternatives using malloc() without using VLA as well mean that you either have to do subscript calculations manually in the called function, or you have to create a vector of pointers.
Manual subscript calculations
int n = 4;
int m = 5;
int *matrix = malloc(sizeof(*matrix) * n * m);
// …code to initialize matrix…
another_func2(n, m, matrix);
free(matrix);
and:
void another_func2(int n, int m, int *matrix)
{
int sum = 0;
for (int i = 0; i < n; i++)
{
for (int j = 0; j < m; j++)
{
// …do manual subscripting…
sum += matrix[i * m + j];
}
}
// …do something with sum…
}
Vector of pointers
int n = 4;
int m = 5;
int **matrix = malloc(sizeof(*matrix) * n);
for (int i = 0; i < n; i++)
matrix[i] = malloc(sizeof(matrix[i] * m);
// …code to initialize matrix…
another_func2(n, m, matrix);
for (int i = 0; i < n; i++)
free(matrix[i]);
free(matrix);
and:
void another_func3(int n, int m, int **matrix)
{
int sum = 0;
for (int i = 0; i < n; i++)
{
for (int j = 0; j < m; j++)
{
// …use matrix 'just' like normal…
// …but there is an extra pointer indirection hidden in this notation…
sum += matrix[i][j];
}
}
// …do something with sum…
}
This form can be optimized to two allocations:
int n = 4;
int m = 5;
int **matrix = malloc(sizeof(*matrix) * n);
int *values = malloc(sizeof(*values) * n * m);
for (int i = 0; i < n; i++)
matrix[i] = &values[i * m];
// …code to initialize matrix…
another_func2(n, m, matrix);
free(values);
free(matrix);
Advantage VLA
There is less bookkeeping work to do when you use VLAs. But if you need to deal with preposterously sized arrays, malloc() still scores. You can use VLAs with malloc() et al if you're careful — see calloc() for an array of array with negative index in C for an example.
I am trying to run a C program which mallocs the memory as per the input given by user.
Whenever I input something as big as 1000000000 rather than returning NULL value, my Ubuntu 14.04 machine freezes completely! I am damn sure that malloc is the culprit...
But I am surprised to see Ubuntu freeze!
Does anyone have any idea about why this may be happening?
I have a laptop with 12GB RAM, i5 processor and 500GB harddisk. and Ubutnu 14.04 OS
Here is the code:
#include<stdio.h>
#include<stdlib.h>
#define LEFT(x) (2*(x)+1)
#define RIGHT(x) (2*(x)+2)
long long int *err, *sorted, *size, *id;
short int *repeat;
void max_heapify(long long int *arr, long long int length, long long int index)
{
long long int largest, left, right, temp, flag = 1;
while (flag)
{
left = LEFT(index);
right = RIGHT(index);
if (left < length && arr[left] > arr[index])
largest = left;
else
largest = index;
if (right < length && arr[right] > arr[largest])
largest = right;
if (largest != index)
{
temp = arr[index];
arr[index] = arr[largest];
arr[largest] = temp;
index = largest;
}
else
flag = 0;
}
}
void build_max_heap(long long int *arr, long long int length)
{
long long int i, j;
j = (length / 2) - 1;
for (i = j; i >= 0; i--)
max_heapify(arr, length, i);
}
void heapsort(long long int *arr, long long int length)
{
long long int i, temp, templength;
build_max_heap(arr, length);
templength = length;
for (i = 0; i < templength; i++)
{
temp = arr[0]; // maximum number
arr[0] = arr[length - 1];
arr[length - 1] = temp;
length--;
max_heapify(arr, length, 0);
}
}
int main()
{
long long int n, k, p, i, j;
scanf("%lld%lld%lld",&n, &k, &p);
err = (long long int*)malloc((n + 1) * sizeof(long long int));
//repeat = (short int*)calloc(1000000001 , sizeof(short int));
sorted = (long long int*)malloc((n + 1) * sizeof(long long int));
j = 0;
for(i = 0; i < n; i++)
{
scanf("%lld",&err[i]);
sorted[j++] = err[i];
}
heapsort(sorted, j);
for(i = 0; i < j; i++)
printf("%lld, ",sorted[i]);
//These malloc statements cause the problem!!
id = (long long int*)malloc((sorted[j - 1] + 1) * sizeof(long long int));
size = (long long int*)malloc((sorted[j - 1] + 1) * sizeof(long long int));
for(i = 0; i <= sorted[j - 1]; i++)
{
id[i] = i;
size[i] = 1;
}
return 0;
}
Basically I am trying to sort the numbers and then allocate the array of size of maximum element. This program works for smaller input but when I enter this
5 5 5
1000000000 999999999 999999997 999999995 999999994
It freezes ubuntu ..I even added the condition to check if id or size is NULL but that didn't help! If system is unable to allocate that much memory then it should return NULL but system freezes! And this code works fine on MAC!
Thanks!
I am attempting to benchmark some different algorithm designs, and I would like to generalize my functions by making them type unsafe internally with user-enforced type safety.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
void * foo1(unsigned n)
{
unsigned i;
unsigned long sum = 0;
void * psum = ∑
for(i = 0; i < n; ++i)
++sum;
return psum;
}
...
int main(void)
{
int i;
unsigned n;
unsigned long sum = 0;
void *(*foo[6])(unsigned) = {&foo1, &foo2, &foo3, &foo4, &foo5, &foo6};
unsigned n_max[6] = {1000000001, 1000001, 10001, 1000001, 101, 1001};
clock_t start, end;
for(i = 0; i < 6; ++i){
n = 1;
do{
start = clock();
sum = *(unsigned long *)(*foo[i])(n);
end = clock();
printf("%d|%lu|%.6f\n", n, sum, (double) (end - start) / CLOCKS_PER_SEC);
n *= 10;
}while(n < n_max[i]);
}
return 0;
}
My question specifically is about how I might make my foo() functions more generic than they are, enabling me to use them for every type.
To that end, I had an idea, but I'm not sure about it; if a signed data type is cast to unsigned, then modified by any sort of other unsigned data type, will the arithmetic performed stick, or will erroneous math happen?