Computer freezes when more memory is malloced - c

I am trying to run a C program which mallocs the memory as per the input given by user.
Whenever I input something as big as 1000000000 rather than returning NULL value, my Ubuntu 14.04 machine freezes completely! I am damn sure that malloc is the culprit...
But I am surprised to see Ubuntu freeze!
Does anyone have any idea about why this may be happening?
I have a laptop with 12GB RAM, i5 processor and 500GB harddisk. and Ubutnu 14.04 OS
Here is the code:
#include<stdio.h>
#include<stdlib.h>
#define LEFT(x) (2*(x)+1)
#define RIGHT(x) (2*(x)+2)
long long int *err, *sorted, *size, *id;
short int *repeat;
void max_heapify(long long int *arr, long long int length, long long int index)
{
long long int largest, left, right, temp, flag = 1;
while (flag)
{
left = LEFT(index);
right = RIGHT(index);
if (left < length && arr[left] > arr[index])
largest = left;
else
largest = index;
if (right < length && arr[right] > arr[largest])
largest = right;
if (largest != index)
{
temp = arr[index];
arr[index] = arr[largest];
arr[largest] = temp;
index = largest;
}
else
flag = 0;
}
}
void build_max_heap(long long int *arr, long long int length)
{
long long int i, j;
j = (length / 2) - 1;
for (i = j; i >= 0; i--)
max_heapify(arr, length, i);
}
void heapsort(long long int *arr, long long int length)
{
long long int i, temp, templength;
build_max_heap(arr, length);
templength = length;
for (i = 0; i < templength; i++)
{
temp = arr[0]; // maximum number
arr[0] = arr[length - 1];
arr[length - 1] = temp;
length--;
max_heapify(arr, length, 0);
}
}
int main()
{
long long int n, k, p, i, j;
scanf("%lld%lld%lld",&n, &k, &p);
err = (long long int*)malloc((n + 1) * sizeof(long long int));
//repeat = (short int*)calloc(1000000001 , sizeof(short int));
sorted = (long long int*)malloc((n + 1) * sizeof(long long int));
j = 0;
for(i = 0; i < n; i++)
{
scanf("%lld",&err[i]);
sorted[j++] = err[i];
}
heapsort(sorted, j);
for(i = 0; i < j; i++)
printf("%lld, ",sorted[i]);
//These malloc statements cause the problem!!
id = (long long int*)malloc((sorted[j - 1] + 1) * sizeof(long long int));
size = (long long int*)malloc((sorted[j - 1] + 1) * sizeof(long long int));
for(i = 0; i <= sorted[j - 1]; i++)
{
id[i] = i;
size[i] = 1;
}
return 0;
}
Basically I am trying to sort the numbers and then allocate the array of size of maximum element. This program works for smaller input but when I enter this
5 5 5
1000000000 999999999 999999997 999999995 999999994
It freezes ubuntu ..I even added the condition to check if id or size is NULL but that didn't help! If system is unable to allocate that much memory then it should return NULL but system freezes! And this code works fine on MAC!
Thanks!

Related

Segmentation fault in malloc()

I need to create a function which returns an array of ints. This int array should contain all values between min and max (both included).
If min >= max a null pointer should be returned.
The question is why, when min = -2147483468 and max = 2147483647 (and len becomes 4294967296) I get "Segmentation fault"?
My code:
#include <stdlib.h>
#include <stdio.h>
int *ft_range(int min, int max)
{
int *range;
long int len;
long int i;
range = NULL;
if (min >= max)
return (NULL);
len = max - min + 1;
if(!(range = (int *)malloc(sizeof(int) * len)))
return (NULL);
i = 0;
while (min < max)
{
range[i] = min;
min++;
i++;
}
range[i] = max;
return (range);
}
int main(void)
{
int max;
int min;
long int len;
int *range;
long int i;
max = 2147483647;
min = -2147483648;
if (max != min)
len = max - min + 1;
else
len = 0;
i = 0;
range = ft_range(min, max);
while (i < len)
{
printf("%d", range[i]);
i++;
}
free(range);
return (0);
}
But, if I enter min = -2147483468 and max = 2147483646 with len = 4294967295 it works.
min and max are type int, which is only guaranteed to be 16 bits signed (-32768, 32767), although the compiler may choose to use more bits to store the values. Therefore, if you were to expect values ranging (-2147483468, 2147483647), these should be of type long int. The program may or may not be be truncating some of the bits when you supply (-2147483468, 2147483647) or (-2147483468, 2147483646) as inputs. This would also apply to the type for range.
Secondly, variable len is long int which is only guaranteed to be 32 bits signed (-2147483468, 2147483647). Since you want to be able to store value 4294967296, this will need to be either long long int or long long unsigned int. Even long unsigned int will only have a range of (0, 4294967295). This would also apply to i.
Additionally, the statement len = max - min + 1;, will need to include a type cast to long long int to avoid overflow when performing the arithmetic. You can do it by adding (long long int) this way: len = (long long int)max - min + 1;; or if you want to be more explicit: len = ((long long int)max - (long long int)min) + 1LL;
To summarize:
#include <stdlib.h>
#include <stdio.h>
int *ft_range(long int min, long int max)
{
long int *range;
long long int len;
long long int i;
range = NULL;
if (min >= max)
return (NULL);
len = (long long int)max - min + 1;
if(!(range = (int *)malloc(sizeof(int) * len)))
return (NULL);
i = 0;
while (min < max)
{
range[i] = min;
min++;
i++;
}
range[i] = max;
return (range);
}
Side note: Range (-2147483468, 2147483647) is going to require around 16 GB of memory to be allocated, so I hope that you are ready for that.
int overflow with max - min + 1;
Use wider math for size calculations.
Use size_t for allocation size and indexing
Add more error checks.
int *ft_range(int min, int max) {
// Add required test explicitly
if (min >= max) {
return NULL;
}
long long size = 1LL + max - min; // Use long long math
if (size > SIZE_MAX/sizeof(int) || size < 1) {
return NULL;
}
size_t usize = (size_t)size;
int *range = malloc(sizeof *range * usize);
if (range == NULL) {
return NULL;
}
size_t i = 0;
while (min < max) {
range[i] = min;
min++;
i++;
}
range[i] = max;
return range;
}
I solved the problem by using len and i as long long int, added long long int this way:
len = (long long int)max - min + 1;
Also I forgot to check if malloc returns NULL in main() function.
This is the correct version:
#include <stdlib.h>
#include <stdio.h>
int *ft_range(int min, int max)
{
int *range;
long long int len;
long long int i;
range = NULL;
if (min >= max)
return (NULL);
len = (long long int)max - min + 1;
if(!(range = (int *)malloc(sizeof(int) * len)))
return (NULL);
i = 0;
while (min < max)
{
range[i] = min;
min++;
i++;
}
range[i] = max;
return (range);
}
int main(void)
{
int *range;
int max;
int min;
long long int len;
long long int i;
max = 2147483647;
min = -2147483648;
len = 0;
if (max != min)
len = (long long int)max - min + 1;
i = 0;
if(!ft_range(min, max))
return (0);
range = ft_range(min, max);
while (i < len)
{
printf("%d", range[i]);
i++;
}
free(range);
return (0);
}

Massive performance slowdown when changing int to unsigned long long in segmented sieve

I'm confused about the performance of my C program that implements a Segmented Sieve. I originally used only int as the datatype but to support bigger numbers I decided to switch to unsigned long long. I expected some performance penalty due to overhead, but when I try the segmented sieve with upper limit 100 billion, the int approach takes 23 seconds whereas the unsigned long long doesn't even finish (or at least takes too long for me to wait)
here's the segmented sieve with just int datatype, with N (upper bound) preset to 100B-
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<stdbool.h>
#include<math.h>
#include<time.h>
int size = 5;
int current_slot = 0;
int* primes_arr;
void append(int data)
{
if ((current_slot + 1) < size)
{
primes_arr[current_slot++] = data;
return;
}
int* newarr = realloc(primes_arr, (size += 5) * sizeof(int));
if (newarr != NULL)
{
newarr[current_slot++] = data;
primes_arr = newarr;
}
else
{
printf("\nAn error occured while re-allocating memory\n");
exit(1);
}
}
// The following is just a standard approach to segmented sieve, nothing interesting
void simpleSieve(int limit)
{
int p;
if (primes_arr == NULL)
{
printf("\nAn error occured while allocating primes_arr for mark in simpleSieve\n");
exit(1);
}
bool* mark = malloc((limit + 1) * sizeof(bool));
if (mark != NULL)
{
memset(mark, true, sizeof(bool) * (limit + 1));
}
else
{
printf("\nAn error occured while allocating memory for mark in simpleSieve\n");
exit(1);
}
for (p = 2; p * p < limit; p++)
{
if (mark[p])
{
for (int i = p * 2; i < limit; i += p)
{
mark[i] = false;
}
}
}
for (p = 2; p < limit; p++)
{
if (mark[p])
{
append(p);
// printf("%d ", p);
}
}
}
void segmentedSieve(int n)
{
int limit = (int)floor(sqrt(n)) + 1;
simpleSieve(limit);
int low = limit;
int high = 2 * limit;
while (low < n)
{
if (high >= n)
{
high = n;
}
bool* mark = malloc((limit + 1) * sizeof(bool));
if (mark != NULL)
{
memset(mark, true, sizeof(bool) * (limit + 1));
}
else
{
printf("\nAn error occured while allocating memory for mark in segmentedSieve\n");
exit(1);
}
for (int i = 0; i < current_slot; i++)
{
int lower_lim = (int)floor(low / primes_arr[i]) * primes_arr[i];
if (lower_lim < low)
{
lower_lim += primes_arr[i];
}
for (int j = lower_lim; j < high; j += primes_arr[i])
{
mark[j - low] = false;
}
}
for (int i = low; i < high; i++)
{
if (mark[i - low] == true)
{
// printf("%d ", i);
}
}
low = low + limit;
high = high + limit;
free(mark);
}
}
int main()
{
primes_arr = malloc(size * sizeof(int));
clock_t t0 = clock();
segmentedSieve(100000000000);
clock_t t1 = clock();
double time_taken = (double) (t1 - t0) / CLOCKS_PER_SEC;
printf("\nDone! Time taken: %f\n", time_taken);
return 0;
}
and here's the segmented sieve with just unsigned long long datatype, with N (upper bound) preset to 100B-
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<stdbool.h>
#include<math.h>
#include<time.h>
unsigned long size = 5, current_slot = 0;
unsigned long long* primes_arr;
void append(unsigned long long data)
{
if ((current_slot + 1) < size)
{
primes_arr[current_slot++] = data;
return;
}
unsigned long long* newarr = realloc(primes_arr, (size += 5l) * sizeof(unsigned long long));
if (newarr != NULL)
{
newarr[current_slot++] = data;
primes_arr = newarr;
}
else
{
printf("\nAn error occured while re-allocating memory\n");
exit(1);
}
}
void simpleSieve(unsigned long limit)
{
unsigned long long p;
if (primes_arr == NULL)
{
printf("\nAn error occured while allocating primes_arr for mark in simpleSieve\n");
exit(1);
}
bool* mark = malloc((limit + 1) * sizeof(bool));
if (mark == NULL)
{
printf("\nAn error occured while allocating memory for mark in segmentedSieve\n");
exit(1);
}
memset(mark, true, sizeof(bool) * (limit + 1));
for (p = 2; p * p < limit; p++)
{
if (mark[p])
{
for (unsigned long long i = p * 2; i < limit; i += p)
{
mark[i] = false;
}
}
}
for (p = 2; p < limit; p++)
{
if (mark[p])
{
append(p);
//printf("%llu ", p);
}
}
}
void segmentedSieve(unsigned long long n)
{
unsigned long limit = (unsigned long)floor(sqrt(n)) + 1l;
simpleSieve(limit);
unsigned long low = limit;
unsigned long high = 2 * limit;
while (low < n)
{
if (high >= n)
{
high = n;
}
bool* mark = malloc((limit + 1) * sizeof(bool));
if (mark == NULL)
{
printf("\nAn error occured while allocating memory for mark in segmentedSieve\n");
exit(1);
}
memset(mark, true, sizeof(bool) * (limit + 1));
for (unsigned long long i = 0; i < current_slot; i++)
{
unsigned long long lower_lim = (unsigned long)floor(low / primes_arr[i]) * primes_arr[i];
if (lower_lim < low)
{
lower_lim += primes_arr[i];
}
for (unsigned long long j = lower_lim; j < high; j += primes_arr[i])
{
mark[j - low] = false;
}
}
for (unsigned long long i = low; i < high; i++)
{
if (mark[i - low])
{
//printf("%llu ", i);
}
}
low = low + limit;
high = high + limit;
free(mark);
}
}
int main()
{
primes_arr = malloc(size * sizeof(unsigned long long));
clock_t t0 = clock();
segmentedSieve(100000000000);
clock_t t1 = clock();
double time_taken = (double)(t1 - t0) / CLOCKS_PER_SEC;
printf("\nDone! Time taken: %f\n", time_taken);
return 0;
}
I fail to see why this is happening, am I doing something wrong?
Edit: I also realize that int shouldn't be capable of handling 100 billion anyway and yet the program executes with no errors and even prints the final time report. Meanwhile the unsigned long long program doesn't even finish in double the time it takes for the int one.
On the other hand, trying to set the upper bound to 1B on both, actually return pretty similar results. Why?
100 billion decimally is 17 4876 E800H hexadecimally. As it doesn't fit into int, the 'surplus' most significant bits are cut away, remaining 4876 4800H, which is 1.215.752.192D, so you actually set the limit to just a 100th of what you actually intended when calling segmentedSieve within main.
Actually, you have been lucky not to produce a negative number that way.
Be aware, though, that you have entered the land of undefined behaviour due to signed integer overflow. Anything could have happened, including your programme crashing or switching off the sun...
Far more interesting, though, have a look at the following:
segmentedSieve(unsigned long long n)
{
unsigned long low = limit;
while (low < n)
That's critical. On many systems, long has the same size as int (with e. g. 64-bit linux being an exception). If that's the case on your system as well, then you produce an en endless loop, as low will just overflow, too, (solely that it's not undefined behaviour this time) and won't ever be able to reach your 100 billion stored in n...
You should use unsigned long long consistently – or maybe even better, uint64_t from stdint.h.

Segmentation Fault in C (array related)

I wrote a code in c in order to solve Project Euler Problem 45 (https://projecteuler.net/problem=45). I keep getting segmentation fault error 139. I am sure it is not about trying to access a memory location that I do not have permission for.
My guessing is , the problem is related to sizes of my arrays. I looked up the answer and it is some 10 digit number. To get that ten digit number the size of the array "triangle" has to be something between one million and two million. But when I make the array that big i get the error. I don't get the error in the code below since size of that array is 500 000 (but of course that is not enough).
I use ubuntu 16.04 and Geany.
If you need more information please ask. Thanks in advance.
#include <stdio.h>
#include <stdlib.h>
unsigned long pentagonalgenerator(int n);
unsigned long trianglegenerator(int n);
unsigned long hexagonalgenerator(int n);
_Bool search_function(unsigned int to_be_looked_for , unsigned long array[] , int sizeofarray);
int main(void)
{
unsigned long pentagon[28000] = {0};
int sizeofpentagon = 28000;
unsigned long hexagon[100000] = {0};
int sizeofhexagon = 100000;
unsigned long triangle[500000] = {0};
int sizeoftriangle = 500000;
int counter;
for(counter = 0 ; counter < sizeofpentagon ; counter++)
{
pentagon[counter] = pentagonalgenerator(counter + 2);
}
for(counter = 0 ; counter < sizeofhexagon ; counter++)
{
hexagon[counter] = hexagonalgenerator(counter + 2);
}
for(counter = 0 ; counter < sizeoftriangle ; counter++)
{
triangle[counter] = trianglegenerator(counter + 2);
}
printf("%lu \n%lu \n%lu \n", hexagon[sizeofhexagon - 1] , pentagon[sizeofpentagon - 1] , triangle[sizeoftriangle - 1]);
for(counter = 0 ; counter < sizeofhexagon ; counter++)
{
if(search_function(hexagon[counter] , pentagon , sizeofpentagon))
{
if(search_function(hexagon[counter] , triangle , sizeoftriangle) && hexagon[counter] != 40755)
{
printf("%lu", hexagon[counter]);
return 0;
}
}
}
return 1;
}
_Bool search_function(unsigned int to_be_looked_for , unsigned long array[] , int sizeofarray)
{
int left = 0, right = sizeofarray - 1 , middle = 0;
while(left <= right)
{
middle = (left + right) / 2;
if(to_be_looked_for == array[middle]) return 1;
else if(to_be_looked_for < array[middle]) right = middle - 1;
else if(to_be_looked_for > array[middle]) left = middle + 1;
}
return 0;
}
unsigned long pentagonalgenerator(int n)
{
unsigned int return_value = 0;
return_value = (n*(3*n - 1)) / 2;
return return_value;
}
unsigned long hexagonalgenerator(int n)
{
unsigned int return_value = 0;
return_value = n*(2*n - 1);
return return_value;
}
unsigned long trianglegenerator(int n)
{
unsigned int return_value = 0;
return_value = (n*(n + 1)) / 2;
return return_value;
}
That's a lot of memory for the stack. Instead of this
unsigned long pentagon[28000] = {0};
int sizeofpentagon = 28000;
unsigned long hexagon[100000] = {0};
int sizeofhexagon = 100000;
unsigned long triangle[500000] = {0};
int sizeoftriangle = 500000;
Try this:
unsigned long *pentagon = calloc(28000*sizeof(unsigned long));
int sizeofpentagon = 28000;
unsigned long *hexagon = calloc(100000 * sizeof(unsigned long));
int sizeofhexagon = 100000;
unsigned long *triangle = calloc(500000 * sizeof(unsigned long));
int sizeoftriangle = 500000;
You have very large arrays defined as local variables in the stack. You are getting a stack overflow because of that. Arrays pentagon hexagon triangle are very large.
These need to be moved to the global space or they should be dynamically allocated. For your use case, it is easier to move the arrays to global.
unsigned long pentagon[28000] = {0};
unsigned long hexagon[100000] = {0};
unsigned long triangle[500000] = {0};
int main(void)
{
int sizeofpentagon = 28000;
int sizeofhexagon = 100000;
int sizeoftriangle = 500000;
....
The maximum size for automatic variables is an implementation dependent detail. BUT major implementation have options to set it.
For example, if you are using gcc or clang, automatic variables are stored in the stack, and the stack size is controlled at link time by the option --stack <size>. The default size is 2Mb and your arrays require 628000 unsigned long so at least 5Mb.
Provided you have more standard requirements in other places of this code, I would try a 8Mb stack:
cc myprog.c -Wl,--stack -Wl,0x800000 -o myprog
(-Wl, is used to pass the argument to the linker phase of the build).
This avoids to reformat your code (for examble using allocated arrays) to only solve a compilation problem.

What is the fastest way to list the elements of the unit group of a given size?

There are several fast algorithms to calculate prime numbers up to a given number n. But, what is the fastest implementation to list all the numbers r relatively prime to some number n in C? That is, find all the elements of the multiplicative group with n elements as efficiently as possible in C. In particular, I am interested in the case where n is a primorial.
The n primorial is like the factorial except only prime numbers are multiplied together and all other numbers are ignored. So, for example 12 primorial would be 12#=11*7*5*3*2.
My current implementation is very naive. I hard code the first 3 groups as arrays and use those to create the larger ones. Is there something faster?
#include "stdafx.h"
#include <stdio.h> /* printf, fgets */
#include <stdlib.h> /* atoi */
#include <math.h>
int IsPrime(unsigned int number)
{
if (number <= 1) return 0; // zero and one are not prime
unsigned int i;
unsigned int max=sqrt(number)+.5;
for (i = 2; i<= max; i++)
{
if (number % i == 0) return 0;
}
return 1;
}
unsigned long long primorial( int Primes[], int size)
{
unsigned long long answer = 1;
for (int k = 0;k < size;k++)
{
answer *= Primes[k];
}
return answer;
}
unsigned long long EulerPhi(int Primes[], int size)
{
unsigned long long answer = 1;
for (int k = 0;k < size;k++)
{
answer *= Primes[k]-1;
}
return answer;
}
int gcd( unsigned long long a, unsigned long long b)
{
while (b != 0)
{
a %= b;
a ^= b;
b ^= a;
a ^= b;
}
//Return whethere a is relatively prime to b
if (a > 1)
{
return false;
}
return true;
}
void gen( unsigned long long *Gx, unsigned int primor, int *G3)
{
//Get the magic numbers
register int Blocks = 30; //5 primorial=30.
unsigned long long indexTracker = 0;
//Find elements using G3
for (unsigned long long offset = 0; offset < primor; offset+=Blocks)
{
for (int j = 0; j < 8;j++) //The 8 comes from EulerPhi(2*3*5=30)
{
if (gcd(offset + G3[j], primor))
{
Gx[indexTracker] = offset + G3[j];
indexTracker++;
}
}
}
}
int main(int argc, char **argv)
{
//Hardcoded values
int G1[] = {1};
int G2[] = {1,5};
int G3[] = {1,7,11,13,17,19,23,29};
//Lazy input checking. The world might come to an end
//when unexpected parameters given. Its okey, we will live.
if (argc <= 1) {
printf("Nothing done.");
return 0;
}
//Convert argument to integer
unsigned int N = atoi(argv[1]);
//Known values
if (N <= 2 )
{
printf("{1}");
return 0;
}
else if (N<=4)
{
printf("{1,5}");
return 0;
}
else if (N <=6)
{
printf("{1,7,11,13,17,19,23,29}");
return 0;
}
//Hardcoded for simplicity, also this primorial is ginarmous as it is.
int Primes[50] = {0};
int counter = 0;
//Find all primes less than N.
for (int a = 2; a <= N; a++)
{
if (IsPrime(a))
{
Primes[counter] = a;
printf("\n Prime: : %i \n", a);
counter++;
}
}
//Get the group size
unsigned long long MAXELEMENT = primorial(Primes, counter);
unsigned long long Gsize = EulerPhi(Primes, counter);
printf("\n GSize: %llu \n", Gsize);
printf("\n GSize: %llu \n", Gsize);
//Create the list to hold the values
unsigned long long *GROUP = (unsigned long long *) calloc(Gsize, sizeof(unsigned long long));
//Populate the list
gen( GROUP, MAXELEMENT, G3);
//Print values
printf("{");
for (unsigned long long k = 0; k < Gsize;k++)
{
printf("%llu,", GROUP[k]);
}
printf("}");
return 0;
}
If you are looking for a faster prime number check, here is one that is reasonably fast and eliminates all calls to computationally intensive functions (e.g. sqrt, etc..)
int isprime (int v)
{
int i;
if (v < 0) v = -v; /* insure v non-negative */
if (v < 2 || !((unsigned)v & 1)) /* 0, 1 + even > 2 are not prime */
return 0;
for (i = 2; i * i <= v; i++)
if (v % i == 0)
return 0;
return 1;
}
(note: You can adjust the type as required if you are looking for numbers above the standard int range.)
Give it a try and let me know how it compares to the once you are currently using.

Search an ordered array in a CUDA kernel

I'm writing a CUDA kernel and each thread has to complete the following task: suppose I have an ordered array a of n unsigned integers (the first one is always 0) stored in shared memory, each thread has to find the array index i such that a[i] ≤ threadIdx.x and a[i + 1] > threadIdx.x.
A naive solution could be:
for (i = 0; i < n - 1; i++)
if (a[i + 1] > threadIdx.x) break;
but I suppose this is not the optimal way to do it... can anyone suggest anything better?
Like Robert, I was thinking that a binary search has got to be faster that a naïve loop -- the upper bound of operation count for a binary search is O(log(n)), compared to O(N) for the loop.
My extremely simple implementation:
#include <iostream>
#include <climits>
#include <assert.h>
__device__ __host__
int midpoint(int a, int b)
{
return a + (b-a)/2;
}
__device__ __host__
int eval(int A[], int i, int val, int imin, int imax)
{
int low = (A[i] <= val);
int high = (A[i+1] > val);
if (low && high) {
return 0;
} else if (low) {
return -1;
} else {
return 1;
}
}
__device__ __host__
int binary_search(int A[], int val, int imin, int imax)
{
while (imax >= imin) {
int imid = midpoint(imin, imax);
int e = eval(A, imid, val, imin, imax);
if(e == 0) {
return imid;
} else if (e < 0) {
imin = imid;
} else {
imax = imid;
}
}
return -1;
}
__device__ __host__
int linear_search(int A[], int val, int imin, int imax)
{
int res = -1;
for(int i=imin; i<(imax-1); i++) {
if (A[i+1] > val) {
res = i;
break;
}
}
return res;
}
template<int version>
__global__
void search(int * source, int * result, int Nin, int Nout)
{
extern __shared__ int buff[];
int tid = threadIdx.x + blockIdx.x*blockDim.x;
int val = INT_MAX;
if (tid < Nin) val = source[threadIdx.x];
buff[threadIdx.x] = val;
__syncthreads();
int res;
switch(version) {
case 0:
res = binary_search(buff, threadIdx.x, 0, blockDim.x);
break;
case 1:
res = linear_search(buff, threadIdx.x, 0, blockDim.x);
break;
}
if (tid < Nout) result[tid] = res;
}
int main(void)
{
const int inputLength = 128000;
const int isize = inputLength * sizeof(int);
const int outputLength = 256;
const int osize = outputLength * sizeof(int);
int * hostInput = new int[inputLength];
int * hostOutput = new int[outputLength];
int * deviceInput;
int * deviceOutput;
for(int i=0; i<inputLength; i++) {
hostInput[i] = -200 + 5*i;
}
cudaMalloc((void**)&deviceInput, isize);
cudaMalloc((void**)&deviceOutput, osize);
cudaMemcpy(deviceInput, hostInput, isize, cudaMemcpyHostToDevice);
dim3 DimBlock(256, 1, 1);
dim3 DimGrid(1, 1, 1);
DimGrid.x = (outputLength / DimBlock.x) +
((outputLength % DimBlock.x > 0) ? 1 : 0);
size_t shmsz = DimBlock.x * sizeof(int);
for(int i=0; i<5; i++) {
search<1><<<DimGrid, DimBlock, shmsz>>>(deviceInput, deviceOutput,
inputLength, outputLength);
}
for(int i=0; i<5; i++) {
search<0><<<DimGrid, DimBlock, shmsz>>>(deviceInput, deviceOutput,
inputLength, outputLength);
}
cudaMemcpy(hostOutput, deviceOutput, osize, cudaMemcpyDeviceToHost);
for(int i=0; i<outputLength; i++) {
int idx = hostOutput[i];
int tidx = i % DimBlock.x;
assert( (hostInput[idx] <= tidx) && (tidx < hostInput[idx+1]) );
}
cudaDeviceReset();
return 0;
}
gave about a five times speed up compared to the loop:
>nvprof a.exe
======== NVPROF is profiling a.exe...
======== Command: a.exe
======== Profiling result:
Time(%) Time Calls Avg Min Max Name
60.11 157.85us 1 157.85us 157.85us 157.85us [CUDA memcpy HtoD]
32.58 85.55us 5 17.11us 16.63us 19.04us void search<int=1>(int*, int*, int, int)
6.52 17.13us 5 3.42us 3.35us 3.73us void search<int=0>(int*, int*, int, int)
0.79 2.08us 1 2.08us 2.08us 2.08us [CUDA memcpy DtoH]
I'm sure that someoneclever could do a lot better than that. But perhaps this gives you at least a few ideas.
can anyone suggest anything better?
A brute force approach would be to have each thread do a binary search (on threadIdx.x + 1).
// sets idx to the index of the first element in a that is
// equal to or larger than key
__device__ void bsearch_range(const int *a, const int key, const unsigned len_a, unsigned *idx){
unsigned lower = 0;
unsigned upper = len_a;
unsigned midpt;
while (lower < upper){
midpt = (lower + upper)>>1;
if (a[midpt] < key) lower = midpt +1;
else upper = midpt;
}
*idx = lower;
return;
}
__global__ void find_my_idx(const int *a, const unsigned len_a, int *my_idx){
unsigned idx = (blockDim.x * blockIdx.x) + threadIdx.x;
unsigned sp_a;
int val = idx+1;
bsearch_range(a, val, len_a, &sp_a);
my_idx[idx] = ((val-1) < a[sp_a]) ? sp_a:-1;
}
This is coded in browser, not tested. It's hacked from a piece of working code, however. If you have trouble making it work, I can revisit it. I don't recommend this approach on a device without caches (cc 1.x device).
This is actually searching on the full unique 1D thread index (blockDim.x * blockIdx.x + threadIdx.x + 1) You can change val to be anything you like.
You could also add an appropriate thread check, if the number of threads you intend to launch is greater than the length of your my_idx result vector.
I imagine there is a more clever approach that may use something akin to prefix sums.
This is the best algorithm so far. It's called: LPW Indexed Search
__global__ void find_position_lpw(int *a, int n)
{
int idx = threadIdx.x;
__shared__ int aux[ MAX_THREADS_PER_BLOCK /*1024*/ ];
aux[idx] = 0;
if (idx < n)
atomicAdd( &aux[a[idx]], 1); // atomics in case there are duplicates
__syncthreads();
int tmp;
for (int j = 1; j <= MAX_THREADS_PER_BLOCK / 2; j <<= 1)
{
if( idx >= j ) tmp = aux[idx - j];
__syncthreads();
if( idx >= j ) aux[idx] += tmp;
__syncthreads();
}
// result in "i"
int i = aux[idx] - 1;
// use "i" here...
// ...
}

Resources