I tried to reduce the execution time of this function, and I got the execution time down to
Sys:0.001s
Is there any way to reduce the execution time of this function further?
int function(uint32_t *r, const uint32_t *a, const uint32_t *b, int n)
{
int i;
uint32_t ri, c=0;
for (i = 0; i < n; i ++)
{
ri = a[i] + b[i] + c;
c = ((ri < a[i]) || ((ri == a[i]) && c));
r[i] = ri;
}
return ((int) c);
}
I guess, you loose most of the time in your conditional expression: most modern CPU hate branches if they can't predict them correctly most of the time. Consequently, the branches introduce by most loops are fine, because they are only mispredicted once for the entire loop. Branching on a carry condition, however, will likely result in 50% of the branches being mispredicted, and each misprediction is worth 10 to 20 cycles. Even worse, the && and || operators are sequence points, which are a hindrance to the optimizer.
So, I would try to eliminate these conditionals:
int function(uint32_t *r, const uint32_t *a, const uint32_t *b, int n) {
int i;
uint64_t ri, c=0;
for (i = 0; i < n; i ++) {
ri = (uint64_t)a[i] + (uint64_t)b[i] + c;
c = ri >> 32;
r[i] = (uint32_t)ri;
}
return ((int) c);
}
Here, I have used 64-bit arithmetic, since modern CPUs do 64-bit arithmetic just as fast as 32-bit arithmetic. However, if 64-bit arithmetic is slow on your hardware, you can fall back to 32-bit arithmetic:
int function(uint32_t *r, const uint32_t *a, const uint32_t *b, int n) {
int i;
uint32_t ri, c=0;
for (i = 0; i < n; i ++) {
uint32_t curA = a[i], curB = b[i];
uint32_t lowA = curA & 0xffffu, highA = curA >> 16;
uint32_t lowB = curB & 0xffffu, highB = curB >> 16;
uint32_t lowR = lowA + lowB + c;
uint32_t highR = highA + highB + (lowR >> 16);
c = highR >> 16;
r[i] = (highR << 16) + lowR;
}
return ((int) c);
}
Even though this looks like a monster, it's only 12 simple operations which should execute with a latency of one cycle on all hardware, i. e. the calculation of the entire loop body should take less than 12 cycles, consequently, the bottleneck should be the memory bus (and you can't avoid that).
you can get rid of the subscript notation and use pointer arithmetic instead which is said to be faster , however i don't know how much CPU time would that actually save .
int function(uint32_t *r, const uint32_t *a, const uint32_t *b, int n)
{
int i;
uint32_t ri, c=0;
for (i = 0; i < n; i ++)
{
ri = *(a + i) + *(b + i) + c;
c = ((ri < *(a + i)) || ((ri == *(a +i)) && c));
*(r + i) = ri;
}
return ((int) c);
}
for reasons see: Accessing array values via pointer arithmetic vs. subscripting in C
c = (ri < a[i]) + ((ri-a[i])*c) might be faster than your code which also test if c==0
Related
I want create random int array in CUDA. And I need to check for duplicity on array index 0-9, 10-19 ... and repair them.
Any idea, how to make it effective? I really dont want check each element with each other.
Here is my code:
__global__ void generateP(int *d_p, unsigned long seed)
{
int i = X * blockIdx.x + threadIdx.x * X;
int buffer[X];
curandState state;
curand_init(seed, i, 0, &state);
for (int j = 0; j < X; j++)
{
float random = HB + (curand_uniform(&state) * (LB - HB));
buffer[j] = (int)truncf(random);
}
// TODO unique check and repair duplicity
for (int k = 0; k < X; k++)
{
d_p[i] = buffer[k];
i++;
}
}
Is there in CUDA some kind of Contains function? Thanks for help.
You really are asking the wrong question here. You should be looking for a way of randomly ordering a list of unique values, rather than attempting to fill a list with unique random numbers by searching and replacing duplicates repeatedly until you have the unique list. The latter is terribly inefficient and a poor fit to a data parallel execution model like CUDA.
There are simple, robust algorithms for randomly shuffling list of values that only require at most N calls to a random generator in order to shuffle a list of N values. The Fisher-Yates shuffle is almost universally used for this.
I'm not going to comment much on this code except to say that it illustrates one approach to doing this, using one thread per list. It isn't intended to be performant, just a teaching example to get you started. I think it probably does close to what you are asking for (more based on your previous attempt at this question than this one). I recommend you study it as a lead-in to writing your own implementation which does whatever it is you are trying to do.
#include <ctime>
#include <iostream>
#include <curand_kernel.h>
struct source
{
int baseval;
__device__ source(int _b) : baseval(_b) {};
__device__ int operator()(int v) { return baseval + v; };
};
__device__ int urandint(int minval, int maxval, curandState_t& state)
{
float rval = curand_uniform(&state);
rval *= (float(maxval) - float(minval) + 0.99999999f);
rval += float(minval);
return (int)truncf(rval);
}
template<int X>
__global__ void kernel(int* out, int N, unsigned long long seed)
{
int tidx = threadIdx.x + blockIdx.x * blockDim.x;
if (tidx < N) {
curandState_t state;
curand_init(seed, tidx, 0, &state);
int seq[X];
source vals(tidx * X);
// Fisher Yeats Shuffle straight from Wikipedia
#pragma unroll
for(int i=0; i<X; ++i) {
int j = urandint(0, i, state);
if (j != i)
seq[i] = seq[j];
seq[j] = vals(i);
}
// Copy local shuffled sequence to output array
int* dest = &out[X * tidx];
memcpy(dest, &seq[0], X * sizeof(int));
}
}
int main(void)
{
const int X = 10;
const int nsets = 200;
int* d_result;
size_t sz = size_t(nsets) * sizeof(int) * size_t(X);
cudaMalloc((void **)&d_result, sz);
int tpb = 32;
int nblocks = (nsets/tpb) + ((nsets%tpb !=0) ? 1 : 0);
kernel<X><<<nblocks, tpb>>>(d_result, nsets, std::time(0));
int h_result[nsets][X];
cudaMemcpy(&h_result[0][0], d_result, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<nsets; ++i) {
std::cout << i << " : ";
for(int j=0; j<X; ++j) {
std::cout << h_result[i][j] << ",";
}
std::cout << std::endl;
}
cudaDeviceReset();
return 0;
}
So.. I have something like this. It is supposed to create arrays with 10, 20, 50 100 .. up to 5000 random numbers that then sorts with Insertion Sort and prints out how many comparisions and swaps were done .. However, I am getting a runtime exception when I reach 200 numbers large array .. "Access violation writing location 0x00B60000." .. Sometimes I don't even reach 200 and stop right after 10 numbers. I have literally no idea.
long *arrayIn;
int *swap_count = (int*)malloc(sizeof(int)), *compare_count = (int*)malloc(sizeof(int));
compare_count = 0;
swap_count = 0;
int i, j;
for (j = 10; j <= 1000; j*=10) {
for (i = 1; i <= 5; i++){
if (i == 1 || i == 2 || i == 5) {
int n = i * j;
arrayIn = malloc(sizeof(long)*n);
fill_array(&arrayIn, n);
InsertionSort(&arrayIn, n, &swap_count, &compare_count);
print_array(&arrayIn, n, &swap_count, &compare_count);
compare_count = 0;
swap_count = 0;
free(arrayIn);
}
}
}
EDIT: ok with this free(arrayIn); I get this " Stack cookie instrumentation code detected a stack-based buffer overrun." and I get nowhere. However without it it's "just" "Access violation writing location 0x00780000." but i get up to 200numbers eventually
void fill_array(int *arr, int n) {
int i;
for (i = 0; i < n; i++) {
arr[i] = (RAND_MAX + 1)*rand() + rand();
}
}
void InsertionSort(int *arr, int n, int *swap_count, int *compare_count) {
int i, j, t;
for (j = 0; j < n; j++) {
(*compare_count)++;
t = arr[j];
i = j - 1;
*swap_count = *swap_count + 2;
while (i >= 0 && arr[i]>t) { //tady chybí compare_count inkrementace
*compare_count = *compare_count + 2;
arr[i + 1] = arr[i];
(*swap_count)++;
i--;
(*swap_count)++;
}
arr[i + 1] = t;
(*swap_count)++;
}
}
I am sure your compiler told you what was wrong.
You are passing a long** to a function that expects a int* at the line
fill_array(&arrayIn, n);
function prototype is
void fill_array(int *arr, int n)
Same problem with the other function. From there, anything can happen.
Always, ALWAYS heed the warnings your compiler gives you.
MAJOR EDIT
First - yes, the name of an array is already a pointer.
Second - declare a function prototype at the start of your code; then the compiler will throw you helpful messages which will help you catch these
Third - if you want to pass the address of a simple variable to a function, there is no need for a malloc; just use the address of the variable.
Fourth - the rand() function returns an integer between 0 and RAND_MAX. The code
a[i] = (RAND_MAX + 1) * rand() + rand();
is a roundabout way of getting
a[i] = rand();
since (RAND_MAX + 1) will overflow and give you zero... If you actually wanted to be able to get a "really big" random number, you would have to do the following:
1) make sure a is a long * (with the correct prototypes etc)
2) convert the numbers before adding / multiplying:
a[i] = (RAND_MAX + 1L) * rand() + rand();
might do it - or maybe you need to do some more casting to (long); I can never remember my order of precedence so I usually would do
a[i] = ((long)(RAND_MAX) + 1L) * (long)rand() + (long)rand();
to be 100% sure.
Putting these and other lessons together, here is an edited version of your code that compiles and runs (I did have to "invent" a print_array) - I have written comments where the code needed changing to work. The last point above (making long random numbers) was not taken into account in this code yet.
#include <stdio.h>
#include <stdlib.h>
// include prototypes - it helps the compiler flag errors:
void fill_array(int *arr, int n);
void InsertionSort(int *arr, int n, int *swap_count, int *compare_count);
void print_array(int *arr, int n, int *swap_count, int *compare_count);
int main(void) {
// change data type to match function
int *arrayIn;
// instead of mallocing, use a fixed location:
int swap_count, compare_count;
// often a good idea to give your pointers a _p name:
int *swap_count_p = &swap_count;
int *compare_count_p = &compare_count;
// the pointer must not be set to zero: it's the CONTENTs that you set to zero
*compare_count_p = 0;
*swap_count_p = 0;
int i, j;
for (j = 10; j <= 1000; j*=10) {
for (i = 1; i <= 5; i++){
if (i == 1 || i == 2 || i == 5) {
int n = i * j;
arrayIn = malloc(sizeof(long)*n);
fill_array(arrayIn, n);
InsertionSort(arrayIn, n, swap_count_p, compare_count_p);
print_array(arrayIn, n, swap_count_p, compare_count_p);
swap_count = 0;
compare_count = 0;
free(arrayIn);
}
}
}
return 0;
}
void fill_array(int *arr, int n) {
int i;
for (i = 0; i < n; i++) {
// arr[i] = (RAND_MAX + 1)*rand() + rand(); // causes integer overflow
arr[i] = rand();
}
}
void InsertionSort(int *arr, int n, int *swap_count, int *compare_count) {
int i, j, t;
for (j = 0; j < n; j++) {
(*compare_count)++;
t = arr[j];
i = j - 1;
*swap_count = *swap_count + 2;
while (i >= 0 && arr[i]>t) { //tady chybí compare_count inkrementace
*compare_count = *compare_count + 2;
arr[i + 1] = arr[i];
(*swap_count)++;
i--;
(*swap_count)++;
}
arr[i + 1] = t;
(*swap_count)++;
}
}
void print_array(int *a, int n, int* sw, int *cc) {
int ii;
for(ii = 0; ii < n; ii++) {
if(ii%20 == 0) printf("\n");
printf("%d ", a[ii]);
}
printf("\n\nThis took %d swaps and %d comparisons\n\n", *sw, *cc);
}
You are assigning the literal value 0 to some pointers. You are also mixing "pointers" with "address-of-pointers"; &swap_count gives the address of the pointer, not the address of its value.
First off, no need to malloc here:
int *swap_count = (int*)malloc(sizeof(int)) ..
Just make an integer:
int swap_coint;
Then you don't need to do
swap_coint = 0;
to this pointer (which causes your errors). Doing so on a regular int variable is, of course, just fine.
(With the above fixed, &swap_count ought to work, so don't change that as well.)
As I told in the comments, you are passing the addresses of pointers, which point to an actual value.
With the ampersand prefix (&) you are passing the address of something.
You only use this when you pass a primitive type.
E.g. filling the array by passing an int. But you are passing pointers, so no need to use ampersand.
What's actually happening is that you are looking in the address space of the pointer, not the actual value the pointer points to in the end. This causes various memory conflicts.
Remove all & where you are inputting pointers these lines:
fill_array(&arrayIn, n);
InsertionSort(&arrayIn, n, &swap_count, &compare_count);
print_array(&arrayIn, n, &swap_count, &compare_count);
So it becomes:
fill_array(arrayIn, n);
InsertionSort(arrayIn, n, swap_count, compare_count);
print_array(arrayIn, n, swap_count, compare_count);
I also note that you alloc memory for primitive types, which could be done way simpler:
int compare_count = 0;
int swap_count = 0;
But if you choose to use the last block of code, DO use &swap_count and &compare_count since you are passing primitive types, not pointers!
I have n (8 bit) character strings all of them of the same length (say m), and another string s of the same length. I need to compute Hamming distances from s to each of the others strings. In plain C, something like:
unsigned char strings[n][m];
unsigned char s[m];
int distances[n];
for(i=0; i<n; i++) {
int distances[i] = 0;
for(j=0; j<m; j++) {
if(strings[i][j] != s[j])
distances[i]++;
}
}
I would like to use SIMD instructions with gcc to perform such computations more efficiently. I have read that PcmpIstrI in SSE 4.2 can be useful and my target computer supports that instruction set, so I would prefer a solution using SSE 4.2.
EDIT:
I wrote following function to compute Hamming distance between two strings:
static inline int popcnt128(__m128i n) {
const __m128i n_hi = _mm_unpackhi_epi64(n, n);
return _mm_popcnt_u64(_mm_cvtsi128_si64(n)) + _mm_popcnt_u64(_mm_cvtsi128_si64(n_hi));
}
int HammingDist(const unsigned char *p1, unsigned const char *p2, const int len) {
#define MODE (_SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_EACH | _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY)
__m128i smm1 = _mm_loadu_si128 ((__m128i*) p1);
__m128i smm2 = _mm_loadu_si128 ((__m128i*) p2);
__m128i ResultMask;
int iters = len / 16;
int diffs = 0;
int i;
for(i=0; i<iters; i++) {
ResultMask = _mm_cmpestrm (smm1,16,smm2,16,MODE);
diffs += popcnt128(ResultMask);
p1 = p1+16;
p2 = p2+16;
smm1 = _mm_loadu_si128 ((__m128i*)p1);
smm2 =_mm_loadu_si128 ((__m128i*)p2);
}
int mod = len % 16;
if(mod>0) {
ResultMask = _mm_cmpestrm (smm1,mod,smm2,mod,MODE);
diffs += popcnt128(ResultMask);
}
return diffs;
}
So I can solve my problem by means of:
for(i=0; i<n; i++) {
int distances[i] = HammingDist(s, strings[i], m);
}
Is this the best I can do or can I use the fact that one of the strings compared is always the same? In addition, should I do some alignment on my arrays to improve performance?
ANOTHER ATTEMPT
Following Harold's recomendation, I have written following code:
void _SSE_hammingDistances(const ByteP str, const ByteP strings, int *ds, const int n, const int m) {
int iters = m / 16;
__m128i *smm1, *smm2, diffs;
for(int j=0; j<n; j++) {
smm1 = (__m128i*) str;
smm2 = (__m128i*) &strings[j*(m+1)]; // m+1, as strings are '\0' terminated
diffs = _mm_setzero_si128();
for (int i = 0; i < iters; i++) {
diffs = _mm_add_epi8(diffs, _mm_cmpeq_epi8(*smm1, *smm2));
smm1 += 1;
smm2 += 1;
}
int s = m;
signed char *ptr = (signed char *) &diffs;
for(int p=0; p<16; p++) {
s += *ptr;
ptr++;
}
*ds = s;
ds++;
}
}
but I am not able to do the final addition of bytes in __m128i by using psadbw. Can anyone please help me with that?
Here's an improved version of your latest routine, which uses PSADBW (_mm_sad_epu8) to eliminate the scalar code:
void hammingDistances_SSE(const uint8_t * str, const uint8_t * strings, int * const ds, const int n, const int m)
{
const int iters = m / 16;
const __m128i smm1 = _mm_loadu_si128((__m128i*)str);
assert((m & 15) == 0); // m must be a multiple of 16
for (int j = 0; j < n; j++)
{
__m128i smm2 = _mm_loadu_si128((__m128i*)&strings[j*(m+1)]); // m+1, as strings are '\0' terminated
__m128i diffs = _mm_setzero_si128();
for (int i = 0; i < iters; i++)
{
diffs = _mm_sub_epi8(diffs, _mm_cmpeq_epi8(smm1, smm2));
}
diffs = _mm_sad_epu8(diffs, _mm_setzero_si128());
ds[j] = m - (_mm_extract_epi16(diffs, 0) + _mm_extract_epi16(diffs, 4));
}
}
I have two sets of sorted elementes and want to merge them together in way so i can parallelize it later. I have a simple merge implementation that has data dependencies because it uses the maximum function and a first version of a parallelizable merge that uses binary search to find the rank and compute the index for a given value.
The getRank function returns the number of elements lower or equal than the given needle.
#define ATYPE int
int getRank(ATYPE needle, ATYPE *haystack, int size) {
int low = 0, mid;
int high = size - 1;
int cmp;
ATYPE midVal;
while (low <= high) {
mid = ((unsigned int) (low + high)) >> 1;
midVal = haystack[mid];
cmp = midVal - needle;
if (cmp < 0) {
low = mid + 1;
} else if (cmp > 0) {
high = mid - 1;
} else {
return mid; // key found
}
}
return low; // key not found
}
The merge algorithms operates on the two sorted sets a, b and store the result into c.
void simpleMerge(ATYPE *a, int n, ATYPE *b, int m, ATYPE *c) {
int i, l = 0, r = 0;
for (i = 0; i < n + m; i++) {
if (l < n && (r == m || max(a[l], b[r]) == b[r])) {
c[i] = a[l];
l++;
} else {
c[i] = b[r];
r++;
}
}
}
void merge(ATYPE *a, int n, ATYPE *b, int m, ATYPE *c) {
int i;
for (i = 0; i < n; i++) {
c[i + getRank(a[i], b, m)] = a[i];
}
for (i = 0; i < m; i++) {
c[i + getRank(b[i], a, n)] = b[i];
}
}
The merge operation is very slow when having a lot of elements and still can be parallelized, but simpleMerge is always faster even though it can not be parallelized.
So my question now is, do you know any better approach for parallel merging and if so, can you point me to a direction or is my code just bad?
Complexity of simpleMerge function:
O(n + m)
Complexity of merge function:
O(n*logm + m*logn)
Without having thought about this too much, my suggestion for parallelizing it, is to find a single value that's around the middle of each function, using something similar to the getRank function, and using simple merge from there. That can be O(n + m + log m + log n) = O(n + m) (even if you do a few, but constant amount of lookups to find a value around the middle).
The algorithm used by the merge function is best by asymptotic analysis. The complexity is O(n+m). You cannot find a better algorithm since I/O takes O(n+m).
int *s;
allocate memory for s[100];
void func (int *a, int *b)
{
int i;
for (i = 0; i < 100; i++)
{
s[i] = a[i] ^ b[i];
}
}
Assume that this particular code snippet is being called 1000 times, and this is the most time consuming operation in my code. Also assume that addresses of a and b is changed every time. 's' is a global variable which is updated with different sets of values of a & b.
As far as I assume, the main performance bottleneck would be memory access, because the only other operation is XOR, which is very trivial.
Would you please suggest how can I optimize my code in the best possible way?
the question I really wanted to ask, but I think it didn't get properly conveyed is, let for example this for loop contains 10 such XOR operations, the loop count is 100 and the function is called 1000 times, the point is high memory access..If the code is to be executed on a single core machine, what are scopes for improvement?
I've tested proposed solutions, and other two. I was not able to test onemasse' proposal as the result saved to s[] was not correct. I was not able to fix it too. I had to do some changes on moonshadow code. The measurement unit is clock cycles, so lower is better.
Original code:
#define MAX 100
void inline STACKO ( struct timespec *ts, struct timespec *te ){
int i, *s, *a, *b;
for (i = 0; i < MAX; ++i){
s = (int *) malloc (sizeof (int)); ++s;
a = (int *) malloc (sizeof (int)); ++a;
b = (int *) malloc (sizeof (int)); ++b;
}
srand ( 1024 );
for (i = 0; i < MAX; ++i){
a[i] = ( rand() % 2 );
b[i] = ( rand() % 2 );
}
rdtscb_getticks ( ts ); /* start measurement */
for (i = 0; i < MAX; i++)
s[i] = a[i] ^ b[i];
rdtscb_getticks ( te ); /* end measurement */
/*
printf("\n");
for (i = 0; i < MAX; ++i)
printf("%d", s[i]);
printf("\n");
*/
}
New proposal 1: register int
From:
int i, *s, *a, *b;
To:
register int i, *s, *a, *b;
New proposal 2: No array notation
s_end = &s[MAX];
for (s_ptr = &s[0], a_ptr = &a[0], b_ptr = &b[0]; \
s_ptr < s_end; \
++s_ptr, ++a_ptr, ++b_ptr){
*s_ptr = *a_ptr ^ *b_ptr;
}
moonshadow proposed optimization:
s_ptr = &s[0];
a_ptr = &a[0];
b_ptr = &b[0];
for (i = 0; i < (MAX/4); i++){
s_ptr[0] = a_ptr[0] ^ b_ptr[0];
s_ptr[1] = a_ptr[1] ^ b_ptr[1];
s_ptr[2] = a_ptr[2] ^ b_ptr[2];
s_ptr[3] = a_ptr[3] ^ b_ptr[3];
s_ptr+=4; a_ptr+=4; b_ptr+=4;
}
moonshadow proposed optimization + register int:
From:
int i, *s, ...
To:
register int i, *s, ...
Christoffer proposed optimization:
#pragma omp for
for (i = 0; i < MAX; i++)
{
s[i] = a[i] ^ b[i];
}
Results:
Original Code 1036.727264
New Proposal 1 611.147928
New proposal 2 450.788845
moonshadow 713.3845
moonshadow2 452.481192
Christoffer 1054.321943
There is other simple way of optimizing the resulting binary. Passing -O2 to gcc tells that you want optimization. To know exactly what -O2 does, refer to gcc man page.
After enabling -O2:
Original Code 464.233031
New Proposal 1 452.620255
New proposal 2 454.519383
moonshadow 428.651083
moonshadow2 419.317444
Christoffer 452.079057
Source codes available at: http://goo.gl/ud52m
Don't use the loop variable to index.
Unroll the loop.
for (i = 0; i < (100/4); i++)
{
s[0] = a[0] ^ b[0];
s[1] = a[1] ^ b[1];
s[2] = a[2] ^ b[2];
s[3] = a[3] ^ b[3];
s+=4; a+=4; b+=4;
}
Work out how to perform SIMD XOR on your platform.
Performing these XORs as an explicit step is potentially more expensive than doing them as part of another calculation: you're having to read from a and b and store the result in s - if s is read again for more calculation, you'd save a read and a write per iteration, and all the function call and loop overhead, by doing the XOR there instead; likewise, if a and b are outputs of some other functions, you do better by performing the XOR at the end of one of those functions.
int *s;
allocate memory for s[100];
void func (int *a, int *b)
{
int i;
#pragma omp for
for (i = 0; i < 100; i++)
{
s[i] = a[i] ^ b[i];
}
}
Of course, for only a hundred elements you might not see any particular improvement :-)
Just a guess here. If this is a cache issue you could try this:
int *s;
allocate memory for s[100];
void func (int *a, int *b)
{
int i;
memcpy( s, a, 100 );
for (i = 0; i < 100; i++)
{
s[i] = s[i] ^ b[i];
}
}
The memcpy, although it's a function call will often be inlined by the compiler if the size argument is a constant. Loop unrolling will probably not help here as it can be done automatically by the compiler. But you shouldn't take my word for it, verify on your platform.