I'm trying to implement parallelism to this function I want it to take as many threads as possible, and write the results to a file.
The results need to be written in the file in the incrementing order so the first result needs to be written first the second second and so on.
The keyGen function is simply an MD5 of the integer m which is used as the start point for each chain. Reduction32 is a reduction function it takes the first 8 byte adds t and returns that value. When a chain reaches its endpoint it is stored in the binary file.
Is there a smart way to make this parallel? without screwing up the order the endpoints are stored in?
void tableGenerator32(uint32_t * text){
int mMax = 33554432, lMax = 236;
int m, t, i;
uint16_t * temp;
uint16_t * key, ep[2];
uint32_t tp;
FILE * write_ptr;
write_ptr = fopen("table32bits.bin", "wb");
for(m = 0; m < mMax ; m++){
key = keyGen(m);
for (t = 0; t < lMax; t++){
keyschedule(key);
temp = kasumi_enc(text);
tp = reduction32(t,temp);
temp[0]=tp>>16;
temp[1]=tp;
for(i=0; i < 8; i++){
key[i]=temp[i%2];
}
}
for(i=0;i<2;i++)
ep[i] = key[i];
fwrite(ep,sizeof(ep),1,write_ptr);
}
fclose(write_ptr);
}
The best way to parallelize the above function without facing concurrency issues is to create as many memory streams as many threads you wish to use and then divide the task into fractions, like if you have 4 threads,
one thread performs the task from 0 to mMax / 4
one thread performs the task from mMax / 4 to (mMax / 4) * 2
one thread performs the task from (mMax / 4) * 2 to (mMax / 4) * 3
one thread performs the task from (mMax / 4) * 3 to (mMax / 4) * 4
then you concatenate the result streams and write them into a file.
Related
I'm trying to get some experience with OpenCL, the environment is setup and I can create and execute kernels. I am currently trying to compute pi in parallel using the Leibniz formula but have been receiving some strange results.
The kernel is as follow:
__kernel void leibniz_cl(__global float *space, __global float *result, int chunk_size)
{
__local float pi[THREADS_PER_WORKGROUP];
pi[get_local_id(0)] = 0.;
for (int i = 0; i < chunk_size; i += THREADS_PER_WORKGROUP) {
// `idx` is the work item's `i` in the grander scheme
int idx = (get_group_id(0) * chunk_size) + get_local_id(0) + i;
float idx_f = 1 / ((2 * (float) idx) + 1);
// Make the fraction negative if needed
if(idx & 1)
idx_f = -idx_f;
pi[get_local_id(0)] += idx_f;
}
// Reduction within workgroups (in `pi[]`)
for(int groupsize = THREADS_PER_WORKGROUP / 2; groupsize > 0; groupsize >>= 1) {
if (get_local_id(0) < groupsize)
pi[get_local_id(0)] += pi[get_local_id(0) + groupsize];
barrier(CLK_LOCAL_MEM_FENCE);
}
If I end the function here and set result to pi[get_local_id(0)] for !get_global_id(0) (as in the reduction for the first group), printing result prints -nan.
Remainder of kernel:
// Reduction amongst workgroups (into `space[]`)
if(!get_local_id(0)) {
space[get_group_id(0)] = pi[get_local_id(0)];
for(int groupsize = get_num_groups(0) / 2; groupsize > 0; groupsize >>= 1) {
if(get_group_id(0) < groupsize)
space[get_group_id(0)] += space[get_group_id(0) + groupsize];
barrier(CLK_LOCAL_MEM_FENCE);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(get_global_id(0) == 0)
*result = space[get_group_id(0)] * 4;
}
Returning space[get_group_id(0)] * 4 returns either -nan or a very large number which clearly is not an approximation of pi.
I can't decide if it is an OpenCL concept I'm missing or a parallel execution one in general. Any help is appreciated.
Links
Reduction template: OpenCL float sum reduction
Leibniz Formula: https://www.wikiwand.com/en/Leibniz_formula_for_%CF%80
Maybe these are not most critical issues with the code but they can be the source of problem:
You definetly should use barrier(CLK_LOCAL_MEM_FENCE); before local reduction. This can be avoided if only you know that work group size is equal or smaller than number of threads in wavefront running same instruction in parallel - 64 for AMD GPUs, 32 for NVidia GPUs.
Global reduction must be done in multiple launches of kernel because barrier() works for work items of same work group only. Clear and 100% working way to insert a barrier into kernel is splittion it in two in the place where global barier is needed.
I'm trying to optimize my C/Fortran code for calculating finite-difference gradients using OpenMP. The code outputs the correct values for both serial and multithreaded cases. However, when I try to store the calculated values in an array, the code slows down a lot, vs just performing computations.
I wrote a simplified example for asking this question:
I allocate my arrays in C:
/* phi - field over which derivatives are calculated */
float *phi = (float *) calloc(SIZE, sizeof(float));
/* rhs - derivatives in each direction are summed and
stored in this variable
*/
float *rhs = (float *) calloc(SIZE, sizeof(float));
My array size is 256^3, and I have 3 cells as "padding" on each end, which makes "SIZE" as 262^3.
Then I call my Fortran function inside the parallel OMP region, dividing the work over different threads equally in the k-direction:
#pragma omp parallel default(none) shared(phi, rhs)
{
/* Divide up slices in the z-direction over the available threads */
cur_thread = omp_get_thread_num();
num_threads = omp_get_num_threads();
nslices = khi_fb - klo_fb + 1;
/* Current lo and hi indices in the z-direction */
cur_klo_fb = klo_fb + nslices * cur_thread/num_threads;
cur_khi_fb = klo_fb + nslices * (cur_thread + 1)/num_threads - 1;
if (cur_khi_fb > khi_fb)
cur_khi_fb = khi_fb;
/* Start timing the program */
time0 = omp_get_wtime();
/* Run 100 times for better timing */
for (i = 0; i < 100; i++)
{
/* Calling Fortran routine for calculating derivatives */
CALC_DERIV(phi, rhs,
&(ilo_gb), &(ihi_gb), &(jlo_gb), &(jhi_gb), &(klo_gb), &(khi_gb),
&(ilo_fb), &(ihi_fb), &(jlo_fb), &(jhi_fb),
&(cur_klo_fb), &(cur_khi_fb),
&dx);
}
time1 = omp_get_wtime();
In my Fortran routine, I iterate over the array, calculating the central difference in each direction at each point:
do k=klo_fb,khi_fb
do j=jlo_fb,jhi_fb
do i=ilo_fb,ihi_fb
phi_x = (phi(i+1,j,k) - phi(i-1,j,k))*dx_factor
phi_y = (phi(i,j+1,k) - phi(i,j-1,k))*dy_factor
phi_z = (phi(i,j,k+1) - phi(i,j,k-1))*dz_factor
temp = rhs(i,j,k) + phi_x + phi_y + phi_z
rhs(i,j,k) = temp
enddo
enddo
enddo
Here, the suffix "_fb" refers to fill box. So the iterations run from 3:258 in i, and j directions, and over 'cur_klo_fb' and 'cur_khi_fb' in the k-direction.
Here's my problem: when I run the code as shown, the timing for 1 thread (serial behavior) is ~2.17s. When I comment out the line rhs(i,j,k) = temp my timing is 3e-4s. Why is there such a big difference? I am doing the same number of computations. The only difference is I am storing the temporary variable 'temp' in a given array location in rhs. Moreover, I am reading the array 'phi' plenty of times, but it doesn't seem to affect the speed. It seems writing 'temp' to the array 'rhs' slows things down.
When I run with OpenMP, I get a speedup, but it is not optimal. I guess I am missing something.
I hope I explained my problem clearly. I would be happy to provide the complete code I am testing.
EDIT 2:
So based on the comments, I experimented with seeing if the compiler is actually just ignoring the loops altogether.
I have further modified the post, to include suggestions by #jabirali and #VladimirF. I have added timings too.
So I modified the Fortran loops:
integer count
count = 1
c { begin loop over grid
do k=klo_fb,khi_fb
do j=jlo_fb,jhi_fb
do i=ilo_fb,ihi_fb
phi_x = (phi(i+1,j,k) - phi(i-1,j,k))*dx_factor
phi_y = (phi(i,j+1,k) - phi(i,j-1,k))*dy_factor
phi_z = (phi(i,j,k+1) - phi(i,j,k-1))*dz_factor
temp = rhs(i,j,k) + phi_x + phi_y + phi_z
temp2 = temp2 + temp
c rhs(i,j,k) = temp
enddo
enddo
enddo
c } end loop over grid
Here are the timings for various cases (for 1 and 2 threads):
Case 1: With temp2, no rhs(i,j,k), 1 thread: 3.24s
Case 2: With temp2, no rhs(i,j,k), 2 threads: 1.65s
Case 3: Without temp2, with rhs(i,j,k), 1 thread: 1.23s
Case 4: Without temp2, with rhs(i,j,k), 2 threads: 0.74s
Case 5: Without temp2, without rhs, 1 thread: 1e-6s
Still confused :(.
Update2. Solved! This is memory issue. Some benching about it here:
http://dontpad.com/bench_mem
Update. My goal is to achieve best throughput. All my results are here.
Sequential Results:
https://docs.google.com/spreadsheet/ccc?key=0AjKHxPB2qgJXdE8yQVNHRkRiQ2VzeElIRWwxMWtRcVE&usp=sharing
Parallel Results*:
https://docs.google.com/spreadsheet/ccc?key=0AjKHxPB2qgJXdEhTb2plT09PNEs3ajBvWUlVaWt0ZUE&usp=sharing
multsoma_par1_vN, N determines how data is acessed by each thread.
N: 1 - NTHREADS displacement, 2 - L1 displacement, 3 - L2 displacement, 4 - TAM/NTHREADS
I am having a hard time trying to figure out why my parallel code runs just slighty faster than sequential code.
What I basically do is to loop through a big array (10^8 elements) of a type (int/float/double) and apply the computation: A = A * CONSTANT + B. Where A and B are arrays of same size.
Sequential code only do a single function call.
Parallel version create pthreads and uses the same function as starting function.
I am using gettimeofday(), RDTSC() and more recently getrusage() to measure timings. My main results are expressed by Clocks per Element (CPE).
My processor is an i5-3570K. 4 Cores, no hyper-threading.
The problem is that I can get up to 2.00 CPE under sequential code and when going parallel my best performance was 1.84 CPE. I know that I get an overhead by creating pthreads and calling more timing routines, but I don't think this is the reason for not getting better timings.
I did measured each thread CPE and executed the program with 1, 2, 3 and 4 threads. When creating only one thread, I get the expected result CPE around 2.00 (+ some overhead expressed in miliseconds but overall CPE is not affected at all).
When running with 2 threads or more the main CPE decreases, but each thread CPE increases.
2 threads I get main CPE around 1.9 and each thread to 3.8 ( Why this is not 2.0 ?! )
The same happens to 3 and 4 threads.
4 threads I get main CPE around 1.85 (my best timing) and each thread with 7.0~7.5 CPE.
Using many threads more than avaiable cores(4) I still getting CPE under 2.0 but not better than 1.85 (most times higher due to overhead).
I suspect that maybe context switching could be the limiting factor here. When running with 2 threads I can count 5 to 10 involuntary contexts switch from each thread...
But I am not so sure about this. Are those seemly few context switches enough to almost double my CPE ? I was expecting to atleast get around 1.00 CPE using all my CPU Cores.
I went further on this and analyzed the assembly code for this function. They are identical, except for some extra shifts and adds (4 instructions) at the very beginning of the function and they are out of loops.
In case you want to see some code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
#include <cpuid.h>
typedef union{
unsigned long long int64;
struct {unsigned int lo, hi;} int32;
} tsc_counter;
#define RDTSC(cpu_c) \
__asm__ __volatile__ ("rdtsc" : \
"=a" ((cpu_c).int32.lo), \
"=d" ((cpu_c).int32.hi) )
#define CNST 5
#define NTHREADS 4
#define L1_SIZE 8096
#define L2_SIZE 72512
typedef int data_t;
data_t * A;
data_t * B;
int tam;
double avg_thread_CPE;
tsc_counter thread_t0[NTHREADS], thread_t1[NTHREADS];
struct timeval thread_sec0[NTHREADS], thread_sec1[NTHREADS];
void fillA_B(int tam){
int i;
for (i=0;i<tam;i++){
A[i]=2; B[i]=2;
}
return;
}
void* multsoma_par4_v4(void *arg){
int w;
int i,j;
int *id = (int *) arg;
int limit = tam-14;
int size = tam/NTHREADS;
int tam2 = ((*id+1)*size);
int limit2 = tam2-14;
gettimeofday(&thread_sec0[*id],NULL);
RDTSC(thread_t0[*id]);
//Mult e Soma
for (i=(*id)*size;i<limit2 && i<limit;i+=15){
A[i] = A[i] * CNST + B[i];
A[i+1] = A[i+1] * CNST + B[i+1];
A[i+2] = A[i+2] * CNST + B[i+2];
A[i+3] = A[i+3] * CNST + B[i+3];
A[i+4] = A[i+4] * CNST + B[i+4];
A[i+5] = A[i+5] * CNST + B[i+5];
A[i+6] = A[i+6] * CNST + B[i+6];
A[i+7] = A[i+7] * CNST + B[i+7];
A[i+8] = A[i+8] * CNST + B[i+8];
A[i+9] = A[i+9] * CNST + B[i+9];
A[i+10] = A[i+10] * CNST + B[i+10];
A[i+11] = A[i+11] * CNST + B[i+11];
A[i+12] = A[i+12] * CNST + B[i+12];
A[i+13] = A[i+13] * CNST + B[i+13];
A[i+14] = A[i+14] * CNST + B[i+14];
}
for (; i<tam2 && i<tam; i++)
A[i] = A[i] * CNST + B[i];
RDTSC(thread_t1[*id]);
gettimeofday(&thread_sec1[*id],NULL);
double CPE, elapsed_time;
CPE = ((double)(thread_t1[*id].int64-thread_t0[*id].int64))/((double)(size));
elapsed_time = (double)(thread_sec1[*id].tv_sec-thread_sec0[*id].tv_sec)*1000;
elapsed_time+= (double)(thread_sec1[*id].tv_usec - thread_sec0[*id].tv_usec)/1000;
//printf("Thread %d workset - %d\n",*id,size);
//printf("CPE Thread %d - %lf\n",*id, CPE);
//printf("Time Thread %d - %lf\n",*id, elapsed_time/1000);
avg_thread_CPE+=CPE;
free(arg);
pthread_exit(NULL);
}
void imprime(int tam){
int i;
int ans = 12;
for (i=0;i<tam;i++){
//printf("%d ",A[i]);
//checking...
if (A[i]!=ans) printf("WA!!\n");
}
printf("\n");
return;
}
int main(int argc, char *argv[]){
tsc_counter t0,t1;
struct timeval sec0,sec1;
pthread_t thread[NTHREADS];
double CPE;
double elapsed_time;
int i;
int* id;
tam = atoi(argv[1]);
A = (data_t*) malloc (tam*sizeof(data_t));
B = (data_t*) malloc (tam*sizeof(data_t));
fillA_B(tam);
avg_thread_CPE = 0;
//Start Computing...
gettimeofday(&sec0,NULL);
RDTSC(t0); //Time Stamp 0
for (i=0;i<NTHREADS;i++){
id = (int*) malloc(sizeof(int));
*id = i;
if (pthread_create(&thread[i], NULL, multsoma_par4_v4, (void*)id)) {
printf("--ERRO: pthread_create()\n"); exit(-1);
}
}
for (i=0; i<NTHREADS; i++) {
if (pthread_join(thread[i], NULL)) {
printf("--ERRO: pthread_join() \n"); exit(-1);
}
}
RDTSC(t1); //Time Stamp 1
gettimeofday(&sec1,NULL);
//End Computing...
imprime(tam);
CPE = ((double)(t1.int64-t0.int64))/((double)(tam)); //diferenca entre Time_Stamps/repeticoes
elapsed_time = (double)(sec1.tv_sec-sec0.tv_sec)*1000;
elapsed_time+= (double)(sec1.tv_usec - sec0.tv_usec)/1000;
printf("Main CPE: %lf\n",CPE);
printf("Avg Thread CPE: %lf\n",avg_thread_CPE/NTHREADS);
printf("Time: %lf\n",elapsed_time/1000);
free(A); free(B);
return 0;
}
I appreciate any help.
After seeing the full code, I rather agree with the guess of #nosid in comments: since the ratio of compute operations to memory loads is low, and the data (about 800M if I am not mistaken) don't fit in cache, the memory bandwidth is likely the limiting factor. The link to the main memory is shared to all cores in a processor, so when its bandwidth is saturated, all memory operations start stalling and take longer time; thus CPE increases.
Also, the following place in your code is a data race:
avg_thread_CPE+=CPE;
as you sum up CPE values calculated on different threads to a single global variable without any synchronization.
Below I left the part of my initial answer, including the "first statement" referred in the comments. I still find it correct, for the definition of CPE as the number of clocks taken by the operations over a single element.
You should not expect the clocks per element (CPE) metric to decrease
due to use of multiple threads. By definition, it's how fast a
single data item is processed, in average. Threading helps to process all data faster (by simultaneous processing on different
cores), so the elapsed wallclock time, i.e. the time to execute the
whole program, should be expected to decrease.
I have a problem that is parallel on two levels: I have a ton of sets of (x0, x1, y0, y1) coordinate pairs, which are turned into variables vdx, vdy, vyy and for each of these sets I'm trying to calculate the values of all "monomials" composed of them up to degree n (i.e. all possible combinations of different powers of them, like vdx^3*vdy*vyy^2 or vdx*1*vyy^4). These values are then added up over all the sets.
My strategy (and for now I'd just like to get it to work, it doesn't have to be optimized with multiple kernels or complex reductions, unless it really has to) is to have each thread deal with one set of coordinate pairs and calculate the values of all their corresponding monomials. Each block's shared memory holds all the monomial sums, and when the block is done, the first thread in the block adds the result to the global sum. Since each block's shared memory is accessed by all threads in all places, I'm using atomicAdd; same with the blocks and the global memory.
Unfortunately there still seems to be a race condition somewhere, since I different results every time I run the kernel.
If it helps, I'm currently using degree = 3 and omitting one of the variables, which means that in the code below, the innermost for loop (over evbl) doesn't do anything and just repeats 4 times. Indeed, the output of the kernel looks like this: 51502,55043.1,55043.1,51502,47868.5,47868.5,48440.5,48440.6,46284.7,46284.7,46284.7,46284.7,46034.3,46034.3,46034.3,46034.3,44972.8,44972.8,44972.8,44972.8,43607.6,43607.6,43607.6,43607.6,43011,43011,43011,43011,42747.8,42747.8,42747.8,42747.8,45937.8,45937.8,46509.9,46509.9,... and it's noticable that there is a (rough) pattern of 4-tuples. But everytime I run it the values are all very different.
Everything is in floats, but I'm on a 2.1 GPU and so that shouldn't be a problem. cuda-memcheck also reports no errors.
Can somebody with more CUDA experience give me some pointers how to track down the race condition here?
__global__ void kernel(...) {
extern __shared__ float s_data[];
// just use global memory for now
// get threadID:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= nPairs) return;
// ... do some calculations to get x/y...
// calculate vdx, vdy and vyy
float vdx = (x1 - x0)/(float)xheight;
float vdy = (y1 - y0)/(float)xheight;
float vyy = 0.5*(y0 + y1)/(float)xheight;
const int offs1 = degree + 1;
const int offs2 = offs1 * offs1;
const int offs3 = offs2 * offs1;
float sol = 1.0;
// now calculate monomial results and store in shared memory
for(int evdx = 0; evdx <= degree; evdx++) {
for(int evdy = 0; evdy <= degree; evdy++) {
for(int evyy = 0; evyy <= degree; evyy++) {
for(int evbl = 0; evbl <= degree; evbl++) {
s = powf(vdx, evdx) + powf(vdy, evdy) + powf(vyy, evyy);
atomicAdd(&(s_data[evbl + offs1*evyy + offs2*evdy +
offs3*evdx]), sol/1000.0 );
}
}
}
}
// now copy shared memory to global
__syncthreads();
if(threadIdx.x == 0) {
for(int i = 0; i < nMonomials; i++) {
atomicAdd(&outmD[i], s_data[i]);
}
}
}
You are using shared memory but you are never initializing it.
I want to implement an optimized queue between threads. To increase performance, I want to use pipeline techniques by splitting queue size.
I have a large queue for communication between two threads, one called producer, and another called consumer. By splitting queue size, if the producer writes in one part of the queue, the consumer can read the part that was written by producer. And when the consumer is reading a part of queue, the producer can write in the other part.
But I think when cache read array (because queue is made by array), the size doesn't same cache line size..
So I want to know what the size when cache bring array to write or read data.
If you're running on Linux, this information is sometimes listed in /proc/cpuinfo as cache_alignment.
You could also find this information indirectly by stepping through an array, adjusting your stride, and timing the loop. When accesses aren't block aligned you'll see the performance drop, so you can get a pretty good idea of what your block size is. Here's a quick and dirty version to basically do this, I think it'll give you a good idea:
int main () {
int i, STEP_SIZE = 8;
int * a;
struct timeval t1, t2;
double el;
a = (int*)malloc(1024*1024*64*sizeof(int));
for (i = 0; i < 1024*1024*64; i++)
a[i] = 0;
gettimeofday(&t1, NULL);
for (i = 0; i < 1024*1024*64; i += STEP_SIZE)
a[i] += 10;
gettimeofday(&t2, NULL);
el = (t2.tv_sec - t1.tv_sec) * 1000.0;
el += (t2.tv_usec - t1.tv_usec) / 1000.0;
printf("%d %3.2f\n", STEP_SIZE, el);
return 0;
}
Basically you would want to vary STEP_SIZE