Optimize and rewrite the following C code - c

This is a textbook question that involves rewriting some C code to make it perform best on a given processor architecture.
Given: targeting a single superscalar processor with 4 adders and 2 multiplier units.
Input structure (initialized elsewhere):
struct s {
short a;
unsigned v;
short b;
} input[100];
Here is the routine to operate on this data. Obviously correctness must be ensured, but the goal is to optimize the crap out of it.
int compute(int x, int *r, int *q, int *p) {
int i;
for(i = 0; i < 100; i++) {
*r *= input[i].v + x;
*p = input[i].v;
*q += input[i].a + input[i].v + input[i].b;
}
return i;
}
So this method has 3 arithmetic instructions to update the integers r, q, p.
Here's my attempt with comments explaining what I'm thinking:
//Use temp variables so we don't keep using loads and stores for mem accesses;
//hopefully the temps will just be kept in the register file
int r_temp = *r;
int q_temp = *q;
for (i=0;i<99;i = i+2) {
int data1 = input[i];
int data2 = input[i+1]; //going to try partially unrolling loop
int a1 = data1.a;
int a2 = data2.a;
int b1 = data1.b;
int b2 = data2.b;
int v1 = data1.v;
int v2 = data2.v;
//I will use brackets to make my intention clear the order of operations I was planning
//with respect to the functional (adder, mul) units available
//This is calculating the next iteration's new q value
//from q += v1 + a1 + b1, or q(new)=q(old)+v1+a1+b1
q_temp = ((v1+q1)+(a1+b1)) + ((a2+b2)+v2);
//For the first step I am trying to use a max of 3 adders in parallel,
//saving one to start the next computation
//This is calculating next iter's new r value
//from r *= v1 + x, or r(new) = r(old)*(v1+x)
r_temp = ((r_temp*v1) + (r_temp*x)) + (v2+x);
}
//Because i will end on i=98 and I only unrolled by 2, I don't need to
//worry about final few values because there will be none
*p = input[99].v; //Why it's in the loop I don't understand, this should be correct
*r = r_temp;
*q = q_temp;
Ok so what did my solution give me? Looking at the old code, each loop iteration of i will take the minimum latency of max((1A + 1M),(3A)) where the former value is for calculating the new r while the latency of 3 Adds is for q.
In my solution, I am unrolling by 2 and trying to calculate the 2nd new value of r and q. Assuming the the latency of adders/multipliers is such that M = c*A (c is some integer > 1), the multiply operations for r are definitely the rate-limiting step, so I focus on that. I tried to use the multipliers in parallel as much as I could.
In my code, 2 multipliers are used at first in parallel to help calculate intermediate steps, then an add must combine those, then a final multiply is used for obtaining the last result. So for 2 new values of r (even though I only keep/care about the last one), it takes me (1M // 1M // 1A) + 1A + 1M, for a total latency of 2M + 1M sequentially. Dividing by 2, my latency per loop value is 1M + 0.5A. I calculate the original latency/value (for r) to be 1A + 1M. So if my code is correct (I did this all by hand, haven't tested yet!) then I have a small performance boost.
Also, hopefully by not accessing and updating pointers directly in the loop as much (thanks to temp variables r_temp and q_temp mainly), I save on some load/store latency.
That was my stab at it. Definitely interested in seeing what others come up with that's better!

Yes, it is possible to leverage the two shorts. Rearrange your struct as so
struct s {
unsigned v;
short a;
short b;
} input[100];
and you might be able to get better alignment of the memory fields on your architecture, which might allow more of these structs to lie in the same memory page, which might allow you to encounter fewer memory page faults.
It's all speculative, that's why it is so important to profile.
If you have the right architecture, a rearrangement will give you better data structure alignment, which results in higher data density within the memory as fewer bits are lost to necessary padding to ensure type alignment with the data boundaries imposed by common memory architectures.

Related

If C is row-major order, why does ARM intrinsic code assume column-major order?

im not sure where is the best place to ask this but I am currently working on using ARM intrinsics and am following this guide: https://developer.arm.com/documentation/102467/0100/Matrix-multiplication-example
However, the code there was written assuming that the arrays are stored column-major order. I have always thought C arrays were stored row-major. Why did they assume this?
EDIT:
For example, if instead of this:
void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
for (int i_idx=0; i_idx < n; i_idx++) {
for (int j_idx=0; j_idx < m; j_idx++) {
for (int k_idx=0; k_idx < k; k_idx++) {
C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
}
}
}
}
They had done this:
void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
for (int i_idx=0; i_idx < n; i_idx++) {
for (int k_idx=0; k_idx < k; k_idx++) {
for (int j_idx=0; j_idx < m; j_idx++) {
C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
}
}
}
}
The code would run faster due to spatial locality of accessing C in the order C[0], C[1], C[2], C[3] instead of in the order C[0], C[2], C[1], C[3] (where C[0], C[1], C[2], C[3] are contiguous in memory).
You're not using C 2D arrays like C[i][j], so it's not a matter of how C stores anything, it's how 2D indexing is done manually in this code, using n * idx_1 + idx_2, with a choice of which you loop over in the inner vs. outer loops.
But the hard part of a matmul with both matrices non-transposed is that you need to make opposite choices for the two input matrices: a naive matmul has to stride through distant elements of one of the input matrices, so it's inherently screwed. That's a major part of why careful cache-blocking / loop-tiling is important for matrix multiplication. (O(n^3) work over O(n^2) data - you want to get the most use out of it for every time you bring it into L1d cache, and/or into registers.)
Loop interchange can speed things up to take advantage of spatial locality in the inner-most loop, if you do it right.
See the cache-blocked matmul example in What Every Programmer Should Know About Memory? which traverses contiguous memory in all 3 inputs in the inner few loops, picking the index that isn't scaled in any of the 3 matrices as the inner one. That looks like this:
for (j_idx)
for (k_idx)
for (i_idx)
C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
Notice that B[k * j_idx + k_idx] is invariant over the loop inner loop, and that you're doing a simple dst[0..n] += const * src[0..n] operation over contiguous memory (which is easy to SIMD vectorize), although you're still doing 2 loads + 1 store for every FMA, so that's not going to max out your FP throughput.
Separate from the cache access pattern, that also avoids a long dependency chain into a single accumulator (element of C). But that's not a real problem for an optimized implementation: you can of course use multiple accumulators. FP math isn't strictly associative because of rounding error, but multiple accumulators are closer to pairwise summation and likely to have less bad FP rounding error than serially adding each element of the row x column dot product.
It will have different results to adding in the order standard simple C loop does, but usually closer to the exact answer.
Your proposed loop order i,k,j is the worst possible.
You're striding through distant elements of 2 of the 3 matrices in the inner loop, including discontiguous access to C[], opposite of what you said in your last paragraph.
With j as the inner-most loop, you'd access C[0], C[n], C[2n], etc. on the first outer iteration. And same for B[], so that's really bad.
Interchanging the i and j loops would give you contiguous access to C[] in the middle loop instead of strided, and still rows of one, columns of the other, in the inner-most loop. So that would be strictly an improvement: yes you're right that this naive example is constructed even worse than it needs to be.
But the key issue is the strided access to something in the inner loop: that's a performance disaster; that's a major part of why careful cache-blocking / loop-tiling is important for matrix multiplication. The only index that is never used with a scale factor is i.
C is not inherently row-major or column-major.
When writing a[i][j], it's up to you to decide whether i is a row index or a column index.
While it's somewhat of a common convention to write the row index first (making the arrays row-major), nothing stops you from doing the opposite.
Also, remember that A × B = C is equivalent to Bt × At = Ct (t meaning a transposed matrix), and reading a row-major matrix as if it was column-major (or vice versa) transposes it, meaning that if you want to keep your matrices row-major, you can just reverse the order of the operands.

Matrix multiplication in 2 different ways (comparing time)

I've got an assignment - compare 2 matrix multiplications - in the default way, and multiplication after transposition of second matrix, we should point the difference which method is faster. I've written something like this below, but time and time2 are nearly equal to each other. In one case the first method is faster, I run the multiplication with the same size of matrix, and in another one the second method is faster. Is something done wrong? Should I change something in my code?
clock_t start = clock();
int sum;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum = 0;
for(int k=0; k<size; ++k) {
sum = sum + (m1[i][k] * m2[k][j]);
}
score[i][j] = sum;
}
}
clock_t end = clock();
double time = (end-start)/(double)CLOCKS_PER_SEC;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
int temp = m2[i][j];
m2[i][j] = m2[j][i];
m2[j][i] = temp;
}
}
clock_t start2 = clock();
int sum2;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum2 = 0;
for(int k=0; k<size; ++k) {
sum2 = sum2 + (m1[k][i] * m2[k][j]);
}
score[i][j] = sum2;
}
}
clock_t end2 = clock();
double time2 = (end2-start2)/(double)CLOCKS_PER_SEC;
You have multiple severe issues with your code and/or your understanding. Let me try to explain.
Matrix multiplication is bottlenecked by the rate at which the processor can load and store the values to memory. Most current architectures use cache to help with this. Data is moved from memory to cache and from cache to memory in blocks. To maximize the benefit of caching, you want to make sure you will use all the data in that block. To do that, you make sure you access the data sequentially in memory.
In C, multi-dimensional arrays are specified in row-major order. It means that the rightmost index is consecutive in memory; i.e. that a[i][k] and a[i][k+1] are consecutive in memory.
Depending on the architecture, the time it takes for the processor to wait (and do nothing) for the data to be moved from RAM to cache (and vice versa), may or may not be included in the CPU time (that e.g. clock() measures, albeit at a very poor resolution). For this kind of measurement ("microbenchmark"), it is much better to measure and report both CPU and real (or wall clock) time used; especially so if the microbenchmark is run on different machines, to get a better idea of the practical impact of the change.
There will be a lot of variation, so typically, you measure the time taken by a few hundred repeats (each repeat possibly making more than one operation; enough to be easily measured), storing the duration of each, and report their median. Why median, and not minimum, maximum, average? Because there will always be occasional glitches (unreasonable measurement due to an external event, or something), which typically yield a much higher value than normal; this makes the maximum uninteresting, and skews the average (mean) unless removed. The minimum is typically an over-optimistic case, where everything just happened to go perfectly; that rarely occurs in practice, so is only a curiosity, not of practical interest. The median time, on the other hand, gives you a practical measurement: you can expect 50% of all runs of your test case to take no more than the median time measured.
On POSIXy systems (Linux, Mac, BSDs), you should use clock_gettime() to measure the time. The struct timespec format has nanosecond precision (1 second = 1,000,000,000 nanoseconds), but resolution may be smaller (i.e., the clocks change by more than 1 nanosecond, whenever they change). I personally use
#define _POSIX_C_SOURCE 200809L
#include <time.h>
static struct timespec cpu_start, wall_start;
double cpu_seconds, wall_seconds;
void timing_start(void)
{
clock_gettime(CLOCK_REALTIME, &wall_start);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_start);
}
void timing_stop(void)
{
struct timespec cpu_end, wall_end;
clock_gettime(CLOCK_REALTIME, &wall_end);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_end);
wall_seconds = (double)(wall_end.tv_sec - wall_start.tv_sec)
+ (double)(wall_end.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
cpu_seconds = (double)(cpu_end.tv_sec - cpu_start.tv_sec)
+ (double)(cpu_end.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}
You call timing_start() before the operation, and timing_stop() after the operation; then, cpu_seconds contains the amount of CPU time taken and wall_seconds the real wall clock time taken (both in seconds, use e.g. %.9f to print all meaningful decimals).
The above won't work on Windows, because Microsoft does not want your C code to be portable to other systems. It prefers to develop their own "standard" instead. (Those C11 "safe" _s() I/O function variants are a stupid sham, compared to e.g. POSIX getline(), or the state of wide character support on all systems except Windows.)
Matrix multiplication is
c[r][c] = a[r][0] * b[0][c]
+ a[r][1] * b[1][c]
: :
+ a[r][L] * b[L][c]
where a has L+1 columns, and b has L+1 rows.
In order to make the summation loop use consecutive elements, we need to transpose b. If B[c][r] = b[r][c], then
c[r][c] = a[r][0] * B[c][0]
+ a[r][1] * B[c][1]
: :
+ a[r][L] * B[c][L]
Note that it suffices that a and B are consecutive in memory, but separate (possibly "far" away from each other), for the processor to utilize cache efficiently in such cases.
OP uses a simple loop, similar to the following pseudocode, to transpose b:
For r in rows:
For c in columns:
temporary = b[r][c]
b[r][c] = b[c][r]
b[c][r] = temporary
End For
End For
The problem above is that each element participates in a swap twice. For example, if b has 10 rows and columns, r = 3, c = 5 swaps b[3][5] and b[5][3], but then later, r = 5, c = 3 swaps b[5][3] and b[3][5] again! Essentially, the double loop ends up restoring the matrix to the original order; it does not do a transpose.
Consider the following entries and the actual transpose:
b[0][0] b[0][1] b[0][2] b[0][0] b[1][0] b[2][0]
b[1][0] b[1][1] b[1][2] ⇔ b[0][1] b[1][1] b[2][1]
b[2][0] b[2][1] b[2][2] b[0][2] b[1][2] b[2][2]
The diagonal entries are not swapped. You only need to do the swap in the upper triangular portion (where c > r) or in the lower triangular portion (where r > c), to swap all entries, because each swap swaps one entry from the upper triangular to the lower triangular, and vice versa.
So, to recap:
Is something done wrong?
Yes. Your transpose does nothing. You haven't understood the reason why one would want to transpose the second matrix. Your time measurement relies on a low-precision CPU time, which may not reflect the time taken by moving data between RAM and CPU cache. In the second test case, with m2 "transposed" (except it isn't, because you swap each element pair twice, returning them back to the way they were), your innermost loop is over the leftmost array index, which means it calculates the wrong result. (Moreover, because consecutive iterations of the innermost loop accesses items far from each other in memory, it is anti-optimized: it uses the pattern that is worst in terms of speed.)
All the above may sound harsh, but it isn't intended to be, at all. I do not know you, and I am not trying to evaluate you; I am only pointing out the errors in this particular answer, in your current understanding, and only in the hopes that it helps you, and anyone else encountering this question in similar circumstances, to learn.

Pointer Math with Complex Array

I have this snippet of code with some pointer math that I'm having trouble understanding:
#include <stdlib.h>
#include <complex.h>
#include <fftw3.h>
int main(void)
{
int i, j, k;
int N, N2;
fftwf_complex *box;
fftwf_plan plan;
float *smoothed_box;
// Allocate memory for arrays (Ns are set elsewhere and properly,
// I've just left it out for clarity)
box = (fftwf_complex *)fftwf_malloc(N * sizeof(fftwf_complex));
smoothed_box = (float *)malloc(N2 * sizeof(float));
// Create complex data and fill box with it. Do FFT. Box has the
// Hermitian symmetry that complex data has when doing FFTs with
// real data
plan = fftwf_plan_dft_c2r_3d(N,N,N,box,(float *)box,
FFTW_ESTIMATE);
...
// end fft
// Now do the loop I don't understand
for(i = 0; i < N2; i++)
{
for(j = 0; j < N2; j++)
{
for(k = 0; k < N2; k++)
{
smoothed_box[R_INDEX(i,j,k)] = *((float *)box +
R_FFT_INDEX(i*f + 0.5, j*f + 0.5, k*f +0.5))/V;
}
}
}
// Do other stuff
...
return 0;
}
Where f and V are just some numbers that are set elsewhere in the code and don't matter for this particular question. Additionally, the functions R_FFT_INDEX and R_INDEX don't really matter, either. What's important is that, for the first loop iteration ,when i=j=k=0, R_INDEX = 0 and R_FFT_INDEX=45. smoothed_box has 8 elements and box has 320.
So, in gdb, when I print smoothed_box[0] after the loop, I get smoothed_box[0] = some number. Now, I understand that, for an array of normal types, say floats, array + integer will give array[integer], assuming that integer is within the bounds of the array.
However, fftwf_complex is defined as typedef float fftw_complex[2], as you need to hold both the real and imaginary parts of the complex number. It's also being casted to a float * from a fftwf_complex *, and I'm unsure what this does, given the typedef.
All I know is that when I print box[45] in gdb, I get box[45] = some complex number that is not smoothed_box[0] * V. Even when I print *((float *)box + 45)/V, I get a different number than smoothed_box[0].
So, I was just wondering if anyone could explain to me the pointer math that is being done in the above loop? Thank you, and I appreciate your time!
box is allocated as an array of N fftwf_complex. Then a backward 3D c2r fftw transform using N,N,N is performed on box, requiring N*N*(N/2+1) fftwf_complex. See http://www.fftw.org/fftw3_doc/Real_002ddata-DFT-Array-Format.html#Real_002ddata-DFT-Array-Format Therefore, this code might trigger undefined behavior, such as segmentation fault, before reaching the pointer arithmetics...
It is practical to cast back box to an array of float because the DFT is performed in place. Indeed, box is used twice as the fftwf_plan is created. box is both the input array of complex and the output array of real:
plan = fftwf_plan_dft_c2r_3d(N,N,N,box,(float *)box,
FFTW_ESTIMATE);
Once fftwf_execute(plan); is called, box is better seen as an array of real. Nevertheless, this array is of size N*N*2*(N/2+1), where the items located at positions i,j,k where k>N-1 are meaningless. See FFTW's Real-data DFT Array Format:
For an in-place transform, some complications arise since the complex data is slightly larger than the real data. In this case, the final dimension of the real data must be padded with extra values to accommodate the size of the complex data—two extra if the last dimension is even and one if it is odd. That is, the last dimension of the real data must physically contain 2 * (nd-1/2+1) double values (exactly enough to hold the complex data). This physical array size does not, however, change the logical array size—only nd-1 values are actually stored in the last dimension, and nd-1 is the last dimension passed to the planner.
This is the reason why the real array smoothed_box is introduced, though an N*N*N array would be expected. If smoothed_box were an array of size N*N*N, then the following conversion could have been performed:
for(i=0;i<N;i++){
for(j=0;j<N;j++){
for(k=0;k<N;k++){
smoothed_box[(i*N+j)*N+k]=((float *)box)[(i*N+j)*(2*(N/2+1))+k]
}
}
}

Best approach to FIFO implementation in a kernel OpenCL

Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).
I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).
Description of the picture:
One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.
What happens with my code?
For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.
Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.
I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.
Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).
float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];
float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {
//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end
int globalId = get_global_id(0);
float output = 0.0f;
//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
tempArrayForShift[i] = temp[i];
}
//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
temp[i] = tempArrayForShift[i];
}
//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];
//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i < 58; i ++) {
// output = multipliedResult[i] + output;
}
//Returned summed value of temp array
return output;
}
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
//Initialize the temporary array values to 0
for (int i = 0; i < 58; i ++) {
temp[i] = 0;
tempArrayForShift[i] = 0;
multipliedResult[i] = 0;
}
//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i < 60; i ++) {
Output[i] = fifo(Array[i], coefficients, 58);
}
}
I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.
Another alternative I was thinking about
In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.
To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.
A simple implementation would be (launch with as many threads as output values):
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output)
{
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 58; i++)
{
float tmp=0;
if (globalId+i > 56)
{
tmp=Array[i+globalId-57]*coefficient[57-i];
}
sum += tmp;
}
output[globalId]=sum;
}
This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.
Here's another kernel you can try out. There are a lot of synchronization points (barriers), but this should perform fairly well. The 65-item work group is not very optimal.
the steps:
init local values to 0
copy coefficients to local variable
looping over the output elements to compute:
shift existing elements (work items > 0 only)
copy new element (work item 0 only)
compute dot product
5a. multiplication - one per work item
5b. reduction loop to compute sum
copy dot product to output (WI 0 only)
final barrier
the code:
__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){
int globalId = get_global_id(0);
int localId = get_local_id(0);
int localSize = get_local_size(0);
//1 init local values to 0
localArray[localId] = 0.0f
//2 copy coefficients to local
//don't bother with this id __constant is working for you
//requires another local to be passed in: localCoeff
//localCoeff[localId] = coefficients[localId];
//barrier for both steps 1 and 2
barrier(CLK_LOCAL_MEM_FENCE);
float tmp;
for(int i = 0; i< outputSize; i++)
{
//3 shift elements (+barrier)
if(localId > 0){
tmp = localArray[localId -1]
}
barrier(CLK_LOCAL_MEM_FENCE);
localArray[localId] = tmp
//4 copy new element (work item 0 only, + barrier)
if(localId == 0){
localArray[0] = Array[i];
}
barrier(CLK_LOCAL_MEM_FENCE);
//5 compute dot product
//5a multiply + barrier
localSums[localId] = localArray[localId] * coefficients[localId];
barrier(CLK_LOCAL_MEM_FENCE);
//5b reduction loop + barrier
for(int j = 1; j < localSize; j <<= 1) {
int mask = (j << 1) - 1;
if ((localId & mask) == 0) {
localSums[local_index] += localSums[localId +j]
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//6 copy dot product (WI 0 only)
if(localId == 0){
Output[i] = localSums[0];
}
//7 barrier
//only needed if there is more code after the loop.
//the barrier in #3 covers this in the case where the loop continues
//barrier(CLK_LOCAL_MEM_FENCE);
}
}
What about more work groups?
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop:
for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)
Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s.
//1 init local values
if(groupId == 0){
localArray[localId] = 0.0f
}else{
localArray[localSize - localId] = Array[workAmount - localId];
}
These two changes should allow you to use a more optimal number of work groups; I suggest some multiple of the number of compute units on the device. Try to keep the amount of work for each group in the thousands though. Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running.
Advantages
At almost every point in this kernel, the work items have something to do. The only time fewer than 100% of the items are working is during the reduction loop in step 5b. Read more here about why that is a good thing.
Disadvantages
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve.
There isn't room for more work items per group, and 65 is not a very optimal size. Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly.

Is vectorization profitable in this case?

I broke a kernel down to several loops, in order to vectorize each one of them afterwards. One of this loops looks like:
int *array1; //Its size is "size+1";
int *array2; //Its size is "size+1";
//All positions of array1 and array2 are set to 0 here;
int *sArray1 = array1+1; //Shift one position so I start writing on pos 1
int *sArray2 = array2+1; //Shift one position so I start writing on pos 1
int bb = 0;
for(int i=0; i<size; i++){
if(A[i] + bb > B[i]){
bb = 1;
sArray1[i] = S;
sArray2[i] = 1;
}
else
bb = 0;
}
Please note the loop-carried dependency, in bb - each comparison depends upon bb's value, which is modified on the previous iteration.
What I thought about:
I can be absolutely certain of some cases. For example, when A[i] is already greater than B[i], I do not need to know what value bb carries from the previous iteration;
When A[i] equals B[i], I need to know what value bb carries from the previous iteration. However, I also need to account for the case when this happens in two consecutive positions; When I started to shape up these cases, it seemed that these becomes overly complicated and vectorization doesn't pay off.
Essentially, I'd like to know if this can be vectorized in an effective manner or if it is simply better to run this without any vectorization whatsoever.
You might not want to iterate over single elements, but have a loop over the chunks (where a chunk is defined by all elements within yielding the same bb).
The search for chunk boundraries could be vectorized (by hand using compiler specific SIMD intrinics probably).
And the action to be taken for single chunk of bb=1 could be vectorized, too.
The loop transformation is as follows:
size_t i_chunk_start = 0, i_chunk_end;
int bb_chunk = A[0] > B[0] ? 1 : 0;
while (i_chunk_start < isize) {
if(bb_chunk) {
/* find end of current chunk */
for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
if(A[i_chunk_end] < B[i_chunk_end]) {
break;
}
}
/* process current chunk */
for(size_t i = i_chunk_start; i < i_chunk_end; ++i) {
sArray1[i] = S;
sArray2[i] = 1;
}
bb_chunk = 0;
} else {
/* find end of current chunk */
for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
if(A[i_chunk_end] > B[i_chunk_end]) {
break;
}
}
bb_chunk = 1;
}
/* prepare for next chunk */
i_chunk_start = i_chunk_end;
}
Now, each of the inner loops (all for loops) could potentially get vectorized.
Whether or not vectorization in this manner is superior to non-vectorization depends on whether the chunks have sufficient length in average. You will only find out by benchmarking.
The effect of your loop body depends on two conditions:
A[i] > B[i]
A[i] + 1 > B[i]
Their calculation can be vectorized easily. Assuming int has 32 bits, and vectorized instructions work on 4 int values at a time, there are 8 bits per vectorized iteration (4 bits for each condition).
You can harvest those bits from a SSE register by _mm_movemask_epi8. It's a bit inconvenient that it works on bytes and not on ints, but you can take care of it by a suitable shuffle.
Afterwards, use the 8 bits as an address to a LUT (of 256 entries), which stores 4-bit masks. These masks can be used to store the elements into destination conditionally, using _mm_maskmoveu_si128.
I am not sure such a complicated program is worthwhile - it involves much bit-fiddling for just x4 improvement in speed. Maybe it's better to build the masks by examining the decision bits individually. But vectorizing your comparisons and stores seems worthwhile in any case.

Resources