Best approach to FIFO implementation in a kernel OpenCL - c

Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).
I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).
Description of the picture:
One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.
What happens with my code?
For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.
Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.
I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.
Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).
float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];
float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {
//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end
int globalId = get_global_id(0);
float output = 0.0f;
//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
tempArrayForShift[i] = temp[i];
}
//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
temp[i] = tempArrayForShift[i];
}
//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];
//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i < 58; i ++) {
// output = multipliedResult[i] + output;
}
//Returned summed value of temp array
return output;
}
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
//Initialize the temporary array values to 0
for (int i = 0; i < 58; i ++) {
temp[i] = 0;
tempArrayForShift[i] = 0;
multipliedResult[i] = 0;
}
//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i < 60; i ++) {
Output[i] = fifo(Array[i], coefficients, 58);
}
}
I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.
Another alternative I was thinking about
In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.

To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.
A simple implementation would be (launch with as many threads as output values):
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output)
{
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 58; i++)
{
float tmp=0;
if (globalId+i > 56)
{
tmp=Array[i+globalId-57]*coefficient[57-i];
}
sum += tmp;
}
output[globalId]=sum;
}
This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.

Here's another kernel you can try out. There are a lot of synchronization points (barriers), but this should perform fairly well. The 65-item work group is not very optimal.
the steps:
init local values to 0
copy coefficients to local variable
looping over the output elements to compute:
shift existing elements (work items > 0 only)
copy new element (work item 0 only)
compute dot product
5a. multiplication - one per work item
5b. reduction loop to compute sum
copy dot product to output (WI 0 only)
final barrier
the code:
__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){
int globalId = get_global_id(0);
int localId = get_local_id(0);
int localSize = get_local_size(0);
//1 init local values to 0
localArray[localId] = 0.0f
//2 copy coefficients to local
//don't bother with this id __constant is working for you
//requires another local to be passed in: localCoeff
//localCoeff[localId] = coefficients[localId];
//barrier for both steps 1 and 2
barrier(CLK_LOCAL_MEM_FENCE);
float tmp;
for(int i = 0; i< outputSize; i++)
{
//3 shift elements (+barrier)
if(localId > 0){
tmp = localArray[localId -1]
}
barrier(CLK_LOCAL_MEM_FENCE);
localArray[localId] = tmp
//4 copy new element (work item 0 only, + barrier)
if(localId == 0){
localArray[0] = Array[i];
}
barrier(CLK_LOCAL_MEM_FENCE);
//5 compute dot product
//5a multiply + barrier
localSums[localId] = localArray[localId] * coefficients[localId];
barrier(CLK_LOCAL_MEM_FENCE);
//5b reduction loop + barrier
for(int j = 1; j < localSize; j <<= 1) {
int mask = (j << 1) - 1;
if ((localId & mask) == 0) {
localSums[local_index] += localSums[localId +j]
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//6 copy dot product (WI 0 only)
if(localId == 0){
Output[i] = localSums[0];
}
//7 barrier
//only needed if there is more code after the loop.
//the barrier in #3 covers this in the case where the loop continues
//barrier(CLK_LOCAL_MEM_FENCE);
}
}
What about more work groups?
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop:
for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)
Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s.
//1 init local values
if(groupId == 0){
localArray[localId] = 0.0f
}else{
localArray[localSize - localId] = Array[workAmount - localId];
}
These two changes should allow you to use a more optimal number of work groups; I suggest some multiple of the number of compute units on the device. Try to keep the amount of work for each group in the thousands though. Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running.
Advantages
At almost every point in this kernel, the work items have something to do. The only time fewer than 100% of the items are working is during the reduction loop in step 5b. Read more here about why that is a good thing.
Disadvantages
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve.
There isn't room for more work items per group, and 65 is not a very optimal size. Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly.

Related

Is there a point in transforming Sparse Matrix Multiplication into block form?

In an assignment for a parallel computing class we have been assigned to program Sparse Binary Matrix-Matrix multiplication (SpGEMM) in C. Julia has a relatively easy to follow implementation based on Gustavson's algorithm that works great.
Thing is we also need to do the multiplication in block form, which I already did, but I don't really see any speedup in doing so. From what I understand you're supposed to use
the result of A(i,k)*B(k,j), where (i,j) are coordinates in the block matrix, as a mask/filter for the next block multiplication in the sum C(i,j) = Σ( A(i,k)*B(k,j) ).
Julia's implementation though, which I followed, already has a dense boolean array when computing each row that acts as a "flag" for when not to add something again in the resulting matrix.
My question is, is there any merit in turning this into block matrix multiplication or is there something that I might be doing wrong myself.
Keep in mind my C code currently runs in half the time Matlab takes in multiplying a 5,000,000 x 5,000,000 sparse matrix. The blocked version, which I really tried to optimize and I'm also doing in the Gustavson order, gets slower and slower the smaller the block-size is set.
Here is my current code
//C=D+(A*B) (basically OR)
bool SpGEMM_dor(int *Acol, int *Arow, int An,
int *Bcol, int *Brow, int Bm,
int **Ccol, int *Crow, int *Csize,//output
int *Dcol, int *Drow)//previous
{
//printCSR(Arow,Acol,An,An,An);
int nnzcum=0;
bool *xb = calloc(An,sizeof(bool)); //boolean flag
for(int i=0; i<An; i++){
int nnzpv = nnzcum;//nnz of previous row;
Crow[i] = nnzcum;
if(nnzcum + An > *Csize){ //make sure theres enough space
*Csize += MAX(An, *Csize/4);
*Ccol = realloc(*Ccol,*Csize*sizeof(int));
}
//---OR---
//add previous row items in order to exist in the next block
for(int jj=Drow[i]; jj<Drow[i+1]; jj++){
int j = Dcol[jj];
xb[j] = true;
(*Ccol)[nnzcum] = j;
nnzcum++;
}
//--------
//add new row items
for(int jj=Arow[i]; jj<Arow[i+1]; jj++){
int j = Acol[jj];
for(int kp=Brow[j]; kp<Brow[j+1]; kp++){
int k = Bcol[kp];
if(!xb[k]){
xb[k] = true;
(*Ccol)[nnzcum] = k;
nnzcum++;
}
}
}
if(nnzcum > nnzpv){
quickSort(*Ccol,nnzpv,nnzcum-1);
for(int p=nnzpv; p<nnzcum; p++){
xb[ (*Ccol)[p] ] = false;
}
}
}
Crow[An] = nnzcum;
free(xb);
return Crow[An];
}
The part of code that I have inside of the ----OR---- section only happens in the block version in order to add the previous block to the now-calculating one. It basically does C = D+(A*B). I've also tried calculating the next block and then merging the 2 sorted arrays of each row of the 2 CSR matrices, which seems to be slower. Also all matrices are in CSR format.

Matrix multiplication in 2 different ways (comparing time)

I've got an assignment - compare 2 matrix multiplications - in the default way, and multiplication after transposition of second matrix, we should point the difference which method is faster. I've written something like this below, but time and time2 are nearly equal to each other. In one case the first method is faster, I run the multiplication with the same size of matrix, and in another one the second method is faster. Is something done wrong? Should I change something in my code?
clock_t start = clock();
int sum;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum = 0;
for(int k=0; k<size; ++k) {
sum = sum + (m1[i][k] * m2[k][j]);
}
score[i][j] = sum;
}
}
clock_t end = clock();
double time = (end-start)/(double)CLOCKS_PER_SEC;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
int temp = m2[i][j];
m2[i][j] = m2[j][i];
m2[j][i] = temp;
}
}
clock_t start2 = clock();
int sum2;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum2 = 0;
for(int k=0; k<size; ++k) {
sum2 = sum2 + (m1[k][i] * m2[k][j]);
}
score[i][j] = sum2;
}
}
clock_t end2 = clock();
double time2 = (end2-start2)/(double)CLOCKS_PER_SEC;
You have multiple severe issues with your code and/or your understanding. Let me try to explain.
Matrix multiplication is bottlenecked by the rate at which the processor can load and store the values to memory. Most current architectures use cache to help with this. Data is moved from memory to cache and from cache to memory in blocks. To maximize the benefit of caching, you want to make sure you will use all the data in that block. To do that, you make sure you access the data sequentially in memory.
In C, multi-dimensional arrays are specified in row-major order. It means that the rightmost index is consecutive in memory; i.e. that a[i][k] and a[i][k+1] are consecutive in memory.
Depending on the architecture, the time it takes for the processor to wait (and do nothing) for the data to be moved from RAM to cache (and vice versa), may or may not be included in the CPU time (that e.g. clock() measures, albeit at a very poor resolution). For this kind of measurement ("microbenchmark"), it is much better to measure and report both CPU and real (or wall clock) time used; especially so if the microbenchmark is run on different machines, to get a better idea of the practical impact of the change.
There will be a lot of variation, so typically, you measure the time taken by a few hundred repeats (each repeat possibly making more than one operation; enough to be easily measured), storing the duration of each, and report their median. Why median, and not minimum, maximum, average? Because there will always be occasional glitches (unreasonable measurement due to an external event, or something), which typically yield a much higher value than normal; this makes the maximum uninteresting, and skews the average (mean) unless removed. The minimum is typically an over-optimistic case, where everything just happened to go perfectly; that rarely occurs in practice, so is only a curiosity, not of practical interest. The median time, on the other hand, gives you a practical measurement: you can expect 50% of all runs of your test case to take no more than the median time measured.
On POSIXy systems (Linux, Mac, BSDs), you should use clock_gettime() to measure the time. The struct timespec format has nanosecond precision (1 second = 1,000,000,000 nanoseconds), but resolution may be smaller (i.e., the clocks change by more than 1 nanosecond, whenever they change). I personally use
#define _POSIX_C_SOURCE 200809L
#include <time.h>
static struct timespec cpu_start, wall_start;
double cpu_seconds, wall_seconds;
void timing_start(void)
{
clock_gettime(CLOCK_REALTIME, &wall_start);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_start);
}
void timing_stop(void)
{
struct timespec cpu_end, wall_end;
clock_gettime(CLOCK_REALTIME, &wall_end);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_end);
wall_seconds = (double)(wall_end.tv_sec - wall_start.tv_sec)
+ (double)(wall_end.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
cpu_seconds = (double)(cpu_end.tv_sec - cpu_start.tv_sec)
+ (double)(cpu_end.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}
You call timing_start() before the operation, and timing_stop() after the operation; then, cpu_seconds contains the amount of CPU time taken and wall_seconds the real wall clock time taken (both in seconds, use e.g. %.9f to print all meaningful decimals).
The above won't work on Windows, because Microsoft does not want your C code to be portable to other systems. It prefers to develop their own "standard" instead. (Those C11 "safe" _s() I/O function variants are a stupid sham, compared to e.g. POSIX getline(), or the state of wide character support on all systems except Windows.)
Matrix multiplication is
c[r][c] = a[r][0] * b[0][c]
+ a[r][1] * b[1][c]
: :
+ a[r][L] * b[L][c]
where a has L+1 columns, and b has L+1 rows.
In order to make the summation loop use consecutive elements, we need to transpose b. If B[c][r] = b[r][c], then
c[r][c] = a[r][0] * B[c][0]
+ a[r][1] * B[c][1]
: :
+ a[r][L] * B[c][L]
Note that it suffices that a and B are consecutive in memory, but separate (possibly "far" away from each other), for the processor to utilize cache efficiently in such cases.
OP uses a simple loop, similar to the following pseudocode, to transpose b:
For r in rows:
For c in columns:
temporary = b[r][c]
b[r][c] = b[c][r]
b[c][r] = temporary
End For
End For
The problem above is that each element participates in a swap twice. For example, if b has 10 rows and columns, r = 3, c = 5 swaps b[3][5] and b[5][3], but then later, r = 5, c = 3 swaps b[5][3] and b[3][5] again! Essentially, the double loop ends up restoring the matrix to the original order; it does not do a transpose.
Consider the following entries and the actual transpose:
b[0][0] b[0][1] b[0][2] b[0][0] b[1][0] b[2][0]
b[1][0] b[1][1] b[1][2] ⇔ b[0][1] b[1][1] b[2][1]
b[2][0] b[2][1] b[2][2] b[0][2] b[1][2] b[2][2]
The diagonal entries are not swapped. You only need to do the swap in the upper triangular portion (where c > r) or in the lower triangular portion (where r > c), to swap all entries, because each swap swaps one entry from the upper triangular to the lower triangular, and vice versa.
So, to recap:
Is something done wrong?
Yes. Your transpose does nothing. You haven't understood the reason why one would want to transpose the second matrix. Your time measurement relies on a low-precision CPU time, which may not reflect the time taken by moving data between RAM and CPU cache. In the second test case, with m2 "transposed" (except it isn't, because you swap each element pair twice, returning them back to the way they were), your innermost loop is over the leftmost array index, which means it calculates the wrong result. (Moreover, because consecutive iterations of the innermost loop accesses items far from each other in memory, it is anti-optimized: it uses the pattern that is worst in terms of speed.)
All the above may sound harsh, but it isn't intended to be, at all. I do not know you, and I am not trying to evaluate you; I am only pointing out the errors in this particular answer, in your current understanding, and only in the hopes that it helps you, and anyone else encountering this question in similar circumstances, to learn.

Pointer Math with Complex Array

I have this snippet of code with some pointer math that I'm having trouble understanding:
#include <stdlib.h>
#include <complex.h>
#include <fftw3.h>
int main(void)
{
int i, j, k;
int N, N2;
fftwf_complex *box;
fftwf_plan plan;
float *smoothed_box;
// Allocate memory for arrays (Ns are set elsewhere and properly,
// I've just left it out for clarity)
box = (fftwf_complex *)fftwf_malloc(N * sizeof(fftwf_complex));
smoothed_box = (float *)malloc(N2 * sizeof(float));
// Create complex data and fill box with it. Do FFT. Box has the
// Hermitian symmetry that complex data has when doing FFTs with
// real data
plan = fftwf_plan_dft_c2r_3d(N,N,N,box,(float *)box,
FFTW_ESTIMATE);
...
// end fft
// Now do the loop I don't understand
for(i = 0; i < N2; i++)
{
for(j = 0; j < N2; j++)
{
for(k = 0; k < N2; k++)
{
smoothed_box[R_INDEX(i,j,k)] = *((float *)box +
R_FFT_INDEX(i*f + 0.5, j*f + 0.5, k*f +0.5))/V;
}
}
}
// Do other stuff
...
return 0;
}
Where f and V are just some numbers that are set elsewhere in the code and don't matter for this particular question. Additionally, the functions R_FFT_INDEX and R_INDEX don't really matter, either. What's important is that, for the first loop iteration ,when i=j=k=0, R_INDEX = 0 and R_FFT_INDEX=45. smoothed_box has 8 elements and box has 320.
So, in gdb, when I print smoothed_box[0] after the loop, I get smoothed_box[0] = some number. Now, I understand that, for an array of normal types, say floats, array + integer will give array[integer], assuming that integer is within the bounds of the array.
However, fftwf_complex is defined as typedef float fftw_complex[2], as you need to hold both the real and imaginary parts of the complex number. It's also being casted to a float * from a fftwf_complex *, and I'm unsure what this does, given the typedef.
All I know is that when I print box[45] in gdb, I get box[45] = some complex number that is not smoothed_box[0] * V. Even when I print *((float *)box + 45)/V, I get a different number than smoothed_box[0].
So, I was just wondering if anyone could explain to me the pointer math that is being done in the above loop? Thank you, and I appreciate your time!
box is allocated as an array of N fftwf_complex. Then a backward 3D c2r fftw transform using N,N,N is performed on box, requiring N*N*(N/2+1) fftwf_complex. See http://www.fftw.org/fftw3_doc/Real_002ddata-DFT-Array-Format.html#Real_002ddata-DFT-Array-Format Therefore, this code might trigger undefined behavior, such as segmentation fault, before reaching the pointer arithmetics...
It is practical to cast back box to an array of float because the DFT is performed in place. Indeed, box is used twice as the fftwf_plan is created. box is both the input array of complex and the output array of real:
plan = fftwf_plan_dft_c2r_3d(N,N,N,box,(float *)box,
FFTW_ESTIMATE);
Once fftwf_execute(plan); is called, box is better seen as an array of real. Nevertheless, this array is of size N*N*2*(N/2+1), where the items located at positions i,j,k where k>N-1 are meaningless. See FFTW's Real-data DFT Array Format:
For an in-place transform, some complications arise since the complex data is slightly larger than the real data. In this case, the final dimension of the real data must be padded with extra values to accommodate the size of the complex data—two extra if the last dimension is even and one if it is odd. That is, the last dimension of the real data must physically contain 2 * (nd-1/2+1) double values (exactly enough to hold the complex data). This physical array size does not, however, change the logical array size—only nd-1 values are actually stored in the last dimension, and nd-1 is the last dimension passed to the planner.
This is the reason why the real array smoothed_box is introduced, though an N*N*N array would be expected. If smoothed_box were an array of size N*N*N, then the following conversion could have been performed:
for(i=0;i<N;i++){
for(j=0;j<N;j++){
for(k=0;k<N;k++){
smoothed_box[(i*N+j)*N+k]=((float *)box)[(i*N+j)*(2*(N/2+1))+k]
}
}
}

Is vectorization profitable in this case?

I broke a kernel down to several loops, in order to vectorize each one of them afterwards. One of this loops looks like:
int *array1; //Its size is "size+1";
int *array2; //Its size is "size+1";
//All positions of array1 and array2 are set to 0 here;
int *sArray1 = array1+1; //Shift one position so I start writing on pos 1
int *sArray2 = array2+1; //Shift one position so I start writing on pos 1
int bb = 0;
for(int i=0; i<size; i++){
if(A[i] + bb > B[i]){
bb = 1;
sArray1[i] = S;
sArray2[i] = 1;
}
else
bb = 0;
}
Please note the loop-carried dependency, in bb - each comparison depends upon bb's value, which is modified on the previous iteration.
What I thought about:
I can be absolutely certain of some cases. For example, when A[i] is already greater than B[i], I do not need to know what value bb carries from the previous iteration;
When A[i] equals B[i], I need to know what value bb carries from the previous iteration. However, I also need to account for the case when this happens in two consecutive positions; When I started to shape up these cases, it seemed that these becomes overly complicated and vectorization doesn't pay off.
Essentially, I'd like to know if this can be vectorized in an effective manner or if it is simply better to run this without any vectorization whatsoever.
You might not want to iterate over single elements, but have a loop over the chunks (where a chunk is defined by all elements within yielding the same bb).
The search for chunk boundraries could be vectorized (by hand using compiler specific SIMD intrinics probably).
And the action to be taken for single chunk of bb=1 could be vectorized, too.
The loop transformation is as follows:
size_t i_chunk_start = 0, i_chunk_end;
int bb_chunk = A[0] > B[0] ? 1 : 0;
while (i_chunk_start < isize) {
if(bb_chunk) {
/* find end of current chunk */
for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
if(A[i_chunk_end] < B[i_chunk_end]) {
break;
}
}
/* process current chunk */
for(size_t i = i_chunk_start; i < i_chunk_end; ++i) {
sArray1[i] = S;
sArray2[i] = 1;
}
bb_chunk = 0;
} else {
/* find end of current chunk */
for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
if(A[i_chunk_end] > B[i_chunk_end]) {
break;
}
}
bb_chunk = 1;
}
/* prepare for next chunk */
i_chunk_start = i_chunk_end;
}
Now, each of the inner loops (all for loops) could potentially get vectorized.
Whether or not vectorization in this manner is superior to non-vectorization depends on whether the chunks have sufficient length in average. You will only find out by benchmarking.
The effect of your loop body depends on two conditions:
A[i] > B[i]
A[i] + 1 > B[i]
Their calculation can be vectorized easily. Assuming int has 32 bits, and vectorized instructions work on 4 int values at a time, there are 8 bits per vectorized iteration (4 bits for each condition).
You can harvest those bits from a SSE register by _mm_movemask_epi8. It's a bit inconvenient that it works on bytes and not on ints, but you can take care of it by a suitable shuffle.
Afterwards, use the 8 bits as an address to a LUT (of 256 entries), which stores 4-bit masks. These masks can be used to store the elements into destination conditionally, using _mm_maskmoveu_si128.
I am not sure such a complicated program is worthwhile - it involves much bit-fiddling for just x4 improvement in speed. Maybe it's better to build the masks by examining the decision bits individually. But vectorizing your comparisons and stores seems worthwhile in any case.

Optimize and rewrite the following C code

This is a textbook question that involves rewriting some C code to make it perform best on a given processor architecture.
Given: targeting a single superscalar processor with 4 adders and 2 multiplier units.
Input structure (initialized elsewhere):
struct s {
short a;
unsigned v;
short b;
} input[100];
Here is the routine to operate on this data. Obviously correctness must be ensured, but the goal is to optimize the crap out of it.
int compute(int x, int *r, int *q, int *p) {
int i;
for(i = 0; i < 100; i++) {
*r *= input[i].v + x;
*p = input[i].v;
*q += input[i].a + input[i].v + input[i].b;
}
return i;
}
So this method has 3 arithmetic instructions to update the integers r, q, p.
Here's my attempt with comments explaining what I'm thinking:
//Use temp variables so we don't keep using loads and stores for mem accesses;
//hopefully the temps will just be kept in the register file
int r_temp = *r;
int q_temp = *q;
for (i=0;i<99;i = i+2) {
int data1 = input[i];
int data2 = input[i+1]; //going to try partially unrolling loop
int a1 = data1.a;
int a2 = data2.a;
int b1 = data1.b;
int b2 = data2.b;
int v1 = data1.v;
int v2 = data2.v;
//I will use brackets to make my intention clear the order of operations I was planning
//with respect to the functional (adder, mul) units available
//This is calculating the next iteration's new q value
//from q += v1 + a1 + b1, or q(new)=q(old)+v1+a1+b1
q_temp = ((v1+q1)+(a1+b1)) + ((a2+b2)+v2);
//For the first step I am trying to use a max of 3 adders in parallel,
//saving one to start the next computation
//This is calculating next iter's new r value
//from r *= v1 + x, or r(new) = r(old)*(v1+x)
r_temp = ((r_temp*v1) + (r_temp*x)) + (v2+x);
}
//Because i will end on i=98 and I only unrolled by 2, I don't need to
//worry about final few values because there will be none
*p = input[99].v; //Why it's in the loop I don't understand, this should be correct
*r = r_temp;
*q = q_temp;
Ok so what did my solution give me? Looking at the old code, each loop iteration of i will take the minimum latency of max((1A + 1M),(3A)) where the former value is for calculating the new r while the latency of 3 Adds is for q.
In my solution, I am unrolling by 2 and trying to calculate the 2nd new value of r and q. Assuming the the latency of adders/multipliers is such that M = c*A (c is some integer > 1), the multiply operations for r are definitely the rate-limiting step, so I focus on that. I tried to use the multipliers in parallel as much as I could.
In my code, 2 multipliers are used at first in parallel to help calculate intermediate steps, then an add must combine those, then a final multiply is used for obtaining the last result. So for 2 new values of r (even though I only keep/care about the last one), it takes me (1M // 1M // 1A) + 1A + 1M, for a total latency of 2M + 1M sequentially. Dividing by 2, my latency per loop value is 1M + 0.5A. I calculate the original latency/value (for r) to be 1A + 1M. So if my code is correct (I did this all by hand, haven't tested yet!) then I have a small performance boost.
Also, hopefully by not accessing and updating pointers directly in the loop as much (thanks to temp variables r_temp and q_temp mainly), I save on some load/store latency.
That was my stab at it. Definitely interested in seeing what others come up with that's better!
Yes, it is possible to leverage the two shorts. Rearrange your struct as so
struct s {
unsigned v;
short a;
short b;
} input[100];
and you might be able to get better alignment of the memory fields on your architecture, which might allow more of these structs to lie in the same memory page, which might allow you to encounter fewer memory page faults.
It's all speculative, that's why it is so important to profile.
If you have the right architecture, a rearrangement will give you better data structure alignment, which results in higher data density within the memory as fewer bits are lost to necessary padding to ensure type alignment with the data boundaries imposed by common memory architectures.

Resources