How to use openBLAS to improve vectorized operations? - c

I am self-learning how to write efficient, optimized deep learning code; but I am very much a newbie at this.
For example: I am reading that numpy uses vectorization to avoid python loops.
They have also pretty much coined the term broadcasting according to that link, which is used by TensorFlow, PyTorch and others.
I did some digging, and found that ldd on my Debian box shows multiarray.so links libopenblasp-r0-39a31c03.2.18.so.
So let's take the use case of a matrix subtraction. I would like to understand how to use openBLAS to improve this very naive implementation:
void matrix_sub(Matrix *a, Matrix *b, Matrix *res)
{
assert(a->cols == b->cols);
assert(a->rows == b->rows);
zero_out_data(res, a->rows, a->cols);
for (int i = 0; i < (a->rows*a->cols); i++)
{
res->data[i] = a->data[i] - b->data[i];
}
}
Like wise an inner product, or an addition?

Related

Why this OpenMP parallel for loop doesn't work properly?

I would like to implement OpenMP to parallelize my code. I am starting from a very basic example to understand how it works, but I am missing something...
So, my example looks like this, without parallelization:
int main() {
...
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
Where I omitted some parts in the "..." because are not relevant. It works, and if I print the u[] and v[] arrays on a file, I get the expected results.
Now, if I try to parallelize it just by adding:
#include <omp.h>
int main() {
...
omp_set_num_threads(2);
#pragma omp parallel for
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
The code compiles and the program runs, BUT the u[] and v[] arrays are half full of zeros.
If I set omp_set_num_threads( 4 ), I get three quarters of zeros.
If I set omp_set_num_threads( 1 ), I get the expected result.
So it looks like only the first thread is being executed, while not the other ones...
What am I doing wrong?
OpenMP assumes that each iteration of a loop is independent of the others. When you write this:
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
The iteration i of the loop is modifying iteration i+1. Meanwhile, iteration i+1 might be happening at the same time.
Unless you can make the iterations independent, this isn't a good use-case for parallelism.
And, if you think about what Euler's method does, it should be obvious that it is not possible to parallelize the code you're working on in this way. Euler's method calculates the state of a system at time t+1 based on information at time t. Since you cannot knowing what's at t+1 without knowing first knowing t, there's no way to parallelize across the iterations of Euler's method.
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
is equivalent to
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
therefore you can parallelize you code like this
#pragma omp parallel for
for (int i = 0; i < n; i++) {
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
}
If you want to mitigate the cost of the pow function you can do it once per thread rather than once per iteration like his (since t << n).
#pragma omp parallel
{
int nt = omp_get_num_threads();
int t = omp_get_thread_num();
int s = (t+0)*n/nt;
int f = (t+1)*n/nt;
u[s] = pow((1+h), s)*u[0];
v[s] = v[0]*pow(1.0/(1-h), s);
for(int i=s; i<f-1; i++) {
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
}
}
You can also write your own pow(double, int) function optimized for integer powers.
Note that the relationship I used is not in fact 100% equivalent because floating point arithmetic is not associative. That's not usually a problem but it's something one should be aware of.
Before parallelizing your code you must identify its concurrency, i.e. the set of tasks that are logically happening at the same time and then figure out a way to make them actually happen in parallel.
As mentioned above, this is a not a good example to apply parallelism on due to the fact that there is no concurrency in its nature. Attempting to use parallelism like that will lead to wrong results, due to the so-called race conditions.
If you just wanna learn how OpenMP works, try to come up with examples where you can clearly identify conceptually independent tasks. One of the most simple I can think of would be computing the area under a curve by means of integration.
Welcome to the parallel ( or "just"-concurrent ) plurality of computing realities.
Why?
Any non-sequential schedule of processing the loop will have problems with hidden ( not correctly handled ) breach of data-{-access | -value}
integrity in time.
A pure-[SERIAL] flow of processing is free from such dangers as the principally serialised steps indirectly introduce ( right by a rigid order of executing nothing but a one-step-after-another as a sequence ) order, in which there is no chance to "touch" the same memory location twice or more times at the same time.
This "peace-of-mind" is inadvertently lost, once a process goes into a "just"-[CONCURRENT] or the true-[PARALLEL] processing.
Suddenly there is an almost random order ( in a case of a "just"-[CONCURRENT] ) or a principally "immediate" singularity ( avoiding any original meaning of "order" - in the case of a true-[PARALLEL] code execution mode -- like a robot, having 6DoF, arrives into each and every trajectory-point in a true-[PARALLEL] fashion, driving all 6DoF-axes in parallel, not a one-after-another, in a pure-[SERIAL]-manner, not in a some-now-some-other-later-and-the-rest-as-it-gets in a "just"-[CONCURRENT] fashion, as the 3D-trajectory of robot-arm will become hardly predictable and mutual collisions would be often on a car assembly line ... ).
Solution:
Using either a defensive tool, called atomic operations, or a principal approach - design (b)locking-free algorithm, where possible, or explicitly signal and coordinate reads and writes ( sure, at a cost in excess-time and degraded performance ), so as to warrant the values will not get damaged into an inconsistent digital trash, if protective steps ( ensuring all "old"-writes get safely "through" before any "next"-reads go ahead to grab a "right"-value ) were not coded in ( as was demonstrated above ).
Epilogue:
Using a tool, like OpenMP for problems, where it cannot bring any advantage, will result in spending time and decreased performance ( as there are needs to handle all tool-related overheads, while there is literally zero net-effect of parallelism in cases, where the algorithm does not allow any parallelism to be enjoyed ), so one finally pays ways more then one finally gets.
A good point to learn about OpenMP best practices could be sources for example from Lawrence Livermore National Laboratory ( indeed very competent ) and similar publications on using OpenMP.

Getting the compiler to auto-vectorize code in a sensible manner

I'm trying to figure out how to structure the main loop code for a numerical simulation in such a way that the compiler generates nicely vectorized instructions in a compact way.
The problem is most easily explained by a C pseudocode, but I also have a Fortran version which is affected by the same kind of issue. Consider the following loop where lots_of_code_* are some complicated expressions which produces a fair number of machine instructions.
void process(const double *in_arr, double *out_arr, int len)
{
for (int i = 0; i < len; i++)
{
const double a = lots_of_code_a(i, in_arr);
const double b = lots_of_code_b(i, in_arr);
...
const double z = lots_of_code_z(i, in_arr);
out_arr[i] = final_expr(a, b, ..., z);
}
}
When compiled with an AVX target the Intel compiler generates code which goes like
process:
AVX_loop
AVX_code_a
AVX_code_b
...
AVX_code_z
AVX_final_expr
...
SSE_loop
SSE_instructions
...
scalar_loop
scalar_instructions
...
The resulting binary is already quite sizable. My actual calculation loop, though, looks more like the following:
void process(const double *in_arr1, ... , const double *in_arr30,
double *out_arr1, ... double *out_arr30,
int len)
{
for (int i = 0; i < len; i++)
{
const double a1 = lots_of_code_a(i, in_arr1);
...
const double a30 = lots_of_code_a(i, in_arr30);
const double b1 = lots_of_code_b(i, in_arr1);
...
const double b30 = lots_of_code_b(i, in_arr30);
...
...
const double z1 = lots_of_code_z(i, in_arr1);
...
const double z30 = lots_of_code_z(i, in_arr30);
out_arr1[i] = final_expr1(a1, ..., z1);
...
out_arr30[i] = final_expr30(a30, ..., z30);
}
}
This results in a very large binary indeed (400KB for the Fortran version, 800KB for C99). If I now define lots_of_code_* as functions, then each function gets turned into non-vectorized code. Whenever the compiler decides to inline a function it does vectorize it, but seems to also duplicate the code each time as well.
In my mind, the ideal code should look like:
AVX_lots_of_code_a:
AVX_code_a
AVX_lots_of_code_b:
AVX_code_b
...
AVX_lots_of_code_z:
AVX_code_z
SSE_lots_of_code_a:
SSE_code_a
...
scalar_lots_of_code_a:
scalar_code_a
...
...
process:
AVX_loop
call AVX_lots_of_code_a
call AVX_lots_of_code_a
...
SSE_loop
call SSE_lots_of_code_a
call SSE_lots_of_code_a
...
scalar_loop
call scalar_lots_of_code_a
call scalar_lots_of_code_a
...
This clearly results in a much smaller code which is still just as well optimized as the fully-inlined version. With luck it might even fit in L1.
Obviously I can write the this myself using intrinsics or whatever, but is it possible to get the compiler to automatically vectorize in the way described above through "normal" source code?
I understand that the compiler will probably never generate separate symbols for each vectorized version of the functions, but I thought it could still just inline each function once inside process and use internal jumps to repeat the same code block, rather than duplicating code for each input array.
Formal answer to questions like yours:
Consider using OpenMP4.0 SIMD-enabled (I didn't say inlined) functions or equivalent proprietary mechanisms. Available in Intel Compiler or fresh GCC4.9.
See more details here: https://software.intel.com/en-us/node/522650
Example:
//Invoke this function from vectorized loop
#pragma omp declare simd
int vfun(int x, int y)
{
return x*x+y*y;
}
It will give you capability to vectorize loop with function calls without inlining and as a result without huge code generation. (I didn't really explore your code snippet in details; instead I answered the question you asked in textual form)
The immediate problem that comes to mind is the lack of restrict on the input/output-pointers. The input is const though, so it's probably not too much of a problem, unless you have multiple output-pointers.
Other than that, I recommend -fassociative-math or whatever the ICC equivalent is. Structurally, you seem to iterate over the array, doing multiple independent operations on the array that are only munged together in the very end. Strict fp compliance might kill you on the array-operations.Finally, there's probably no way this will get vectorized if you need more intermediate results than vector_registers - input_arrays.Edit:
I think I see your problem now. You call the same function on different data, and want each result stored independently, right?The problem is that the same function always writes to the same output register, so subsequent, vectorized calls would clobber earlier results. The solution could be:A stack of results (either in memory or like the old x87 FPU-stack), that gets pushed every time. If in memory, it is slow, if x87, it's not vectorized. Bad idea.
Effectively multiple functions to write into different registers. Code duplication. Bad idea.Rotating registers, like on the Itanium. You don't have an Itanium? You're not alone.It's possible that this can't be easily vectorized on current architectures. Sorry.
Edit, you're apparently fine with going to memory:
void function1(double const *restrict inarr1, double const *restrict inarr2, \
double *restrict outarr, size_t n)
{
for (size_t i = 0; i<n; i++)
{
double intermediateres[NUMFUNCS];
double * rescursor = intermediateres;
*rescursor++ = mungefunc1(inarr1[i]);
*rescursor++ = mungefunc1(inarr2[i]);
*rescursor++ = mungefunc2(inarr1[i]);
*rescursor++ = mungefunc2(inarr2[i]);
...
outarr[i] = finalmunge(intermediateres[0],...,intermediateres[NUMFUNCS-1]);
}
}
This might be vectorizable. I don't think it'll be all that fast, going at memory speed, but you never know till you benchmark.
If you moved the lots_of_code blocks into separate compilation units without the for loop, they will probably not vecorize. Unless the compiler has a motive for vectorization, it will not vectorize the code because vectorization might lead for longer latencies in the pipelines. To get around that, split the loop into 30 loops, and put each one of them in a separate compilation unit like that:
for (int i = 0; i < len; i++)
{
lots_of_code_a(i, in_arr1);
}

Hough Transform: improving algorithm efficiency over OpenCL

I am trying to detect a circle in binary image using hough transform.
When I use Opencv's built-in function for the circular hough transform, it is OK and I can find the circle.
Now I try to write my own 'kernel' code for doing hough transform but is very very slow:
kernel void hough_circle(read_only image2d_t imageIn, global int* in,const int w_hough,__global int * circle)
{
sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int gid0 = get_global_id(0);
int gid1 = get_global_id(1);
uint4 pixel;
int x0=0,y0=0,r;
int maxval=0;
pixel=read_imageui(imageIn,sampler,(int2)(gid0,gid1));
if(pixel.x==255)
{
for(int r=20;r<150;r+=2)
{
// int r=100;
for(int theta=0; theta<360;theta+=2)
{
x0=(int) round(gid0-r*cos( (float) radians( (float) theta) ));
y0=(int) round(gid1-r*sin( (float) radians( (float) theta) ));
if((x0>0) && (x0<get_global_size(0)) && (y0>0)&&(y0<get_global_size(1)))
atom_inc(&in[w_hough*y0+x0]);
}
if(maxval<in[w_hough*y0+x0])
{
maxval=in[w_hough*y0+x0];
circle[0]=gid0;
circle[1]=gid1;
circle[2]=r;
}
}
}
}
There are source codes for the hough opencl library with opencv, but its hard to me for extract a specific function that helps me.
Can anyone offer a better source code example, or help me understand why this is so inefficient?
the code main.cpp and kernel.cl compress in rar file http://www.files.com/set/527152684017e
use opencv lib for read and display image >
Making repeated calls to sin() and cos() is computationally expensive. Since you only ever call these functions with the same 180 values of theta, you could speed things up by precalculating these values and storing them in an array.
A more robust approach would be to use the midpoint circle algorithm to find the perimeters of these circles by simple integer arithmetic.
What you are doing is running a huge CPU block of code in only 1 workitem, the results as expected, is a slowww kernel.
Detailed answer:
The only place were you use the work-item ID is just for the pixel value, if that condition is met then you run a big chunck of code. Some of the work-items will trigger this some of them don't. The ones that trigger it will make indirectly all the work group to run that code, and this will slow you down.
In addition, the workitems that don't enter that condition will be idle. Depending on the image maybe 99% of them are idle.
I would rewrite your algorithm to use 1 workgroup per pixel.
If the condition is met the workgroup will run the algorithm, if it is not, the whole workgroup will skip. And in the case the workgroup enters the condition, you will have many workitems to play with. This will allow a redesign of the code such that the inner for loops run in parallel.

Node js comparison to C

On my dual core machine, Node JS runs faster than an equivalent program written in C.
Is node so well optimized that it actually is more efficient, or is there something wrong with my C program that makes it slower?
Node js code:
var Parallel = require("paralleljs");
function slow(n){
var i = 0;
while(++i < n * n){}
return i;
}
var p = new Parallel([20001, 32311, 42222]);
p.map(slow).then(function(data){
console.log("Done!"+data.toString());
});
C code:
#include <stdio.h>
#include <pthread.h>
struct thread_s {
long int n;
long int r;
};
void *slow(void *p){
thread_s *t = (thread_s*)p;
long int i = 0;
while(++i < t->n * t->n){}
t->r = i;
pthread_exit( 0 );
}
thread_s arr[] = {{20001, 0}, {32311, 0}, {42222, 0}};
int main(){
pthread_t t[3];
for(int c = 0; c < 3; c++){
pthread_create(&t[c], NULL, slow, &arr[c]);
}
for(int c = 0; c < 3; c++){
pthread_join(t[c], NULL);
}
printf("Done! %ld %ld %ld\n", arr[0].r, arr[1].r, arr[2].r);
return 0;
}
You are benchmarking a toy program which is not a good way to compare compilers. Also, the loops you are doing have no side-effects. All it does is set i to n * n. The loops should be optimized out. Are you running unoptimized?
Try to compute something real that approximates the work-load that you will later apply in production. If your code will be numerics-heavy you could benchmark a naive matrix multiplication for example.
All basic operations (+-, Math.xx etc.) are mapped to V8 engine which just execute it as C programm. So you should have pretty same results for C vs Node.js in these kind of scenarios.
Also I have tried C#.NET vs Node Fibonacci of 45. And first time I ran it was 5 times slower for C#, that was really strange. In a moment I understood that this is due to debug mode I ran C# app.
Going to release make it very close(20sec node, 22sec C#), probably this is just measurement inconsistency.
In any case this is just a matter of percents.
Question currently lacks details on benchmark, so it is impossible to say anything definitive about it. However, general comparison between V8 running javascript, and a bknary program compiled from C source is possible.
V8 is pretty darn good at JIT compilation, so while there is the overhead of JIT compilation, this compensates for dynamic nature of JavaScript, so for simple integer operations in a loop there's no reason for JIT code to be slower. JIT
Another consideration is startup time. If you load node.js first and load javascript from interactive prompt, startup time of script is minimal even with JIT, especially compared to dynamically linked binary which needs to resolve symbols etc. If you have small statically linked binary, it will start very fast, and will have done a lot of processing by the time a new node.js is even started and starts to look for some Javascript to execute. You need to be careful how you handle this in benchmarks, or results will be meaningless.

What would be an efficient way to add multithreading to this simple algorithm?

I would say my knowledge in C is fair, and I wish to extend a program to enhance my knowledge of parallel programming.
It essentially the program I am refering to is a brute force generator, to increment through passwords such as from 0000 .. zzzz of a specific character set:
Need help with brute force code for crypt(3)
The algorithm is outlined below (credit to Jerome for this)
int len = 3;
char letters[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
int nbletters = sizeof(letters)-1;
int main() {
int i, entry[len];
for(i=0 ; i<len ; i++) entry[i] = 0;
do {
for(i=0 ; i<len ; i++) putchar(letters[entry[i]]);
putchar('\n');
for(i=0 ; i<len && ++entry[i] == nbletters; i++) entry[i] = 0;
} while(i<len);
}
In what logical way would you say this could be extended by multithreading?
CUDA is a silly, if simple, solution. I had heard of OpenMP which in my books looks like a good solution, how do you think this could be split up to benefit from multiple cores of my computer? I.e. core 1 computing aaaa..ffff, and core 2 computing ffff...zzzz, is this the only method that would make sense with this?
I think you answered your own question. The aaaa..ffff on thread #1 and ffff..zzzz on thread #2 is probably the way to go, except to maybe break it down into more threadable parts in case you have more cores available. Trying to start a thread to perform some part of the do loop would probably introduce more overhead than benefit in such a tight algorithm.
I assume that you want to see your output characters in the order they are referenced in the entry array.
This is a sequential operation you can not parallelize it.
Edit:
OK, now I see how wrong my was are :) You actually CAN parallelize this program, but you have to implement an additional layer handling the order of letters in the output. Also need to implement synchronization.

Resources