CBLAS: Issue with *gbmv

CBLAS: Issue with *gbmv - c

I am trying to use the BLAS function dgbmv to multiply a vector by a band matrix. My code is in C and I use cblas to call the function.
Here is an example:
int main(int argc, char **argv)
{
/*Band representation of A = {1.0 1.0 1.0 0.0
2.0 2.0 2.0 2.0
3.0 3.0 3.0 3.0
4.0 4.0 4.0 4.0
0.0 5.0 5.0 5.0 }*/
double A[24]={0.,0.,1.,2.,
0.,1.,2.,3.,
1.,2.,3.,4.,
2.,3.,4.,5.,
3.,4.,5.,0.,
4.,5.,0.,0. };
/* Vectors x and y */
double X[4]={1,2,3,4};
double Y[10]={1,0,2,0,3,0,4,0,5,0};
double alpha=2.;
double beta=10.;
/* Use BLAS to compute y=beta*y+alpha*A*x */
cblas_dgbmv(CblasRowMajor,CblasNoTrans,5,4,3,2,alpha,A,6,X,1,beta,Y,2);
}
This code gives y={20.,0.,80.,0.,106.,0.,114.,0.,58.,0.} which is wrong.
However, if I use the transpose of A and change the first parameter of cblas_dgbmv to "CblasColMajor" I get the correct result:
int main(int argc, char **argv)
{
....
....
double B[24];
/* Set B=transpose(A) */
int l,m;
for (l=0;l<6;l++)
{
for(m=0;m<4;m++)
{
B[6*m+l]=A[l*4+m];
}
}
...
...
cblas_dgbmv(CblasColMajor,CblasNoTrans,5,4,3,2,alpha,B,6,X,1,beta,Y,2);
}
This code gives y={22.,0,60.,0.,90.,0.,120.,0.,140.,0.}, which is correct.
I think that both codes should produce the same result, since changing the matrix storage format from row-major to column-major and transposing the matrix should lead to no net effect.
I need to handle large matrices and I can not afford to transpose all of them. Moreover I would like to understand why the two examples give different results.

Related

Enabling HVX SIMD in Hexagon DSP by using instruction intrinsics

I was using Hexagon-SDK 3.0 to compile my sample application for HVX DSP architecture. There are many tools related to Hexagon-LLVM available to use located folder at:
~/Qualcomm/HEXAGON_Tools/7.2.12/Tools/bin
I wrote a small example to calculate the product of two arrays to makes sure I can utilize the HVX hardware acceleration. However, when I generate my assembly, either with -S , or, with -S -emit-llvm I don't find any definition of HVX instructions such as vmem, vX, etc. My C application is executing on hexagon-sim for now till I manage to find a way to run in on the board as well.
As far as I understood, I need to define my HVX part of the code in C Intrinsics, but was not able to adapt the existing examples to match my own needs. It would be great if somebody could demonstrate how this process can be done. Also in the Hexagon V62 Programmer's Reference Manual many of the intrinsic instructions are not defined.
Here is my small app in pure C:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#if defined(__hexagon__)
#include "hexagon_standalone.h"
#include "subsys.h"
#endif
#include "io.h"
#include "hvx.cfg.h"
#define KERNEL_SIZE 9
#define Q 8
#define PRECISION (1<<Q)
double vectors_dot_prod2(const double *x, const double *y, int n)
{
double res = 0.0;
int i = 0;
for (; i <= n-4; i+=4)
{
res += (x[i] * y[i] +
x[i+1] * y[i+1] +
x[i+2] * y[i+2] +
x[i+3] * y[i+3]);
}
for (; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
int main (int argc, char* argv[])
{
int n;
long long start_time, total_cycles;
/* -----------------------------------------------------*/
/* Allocate memory for input/output */
/* -----------------------------------------------------*/
//double *res = memalign(VLEN, 4 *sizeof(double));
const double *x = memalign(VLEN, n *sizeof(double));
const double *y = memalign(VLEN, n *sizeof(double));
if ( *x == NULL || *y == NULL ){
printf("Error: Could not allocate Memory for image\n");
return 1;
}
#if defined(__hexagon__)
subsys_enable();
SIM_ACQUIRE_HVX;
#if LOG2VLEN == 7
SIM_SET_HVX_DOUBLE_MODE;
#endif
#endif
/* -----------------------------------------------------*/
/* Call fuction */
/* -----------------------------------------------------*/
RESET_PMU();
start_time = READ_PCYCLES();
vectors_dot_prod2(x,y,n);
total_cycles = READ_PCYCLES() - start_time;
DUMP_PMU();
printf("Array product of x[i] * y[i] = %f\n",vectors_dot_prod2(x,y,4));
#if defined(__hexagon__)
printf("AppReported (HVX%db-mode): Array product of x[i] * y[i] =%f\n", VLEN, vectors_dot_prod2(x,y,4));
#endif
return 0;
}
I compile it using hexagon-clang:
hexagon-clang -v -O2 -mv60 -mhvx-double -DLOG2VLEN=7 -I../../common/include -I../include -DQDSP6SS_PUB_BASE=0xFE200000 -o arrayProd.o -c arrayProd.c
Then link it with subsys.o (is found in DSK and already compiled) and -lhexagon to generate my executable:
hexagon-clang -O2 -mv60 -o arrayProd.exe arrayProd.o subsys.o -lhexagon
Finally, run it using the sim:
hexagon-sim -mv60 arrayProd.exe

A bit late, but might still be useful.
Hexagon Vector eXtensions are not emitted automatically and current instruction set (as of 8.0 SDK) only supports integer manipulation, so compiler will not emit anything for the C code containing "double" type (it is similar to SSE programming, you have to manually pack xmm registers and use SSE intrinsics to do what you need).
You need to define what your application really requires.
E.g., if you are writing something 3D-related and really need to calculate double (or float) dot products, you might convert yout floats to 16.16 fixed point and then use instructions (i.e., C intrinsics) like
Q6_Vw_vmpyio_VwVh and Q6_Vw_vmpye_VwVuh to emulate fixed-point multiplication.
To "enable" HVX you should use HVX-related types defined in
#include <hexagon_types.h>
#include <hexagon_protos.h>
The instructions like 'vmem' and 'vmemu' are emitted automatically for statements like
// I assume 64-byte mode, no `-mhvx-double`. For 128-byte mode use 32 int array
int values[16] = { 1, 2, 3, ..... };
/* The following line compiles to
{
r4 = __address_of_values
v1 = vmem(r4 + #0)
}
You can get the exact code by using '-S' switch, as you already do
*/
HVX_Vector v = *(HVX_Vector*)values;
Your (fixed-point) version of dot_product may read out 16 integers at a time, multiply all 16 integers in a couple of instructions (see HVX62 programming manual, there is a tip to implement 32-bit integer multiplication from 16-bit one),
then shuffle/deal/ror data around and sum up rearranged vectors to get dot product (this way you may calculate 4 dot products almost at once and if you preload 4 HVX registers - that is 16 4D vectors - you may calculate 16 dot products in parallel).
If what you are doing is really just byte/int image processing, you might use specific 16-bit and 8-bit hardware dot products in Hexagon instruction set, instead of emulating doubles and floats.

Inaccurate Result from Haversine's Bearing Calculation

I am trying to implement the Haversine Formula in a little GPS program I'm writing. The distance calculations appear to be spot-on. However, I believe the bearing is being computed in radians, and I don't know how to properly convert the result to compass directions (0 for North, 90 for East, etc).
Any help would be greatly appreciated, all this talk of cosigns and arctangents is giving me a major headache! I'm a coder, not a mathematician!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
void FindDistance (double latHome, double lonHome, double latDest, double lonDest)
{
double pi=3.141592653589793;
int R=6371; //Radius of the Earth in kilometers
//Keep the parameters passed to the function immutable
double latHomeTmp=(pi/180)*(latHome);
double latDestTmp=(pi/180)*(latDest);
double differenceLon= (pi/180)*(lonDest- lonHome);
double differenceLat=(pi/180)*(latDest- latHome);
double a= sin (differenceLat/2.)*sin (differenceLat/2.)+cos (latHomeTmp)*cos (latDestTmp)*sin (differenceLon/2.)*sin (differenceLon/2.);
double c=2*atan2 (sqrt (a), sqrt (1-a));
double Distance=R*c;
printf ("Distance is %f\n", Distance);
double RadBearing=atan2 (sin (differenceLon)*cos (latDestTmp), cos (latHomeTmp)*sin (latDestTmp)-sin (latHomeTmp)*cos (latDestTmp)*cos (differenceLon));
double DegBearing=RadBearing*57.2958;
if (DegBearing<0) DegBearing=360+DegBearing;
printf ("Bearing is %f\n", DegBearing);
} //Function FindDistance
int main (void) {
puts ("LA to NY");
FindDistance (34.052235, -118.243683, 40.748817, -73.985428);
puts ("NY to LA");
FindDistance (40.748817, -73.985428, 34.052235, -118.243683);
} //Function main
gcc -o gps -lm gps.c
It returns a bearing of 65 from LA to NY, and a bearing of 273 from NY to LA.
If we add the bearings together, we get 338 which can't be right - shouldn't it equal 360?
Or am I completely out to lunch?
Anyway, as you can see, I always compute both distance and bearing at the same time. If you could also suggest a way to clean up the code so it doesn't perform unnecessary calculations, that would be so very outstanding! I'm running this on a small microprocessor where I like to make every cycle count!

Not a problem.
Consider 2 locations at the same latitude, yet differ in longitude. One is not due east (90) of the other when going in the shorting great circle route. Nor is the first due (west 270). The bearings are not necessarily complementary.
FindDistance(34.0, -118.0, 34.0, -73.0);
FindDistance(34.0, -73.0, 34.0, -118.0);
Distance is 4113.598081
Bearing is 76.958824
Distance is 4113.598081
Bearing is 283.041176
#user3386109 adds more good information.
Per the site suggest by #M Oehm your code is about correct.
Per OP request, some mods that may have slight speed improvement.
void FindDistance(double latHome, double lonHome, double latDest,
double lonDest) {
// A few extra digits sometimes is worth it - rare is adding more digits a problem
// double pi=3.141592653589793;
static const double pi_d180 = 3.1415926535897932384626433832795 / 180;
static const double d180_pi = 180 / 3.1415926535897932384626433832795;
// int R=6371; //Radius of the Earth in kilometers
// (Compiler may do this all ready)
static const double R = 6371.0; // better to make FP to avoid the need to convert
//Keep the parameters passed to the function immutable
double latHomeTmp = pi_d180 * (latHome);
double latDestTmp = pi_d180 * (latDest);
double differenceLon = pi_d180 * (lonDest - lonHome);
double differenceLat = pi_d180 * (latDest - latHome);
double a = sin(differenceLat / 2.) * sin(differenceLat / 2.)
+ cos(latHomeTmp) * cos(latDestTmp) * sin(differenceLon / 2.)
* sin(differenceLon / 2.);
double c = 2 * atan2(sqrt(a), sqrt(1 - a));
double Distance = R * c;
printf("Distance is %f\n", Distance);
double RadBearing = atan2(sin(differenceLon) * cos(latDestTmp),
cos(latHomeTmp) * sin(latDestTmp)
- sin(latHomeTmp) * cos(latDestTmp) * cos(differenceLon));
// double DegBearing = RadBearing * 57.2958;
double DegBearing = RadBearing * d180_pi;
// Why is this even needed?
if (DegBearing < 0) DegBearing = 360 + DegBearing;
printf("Bearing is %f\n", DegBearing);
} //Function FindDistance

newton raphson in C

I have implemented the newton raphson algorithm for finding roots in C. I want to print out the most accurate approximation of the root as possible without going into nan land. My strategy for this is while (!(isnan(x0)) { dostuff(); } But this continues to print out the result multiple times. Ideally I would like to setup a range so that the difference between each computed x intercept approximation would stop when the previous - current is less than some range .000001 in my case. I have a possible implementation below. When I input 2.999 It takes only one step, but when I input 3.0 it takes 20 steps, this seems incorrect to me.
(When I input 3.0)
λ newton_raphson 3
2.500000
2.250000
2.125000
2.062500
2.031250
2.015625
2.007812
2.003906
2.001953
2.000977
2.000488
2.000244
2.000122
2.000061
2.000031
2.000015
2.000008
2.000004
2.000002
2.000001
Took 20 operation(s) to approximate a proper root of 2.000002
within a range of 0.000001
(When I input 2.999)
λ newton_raphson 2.999
Took 1 operation(s) to approximate a proper root of 2.000000
within a range of 0.000001
My code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define RANGE 0.000001
double absolute(double number)
{
if (number < 0) return -number;
else return number;
}
double newton_raphson(double (*func)(double), double (*derivative)(double), double x0){
int count;
double temp;
count = 0;
while (!isnan(x0)) {
temp = x0;
x0 = (x0 - (func(x0)/derivative(x0)));
if (!isnan(x0))
printf("%f\n", x0);
count++;
if (absolute(temp - x0) < RANGE && count > 1)
break;
}
printf("Took %d operation(s) to approximate a proper root of %6f\nwithin a range of 0.000001\n", count, temp);
return x0;
}
/* (x-2)^2 */
double func(double x){ return pow(x-2.0, 2.0); }
/* 2x-4 */
double derivative(double x){ return 2.0*x - 4.0; }
int main(int argc, char ** argv)
{
double x0 = atof(argv[1]);
double (*funcPtr)(double) = &func; /* this is a user defined function */
double (*derivativePtr)(double) = &derivative; /* this is the derivative of that function */
double result = newton_raphson(funcPtr, derivativePtr, x0);
return 0;
}

You call trunc(x0) which turns 2.999 into 2.0. Naturally, when you start at the right answer, no iteration is needed! In other words, although you intended to use 2.999 as your starting value, you actually used 2.0.
Simply remove the call to trunc().

Worth pointing out: taking 20 steps to converge is also anomalous; because you are converging to a multiple root, the convergence is only linear instead of the typical quadratic convergence that Newton-Raphson gives in the general case. You can see this in the fact that your error is halved with each iteration (with the usual quadratic convergence, you would get twice as many correct digits on each iteration, and converge much, much faster).

Gaussian random number generator

I'm trying to implement a gaussian distributed random number generator in the interval [0,1].
float rand_gauss (void) {
float v1,v2,s;
do {
v1 = 2.0 * ((float) rand()/RAND_MAX) - 1;
v2 = 2.0 * ((float) rand()/RAND_MAX) - 1;
s = v1*v1 + v2*v2;
} while ( s >= 1.0 );
if (s == 0.0)
return 0.0;
else
return (v1*sqrt(-2.0 * log(s) / s));
}
It's pretty much a straight forward implementation of the algorithm in Knuth's 2nd volume of TAOCP 3rd edition page 122.
The problem is that rand_gauss() sometimes returns values outside the interval [0,1].

Knuth describes the polar method on p 122 of the 2nd volume of TAOCP. That algorithm generates a normal distribution with mean = 0 and standard deviation = 1. But you can adjust that by multiplying by the desired standard deviation and adding the desired mean.
You might find it fun to compare your code to another implementation of the polar method in the C-FAQ.

Change your if statement to (s >= 1.0 || s == 0.0). Better yet, use a break as seen in the following example for a SIMD Gaussian random number generating returning a complex pair (u,v). This uses the Mersenne twister random number generator dsfmt(). If you only want a single, real, random-number, return only u and save the v for the next pass.
inline static void randn(double *u, double *v)
{
double s, x, y; // SIMD Marsaglia polar version for complex u and v
while (1){
x = dsfmt_genrand_close_open(&dsfmt) - 1.;
y = dsfmt_genrand_close_open(&dsfmt) - 1.;
s = x*x + y*y;
if (s < 1) break;
}
s = sqrt(-2.0*log(s)/s);
*u = x*s; *v = y*s;
return;
}
This algorithm is surprisingly fast. Execution times for computing two random numbers (u,v) for four different Gaussian random number generators are:
Times for delivering two Gaussian numbers (u + iv)
i7-2600K # 4GHz, gcc -Wall -Ofast -msse2 ..
gsl_ziggurat = 20.3 (ns)
Box-Muller = 78.8 (ns)
Box-Muller with fast_sin fast_cos = 28.1 (ns)
SIMD Marsaglia polar = 35.0 (ns)
The fast_sin and fast_cos polynomial routines of Charles K. Garrett speed up the Box-Muller computation by a factor 2.9 using a nested polynomial implementation of cos() and sin(). The SIMD Box Muller and polar algorithms are certainly competitive. Also they can be parallelized easily. Using gcc -Ofast -S, the assembly code dump shows that the square root is the SIMD SSE2: sqrt --> sqrtsd %xmm0, %xmm0
Comment: it is really hard and frustrating to get accurate timings with gcc5, but I think these are ok: as of 2/3/2016: DLW
[1] Related link: c malloc array pointer return in cython
[2] A comparison of algorithms, but not necessarily for SIMD versions: http://www.doc.ic.ac.uk/~wl/papers/07/csur07dt.pdf
[3] Charles K. Garrett: http://krisgarrett.net/papers/l2approx.pdf

cblas_dgemm - works ONLY if (beta) is power-of-two

I am totally stumped. I have a fairly large recursive program written in c that calls cblas_dgemm(). The result is verified independently by a program that works correctly.
C = alpha*A*B + beta*C
On repeated tests using random matrices and all possible combination of parameters the program gives correct answer ONLY if abs(beta) = 2^n (1,2,4,8..). Any value works for alpha. Any other positive/negative, odd/even value for beta gives correct answer b/w 10-30% of the time.
I am using Ubuntu 10.04, GCC 4.4.x, I have tried system installed blas/cblas/atlas as well as manually compiled atlas.
Any hints or suggestions would be greatly appreciated. I am amazed at the wonderfully generous (and smart) folks lurking at this site.
Thanking you all in advance,
Russ

Two completely unrelated errors conspired to produce an illusive picture. It made me look for problems in the wrong place.
(1) There was a simple error in the logic of the function calling dgemm. Would have been easily fixed if I was not chasing the wrong problem.
(2) My double-compare function: double version of AlmostEqual2sComplement() (http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm) used incorrect sized integer - resulting in an incorrect TRUE under certain rare circumstances. This was the first time the error bit me!
Thanks again for the useful suggestion of using the scientific method when trying to debug a program.
Russ

Yes, a full example would be handy. Here is an old example I had hanging around using GSL's sgemm variant; should be easy to fix to double. Please try and see if this gives the result shown in the GSL manual:
/* from the gsl info documentation in node 'gsl cblas examples' */
/* compile via 'gcc -o $file $file.c -lgslcblas' */
/* edd 15 Nov 2003 */
#include <stdio.h>
#include <gsl/gsl_cblas.h>
int
main (void)
{
int lda = 3;
float A[] = { 0.11, 0.12, 0.13,
0.21, 0.22, 0.23 };
int ldb = 2;
float B[] = { 1011, 1012,
1021, 1022,
1031, 1032 };
int ldc = 2;
float C[] = { 0.00, 0.00,
0.00, 0.00 };
/* Compute C = A B */
cblas_sgemm (CblasRowMajor,
CblasNoTrans, CblasNoTrans, 2, 2, 3,
1.0, A, lda, B, ldb, 0.0, C, ldc);
printf ("[ %g, %g\n", C[0], C[1]);
printf (" %g, %g ]\n", C[2], C[3]);
return 0;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

CBLAS: Issue with *gbmv - c

Related

Enabling HVX SIMD in Hexagon DSP by using instruction intrinsics

Inaccurate Result from Haversine's Bearing Calculation

newton raphson in C

Gaussian random number generator

cblas_dgemm - works ONLY if (beta) is power-of-two

Categories

Resources