Arithmetic operations on 64-Bit double values using ARM Neon Intrinsic's in ARM64 - c

I'm trying to implement a simple 64 Bit double addition operation using ARM Neon. I've come across this Question but there was no sample implementation using ARM intrinsic available in the answer. So any Help in providing a complete example is greatly appreciated. Here is what i have tried so far by using integer type registers.
Side Note:
Please note that i'm using intel/ARM_NEON_2_x86_SSE library for simulating this ARM Neon code using SSE instructions. Should i switch to native ARM neon to test this code?
int main()
{
double Val1[2] = { 2.46574621,0.46546221};
double Val2[2] = { 2.63565654,0.46574621};
double Sum[2] = { 0.0,0.0 };
double Sum_C[2] = { 0.0,0.0};
vst1q_s64(Sum, //Store int64x2_t
vaddq_s64( //Add int64x2_t
vld1q_s64(&(Val1[0])), //Load int64x2_t
vld1q_s64(&(Val2[0])) )); //Load int64x2_t
for (size_t i = 0; i < 2; i++)
{
Sum_C[i] = Val1[i] + Val2[i];
if (Sum_C[i] != Sum[i])
{
cout << "[Error] Sum : " << Sum[i] << " != " << Sum_C[i] << "\n";
}
else
cout << "[Passed] Sum : " << Sum[i] << " == " << Sum_C[i] << "\n";
}
cout << "\n";
}
[Error] Sum : -1.22535e-308 != 5.1014
[Error] Sum : 1.93795e+307 != 0.931208

Double precision isn't supported on aarch32 NEON.
Therefore, if you target armv7-a while using the data type float64x2_t, it won't build.
If your test platform is an aarch64 one with a 64-bit OS installed, just exclude the aarch32 target from your makefile.

Related

In Eigen3 the norm() method does not provide the precise answer, why

I have a situation that using Eigen3 library the norm() does not provide the correct answer. The norm() should be just the square root of the coeficients of a vector to the square:
NORM= sqrt( v[1]*v[1] + v[2]*v[2] + .... + v[N]*v[N])
However the following function calculates the norm() in two ways: with the norm() method of Eigen3 and by hand. The results are slighly different:
void mytest()
{
double mvec[3];
mvec[0] = -3226.9276456286984;
mvec[1] = 6153.3425006471571;
mvec[2] = 2548.5894934614853;
Vector3d v;
v(0) = mvec[0];
v(1) = mvec[1];
v(2) = mvec[2];
double normEigen = v.norm();
double normByHand = sqrt( v(0)*v(0) + v(1)*v(1) + v(2)*v(2));
double mdiff = abs((normEigen - normByHand));
std::cout.precision(17);
std::cout << "normEigen= " << normEigen << std::endl;
std::cout << "normByHand= " << normByHand << std::endl;
std::cout << "mdiff= " << mdiff << std::endl;
}
The output of this function is:
normEigen= 7400.8103858007089
normByHand= 7400.8103858007107
mdiff= 1.8189894035e-12
From digit 15 they are different, why? where is rounding some number?
Thanks in advance
PedroC.
The calculation is one that uses floating point computations. As such, the order of operations, as well as things like vectorization, can result in (usually) slightly different results (due to different roundings, different orders of magnitude, etc.).
In this case, the difference is just in the 15th digit. The maximal accuracy on a 64 bit floating point number is around the 16th digit.
If we look at the distance in ULPs using boost:
#include <boost/math/special_functions/next.hpp>
#include <iostream>
int main()
{
double normEigen = 7400.8103858007089;
double normByHand = 7400.8103858007107;
std::cout << boost::math::float_distance(normEigen, normByHand);
return 0;
}
we see that the distance (at least on my system) is 2. So the binary number is e.g. 0101...011 instead of 0101...001. Such a small difference is almost always due to the reasons I listed above.
Going deeper I see that the sum of the square values introduces the discrepancy, when I calculate the squaredNorm in a unique vector and in
3 vectors using only one tem I see that the total is not identical.
void mytest2()
{
double mvec[3];
mvec[0] = -3226.9276456286984;
mvec[1] = 6153.3425006471571;
mvec[2] = 2548.5894934614853;
Vector3d v, v1, v2, v3;
v(0) = mvec[0];
v(1) = mvec[1];
v(2) = mvec[2];
v1(0) = mvec[0]; v1(1) = v1(2) = 0.0;
v2(0) = 0.0; v2(1) = mvec[1]; v2(2) = 0.0;
v3(0) = v3(1) = 0.0; v3(2) = mvec[2];
double squnorm = v.squaredNorm();
double squnorm1 = v1.squaredNorm();
double squnorm2 = v2.squaredNorm();
double squnorm3 = v3.squaredNorm();
double squnormbyhand = squnorm1 + squnorm2 + squnorm3;
double sqdiff = abs(squnorm - squnormbyhand);
std::cout.precision(17);
std::cout << "normEigen= " << squnorm << std::endl;
std::cout << "normByHand= " << squnormbyhand << std::endl;
std::cout << "mdiff= " << sqdiff << std::endl;
}
The ouput of this function is:
normEigen= 54771994.366575643
normByHand= 54771994.366575658
mdiff= 1.49011161193847656e-8
For some reason when adding the square values Eigen introduces a rounding difference.
Thanks for your answer anyway.
Pedro

C rotl32 alternative

is there an alternative for rotl32 in C language?
i found this: Near constant time rotate that does not violate the standards
but still trying to get an optimized one
my code:
k0 = rotl32 ((k3 ^ k2 ^ k ^ k0), 1u)
I think this is the best portable option:
uint32_t rotl32(uint32_t var, uint32_t hops)
{
return (var << hops) | (var >> (32 - hops));
}
You have opencl tag in your question, so with a kernel
__kernel void rotateGpu(__global unsigned int * a,__global unsigned int * b)
{
int idx = get_global_id(0);
unsigned int a0=a[idx];
for(int i=0;i<100;i++)
a0=rotate(a0,1280u);
b[idx] = rotate(a0,1280u);
}
rotate performance on R7-240 GPU according to a benchmark:
32 million element-array of 32b unsigned integers such as a0, kernel execution takes 16ms where each thread does 100 times(10 ms for 1 times) rotation of 1280u step length(so latency is independent of step length)) . Its more than 200 Gflops(but on integers) reaching %40 theoretical maximum of gpu . Maybe its even faster for integers than floats(they would need normalization after shift I suppose).
Example:
__kernel void rotateGpu(__global unsigned int * a,__global unsigned int * b)
{
int idx = get_global_id(0);
unsigned int a0=a[idx];
b[idx] = rotate(a0,2u);
}
input:
buf[0] = 80;
buf[1] = 12;
buf[2] = 14;
buf[3] = 5 ;
buf[4] = 70;
output:
320
48
56
20
280
dromtrund posted a good portable solution:
uint32_t rotl32(uint32_t var, uint32_t hops) {
return (var << hops) | (var >> (32 - hops));
}
Unfortunately, this function has undefined behavior for hops == 0. On the x86 processors, only the low order bits of hops are significant. This behavior can be forced this way:
uint32_t rotl32(uint32_t var, uint32_t hops) {
return (var << hops) | (var >> ((32 - hops) & 31));
}
Both functions compile to optimal code with gcc 4.9 and up, clang 3.5 and up and icc 17, as can be verified with Godbolt's Compiler Explorer.
John Regehr has an interesting blog article on this very subject.

How to save a vector of keypoints using openCV

I was wondering if it was possible to save out a vector of cv::KeyPoints using the CvFileStorage class or the cv::FileStorage class. Also is it the same process to read it back in?
Thanks.
I am not sure about what you really expect :
The code I provide you is simply an example, to show how the file storage works in the OpenCV C++ bindings. It assumes here that you write in the XML file all the Keypoints separately, with their name being their position in the vector they were stored in.
It assumes aswell that when you read them back, you know the number of them you want to read, if not, the code is a little bit more complex. You'll find a way (if for instance you read the filestorage and test what it gives you, if it doesn't give you anything, then it means there is no more point to read) -it's just an idea, you have to find a solution, maybe this piece of code will be enough for you.
I should precise that i use ostringstream to put the integer in the string and by the way change the place where it will be written in the *.yml file.
//TO WRITE
vector<Keypoint> myKpVec;
FileStorage fs(filename,FileStorage::WRITE);
ostringstream oss;
for(size_t i;i<myKpVec.size();++i) {
oss << i;
fs << oss.str() << myKpVec[i];
}
fs.release();
//TO READ
vector<Keypoint> myKpVec;
FileStorage fs(filename,FileStorage::READ);
ostringstream oss;
Keypoint aKeypoint;
for(size_t i;i<myKpVec.size();<++i) {
oss << i;
fs[oss.str()] >> aKeypoint;
myKpVec.push_back(aKeypoint);
}
fs.release();
Julien,
char* key;
FileStorage f;
vector<Keypoint> keypoints;
//writing
write(f, key, keypoints);
//reading
read(f[key], keypoints);
int main() {
String filename = "data.xml";
FileStorage fs(filename,FileStorage::WRITE);
Vector<Mat> vecMat;
Mat A(3,3,CV_32F, Scalar(5));
Mat B(3,3,CV_32F, Scalar(6));
Mat C(3,3,CV_32F, Scalar(7));
vecMat.push_back(A);
vecMat.push_back(B);
vecMat.push_back(C);
//ostringstream oss;
for(int i = 0;i<vecMat.size();i++) {
stringstream ss;
ss << i;
string str = "x" + ss.str();
fs << str << vecMat[i];
}
fs.release();
vector<Mat> matVecRead;
FileStorage fr(filename,FileStorage::READ);
Mat aMat;
int countlabel = 0;
while(1) {
stringstream ss;
ss << countlabel;
string str = "x" + ss.str();
cout << str << endl;
fr[str] >> aMat;
if (fr[str].isNone() == 1) {
break;
}
matVecRead.push_back(aMat.clone());
countlabel ++;
}
fr.release();
for( unsigned j = 0; j < matVecRead.size(); j++){
cout << matVecRead[j] << endl;
}
}
Put a letter eg 'a' infront of the numbering as the OPENCV XML Format specify the xml KEY must start with a letter.
This is a code to save Vector<Mat> for visual studio 2010, i think it will works for Vector<KeyPoints>

Unset the most significant bit in a word (int32) [C]

How can I unset the most significant setted bit of a word (e.g. 0x00556844 -> 0x00156844)? There is a __builtin_clz in gcc, but it just counts the zeroes, which is unneeded to me. Also, how should I replace __builtin_clz for msvc or intel c compiler?
Current my code is
int msb = 1<< ((sizeof(int)*8)-__builtin_clz(input)-1);
int result = input & ~msb;
UPDATE: Ok, if you says that this code is rather fast, I'll ask you, how should I add a portability to this code? This version is for GCC, but MSVC & ICC?
Just round down to the nearest power of 2 and then XOR that with the original value, e.g. using flp2() from Hacker's Delight:
uint32_t flp2(uint32_t x) // round x down to nearest power of 2
{
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >>16);
return x - (x >> 1);
}
uint32_t clr_msb(uint32_t x) // clear most significant set bit in x
{
msb = flp2(x); // get MS set bit in x
return x ^ msb; // XOR MS set bit to clear it
}
If you are truly concerned with performance, the best way to clear the msb has recently changed for x86 with the addition of BMI instructions.
In x86 assembly:
clear_msb:
bsrq %rdi, %rax
bzhiq %rax, %rdi, %rax
retq
Now to rewrite in C and let the compiler emit these instructions while gracefully degrading for non-x86 architectures or older x86 processors that don't support BMI instructions.
Compared to the assembly code, the C version is really ugly and verbose. But at least it meets the objective of portability. And if you have the necessary hardware and compiler directives (-mbmi, -mbmi2) to match, you're back to the beautiful assembly code after compilation.
As written, bsr() relies on a GCC/Clang builtin. If targeting other compilers you can replace with equivalent portable C code and/or different compiler-specific builtins.
#include <inttypes.h>
#include <stdio.h>
uint64_t bsr(const uint64_t n)
{
return 63 - (uint64_t)__builtin_clzll(n);
}
uint64_t bzhi(const uint64_t n,
const uint64_t index)
{
const uint64_t leading = (uint64_t)1 << index;
const uint64_t keep_bits = leading - 1;
return n & keep_bits;
}
uint64_t clear_msb(const uint64_t n)
{
return bzhi(n, bsr(n));
}
int main(void)
{
uint64_t i;
for (i = 0; i < (uint64_t)1 << 16; ++i) {
printf("%" PRIu64 "\n", clear_msb(i));
}
return 0;
}
Both assembly and C versions lend themselves naturally to being replaced with 32-bit instructions, as the original question was posed.
You can do
unsigned resetLeadingBit(uint32_t x) {
return x & ~(0x80000000U >> __builtin_clz(x))
}
For MSVC there is _BitScanReverse, which is 31-__builtin_clz().
Actually its the other way around, BSR is the natural x86 instruction, and the gcc intrinsic is implemented as 31-BSR.

SIMD code for exponentiation

I am using SIMD to compute fast exponentiation result. I compare the timing with non-simd code. The exponentiation is implemented using square and multiply algorithm.
Ordinary(non-simd) version of code:
b = 1;
for (i=WPE-1; i>=0; --i){
ew = e[i];
for(j=0; j<BPW; ++j){
b = (b * b) % p;
if (ew & 0x80000000U) b = (b * a) % p;
ew <<= 1;
}
}
SIMD version:
B.data[0] = B.data[1] = B.data[2] = B.data[3] = 1U;
P.data[0] = P.data[1] = P.data[2] = P.data[3] = p;
for (i=WPE-1; i>=0; --i) {
EW.data[0] = e1[i]; EW.data[1] = e2[i]; EW.data[2] = e3[i]; EW.data[3] = e4[i];
for (j=0; j<BPW;++j){
B.v *= B.v; B.v -= (B.v / P.v) * P.v;
EWV.v = _mm_srli_epi32(EW.v,31);
M.data[0] = (EWV.data[0]) ? a1 : 1U;
M.data[1] = (EWV.data[1]) ? a2 : 1U;
M.data[2] = (EWV.data[2]) ? a3 : 1U;
M.data[3] = (EWV.data[3]) ? a4 : 1U;
B.v *= M.v; B.v -= (B.v / P.v) * P.v;
EW.v = _mm_slli_epi32(EW.v,1);
}
}
The issue is though it is computing correctly, simd version is taking more time than non-simd version.
Please help me debug the reasons. Any suggestions on SIMD coding is also welcome.
Thanks & regards,
Anup.
All functions in the for loops should be SIMD functions, not only two. Time taking to set the arguments for your 2 functions is less optimal then your original example (which is most likely optimized by the compiler)
A SIMD loop for 32 bit int data typically looks something like this:
for (i = 0; i < N; i += 4)
{
// load input vector(s) with data at array index i..i+3
__m128 va = _mm_load_si128(&A[i]);
__m128 vb = _mm_load_si128(&B[i]);
// process vectors using SIMD instructions (i.e. no scalar code)
__m128 vc = _mm_add_epi32(va, vb);
// store result vector(s) at array index i..i+3
_mm_store_si128(&C[i], vc);
}
If you find that you need to move between scalar code and SIMD code within the loop then you probably won't gain anything from SIMD optimisation.
Much of the skill in SIMD programming comes from finding ways to make your algorithm work with the limited number of supported instructions and data types that a given SIMD architecture provides. You will often need to exploit a priori knowledge of your data set to get the best possible performance, e.g. if you know for certain that your 32 bit integer values actually have a range that fits within 16 bits then that would make the multiplication part of your algorithm a lot easier to implement.

Resources