if condition in loop and neon SIMD - arm

I am trying to write the neon level SIMD for below scalar code :
Scalar code :
int *xt = new int[50];
float32_t input1[16] = {12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,12.0f,};
float32_t input2[16] = {13.0f,12.0f,9.0f,12.0f,12.0f,12.0f,12.0f,12.0f,13.0f,12.0f,9.0f,12.0f,12.0f,12.0f,12.0f,12.0f};
float32_t threshq = 13.0f;
uint32_t corners_count = 0;
float32_t threshq =13.0f;
for (uint32_t x = 0; x < 16; x++)
{
if ( (input1[x] == input2[x]) && (input2[x] > threshq) )
{
xt[corners_count] = x ;
}
}
Neon:
float32x4_t t1,t2,t3;
uint32x4_t rq1,rq2,rq3;
t1 = vld1q_f32(input1); // 12 12 12 12
t2 = vld1q_f32(input2); // 13 12 09 12
t3 = vdupq_n_f32(threshq); // 13 13 13 13
rq1 = vceqq_f32(t1,t2); // condition to check for input1 equal to input2
rq2 = vcgtq_f32(t1,t3); // condition to check for input1 greater than to threshold
rq3 = vandq_u32(rq1,rq2); // anding the result of two conditions
for( int i = 0;i < 4; i++){
corners_count = corners_count + rq3[i];
//...Not able to write a logic in neon for the same
}
I am not able to write a logic in Neon .
Can anyone really guide me for the same .I am totally got tired in thinking about this logic

Because of the dependencies in your loop I think you need to re-factor your code into a SIMD loop followed by a scalar loop. Pseudo code:
// SIMD loop
for each set of 4 float elements
apply SIMD threshold test
store 4 x bool results in temp[]
// scalar loop
for each bool element in temp[]
if temp[x]
xt[corners_count] = x
corner_count++
This way you get the benefits of SIMD for most of the operations, and you just have to resort to scalar code for the last part.

Related

Fill an array at index n with m times data without bit-fields

I try to send a maximum of 8 bytes of data. The first 4 bytes are always the same and involve defined commands and an address. The last 4 bytes should be variable.
So far I'm using this approach. Unfortunatly I was told to not use any for loops in this case.
// Construct data
local_transmit_buffer[0] = EEPROM_CMD_WREN;
local_transmit_buffer[1] = EEPROM_CMD_WRITE;
local_transmit_buffer[2] = High(MSQ_Buffer.address);
local_transmit_buffer[3] = Low(MSQ_Buffer.address);
uint_fast8_t i = 0;
for(i = 0; i < MSQ_Buffer.byte_lenght || i < 4; i++){ // assign data
local_transmit_buffer[i + 4] = MSQ_Buffer.dataPointer[i];
}
This is some test code I'm trying to solve my problem:
#include <stdio.h>
__UINT_FAST8_TYPE__ local_transmit_buffer[8];
__UINT_FAST8_TYPE__ MSQ_Buffer_data[8];
void print_local(){
for (int i = 0; i < 8; i++)
{
printf("%d ", local_transmit_buffer[i]);
}
printf("\n");
}
void print_msg(){
for (int i = 0; i < 8; i++)
{
printf("%d ", MSQ_Buffer_data[i]);
}
printf("\n");
}
int main(){
// assign all local values to 0
for (int i = 0; i < 8; i++)
{
local_transmit_buffer[i] = 0;
} print_local();
// assign all msg values to 1
for (int i = 0; i < 8; i++)
{
MSQ_Buffer_data[i] = i + 1;
} print_msg();
*(local_transmit_buffer + 3) = (__UINT_FAST32_TYPE__)MSQ_Buffer_data;
printf("\n");
print_local();
return 0;
}
The first loops fills up the local_transmit_buffer with 0's and the MSQ_Buffer with 0,1,2,...
local_transmit_buffer -> 0 0 0 0 0 0 0 0
MSQ_Buffer_data -> 1 2 3 4 5 6 7 8
Now i want to assign the first 4 values of MSQ_Buffer_data to local_transmit_buffer like this:
local_transmit_buffer -> 0 0 0 0 1 2 3 4
Is there another way of solving this problem without using for loops or a bit_field?
Solved:
I used the memcpy function to solve my problem
// uint_fast8_t i = 0;
// for(i = 0; i < MSQ_Buffer.byte_lenght || i < 4; i++){ // assign data
// local_transmit_buffer[i + 4] = MSQ_Buffer.dataPointer[i];
// }
// copy a defined number data from the message to the local buffer to send
memcpy(&local_transmit_buffer[4], &MSQ_Buffer.dataPointer, local_save_data_length);
Either just unroll the loop manually by typing out each line, or simply use memcpy. In this case there's no reason why you need abstraction layers, so I'd write the sanest possible code, which is just manual unrolling (and get rid of icky macros):
uint8_t local_transmit_buffer [8];
...
local_transmit_buffer[0] = EEPROM_CMD_WREN;
local_transmit_buffer[1] = EEPROM_CMD_WRITE;
local_transmit_buffer[2] = (uint8_t) ((MSQ_Buffer.address >> 8) & 0xFFu);
local_transmit_buffer[3] = (uint8_t) (MSQ_Buffer.address & 0xFFu);
local_transmit_buffer[4] = MSQ_Buffer.dataPointer[0];
local_transmit_buffer[5] = MSQ_Buffer.dataPointer[1];
local_transmit_buffer[6] = MSQ_Buffer.dataPointer[2];
local_transmit_buffer[7] = MSQ_Buffer.dataPointer[3];
It is not obvious why you can't use a loop though, this doesn't look like the actual EEPROM programming (where overhead code might cause hiccups), but just preparations for it. Start to question such requirements.
Also note that you should not use __UINT_FAST8_TYPE__ but uint8_t. Never use homebrewed types but always stdint.h. But you should not be using fast types for a RAM buffer used for EEPROM programming, because it cannot be allowed to contain padding, ever. This is a bug.

What is the matrix/vector operation that corresponds to this code?

Here is the code:
long long mul(long long x)
{
uint64_t M[64] = INIT;
uint64_t result = 0;
for ( int i = 0; i < 64; i++ )
{
uint64_t a = x & M[i];
uint64_t b = 0;
while ( a ){
b ^= a & 1;;
a >>= 1;
}
result |= b << (63 - i);
}
return result;
}
This code implements multiplication of the matrix and vector on GF(2). The code that returns result as the product of 64x64 matrix M and 1x64 vector x.
I want to know what linear algebraic operation( on GF(2) ) this code is:
long long unknown(long long x)
{
uint64_t A[] = INIT;
uint64_t a = 0, b = 0;
for( i = 1; i <= 64; i++ ){
for( j = i; j <= 64; j++ ){
if( ((x >> (64-i)) & 1) && ((x >> (64-j)) & 1) )
a ^= A[b];
b++;
}
}
return a;
}
I want to know what linear algebraic operation( on GF(2) ) this code is:
Of course you mean GF(2)64, the field of 64-dimensional vectors over GF(2).
Consider first the loop structure:
for( i = 1; i <= 64; i++ ){
for( j = i; j <= 64; j++ ){
That's looking at every distinct pair of indices (the indices themselves not necessarily distinct from each other). That should provide a first clue. We then see
if( ((x >> (64-i)) & 1) && ((x >> (64-j)) & 1) )
, which is testing whether vector x has both bit i and bit j set. If it does, then we add a row of matrix A into accumulation variable a, by vector sum (== element-wise exclusive or). By incrementing b on every inner-loop iteration, we ensure that each iteration services a different row of A. And that also tells us that A must have 64 * 65 / 2 = 160 rows (that matter).
In general, this is not a linear operation at all. The criterion for an operation o on a vector field over GF(2) to be linear boils down to this expression holding for all pairs of vectors x and y:
o(x + y) = o(x) + o(y)
Now, for notational convenience, let's consider the field GF(2)2 instead of GF(2)64; the result can be extended from the former to the latter simply by adding zeroes. Let x be the bit vector (1, 0) (represented, for example, by the integer 2). Let y be the bit vector (0, 1) (represented by the integer 1). And let A be this matrix:
1 0
0 1
1 0
Your operation has the following among its results:
operand result as integer comment
x (1, 0) 2 Only the first row is accumulated
y (1, 0) 2 Only the third row is accumulated
x + y (0, 1) 1 All rows are accumulated
Clearly, it is not the case that o(x) + o(y) = o(x + y) for this x, y, and characteristic A, so the operation is not linear for this A.
There are matrices A for which the corresponding operation is linear, but what linear operation they represent will depend on A. For example, it is possible to represent a wide variety of matrix-vector multiplications this way. It's not clear to me whether linear operations other than matrix-vector multiplications can be represented in this form, but I'm inclined to think not.

Floating average with reading of ADC values

I want to do moving average or something similar to that, because I am getting noisy values from ADC, this is my first try, just to compute moving average, but values goes to 0 everytime, can you help me?
This is part of code, which makes this magic:
unsigned char buffer[5];
int samples = 0;
USART_Init0(MYUBRR);
uint16_t adc_result0, adc_result1;
float ADCaverage = 0;
while(1)
{
adc_result0 = adc_read(0); // read adc value at PA0
samples++;
//adc_result1 = adc_read(1); // read adc value at PA1
ADCaverage = (ADCaverage + adc_result0)/samples;
sprintf(buffer, "%d\n", (int)ADCaverage);
char * p = buffer;
while (*p) { USART_Transmit0(*p++); }
_delay_ms(1000);
}
return(0);
}
This result I am sending via usart to display value.
Your equation is not correct.
Let s_n = (sum_{i=0}^{n} x[i])/n then:
s_(n-1) = sum_{i=0}^{n-1} x[i])/(n-1)
sum_{i=0}^{n-1} x[i] = (n-1)*s_(n-1)
sum_{i=0}^{n} x[i] = n*s_n
sum_{i=0}^{n} x[i] = sum_{i=0}^{n-1} x[i] + x[n]
n*s_n = (n-1)*s_(n-1) + x[n] = n*s_(n-1) + (x[n]-s_(n-1))
s_n = s_(n-1) + (x[n]-s_(n-1))/n
You must use
ADCaverage += (adc_result0-ADCaverage)/samples;
You can use an exponential moving average which only needs 1 memory unit.
y[0] = (x[0] + y[-1] * (a-1) )/a
Where a is the filter factor.
If a is multiples of 2 you can use shifts and optimize for speed significantly:
y[0] = ( x[0] + ( ( y[-1] << a ) - y[-1] ) ) >> a
This works especially well with left aligned ADC's. Just keep an eye on the word size of the shift result.

Modulo optimization

I have a C program which does some extensive swapping operations on a large array. It has a modulo operation in its tight loop. In fact there is an integer in the range [-N|N[ with N a power of 2 and it should be wrapped to [0,N[.
Example with N=4: -4 => 0, -3 => 1, -2 => 2, -1 => 3, 0 => 0, ..., 3 => 3
At first I tried the version 1 below but was surprised that version 2 is actually notably faster even though it has a conditional expression.
Can you explain why version 2 is faster than version 1 for this special case?
Version 1:
#define N (1<<(3*5))
inline int modBitAnd(int x)
{
return (x & (N-1));
}
Runtime: 17.1 seconds (for the whole program)
Version 2:
inline int modNeg1(int x)
{
return (x < 0 ? x + N : x);
}
Runtime: 14.6 seconds (for the whole program)
Program is compiled on GCC 4.8.2. with -std=c99 -O3.
Edit:
Here is the main loop in my program:
int en(uint16_t* p, uint16_t i, uint16_t v)
{
uint16_t n1 = p[modNeg1((int)i - 1)];
uint16_t n2 = p[modBitAnd((int)i + 1)];
uint16_t n3 = p[modNeg1((int)i - C_WIDTH)];
uint16_t n4 = p[modBitAnd((int)i + C_WIDTH)];
return d(n1,v) + d(n2,v) + d(n3,v) + d(n4,v);
}
void arrange(uint16_t* p)
{
for(size_t i=0; i<10000000; i++) {
uint16_t ia = random(); // random integer [0|2^15[
uint16_t va = p[ia];
uint16_t ib = random(); // random integer [0|2^15[
uint16_t vb = p[ib];
if(en(p,ia,vb) + en(p,ib,va) < en(p,ia,va) + en(p,ib,vb)) {
p[ia] = vb;
p[ib] = va;
}
}
}
int d(uint16_t a, uint16_t b) is a distance function e.g. abs((int)a-(int)b).
This is how p is initialized:
uint16_t* p = malloc(sizeof(uint16_t)*N);
for(unsigned i=0; i<N; i++) *p++ = i;
First I used modBitAnd everywhere, but found out that the modNeg1 is acutally faster for the two cases where it can be used.
First take a few stackshots to find out where the time is actually going. Your mod functions will grab some fraction of the samples, but you've also got two calls to random, plus a fair amount of array indexing. Also, it looks like you've got four calls to en with some arguments that are the same, so maybe your modularity is leading to repeat calls to the mod functions.

How to loop and increase by 0.01 everytime?

I'm really confused on this code.
Here's what I want it to do: Start with a "v" value of 5, carry out the rest of the functions/calculations, increase the "v" value by 0.01, carry out the functions/calculations, then increase the "v" value by 0.01 again, carry out the functions...do this 500 times or until a "v" value of 10.00 is reached, whichever is easier to code.
Here is my code at the moment:
//start loop over v
for(iv=5;iv<=500;iv++) {
v=0.01*iv;
//Lots and lots of calculations with v here
}
Here is what I get: I tried setting iv<=10 so it does 10 loops only just so I could test it first before leaving it on all night. It did only 6 loops, starting at v=0.05 and ending at 0.1. So the problem is that a) it didn't run for 10 loops, b) it didn't start at 5.00, it started at 0.05.
Any help would be appreciated.
EDIT: Holy crap, so many answers! I've tried 2 different answers so far, both work! I've been staring at this and changing code around for 3 hours, can't believe it was so easy.
You need to start at iv = 500. and if you want 10 loops, and iv++ is the update, then you stop before 510.
Reason: v = 0.01*iv, so v = 5 means iv = 5/0.01 = 500. As for the number of iterations, if your for loop is of the form for (x = N; x < M; x++) (constant N and M), then max(0, M-N) loops are executed, if x is not changed in the loop and no weird stuff (e.g. overflow, hidden casts of negative numbers to unsigned, etc.) occurs.
EDIT
Instead of using v = 0.01 * iv, v = iv / 100.0 is probably more accurate. Reason: 0.01 is not exactly representable in floating point, but 100.0 is.
Changing SiegeX's code so it uses integers ("more accurate"):
double dv;
int iv;
for(iv = 500; dv <= 1000; iv += 1)
{
dv = (double)iv / 100.0;
}
double iv;
for(iv = 5.0; iv <= 10.0 ; iv += 0.01) {
/* stuff here */
}
int i;
double v;
v = 5;
for (i = 0; i < 500; i++)
{
v += 0.01;
// Do Calculations Here.
if (v >= 10.00) break;
}
This gives you both. This will iterate at most 500 times, but will break out of that loop if the v value reaches (or exceeds) 10.00.
If you wanted only one or the other:
The 10.00 Version:
double v;
v = 5.0;
while ( v < 10.00 )
{
v += 0.01;
// Do Calculations Here.
}
The 500 iterations version:
double v;
int i;
v = 5.0;
for( i = 0; i < 500; i++ )
{
v += 0.01;
// Do Calculations.
}
(Note that this isn't C99, which allows for a cleaner declaration syntax in the loops).
iv <= 10 doesn't do it for 10 loops, it does it until iv is greater than 10.
//start loop over v
for(iv=0;iv<500;iv++) //loop from 0 to 499
{
v=v+0.01; //increase v by 0.01
//Lots and lots of calculations with v here
}
this should do it

Resources