In the neon intrinsics,there are four intrinsics(vld1 vld2 vld3 vld4) to perform 1-way to 4-way de-interleave.But how to implement 8-way de-interleave?
For example, the data is:
uint8_t src[64] = {0,1,2,3,4,5,6,7,```63};
Load data into neon register, and after performing 8-way de-interleave,I hope the value of src_reg1 and src_reg2 could be like this:
uint8x8x4_t src_reg1;
uint8x8x4_t src_reg2;
src_reg1.val[0] = {0,8, 16,24,32,40,48,56}
src_reg1.val[1] = {1,9, 17,25,```}
src_reg1.val[2] = {2,10,18,26,```}
src_reg1.val[3] = {3,11,19,27,```}
src_reg2.val[0] = {4,12,20,28,```}
src_reg2.val[1] = {5,13,21,29,```}
src_reg2.val[2] = {6,14,22,30,```}
src_reg2.val[3] = {7,15,23,31,39,47,55,63}
Does anyone know how to achieve this?Thank you very much!
It's as simple as doing two 4-element loads to get two sets of 4-way deinterleaved data, then further deinterleaving those sets with each other via one of the register-interleaving operations, e.g.:
uint8x8x4_t src_reg1 = vld4_u8(src);
uint8x8x4_t src_reg2 = vld4_u8(src + 32);
for (int i = 0; i < 4; i++) {
// This a bit of a faff thanks to the intrinsic datatypes, but
// compiling at -O3 tidies it all up into sensible code
uint8x8x2_t tmp = vuzp_u8(src_reg1.val[i], src_reg2.val[i]);
src_reg1.val[i] = tmp.val[0];
src_reg2.val[i] = tmp.val[1];
}
Related
I am implementing a simple multiplication for an array using ARM NEON intrinsics. The input is an uint8 array and the output is an uint16 array. However, the regular native code is faster than the NEON optimzied one. Can anyone help figure out how I can improve the NEON code?
My regular code is
uint16_t scale_factor = 300;
for(int i = 0; i < output_size; i++)
{
out_16bit[i] = (uint16_t)(in_ptr[i] * scale_factor) ;
}
My NEON code is
uint16_t* out_ptr = out_16bit;
uint8_t* in_ptr = in_8bit;
uint16_t scale_factor = 300;
for(int i = 0; i < out_size/16; i++)
{
uint8x16_t in_v0 = vld1q_u8(in_ptr);
in_ptr += 16;
uint16x8_t in_16_v0 = vmovl_u8(vget_low_u8(in_v0));
uint16x8_t in_16_v1 = vmovl_u8(vget_high_u8(in_v0));
uint16x8_t res_0 = vmulq_n_u16(in_16_v0, scale_factor);
uint16x8_t res_1 = vmulq_n_u16(in_16_v1, scale_factor);
// code below takes long time
vst1q_u16(out_ptr,res_0);
vst1q_u16(out_ptr+8,res_1);
out_ptr += 16;
}
I also did some profiling and found out that if I comment out either vst1q_u16s or out_ptr += 16, the speed is fast. But if I keep both as above, it's very slow. So I guess it might be because the increment of pointer is waiting the finishing of vst1q_u16? Then I updated the NEON code to add some code between vst1q_u16 and out_ptr+=16 as below,
uint8x16_t in_v0 = vld1q_u8(in_ptr);
uint16x8_t in_16_v0 = vmovl_u8(vget_low_u8(in_v0));
uint16x8_t in_16_v1 = vmovl_u8(vget_high_u8(in_v0));
uint16x8_t res_0 = vmulq_n_u16(in_16_v0, scale_factor);
uint16x8_t res_1 = vmulq_n_u16(in_16_v1, scale_factor);
vst1q_u16(out_ptr,res_0);
vst1q_u16(out_ptr+8,res_1);
for(int i = 1; i < out_size/16; i++)
{
in_v0 = vld1q_u8(in_ptr);
in_16_v0 = vmovl_u8(vget_low_u8(in_v0));
in_16_v1 = vmovl_u8(vget_high_u8(in_v0));
out_ptr += 16;
res_0 = vmulq_n_u16(in_16_v0, scale_factor);
res_1 = vmulq_n_u16(in_16_v1, scale_factor);
vst1q_u16(out_ptr,res_0);
vst1q_u16(out_ptr+8,res_1);
}
But this change didn't work...Please help advise what I should do...Thank you.
The simple answer, as in the comments, is auto-vectorization.
I'm unsure for clang 6, but certainly more recent clang will by default auto-vectorize to Neon when targeting Neon platforms, and it will be very hard to beat that auto-vectorization on something as simple as this multiplication. Maybe with the best loop unrolling for your particular processor. But it is very easy to be worse than auto-vectorization. Godbolt is a very good way to go to compare, along with profiling all your changes.
All the comments make good points too.
For more documentation on best practice for Neon intrinsics, Arm's Neon microsite has very useful information, especially the doc on Optimizing C with Neon intrinsics.
I have a vector with many frequencies. Now I try to program a sine-wave, which generates for each frequency one period and put it into one vector... (similar like a sweep signal)
Finally I want to plot this...
I already tried this, but it doesn't work correctly..
%fr = Frequency-Vector with 784 Elements from 2.0118e+04 to 1.9883e+04 Hz
fs = 48000; %Sampling frequency [Hz]
tstart = 0;
tstep = 1/fs;
tend = (length(fr))*(1/min(fr))-tstep;
t3 = tstart3:tstep3:tend3;
sin3 = [];
for i = 1:length(fr)/2
sin3 = [sin3 sin(2*pi*fr(i)*t3)];
end
tstart4 = 0;
tstep4 = 1/fs2;
tend4 = tstep4*length(sin3);
t4 = tstart4:tstep4:tend4-tstep4;
figure;
plot(t4,sin3)
Could you please help me?
Thanks!
If reversed engineer your codes correctly, it seems like you wanted to generate a chirp frequency. It could be more efficient if you do it as follows
fr = linspace(2.0118e4, 1.9883e4, 784); % Frequency content
%fr = linspace(2e4, 1e4, 784); % Try this for a wider chirp
fs = 48e3;
phi = cumsum(2*pi*fr/fs);
s1 = sin(phi);
spectrogram(s1, 128, 120, 128, fs); % View the signal in time vs frequency
Consider the two following pieces of code, the first is the C version :
void __attribute__((no_inline)) proj(uint8_t * line, uint16_t length)
{
uint16_t i;
int16_t tmp;
for(i=HPHD_MARGIN; i<length-HPHD_MARGIN; i++) {
tmp = line[i-3] - 4*line[i-2] + 5*line[i-1] - 5*line[i+1] + 4*line[i+2] - line[i+3];
hphd_temp[i]=ABS(tmp);
}
}
The second is the same function (except for the border) using neon intrinsics
void __attribute__((no_inline)) proj_neon(uint8_t * line, uint16_t length)
{
int i;
uint8x8_t b0b7, b8b15, p1p8,p2p9,p4p11,p5p12,p6p13, m4, m5;
uint16x8_t result;
m4 = vdup_n_u8(4);
m5 = vdup_n_u8(5);
b0b7 = vld1_u8(line);
for(i = 0; i < length - 16; i+=8) {
b8b15 = vld1_u8(line + i + 8);
p1p8 = vext_u8(b0b7,b8b15, 1);
p2p9 = vext_u8(b0b7,b8b15, 2);
p4p11 = vext_u8(b0b7,b8b15, 4);
p5p12 = vext_u8(b0b7,b8b15, 5);
p6p13 = vext_u8(b0b7,b8b15, 6);
result = vsubl_u8(b0b7, p6p13); //p[-3]
result = vmlal_u8(result, p2p9, m5); // +5 * p[-1];
result = vmlal_u8(result, p5p12, m4);// +4 * p[1];
result = vmlsl_u8(result, p1p8, m4); //-4 * p[-2];
result = vmlsl_u8(result, p4p11, m5);// -5 * p[1];
vst1q_s16(hphd_temp + i + 3, vabsq_s16(vreinterpretq_s16_u16(result)));
b0b7 = b8b15;
}
/* todo : remaining pixel */
}
I am disappointed by the performance gain : it is around 10 - 15 %. If I look at the generated assembly :
C version is transformed in a 108 instruction loop
Neon version is transformed in a 72 instruction loop.
But one loop in the neon code computes 8 times as much data as an iteration through the C loop, so a dramatic improvement should be seen.
Do you have any explanation regarding the small difference between the two version ?
Additional details :
Test data is a 10 Mpix image, computation time is around 2 seconds for the C version.
CPU : ARM cortex a8
I'm going to take a wild guess and say that caching (data) is the reason you don't see the big performance gain you are expecting. While I don't know if your chipset supports caching or at what level, if the data spans cache lines, has poor alignment, or is running in an environment where the CPU is doing other things at the same time (interrupts, threads, etc.), then that also could muddy your results.
Is there an intrinsic which allows one to add all of the elements in a lane? I am using Neon to multiply 8 numbers together, and I need to sum the result. Here is some paraphrased code to show what I'm currently doing (this could probably be optimised):
int16_t p[8], q[8], r[8];
int32_t sum;
int16x8_t pneon, qneon, result;
p[0] = some_number;
p[1] = some_other_number;
//etc etc
pneon = vld1q_s16(p);
q[0] = some_other_other_number;
q[1] = some_other_other_other_number;
//etc etc
qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
vst1q_s16(r,result);
sum = ((int32_t) r[0] + (int32_t) r[1] + ... //etc );
Is there a "better" way to do this?
If you're targeting the newer arm 64 bit architecture, then ADDV is just the right instruction for you.
Here's how your code will look with it.
qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
sum = vaddvq_s16(result);
That's it. Just one instruction to sum up all of the lanes in the vector register.
Sadly, this instruction doesn't feature in the older 32 bit arm architecture.
Something like this should work pretty optimal (caution: not tested)
const int16x4_t result_low = vget_low_s16(result); // Extract low 4 elements
const int16x4_t result_high = vget_high_s16(result); // Extract high 4 elements
const int32x4_t twopartsum = vaddl_s16(result_low, result_high); // Extend to 32 bits and add (4 partial 32-bit sums are formed)
const int32x2_t twopartsum_low = vget_low_s32(twopartsum); // Extract 2 low 32-bit partial sums
const int32x2_t twopartsum_high = vget_high_s32(twopartsum); // Extract 2 high 32-bit partial sums
const int32x2_t fourpartsum = vadd_s32(twopartsum_low, twopartsum_high); // Add partial sums (2 partial 32-bit sum are formed)
const int32x2_t eightpartsum = vpadd_s32(fourpartsum, fourpartsum); // Final reduction
const int32_t sum = vget_lane_s32(eightpartsum, 0); // Move to general-purpose registers
temp = vadd_f32(vget_high_f32(variance_n), vget_low_f32(variance_n));
sum = vget_lane_f32(vpadd_f32(variance_temp, variance_temp), 0);
I am using SIMD to compute fast exponentiation result. I compare the timing with non-simd code. The exponentiation is implemented using square and multiply algorithm.
Ordinary(non-simd) version of code:
b = 1;
for (i=WPE-1; i>=0; --i){
ew = e[i];
for(j=0; j<BPW; ++j){
b = (b * b) % p;
if (ew & 0x80000000U) b = (b * a) % p;
ew <<= 1;
}
}
SIMD version:
B.data[0] = B.data[1] = B.data[2] = B.data[3] = 1U;
P.data[0] = P.data[1] = P.data[2] = P.data[3] = p;
for (i=WPE-1; i>=0; --i) {
EW.data[0] = e1[i]; EW.data[1] = e2[i]; EW.data[2] = e3[i]; EW.data[3] = e4[i];
for (j=0; j<BPW;++j){
B.v *= B.v; B.v -= (B.v / P.v) * P.v;
EWV.v = _mm_srli_epi32(EW.v,31);
M.data[0] = (EWV.data[0]) ? a1 : 1U;
M.data[1] = (EWV.data[1]) ? a2 : 1U;
M.data[2] = (EWV.data[2]) ? a3 : 1U;
M.data[3] = (EWV.data[3]) ? a4 : 1U;
B.v *= M.v; B.v -= (B.v / P.v) * P.v;
EW.v = _mm_slli_epi32(EW.v,1);
}
}
The issue is though it is computing correctly, simd version is taking more time than non-simd version.
Please help me debug the reasons. Any suggestions on SIMD coding is also welcome.
Thanks & regards,
Anup.
All functions in the for loops should be SIMD functions, not only two. Time taking to set the arguments for your 2 functions is less optimal then your original example (which is most likely optimized by the compiler)
A SIMD loop for 32 bit int data typically looks something like this:
for (i = 0; i < N; i += 4)
{
// load input vector(s) with data at array index i..i+3
__m128 va = _mm_load_si128(&A[i]);
__m128 vb = _mm_load_si128(&B[i]);
// process vectors using SIMD instructions (i.e. no scalar code)
__m128 vc = _mm_add_epi32(va, vb);
// store result vector(s) at array index i..i+3
_mm_store_si128(&C[i], vc);
}
If you find that you need to move between scalar code and SIMD code within the loop then you probably won't gain anything from SIMD optimisation.
Much of the skill in SIMD programming comes from finding ways to make your algorithm work with the limited number of supported instructions and data types that a given SIMD architecture provides. You will often need to exploit a priori knowledge of your data set to get the best possible performance, e.g. if you know for certain that your 32 bit integer values actually have a range that fits within 16 bits then that would make the multiplication part of your algorithm a lot easier to implement.