SIMD code for exponentiation - c

I am using SIMD to compute fast exponentiation result. I compare the timing with non-simd code. The exponentiation is implemented using square and multiply algorithm.
Ordinary(non-simd) version of code:
b = 1;
for (i=WPE-1; i>=0; --i){
ew = e[i];
for(j=0; j<BPW; ++j){
b = (b * b) % p;
if (ew & 0x80000000U) b = (b * a) % p;
ew <<= 1;
}
}
SIMD version:
B.data[0] = B.data[1] = B.data[2] = B.data[3] = 1U;
P.data[0] = P.data[1] = P.data[2] = P.data[3] = p;
for (i=WPE-1; i>=0; --i) {
EW.data[0] = e1[i]; EW.data[1] = e2[i]; EW.data[2] = e3[i]; EW.data[3] = e4[i];
for (j=0; j<BPW;++j){
B.v *= B.v; B.v -= (B.v / P.v) * P.v;
EWV.v = _mm_srli_epi32(EW.v,31);
M.data[0] = (EWV.data[0]) ? a1 : 1U;
M.data[1] = (EWV.data[1]) ? a2 : 1U;
M.data[2] = (EWV.data[2]) ? a3 : 1U;
M.data[3] = (EWV.data[3]) ? a4 : 1U;
B.v *= M.v; B.v -= (B.v / P.v) * P.v;
EW.v = _mm_slli_epi32(EW.v,1);
}
}
The issue is though it is computing correctly, simd version is taking more time than non-simd version.
Please help me debug the reasons. Any suggestions on SIMD coding is also welcome.
Thanks & regards,
Anup.

All functions in the for loops should be SIMD functions, not only two. Time taking to set the arguments for your 2 functions is less optimal then your original example (which is most likely optimized by the compiler)

A SIMD loop for 32 bit int data typically looks something like this:
for (i = 0; i < N; i += 4)
{
// load input vector(s) with data at array index i..i+3
__m128 va = _mm_load_si128(&A[i]);
__m128 vb = _mm_load_si128(&B[i]);
// process vectors using SIMD instructions (i.e. no scalar code)
__m128 vc = _mm_add_epi32(va, vb);
// store result vector(s) at array index i..i+3
_mm_store_si128(&C[i], vc);
}
If you find that you need to move between scalar code and SIMD code within the loop then you probably won't gain anything from SIMD optimisation.
Much of the skill in SIMD programming comes from finding ways to make your algorithm work with the limited number of supported instructions and data types that a given SIMD architecture provides. You will often need to exploit a priori knowledge of your data set to get the best possible performance, e.g. if you know for certain that your 32 bit integer values actually have a range that fits within 16 bits then that would make the multiplication part of your algorithm a lot easier to implement.

Related

Solving a large system of linear equations over the finite field F2

I have 10163 equations and 9000 unknowns, all over finite fields, like this style:
Of course my equation will be much larger than this, I have 10163 rows and 9000 different x.
Presented in the form of a matrix is AX=B. A is a 10163x9000 coefficient matrix and it may be sparse, X is a 9000x1 unknown vector, B is the result of their multiplication and mod 2.
Because of the large number of unknowns that need to be solved for, it can be time consuming. I'm looking for a faster way to solve this system of equations using C language.
I tried to use Gaussian elimination method to solve this equation, In order to make the elimination between rows more efficient, I store the matrix A in a 64-bit two-dimensional array, and let the last column of the array store the value of B, so that the XOR operation may reduce the calculating time.
The code I am using is as follows:
uint8_t guss_x_main[R_BITS] = {0};
uint64_t tmp_guss[guss_j_num];
for(uint16_t guss_j = 0; guss_j < x_weight; guss_j++)
{
uint64_t mask_1 = 1;
uint64_t mask_guss = (mask_1 << (guss_j % GUSS_BLOCK));
uint16_t eq_j = guss_j / GUSS_BLOCK;
for(uint16_t guss_i = guss_j; guss_i < R_BITS; guss_i++)
{
if((mask_guss & equations_guss_byte[guss_i][eq_j]) != 0)
{
if(guss_x_main[guss_j] == 0)
{
guss_x_main[guss_j] = 1;
for(uint16_t change_i = 0; change_i < guss_j_num; change_i++)
{
tmp_guss[change_i] = equations_guss_byte[guss_j][change_i];
equations_guss_byte[guss_j][change_i] =
equations_guss_byte[guss_i][change_i];
equations_guss_byte[guss_i][change_i] = tmp_guss[change_i];
}
}
else
{
GUARD(xor_64(equations_guss_byte[guss_i], equations_guss_byte[guss_i],
equations_guss_byte[guss_j], guss_j_num));
}
}
}
for(uint16_t guss_i = 0; guss_i < guss_j; guss_i++)
{
if((mask_guss & equations_guss_byte[guss_i][eq_j]) != 0)
{
GUARD(xor_64(equations_guss_byte[guss_i], equations_guss_byte[guss_i],
equations_guss_byte[guss_j], guss_j_num));
}
}
}
R_BIT = 10163, x_weight = 9000, GUSS_BLOCK = 64, guss_j_num = x_weight / GUSS_BLOCK + 1; equations_guss_byte is a two-dimensional array of uint64, where x_weight / GUSS_BLOCK column stores the matrix A and the latter column stores the vector B, xor_64() is used to XOR two arrays, GUARD() is used to check the correctness of function operation.
Using this method takes about 8 seconds to run on my machine. Is there a better way to speed up the calculation?

Low pass shape curve with low cycle count

Hi, I want to achieve the above curve in software without using a dsp function. I was hoping to use a fast and low cycle arm function like multiply-accumulate.
Is there any fast way of doing this in C on an embedded arm processor?
The curve shown is that of the simplest possible first-order filter charactarised by a 3dB cut-off frequency fc> and a 6dB/Octave or 20dB/Decade roll-off. As an analogue filter it could be implemented as a simple passive RC filter thus:
In the digital domain such a filter would be implemented by:
yn = a0 xn + b1 yn-1
Where y are input samples and x output samples. Or in code:
void lowPassFilter( const tSample* x, tSample* y, size_t sample_count )
{
static tSample y_1 = 0 ;
for( int i = 0; i < n; ++i)
{
y[i] = a0 * x[i] + b1 * y_1 ;
y_1 = y[i];
}
}
The filter is characterised by the coefficients:
a0 = 1 - x
b1 = x
where x is a value between 0 and 1 (I'll address the eradication of the implied floating point operations in due course):
x = e-2πfc
Where fc is the desired -3dB cut-off frequency expressed as a fraction of the sample rate. So for a sample rate 32Ksps and a cut-off frequency of 1KHz, fc = 1000/32000 = 0.03125, so:
b1 = x = e-2πfc = 0.821725
a0 = 1 - x = 0.178275
Now naïvely plugging those constants into the lowPassFilter() will result in generation of floating point code and on an MCU without an FPU that might be prohibitive and even with an FPU might be be sub-optimal. So in this case we might use a fixed-point representation. Since all the real values are less than one, and the machine is 32bit, a UQ0.16 representation would be appropriate, as intermediate multiplication results will not then overflow a 32 bit machine word. This does require the sample width to be 16bit or less (or scaled accordingly). So using fixed-point the code might look like:
typedef uint16_t tSample ;
#define b1 53852 // 0.821725 * 65535
#define a0 (1 - b1)
#define FIXED_MUL( x, y ) (((x)*(y))>>16))
void lowPassFilter( const tSample* x, tSample* y, size_t sample_count )
{
static tSample y_1 = 0 ;
for( int i = 0; i < n; ++i)
{
y[i] = FIXED_MUL(a0, x[i]) + FIXED_MUL(b1, y_1) ;
y_1 = y[i];
}
}
Now that is not a significant amount of processing for most ARM processors at 32ksps suggested in this example. Obviously it depends what other demands are on the processor, but on its own this would not be a significant load, even without applying compiler optimisation. As with any optimisation, you should implement it, measure it and improve it if necessary.
As a first stab I'd trust the compiler optimiser to generate code that in most cases will meet requirements or at least be as good as you might achieve with handwritten assembler. Whether or not it would choose to use a multiply-accumulate instruction is out of your hands, but if it didn't the chances are that it is because there s no advantage.
Bare in mind that ARM Cortex-M4 and M7 for example include DSP instructions not supported in M0 or M3 ports. The compiler may or may not utilise these, but the simplest way to guarantee that without resorting to assembler would be to use the CMSIS DSP Library whether or not that provided greater performance or better fidelity than the above, you would have to test.
Worth noting that the function lowPassFilter() retains state staically so can be called iteratively for "blocks" of samples (from ADC DMA transfer for example), so you might have:
int dma_buffer_n = 0
for(;;)
{
waitEvent( DMA_BUFFER_READY ) ;
lowPassFilter( dma_buffer[dma_buffer_n], output_buffer, DMA_BLOCK_SIZE ) ;
dma_buffer_n = dma_buffer_n == 0 ? 1 : 0 ; // Flip buffers
}
The use of DMA double-buffering is likely to be far more important to performance than the filter function implementation. I have worked on a DSP application sampling two channels at 48ksps on a 72MHz Cortex-M3 with far more complex DSP requirements than this with each channel having a high pass IIR, an 18 coefficient FIR and a Viterbi decoder, so I really do think that your assumption that this simple filter will not be fast enough is somewhat premature.

ARM NEON: Regular C code is faster than ARM Neon code in simple multiplication?

I am implementing a simple multiplication for an array using ARM NEON intrinsics. The input is an uint8 array and the output is an uint16 array. However, the regular native code is faster than the NEON optimzied one. Can anyone help figure out how I can improve the NEON code?
My regular code is
uint16_t scale_factor = 300;
for(int i = 0; i < output_size; i++)
{
out_16bit[i] = (uint16_t)(in_ptr[i] * scale_factor) ;
}
My NEON code is
uint16_t* out_ptr = out_16bit;
uint8_t* in_ptr = in_8bit;
uint16_t scale_factor = 300;
for(int i = 0; i < out_size/16; i++)
{
uint8x16_t in_v0 = vld1q_u8(in_ptr);
in_ptr += 16;
uint16x8_t in_16_v0 = vmovl_u8(vget_low_u8(in_v0));
uint16x8_t in_16_v1 = vmovl_u8(vget_high_u8(in_v0));
uint16x8_t res_0 = vmulq_n_u16(in_16_v0, scale_factor);
uint16x8_t res_1 = vmulq_n_u16(in_16_v1, scale_factor);
// code below takes long time
vst1q_u16(out_ptr,res_0);
vst1q_u16(out_ptr+8,res_1);
out_ptr += 16;
}
I also did some profiling and found out that if I comment out either vst1q_u16s or out_ptr += 16, the speed is fast. But if I keep both as above, it's very slow. So I guess it might be because the increment of pointer is waiting the finishing of vst1q_u16? Then I updated the NEON code to add some code between vst1q_u16 and out_ptr+=16 as below,
uint8x16_t in_v0 = vld1q_u8(in_ptr);
uint16x8_t in_16_v0 = vmovl_u8(vget_low_u8(in_v0));
uint16x8_t in_16_v1 = vmovl_u8(vget_high_u8(in_v0));
uint16x8_t res_0 = vmulq_n_u16(in_16_v0, scale_factor);
uint16x8_t res_1 = vmulq_n_u16(in_16_v1, scale_factor);
vst1q_u16(out_ptr,res_0);
vst1q_u16(out_ptr+8,res_1);
for(int i = 1; i < out_size/16; i++)
{
in_v0 = vld1q_u8(in_ptr);
in_16_v0 = vmovl_u8(vget_low_u8(in_v0));
in_16_v1 = vmovl_u8(vget_high_u8(in_v0));
out_ptr += 16;
res_0 = vmulq_n_u16(in_16_v0, scale_factor);
res_1 = vmulq_n_u16(in_16_v1, scale_factor);
vst1q_u16(out_ptr,res_0);
vst1q_u16(out_ptr+8,res_1);
}
But this change didn't work...Please help advise what I should do...Thank you.
The simple answer, as in the comments, is auto-vectorization.
I'm unsure for clang 6, but certainly more recent clang will by default auto-vectorize to Neon when targeting Neon platforms, and it will be very hard to beat that auto-vectorization on something as simple as this multiplication. Maybe with the best loop unrolling for your particular processor. But it is very easy to be worse than auto-vectorization. Godbolt is a very good way to go to compare, along with profiling all your changes.
All the comments make good points too.
For more documentation on best practice for Neon intrinsics, Arm's Neon microsite has very useful information, especially the doc on Optimizing C with Neon intrinsics.

Can anyone help me to optimize this for loop use SSE?

I have a for loop which will run many times, and will cost a lot of time:
for (int z=0; z<temp; z++)
{
float findex= a + b * A[z];
int iindex = findex ;
outArray[z] += inArray[iindex] + (findex - iindex) * (inArray[iindex+1] - inArray[iindex]);
a++;
}
I have optimized this code, but have no performance improvement! Maybe my SSE code is bad, can any one help me?
Try using the restrict keyword on inArray and outArray. Otherwise the compiler has to assume that inArray could be == outArray. In this case no parallelization would be possible.
Your loop has a loop carried dependency when you write to outArray[z]. Your CPU can do more than one floating point sum at once but with your current loop you only allows one sum of outArray[z]. To fix this you should unroll your loop.
for (int z=0; z<temp; z+=2) {
float findex_v1 = a + b * A[z];
int iindex_v1 = findex_v1;
outArray[z] += inArray[iindex_v1] + (findex_v1 - iindex_v1) * (inArray[iindex_v1+1] - inArray[iindex_v1]);
float findex_v2 = (a+1) + b * A[z+1];
int iindex_v2 = findex_v2;
outArray[z+1] += inArray[iindex_v2] + (findex_v2 - iindex_v2) * (inArray[iindex_v2+1] - inArray[iindex_v2]);
a+=2;
}
In terms of SIMD the problem is that you have to gather non-contiguous data when you access inArray[iindex_v1]. AVX2 has some gather instructions but I have not tried them. Otherwise it may be best to do the gather without SIMD. All the operations accessing z access contiguous memory so that part is easy. Psuedo-code (without unrolling) would look something like this
int indexa[4];
float inArraya[4];
float dinArraya[4];
int4 a4 = a + float4(0,1,2,3);
for (int z=0; z<temp; z+=4) {
//use SSE for contiguous memory
float4 findex4 = a4 + b * float4.load(&A[z]);
int4 iindex4 = truncate_to_int(findex4);
//don't use SSE for non-contiguous memory
iindex4.store(indexa);
for(int i=0; i<4; i++) {
inArraya[i] = inArray[indexa[i]];
dinArraya[i] = inArray[indexa[i+1]] - inArray[indexa[i]];
}
//loading from and array right after writing to it causes a CPU stall
float4 inArraya4 = float4.load(inArraya);
float4 dinArraya4 = float4.load(dinArraya);
//back to SSE
float4 outArray4 = float4.load(&outarray[z]);
outArray4 += inArray4 + (findex4 - iindex4)*dinArray4;
outArray4.store(&outArray[z]);
a4+=4;
}

ARM neon performance issue

Consider the two following pieces of code, the first is the C version :
void __attribute__((no_inline)) proj(uint8_t * line, uint16_t length)
{
uint16_t i;
int16_t tmp;
for(i=HPHD_MARGIN; i<length-HPHD_MARGIN; i++) {
tmp = line[i-3] - 4*line[i-2] + 5*line[i-1] - 5*line[i+1] + 4*line[i+2] - line[i+3];
hphd_temp[i]=ABS(tmp);
}
}
The second is the same function (except for the border) using neon intrinsics
void __attribute__((no_inline)) proj_neon(uint8_t * line, uint16_t length)
{
int i;
uint8x8_t b0b7, b8b15, p1p8,p2p9,p4p11,p5p12,p6p13, m4, m5;
uint16x8_t result;
m4 = vdup_n_u8(4);
m5 = vdup_n_u8(5);
b0b7 = vld1_u8(line);
for(i = 0; i < length - 16; i+=8) {
b8b15 = vld1_u8(line + i + 8);
p1p8 = vext_u8(b0b7,b8b15, 1);
p2p9 = vext_u8(b0b7,b8b15, 2);
p4p11 = vext_u8(b0b7,b8b15, 4);
p5p12 = vext_u8(b0b7,b8b15, 5);
p6p13 = vext_u8(b0b7,b8b15, 6);
result = vsubl_u8(b0b7, p6p13); //p[-3]
result = vmlal_u8(result, p2p9, m5); // +5 * p[-1];
result = vmlal_u8(result, p5p12, m4);// +4 * p[1];
result = vmlsl_u8(result, p1p8, m4); //-4 * p[-2];
result = vmlsl_u8(result, p4p11, m5);// -5 * p[1];
vst1q_s16(hphd_temp + i + 3, vabsq_s16(vreinterpretq_s16_u16(result)));
b0b7 = b8b15;
}
/* todo : remaining pixel */
}
I am disappointed by the performance gain : it is around 10 - 15 %. If I look at the generated assembly :
C version is transformed in a 108 instruction loop
Neon version is transformed in a 72 instruction loop.
But one loop in the neon code computes 8 times as much data as an iteration through the C loop, so a dramatic improvement should be seen.
Do you have any explanation regarding the small difference between the two version ?
Additional details :
Test data is a 10 Mpix image, computation time is around 2 seconds for the C version.
CPU : ARM cortex a8
I'm going to take a wild guess and say that caching (data) is the reason you don't see the big performance gain you are expecting. While I don't know if your chipset supports caching or at what level, if the data spans cache lines, has poor alignment, or is running in an environment where the CPU is doing other things at the same time (interrupts, threads, etc.), then that also could muddy your results.

Resources