How can I make this loop run faster? - c

I'm using this code to find the highest temperature pixel in a thermal image and the coordinates of the pixel.
void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
int temp = 0;
for (int i = sz; i > 0; i--)
{
if (returnPixel->temperature < *image)
{
returnPixel->temperature = *image;
temp = i;
}
image++;
}
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}
With an image size of 640x480 it takes around 35ms to run through this function, which is too slow for what I need it for (under 10ms ideally).
This is executing on an ARM A9 processor running Linux.
The compiler I'm using is ARM v8 32-Bit Linux gcc compiler.
I'm using optimize -O3 and the following compile options: -march=armv7-a+neon -mcpu=cortex-a9 -mfpu=neon-fp16 -ftree-vectorize.
This is the output from the compiler:
000127f4 <_findMax>:
for(int i = sz; i > 0; i--)
127f4: e3510000 cmp r1, #0
{
127f8: e52de004 push {lr} ; (str lr, [sp, #-4]!)
for(int i = sz; i > 0; i--)
127fc: da000014 ble 12854 <_findMax+0x60>
12800: e1d2c0b0 ldrh ip, [r2]
12804: e2400002 sub r0, r0, #2
int temp = 0;
12808: e3a0e000 mov lr, #0
if(returnPixel->temperature < *image)
1280c: e1f030b2 ldrh r3, [r0, #2]!
12810: e153000c cmp r3, ip
returnPixel->temperature = *image;
12814: 81a0c003 movhi ip, r3
12818: 81a0e001 movhi lr, r1
1281c: 81c230b0 strhhi r3, [r2]
for(int i = sz; i > 0; i--)
12820: e2511001 subs r1, r1, #1
12824: 1afffff8 bne 1280c <_findMax+0x18>
12828: e30c3ccd movw r3, #52429 ; 0xcccd
1282c: e34c3ccc movt r3, #52428 ; 0xcccc
12830: e0831e93 umull r1, r3, r3, lr
12834: e1a034a3 lsr r3, r3, #9
12838: e0831103 add r1, r3, r3, lsl #2
1283c: e6ff3073 uxth r3, r3
12840: e04ee381 sub lr, lr, r1, lsl #7
12844: e6ffe07e uxth lr, lr
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
12848: e1c2e0b4 strh lr, [r2, #4]
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
1284c: e1c230b6 strh r3, [r2, #6]
}
12850: e49df004 pop {pc} ; (ldr pc, [sp], #4)
for(int i = sz; i > 0; i--)
12854: e3a03000 mov r3, #0
12858: e1a0e003 mov lr, r3
1285c: eafffff9 b 12848 <_findMax+0x54>
For clarity after comments:
Each pixel is a unsigned 16 bit integer, image[0] would be the pixel with coordinates 0,0, and the last in the array would have the coordinates 639,479.

This is executing on an ARM A9 processor running Linux.
ARM Cortex-A9 supports Neon.
With this in mind the goal should be to load 8 values (128 bits of pixel data) into a register, then do "compare with the current maximums for each of the 8 places" to get a mask, then use the mask and its inverse to mask out the "too small" old maximums and the "too small" new values; then OR the results to merge the new higher values into the "current maximums for each of the 8 places".
Once that has been done for all pixels (using a loop); you'd want to find the highest value in the "current maximums for each of the 8 places".
However; to find the location of the hottest pixel (rather than just how hot it is) you'd want to split the image into tiles (e.g. maybe 8 pixels wide and 8 pixels tall). This allows you to find the max. temperature within each tile (using Neon); then find the pixel within the hottest tile. Note that for huge images this lends itself to a "multi-layer" approach - e.g. create a smaller image containing the maximum from each tile in the original image; then do the same thing again to create an even smaller image containing the maximum from each "group of tiles", then ...
Making this work in plain C means trying to convince the compiler to auto-vectorize. The alternatives are to use compiler intrinsics or inline assembly. In any of these cases, using Neon to do 8 pixels in parallel (without any branches) could/should improve performance significantly (how much depends on RAM bandwidth).

You should minimize memory access, especially in loops.
Every * or -> could cause unnecessary memory access resulting in serious performance hits.
Local variables are your best friend:
void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
int temp = 0;
uint16_t temperature = returnPixel->temperature;
uint16_t pixel;
for (int i = sz; i > 0; i--)
{
pixel = *image++;
if (temperature < pixel)
{
temperature = pixel;
temp = i;
}
}
returnPixel->temperature = temperature;
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}
Below is how this can be optimized by utilizing neon:
#include <stdint.h>
#include <arm_neon.h>
#include <assert.h>
static inline void findMax128_neon(uint16_t *pDst, uint16x8_t *pImage)
{
uint16x8_t in0, in1, in2, in3, in4, in5, in6, in7, in8, in9, in10, in11, in12, in13, in14, in15;
uint16x4_t dmax;
in0 = vld1q_u16(pImage++);
in1 = vld1q_u16(pImage++);
in2 = vld1q_u16(pImage++);
in3 = vld1q_u16(pImage++);
in4 = vld1q_u16(pImage++);
in5 = vld1q_u16(pImage++);
in6 = vld1q_u16(pImage++);
in7 = vld1q_u16(pImage++);
in8 = vld1q_u16(pImage++);
in9 = vld1q_u16(pImage++);
in10 = vld1q_u16(pImage++);
in11 = vld1q_u16(pImage++);
in12 = vld1q_u16(pImage++);
in13 = vld1q_u16(pImage++);
in14 = vld1q_u16(pImage++);
in15 = vld1q_u16(pImage);
in0 = vmaxq_u16(in1, in0);
in2 = vmaxq_u16(in3, in2);
in4 = vmaxq_u16(in5, in4);
in6 = vmaxq_u16(in7, in6);
in8 = vmaxq_u16(in9, in8);
in10 = vmaxq_u16(in11, in10);
in12 = vmaxq_u16(in13, in12);
in14 = vmaxq_u16(in15, in14);
in0 = vmaxq_u16(in2, in0);
in4 = vmaxq_u16(in6, in4);
in8 = vmaxq_u16(in10, in8);
in12 = vmaxq_u16(in14, in12);
in0 = vmaxq_u16(in4, in0);
in8 = vmaxq_u16(in12, in8);
in0 = vmaxq_u16(in8, in0);
dmax = vmax_u16(vget_high_u16(in0), vget_low_u16(in0));
dmax = vpmax_u16(dmax, dmax);
dmax = vpmax_u16(dmax, dmax);
vst1_lane_u16(pDst, dmax, 0);
}
void _findMax_neon(uint16_t *image, int sz, sPixelData *returnPixel)
{
assert((sz % 128) == 0);
const uint32_t nSector = sz/128;
uint16_t max[nSector];
uint32_t i, s, nMax;
uint16_t *pImage;
for (i = 0; i < nSector; ++i)
{
findMax128_neon(&max[i], (uint16x8_t *) &image[i*128]);
}
s = 0;
nMax = max[0];
for (i = 1; i < nSector; ++i)
{
if (max[i] > nMax)
{
s = i;
nMax = max[i];
}
}
if (nMax < returnPixel->temperature)
{
returnPixel->x_location = 0;
returnPixel->y_location = 0;
return;
}
pImage = &image[s];
i = 0;
while(1) {
if (*pImage++ == nMax) break;
i += 1;
}
i += 128 * s;
returnPixel->temperature = nMax;
returnPixel->x_location = i % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = i / IMAGE_HORIZONTAL_SIZE;
}
Beware that the function above assumes sz being a multiple of 128.
And yes, it will run in less than 10ms.

The culprit here is the slow linear search for the highest "temperature". I'm not quite sure how to improve that search algorithm with the information given, if at all possible (could you sort the data in advance?), but you could start with this:
uint16_t max = 0;
size_t found_index = 0;
for(size_t i=0; i<sz; i++)
{
if(max < image[i])
{
max = image[i];
found_index = sz - i - 1; // or whatever makes sense here for the algorithm
}
}
returnPixel->temperature = max;
returnPixel->x_location = found_index % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = found_index / IMAGE_HORIZONTAL_SIZE;
This might give a very slight performance gain because of the top to bottom iteration order and not touching unrelated memory returnPixel in the middle of the loop. max should get stored in a register and with luck you might get slightly better cache performance overall. Still, this comes with a branch like the original code, so it is a minor improvement.
Another micro-optimization is to change the parameter to const uint16_t* image - this might give slightly better pointer aliasing, in case returnPixel happens to contain a uint16_t too. image should be const regardless of performance, since const correctness is always good practice.
Further more obscure optimization tricks might be possible if you read image 32 or 64 bits at a time, then come up with a fast look-up method to find the largest image inside that 32/64 bit chunk.

If you have to find the hottest pixel in the image and if there is no structure to the image data itself then I think your stuck with iterating through the pixels. If so you have a number of different ways to make this faster:
As suggested above try loop unrolling and other micro-optimisation tricks, this might give you the performance boost you need
Go parallel, split the array into N chunks and find the MAX[N] for each chunk and then find the largest of the MAX[N] values. You have to be careful here as setting up the parallel processes can take longer than doing the work.
If there is some structure to the image, lots of cold pixels and a hot spot (larger than 1 pixel) that your trying to find say, then there are other techniques you could use.
One approach could be to split the image up into N boxes and then sample each box, The hottest pixel in the hottest box (and maybe in the boxes adjacent too it) would then be your result. However this depends on their being some structure to the image which you can rely on.

The assembly language reveals the compiler is storing in returnPixel->temperature each time a new maximum is found, with the instruction strhhi r3, [r2]. Eliminate this unnecessary store by caching the maximum in a local object and only updating returnPixel->temperature after the loop ends:
uint16_t maximum = returnPixel->temperature;
for (int i = sz; i > 0; i--)
{
if (maximum < *image)
{
maximum = *image;
temp = i;
}
image++;
}
returnPixel->temperature = maximum;
That is unlikely to reduce the execution time as much as you need, but it might if there is some bad cache or memory interaction occurring. It is a very simple change, so try it before moving on to the SIMD vectorizations suggested in other answers.
Regarding vectorization, two approaches are:
Iterate through the image using a vmax instruction to update the maximum value seen so far in each SIMD lane. Then consolidate the lanes to find the overall maximum. Then iterate through the image again looking for that maximum in any lane. (I forget what that architecture has for instructions that would assist in testing whether a comparison produced true in any lane.)
Iterate through the image maintaining three registers: One with the maximum seen so far in each lane, one with a counter of position in the image, and one with, in each lane, a record of the counter value at the time each new maximum was seen. The first can be updated with vmax, as above. The second can be updated with vadd. The third can be updated with vcmp and vbit. After the loop, figure out which lane has the maximum, and get the position of the maximum from the recorded counter for that lane.
Depending on the performance of the necessary instructions, a hybrid approach may be faster:
Set some strip size S. Partition the image into strips of that size. For each strip in the image, find the maximum (using the fast vmax loop described above). If the maximum is greater than seen in previous strips, remember it and the current strip number. After processing the whole image, the task has been reduced to finding the location of the maximum in a particular strip. Use the second loop from the first approach above for that. (For large images, further refinements may be possible, possibly refining the location using a shorter strip size before finding the exact location, depending on cache behavior and other factors.)

In my opinion you cannot improve that algorithim, because any array element can hold the maximum, so you need to do at least one pass through the data, I don't believe you can improve this without going multithreading. you can start several threads (as many as cores/processors you may have) and give each a subset of your image. Once they are finished, you will have as many local maximum values as the number of threads you started. just do a second pass on those vaues to get the maximum total value, and you are finished. But consider the extra workload of creating threads, allocating stack memory for them, and scheduling, as that can be higher if the number of values is low than the workload of running all in a single thread. If you have a thread pool somewhere to provide ready to run threads, and that is something you can get on, then probably you'll be able to finished in one Nth part of the time to run all the loop in one single processor (where N is the number of cores you have on the machine)
Note: using a dimension that is a power of two will save you the job of calculating a quotient and a remainder by solving the division problem with a bit shift and a bit mask. You use it only once in your function, but it's an improving, anyway.

Related

Efficient C vectors for generic SIMD (SSE, AVX, NEON) test for zero matches. (find FP max absolute value and index)

I want to see if it's possible to write some generic SIMD code that can compile efficiently. Mostly for SSE, AVX, and NEON. A simplified version of the problem is: Find the maximum absolute value of an array of floating point numbers and return both the value and the index. It is the last part, the index of the maximum, that causes the problem. There doesn't seem to be a very good way to write code that has a branch.
See update at end for finished code using some of the suggested answers.
Here's a sample implementation (more complete version on godbolt):
#define VLEN 8
typedef float vNs __attribute__((vector_size(VLEN*sizeof(float))));
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));
#define SWAP128 4,5,6,7, 0,1,2,3
#define SWAP64 2,3, 0,1, 6,7, 4,5
#define SWAP32 1, 0, 3, 2, 5, 4, 7, 6
static bool any(vNb x) {
x = x | __builtin_shufflevector(x,x, SWAP128);
x = x | __builtin_shufflevector(x,x, SWAP64);
x = x | __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
float maxabs(float* __attribute__((aligned(32))) data, unsigned n, unsigned *index) {
vNs max = {0,0,0,0,0,0,0,0};
vNs tmax;
unsigned imax = 0;
for (unsigned i = 0 ; i < n; i += VLEN) {
vNs t = *(vNs*)(data + i);
t = -t < t ? t : -t; // Absolute value
vNb cmp = t > max;
if (any(cmp)) {
tmax = t; imax = i;
// broadcast horizontal max of t into every element of max
vNs tswap128 = __builtin_shufflevector(t,t, SWAP128);
t = t < tswap128 ? tswap128 : t;
vNs tswap64 = __builtin_shufflevector(t,t, SWAP64);
t = t < tswap64 ? tswap64 : t;
vNs tswap32 = __builtin_shufflevector(t,t, SWAP32);
max = t < tswap32 ? tswap32 : t;
}
}
// To simplify example, ignore finding index of true value in tmax==max
*index = imax; // + which(tmax == max);
return max[0];
}
Code on godbolt allows changing VLEN to 8 or 4.
This mostly works very well. For AVX/SSE the absolute value becomes t & 0x7fffffff using a (v)andps, i.e. clear the sign bit. For NEON it's done with vneg + fmaxnm. The block to find and broadcast the horizontal max becomes an efficient sequence of permute and max instructions. gcc is able to use NEON fabs for absolute value.
The 8 element vector on the 4 element SSE/NEON targets works well on clang. It uses a pair of instructions on two sets of registers and for the SWAP128 horizontal op will max or or the two registers without any unnecessary permute. gcc on the other hand really can't handle this and produces mostly non-SIMD code. If we reduce the vector length to 4, gcc works fine for SSE and NEON.
But there's a problem with if (any(cmp)). For clang + SSE/AVX, it works well, vcmpltps + vptest, with an orps to go from 8->4 on SSE.
But gcc and clang on NEON do all the permutes and ORs, then move the result to a gp register to test.
Is there some bit of code, other than architecture specific intrinsics, to get ptest with gcc and vmaxvq with clang/gcc and NEON?
I tried some other methods, like if (x[0] || x[1] || ... x[7]) but they were worse.
Update
I've created an updated example that shows two different implementations, both the original and "indices in a vector" method as suggested by chtz and shown in Aki Suihkonen's answer. One can see the resulting SSE and NEON output.
While some might be skeptical, the compiler does produce very good code from the generic SIMD (not auto-vectorization!) C++ code. On SSE/AVX, I see very little room to improve the code in the loop. The NEON version still troubled by a sub-optimal implementation of "any()".
Unless the data is usually in ascending order, or nearly so, my original version is still fastest on SSE/AVX. I haven't tested on NEON. This is because most loop iterations do not find a new max value and it's best to optimize for that case. The "indices in a vector" method produces a tighter loop and the compiler does a better job too, but the common case is just a bit slower on SSE/AVX. The common case might be equal or faster on NEON.
Some notes on writing generic SIMD code.
The absolute value of a vector of floats can be found with the following. It produces optimal code on SSE/AVX (and with a mask that clears the sign bit) and on NEON (the fabs instruction).
static vNs vabs(vNs x) {
return -x < x ? x : -x;
}
This will do a vertical max efficiently on SSE/AVX/NEON. It doesn't do a compare; it produces the architecture's "max' instruction. On NEON, changing it to use > instead of < causes the compiler to produce very bad scalar code. Something with denormals or exceptions I guess.
template <typename v> // Deduce vector type (float, unsigned, etc.)
static v vmax(v a, v b) {
return a < b ? b : a; // compiles best with "<" as compare op
}
This code will broadcast the horizontal max across a register. It compiles very well on SSE/AVX. On NEON, it would probably be better if the compiler could use a horizontal max instruction and then broadcast the result. I was impressed to see that if one uses 8 element vectors on SSE/NEON, which have only 4 element registers, the compiler is smart enough to use just one register for the broadcasted result, since the top 4 and bottom 4 elements are the same.
template <typename v>
static v hmax(v x) {
if (VLEN >= 8)
x = vmax(x, __builtin_shufflevector(x,x, SWAP128));
x = vmax(x, __builtin_shufflevector(x,x, SWAP64));
return vmax(x, __builtin_shufflevector(x,x, SWAP32));
}
This is the best "any()" I found. It is optimal on SSE/AVX, using a single ptest instruction. On NEON it does the permutes and ORs, instead of a horizontal max instruction, but I haven't found a way to get anything better on NEON.
static bool any(vNb x) {
if (VLEN >= 8)
x |= __builtin_shufflevector(x,x, SWAP128);
x |= __builtin_shufflevector(x,x, SWAP64);
x |= __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
Also interesting, on AVX the code i = i + 1 will be compiled to vpsubd ymmI, ymmI, ymmNegativeOne, i.e. subtract -1. Why? Because a vector of -1s is produced with vpcmpeqd ymm0, ymm0, ymm0 and that's faster than broadcasting a vector of 1s.
Here is the best which() I've come up with. This gives you the index of the 1st true value in a vector of booleans (0 = false, -1 = true). One can do somewhat better on AVX with movemask. I don't know about the best NEON.
// vector of signed ints
typedef int vNi __attribute__((vector_size(VLEN*sizeof(int))));
// vector of bytes, same number of elements, 1/4 the size
typedef unsigned char vNb __attribute__((vector_size(VLEN*sizeof(unsigned char))));
// scalar type the same size as the byte vector
using sNb = std::conditional_t<VLEN == 4, uint32_t, uint64_t>;
static int which(vNi x) {
vNb cidx = __builtin_convertvector(x, vNb);
return __builtin_ctzll((sNb)cidx) / 8u;
}
As commented by chtz, the most generic and typical method is to have another mask to gather indices:
Vec8s indices = { 0,1,2,3,4,5,6,7};
Vec8s max_idx = indices;
Vec8f max_abs = abs(load8(ptr));
for (auto i = 8; i + 8 <= vec_length; i+=8) {
Vec8s data = abs(load8(ptr[i]));
auto mask = is_greater(data, max_abs);
max_idx = bitselect(mask, indices, max_idx);
max_abs = max(max_abs, data);
indices = indices + 8;
}
Another option is to interleave the values and indices:
auto data = load8s(ptr) & 0x7fffffff; // can load data as int32_t
auto idx = vec8s{0,1,2,3,4,5,6,7};
auto lo = zip_lo(idx, data);
auto hi = zip_hi(idx, data);
for (int i = 8; i + 8 <= size; i+=8) {
idx = idx + 8;
auto d1 = load8s(ptr + i) & 0x7fffffff;
auto lo1 = zip_lo(idx, d1);
auto hi1 = zip_hi(idx, d1);
lo = max_u64(lo, lo1);
hi = max_u64(hi, hi1);
}
This method is especially lucrative, if the range of inputs is small enough to shift the input left, while appending a few bits from the index to the LSB bits of the same word.
Even in this case we can repurpose 1 bit in the float allowing us to save one half of the bit/index selection operations.
auto data0 = load8u(ptr) << 1; // take abs by shifting left
auto data1 = (load8u(ptr + 8) << 1) + 1; // encode odd index to data
auto mx = max_u32(data0, data1); // the LSB contains one bit of index
Looks like one can use double as the storage, since even SSE2 supports _mm_max_pd (some attention needs to be given to Inf/Nan handling, which don't encode as Inf/Nan any more when reinterpreted as the high part of 64-bit double).
UPD: the no-aligning issue is fixed now, all the examples on godbolt use aligned reads.
UPD: MISSED THE ABS
Terribly sorry about that, I missed the absolute value from the definition.
I do not have the measurements, but here are all 3 functions vectorised:
max value with abs: https://godbolt.org/z/6Wznrc5qq
find with abs: https://godbolt.org/z/61r9Efxvn
one pass with abs: https://godbolt.org/z/EvdbfnWjb
Asm stashed in a gist
On the method
The way to do max element with simd is to first find the value and then find the index.
Alternatively you have to keep a register of indexes and blend the indexes.
This requires keeping indexes, doing more operations and the problem of the overflow needs to be addressed.
Here are my timings on avx2 by type (char, short and int) for 10'000 bytes of data
The min_element is my implementation of keeping the index.
reduce(min) + find is doing two loops - first get the value, then find where.
For ints (should behave like floats), performance is 25% faster for the two loops solution, at least on my measurements.
For completeness, comparisons against scalar for both methods - this is definitely an operation that should be vectorized.
How to do it
finding the maximum value is auto-vectorised across all platforms if you write it as reduce
if (!arr.size()) return {};
// std::reduce is also ok, just showing for more C ppl
float res = arr[0];
for (int i = 1; i != (int)arr.size(); ++i) {
res = res > arr[i] ? res : arr[i];
}
return res;
https://godbolt.org/z/EsazWf1vT
Now the find portion is trickier, non of the compilers I know autovectorize find
We have eve library that provides you with find algorithm: https://godbolt.org/z/93a98x6Tj
Or I explain how to implement find in this talk if you want to do it yourself.
UPD:
UPD2: changed the blend to max
#Peter Cordes in the comments said that there is maybe a point to doing the one pass solution in case of bigger data.
I have no evidence of this - my measurements point to reduce + find.
However, I hacked together roughly how keeping the index looks (there is an aligning issue at the moment, we should definitely align reads here)
https://godbolt.org/z/djrzobEj4
AVX2 main loop:
.L6:
vmovups ymm6, YMMWORD PTR [rdx]
add rdx, 32
vcmpps ymm3, ymm6, ymm0, 30
vmaxps ymm0, ymm6, ymm0
vpblendvb ymm3, ymm2, ymm1, ymm3
vpaddd ymm1, ymm5, ymm1
vmovdqa ymm2, ymm3
cmp rcx, rdx
jne .L6
ARM-64 main loop:
.L6:
ldr q3, [x0], 16
fcmgt v4.4s, v3.4s, v0.4s
fmax v0.4s, v3.4s, v0.4s
bit v1.16b, v2.16b, v4.16b
add v2.4s, v2.4s, v5.4s
cmp x0, x1
bne .L6
Links to ASM if godbolt becomes stale: https://gist.github.com/DenisYaroshevskiy/56d82c8cf4a4dd5bf91d58b053ea80f2
I don’t believe that’s possible. Compilers aren’t smart enough to do that efficiently.
Compare the other answer (which uses NEON-like pseudocode) with the SSE version below:
// Compare vector absolute value with aa, if greater update both aa and maxIdx
inline void updateMax( __m128 vec, __m128i idx, __m128& aa, __m128& maxIdx )
{
vec = _mm_andnot_ps( _mm_set1_ps( -0.0f ), vec );
const __m128 greater = _mm_cmpgt_ps( vec, aa );
aa = _mm_max_ps( vec, aa );
// If you don't have SSE4, emulate with bitwise ops: and, andnot, or
maxIdx = _mm_blendv_ps( maxIdx, _mm_castsi128_ps( idx ), greater );
}
float maxabs_sse4( const float* rsi, size_t length, size_t& index )
{
// Initialize things
const float* const end = rsi + length;
const float* const endAligned = rsi + ( ( length / 4 ) * 4 );
__m128 aa = _mm_set1_ps( -1 );
__m128 maxIdx = _mm_setzero_ps();
__m128i idx = _mm_setr_epi32( 0, 1, 2, 3 );
// Main vectorized portion
while( rsi < endAligned )
{
__m128 vec = _mm_loadu_ps( rsi );
rsi += 4;
updateMax( vec, idx, aa, maxIdx );
idx = _mm_add_epi32( idx, _mm_set1_epi32( 4 ) );
}
// Handle the remainder, if present
if( rsi < end )
{
__m128 vec;
if( length > 4 )
{
// The source has at least 5 elements
// Offset the source pointer + index back, by a few elements
const int offset = (int)( 4 - ( length % 4 ) );
rsi -= offset;
idx = _mm_sub_epi32( idx, _mm_set1_epi32( offset ) );
vec = _mm_loadu_ps( rsi );
}
else
{
// The source was smaller than 4 elements, copy them into temporary buffer and load vector from there
alignas( 16 ) float buff[ 4 ];
_mm_store_ps( buff, _mm_setzero_ps() );
for( size_t i = 0; i < length; i++ )
buff[ i ] = rsi[ i ];
vec = _mm_load_ps( buff );
}
updateMax( vec, idx, aa, maxIdx );
}
// Reduce to scalar
__m128 tmpMax = _mm_movehl_ps( aa, aa );
__m128 tmpMaxIdx = _mm_movehl_ps( maxIdx, maxIdx );
__m128 greater = _mm_cmpgt_ps( tmpMax, aa );
aa = _mm_max_ps( tmpMax, aa );
maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );
// SSE3 has 100% market penetration in 2022
tmpMax = _mm_movehdup_ps( tmpMax );
tmpMaxIdx = _mm_movehdup_ps( tmpMaxIdx );
greater = _mm_cmpgt_ss( tmpMax, aa );
aa = _mm_max_ss( tmpMax, aa );
maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );
index = (size_t)_mm_cvtsi128_si32( _mm_castps_si128( maxIdx ) );
return _mm_cvtss_f32( aa );
}
As you see, pretty much everything is completely different. Not just the boilerplate about remainder and final reduction, the main loop is very different too.
SSE doesn’t have bitselect; blendvps is not quite that, it selects 32-bit lanes based on high bit of the selector. Unlike NEON, SSE doesn’t have instructions for absolute value, need to be emulated with bitwise andnot.
The final reduction going to be completely different as well. NEON has very limited shuffles, but it has better horizontal operations, like vmaxvq_f32 which finds horizontal maximum over the complete SIMD vector.

Pointer and regular variable with the same name

In the "Memory and pointers" chapter of the book "Embedded C programming", Mark Siegesmund gives the following example:
void Find_Min_Max( int list[], int count, int * min, int * max){
for( int i=0, min=255, max=0; i<count; i++){
if( *min>list[i])
*min=list[i];
if( *max<list[i] )
*max=list[i];
}
}
// call like this: Find_Min_Max( table, sizeof(table), &lowest, &highest);
If I understand correctly:
table is an array of int,
count is the size of the array
when calling, &lowest and &highest are addresses of int variables where one wants to store the results
int * min and int * max in the function definition refer to pointers * min and * max, both to integer types
the &lowest and &highest as defined in his last line are actually pointers to an address with an int type variable
(Not 100% sure about that last one.)
Inside the for loop, he each time compares the next int in the array list with what is at the pointer *min and *max addresses and updates the values at those addresses if necessary.
But in the definition of the loop, he defines min = 255, max = 0.
It seems to me as if these are two completely new variables that have not been initialized.
Should that line not be
for( int i=0, *min=255, *max=0: i<count; i++){
Is that a mistake in the book or something I misunderstand?
It seems like an error in the book - it does indeed declare new variables inside the loop. (Hint: before publishing programming books, at least compile the code first...)
But even with that embarrassing bug fixed, the code is naively written. Here's more mistakes:
Always const qualify array parameters that aren't modified.
Always use stdint.h in embedded systems.
Never use magic numbers like 255. In this case, use UINT8_MAX instead.
The above is industry standard consensus. (Also required by MISRA-C etc.)
Also, it is most correct practice to use size_t rather than int for size of arrays, but that's more of a style issue remark.
In addition, a better algorithm is to have the pointers point at the min and max values found in the array, meaning we don't just get the values but also their locations in the data container. Finding the location is a very common use-case. It's about the same execution speed but we get more information.
So if we should attempt to rewrite this to some book-worthy code, it would rather look like:
void find_min_max (const uint8_t* data, size_t size, const uint8_t** min, const uint8_t** max);
Bit harder to read and use with pointer-to-pointers, but more powerful.
(Normally we would micro-optimize pointers of same type with restrict, but in this case all pointers may end up pointing at the same object, so it isn't possible.)
Complete example:
#include <stddef.h>
#include <stdint.h>
void find_min_max (const uint8_t* data, size_t size, const uint8_t** min, const uint8_t** max)
{
*min = data;
*max = data;
for(size_t i=0; i<size; i++)
{
if(**min > data[i])
{
*min = &data[i];
}
if(**max < data[i])
{
*max = &data[i];
}
}
}
Usage example for PC: (please note that int main (void) and stdio.h shouldn't be used in embedded systems.)
#include <stdio.h>
#include <inttypes.h>
int main (void)
{
const uint8_t data[] = { 1, 2, 3, 4, 5, 4, 3, 2, 1, 0};
const uint8_t* min;
const uint8_t* max;
find_min_max(data, sizeof data, &min, &max);
printf("Min: %"PRIu8 ", index: %d\n", *min, (int)(min-data));
printf("Max: %"PRIu8 ", index: %d\n", *max, (int)(max-data));
return 0;
}
Disassembling this search algorithm for ARM gcc -O3:
find_min_max:
cmp r1, #0
str r0, [r2]
str r0, [r3]
bxeq lr
push {r4, lr}
add r1, r0, r1
.L5:
mov lr, r0
ldr ip, [r2]
ldrb r4, [ip] # zero_extendqisi2
ldrb ip, [r0], #1 # zero_extendqisi2
cmp r4, ip
strhi lr, [r2]
ldr r4, [r3]
ldrbhi ip, [r0, #-1] # zero_extendqisi2
ldrb r4, [r4] # zero_extendqisi2
cmp r4, ip
strcc lr, [r3]
cmp r1, r0
bne .L5
pop {r4, pc}
Not the most effective code still, very branch-intensive. I think there's plenty of room for optimizing this further, if the aim is library-quality code. But then it's also a specialized algorithm, finding both min and max, and their respective indices.
For small data sets it would probably be wiser to just sort the data first, then just grab min and max out of the sorted smallest and largest indices. If you plan to search through the data for other purposes elsewhere in the code, then definitely sort it first, so that you can use binary search.
int i=0, min=255, max=0 and int i=0, *min=255, *max=0 both define three new variables, which are initialized, but used incorrectly in the loop body.
The limits should be initialized before the loop:
*min=255;
*max=0;
for(int i=0; i<count; i++)
Alternatively the new variable i could be defined before the loop, but this is not as easy to read as the first one:
int i;
for(i=0, *min=255, *max=0; i<count; i++)
Note that if there are values smaller than 0 or larger than 255, the returned minimum and maximum values will be incorrect.
It is some mistake in the code. Maybe it should read:
for( int i=0, *min=255, *max=0; i<count; i++){
if( *min>list[i])
*min=list[i];
if( *max<list[i] )
*max=list[i];
}

Determine length in bytes of content of a variable

I want to automatically determine length in bytes of a field (addr) that is uint32, based on it's contents. Compiler is GCC. I use this:
uint8 len;
if(addr < 256) len = 1;
else if (addr < 65536) len = 2;
else if (addr < 16777216) len = 3;
else len = 4;
Is there a more efficient way?
This is inside a SPI function for a embedded device. I'm interested in the fastest way except macros, since addr can be a variable.
You can do it using an approach similar to binary search: first compare to 65536, then either to 256 or 16777216, depending on the outcome of the first comparison. This way you always finish in two comparisons, while your code sometimes would require three:
uint8 len = (addr < 65536)
? ((addr < 256) ? 1 : 2)
: ((addr < 16777216) ? 3 : 4);
gcc has a __builtin_ctz() function which can be used like
if (addr == 0)
len = 0;
else
len = (sizeof(int) * 8 - __builtin_ctz(addr) + 7) / 8;
Update:
under ARM, this compiles to
cmp r0, #0
rbitne r0, r0
clzne r0, r0
rsbne r0, r0, #39
lsrne r0, r0, #3
bx lr
Get the position of the highest bit – the only one that counts for your question (from https://stackoverflow.com/a/14085901). Divide by 8 to get the number of bytes.
addr = 0x20424;
printf ("%d\n", (fls(addr)+7)>>3);
It returns 0 when addr == 0.
fls is conforming to POSIX.1-2001, POSIX.1-2008, 4.3BSD. If your current system does not contain it, look at the above link or What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C? for more suggestions to find the highest bit set.

ARMCC 5 optimization of strtol and strtod

I have a board based on STM32L4 MCU (Ultra Low Power Cortex-M4) for GNSS tracking purposes. I don't use RTOS, so I use a custom scheduler. Compiler and environment is KEIL uVision 5 (compiler 5.05 and 5.06, behavior doesn't change)
The MCU speaks with GNSS module via plain UART and the protocol is a mix of NMEA and AT. GNSS position is given as plain text that must be converted to a pair of float/double coordinates.
To get the double/float value from text, I use strtod (or strtof).
Note that string operations are made in a separate buffer, different from the UART RX one.
The typical string for a latitude on the UART is
4256.45783
which means 42° 56.45783'
to get absolute position in degrees, I use the following formula
42 + 56.45783 / 60
When there is no optimization the code works fine and the position is converted right. When I turn on level 1 optimization (or higher), if I use standard C library I can convert the integer part (42 in the example) and when it comes to convert 56.45783, I get only 56 (so the integer part of minutes until the dot).
If I get rid of standard library and I use a custom strtod function downloaded from ANSI C source library I simply get 0 with ERANGE error.
In other parts of the code I use strtol, which has a strange behavior when L1 optimization is turned ON: when the first digit is 9 and conversion base is 10 it simply skips that 9 going on with the other digits.
So if in the buffer I have 92, I will get just 2 parsed. To get rid of this I simply prepended a sign + to the number and the result is always OK (as far as I can tell). This WA doesn't work with strtod.
Note that I tried to use static, volatile and on-stack variables, behavior doesn't change.
EDIT: I simplified the code in order to get where it goes wrong, as per comments hereafter
C code is like this:
void GnssStringToLatLonDegMin(const char* str, LatLong_t* struc)
{
double dbl = 0.0;
dbl = strtod("56.45783",NULL);
if(struc != NULL)
{
struc->Axis = (float)((dbl / 60.0) + 42.0);
}
}
Level 0 optimization:
559: void GnssStringToLatLonDegMin(const char* str, LatLong_t* struc)
0x08011FEE BDF8 POP {r3-r7,pc}
560: {
0x08011FF0 B570 PUSH {r4-r6,lr}
0x08011FF2 4605 MOV r5,r0
0x08011FF4 ED2D8B06 VPUSH.64 {d8-d10}
0x08011FF8 460C MOV r4,r1
561: double dbl = 0.0;
0x08011FFA ED9F0BF8 VLDR d0,[pc,#0x3E0]
0x08011FFE EEB08A40 VMOV.F32 s16,s0
0x08012002 EEF08A60 VMOV.F32 s17,s1
562: dbl = strtod("56.45783",NULL);
0x08012006 2100 MOVS r1,#0x00
0x08012008 A0F6 ADR r0,{pc}+4 ; #0x080123E4
0x0801200A F7FDFED1 BL.W __hardfp_strtod (0x0800FDB0)
0x0801200E EEB08A40 VMOV.F32 s16,s0
0x08012012 EEF08A60 VMOV.F32 s17,s1
563: if(struc != NULL)
564: {
0x08012016 B1A4 CBZ r4,0x08012042
565: struc->Axis = (float)((dbl / 60.0) + 42.0);
566: }
0x08012018 ED9F0BF5 VLDR d0,[pc,#0x3D4]
0x0801201C EC510B18 VMOV r0,r1,d8
0x08012020 EC532B10 VMOV r2,r3,d0
0x08012024 F7FEF880 BL.W __aeabi_ddiv (0x08010128)
0x08012028 EC410B1A VMOV d10,r0,r1
0x0801202C ED9F0BF2 VLDR d0,[pc,#0x3C8]
0x08012030 EC532B10 VMOV r2,r3,d0
0x08012034 F7FDFFBC BL.W __aeabi_dadd (0x0800FFB0)
0x08012038 EC410B19 VMOV d9,r0,r1
0x0801203C F7FDFF86 BL.W __aeabi_d2f (0x0800FF4C)
0x08012040 6020 STR r0,[r4,#0x00]
567: }
LEVEL 1 optimization
557: void GnssStringToLatLonDegMin(const char* str, LatLong_t* struc)
0x08011FEE BDF8 POP {r3-r7,pc}
558: {
559: double dbl = 0.0;
0x08011FF0 B510 PUSH {r4,lr}
0x08011FF2 460C MOV r4,r1
560: dbl = strtod("56.45783",NULL);
0x08011FF4 2100 MOVS r1,#0x00
0x08011FF6 A0F7 ADR r0,{pc}+2 ; #0x080123D4
0x08011FF8 F7FDFEDA BL.W __hardfp_strtod (0x0800FDB0)
561: if(struc != NULL)
562: {
0x08011FFC 2C00 CMP r4,#0x00
0x08011FFE D010 BEQ 0x08012022
563: struc->Axis = (float)((dbl / 60.0) + 42.0);
564: }
0x08012000 ED9F1BF7 VLDR d1,[pc,#0x3DC]
0x08012004 EC510B10 VMOV r0,r1,d0
0x08012008 EC532B11 VMOV r2,r3,d1
0x0801200C F7FEF88C BL.W __aeabi_ddiv (0x08010128)
0x08012010 ED9F1BF5 VLDR d1,[pc,#0x3D4]
0x08012014 EC532B11 VMOV r2,r3,d1
0x08012018 F7FDFFCA BL.W __aeabi_dadd (0x0800FFB0)
0x0801201C F7FDFF96 BL.W __aeabi_d2f (0x0800FF4C)
0x08012020 6020 STR r0,[r4,#0x00]
565: }
I looked at the disassembly of __hardfp_strtod and __strtod_int called by these functions and, as they are incorporated as binaries, they don't change with respect of optimization level.
Due to optimization, strtod didn't work.
Thanks to #old_timer, I had to make my own strtod function, which works even with optimization level set at level 2.
double simple_strtod(const char* str)
{
int8 inc;
double result = 0.0;
char * c_tmp;
c_tmp = strchr(str, '.');
if(c_tmp != NULL)
{
c_tmp++;
inc = -1;
while(*c_tmp != 0 && inc > -9)
{
result += (*c_tmp - '0') * pow(10.0, inc);
c_tmp++; inc--;
}
inc = 0;
c_tmp = strchr(str, '.');
c_tmp--;
do
{
result += (*c_tmp - '0') * pow(10.0,inc);
c_tmp--; inc++;
}while(c_tmp >= str);
}
return result;
}
It can be further optimized by not calling 'pow' and use something more clever, but just like this it works perfectly.

Efficient Neon Implementation Of Clipping

Within a loop i have to implement a sort of clipping
if ( isLast )
{
val = ( val < 0 ) ? 0 : val;
val = ( val > 255 ) ? 255 : val;
}
However this "clipping" takes up almost half the time of execution of the loop in Neon .
This is what the whole loop looks like-
for (row = 0; row < height; row++)
{
for (col = 0; col < width; col++)
{
Int sum;
//...Calculate the sum
Short val = ( sum + offset ) >> shift;
if ( isLast )
{
val = ( val < 0 ) ? 0 : val;
val = ( val > 255 ) ? 255 : val;
}
dst[col] = val;
}
}
This is how the clipping has been implemented in Neon
cmp %10,#1 //if(isLast)
bne 3f
vmov.i32 %4, d4[0] //put val in %4
cmp %4,#0 //if( val < 0 )
blt 4f
b 5f
4:
mov %4,#0
vmov.i32 d4[0],%4
5:
cmp %4,%11 //if( val > maxVal )
bgt 6f
b 3f
6:
mov %4,%11
vmov.i32 d4[0],%4
3:
This is the mapping of variables to registers-
isLast- %10
maxVal- %11
Any suggestions to make it faster ?
Thanks
EDIT-
The clipping now looks like-
"cmp %10,#1 \n\t"//if(isLast)
"bne 3f \n\t"
"vmin.s32 d4,d4,d13 \n\t"
"vmax.s32 d4,d4,d12 \n\t"
"3: \n\t"
//d13 contains maxVal(255)
//d12 contains 0
Time consumed by this portion of the code has dropped from 223ms to 18ms
Using normal compares with NEON is almost always a bad idea because it forces the contents of a NEON register into a general purpose ARM register, and this costs lots of cycles.
You can use the vmin and vmax NEON instructions. Here is a little example that clamps an array of integers to any min/max values.
void clampArray (int minimum,
int maximum,
int * input,
int * output,
int numElements)
{
// get two NEON values with your minimum and maximum in each lane:
int32x2_t lower = vdup_n_s32 (minimum);
int32x2_t higher = vdup_n_s32 (maximum);
int i;
for (i=0; i<numElements; i+=2)
{
// load two integers
int32x2_t x = vld1_s32 (&input[i]);
// clamp against maximum:
x = vmin_s32 (x, higher);
// clamp against minimum
x = vmax_s32 (x, lower);
// store two integers
vst1_s32 (&output[i], x);
}
}
Warning: This code assumes the numElements is always a multiple of two, and I haven't tested it.
You may even make it faster if you process four elements at a time using the vminq / vmaxq instructions and load/store four integers per iteration.
If maxVal is UCHAR_MAX, CHAR_MAX, SHORT_MAX or USHORT_MAX, you can simply convert with neon from int to your desired datatype, by casting with saturation.
By example
// Will convert four int32 values to signed short values, with saturation.
int16x4_t vqmovn_s32 (int32x4_t)
// Converts signed short to unsgigned char, with saturation
uint8x8_t vqmovun_s16 (int16x8_t)
If you do not want to use multiple-data capabilities, you can still use those instructions, by simply loading and reading one of the lanes.

Resources