Optimizing horizontal boolean reduction in ARM NEON - arm

I'm experimenting with a cross-platform SIMD library ala ecmascript_simd aka SIMD.js, and part of this is providing a few "horizontal" SIMD operations. In particular, the API that library offers includes any(<boolN x M>) -> bool and all(<boolN x M>) -> bool functions, where <T x K> is a vector of K elements of type T and boolN is an N-bit boolean, i.e. all ones or all zeros, as SSE and NEON return for their comparison operations.
For example, let v be a <bool32 x 4> (a 128-bit vector), it could be the result of VCLT.S32 or something. I'd like to compute all(v) = v[0] && v[1] && v[2] && v[3] and any(v) = v[0] || v[1] || v[2] || v[3].
This is easy with SSE, e.g. movmskps will extract the high bit of each element, so all for the type above becomes (with C intrinsics):
#include<xmmintrin.h>
int all(__m128 x) {
return _mm_movemask_ps(x) == 8 + 4 + 2 + 1;
}
and similarly for any.
I'm struggling to find obvious/nice/efficient ways to implement this with NEON, which doesn't support an instruction like movmskps. There's the approach of simply extracting each element and computing with scalars. E.g. there's the naive method but there's also the approach of using the "horizontal" operations NEON supports, like VPMAX and VPMIN.
#include<arm_neon.h>
int all_naive(uint32x4_t v) {
return v[0] && v[1] && v[2] && v[3];
}
int all_horiz(uint32x4_t v) {
uint32x2_t x = vpmin_u32(vget_low_u32(v),
vget_high_u32(v));
uint32x2_t y = vpmin_u32(x, x);
return x[0] != 0;
}
(One can do a similar thing for the latter with VPADD, which may be faster, but it's fundamentally the same idea.)
Are there are other tricks one can use to implement this?
Yes, I know that horizontal operations are not great with SIMD vector units. But sometimes it is useful, e.g. many SIMD implementations of mandlebrot will operate on 4 points at once, and bail out of the inner loop when all of them are out of range... which requires doing a comparison and then a horizontal and.

This is my current solution that is implemented in eve library.
If your backend has C++20 support, you can just use the library: it has implementations for arm-v7, arm-v8 (only little endian at the moment) and all x86 from sse2 to avx-512. It's open source and MIT licensed. In beta at the moment. Feel free to reach out (for example with an issue) if you are trying out the library.
Take everything with a grain of salt - I don't yet have the arm benchmarks set up.
NOTE: On top of basic all and any we also have a movemask equivalent to do more complex operations like first_true. That wasn't part of the question and it's not amazing but the code can be found here
ARM-V7, 8 bytes register
Now, arm-v7 is 32 bit architecture, so we try to get to 32 bit elements where we can.
any
Use pairwise 32 bit max. If any element is true, the max is true.
// cast to dwords
dwords = vpmax_u32(dwords, dwords);
return vget_lane_u32(dwords, 0);
all
Pairwise min instead of max. Also what you test against changes.
If you have 4 byte element - just test for true. If shorts or chars - you need to test for -1;
// cast to dwords
dwords = vpmin_u32(dwords, dwords);
std::uint32_t combined = vget_lane_u32(dwords, 0);
// Assuming T is your scalar type
if constexpr ( sizeof(T) >= 4 ) return combined;
// I decided that !~ is better than -1, compiler will figure it out.
return !~combined;
ARM-V7, 16 bytes register
For anything bigger than chars, just do a conversion to a 64 bit one. Here is the list of vector narrow integer conversions.
For chars, the best I found is to reinterpret as uint32 and do an extra check.
So compare for == -1 for all and > 0 for any.
Seemed nicer asm the split in two 8 byte registers.
Then just do all/any on that dword register.
ARM-v8, 8 byte
ARM-v8 has 64 bit support, so you can just get a 64 bit lane. That one is trivially testable.
ARM-v8, 16 byte
We use vmaxvq_u32 since there is not a 64 bit one for any and vminvq_u32, vminvq_u16 or vminvq_u8 for all depending on the element size.
(Which is similar to glibc strlen)
Conclusion
Lack of benchmarks definitely makes me worried, some instructions are problematic sometimes and I don't know about it.
Regardless, that's the best I've got, so far at least.

NOTE: first time looking at arm today, I might be wrong about things.
UPD: Removed ARM-V7 and will write up what we ended up doing in a separate answer
ARM-V8.
For ARM-V8, have a look at this strlen implementation from glibc:
https://code.woboq.org/userspace/glibc/sysdeps/aarch64/multiarch/strlen_asimd.S.html
ARM-V8 introduced reductions across registers. Here they use min to compare with 0
uminv datab2, datav.16b
mov tmp1, datav2.d[0]
cbnz tmp1, L(main_loop)
Find the smallest char, compare with 0 - take the next 16 bytes.
There are a few other reductions in ARM-V8 like vaddvq_u8.
I'm pretty sure you can do most of the things you'd want from movemask and alike with this.
Another interesting thing here is how they find the first_true
/* Set te NULL byte as 0xff and the rest as 0x00, move the data into a
pair of scalars and then compute the length from the earliest NULL
byte. */
cmeq datav.16b, datav.16b, #0
mov data1, datav.d[0]
mov data2, datav.d[1]
cmp data1, 0
csel data1, data1, data2, ne
sub len, src, srcin
rev data1, data1
add tmp2, len, 8
clz tmp1, data1
csel len, len, tmp2, ne
add len, len, tmp1, lsr 3
Looks a bit intimidating, but my understanding is:
they narrow it down to a 64 bit number just by doing if/else (if the first half doesn't have the zero - the second half does.
use count leading zeroes to find the position (didn't quite understand all of the endianness stuff here but it's libc - so this is the correct one).
So - if you only need V8 - there is a solution.

Related

Is this how the + operator is implemented in C?

When understanding how primitive operators such as +, -, * and / are implemented in C, I found the following snippet from an interesting answer.
// replaces the + operator
int add(int x, int y) {
while(x) {
int t = (x & y) <<1;
y ^= x;
x = t;
}
return y;
}
It seems that this function demonstrates how + actually works in the background. However, it's too confusing for me to understand it. I believed that such operations are done using assembly directives generated by the compiler for a long time!
Is the + operator implemented as the code posted on MOST implementations? Does this take advantage of two's complement or other implementation-dependent features?
To be pedantic, the C specification does not specify how addition is implemented.
But to be realistic, the + operator on integer types smaller than or equal to the word size of your CPU get translated directly into an addition instruction for the CPU, and larger integer types get translated into multiple addition instructions with some extra bits to handle overflow.
The CPU internally uses logic circuits to implement the addition, and does not use loops, bitshifts, or anything that has a close resemblance to how C works.
When you add two bits, following is the result: (truth table)
a | b | sum (a^b) | carry bit (a&b) (goes to next)
--+---+-----------+--------------------------------
0 | 0 | 0 | 0
0 | 1 | 1 | 0
1 | 0 | 1 | 0
1 | 1 | 0 | 1
So if you do bitwise xor, you can get the sum without carry.
And if you do bitwise and you can get the carry bits.
Extending this observation for multibit numbers a and b
a+b = sum_without_carry(a, b) + carry_bits(a, b) shifted by 1 bit left
= a^b + ((a&b) << 1)
Once b is 0:
a+0 = a
So algorithm boils down to:
Add(a, b)
if b == 0
return a;
else
carry_bits = a & b;
sum_bits = a ^ b;
return Add(sum_bits, carry_bits << 1);
If you get rid of recursion and convert it to a loop
Add(a, b)
while(b != 0) {
carry_bits = a & b;
sum_bits = a ^ b;
a = sum_bits;
b = carrry_bits << 1; // In next loop, add carry bits to a
}
return a;
With above algorithm in mind explanation from code should be simpler:
int t = (x & y) << 1;
Carry bits. Carry bit is 1 if 1 bit to the right in both operands is 1.
y ^= x; // x is used now
Addition without carry (Carry bits ignored)
x = t;
Reuse x to set it to carry
while(x)
Repeat while there are more carry bits
A recursive implementation (easier to understand) would be:
int add(int x, int y) {
return (y == 0) ? x : add(x ^ y, (x&y) << 1);
}
Seems that this function demonstrates how + actually works in the
background
No. Usually (almost always) integer addition translates to machine instruction add. This just demonstrate an alternate implementation using bitwise xor and and.
Seems that this function demonstrates how + actually works in the background
No. This is translated to the native add machine instruction, which is actually using the hardware adder, in the ALU.
If you're wondering how does the computer add, here is a basic adder.
Everything in the computer is done using logic gates, which are mostly made of transistors. The full adder has half-adders in it.
For a basic tutorial on logic gates, and adders, see this. The video is extremely helpful, though long.
In that video, a basic half-adder is shown. If you want a brief description, this is it:
The half adder add's two bits given. The possible combinations are:
Add 0 and 0 = 0
Add 1 and 0 = 1
Add 1 and 1 = 10 (binary)
So now how does the half adder work? Well, it is made up of three logic gates, the and, xor and the nand. The nand gives a positive current if both the inputs are negative, so that means this solves the case of 0 and 0. The xor gives a positive output one of the input is positive, and the other negative, so that means that it solves the problem of 1 and 0. The and gives a positive output only if both the inputs are positive, so that solves the problem of 1 and 1. So basically, we have now got our half-adder. But we still can only add bits.
Now we make our full-adder. A full adder consists of calling the half-adder again and again. Now this has a carry. When we add 1 and 1, we get a carry 1. So what the full-adder does is, it takes the carry from the half-adder, stores it, and passes it as another argument to the half-adder.
If you're confused how can you pass the carry, you basically first add the bits using the half-adder, and then add the sum and the carry. So now you've added the carry, with the two bits. So you do this again and again, till the bits you have to add are over, and then you get your result.
Surprised? This is how it actually happens. It looks like a long process, but the computer does it in fractions of a nanosecond, or to be more specific, in half a clock cycle. Sometimes it is performed even in a single clock cycle. Basically, the computer has the ALU (a major part of the CPU), memory, buses, etc..
If you want to learn computer hardware, from logic gates, memory and the ALU, and simulate a computer, you can see this course, from which I learnt all this: Build a Modern Computer from First Principles
It's free if you do not want an e-certificate. The part two of the course is coming up in spring this year
C uses an abstract machine to describe what C code does. So how it works is not specified. There are C "compilers" that actually compile C into a scripting language, for example.
But, in most C implementations, + between two integers smaller than the machine integer size will be translated into an assembly instruction (after many steps). The assembly instruction will be translated into machine code and embedded within your executable. Assembly is a language "one step removed" from machine code, intended to be easier to read than a bunch of packed binary.
That machine code (after many steps) is then interpreted by the target hardware platform, where it is interpreted by the instruction decoder on the CPU. This instruction decoder takes the instruction, and translates it into signals to send along "control lines". These signals route data from registers and memory through the CPU, where the values are added together often in an arithmetic logic unit.
The arithmetic logic unit might have separate adders and multipliers, or might mix them together.
The arithmetic logic unit has a bunch of transistors that perform the addition operation, then produce the output. Said output is routed via the signals generated from the instruction decoder, and stored in memory or registers.
The layout of said transistors in both the arithmetic logic unit and instruction decoder (as well as parts I have glossed over) is etched into the chip at the plant. The etching pattern is often produced by compiling a hardware description language, which takes an abstraction of what is connected to what and how they operate and generates transistors and interconnect lines.
The hardware description language can contain shifts and loops that don't describe things happening in time (like one after another) but rather in space -- it describes the connections between different parts of hardware. Said code may look very vaguely like the code you posted above.
The above glosses over many parts and layers and contains inaccuracies. This is both from my own incompetence (I have written both hardware and compilers, but am an expert in neither) and because full details would take a career or two, and not a SO post.
Here is a SO post about an 8-bit adder. Here is a non-SO post, where you'll note some of the adders just use operator+ in the HDL! (The HDL itself understands + and generates the lower level adder code for you).
Almost any modern processor that can run compiled C code will have builtin support for integer addition. The code you posted is a clever way to perform integer addition without executing an integer add opcode, but it is not how integer addition is normally performed. In fact, the function linkage probably uses some form of integer addition to adjust the stack pointer.
The code you posted relies on the observation that when adding x and y, you can decompose it into the bits they have in common and the bits that are unique to one of x or y.
The expression x & y (bitwise AND) gives the bits common to x and y. The expression x ^ y (bitwise exclusive OR) gives the bits that are unique to one of x or y.
The sum x + y can be rewritten as the sum of two times the bits they have in common (since both x and y contribute those bits) plus the bits that are unique to x or y.
(x & y) << 1 is twice the bits they have in common (the left shift by 1 effectively multiplies by two).
x ^ y is the bits that are unique to one of x or y.
So if we replace x by the first value and y by the second, the sum should be unchanged. You can think of the first value as the carries of the bitwise additions, and the second as the low-order bit of the bitwise additions.
This process continues until x is zero, at which point y holds the sum.
The code that you found tries to explain how very primitive computer hardware might implement an "add" instruction. I say "might" because I can guarantee that this method isn't used by any CPU, and I'll explain why.
In normal life, you use decimal numbers and you have learned how to add them: To add two numbers, you add the lowest two digits. If the result is less than 10, you write down the result and proceed to the next digit position. If the result is 10 or more, you write down the result minus 10, proceed to the next digit, buy you remember to add 1 more. For example: 23 + 37, you add 3+7 = 10, you write down 0 and remember to add 1 more for the next position. At the 10s position, you add (2+3) + 1 = 6 and write that down. Result is 60.
You can do the exact same thing with binary numbers. The difference is that the only digits are 0 and 1, so the only possible sums are 0, 1, 2. For a 32 bit number, you would handle one digit position after the other. And that is how really primitive computer hardware would do it.
This code works differently. You know the sum of two binary digits is 2 if both digits are 1. So if both digits are 1 then you would add 1 more at the next binary position and write down 0. That's what the calculation of t does: It finds all places where both binary digits are 1 (that's the &) and moves them to the next digit position (<< 1). Then it does the addition: 0+0 = 0, 0+1 = 1, 1+0 = 1, 1+1 is 2, but we write down 0. That's what the excludive or operator does.
But all the 1's that you had to handle in the next digit position haven't been handled. They still need to be added. That's why the code does a loop: In the next iteration, all the extra 1's are added.
Why does no processor do it that way? Because it's a loop, and processors don't like loops, and it is slow. It's slow, because in the worst case, 32 iterations are needed: If you add 1 to the number 0xffffffff (32 1-bits), then the first iteration clears bit 0 of y and sets x to 2. The second iteration clears bit 1 of y and sets x to 4. And so on. It takes 32 iterations to get the result. However, each iteration has to process all bits of x and y, which takes a lot of hardware.
A primitive processor would do things just as quick in the way you do decimal arithmetic, from the lowest position to the highest. It also takes 32 steps, but each step processes only two bits plus one value from the previous bit position, so it is much easier to implement. And even in a primitive computer, one can afford to do this without having to implement loops.
A modern, fast and complex CPU will use a "conditional sum adder". Especially if the number of bits is high, for example a 64 bit adder, it saves a lot of time.
A 64 bit adder consists of two parts: First, a 32 bit adder for the lowest 32 bit. That 32 bit adder produces a sum, and a "carry" (an indicator that a 1 must be added to the next bit position). Second, two 32 bit adders for the higher 32 bits: One adds x + y, the other adds x + y + 1. All three adders work in parallel. Then when the first adder has produced its carry, the CPU just picks which one of the two results x + y or x + y + 1 is the correct one, and you have the complete result. So a 64 bit adder only takes a tiny bit longer than a 32 bit adder, not twice as long.
The 32 bit adder parts are again implemented as conditional sum adders, using multiple 16 bit adders, and the 16 bit adders are conditional sum adders, and so on.
My question is: Is the + operator implemented as the code posted on MOST implementations?
Let's answer the actual question. All operators are implemented by the compiler as some internal data structure that eventually gets translated into code after some transformations. You can't say what code will be generated by a single addition because almost no real world compiler generates code for individual statements.
The compiler is free to generate any code as long as it behaves as if the actual operations were performed according to the standard. But what actually happens can be something completely different.
A simple example:
static int
foo(int a, int b)
{
return a + b;
}
[...]
int a = foo(1, 17);
int b = foo(x, x);
some_other_function(a, b);
There's no need to generate any addition instructions here. It's perfectly legal for the compiler to translate this into:
some_other_function(18, x * 2);
Or maybe the compiler notices that you call the function foo a few times in a row and that it is a simple arithmetic and it will generate vector instructions for it. Or that the result of the addition is used for array indexing later and the lea instruction will be used.
You simply can't talk about how an operator is implemented because it is almost never used alone.
In case a breakdown of the code helps anyone else, take the example x=2, y=6:
x isn't zero, so commence adding to y:
while(2) {
x & y = 2 because
x: 0 0 1 0 //2
y: 0 1 1 0 //6
x&y: 0 0 1 0 //2
2 <<1 = 4 because << 1 shifts all bits to the left:
x&y: 0 0 1 0 //2
(x&y) <<1: 0 1 0 0 //4
In summary, stash that result, 4, in t with
int t = (x & y) <<1;
Now apply the bitwise XOR y^=x:
x: 0 0 1 0 //2
y: 0 1 1 0 //6
y^=x: 0 1 0 0 //4
So x=2, y=4. Finally, sum t+y by resetting x=t and going back to the beginning of the while loop:
x = t;
When t=0 (or, at the beginning of the loop, x=0), finish with
return y;
Just out of interest, on the Atmega328P processor, with the avr-g++ compiler, the following code implements adding one by subtracting -1 :
volatile char x;
int main ()
{
x = x + 1;
}
Generated code:
00000090 <main>:
volatile char x;
int main ()
{
x = x + 1;
90: 80 91 00 01 lds r24, 0x0100
94: 8f 5f subi r24, 0xFF ; 255
96: 80 93 00 01 sts 0x0100, r24
}
9a: 80 e0 ldi r24, 0x00 ; 0
9c: 90 e0 ldi r25, 0x00 ; 0
9e: 08 95 ret
Notice in particular that the add is done by the subi instruction (subtract constant from register) where 0xFF is effectively -1 in this case.
Also of interest is that this particular processor does not have a addi instruction, which implies that the designers thought that doing a subtract of the complement would be adequately handled by the compiler-writers.
Does this take advantage of two's complement or other implementation-dependent features?
It would probably be fair to say that compiler-writers would attempt to implement the wanted effect (adding one number to another) in the most efficient way possible for that particularly architecture. If that requires subtracting the complement, so be it.

Speed comparison between bitwise operations

I have a question about the number of cycles needed for bitwise operation, or more precisely, the XOR operation. In my program, I have two 1D arrays of uint8_t variable with a fixed size of 8. I want to XOR both arrays and I was wondering what was the most effective way to do so. This is a code summarizing the options I have found :
int main() {
uint8_t tab[4] = {1,0,0,2};
uint8_t tab2[4] = {2,3,4,1};
/* First option */
uint8_t tab3[4] = {tab[0]^tab2[0], tab[1]^tab2[1], tab[2]^tab2[2], tab[3]^tab2[3]};
/* Second option */
uint32_t* t = tab;
uint32_t* t2 = tab2;
uint32_t t3 = *t ^ *t2;
uint8_t* tab4 = &t3;
/* Comparison */
printf("%d & %d\n", tab3[0], tab4[0]);
printf("%d & %d\n", tab3[1], tab4[1]);
printf("%d & %d\n", tab3[2], tab4[2]);
printf("%d & %d\n", tab3[3], tab4[3]);
return 0;
}
What is the best option from a cycle/byte point of view?
All the basic binary operations—and, or, xor, not—execute in one clock cycle (or less) on almost every processor architecture ever since the 1960s. I say "or less" because the overhead of fetching instructions, tracking ready registers, etc., may put the binary operation time into the noise.
To make the algorithm faster, it would be necessary to study the caching characteristics of the data.
Most any practical algorithm crunching with binary operations will be faster than the associated I/O. Hashing algorithms (like the SHA family) are probably the exception.
Single integer operations are usually faster than four single byte operations. For example, memchr() using the single instruction loop: rep scasb , which is byte oriented, is slower than an integer optimized version of memchr(), even though it involves about 12 instructions per integer.

Fast and efficient array lookup and modification

I have been wondering for a while which of the two following methods are faster or better.
MY CURRENT METHOD
I'm developing a chess game and the pieces are stored as numbers (really bytes to preserve memory) into a one-dimensional array. There is a position for the cursor corresponding to the index in the array. To access the piece at the current position in the array is easy (piece = pieces[cursorPosition]).
The problem is that to get the x and y values for checking if the move is a valid move requires the division and a modulo operators (x = cursorPosition % 8; y = cursorPosition / 8).
Likewise when using x and y to check if moves are valid (you have to do it this way for reasons that would fill the entire page), you have to do something like - purely as an example - if pieces[y * 8 + x] != 0: movePiece = False. The obvious problem is having to do y * 8 + x a bunch of times to access the array.
Ultimately, this means that getting a piece is trivial but then getting the x and y requires another bit of memory and a very small amount of time to compute it each round.
A MORE TRADITIONAL METHOD
Using a two-dimensional array, one can implement the above process a little easier except for the fact that piece lookup is now a little harder and more memory is used. (I.e. piece = pieces[cursorPosition[0]][cursorPosition[1]] or piece = pieces[x][y]).
I don't think this is faster and it definitely doesn't look less memory intensive.
GOAL
My end goal is to have the fastest possible code that uses the least amount of memory. This will be developed for the unix terminal (and potentially Windows CMD if I can figure out how to represent the pieces without color using Ansi escape sequences) and I will either be using a secure (encrypted with protocol and structure) TCP connection to connect people p2p to play chess or something else and I don't know how much memory people will have or how fast their computer will be or how strong of an internet connection they will have.
I also just want to learn to do this the best way possible and see if it can be done.
-
I suppose my question is one of the following:
Which of the above methods is better assuming that there are slightly more computations involving move validation (which means that the y * 8 + x has to be used a lot)?
or
Is there perhaps a method that includes both of the benefits of 1d and 2d arrays with not as many draw backs as I described?
First, you should profile your code to make sure that this is really a bottleneck worth spending time on.
Second, if you're representing your position as an unsigned byte decomposing it into X and Y coordinates will be very fast. If we use the following C code:
int getX(unsigned char pos) {
return pos%8;
}
We get the following assembly with gcc 4.8 -O2:
getX(unsigned char):
shrb $3, %dil
movzbl %dil, %eax
ret
If we get the Y coordinate with:
int getY(unsigned char pos) {
return pos/8;
}
We get the following assembly with gcc 4.8 -O2:
getY(unsigned char):
movl %edi, %eax
andl $7, %eax
ret
There is no short answer to this question; it all depends on how much time you spend optimizing.
On some architectures, two-dimensional arrays might work better than one-dimensional. On other architectures, bitmapped integers might be the best.
Do not worry about division and multiplication.
You're dividing, modulating and multiplying by 8.
This number is in the power of two, thus any computer can use bitwise operations in order to achieve the result.
(x * 8) is the same as (x << 3)
(x % 8) is the same as (x & (8 - 1))
(x / 8) is the same as (x >> 3)
Those operations are normally performed in a single clock cycle. On many modern architectures, they can be performed in less than a single clock cycle (including ARM architectures).
Do not worry about using bitwise operators instead of *, % and /. If you're using a compiler that's less than a decade old, it'll optimize it for you and use bitwise operations.
What you should focus on instead, is how easy it will be for you to find out whether or not a move is legal, for instance. This will help your computer-player to "think quickly".
If you're using an 8*8 array, then it's easy for you to see where a castle can move by checking if only x or y is changed. If checking the queen, then X must either be the same or move the same number of steps as the Y position.
If you use a one-dimensional array, you also have advantages.
But performance-wise, it might be a real good idea to use a 16x16 array or a 1x256 array.
Fill the entire array with 0x80 values (eg. "illegal position"). Then fill the legal fields with 0x00.
If using a 1x256 array, you can check bit 3 and 7 of the index. If any of those are set, then the position is outside the board.
Testing can be done this way:
if(position & 0x88)
{
/* move is illegal */
}
else
{
/* move is legal */
}
... or ...
if(0 == (position & 0x88))
{
/* move is legal */
}
'position' (the index) should be an unsigned byte (uint8_t in C). This way, you'll never have to worry about pointing outside the buffer.
Some people optimize their chess-engines by using 64-bit bitmapped integers.
While this is good for quickly comparing the positions, it has other disadvantages; for instance checking if the knight's move is legal.
It's not easy to say which is better, though.
Personally, I think the one-dimensional array in general might be the best way to do it.
I recommend getting familiar (very familiar) with AND, OR, XOR, bit-shifting and rotating.
See Bit Twiddling Hacks for more information.

Quickly find whether a value is present in a C array?

I have an embedded application with a time-critical ISR that needs to iterate through an array of size 256 (preferably 1024, but 256 is the minimum) and check if a value matches the arrays contents. A bool will be set to true is this is the case.
The microcontroller is an NXP LPC4357, ARM Cortex M4 core, and the compiler is GCC. I already have combined optimisation level 2 (3 is slower) and placing the function in RAM instead of flash. I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i!=0 is faster than checking if i<256). All in all, I end up with a duration of 12.5 µs which has to be reduced drastically to be feasible. This is the (pseudo) code I use now:
uint32_t i;
uint32_t *array_ptr = &theArray[0];
uint32_t compareVal = 0x1234ABCD;
bool validFlag = false;
for (i=256; i!=0; i--)
{
if (compareVal == *array_ptr++)
{
validFlag = true;
break;
}
}
What would be the absolute fastest way to do this? Using inline assembly is allowed. Other 'less elegant' tricks are also allowed.
In situations where performance is of utmost importance, the C compiler will most likely not produce the fastest code compared to what you can do with hand tuned assembly language. I tend to take the path of least resistance - for small routines like this, I just write asm code and have a good idea how many cycles it will take to execute. You may be able to fiddle with the C code and get the compiler to generate good output, but you may end up wasting lots of time tuning the output that way. Compilers (especially from Microsoft) have come a long way in the last few years, but they are still not as smart as the compiler between your ears because you're working on your specific situation and not just a general case. The compiler may not make use of certain instructions (e.g. LDM) that can speed this up, and it's unlikely to be smart enough to unroll the loop. Here's a way to do it which incorporates the 3 ideas I mentioned in my comment: Loop unrolling, cache prefetch and making use of the multiple load (ldm) instruction. The instruction cycle count comes out to about 3 clocks per array element, but this doesn't take into account memory delays.
Theory of operation: ARM's CPU design executes most instructions in one clock cycle, but the instructions are executed in a pipeline. C compilers will try to eliminate the pipeline delays by interleaving other instructions in between. When presented with a tight loop like the original C code, the compiler will have a hard time hiding the delays because the value read from memory must be immediately compared. My code below alternates between 2 sets of 4 registers to significantly reduce the delays of the memory itself and the pipeline fetching the data. In general, when working with large data sets and your code doesn't make use of most or all of the available registers, then you're not getting maximum performance.
; r0 = count, r1 = source ptr, r2 = comparison value
stmfd sp!,{r4-r11} ; save non-volatile registers
mov r3,r0,LSR #3 ; loop count = total count / 8
pld [r1,#128]
ldmia r1!,{r4-r7} ; pre load first set
loop_top:
pld [r1,#128]
ldmia r1!,{r8-r11} ; pre load second set
cmp r4,r2 ; search for match
cmpne r5,r2 ; use conditional execution to avoid extra branch instructions
cmpne r6,r2
cmpne r7,r2
beq found_it
ldmia r1!,{r4-r7} ; use 2 sets of registers to hide load delays
cmp r8,r2
cmpne r9,r2
cmpne r10,r2
cmpne r11,r2
beq found_it
subs r3,r3,#1 ; decrement loop count
bne loop_top
mov r0,#0 ; return value = false (not found)
ldmia sp!,{r4-r11} ; restore non-volatile registers
bx lr ; return
found_it:
mov r0,#1 ; return true
ldmia sp!,{r4-r11}
bx lr
Update:
There are a lot of skeptics in the comments who think that my experience is anecdotal/worthless and require proof. I used GCC 4.8 (from the Android NDK 9C) to generate the following output with optimization -O2 (all optimizations turned on including loop unrolling). I compiled the original C code presented in the question above. Here's what GCC produced:
.L9: cmp r3, r0
beq .L8
.L3: ldr r2, [r3, #4]!
cmp r2, r1
bne .L9
mov r0, #1
.L2: add sp, sp, #1024
bx lr
.L8: mov r0, #0
b .L2
GCC's output not only doesn't unroll the loop, but also wastes a clock on a stall after the LDR. It requires at least 8 clocks per array element. It does a good job of using the address to know when to exit the loop, but all of the magical things compilers are capable of doing are nowhere to be found in this code. I haven't run the code on the target platform (I don't own one), but anyone experienced in ARM code performance can see that my code is faster.
Update 2:
I gave Microsoft's Visual Studio 2013 SP2 a chance to do better with the code. It was able to use NEON instructions to vectorize my array initialization, but the linear value search as written by the OP came out similar to what GCC generated (I renamed the labels to make it more readable):
loop_top:
ldr r3,[r1],#4
cmp r3,r2
beq true_exit
subs r0,r0,#1
bne loop_top
false_exit: xxx
bx lr
true_exit: xxx
bx lr
As I said, I don't own the OP's exact hardware, but I will be testing the performance on an nVidia Tegra 3 and Tegra 4 of the 3 different versions and post the results here soon.
Update 3:
I ran my code and Microsoft's compiled ARM code on a Tegra 3 and Tegra 4 (Surface RT, Surface RT 2). I ran 1000000 iterations of a loop which fails to find a match so that everything is in cache and it's easy to measure.
My Code MS Code
Surface RT 297ns 562ns
Surface RT 2 172ns 296ns
In both cases my code runs almost twice as fast. Most modern ARM CPUs will probably give similar results.
There's a trick for optimizing it (I was asked this on a job-interview once):
If the last entry in the array holds the value that you're looking for, then return true
Write the value that you're looking for into the last entry in the array
Iterate the array until you encounter the value that you're looking for
If you've encountered it before the last entry in the array, then return true
Return false
bool check(uint32_t theArray[], uint32_t compareVal)
{
uint32_t i;
uint32_t x = theArray[SIZE-1];
if (x == compareVal)
return true;
theArray[SIZE-1] = compareVal;
for (i = 0; theArray[i] != compareVal; i++);
theArray[SIZE-1] = x;
return i != SIZE-1;
}
This yields one branch per iteration instead of two branches per iteration.
UPDATE:
If you're allowed to allocate the array to SIZE+1, then you can get rid of the "last entry swapping" part:
bool check(uint32_t theArray[], uint32_t compareVal)
{
uint32_t i;
theArray[SIZE] = compareVal;
for (i = 0; theArray[i] != compareVal; i++);
return i != SIZE;
}
You can also get rid of the additional arithmetic embedded in theArray[i], using the following instead:
bool check(uint32_t theArray[], uint32_t compareVal)
{
uint32_t *arrayPtr;
theArray[SIZE] = compareVal;
for (arrayPtr = theArray; *arrayPtr != compareVal; arrayPtr++);
return arrayPtr != theArray+SIZE;
}
If the compiler doesn't already apply it, then this function will do so for sure. On the other hand, it might make it harder on the optimizer to unroll the loop, so you will have to verify that in the generated assembly code...
Keep the table in sorted order, and use Bentley's unrolled binary search:
i = 0;
if (key >= a[i+512]) i += 512;
if (key >= a[i+256]) i += 256;
if (key >= a[i+128]) i += 128;
if (key >= a[i+ 64]) i += 64;
if (key >= a[i+ 32]) i += 32;
if (key >= a[i+ 16]) i += 16;
if (key >= a[i+ 8]) i += 8;
if (key >= a[i+ 4]) i += 4;
if (key >= a[i+ 2]) i += 2;
if (key >= a[i+ 1]) i += 1;
return (key == a[i]);
The point is,
if you know how big the table is, then you know how many iterations there will be, so you can fully unroll it.
Then, there's no point testing for the == case on each iteration because, except on the last iteration, the probability of that case is too low to justify spending time testing for it.**
Finally, by expanding the table to a power of 2, you add at most one comparison, and at most a factor of two storage.
** If you're not used to thinking in terms of probabilities, every decision point has an entropy, which is the average information you learn by executing it.
For the >= tests, the probability of each branch is about 0.5, and -log2(0.5) is 1, so that means if you take one branch you learn 1 bit, and if you take the other branch you learn one bit, and the average is just the sum of what you learn on each branch times the probability of that branch.
So 1*0.5 + 1*0.5 = 1, so the entropy of the >= test is 1. Since you have 10 bits to learn, it takes 10 branches.
That's why it's fast!
On the other hand, what if your first test is if (key == a[i+512)? The probability of being true is 1/1024, while the probability of false is 1023/1024. So if it's true you learn all 10 bits!
But if it's false you learn -log2(1023/1024) = .00141 bits, practically nothing!
So the average amount you learn from that test is 10/1024 + .00141*1023/1024 = .0098 + .00141 = .0112 bits. About one hundredth of a bit.
That test is not carrying its weight!
You're asking for help with optimising your algorithm, which may push you to assembler. But your algorithm (a linear search) is not so clever, so you should consider changing your algorithm. E.g.:
perfect hash function
binary search
Perfect hash function
If your 256 "valid" values are static and known at compile time, then you can use a perfect hash function. You need to find a hash function that maps your input value to a value in the range 0..n, where there are no collisions for all the valid values you care about. That is, no two "valid" values hash to the same output value. When searching for a good hash function, you aim to:
Keep the hash function reasonably fast.
Minimise n. The smallest you can get is 256 (minimal perfect hash function), but that's probably hard to achieve, depending on the data.
Note for efficient hash functions, n is often a power of 2, which is equivalent to a bitwise mask of low bits (AND operation). Example hash functions:
CRC of input bytes, modulo n.
((x << i) ^ (x >> j) ^ (x << k) ^ ...) % n (picking as many i, j, k, ... as needed, with left or right shifts)
Then you make a fixed table of n entries, where the hash maps the input values to an index i into the table. For valid values, table entry i contains the valid value. For all other table entries, ensure that each entry of index i contains some other invalid value which doesn't hash to i.
Then in your interrupt routine, with input x:
Hash x to index i (which is in the range 0..n)
Look up entry i in the table and see if it contains the value x.
This will be much faster than a linear search of 256 or 1024 values.
I've written some Python code to find reasonable hash functions.
Binary search
If you sort your array of 256 "valid" values, then you can do a binary search, rather than a linear search. That means you should be able to search 256-entry table in only 8 steps (log2(256)), or a 1024-entry table in 10 steps. Again, this will be much faster than a linear search of 256 or 1024 values.
If the set of constants in your table is known in advance, you can use perfect hashing to ensure that only one access is made to the table. Perfect hashing determines a hash function
that maps every interesting key to a unique slot (that table isn't always dense, but you can decide how un-dense a table you can afford, with less dense tables typically leading to simpler hashing functions).
Usually, the perfect hash function for the specific set of keys is relatively easy to compute; you don't want that to be long and complicated because that competes for time perhaps better spent doing multiple probes.
Perfect hashing is a "1-probe max" scheme. One can generalize the idea, with the thought that one should trade simplicity of computing the hash code with the time it takes to make k probes. After all, the goal is "least total time to look up", not fewest probes or simplest hash function. However, I've never seen anybody build a k-probes-max hashing algorithm. I suspect one can do it, but that's likely research.
One other thought: if your processor is extremely fast, the one probe to memory from a perfect hash probably dominates the execution time. If the processor is not very fast, than k>1 probes might be practical.
Use a hash set. It will give O(1) lookup time.
The following code assumes that you can reserve value 0 as an 'empty' value, i.e. not occurring in actual data.
The solution can be expanded for a situation where this is not the case.
#define HASH(x) (((x >> 16) ^ x) & 1023)
#define HASH_LEN 1024
uint32_t my_hash[HASH_LEN];
int lookup(uint32_t value)
{
int i = HASH(value);
while (my_hash[i] != 0 && my_hash[i] != value) i = (i + 1) % HASH_LEN;
return i;
}
void store(uint32_t value)
{
int i = lookup(value);
if (my_hash[i] == 0)
my_hash[i] = value;
}
bool contains(uint32_t value)
{
return (my_hash[lookup(value)] == value);
}
In this example implementation, the lookup time will typically be very low, but at the worst case can be up to the number of entries stored. For a realtime application, you can consider also an implementation using binary trees, which will have a more predictable lookup time.
In this case, it might be worthwhile investigating Bloom filters. They're capable of quickly establishing that a value is not present, which is a good thing since most of the 2^32 possible values are not in that 1024 element array. However, there are some false positives that will need an extra check.
Since your table is apparently static, you can determine which false positives exist for your Bloom filter and put those in a perfect hash.
Assuming your processor runs at 204 MHz which seems to be the maximum for the LPC4357, and also assuming your timing result reflects the average case (half of the array traversed), we get:
CPU frequency: 204 MHz
Cycle period: 4.9 ns
Duration in cycles: 12.5 µs / 4.9 ns = 2551 cycles
Cycles per iteration: 2551 / 128 = 19.9
So, your search loop spends around 20 cycles per iteration. That doesn't sound awful, but I guess that in order to make it faster you need to look at the assembly.
I would recommend dropping the index and using a pointer comparison instead, and making all the pointers const.
bool arrayContains(const uint32_t *array, size_t length)
{
const uint32_t * const end = array + length;
while(array != end)
{
if(*array++ == 0x1234ABCD)
return true;
}
return false;
}
That's at least worth testing.
Other people have suggested reorganizing your table, adding a sentinel value at the end, or sorting it in order to provide a binary search.
You state "I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i != 0 is faster than checking if i < 256)."
My first advice is: get rid of the pointer arithmetic and the downcounting. Stuff like
for (i=0; i<256; i++)
{
if (compareVal == the_array[i])
{
[...]
}
}
tends to be idiomatic to the compiler. The loop is idiomatic, and the indexing of an array over a loop variable is idiomatic. Juggling with pointer arithmetic and pointers will tend to obfuscate the idioms to the compiler and make it generate code related to what you wrote rather than what the compiler writer decided to be the best course for the general task.
For example, the above code might be compiled into a loop running from -256 or -255 to zero, indexing off &the_array[256]. Possibly stuff that is not even expressible in valid C but matches the architecture of the machine you are generating for.
So don't microoptimize. You are just throwing spanners into the works of your optimizer. If you want to be clever, work on the data structures and algorithms but don't microoptimize their expression. It will just come back to bite you, if not on the current compiler/architecture, then on the next.
In particular using pointer arithmetic instead of arrays and indexes is poison for the compiler being fully aware of alignments, storage locations, aliasing considerations and other stuff, and for doing optimizations like strength reduction in the way best suited to the machine architecture.
Vectorization can be used here, as it is often is in implementations of memchr. You use the following algorithm:
Create a mask of your query repeating, equal in length to your OS'es bit count (64-bit, 32-bit, etc.). On a 64-bit system you would repeat the 32-bit query twice.
Process the list as a list of multiple pieces of data at once, simply by casting the list to a list of a larger data type and pulling values out. For each chunk, XOR it with the mask, then XOR with 0b0111...1, then add 1, then & with a mask of 0b1000...0 repeating. If the result is 0, there is definitely not a match. Otherwise, there may (usually with very high probability) be a match, so search the chunk normally.
Example implementation: https://sourceware.org/cgi-bin/cvsweb.cgi/src/newlib/libc/string/memchr.c?rev=1.3&content-type=text/x-cvsweb-markup&cvsroot=src
If you can accommodate the domain of your values with the amount of memory that's available to your application, then, the fastest solution would be to represent your array as an array of bits:
bool theArray[MAX_VALUE]; // of which 1024 values are true, the rest false
uint32_t compareVal = 0x1234ABCD;
bool validFlag = theArray[compareVal];
EDIT
I'm astounded by the number of critics. The title of this thread is "How do I quickly find whether a value is present in a C array?" for which I will stand by my answer because it answers precisely that. I could argue that this has the most speed efficient hash function (since address === value). I've read the comments and I'm aware of the obvious caveats. Undoubtedly those caveats limit the range of problems this can be used to solve, but, for those problems that it does solve, it solves very efficiently.
Rather than reject this answer outright, consider it as the optimal starting point for which you can evolve by using hash functions to achieve a better balance between speed and performance.
I'm sorry if my answer was already answered - just I'm a lazy reader. Feel you free to downvote then ))
1) you could remove counter 'i' at all - just compare pointers, ie
for (ptr = &the_array[0]; ptr < the_array+1024; ptr++)
{
if (compareVal == *ptr)
{
break;
}
}
... compare ptr and the_array+1024 here - you do not need validFlag at all.
all that won't give any significant improvement though, such optimization probably could be achieved by the compiler itself.
2) As it was already mentioned by other answers, almost all modern CPU are RISC-based, for example ARM. Even modern Intel X86 CPUs use RISC cores inside, as far as I know (compiling from X86 on fly). Major optimization for RISC is pipeline optimization (and for Intel and other CPU as well), minimizing code jumps. One type of such optimization (probably a major one), is "cycle rollback" one. It's incredibly stupid, and efficient, even Intel compiler can do that AFAIK. It looks like:
if (compareVal == the_array[0]) { validFlag = true; goto end_of_compare; }
if (compareVal == the_array[1]) { validFlag = true; goto end_of_compare; }
...and so on...
end_of_compare:
This way the optimization is that the pipeline is not broken for the worst case (if compareVal is absent in the array), so it is as fast as possible (of course not counting algorithm optimizations such as hash tables, sorted arrays and so on, mentioned in other answers, which may give better results depending on array size. Cycles Rollback approach can be applied there as well by the way. I'm writing here about that I think I didn't see in others)
The second part of this optimization is that that array item is taken by direct address (calculated at compiling stage, make sure you use a static array), and do not need additional ADD op to calculate pointer from array's base address. This optimization may not have significant effect, since AFAIK ARM architecture has special features to speed up arrays addressing. But anyway it's always better to know that you did all the best just in C code directly, right?
Cycle Rollback may look awkward due to waste of ROM (yep, you did right placing it to fast part of RAM, if your board supports this feature), but actually it's a fair pay for speed, being based on RISC concept. This is just a general point of calculation optimization - you sacrifice space for sake of speed, and vice versa, depending on your requirements.
If you think that rollback for array of 1024 elements is too large sacrifice for your case, you can consider 'partial rollback', for example dividing the array into 2 parts of 512 items each, or 4x256, and so on.
3) modern CPU often support SIMD ops, for example ARM NEON instruction set - it allows to execute the same ops in parallel. Frankly speaking I do not remember if it is suitable for comparison ops, but I feel it may be, you should check that. Googling shows that there may be some tricks as well, to get max speed, see https://stackoverflow.com/a/5734019/1028256
I hope it can give you some new ideas.
This is more like an addendum than an answer.
I've had a similar case in the past, but my array was constant over a considerable number of searches.
In half of them, the searched value was NOT present in array. Then I realized I could apply a "filter" before doing any search.
This "filter" is just a simple integer number, calculated ONCE and used in each search.
It's in Java, but it's pretty simple:
binaryfilter = 0;
for (int i = 0; i < array.length; i++)
{
// just apply "Binary OR Operator" over values.
binaryfilter = binaryfilter | array[i];
}
So, before do a binary search, I check binaryfilter:
// Check binaryfilter vs value with a "Binary AND Operator"
if ((binaryfilter & valuetosearch) != valuetosearch)
{
// valuetosearch is not in the array!
return false;
}
else
{
// valuetosearch MAYBE in the array, so let's check it out
// ... do binary search stuff ...
}
You can use a 'better' hash algorithm, but this can be very fast, specially for large numbers.
May be this could save you even more cycles.
Make sure the instructions ("the pseudo code") and the data ("theArray") are in separate (RAM) memories so CM4 Harvard architecture is utilized to its full potential. From the user manual:
To optimize the CPU performance, the ARM Cortex-M4 has three buses for Instruction (code) (I) access, Data (D) access, and System (S) access. When instructions and data are kept in separate memories, then code and data accesses can be done in parallel in one cycle. When code and data are kept in the same memory, then instructions that load or store data may take two cycles.
Following this guideline I observed ~30% speed increase (FFT calculation in my case).
I'm a great fan of hashing. The problem of course is to find an efficient algorithm that is both fast and uses a minimum amount of memory (especially on an embedded processor).
If you know beforehand the values that may occur you can create a program that runs through a multitude of algorithms to find the best one - or, rather, the best parameters for your data.
I created such a program that you can read about in this post and achieved some very fast results. 16000 entries translates roughly to 2^14 or an average of 14 comparisons to find the value using a binary search. I explicitly aimed for very fast lookups - on average finding the value in <=1.5 lookups - which resulted in greater RAM requirements. I believe that with a more conservative average value (say <=3) a lot of memory could be saved. By comparison the average case for a binary search on your 256 or 1024 entries would result in an average number of comparisons of 8 and 10, respectively.
My average lookup required around 60 cycles (on a laptop with an intel i5) with a generic algorithm (utilizing one division by a variable) and 40-45 cycles with a specialized (probably utilizing a multiplication). This should translate into sub-microsecond lookup times on your MCU, depending of course on the clock frequency it executes at.
It can be real-life-tweaked further if the entry array keeps track of how many times an entry was accessed. If the entry array is sorted from most to least accessed before the indeces are computed then it'll find the most commonly occuring values with a single comparison.

How to properly add/subtract a 128-bit number (as two uint64_t)?

I'm working in C and need to add and subtract a 64-bit number and a 128-bit number. The result will be held in the 128-bit number. I am using an integer array to store the upper and lower halves of the 128-bit number (i.e. uint64_t bigNum[2], where bigNum[0] is the least significant).
Can anybody help with an addition and subtraction function that can take in bigNum and add/subtract a uint64_t to it?
I have seen many incorrect examples on the web, so consider this:
bigNum[0] = 0;
bigNum[1] = 1;
subtract(&bigNum, 1);
At this point bigNum[0] should have all bits set, while bigNum[1] should have no bits set.
In many architectures it's very easy to add/subtract any arbitrarily-long integers because there's a carry flag and add/sub-with-flag instruction. For example on x86 rdx:rax += r8:r9 can be done like this
add rax, r9 # add the low parts and store the carry
adc rdx, r8 # add the high parts with carry
In C there's no way to access this carry flag so you must calculate the flag on your own. The easiest way is to check if the unsigned sum is less than either of the operand like this. For example to do a += b we'll do
aL += bL;
aH += bH + (aL < bL);
This is exactly how multi-word add is done in architectures that don't have a flag register. For example in MIPS it's done like this
# alow = blow + clow
addu alow, blow, clow
# set tmp = 1 if alow < clow, else 0
sltu tmp, alow, clow
addu ahigh, bhigh, chigh
addu ahigh, ahigh, tmp
Here's some example assembly output
This should work for the subtraction:
typedef u_int64_t bigNum[2];
void subtract(bigNum *a, u_int64_t b)
{
const u_int64_t borrow = b > a[1];
a[1] -= b;
a[0] -= borrow;
}
Addition is very similar. The above could of course be expressed with an explicit test, too, but I find it cleaner to always do the borrowing. Optimization left as an exercise.
For a bigNum equal to { 0, 1 }, subtracting two would make it equal { ~0UL, ~0UL }, which is the proper bit pattern to represent -1. Here, UL is assumed to promote an integer to 64 bits, which is compiler-dependent of course.
In grade 1 or 2, you should have learn't how to break down the addition of 1 and 10 into parts, by splitting it into multiple separate additions of tens and units. When dealing with big numbers, the same principals can be applied to compute arithmetic operations on arbitrarily large numbers, by realizing your units are now units of 2^bits, your "tens" are 2^bits larger and so on.
For the case the value that your are subtracting is less or equal to bignum[0] you don't have to touch bignum[1].
If it isn't, you subtract it from bignum[0], anyhow. This operation will wrap around, but this is the behavior you need here. In addition you'd then have to substact 1 from bignum[1].
Most compilers support a __int128 type intrinsically.
Try it and you might be lucky.

Resources