I want to do addition using scalar. Here is what I've tried:
ex) uint32x4_t result, result2, op, one;
// op + 1
result = vaddq_u32(op, 1); //error, 1 is not vector
one = vdupq_n_u32(1);
result2 = vaddq_u32(op, one); // ok
What is the best way to save memory space when doing this?
There are no instructions for vector-scalar alu type operations, only multiplications of >= 16bit width on NEON.
Neither are there instructions for add/sub by immediate values.
What you already did is the way it is supposed to be done.
One thing you could try to boost the performance is to declare the vector of 1s as a constant outside of the loop, hoping the compiler to be smart enough not to load the same value over and over each iteration within the loop.
Unfortunately, the available ARM compilers aren't that reliable when in comes to NEON. Checking the disassembly is pretty much a necessety which defeats the point of writing in intrinsics in the first place.
Related
I am trying to implement a VERY efficient 2x2 matrix multiplication in C code for operation in an ARM Cortex-M4. The function accepts 3 pointers to 2x2 arrays, 2 for the inputs to be multiplied and an output buffer passed by the using function. Here is what I have so far...
static inline void multiply_2x2_2x2(int16_t a[2][2], int16_t b[2][2], int32_t c[2][2])
{
int32_t a00a01, a10a11, b00b01, b01b11;
a00a01 = a[0][0] | a[0][1]<<16;
b00b10 = b[0][0] | b[1][0]<<16;
b01b11 = b[0][1] | b[1][1]<<16;
c[0][0] = __SMUAD(a00a01, b00b10);
c[0][1] = __SMUAD(a00a01, b01b11);
a10a11 = a[1][0] | a[1][1]<<16;
c[1][0] = __SMUAD(a10a11, b00b10);
c[1][1] = __SMUAD(a10a11, b01b11);
}
Basically, my strategy is to to use the ARM Cortex-M4 __SMUAD() function to do the actual multiply accumulates. But this requires me to build the inputs a00a01, a10a11, b00b10, and b01b11 ahead of time. My question is, given that the C array should be a continuous in memory, is there a more efficient wat to pass the data into the functions directly without the intermediate variables? Secondary question, am I overthinking this and I should just let the compiler do its job as it is smarter than I am? I tend to do that a lot.
Thanks!
You could break the strict aliasing rules and load the matrix row directly into the 32-bit register, using a int16_t* to int32_t* typecast. An expression such as a00a01 = a[0][0] | a[0][1]<<16 just takes some consecutive bits from RAM and arranges them into other consecutive bits in registers. Consult your compiler manual for the flag to disable its strict aliasing assumptions, and make the cast safely usable.
You could also perhaps avoid transposing matrix columns into registers, by generating b in transposed format in the first place.
The best way to learn about the compiler, and get a sense of the cases for which it's smarter than you, is to disassemble its results and compare the instruction sequence to your intentions.
The first main concern is that some_signed_int << 16 invokes undefined behavior for negative numbers. So you have bugs all over. And then bitwise OR of two int16_t where either is negative does not necessarily form a valid int32_t either. Do you actually need the sign or can you drop it?
ARM examples use unsigned int, which in turn supposedly contains 2x int16_t in raw binary form. This is what you actually want too.
Also it would seem that it shouldn't matter for SMUAD which 16 bit word you place where. So the a[0][0] | a[0][1]<<16; just serves to needlessly swap data around in memory. It will confuse the compiler which can't optimize such code well. Sure, shifts etc are always very fast, but this is pointless overhead.
(As someone noted, this whole thing is likely much easier to write in pure assembler without concern of all the C type rules and undefined behavior.)
To avoid all these issues you could define your own union type:
typedef union
{
int16_t i16 [2][2];
uint32_t u32 [2];
} mat2x2_t;
u32[0] corresponds to i16[0][0] and i16[0][1]
u32[1] corresponds to i16[1][0] and i16[1][1]
C actually lets you "type pun" between these types pretty wildly (unlike C++). Unions also dodge the brittle strict aliasing rules.
The function can then become something along the lines of this pseudo code:
static uint32_t mat_mul16 (mat2x2_t a, mat2x2_t b)
{
uint32_t c0 = __SMUAD(a.u32[0], b.u32[0]);
...
}
Supposedly each such line should give 2x signed 16 multiplications as per the SMUAD instruction.
As for if this actually gives some revolutionary performance increase compared to some default MUL, I kind of doubt it. Disassemble and count CPU ticks.
am I overthinking this and I should just let the compiler do its job as it is smarter than I am?
Most likely :) The old rule of thumb: benchmark and then only manually optimize at the point when you've actually found a performance bottleneck.
I want to know which of these functions is easier for CPU to calculate/run. I was told that direct multiplication (e.g. 4x3) is more difficult for CPU to calculate than a series of summation (e.g. 4+4+4). Well the first one has direct multiplication, but the second one has a for loop.
Algorithm 1
The first one is like x*y:
int multi_1(int x, int y)
{
return x * y;
}
Algorithm 2
The second one is like x+x+x+...+x (as much as y):
int multi_2(int num1, int num2)
{
int sum=0;
for(int i=0; i<num2; i++)
{
sum += num1;
}
return sum;
}
Please don't respond with "Don't try to do micro-optimization" or something similar. How can I evaluate which of these codes run better/faster? Does C language automatically convert direct multiplication to summation?
You can generally expect the multiplication operator * to be implemented as efficiently as possible. Beating it with a custom multiplication algorithm is highly unlikely. If for any reason multi_2 is faster than multi_1 for all but some edge cases, consider writing a bug report against your compiler vendor.
On modern (i.e. made in this century) machines, multiplications by arbitrary integers are extremely fast and takes four cycles at most, which is faster than initializing the loop in multi_2.
The more "high level" your code is, the more optimization paths your compiler will be able to use. So, I'd say that code #1 will have the most chances to produce a fast and optimized code.
In fact, for a simple CPU architecture that doesn't support direct multiplication operations, but does support addition and shifts, the second algorithm won't be used at all. The usual procedure is something similar to the following code:
unsigned int mult_3 (unsigned int x, unsigned int y)
{
unsigned int res = 0;
while (x)
{
res += (x&1)? y : 0;
x>>=1;
y<<=1;
}
return res;
}
Typical modern CPUs can do multiplication in hardware, often at the same speed as addition. So clearly #1 is better.
Even if multiplication is not available and you are stuck with addition there are algorithms much faster than #2.
You were misinformed. Multiplication is not "more difficult" than repeated addition. Multipliers are built-into in the ALU (Arithmetic & Logical Unit) of modern CPU, and they work in constant time. On the opposite, repeated additions take time proportional to the value of one of the operands, which could be as large a one billion !
Actually, multiplies rarely performed by straight additions; when you have to implement them in software, you do it by repeated shifts, using a method similar to duplation, known of the Ancient Aegyptiens.
This depends on the architecture you run it on, as well as the compiler and the values for x and y.
If x and y are small, the second version might be faster. However, when x and y are very large numbers, the second version will certainly be much slower.
The only way to really find out is to measure the running time of your code, for example like this: https://stackoverflow.com/a/9085330/369009
Since you're dealing with int values, the multiplication operator (*) will be far more efficient. C will compile into the CPU-specific assembly language, which will have a multiplication instruction (e.g., x86's mul/imul). Virtually all modern CPUs can multiply integers within a few clock cycles. It doesn't get much faster than that. Years ago (and on some relatively uncommon embedded CPUs) it used to be the case that multiplication took more clock cycles than addition, but even then, the additional jump instructions to loop would result in more cycles being consumed, even if only looping once or twice.
The C language does not require that multiplications by integers be converted into series of additions. It permits implementations to do that, I suppose, but I would be surprised to find an implementation that did so, at least in the general context you present.
Note also that in your case #2 you have replaced one multiplication operation with not just num2 addition operations, but with at least 2 * num2 additions, num2 comparisons, and 2 * num2 storage operations. The storage operations probably end up being approximately free, as the values likely end up living in CPU registers, but they don't have to do.
Overall, I would expect alternative #1 to be much faster, but it is always best to answer performance questions by testing. You will see the largest difference for large values of num2. For instance, try with num1 == 1 and num2 == INT_MAX.
I have a C program which uses GCC's __uint128_t which is great, but now my needs have grown beyond it.
What are my options for fast arithmetic with 196 or 256 bits?
The only operation I need is addition (and I don't need the carry bit, i.e., I will be working mod 2192 or 2256).
Speed is important, so I don't want to move to a general multi-precision if at all possible. (In fact my code does use multi-precision in some places, but this is in the critical loop and will run tens of billions of times. So far the multi-precision needs to run only tens of thousands of times.)
Maybe this is simple enough to code directly, or maybe I need to find some appropriate library.
What is your advice, Oh great Stack Overflow?
Clarification: GMP is too slow for my needs. Although I actually use multi-precision in my code it's not in the inner loop and runs less than 105 times. The hot loop runs more like 1012 times. When I changed my code (increasing a size parameter) so that the multi-precision part ran more often vs. the single-precision, I had a 100-fold slowdown (mostly due to memory management issues, I think, rather than extra µops). I'd like to get that down to a 4-fold slowdown or better.
256-bit version
__uint128_t a[2], b[2], c[2]; // c = a + b
c[0] = a[0] + b[0]; // add low part
c[1] = a[1] + b[1] + (c[0] < a[0]); // add high part and carry
Edit: 192-bit version. This way you can eliminate the 128-bit comparison like what #harold's stated:
struct uint192_t {
__uint128_t H;
uint64_t L;
} a, b, c; // c = a + b
c.L = a.L + b.L;
c.H = a.H + b.H + (c.L < a.L);
Alternatively you can use the integer overflow builtins or checked arithmetic builtins
bool carry = __builtin_uaddl_overflow(a.L, b.L, &c.L);
c.H = a.H + b.H + carry;
Demo on Godbolt
If you do a lot of additions in a loop you should consider using SIMD and/or running them in parallel with multithreading. For SIMD you may need change the layout of the type so that you can add all the low parts at once and all the high parts at once. Once possible solution is an array of struct of array as suggested here practical BigNum AVX/SSE possible?
SSE2: llhhllhhllhhllhh
AVX2: llllhhhhllllhhhh
AVX512: llllllllhhhhhhhh
With AVX-512 you can add eight 64-bit values at once. So you can add eight 192-bit values in 3 instructions plus a few more for the carry. For more information read Is it possible to use SSE and SSE2 to make a 128-bit wide integer?
With AVX-2 or AVX-512 you may also have very fast horizontal add so it may also worth a try for 256-bit even if you don't have parallel addition chains. But for 192-bit addition then 3 add/adc instructions would be much faster
There are also many libraries with a fixed-width integer type. For example Boost.Multiprecision
#include <boost/multiprecision/cpp_int.hpp>
using namespace boost::multiprecision;
uint256_t myUnsignedInt256 = 1;
Some other libraries:
ttmath: ttmath:UInt<3> (an int type with 3 limbs, which is 192 bits on 64-bit computers)
uint256_t
See also
C++ 128/256-bit fixed size integer types
You could test if the "add (low < oldlow) to simulate carry"-technique from this answer is fast enough. It's slightly complicated by the fact that low is an __uint128_t here, that could hurt code generation. You might try it with 4 uint64_t's as well, I don't know whether that'll be better or worse.
If that's not good enough, drop to inline assembly, and directly use the carry flag - it doesn't get any better than that, but you'd have the usual downsides of using inline assembly.
I have the following code
void Fun2()
{
if(X<=A)
X=ceil(M*1.0/A*X);
else
X=M*1.0/(M-A)*(M-X);
}
I want to program it in fast manner using C99, take into account the following comments.
Xand A, are 32 bit variables and I declare them as uint64_t, While M as static const uint64_t.
This function is called by another function and the value of A are changed to a new value every n times of calling.
The optimization is needed in the execution time, CPU is Core i3, OS is windows 7
The math model I want to implement it is
F=ceil(Max/A*X) if x<=A
F=floor(M/(M-A)*(M-X)) if x>A
For clarity and no confusion My previous post was
I have the following code
void Fun2()
{
if(X0<=A)
X0=ceil(Max1*X0);
else
X0=Max2*(Max-X0);
}
I want to program it in fast manner using C99, take into account the following comments.
X0, A, Max1, and Max2 are 32 bit variable and I declare them as uint64_t, While Max as static const uint64_t.
This function is called by another function and the values of Max1, A, Max2 are changed to random values every n times of calling.
I work in Windows 7 and in codeblocks software
Thanks
It is completely pointless and impossible to optimize code like this without a specific target in mind. In order to do so, you need the following knowledge:
Which CPU is used.
Which OS is used (if any).
In-depth knowledge of the above, to the point where you know more, or about as much of the system as the people who wrote the optimizer for the given compiler port.
What kind of optimization that is most important: execution speed, RAM usage or program size.
The only kind of optimization you can do without knowing the above is on the algorithm level. There are no such algorithms in the code posted.
Thus your question cannot be answered by anyone until more information is provided.
If "fast manner" means fast execution, your first change is to declare this function as an inline one, a feature of C99.
inline void Fun2()
{
...
...
}
I recall that GNU CC has some interesting macros that may help optimizing this code as well. I don't think this is C99 compliant but it is always interesting to note. I mean: your function has an if statement. If you can know by advance what probability has each branch of being taken, you can do things like:
if (likely(X0<=A)).....
If it's probable that X0 is less or equal than A. Or:
if (unlikely(X0<=A)).....
If it's not probable that X0 is less or equal than A.
With that information, the compiler will optimize the comparison and jump so the most probable branch will be executed with no jumps, so it will be executed faster in architectures with no branch prediction.
Another thing that may improve speed is to use the ?: ternary operator, as both branches assign a value to the same variable, something like this:
inline void Func2()
{
X0 = (X0>=A)? Max1*X0 : Max2*(Max-X0);
}
BTW: why use ceil()? ceil() is for double numbers to round down a decimal number to the nearest non greater integer. If X0 and Max1 are integer numbers, there won't be decimals in the result, so ceil() won't have any effect.
I think one thing that can be improved is not to use floating point. Your code mostly deals with integers, so you want to stick to integer arithmetic.
The only floating point number is Max1. If it's always whole, it can be an integer. If not, you may be able to replace it with two integers: Max1*X0 -> X0 * Max1_nom / Max1_denom. If you calculate the nominator/denominator once, and use many times, this can speed things up.
I'd transform the math model to
Ceil (M*(X-0) / (A-0)) when A<=X
Floor (M*(X-M) / (A-M)) when A>X
with
Ceil (A / B) = Floor((A + (B-1)) / B)
Which substituted to the first gives:
((M * (X - m0) + c ) / ( A - m0))
where
c = A-1; m0 = 0, when A <= X
c = 0; m0 = M, when A >= X
Everything will be performed in integer arithmetic, but it'll be quite tough to calculate the reciprocals in advance;
It may still be possible to use some form of DDA to avoid calculating the division between iterations.
Using the temporary constants c, m0 is simply for unifying the pipeline for both branches as the next step is in pursuit of parallelism.
Is there any single instruction or function that can invert the sign of every float inside a __m128?
i.e. a = r0:r1:r2:r3 ===> a = -r0:-r1:-r2:-r3?
I know this can be done by _mm_sub_ps(_mm_set1_ps(0.0),a), but isn't it potentially slow since _mm_set1_ps(0.0) is a multi-instruction function?
In practice your compiler should do a good job of generating the constant vector for 0.0. It will probably just use _mm_xor_ps, and if your code is in a loop it should hoist the constant generation out of the loop anyway. So, bottom line, use your original idea of:
v = _mm_sub_ps(_mm_set1_ps(0.0), v);
or another common trick, which is:
v = _mm_xor_ps(v, _mm_set1_ps(-0.0));
which just flips the sign bits instead of doing a subtraction (not quite as safe as the first method, since it doesn't do the right thing with NaNs, but may be more efficient in some cases).