Multiword addition in C - c

I have a C program which uses GCC's __uint128_t which is great, but now my needs have grown beyond it.
What are my options for fast arithmetic with 196 or 256 bits?
The only operation I need is addition (and I don't need the carry bit, i.e., I will be working mod 2192 or 2256).
Speed is important, so I don't want to move to a general multi-precision if at all possible. (In fact my code does use multi-precision in some places, but this is in the critical loop and will run tens of billions of times. So far the multi-precision needs to run only tens of thousands of times.)
Maybe this is simple enough to code directly, or maybe I need to find some appropriate library.
What is your advice, Oh great Stack Overflow?
Clarification: GMP is too slow for my needs. Although I actually use multi-precision in my code it's not in the inner loop and runs less than 105 times. The hot loop runs more like 1012 times. When I changed my code (increasing a size parameter) so that the multi-precision part ran more often vs. the single-precision, I had a 100-fold slowdown (mostly due to memory management issues, I think, rather than extra µops). I'd like to get that down to a 4-fold slowdown or better.

256-bit version
__uint128_t a[2], b[2], c[2]; // c = a + b
c[0] = a[0] + b[0]; // add low part
c[1] = a[1] + b[1] + (c[0] < a[0]); // add high part and carry
Edit: 192-bit version. This way you can eliminate the 128-bit comparison like what #harold's stated:
struct uint192_t {
__uint128_t H;
uint64_t L;
} a, b, c; // c = a + b
c.L = a.L + b.L;
c.H = a.H + b.H + (c.L < a.L);
Alternatively you can use the integer overflow builtins or checked arithmetic builtins
bool carry = __builtin_uaddl_overflow(a.L, b.L, &c.L);
c.H = a.H + b.H + carry;
Demo on Godbolt
If you do a lot of additions in a loop you should consider using SIMD and/or running them in parallel with multithreading. For SIMD you may need change the layout of the type so that you can add all the low parts at once and all the high parts at once. Once possible solution is an array of struct of array as suggested here practical BigNum AVX/SSE possible?
SSE2: llhhllhhllhhllhh
AVX2: llllhhhhllllhhhh
AVX512: llllllllhhhhhhhh
With AVX-512 you can add eight 64-bit values at once. So you can add eight 192-bit values in 3 instructions plus a few more for the carry. For more information read Is it possible to use SSE and SSE2 to make a 128-bit wide integer?
With AVX-2 or AVX-512 you may also have very fast horizontal add so it may also worth a try for 256-bit even if you don't have parallel addition chains. But for 192-bit addition then 3 add/adc instructions would be much faster
There are also many libraries with a fixed-width integer type. For example Boost.Multiprecision
#include <boost/multiprecision/cpp_int.hpp>
using namespace boost::multiprecision;
uint256_t myUnsignedInt256 = 1;
Some other libraries:
ttmath: ttmath:UInt<3> (an int type with 3 limbs, which is 192 bits on 64-bit computers)
uint256_t
See also
C++ 128/256-bit fixed size integer types

You could test if the "add (low < oldlow) to simulate carry"-technique from this answer is fast enough. It's slightly complicated by the fact that low is an __uint128_t here, that could hurt code generation. You might try it with 4 uint64_t's as well, I don't know whether that'll be better or worse.
If that's not good enough, drop to inline assembly, and directly use the carry flag - it doesn't get any better than that, but you'd have the usual downsides of using inline assembly.

Related

Operating Rightmost/Leftmost n-Bits, Not All the Bits of A Integer Type Data Variable

In a programming-task, I have to add a smaller integer in variable B (data type int)
to a larger integer (20 decimal integer) in variable A (data type long long int),
then compare A with variable C which is also as large integer (data type long long int) as A.
What I realized, since I add a smaller B to A,
I don't need to check all the digits of A when I compare that with C, in other words, we don't need to check all the bits of A and C.
Given that I know, how many bits from the right I need to check, say n-bits,
is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language?
Because for comparing all the bits take more time, and since I am working with large number, the program becomes slower.
Every time I search in the google, bit-masking appears which uses all the bits of A, C, that doesn't do what I am asking for, so probably I am not using correct terminology, please help.
Addition:
Initial comments of this post made me think there is no way but i found the following -
Bit Manipulation by University of Colorado Boulder
(#cuboulder, after 7:45)
...the bit band region is accessed via a bit band alías, each bit in a
supported bit band region has its own unique address and we can access
that bit using a pointer to its bit band alias location, the least
significant bit in an alias location can be sent or cleared and that
will be mapped to the bit in the corresponding data or peripheral
memory, unfortunately this will not help you if you need to write to
multiple bit locations in memory dependent operations only allow a
single bit to be cleared or set...
Is above what I a asking for? if yes then
where I can find the detail as beginner?
Updated question:
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
Your assumption that comparing fewer bits is faster might be true in some cases but is probably not true in most cases.
I'm only familiar with x86 CPUs. A x86-64 Processor has 64 bit wide registers. These can be accessed as 64 bit registers but the lower bits also as 32, 16 and 8 bit registers. There are processor instructions which work with the 64, 32, 16 or 8 bit part of the registers. Comparing 8 bits is one instruction but so is comparing 64 bits.
If using the 32 bit comparison would be faster than the 64 bit comparison you could gain some speed. But it seems like there is no speed difference for current processor generations. (Check out the "cmp" instruction with the link to uops.info from #harold.)
If your long long data type is actually bigger then the word size of your processor, then it's a different story. E.g. if your long long is 64 bit but your are on a 32 bit processor then these instructions cannot be handled by one register and you would need multiple instructions. So if you know that comparing only the lower 32 bits would be enough this could save some time.
Also note that comparing only e.g. 20 bits would actually take more time then comparing 32 bits. You would have to compare 32 bits and then mask the 12 highest bits. So you would need a comparison and a bitwise and instruction.
As you see this is very processor specific. And you are on the processors opcode level. As #RawkFist wrote in his comment you could try to get the C compiler to create such instructions but that does not automatically mean that this is even faster.
All of this is only relevant if these operations are executed a lot. I'm not sure what you are doing. If e.g. you add many values B to A and compare them to C each time it might be faster to start with C, subtract the B values from it and compare with 0. Because the compare-operation works internally like a subtraction. So instead of an add and a compare instruction a single subtraction would be enough within the loop. But modern CPUs and compilers are very smart and optimize a lot. So maybe the compiler automatically performs such or similar optimizations.
Try this question.
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
Yes - when A + B != C. We can short-cut the comparison once a difference is found: from least to most significant.
No - when A + B == C. All bits need comparison.
Now back to OP's original question
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
No. In order to do so, we need to out-think the compiler. A well enabled compiler itself will notice any "tricks" available for long long + (signed char)int == long long and emit efficient code.
Yet what about really long compares? How about a custom uint1000000 for A and C?
For long compares of a custom type, a quick compare can be had.
First, select a fast working type. unsigned is a prime candidate.
typedef unsigned ufast;
Now define the wide integer.
#include <limits.h>
#include <stdbool.h>
#define UINT1000000_N (1000000/(sizeof(ufast) * CHAR_BIT))
typedef struct {
// Least significant first
ufast digit[UINT1000000_N];
} uint1000000;
Perform the addition and compare one "digit" at a time.
bool uint1000000_fast_offset_compare(const uint1000000 *A, unsigned B,
const uint1000000 *C) {
ufast carry = B;
for (unsigned i = 0; i < UINT1000000_N; i++) {
ufast sum = A->digit[i] + carry;
if (sum != C->digit[i]) {
return false;
}
carry = sum < A->digit[i];
}
return true;
}

Efficient Embedded Fixed Point 2x2 Matrix Multiplication in ARM Cortex-M4 C code

I am trying to implement a VERY efficient 2x2 matrix multiplication in C code for operation in an ARM Cortex-M4. The function accepts 3 pointers to 2x2 arrays, 2 for the inputs to be multiplied and an output buffer passed by the using function. Here is what I have so far...
static inline void multiply_2x2_2x2(int16_t a[2][2], int16_t b[2][2], int32_t c[2][2])
{
int32_t a00a01, a10a11, b00b01, b01b11;
a00a01 = a[0][0] | a[0][1]<<16;
b00b10 = b[0][0] | b[1][0]<<16;
b01b11 = b[0][1] | b[1][1]<<16;
c[0][0] = __SMUAD(a00a01, b00b10);
c[0][1] = __SMUAD(a00a01, b01b11);
a10a11 = a[1][0] | a[1][1]<<16;
c[1][0] = __SMUAD(a10a11, b00b10);
c[1][1] = __SMUAD(a10a11, b01b11);
}
Basically, my strategy is to to use the ARM Cortex-M4 __SMUAD() function to do the actual multiply accumulates. But this requires me to build the inputs a00a01, a10a11, b00b10, and b01b11 ahead of time. My question is, given that the C array should be a continuous in memory, is there a more efficient wat to pass the data into the functions directly without the intermediate variables? Secondary question, am I overthinking this and I should just let the compiler do its job as it is smarter than I am? I tend to do that a lot.
Thanks!
You could break the strict aliasing rules and load the matrix row directly into the 32-bit register, using a int16_t* to int32_t* typecast. An expression such as a00a01 = a[0][0] | a[0][1]<<16 just takes some consecutive bits from RAM and arranges them into other consecutive bits in registers. Consult your compiler manual for the flag to disable its strict aliasing assumptions, and make the cast safely usable.
You could also perhaps avoid transposing matrix columns into registers, by generating b in transposed format in the first place.
The best way to learn about the compiler, and get a sense of the cases for which it's smarter than you, is to disassemble its results and compare the instruction sequence to your intentions.
The first main concern is that some_signed_int << 16 invokes undefined behavior for negative numbers. So you have bugs all over. And then bitwise OR of two int16_t where either is negative does not necessarily form a valid int32_t either. Do you actually need the sign or can you drop it?
ARM examples use unsigned int, which in turn supposedly contains 2x int16_t in raw binary form. This is what you actually want too.
Also it would seem that it shouldn't matter for SMUAD which 16 bit word you place where. So the a[0][0] | a[0][1]<<16; just serves to needlessly swap data around in memory. It will confuse the compiler which can't optimize such code well. Sure, shifts etc are always very fast, but this is pointless overhead.
(As someone noted, this whole thing is likely much easier to write in pure assembler without concern of all the C type rules and undefined behavior.)
To avoid all these issues you could define your own union type:
typedef union
{
int16_t i16 [2][2];
uint32_t u32 [2];
} mat2x2_t;
u32[0] corresponds to i16[0][0] and i16[0][1]
u32[1] corresponds to i16[1][0] and i16[1][1]
C actually lets you "type pun" between these types pretty wildly (unlike C++). Unions also dodge the brittle strict aliasing rules.
The function can then become something along the lines of this pseudo code:
static uint32_t mat_mul16 (mat2x2_t a, mat2x2_t b)
{
uint32_t c0 = __SMUAD(a.u32[0], b.u32[0]);
...
}
Supposedly each such line should give 2x signed 16 multiplications as per the SMUAD instruction.
As for if this actually gives some revolutionary performance increase compared to some default MUL, I kind of doubt it. Disassemble and count CPU ticks.
am I overthinking this and I should just let the compiler do its job as it is smarter than I am?
Most likely :) The old rule of thumb: benchmark and then only manually optimize at the point when you've actually found a performance bottleneck.

Which of these C multiplication algorithms is easier CPU and has lower overhead?

I want to know which of these functions is easier for CPU to calculate/run. I was told that direct multiplication (e.g. 4x3) is more difficult for CPU to calculate than a series of summation (e.g. 4+4+4). Well the first one has direct multiplication, but the second one has a for loop.
Algorithm 1
The first one is like x*y:
int multi_1(int x, int y)
{
return x * y;
}
Algorithm 2
The second one is like x+x+x+...+x (as much as y):
int multi_2(int num1, int num2)
{
int sum=0;
for(int i=0; i<num2; i++)
{
sum += num1;
}
return sum;
}
Please don't respond with "Don't try to do micro-optimization" or something similar. How can I evaluate which of these codes run better/faster? Does C language automatically convert direct multiplication to summation?
You can generally expect the multiplication operator * to be implemented as efficiently as possible. Beating it with a custom multiplication algorithm is highly unlikely. If for any reason multi_2 is faster than multi_1 for all but some edge cases, consider writing a bug report against your compiler vendor.
On modern (i.e. made in this century) machines, multiplications by arbitrary integers are extremely fast and takes four cycles at most, which is faster than initializing the loop in multi_2.
The more "high level" your code is, the more optimization paths your compiler will be able to use. So, I'd say that code #1 will have the most chances to produce a fast and optimized code.
In fact, for a simple CPU architecture that doesn't support direct multiplication operations, but does support addition and shifts, the second algorithm won't be used at all. The usual procedure is something similar to the following code:
unsigned int mult_3 (unsigned int x, unsigned int y)
{
unsigned int res = 0;
while (x)
{
res += (x&1)? y : 0;
x>>=1;
y<<=1;
}
return res;
}
Typical modern CPUs can do multiplication in hardware, often at the same speed as addition. So clearly #1 is better.
Even if multiplication is not available and you are stuck with addition there are algorithms much faster than #2.
You were misinformed. Multiplication is not "more difficult" than repeated addition. Multipliers are built-into in the ALU (Arithmetic & Logical Unit) of modern CPU, and they work in constant time. On the opposite, repeated additions take time proportional to the value of one of the operands, which could be as large a one billion !
Actually, multiplies rarely performed by straight additions; when you have to implement them in software, you do it by repeated shifts, using a method similar to duplation, known of the Ancient Aegyptiens.
This depends on the architecture you run it on, as well as the compiler and the values for x and y.
If x and y are small, the second version might be faster. However, when x and y are very large numbers, the second version will certainly be much slower.
The only way to really find out is to measure the running time of your code, for example like this: https://stackoverflow.com/a/9085330/369009
Since you're dealing with int values, the multiplication operator (*) will be far more efficient. C will compile into the CPU-specific assembly language, which will have a multiplication instruction (e.g., x86's mul/imul). Virtually all modern CPUs can multiply integers within a few clock cycles. It doesn't get much faster than that. Years ago (and on some relatively uncommon embedded CPUs) it used to be the case that multiplication took more clock cycles than addition, but even then, the additional jump instructions to loop would result in more cycles being consumed, even if only looping once or twice.
The C language does not require that multiplications by integers be converted into series of additions. It permits implementations to do that, I suppose, but I would be surprised to find an implementation that did so, at least in the general context you present.
Note also that in your case #2 you have replaced one multiplication operation with not just num2 addition operations, but with at least 2 * num2 additions, num2 comparisons, and 2 * num2 storage operations. The storage operations probably end up being approximately free, as the values likely end up living in CPU registers, but they don't have to do.
Overall, I would expect alternative #1 to be much faster, but it is always best to answer performance questions by testing. You will see the largest difference for large values of num2. For instance, try with num1 == 1 and num2 == INT_MAX.

Create own type of variable

Is it possible to create a custom type of variable in C/C++? I want something like "super long int", that occupies let's say 40 bytes and allows same operations as in an usual int. (+, -, /, %, <, >, etc..)
There's nothing built-in for something like that, at least not in C. You'll need to use a big-number library like GMP. It doesn't allow for using the normal set of operators, but it can handle numbers of an arbitrarily large size.
EDIT:
If you're targeting C++, GMP does have overloaded operators that will allow you to use the standard set of operators like you would with a regular int. See the manual for more details.
Some CPUs have support to work with very large numbers. With SSE on the x86/64 architecture you can implement 128 bit values (16 bytes) that can be calculated with normally.
With AVX this limitation extends to 256 bits (32 bytes). The upcoming AVX-512 extension is supposed to have 512 bits (64 bytes), thus enabling "super large" integers.
But there are two caveats to these extensions:
The compiler has to support it (GCC for example uses immintrin.h for AXV support and xmmintrin.h for SSE support). Alternatively you can try to implement the abstractions via inline assembler, but then the Assembler has to understand these (GCC uses AS as far as I know).
The machine you are running the compiled code on has to support these instructions. If the CPU does not support AVX or SSE (depending on what you want to do), the application will crash on these instructions, as the CPU does not understand them.
AVX/SSE is used in the implementations of memset, memcpy, etc, since they also allow you to reduce the memory accesses by a good deal (keep in mind that, while your cache line is going to be loaded into cache once, loading to it still takes up some cycles, and AVX/SSE help you eliminating a good chunk of these costs as well).
Here a working example (compiles with GCC 4.9.3, you have to add -mavx to your compiler options):
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
int main(void)
{
size_t i;
/*********************************************************************
**Hack-ish way to ensure that malloc's alignment does not screw with
**us. On this box it aligns to 0x10 bytes, but AVX needs 0x20.
*********************************************************************/
#define AVX_BASE (0x20ULL)
uint64_t*real_raw = malloc(128);
uint64_t*raw = (uint64_t*)((uintptr_t)real_raw + (AVX_BASE - ((uintptr_t)real_raw % AVX_BASE)));
__m256i value = _mm256_setzero_si256();
for(i = 0;i < 10;i++)
{
/*No special function here to do the math.*/
value += i * i;
/*************************************************************
**Extract the value from the register and print the last
**byte.
*************************************************************/
_mm256_store_si256((__m256i*)raw,value);
printf("%lu\n",raw[0]);
}
_mm256_store_si256((__m256i*)raw,value);
printf("End: %lu\n",raw[0]);
free(real_raw);
return 0;
}

Optimize C code

I have the following code
void Fun2()
{
if(X<=A)
X=ceil(M*1.0/A*X);
else
X=M*1.0/(M-A)*(M-X);
}
I want to program it in fast manner using C99, take into account the following comments.
Xand A, are 32 bit variables and I declare them as uint64_t, While M as static const uint64_t.
This function is called by another function and the value of A are changed to a new value every n times of calling.
The optimization is needed in the execution time, CPU is Core i3, OS is windows 7
The math model I want to implement it is
F=ceil(Max/A*X) if x<=A
F=floor(M/(M-A)*(M-X)) if x>A
For clarity and no confusion My previous post was
I have the following code
void Fun2()
{
if(X0<=A)
X0=ceil(Max1*X0);
else
X0=Max2*(Max-X0);
}
I want to program it in fast manner using C99, take into account the following comments.
X0, A, Max1, and Max2 are 32 bit variable and I declare them as uint64_t, While Max as static const uint64_t.
This function is called by another function and the values of Max1, A, Max2 are changed to random values every n times of calling.
I work in Windows 7 and in codeblocks software
Thanks
It is completely pointless and impossible to optimize code like this without a specific target in mind. In order to do so, you need the following knowledge:
Which CPU is used.
Which OS is used (if any).
In-depth knowledge of the above, to the point where you know more, or about as much of the system as the people who wrote the optimizer for the given compiler port.
What kind of optimization that is most important: execution speed, RAM usage or program size.
The only kind of optimization you can do without knowing the above is on the algorithm level. There are no such algorithms in the code posted.
Thus your question cannot be answered by anyone until more information is provided.
If "fast manner" means fast execution, your first change is to declare this function as an inline one, a feature of C99.
inline void Fun2()
{
...
...
}
I recall that GNU CC has some interesting macros that may help optimizing this code as well. I don't think this is C99 compliant but it is always interesting to note. I mean: your function has an if statement. If you can know by advance what probability has each branch of being taken, you can do things like:
if (likely(X0<=A)).....
If it's probable that X0 is less or equal than A. Or:
if (unlikely(X0<=A)).....
If it's not probable that X0 is less or equal than A.
With that information, the compiler will optimize the comparison and jump so the most probable branch will be executed with no jumps, so it will be executed faster in architectures with no branch prediction.
Another thing that may improve speed is to use the ?: ternary operator, as both branches assign a value to the same variable, something like this:
inline void Func2()
{
X0 = (X0>=A)? Max1*X0 : Max2*(Max-X0);
}
BTW: why use ceil()? ceil() is for double numbers to round down a decimal number to the nearest non greater integer. If X0 and Max1 are integer numbers, there won't be decimals in the result, so ceil() won't have any effect.
I think one thing that can be improved is not to use floating point. Your code mostly deals with integers, so you want to stick to integer arithmetic.
The only floating point number is Max1. If it's always whole, it can be an integer. If not, you may be able to replace it with two integers: Max1*X0 -> X0 * Max1_nom / Max1_denom. If you calculate the nominator/denominator once, and use many times, this can speed things up.
I'd transform the math model to
Ceil (M*(X-0) / (A-0)) when A<=X
Floor (M*(X-M) / (A-M)) when A>X
with
Ceil (A / B) = Floor((A + (B-1)) / B)
Which substituted to the first gives:
((M * (X - m0) + c ) / ( A - m0))
where
c = A-1; m0 = 0, when A <= X
c = 0; m0 = M, when A >= X
Everything will be performed in integer arithmetic, but it'll be quite tough to calculate the reciprocals in advance;
It may still be possible to use some form of DDA to avoid calculating the division between iterations.
Using the temporary constants c, m0 is simply for unifying the pipeline for both branches as the next step is in pursuit of parallelism.

Resources