Pure high-bit multiplication in assembly?

Pure high-bit multiplication in assembly? - c

To implement real numbers between 0 and 1, one usually uses ANSI floats or doubles. But fixed precision numbers between 0 and 1 (decimals modulo 1) can be efficiently implemented as 32 bit integers or 16 bit words, which add like normal integers/words, but which multiply the "wrong way", meaning that when you multiply X times Y, you keep the high order bits of the product. This is equivalent to multiplying 0.X and 0.Y, where all the bits of X are behind the decimal point. Likewise, signed numbers between -1 and 1 are also implementable this way with one extra bit and a shift.
How would one implement fixed-precision mod 1 or mod 2 in C (especially using MMX or SSE)?
I think this representation could be useful for efficient representation of unitary matrices, for numerically intensive physics simulations. It makes for more MMX/SSE to have integer quantities, but you need higher level access to PMULHW.

If 16 bit fixed point arithmetic is sufficient and you are on x86 or a similar architecture, you can directly use SSE.
The SSE3 instruction pmulhrsw directly implements signed 0.15 fixed point arithmetic multiplication (mod 2 as you call it, from -1..+1) in hardware. Addition is not different than the standard 16 bit vector operations, just using paddw.
So a library which handles multiplication and addition of eight signed 16 bit fixed point variables at a time could look like this:
typedef __v8hi fixed16_t;
fixed16_t mul(fixed16_t a, fixed16_t b) {
return _mm_mulhrs_epi16(a,b);
}
fixed16_t add(fixed16_t a, fixed16_t b) {
return _mm_add_epi16(a,b);
}
Permission granted to use it in any way you like ;-)

Related

Optimal frequency of modulo operation in finite field arithmetic implementation

I'm trying to implement finite field arithmetic to use it in Elliptic Curve calculations. Since all that's ever used are arithmetic operations that commute with the modulo operator, I don't see a reason not to delaying that operation till the very end. One thing that may happen is that the numbers involved might become (way) too big and impractical/inefficient to work with, but I was wondering if there was a way to determine the optimal conditions/frequency which should trigger a modulo operation in the calculations.
I'm coding in C.

To avoid the complexity of elliptic curve crypto (as I'm unfamiliar with its algorithm); let assume you're doing temp = (a * b) % M; result = (temp * c) % M, and you're thinking about just doing result = (a * b * c) % M instead.
Let's also assume that you're doing this a lot with the same modulo M; so you've precomputed "multiples of M" lookup tables, so that your modulo code can use the table to find the highest multiple of "M shifted left by N" that is not greater than the dividend and subtract it from dividend, and repeat that with decreasing values of N until you're left with the quotient.
If your lookup table has 256 entries, the dividend is 4096 bits and the divisor is 2048 bits; then you'd reduce the size of the dividend by 8 bits per iteration, so dividend would become smaller than the divisor (and you'd find the quotient) after no more than 256 "search and subtract" operations.
For multiplication; it's almost purely "multiply and add digits" for each pair of digits. E.g. using uint64_t as a digit, multiplying 2048 bit numbers is multiplying 32 digit numbers and involves 32 * 32 = 1024 of those "multiply and add digits" operations.
Now we can make comparisons. Specifically, assuming a, b, c, M are 2048-bit numbers:
a) the original temp = (a * b) % M; result = (temp * c) % M would be 1024 "multiply and add", then 256 "search and subtract", then 1024 "multiply and add", then 256 "search and subtract". For totals it'd be 2048 "multiply and add" and 512 "search and subtract".
b) the proposed result = (a * b * c) % M would be 1024 "multiply and add", then would be 2048 "multiply and add" (as the result of a*b will be a "twice as big" 4096-bit number), then 512 "search and subtract" (as a*b*c will be twice as big as a*b). For totals it'd be 3072 "multiply and add" and 512 "search and subtract".
In other words; (assuming lots of assumptions) the proposed result = (a * b * c) % M would be worse, with 50% more "multiply and add" and the exact same "search and subtract".
Of course none of this (the operations you need for elliptic curve crypto, the sizes of your variables, etc) can be assumed to apply for your specific case.
I was wondering if there was a way to determine the optimal conditions/frequency which should trigger a modulo operation in the calculations.
Yes; the way to determine the optimal conditions/frequency is to do similar to what I did above - determine the true costs (in terms of lower level operations, like my "search and subtract" and "multiply and add") and compare them.
In general (regardless of how modulo is implemented, etc) I'd expect you'll find that doing modulo as often as possible is the fastest option (as it reduces the cost of multiplications and also reduces the cost of later/final modulo) for all cases don't involve addition or subtraction, and that don't fit in simple integers.

If M is a constant, then an alternative for modulo is to multiply by the logical inverse of M. Looking at Polk's comment about 256 bits being a common case, then assuming M is polynomial of degree 256 with 1 bit coefficients, then define the inverse of M to be x^512 / M, which results in a 256 bit "inverse". Name this inverse to be I. Then for a multiply modulo M:
C = A * B ; 512 bit product
Q = (upper 256 bits of C * I)>>256 ; Q = C / M = 256 bit quotient
P = M * Q ; 512 bit product
R = lower 256 bits of (C xor P) ; (A * B)% M
So this require 3 extended precision multiplies and one xor.
If the processor for this code has a carryless multiply, such as X86 PCLMULQDQ, which multiplies two 64 bit operands to produce a 128 bit result, then that could be used as the basis for an extended precision multiply. A basic implementation would need 16 multiplies for a 256 bit by 256 bit multiply to produce a 512 bit product. This could be improved using somthing like Karatsuba:
https://en.wikipedia.org/wiki/Karatsuba_algorithm
but on currernt X86, PCLMULQDQ is fast, taking 1 to 3 cycles, so the main issue would be loading the data into the XMM registers, and I'm not sure Karatsuba would save much time.

optimal conditions/frequency which should trigger a modulo operation in the calculations
Standard practice is to replace all actual modulo operations with something else. So the frequency is never. There are different ways to accomplish that:
Choose the modulus to be a Mersenne prime or pseudo-Mersenne prime. There is a large repertoire of mathematical tricks to implement arithmetic modulo a (pseudo-)Mersenne prime efficiently, without doing any actual modulo operations. In the context of elliptic curves, the prime-modulus NIST curves are chosen this way and for this reason.
Use Barrett reduction. This has the same effect as a real modulo operation, but relies on some precomputation and a precondition on the range of the input to be able to reduce the cost of a modulo-like operation to the cost to a couple of multiplications (plus some supporting operations). Also applicable to polynomial fields.
Do arithmetic in Montgomery form.
Additionally, and perhaps more in the spirit of your question, a common technique is to do various additions without reducing every time (addition does not significantly change the size of a number). It takes a lot of additions before you need an extra limb in your integers, so a lot of them can be done before it starts to make sense to reduce. For multiplications, unless it's by a small constant it almost always makes sense to reduce immediately afterwards to prevent the numbers from getting much physically larger than they need to be (which would be especially bad if the result was fed into another multiplication).
Another technique especially associated with Barrett reductions is to work, most of the time, in a slightly larger range than [0 .. N), eg [0 .. 2N). This enables skipping the conditional subtraction that Barrett reduction needs in order to fully reduce to the range [0 .. N), while still using the most important part, the reduction from the range [0 .. N²) to the range [0 .. 2N).

How do you convert the remainder of a division operation to a fixed point in C?

I understand the concept of fixed point pretty well at this point, but I'm having trouble making a logical jump.
I'm working with M68000 CPUs using gcc with no standard libraries of any sort. Using DIVU/DIVS opcodes, I can obtain the quotient and the remainder. Given a Q16.16 fixed point value stored in an unsigned 32bit memory space, I know I can put the quotient in the upper 16 bits. However, how does one convert the integer remainder into the fractional portion of the fixed point value?
I'm sure this is something simple and I'm just missing it. Any help would be greatly appreciated.

The way to think about it is that fixed point numbers are actually integers hold the value of your number times some fixed multiplier. You want to build you fixed point operations out of the integer operations you have available in your hardware.
So for a 16.16 fixed-point format, your multiplier is 65536 (216), so if you want to do a divide c = a/b, the numbers (integers) you have to work with are actually a' = a * 65536 and b' = b * 65536 and you want to find c' = c * 65536. So substituting into the desired c = a/b, you have
c'/65536 = (a'/65536) / (b'/65536) = a'/b'
c' = 65536 * a' / b'
So you actually want to first (integer) mulitply the fixed-point value of a by 65536 (left shift by 16), then do an integer divide by the fixed point value of b, and that will give you the fixed point value of c. The issue is that the first multiply will almost certainly overflow 32 bits, so you need a 64 bit (actually only 48 bit) intermediate. So if you're using a 68020+ with a 64/32 DIVS.L instruction (divides a 64 bit value in a pair of registers by a 32 bit value), you're fine. You don't need the remainder at all.
If you're using a pure 68000 that doesn't have the wide divide, you'll need to do 16-bit long division on the values (where you use 16 bit numbers as "digits", so you're dividing a 3-"digit" number by a 2-"digit" one)

Multiplying 2 32-bit numbers and taking the top 32 bits using AVX2

I am using multiplication (with the addition of other operations) as a substitution for integer division. My solution eventually requires me to multiply 2 32-bit numbers together and take the top 32 bits (just like the mulhi function), but AVX2 does not offer a 32-bit variant of _mm256_mulhi_epu16 (Ex: there's no '_mm256_mulhi_epu32' function).
I have tried various methods such as checking the functions of AVX512, or even manipulating the 32-bit integers to be 2 hi/lo 16-bit integers. I'm very new to working with low-level programming, so I'm unaware what is optimal, or even just possible.

This can be done by doing the following:
__m256i t1 = _mm256_mul_epu32(m, n);
t1 = _mm256_srli_epi64(t1, 32);

fixed point arithmetic in modern systems

I'd like to start out by saying this isn't about optimizations so please refrain from dragging this topic down that path. My purpose for using fixed point arithmetic is because I want to control the precision of my calculations without using floating point.
With that being said let's move on. I wanted to have 17 bits for range and 15 bits for the fractional part. The extra bit is for the signed value. Here are some macros below.
const int scl = 18;
#define Double2Fix(x) ((x) * (double)(1 << scl))
#define Float2Fix(x) ((x) * (float)(1 << scl))
#define Fix2Double(x) ((double)(x) / (1 << scl))
#define Fix2Float(x) ((float)(x) / (1 << scl))
Addition and subtraction are fairly straight forward but things gets a bit tricky with mul and div.
I've seen two different ways to handle these two types of operations.
1) if I am using 32 bits then use a temp 64bit variable to store intermediate multiplication steps then scale at the end.
2) right in the multiplication step scale both variables to a lesser bit range before multiplication. For example if you have a 32 bit register with 16 bits for the whole number you could shift like this:
(((a)>>8)*((b)>>6) >> 2) or some combination that makes sense for you app.
It seems to me that if you design your fixed point math around 32 bits it might be impractical to always depend on having a 64bit variable able to store your intermediate values but on the other hand shifting to a lower scale will seriously reduce your range and precision.
questions
Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
Also i've notice
int b = Double2Fix(9.1234567890);
printf("double shift:%f\n",Fix2Double(b));
int c = Float2Fix(9.1234567890);
printf("float shift:%f\n",Fix2Float(c));
double shift:9.123444
float shift:9.123444
Is that precision loss just a part of using fixed point numbers?

Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
You have to work with the hardware capabilities, and the only available operations you'll find are:
Multiply N x N => low N bits (native C multiplication)
Multiply N x N => high N bits (the C language has no operator for this)
Multiply N x N => all 2N bits (cast to wider type, then multiply)
If the instruction set has #3, and the CPU implements it efficiently, then there's no need to worry about the extra-wide result it produces. For x86, you can pretty much take these as a given. Anyway, you said this wasn't an optimization question :) .
Sticking to just #1, you'll need to break the operands into pieces of (N/2) bits and do long multiplication, which is likely to generate more work. There are still cases where it's the right thing to do, for instance implementing #3 (software extended arithmetic) on a CPU that doesn't have it or #2.
Is that precision loss just a part of using fixed point numbers?
log2( 9.1234567890 – 9.123444 ) = –16.25, and you used 16 bits of precision, so yep, that's very typical.

Fixed-point multiplication in a known range

I'm trying to multiply A*B in 16-bit fixed point, while keeping as much accuracy as possible. A is 16-bit in unsigned integer range, B is divided by 1000 and always between 0.001 and 9.999. It's been a while since I dealt with problems like that, so:
I know I can just do A*B/1000 after moving to 32-bit variables, then strip back to 16-bit
I'd like to make it faster than that
I'd like to do all the operations without moving to 32-bit (since I've got 16-bit multiplication only)
Is there any easy way to do that?
Edit: A will be between 0 and 4000, so all possible results are in the 16-bit range too.
Edit: B comes from user, set digit-by-digit in the X.XXX mask, that's why the operation is /1000.

No, you have to go to 32 bit. In general the product of two 16 bit numbers will always give you a 32 bit wide result.
You should check the CPU instruction set of the CPU you're working on because most multiply instructions on 16 bit machines have an option to return the result as a 32 bit integer directly.
This would help you a lot because:
short testfunction (short a, short b)
{
int A32 = a;
int B32 = b;
return A32*B32/1000
}
Would force the compiler to do a 32bit * 32bit multiply. On your machine this could be very slow or even done in multiple steps using 16bit multiplies only.
A little bit of inline assembly or even better a compiler intrinsic could speed things up a lot.
Here is an example for the Texas Instruments C64x+ DSP which has such intrinsics:
short test (short a, short b)
{
int product = _mpy (a,b); // calculates product, returns 32 bit integer
return product / 1000;
}
Another thought: You're dividing by 1000. Was that constant your choice? It would be much faster to use a power of two as the base for your fixed-point numbers. 1024 is close. Why don't you:
return (a*b)/1024
instead? The compiler could optimize this by using a shift right by 10 bits. That ought to be much faster than doing reciprocal multiplication tricks.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight