I have AVX (no AVX2 or AVX-512). I have a vector with 32bit values (only 4 lowest bits are used, rest is always zero):
[ 1010, 0000, 0000, 0000, 0000, 1010, 1010, 0000]
Internally, I keep vector as __m256 because of bitwise operations and the bits represents "float numbers". I need to export single 8-bit number from the vector, which will contain 1 for non-zero a 0 for zero bits.
So for above example, I need 8-bit number: 10000110
I have idea to use _mm256_cmp_ps and then _mm256_movemask_ps. However, for cmp, I dont know if it will work correctly, if numbers are not exactly floats and can be any "junk". In this case, which operand to use for cmp?
Or is there any other solution?
Conceptually, what you're doing should work. Floats with the upper 24 bits zero are valid floats. However, they are denormal.
While it should work, there are two potential problems:
If the FP mode is set to set to flush denormals to zero, then they will all be treated as zero. (thus, breaking that approach)
Because these are denormals, you may end up taking massive performance penalties depending on whether the hardware can handle them natively.
Alternative Approach:
Since the upper 24 bits are zero, you can normalize them. Then do the floating-point comparison.
(Warning: untested code)
int to_mask(__m256 data){
const __m256 MASK = _mm256_set1_ps(8388608.); // 2^23
data = _mm256_or_ps(data, MASK);
data = _mm256_cmp_ps(data, MASK, _CMP_NEQ_UQ);
return _mm256_movemask_ps(data);
}
Here, data is your input where the upper 24 bits of each "float" are zero. Let's call each of these 8-bit integers x.
OR'ing with 2^23 sets the mantissa of the float such that it becomes a normalized float with value 2^23 + x.
Then you compare against 2^23 as float - which will give a 1 only if the x is non-zero.
Alternate answer, for future readers that do have AVX2
You can cast to __m256i and use SIMD integer compares.
This avoid any problems with DAZ treating these small-integer bit patterns as exactly zero, or microcode assists for denormal (aka subnormal) inputs.
There might be 1 extra cycle of bypass latency between vcmpeqd and vpmovmskps on some CPUs, but you still come out ahead because integer compare is lower latency than FP compare.
int nonzero_positions_avx2(__m256 v)
{
__m256i vi = _mm256_castps_si256(v);
vi = _mm256_cmpeq_epi32(vi, _mm256_setzero_si256());
return _mm256_movemask_ps(_mm256_castsi256_ps(vi));
}
Related
#include <stdio.h>
int main() {
double b = 3.14;
double c = -1e20;
c = -1e20 + b;
return 0;
}
As long as I know type "double" has 52 bits of fraction. To conform 3.14's exponent to -1e20, 3.14's faction part goes over 60 bits, which never fits to 52 bits.
In my understanding, rest of fraction bits other than 52, which roughly counts 14 bits, invades unassigned memory space, like this.
rough drawing
So I examined memory map in debug mode (gdb), suspecting that the bits next to the variable b or c would be corrupted. But I couldn't see any changes. What am I missing here?
You mix up 2 very different things:
Buffer overflow/overrun
Your added image shows what happens when you overflow your buffer.
Like definining char[100] and writing to index 150. Then the memory layout is important as you might corrupt neighboring variables.
Overflow in values of a data type
What your code shows can only be an overflow of values.
If you do int a= INT_MAX; a++ you get an integer overflow.
This only affects the resulting value.
It does not cause the variable to grow in size. An int will always stay an int.
You do not invade any memory area outside your data type.
Depending on data type and architecture the overflowing bits could just be chopped off or some saturation could be applied to set the value to maximum/minimum representable value.
I did not check for yout values but without inspecting the value of c in a debugger or printing it, you cannot tell anything about the overflow there.
Floating-point arithmetic is not defined to work by writing out all the bits of the operands, performing the arithmetic using all the bits involved, and storing those bits in memory. Rather, the way elementary floating-point operations work is that each operation is performed “as if it first produced an intermediate result correct to infinite precision and with unbounded range” and then rounded to a result that is representable in the floating-point format. That “as if” is important. It means that when computer processor designers are designing the floating-point arithmetic instructions, they figure out how to compute what the final rounded result would be. The processor does not always need to “write out” all the bits to do that.
Consider an example using decimal floating-point with four significant digits. If we add 6.543•1020 and 1.037•17 (equal to 0.001037•1020), the infinite-precision result would be 6.544037•1020, and then rounding that to the nearest number representable in the four-significant-digit format would give 6.544•1020. But we do not have to write out the infinite-precision result to compute that. We can compute the result is 6.544•1020 plus a tiny fraction, and then we can discard that fraction without actually writing out its digits. This is what processor designers do. The add, multiply, and other instructions compute the main part of a result, and they carefully manage information about the other parts to determine whether they would cause the result to round upward or downward in its last digit.
The resulting behavior is that, given any two operands in the format used for double, the computer always produces a result in that same format. It does not produce any extra bits.
Supplement
There are 53 bits in the fraction portion of the format commonly used for double. (This is the IEEE-754 binary64 format, also called double precision.) The fraction portion is called the significand. (You may see it referred to as a mantissa, but that is an old term for the fraction portion of a logarithm. The preferred term is “significand.” Significands are linear; mantissas are logarithmic.) You may see some people describe there being 52 bits for the significand, but that refers to a part of the encoding of the floating-point value, and it is only part of it.
Mathematically, a floating-point representation is defined to be s•f•be, where b is a fixed numeric base, s provides a sign (+1 or −1), f is a number with a fixed number of digits p in base b, and e is an exponent within fixed limits. p is called the precision of the format, and it is 53 for the binary64 format. When this number is encoded into bits, the last 52 bits of f are stored in the significand field, which is where the 52 comes from. However, the first bit is also encoded, by way of the exponent field. Whenever the stored exponent field is not zero (or the special value of all one bits), it means the first bit of f is 1. When the stored exponent field is zero, it means the first bit of f is 0. So there are 53 bits present in the encoding.
I know that many had similar questions over here about converting from/to two's complement format and I tried many of them but nothing seems to help in my case.
Well, I'm working on an embedded project that involves writing/reading registers of a slave device over SPI. The register concerned here is a 22-bit position register that stores the uStep value in two's complement format and it ranges from -2^21 to +2^21 -1. The problem is when I read the register, I get a big integer that has nothing to do with the actual value.
Example:
After sending a command to the slave to move 4000 steps (forward/positive), I read the position register and I get exactly 4000. However, if I send a reverse move command, say -1, and then read the register, the value I get is something like 4292928. I believe it's the negative offset of the register as the two's complement has no zero. I have no problem sending a negative integer to the device to move x number of steps, however, getting the actual negative integer from the value retrieved is something else.
I know that this involves two's complement but the question is, how to get the actual negative integer out of that strange value? I mean, if I moved the device -4000 steps, what I have to do to get the exact value for the negative steps moved so far from my register?
You need to sign-extend bit 21 through the bits to the left.
For negative values when bit 21 is set, you can do this by ORring the value with 0xFFC00000.
For positive values when bit 21 is clear, you can ensure by ANDing the value with 0x003FFFFF.
The solutions by Clifford and Weather Vane assume the target machine is two's-complement. This is very likely true, but a solution that removes this dependency is:
static const int32_t sign_bit = 0x00200000;
int32_t pos_count = (getPosRegisterValue() ^ sign_bit) - sign_bit;
It has the additional advantage of being branch-free.
The simplest method perhaps is simply to shift the position value left by 10 bits and assign to an int32_t. You will then have a 32 bit value and the position will be scaled up by 210 (1024), and have 32 bit resolution, but 10 bit granularity, which normally shouldn't matter since the position units are entirely arbitrary in any case, and can be converted to real-world units if necessary taking into account the scaling:
int32_t pos_count = (int32_t)(getPosRegisterValue() << 10) ;
Where getPosRegisterValue() returns a uint32_t.
If you do however want to retain 22 bit resolution then it is simply a case of dividing the value by 1024:
int32_t pos_count = (int32_t)(getPosRegisterValue() << 10)) / 1024 ;
Both solutions rely in the implementation-defined behaviour of casting a uint32_t of value not representable in an int32_t; but one a two's complement machine any plausible implementation will not modify the bit-pattern and the result will be as required.
Another perhaps less elegant solution also retaining 22 bit resolution and single bit granularity is:
int32_t pos_count = getPosRegisterValue() ;
// If 22 bit sign bit set...
if( (pos_count & 0x00200000) != 0)
{
// Sign-extend to 32bit
pos_count |= 0xFFC00000 ;
}
It would be wise perhaps to wrap the solution is a function to isolate any implementation defined behaviour:
int32_t posCount()
{
return (int32_t)(getPosRegisterValue() << 10)) / 1024 ;
}
I am struggling with how to implement arithmetic on fixed-point numbers of different precision. I have read the paper by R. Yates, but I'm still lost. In what follows, I use Yates's notation, in which A(n,m) designates a signed fixed-point format with n integer bits, m fraction bits, and n + m + 1 bits overall.
Short question: How exactly is a A(a,b)*A(c,d) and A(a,b)+A(c,d) carried out when a != c and b != d?
Long question: In my FFT algorithm, I am generating a random signal having values between -10V and 10V signed input(in) which is scaled to A(15,16), and the twiddle factors (tw) are scaled to A(2,29). Both are stored as ints. Something like this:
float temp = (((float)rand() / (float)(RAND_MAX)) * (MAX_SIG - MIN_SIG)) + MIN_SIG;
int in_seq[i][j] = (int)(roundf(temp *(1 << numFracBits)));
And similarly for the twiddle factors.
Now I need to perform
res = a*tw
Questions:
a) how do I implement this?
b) Should the size of res be 64 bit?
c) can I make 'res' A(17,14) since I know the ranges of a and tw? if yes, should I be scaling a*tw by 2^14 to store correct value in res?
a + res
Questions:
a) How do I add these two numbers of different Q formats?
b) if not, how do I do this operation?
Maybe it's easiest to make an example.
Suppose you want to add two numbers, one in the format A(3, 5), and the other in the format A(2, 10).
You can do it by converting both numbers to a "common" format - that is, they should have the same number of bits in the fractional part.
A conservative way of doing that is to choose the greater number of bits. That is, convert the first number to A(3, 10) by shifting it 5 bits left. Then, add the second number.
The result of an addition has the range of the greater format, plus 1 bit. In my example, if you add A(3, 10) and A(2, 10), the result has the format A(4, 10).
I call this the "conservative" way because you cannot lose information - it guarantees that the result is representable in the fixed-point format, without losing precision. However, in practice, you will want to use smaller formats for your calculation results. To do that, consider these ideas:
You can use the less-accurate format as your common representation. In my example, you can convert the second number to A(2, 5) by shifting the integer right by 5 bits. This will lose precision, and usually this precision loss is not problematic, because you are going to add a less-precise number to it anyway.
You can use 1 fewer bit for the integer part of the result. In applications, it often happens that the result cannot be too big. In this case, you can allocate 1 fewer bit to represent it. You might want to check if the result is too big, and clamp it to the needed range.
Now, on multiplication.
It's possible to multiply two fixed-point numbers directly - they can be in any format. The format of the result is the "sum of the input formats" - all the parts added together - and add 1 to the integer part. In my example, multiplying A(3, 5) with A(2, 10) gives a number in the format A(6, 15). This is a conservative rule - the output format is able to store the result without loss of precision, but in applications, almost always you want to cut the precision of the output, because it's just too many bits.
In your case, where the number of bits for all numbers is 32, you probably want to lose precision in such a way that all intermediate results have 32 bits.
For example, multiplying A(17, 14) with A(2, 29) gives A(20, 43) - 64 bits required. You probably should cut 32 bits from it, and throw away the rest. What is the range of the result? If your twiddle factor is a number up to 4, the result is probably limited by 2^19 (the conservative number 20 above is needed to accommodate the edge case of multiplying -1 << 31 by -1 << 31 - it's almost always worth rejecting this edge-case).
So use A(19, 12) for your output format, i.e. remove 31 bits from the fractional part of your output.
So, instead of
res = a*tw;
you probably want
int64_t res_tmp = (int64_t)a * tw; // A(20, 43)
if (res_tmp == ((int64_t)1 << 62)) // you might want to neglect this edge case
--res_tmp; // A(19, 43)
int32_t res = (int32_t)(res_tmp >> 31); // A(19, 12)
Your question seems to assume that there is a single right way to perform the operations you are interested in, but you are explicitly asking about some of the details that direct how the operations should be performed. Perhaps this is the kernel of your confusion.
res = a*tw
a is represented as A(15,16) and tw is represented as A(2,29), so the its natural representation of their product A(18,45). You need more value bits (as many bits as the two factors have combined) to maintain full precision. A(18,45) is how you should interpret the result of widening your ints to a 64-bit signed integer type (e.g. int64_t) and computing their product.
If you don't actually need or want 45 bits of fraction, then you can indeed round that to A(18,13) (or to A(18+x,13-x) for any non-negative x) without changing the magnitude of the result. That does requiring scaling. I would probably implement it like this:
/*
* Computes a magnitude-preserving fixed-point product of any two signed
* fixed-point numbers with a combined 31 (or fewer) value bits. If x
* is represented as A(s,t) and y is represented as A(u,v),
* where s + t == u + v == 31, then the representation of the result is
* A(s + u + 1, t + v - 32).
*/
int32_t fixed_product(int32_t x, int32_t y) {
int64_t full_product = (int64_t) x * (int64_t) y;
int32_t truncated = full_product / (1U << 31);
int round_up = ((uint32_t) full_product) >> 31;
return truncated + round_up;
}
That avoids several potential issues and implementation-defined characteristics of signed integer arithmetic. It assumes that you want the results to be in a consistent format (that is, depending only on the formats of the inputs, not on their actual values), without overflowing.
a + res
Addition is actually a little harder if you cannot rely on the operands to initially have the same scale. You need to rescale so that they match before you can perform the addition. In the general case, you may not be able to do that without rounding away some precision.
In your case, you start with one A(15,16) and one A(18,13). You can compute an intermediate result in A(19,16) or wider (presumably A(47,16) in practice) that preserves magnitude without losing any precision, but if you want to represent that in 32 bits then the best you can do without risk of changing the magnitude is A(19,11). That would be this:
int32_t a_plus_res(int32_t a, int32_t res) {
int64_t res16 = ((int64_t) res) * (1 << 3);
int64_t sum16 = a + res16;
int round_up = (((uint32_t) sum16) >> 4) & 1;
return (int32_t) ((sum16 / (1 << 5)) + round_up);
}
A generic version would need to accept the scales of the operands' representations as additional arguments. Such a thing is possible, but the above is enough to chew on as it is.
All of the foregoing assumes that the fixed-point format for each operand and result is constant. That is more or less the distinguishing feature of fixed-point, differentiating it from floating-point formats on one hand and from arbitrary-precision formats on the other. You do, however, have the alternative of allowing formats to vary, and tracking them with a separate variable per value. That would be basically a hybrid of fixed-point and arbitrary-precision formats, and it would be messier.
Additionally, the foregoing assumes that overflow must be avoided at all costs. It would also be possible to instead put operands and results on a consistent scale; this would make addition simpler and multiplication more complicated, and it would afford the possibility of arithmetic overflow. That might nevertheless be acceptable if you have reason to believe that such overflow is unlikely for your particular data.
I'd like to start out by saying this isn't about optimizations so please refrain from dragging this topic down that path. My purpose for using fixed point arithmetic is because I want to control the precision of my calculations without using floating point.
With that being said let's move on. I wanted to have 17 bits for range and 15 bits for the fractional part. The extra bit is for the signed value. Here are some macros below.
const int scl = 18;
#define Double2Fix(x) ((x) * (double)(1 << scl))
#define Float2Fix(x) ((x) * (float)(1 << scl))
#define Fix2Double(x) ((double)(x) / (1 << scl))
#define Fix2Float(x) ((float)(x) / (1 << scl))
Addition and subtraction are fairly straight forward but things gets a bit tricky with mul and div.
I've seen two different ways to handle these two types of operations.
1) if I am using 32 bits then use a temp 64bit variable to store intermediate multiplication steps then scale at the end.
2) right in the multiplication step scale both variables to a lesser bit range before multiplication. For example if you have a 32 bit register with 16 bits for the whole number you could shift like this:
(((a)>>8)*((b)>>6) >> 2) or some combination that makes sense for you app.
It seems to me that if you design your fixed point math around 32 bits it might be impractical to always depend on having a 64bit variable able to store your intermediate values but on the other hand shifting to a lower scale will seriously reduce your range and precision.
questions
Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
Also i've notice
int b = Double2Fix(9.1234567890);
printf("double shift:%f\n",Fix2Double(b));
int c = Float2Fix(9.1234567890);
printf("float shift:%f\n",Fix2Float(c));
double shift:9.123444
float shift:9.123444
Is that precision loss just a part of using fixed point numbers?
Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
You have to work with the hardware capabilities, and the only available operations you'll find are:
Multiply N x N => low N bits (native C multiplication)
Multiply N x N => high N bits (the C language has no operator for this)
Multiply N x N => all 2N bits (cast to wider type, then multiply)
If the instruction set has #3, and the CPU implements it efficiently, then there's no need to worry about the extra-wide result it produces. For x86, you can pretty much take these as a given. Anyway, you said this wasn't an optimization question :) .
Sticking to just #1, you'll need to break the operands into pieces of (N/2) bits and do long multiplication, which is likely to generate more work. There are still cases where it's the right thing to do, for instance implementing #3 (software extended arithmetic) on a CPU that doesn't have it or #2.
Is that precision loss just a part of using fixed point numbers?
log2( 9.1234567890 – 9.123444 ) = –16.25, and you used 16 bits of precision, so yep, that's very typical.
To implement real numbers between 0 and 1, one usually uses ANSI floats or doubles. But fixed precision numbers between 0 and 1 (decimals modulo 1) can be efficiently implemented as 32 bit integers or 16 bit words, which add like normal integers/words, but which multiply the "wrong way", meaning that when you multiply X times Y, you keep the high order bits of the product. This is equivalent to multiplying 0.X and 0.Y, where all the bits of X are behind the decimal point. Likewise, signed numbers between -1 and 1 are also implementable this way with one extra bit and a shift.
How would one implement fixed-precision mod 1 or mod 2 in C (especially using MMX or SSE)?
I think this representation could be useful for efficient representation of unitary matrices, for numerically intensive physics simulations. It makes for more MMX/SSE to have integer quantities, but you need higher level access to PMULHW.
If 16 bit fixed point arithmetic is sufficient and you are on x86 or a similar architecture, you can directly use SSE.
The SSE3 instruction pmulhrsw directly implements signed 0.15 fixed point arithmetic multiplication (mod 2 as you call it, from -1..+1) in hardware. Addition is not different than the standard 16 bit vector operations, just using paddw.
So a library which handles multiplication and addition of eight signed 16 bit fixed point variables at a time could look like this:
typedef __v8hi fixed16_t;
fixed16_t mul(fixed16_t a, fixed16_t b) {
return _mm_mulhrs_epi16(a,b);
}
fixed16_t add(fixed16_t a, fixed16_t b) {
return _mm_add_epi16(a,b);
}
Permission granted to use it in any way you like ;-)