Fixed point numbers in C without float - c

In C is it possible to present a fixed point number in binary form so it can be transmitted without the use floats ?
I know how to convert a float or double to the desired fixed point representation but I'm stuck when it shall be done without floating points. The problem is that the system I have to develop on has this limitation.
My idea is to create a struct which holds the full representation and a processable integer and fractional part. And after creating the struct with either only the received binary representation or the integer and fractional values there shall be a function which does the conversion.
My Question seems not to be precise enough so I'll add some details.
Within my code I have to create and receive Numbers in a certain fixed point representation. As described by the answers below this is nothing but a pointer to a sequence of bits. My problem is that i have to create this sequence of bits when sending or interpret it when receiving the information.
This conversion is my problem ignoring signdness it is quiet easy thing to do when you can use a float to convert from (code not tested, but must work like this):
float sourceValue = 12.223445;
int intPart = 0;
float fractPart = 0.0;
//integer part is easy, just cast it
intPart = (int)sourceValue;
//the fractinoal part is the rest
fractPart = sourceValue - intPart;
//multipling the fract part by the precision of the fixed point number (Q9.25)
//gets us the fractional part in the desired representation
u_int64_t factor = 1;
factor = factor << 25;
u_int64_t fractPart = fractPart * factor;
The rest can be done by some shifting and the use of logical bit operators.
But how can I do this without a float in the middle, starting with something like this:
int intPart = 12;
int fractPart = 223445;
Is it even possible ? As told, I'm kind a stuck here.
I don't know what you are really up to, but a fixed-point number can be viewed as an integer number with a constant factor applied to it.
For example, if you want to express a number in the interval [0; 1) in 16 bits, you can map it to the range [0; 65536) by simply multiplying it with 65536.
Everything boils down to bits, be it an integer, float, etc. All you need is the memory base address and the size of that certain memory. For example,
float src = 0.5;
float dest;
char bytes[sizeof(src)];
memcpy(bytes, &num, sizeof(src));
dest = *((float *)bytes);
should give you dest equal to src.
Is it defined what will happen if you shift a float?

I am following This video tutorial to implement a raycaster. It contains this code:
if(ra > PI) { ry = (((int)py>>6)<<6)-0.0001; rx=(py-ry)*aTan+px; yo=-64; xo=-yo*aTan; }//looking up
I hope I have transcribed this correctly. In particular, my question is about casting py (it's declared as float) to integer, shifting it back and forth, subtracting something, and then assigning it to a ry (also a float) This line of code is entered at time 7:24, where he also explains that he wants to
round the y position to the nearest 64th value
(I'm unsure if that means the nearest multiple of 64 or the nearest (1/64), but I know that the 6 in the source is derived from the number 64, being 2⁶)
For one thing, I think that it would be valid for the compiler to load (say) a 32-bit float into a machine register, and then shift that value down by six spaces, and then shift it back up by six spaces (these two operations could interfere with the mantissa, or the exponent, or maybe something else, or these two operations could be deleted by a peephole optimisation step.)
Also I think it would be valid for the compiler to make demons fly out of your nose when this statement is executed.
So my question is, is (((int)py>>6)<<6) defined in C when py is float?
is (((int)py>>6)<<6) defined in C when py is float?
It is certainly undefined behavior (UB) for many float. The cast to an int is UB for float with a whole number value outside the [INT_MIN ... INT_MAX] range.
So code is UB for about 38% of all typical float - the large valued ones, NaNs and infinities.
For typical float, a cast to int128_t is defined for nearly all float.
To get to OP's goal, code could use the below, which I believe to be well defined for all float.
If anything, use the below to assess the correctness of one's crafted code.
// round the y position to the nearest 64th value
float round_to_64th(float x) {
if (isfinite(x)) {
float ipart;
// The modf functions break the argument value into integral and fractional parts
float frac = modff(x, &ipart);
x = ipart + roundf(frac*64)/64;
return x;
"I'm unsure if that means the nearest multiple of 64 or the nearest (1/64)"
On review, OP's code is attempting to truncate to the nearest multiple of 64 or 2⁶.
It is still UB for many float.
That code doesn't shift a float because the bitshift operators aren't defined for floating-point types. If you try it you will get a compiler error.
Notice that the code is (int)py >> 6, the float is cast to an int before the shift operation. The integer value is what is being shifted.
If your question is "what will happen if you shift a float?", the answer is it won't compile. Example on Compiler Explorer.
The best possible recreation of the shift ops for floating points, in short, without using additional functions are the following:
Left shift:
Right shift:
float ShiftFloat(float x, int count, int ismultiplication)
float value = x;
for (int i = 0; i < count; ++i)
value *= (powf(0.5,(float)(ismultiplication^1)) / powf(2.0,(float)(ismultiplication)));
Converting int16 to float in C

How do i convert a 16 bit int to a floating point number?
I have a signed 16 bit variable which i'm told i need to display with an accuracy of 3 decimal places, so i presume this would involve a conversion to float?
I've tried the below which just copy's my 16 bits into a float but this doesn't seem right.
float myFloat = 0;
int16_t myInt = 0x3e00;
memcpy(&myFloat, &myInt, sizeof(int));
I've also read about the Half-precision floating-point format but am unsure how to handle this... if i need to.
I'm using GCC.
The source of the data is a char array [2] which i get from an i2c interface. I then stitch this together into a signed int.
Can anyone help?
I have a signed 16 bit variable which i'm told i need to display with
an accuracy of 3 decimal places
If someone told you the integer value can be displayed this way he/she should start from the C begginers course.
The only possibility is that the integer value has been scaled (multiplied). For example the value of 12.456 can be stored in the integer if multiplied by 1000. If this is the case:
float flv;
int intv = 12456;
flv = (float)intv / 1000.0f;
You can also print this scaled integer without convering to float
Division of two floats giving incorrect answer

Attempting to divide two floats in C, using the code below:
#include <stdio.h>
#include <math.h>
int main(){
float fpfd = 122.88e6;
float flo = 10e10;
float int_part, frac_part;
int_part = (int)(flo/fpfd);
frac_part = (flo/fpfd) - int_part;
printf("\nInt_Part = %f\n", int_part);
printf("Frac_Part = %f\n", frac_part);
To this code, I use the commands:
>> gcc test_prog.c -o test_prog -lm
>> ./test_prog
I then get this output:
Int_Part = 813.000000
Frac_Part = 0.802063
Now, this Frac_part it seems is incorrect. I have tried the same equation on a calculator first and then in Wolfram Alpha and they both give me:
Frac_Part = 0.802083
Notice the number at the fifth decimal place is different.
This may seem insignificant to most, but for the calculations I am doing it is of paramount importance.
Can anyone explain to me why the C code is making this error?
When you have inadequate precision from floating point operations, the first most natural step is to just use floating point types of higher precision, e.g. use double instead of float. (As pointed out immediately in the other answers.)
Second, examine the different floating point operations and consider their precisions. The one that stands out to me as being a source of error is the method above of separating a float into integer part and fractional part, by simply casting to int and subtracting. This is not ideal, because, when you subtract the integer part from the original value, you are doing arithmetic where the three numbers involved (two inputs and result) have very different scales, and this will likely lead to precision loss.
I would suggest to use the C <math.h> function modf instead to split floating point numbers into integer and fractional part.
(In greater detail: When you do an operation like f - (int)f, the floating point addition procedure is going to see that two numbers of some given precision X are being added, and it's going to naturally assume that the result will also have precision X. Then it will perform the actual computation under that assumption, and finally reevaluate the precision of the result at the end. Because the initial prediction turned out not to be ideal, some low order bits are going to get lost.)
Float are single precision for floating point, you should instead try to use double, the following code give me the right result:
#include <stdio.h>
#include <math.h>
int main(){
double fpfd = 122.88e6;
double flo = 10e10;
double int_part, frac_part;
int_part = (int)(flo/fpfd);
frac_part = (flo/fpfd) - int_part;
printf("\nInt_Part = %f\n", int_part);
printf("Frac_Part = %f\n", frac_part);
Why ?
As I said, float are single precision floating point, they are smaller than double (in most architecture, sizeof(float) < sizeof(double)).
By using double instead of float you will have more bit to store the mantissa and the exponent part of the number (see wikipedia).
float has only 6~9 significant digits, it's not precise enough for most uses in practice. Changing all float variables to double (which provides 15~17 significant digits) gives output:
Int_Part = 813.000000
Precision loss / rounding difference when directly assigning double result to an int

Is there a reason why converting from a double to an int performs as expected in this case:
double value = 45.33;
double multResult = (double) value*100.0; // assign to double
int convert = multResult; // assign to int
printf("convert = %d\n", convert); // prints 4533 as expected
But not in this case:
double value = 45.33;
int multResultInt = (double) value*100.0; // assign directly to int
printf("multResultInt = %d\n", multResultInt); // prints 4532??
It seems to me there should be no difference. In the second case the result is still first stored as a double before being converted to an int unless I am not understanding some difference between casts and hard assignments.
There is indeed no difference between the two, but compilers are used to take some freedom when it comes down to floating point computations. For example compilers are free to use higher precision for intermediate results of computations but higher still means different so the results may vary.
Some compilers provide switches to always drop extra precision and convert all intermediate results to the prescribed floating point numbers (say 64bit double-precision numbers). This will make the code slower, however.
In the specific the number 45.33 cannot be represented exactly with a floating point value (it's a periodic number when expressed in binary and it would require an infinite number of bits). When multiplying by 100 this value may be you don't get an integer, but something very close (just below or just above).
int conversion or cast is performed using truncation and something very close to 4533 but below will become 4532, when above will become 4533; even if the difference is incredibly tiny, say 1E-300.
To avoid having problems be sure to account for numeric accuracy problems. If you are doing a computation that depends on exact values of floating point numbers then you're using the wrong tool.
#6502 has given you the theory, here's how to look at things experimentally
double v = 45.33;
int x = v * 100.0;
printf("x=%d v=%.20lf v100=%.20lf\n", x, v, v * 100.0 );
On my machine, this prints
x=4533 v=45.32999999999999829470 v100=4533.00000000000000000000
The value 45.33 does not have an exact representation when encoded as a 64-bit IEEE-754 floating point number. The actual value of v is slightly lower than the intended value due to the limited precision of the encoding.
So why does multiplying by 100.0 fix the problem on some machines? One possibility is that the multiplication is done with 80-bits of precision and then rounded to fit into a 64-bit result. The 80-bit number 4532.999... will round to 4533 when converted to 64-bits.
Arithmetic Bit Shift of Double Variable Data Type in C

I am trying to arithmetic bit shift a double data type in C. I was wondering if this is the correct way to do it:
NOTE: firdelay[ ][ ] is declared in main as
double firdelay[8][12]
void function1(double firdelay[][12]) {
int * shiftptr;
// Cast address of element of 2D matrix (type double) to integer pointer
*shiftptr = (int *) (&firdelay[0][5]);
// Dereference integer pointer and shift right by 12 bits
*shiftptr >>= 12;
Bitwise shifting a floating point data type will not give you the result you're looking for.
In Simulink, the Shift Arithmetic block only does bit shifting for integer data types. If you feed it a floating point type it divides the input signal by 2^N where N is the number of bits to shift specified in the mask dialog box.
Since you don't have the capability to perform any floating point math your options are:
understand the layout of a floating point single precision number, then figure out how to manipulate it bitwise to achieve division.
convert whatever algorithm you're porting to use fixed point data types instead of floating point
I'd recommend option 2, it's a whole lot easier than 1
Bit-shifting a floating-point data type (reinterpreted as an int) will give you gibberish (take a look at the diagrams of the binary representation here to see why).
If you want to multiply/divide by a power of 2, then you should do that explicitly.
According to the poorly worded and very unclear documentation, it seems that "bit shifting" in Simulink takes two arguments for floating point values and has the effect of multiplying a floating point value by two raised to the difference of the arguments.
You can use ldexp(double_number, bits_to_pseudo_shift) to obtain this behavior. The function ldexp is found in <math.h>.
There is no correct way to do this. Both operands of << must be of some integer type.
What you're doing is interpreting ("type-punning") a double object as if it were an int object, and then shifting the resulting int value. Even if double and int happen to be the same size, this is very unlikely to do anything useful. (And even if it is useful, it makes more sense to shift unsigned values rather than signed values).
One potential use case for this is for capturing mantissa bits, exponent bits, and sign bit if interested. For this you can use a union:
union doubleBits {
double d;
long l;
You can set take your double and set it in the union:
union doubleBits myUnion;
myUnion.d = myDouble;
And bit shift the long portion of the union after extracting the bits like so:
myUnion.l >>= 1;
Since the number of bits for each portion of the double is defined, this is one way to extract the underlying bit representation. This is one use case where it could be feasible to want to get the raw underlying bits. I'm not familiar with Simulink, but if that is possibly why the double was being shifted in the first place, this could be a way to achieve that behavior in C. The fact that it was always 12 bits makes me think otherwise, but just in case figured it was worth pointing out for others who stumble upon this question.
There's a way to achieve this : just add your n to the exponent part of the bitwise representation of the double.
Convert your double to long using "reinterpret" or bitwise conversion (using a union for instance).. extract bits from 52 to 63 (11 bits).. then add your shift and put back the result in the exponent.
You should take into account the special values of double (+infinity, nan or zero)
double operator_shift_left(double a,int n)
long long l;
double d;
} r;
case 0x0000000000000000: // 0
case 0x8000000000000000: // -0
case 0x7FF0000000000000: // pos infnity
case 0xFFF0000000000000: // neg infnity
case 0x7FF0000000000001: // Nan
case 0x7FF8000000000001: // Nan
return a;
int nexp=(((r.l>>52)&0x7FF)+n); // new exponent
if (nexp<0) // underflow
r.l=r.l & 0x8000000000000000;
// returns +/- 0
return r.d;
if (nexp>2047) // overflow
r.l=(r.l & 0x8000000000000000)| 0x7FF0000000000000;
// returns +/- infinity
return r.d;
// returns the number with the new exponant
r.l=(r.l & 0x800FFFFFFFFFFFFF)|(((long long)nexp)<<52);
return r.d;
