arithmetic operations using float binary "0b"

arithmetic operations using float binary "0b" - c

I'm trying to understand, I'm a beginner.
I want to do arithmetic operations with float numbers in binary.
I was using http://www.binaryconvert.com/result_float.html to do the conversion
Only he returns me:
1069547520.000000
1069547520.000000
2139095040.000000
What is it?
I was hoping for this:
00111111110000000000000000000000
00111111110000000000000000000000
01000000010000000000000000000000
%f in printf() would be wrong too?
#include <stdio.h>
int main()
{
float a = 0b00111111110000000000000000000000; /* 1.5 */
float b = 0b00111111110000000000000000000000; /* 1.5 */
float c;
c = a + b; /* 3.0 !? */
printf("%f\n", a);
printf("%f\n", b);
printf("%f\n", c);
return 0;
}

The binary constant 0b00111111110000000000000000000000 is an extension of GCC, and it has type int having value 1069547520. This is converted to a float by the same value, i.e. the float closest to 1069547520.
There is no way of having floating point constants in binary in C; but hex is possible. If there were, then 1.5 would be expressed in binary simply as something like
0b1.1f
i.e. its numeric value in binary is 1.1.
C17 (C99, C11) does have support for hexadecimal floating point constants; you can use
0x1.8p0f
for 1.5f; p0 signifies the exponent.
If you really want to fiddle with the IEEE 754 binary format, you need to use an union or memcpy. For example
#include <stdio.h>
#include <string.h>
#include <stdint.h>
int main(void) {
float a;
uint32_t a_v = 0b00111111110000000000000000000000;
memcpy(&a, &a_v, sizeof(float));
printf("%f\n", a);
// prints 1.500000 on linux x86-64
}

Your binary literals are integer literals. Then you print the floating point values as floating point values, not using binary representation.

Related

GNU MP low precision while using mpf_pow function

While writing this answer, I used the mpf_pow function to calculate 12.3 ^ 123, and the result is different from the one given by WolframAlpha (which by the way also uses GMP).
I casted the code to pure C to simplify:
#include <stdio.h>
#include <gmp.h>
int main (void) {
mpf_t a, c;
unsigned long int b = 123UL;
mpf_set_default_prec(100000);
mpf_inits(a, c, NULL);
mpf_set_d(a, 12.3);
mpf_pow_ui(c, a, b);
gmp_printf("c = %.50Ff\n", c);
return 0;
}
Which results in
114374367934618002778643226182707594198913258409535335775583252201365538178632825702225459029661601216944929436371688246107986574246790.32099077871758646985223686110515186972735931183764
While WolframAlpha returns
1.14374367934617190099880295228066276746218078451850229775887975052369504785666896446606568365201542169649974727730628842345343196581134895919942820874449837212099476648958359023796078549041949007807220625356526926729664064846685758382803707100766740220839267 × 10^134
which starts to disagree with mpf_pow at the 15th digit.
Am I doing something wrong in the code, is this a limitation of GMP, or is WolframAlpha giving an incorrect result?

Am I doing something wrong in the code, is this a limitation of GMP, or is WolframAlpha giving an incorrect result?
You are doing something different from what Wolfram is doing (obviously). Your code is not wrong, per se, but it is not doing what you probably think it is doing. Compare the output of this variation:
#include <stdio.h>
#include <gmp.h>
int main (void) {
mpf_t a, c;
unsigned long int b = 123UL;
mpf_set_default_prec(100000);
mpf_inits(a, c, NULL);
mpf_set_d(a, 12.3);
mpf_pow_ui(c, a, b);
gmp_printf("c = %.50Ff\n", c);
putchar('\n');
mpf_t a1, c1;
mpf_inits(a1, c1, NULL);
mpf_set_str(a1, "12.3", 10);
mpf_pow_ui(c1, a1, b);
gmp_printf("c' = %.50Ff\n", c1);
return 0;
}
...
c = 114374367934618002778643226182707594198913258409535335775583252201365538178632825702225459029661601216944929436371688246107986574246790.32099077871758646985223686110515186972735931183764
c' = 114374367934617190099880295228066276746218078451850229775887975052369504785666896446606568365201542169649974727730628842345343196581134.89591994282087444983721209947664895835902379607855
The difference between the two output values arises because my C implementation and yours represent values of type double in binary floating point, and 12.3 is not exactly representable in binary floating point (see Is floating point math broken?). C provides the closest approximation available, which, assuming 64-bit IEEE 754 representation, matches to about 15 decimal digits of precision. When you initialize a GMP variable with such a value, you get an exact GMP representation of the actual double value, which is only an approximation to 12.3 decimal.
But GMP can represent 12.3 (decimal) to whatever precision you choose.* You chose a very high precision, so when you use a decimal string to initialize your MP-float variable you get a much closer approximation than when you used a double. Naturally, performing the same operation on those different values produces different results. The GMP result in the latter case appears to agree with the Wolfram result to the full precision in which it is expressed.
Note also that in a general sense, one can also use decimal floating-point, in software or (if you are so equipped) in hardware. The value 12.3 (decimal) can be represented exactly in such a format, but that's not what GMP uses.
* Or indeed, GMP can represent 12.3 exactly as a MP rational, though that's not what the code above does.

This gives a result similar to WolframAlpha's:
from decimal import Decimal
from decimal import getcontext
getcontext().prec = 200
print(Decimal('12.3') ** 123)
So you must be doing something wrong in your GMP configuration.

Using 0b representation to print (printf) float

I'm trying to express a fractional number in binary and then have it print out as a float. I've done the fixed point to floating point conversion.
The number in decimal: -342.265625
fixed point: -101010110.010001
32-bit float: 11000011101010110010001000000000
64-bit float (double): 1100000001110101011001000100000000000000000000000000000000000000
*I've double checked with an IEEE 754 Converter
*I'm also aware that printf changes floats into doubles to print them, but declaring it as a double should work? I thought...?
Code:
int main()
{
float floaty = 0b11000011101010110010001000000000;
double doubley = 0b1100000001110101011001000100000000000000000000000000000000000000;
printf("Float: %f\n", floaty);
printf("Double: %lf\n", doubley);
}
Output:
Float: 3282772480.000000
Double: 13868100853597995008.000000
The compiler is gcc and the standard is c99

From gcc's documentation:
The type of these constants follows the same rules as for octal or
hexadecimal integer constants, so suffixes like ‘L’ or ‘UL’ can be
applied.
So, the binary numbers you assign to float and double are actually of integer types and don't directly map to the bit pattern of the underlying types you assign to.
In other words, this:
float floaty = 0b11000011101010110010001000000000;
double doubley = 0b1100000001110101011001000100000000000000000000000000000000000000;
is equivalent to:
float floaty = 3282772480;
double doubley = 13868100853597995008;

The problem is that the compiler is trying to help you out. Your literals (0b1...), which by the way is a non-standard extension and should be written as (0x...), are treaded as literals. The compiler then tries its very best to fit those values into the variables you cast them to. As such it produces very big values that are equal to the integer value of your literals.
To directly assign the value of a variable, you have to use unions (or pointers if you don't mind losing a bit of portability). This code works:
#include <stdint.h>
union floatint {
float f;
uint32_t i;
};
union doubleint {
double d;
uint64_t i;
};
int main()
{
floatint floaty;
doubleint doubley;
floaty.i = 0xC3AB2200;
doubley.i = 0xC075644000000000;
printf("Float: %f\n", floaty.f); // implementation-defined, in your case IEEE 754
printf("Double: %lf\n", doubley.d); // ditto
}
Note that this is the very definition of a union, two (or more) types that share the same representation, but are treated differently.

You can use the binary constants with some more work.
We will have to assume the floating point represented using IEEE 754, and the system is in little endian:
uint32_t value = 0b11000011101010110010001000000000;
float f;
memcpy( &f , &value , sizeof( f ) );
printf( "%f\n" , f );

Using round() function in c

I'm a bit confused about the round() function in C.
First of all, man says:
SYNOPSIS
#include <math.h>
double round(double x);
RETURN VALUE
These functions return the rounded integer value.
If x is integral, +0, -0, NaN, or infinite, x itself is returned.
The return value is a double / float or an int?
In second place, I've created a function that first rounds, then casts to int. Latter on my code I use it as a mean to compare doubles
int tointn(double in,int n)
{
int i = 0;
i = (int)round(in*pow(10,n));
return i;
}
This function apparently isn't stable throughout my tests. Is there redundancy here? Well... I'm not looking only for an answer, but a better understanding on the subject.

The wording in the man-page is meant to be read literally, that is in its mathematical sense. The wording "x is integral" means that x is an element of Z, not that x has the data type int.
Casting a double to int can be dangerous because the maximum arbitrary integral value a double can hold is 2^52 (assuming an IEEE 754 conforming binary64 ), the maximum value an int can hold might be smaller (it is mostly 32 bit on 32-bit architectures and also 32-bit on some 64-bit architectures).
If you need only powers of ten you can test it with this little program yourself:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(){
int i;
for(i = 0;i < 26;i++){
printf("%d:\t%.2f\t%d\n",i, pow(10,i), (int)pow(10,i));
}
exit(EXIT_SUCCESS);
}
Instead of casting you should use the functions that return a proper integral data type like e.g.: lround(3).

here is an excerpt from the man page.
#include <math.h>
double round(double x);
float roundf(float x);
long double roundl(long double x);
notice: the returned value is NEVER a integer. However, the fractional part of the returned value is set to 0.
notice: depending on exactly which function is called will determine the type of the returned value.
Here is an excerpt from the man page about which way the rounding will be done:
These functions round x to the nearest integer, but round halfway cases
away from zero (regardless of the current rounding direction, see
fenv(3)), instead of to the nearest even integer like rint(3).
For example, round(0.5) is 1.0, and round(-0.5) is -1.0.

If you want a long integer to be returned then please use lround:
long int tolongint(double in)
{
return lround(in));
}
For details please see lround which is available as of the C++ 11 standard.

Why did we add a dot in multiplication in order to run the program?

I'm writing a program that calculates the roots of the quadratic equation. When I first wrote the code I didn't type a dot after 4 and 2 in the x equation and it didn't work! So what does that dot represent here and when should I use it?
#include<stdio.h>
#include<conio.h>
#include<math.h>
int main()
{
int a, b, c;
double x;
scanf("%d %d %d", &a, &b, &c);
x = (-b + sqrt(b*b-4.*a*c) ) / (2.*a);
printf("%lf", x);
getch();
return 0;
}

4. is 4.0
The decimal point makes it a float literal rather than an integer literal.
The more important literal is 2. as without it you would get integer division (and in most cases, the wrong result).

Integer literals are interpreted by the compiler as integers, which means that operations such as division are performed in their integer form if all operands are integers. The decimal point makes it a floating literal, which means that the compiler will use the floating form of the operations instead.

Wrong answer of average function?

I'm practically new to C programming and I've been trying to get a simple average function right, but the fractional part of the answer keeps messing up...??
#include <stdio.h>
#include <float.h>
float cal(int num1,int num2,int num3);
int main(){
int a,b,c;
float avg;
a=10;
b=5;
c=11;
avg=cal(a,b,c);
printf("Average is : %E\n", avg);
return 0;
}
float cal(int num1,int num2,int num3){
float avg1;
avg1=(num1+num2+num3)/3;
return avg1;
}
The answer (avg) should be 8.66666666667, but instead I get 8.00000000...

You're doing integer division here. Cast it to float (at least one of them) or use float literals before division to force it to use float division.
For example, change
avg1=(num1+num2+num3)/3;
to
avg1=(num1+num2+num3)/(float)3; // 1. cast one to float
avg1=(num1+num2+num3)/3.0f; // 2. use float literals

Change this
avg1=(num1+num2+num3)/3;
to this
avg1=(num1+num2+num3)/(float)3;
That way you force a division by a float.
With your code you actually perform integer division, which means that the decimal digits get discarded. Then the result of the division is assigned to a float number, but the decimal digits are already gone. That's why you need to cast at least one operand of the division to a float, in order to get what you want.

It is because all operands are integer in here (num1+num2+num3)/3. So you get an integer division, that is later cast to a float (i.e. upon the assignment, but after the evaluation).
You need to make one of the division operands a float, so the rest will be converted. And the division will be a float division.
E.g:
(num1+num2+num3)/(float)3
(num1+num2+num3)/3.0f
((float)(num1+num2+num3))/3
Note that the additions are still integer additions, because of parentheses.
A nice read on the conversion rules is here.

Force the division to be perfomed in floating point
avg1=(num1+num2+num3)/3.0f;
What happens in your case is, you perform the an integer division and then convert it to float:
Resulting type of (num1+num2+num3)/3 is an integer, while the type of (num1+num2+num3)/3.0f is a float.
Integer division will give the result without the decimal point.

Change
avg1=(num1+num2+num3)/3;
to
avg1=(float)(num1+num2+num3)/3;
If you perform integer division then result will also be integer.
As avg1 is already declared as float, you can cast the result of the operation to get float value.

You can simplify your code further
float cal(int num1,int num2,int num3){
return ((num1+num2+num3)/3.0);
}
Just change the value 3 to 3.0 that's enough. because the up-casting has to be made manually the compiler only perform the down-casting.

In addition to the well pointed out need to use floating point division rather than integer division, typical float will not provide a precise number like 8.66666666667 but only to 6 or so digits. Further, conversion of typical int (32-bit) may result in truncation when converting to float.
For a more precise answer with 11 digits to the rights of ., use double instead of float
double cal(int num1,int num2,int num3){
double avg1;
avg1=(num1+num2+num3)/3.0; // 3 --> 3.0
return avg1;
}
int main(void){ // added void
int a,b,c;
double avg;
a=10;
b=5;
c=11;
avg=cal(a,b,c);
// printf("Average is : %E\n", avg);
printf("Average is : %.11E\n", avg); // Print to 11 digits after the dp.
return 0;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

arithmetic operations using float binary "0b" - c

Your binary literals are integer literals. Then you print the floating point values as floating point values, not using binary representation.

Related

GNU MP low precision while using mpf_pow function

Using 0b representation to print (printf) float

Using round() function in c

Why did we add a dot in multiplication in order to run the program?

Wrong answer of average function?

Categories

Resources