Converting float negative values to integer value in c launguage [closed] - c

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I want to convert float negative values to unsigned int values. Is it possible?
For example:
float x = -10000.0;
int y;
y = x;
When we assign x value to y, can the negative value be stored in an integer?
If not, how can we store the negative values into integer variables?

can the negative (float f) value be stored in an integer?
Yes, with limitations.
With a signed integer type like int16_t i, i = f is well defined for
-32768.999... to 32767.999...
With an unsigned integer type like unt16_t u, u = f is well defined for
-0.999... to 65535.999...
The result is a truncated value (fraction thrown away). All other float values result in undefined behavior.
If not, how can we store the negative values into integer variables?
Best to use wide signed integer types and test for range limitations.
In any case, the fraction of the float is lost. A -0.5f can be stored in an unsigned, yet the value becomes 0u.
The below performs some simply tests to insure y is in range.
#include <limits.h>
float x = ...;
int y = 0;
if (x >= INT_MAX + 1u) puts("Too pos");
else if (x <= INT_MIN - 1.0) puts("Too neg");
else y = (int) x;
Note the tests above are illustrative as they lack high portability.
Example: INT_MIN - 1.0 in inexact in select situations.
To cope, with common 2's complement int, the below is better reformed. As 2's complement, INT_MIN is a power of 2 (negated) and usually in the range of float, thus making for an exact subtraction near the negative threshold. `
// if (x <= INT_MIN - 1.0)
if (x - INT_MIN <= - 1.0f)
Another alternative is to explore a union. Leave that for others to explain its possibilities and limitations.
union {
float f;
unsigned u;
} x;

float x = 10000.0;
int a;
a = (int)(x+0.5);

Related

Why is integer conversion done this way? [duplicate]

This question already has answers here:
Unsigned values in C
(3 answers)
Closed 9 months ago.
I ran into an interesting scenario with integer conversion:
#include <stdio.h>
int main()
{
unsigned int x = 20;
unsigned int y = 40;
printf("%d\n", x - y);
printf("%d\n", (x - y) / 4);
}
~ % ./a.out
-20
1073741819
I wasn't expecting the 2nd result. Since x and y are both unsigned ints is the result of x - y unsigned (and in this case displayed as signed by printf)?
The things you are printing are indeed unsigned integers and you should print them with %u, but that does not explain the surprising result for the second number. The surprising result comes from an overflow occurring when you calculate x - y, since the result of that subtraction is negative and thus not representable as an unsigned int.
Unsigned overflow/underflow is not undefined behavior, so it's OK to have code like this in production if you know what you are doing.

fast double exp2 function in C [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
The community reviewed whether to reopen this question 9 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I need a version of the following fast exp2 function working in double precision, can you help me ? Please don't answer saying that it is an approximation so a double version is pointless and casting the result to (double) is enough..thanks. The function which I found somewhere and which works for me is the following and it is much faster than exp2f(), but unfortunately I could not find any double precision version:
inline float fastexp2(float p)
{
if(p<-126.f) p= -126.f;
int w=(int)p;
float z=p-(float)w;
if(p<0.f) z+= 1.f;
union {uint32_t i; float f;} v={(uint32_t)((1<<23)*(p+121.2740575f+27.7280233f/(4.84252568f -z)-1.49012907f * z)) };
return v.f;
}
My assumption is that the existing code from the question assumes IEEE-754 binary floating-point computation, in particular mapping C's float type to IEEE-754's binary32 format.
The existing code suggests that only floating-point results in the normal range are of interest: subnormal results are avoided by clamping the input from below and overflow is ignored. So for float computation valid inputs are in the interval [-126, 128). By exhaustive test, I found that the maximum relative error of the function in the question is 7.16e-5, and that the root-mean-square (RMS) error is 1.36e-5.
My assumption is that the desired change to double computation should widen the range of allowed inputs to [-1022, 1024), and that identical relative accuracy should be maintained. The code is written in a fairly cryptic fashion. So as a first step, I rearranged it into a more readable version. In a second step, I tweaked the coefficients of the core approximation so as to minimize the maximum relative error. This results in the following ISO-C99 code:
/* compute 2**p, for p in [-126, 128). Maximum relative error: 5.04e-5; RMS error: 1.03e-5 */
float fastexp2 (float p)
{
const int FP32_MIN_EXPO = -126; // exponent of minimum binary32 normal
const int FP32_MANT_BITS = 23; // number of stored mantissa (significand) bits
const int FP32_EXPO_BIAS = 127; // binary32 exponent bias
float res;
p = (p < FP32_MIN_EXPO) ? FP32_MIN_EXPO : p; // clamp below
/* 2**p = 2**(w+z), with w an integer and z in [0, 1) */
float w = floorf (p); // integral part
float z = p - w; // fractional part
/* approximate 2**z-1 for z in [0, 1) */
float approx = -0x1.6e7592p+2f + 0x1.bba764p+4f / (0x1.35ed00p+2f - z) - 0x1.f5e546p-2f * z;
/* assemble the exponent and mantissa components into final result */
int32_t resi = ((1 << FP32_MANT_BITS) * (w + FP32_EXPO_BIAS + approx));
memcpy (&res, &resi, sizeof res);
return res;
}
Refactoring and retuning of the coefficients resulted in slight improvements in accuracy, with maximum relative error of 5.04e-5 and RMS error of 1.03e-5. It should be noted that floating-point arithmetic is generally not associative, therefore any re-association of floating-point operations, either by manual code changes or compiler transformations could affect the stated accuracy and requires careful re-testing.
I do not expect any need to modify the code as it compiles into efficient machine code for common architectures, as can be seen from trial compilations with Compiler Explorer, e.g. gcc with x86-64 or gcc with ARM64.
At this point it is obvious what needs to be changed for switching to double computation. Change all instances of float to double, all instances of int32_t to int64_t, change type suffixes for literal constants and math functions, and change the floating-point format specific parameters for IEEE-754 binary32 to those for IEEE-754 binary64. The core approximation needs re-tuning to make best possible use of double precision coefficients in the core approximation.
/* compute 2**p, for p in [-1022, 1024). Maximum relative error: 4.93e-5. RMS error: 9.91e-6 */
double fastexp2 (double p)
{
const int FP64_MIN_EXPO = -1022; // exponent of minimum binary64 normal
const int FP64_MANT_BITS = 52; // number of stored mantissa (significand) bits
const int FP64_EXPO_BIAS = 1023; // binary64 exponent bias
double res;
p = (p < FP64_MIN_EXPO) ? FP64_MIN_EXPO : p; // clamp below
/* 2**p = 2**(w+z), with w an integer and z in [0, 1) */
double w = floor (p); // integral part
double z = p - w; // fractional part
/* approximate 2**z-1 for z in [0, 1) */
double approx = -0x1.6e75d58p+2 + 0x1.bba7414p+4 / (0x1.35eccbap+2 - z) - 0x1.f5e53c2p-2 * z;
/* assemble the exponent and mantissa components into final result */
int64_t resi = ((1LL << FP64_MANT_BITS) * (w + FP64_EXPO_BIAS + approx));
memcpy (&res, &resi, sizeof res);
return res;
}
Both maximum relative error and root-mean-square error decreases very slightly to 4.93e-5 and 9.91e-6, respectively. This is as expected, because for an approximation that is roughly accurate to 15 bits it matters little whether intermediate computation is performed with 24 bits of precision or 53 bits of precision. The computation uses a division, and this tends to be slower for double than for float on all platforms I am familiar with, so the double-precision port doesn't seem to provide any significant advantages other than perhaps eliminating conversion overhead if the calling code uses double-precision computation.

Bitset membership function [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm writing a bitset membership predicate which should handle all values of x:
int Contains(unsigned int A, int x)
{
return (x >= 0) && (x < 8 * sizeof A) && (((1u << x) & A) != 0);
}
Is there a more efficient implementation?
You can skip lower bound check if x is unsigned.
From N1570:
6.3.1.3 Signed and unsigned integers
When a value with integer type is converted to another integer type other than _Bool, if
the value can be represented by the new type, it is unchanged.
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type. 60)
unsigned int y = x; // Wrap around if x is negative
return (y < CHAR_BIT * sizeof A) && (1u << y & A) != 0;
Not sure if this really brings any meaningful improvement though. Check your compiler output to make sure.

Test the subtraction of multiple unsigned int

After a few unsuccessful searches, I still don't know if there's a way to substract two unsigned int (or more) and detect if the result of this substraction is negative (or not).
I've try things like :
if(((int)x - (int)y) < 0)
But I don't think it's the best way.
Realize that what you intend by
unsigned int x;
unsigned int y;
if (x - y < 0)
is mathematically equivalent to:
unsigned int x;
unsigned int y;
if (y > x)
EDIT
There aren't many questions for which I can assert a definitive proof, but I can for this one. It's basic inequality algebra:
x - y < 0
add y to both sides:
x < y, which is the same as y > x.
You can do similarly with more variables, if you need:
x - y - z < 0 == x < y + z, or y + z > x
see chux's comment to his own answer, though, for a valid warning about integer overflow when dealing with multiple values.
Simply compare.
unsigned x, y, diff;
diff = x - y;
if (x < y) {
printf("Difference is negative and not representable as an unsigned.\n");
}
[Edit] OP change from "2 unsigned int" to "multiple unsigned int"
Confident doing N*(N-1)/2 compares would be needed if a wider integer width is not available for subtracting N unsigned.
With N > 2, simplest, if available, to use wider integers. Such as
long long diff;
// or
#include <stdint.h>
intmax_t diff;
Depending though on your platform, these type may or may not be wider than unsigned. Certainly not narrower.
Note: this issue similarly applies to multiple signed int too. Other compares are use though. But that is another question.

char or float overflow

Unsure how to tell if an overflow is possible. I am given this sample code:
char x;
float f, g;
// some values get assigned to x and f here...
g = f + x;
Can someone please explain?
A float, at its highest limits (binary exponent of 127), does not have sufficient precision (23 bits) to show a difference of the largest possible char (127, 7 bits), and so overflow is not possible since addition will have no effect (a precision of 127-7=120 would be required).

Resources