Floats and Longs - c

I used sizeof to check the sizes of longs and floats in my 64 bit amd opteron machine. Both show up as 4.
When I check limits.h and float.h for maximum float and long values these are the values I get:
Max value of Float:340282346638528859811704183484516925440.000000
Max value of long:9223372036854775807
Since they both are of the same size, how can a float store such a huge value when compared to the long?
I assume that they have a different storage representation for float. If so, does this impact performance:ie ,is using longs faster than using floats?

It is a tradeoff.
A 32 bit signed integer can express every integer between -231 and +231-1.
A 32 bit float uses exponential notation and can express a much wider range of numbers, but would be unable to express all of the numbers in the range -- not even all of the integers. It uses some of the bits to represent a fraction, and the rest to represent an exponent. It is effectively the binary equivalent of a notation like 6.023*1023 or what have you, with the distance between representable numbers quite large at the ends of the range.
For more information, I would read this article, "What Every Computer Scientist Should Know About Floating Point Arithmetic" by David Goldberg: http://web.cse.msu.edu/~cse320/Documents/FloatingPoint.pdf
By the way, on your platform, I would expect a float to be a 32 bit quantity and a long to be a 64 bit quantity, but that isn't really germane to the overall point.
Performance is kind of hard to define here. Floating point operations may or may not take significantly longer than integer operations depending on the nature of the operations and whether hardware acceleration is used for them. Typically, operations like addition and subtraction are much faster in integer -- multiplication and division less so. At one point, people trying to bum every cycle out when doing computation would represent real numbers as "fixed point" arithmetic and use integers to represent them, but that sort of trick is much rarer now. (On an Opteron, such as you are using, floating point arithmetic is indeed hardware accelerated.)
Almost all platforms that C runs on have distinct "float" and "double" representations, with "double" floats being double precision, that is, a representation that occupies twice as many bits. In addition to the space tradeoff, operations on these are often somewhat slower, and again, people highly concerned about performance will try to use floats if the precision of their calculation does not demand doubles.

It's unlikely to matter whether operations on long are faster than operations on float, or vice versa.
If you only need to represent whole number values, use an integer type. Which type you should use depends on what you're using it for (signed vs. unsigned, short vs. int vs. long vs. long long, or one of the exact-width types in <stdint.h>).
If you need to represent real numbers, use one of the floating-point types: float, double, or long double. (float is actually not used much unless memory space is at a premium; double has better precision and often is no slower than float.)
In short, choose a type whose semantics match what you need, and worry about performance later. There's no great advantageously in getting wrong answers quickly.
As for storage representation, the other answers have pretty much covered that. Typically unsigned integers user all their bits to represent the value, signed integers devote one bit to representing the sign (though usually not directly), and floating-point types devote one bit for the sign, a few bits for an exponent, and the rest for the value. (That's a gross oversimplification.)

Floating point maths is a subject all to itself, but yes: int types are typically faster than float types.
One trick to remember is that not all values can be expressed as a float.
e.g. the closest you may be able to get to 1.9 is 1.899999999. This leads to fun bugs where you say if (v == 1.9) things behave unexpectedly!

If so, does this impact performance: ie, is using longs faster than using floats?
Yes, arithmetic with longs will be faster than with floats.
I assume that they have a different storage representation for float.
Yes. The float types are in IEEE 754 (single precision) format.
Since they both are of the same size, how can a float store such a huge value when compared to the long?
It's optimized to store numbers at a few points (near 0 for example), but it's not optimized to be accurate. For example you could add 1 to 1000000000. With the float, there probably won't be any difference in the sum (1000000000 instead of 1000000001), but with the long there will be.

Related

Is there a fixed point representation available in C or Assembly

As far as I know, representing a fraction in C relies on floats and doubles which are in floating point representation.
Assume I'm trying to represent 1.5 which is a fixed point number (only one digit to the right of the radix point). Is there a way to represent such number in C or even assembly using a fixed point data type?
Are there even any fixed point instructions on x86 (or other architectures) which would operate on such type?
Every integral type can be used as a fixed point type. A favorite of mine is to use int64_t with an implied 8 digit shift, e.g. you store 1.5 as 150000000 (1.5e8). You'll have to analyze your use case to decide on an underlying type and how many digits to shift (that is, assuming you use base-10 scaling, which most people do). But 64 bits scaled by 10^8 is a pretty reasonable starting point with a broad range of uses.
While some C compilers offer special fixed-point types as an extension (not part of the standard C language), there's really very little use for them. Fixed point is just integers, interpreted with a different unit. For example, fixed point currency in typical cent denominations is just using integers that represent cents instead of dollars (or whatever the whole currency unit is) for your unit. Likewise, you can think of 8-bit RGB as having units of 1/256 or 1/255 "full intensity".
Adding and subtracting fixed point values with the same unit is just adding and subtracting integers. This is just like arithmetic with units in the physical sciences. The only value in having the language track that they're "fixed point" would be ensuring that you can only add/subtract values with matching units.
For multiplication and division, the result will not have the same units as the operands so you have to either treat the result as a different fixed-point type, or renormalize. For example if you multiply two values representing 1/16 units, the result will have 1/256 units. You can then either scale the value down by a factor of 16 (rounding in whatever way is appropriate) to get back to a value with 1/16 units.
If the issue here is representing decimal values as fixed point, there's probably a library for this for C, you could try a web search. You could create your own BCD fixed point library in assembly, using the BCD related instructions, AAA (adjusts after addition), AAS (adjusts after subtraction) and AAM (adjusts after multiplication). However, it seems these instructions are invalid in X86 X64 (64 bit) mode, so you'll need to use a 32 bit program, which should be runnable on a 64 bit OS.
Financial institutions in the USA and other countries are required by law to perform decimal based math on currency values, to avoid decimal -> binary -> decimal conversion issues.

Converting 32-bit number to 16 bits or less

On my mbed LPC1768 I have an ADC on a pin which when polled returns a 16-bit short number normalised to a floating point value between 0-1. Document here.
Because it converts it to a floating point number does that mean its 32-bits? Because the number I have is a number to six decimal places. Data Types here
I'm running Autocorrelation and I want to reduce the time it takes to complete the analysis.
Is it correct that the floating point numbers are 32-bits long and if so is it correct that multiplying two 32-bit floating point numbers will take a lot longer than multiplying two 16-bit short value (non-demical) numbers together?
I am working with C to program the mbed.
Cheers.
I should be able to comment on this quite accurately. I used to do DSP processing work where we would "integerize" code, which effectively meant we'd take a signal/audio/video algorithm, and replace all the floating point logic with fixed point arithmetic (ie: Q_mn notation, etc).
On most modern systems, you'll usually get better performance using integer arithmetic, compared to floating point arithmetic, at the expense of more complicated code you have to write.
The Chip you are using (Cortex M3) doesn't have a dedicated hardware-based FPU: it only emulates floating point operations, so floating point operations are going to be expensive (take a lot of time).
In your case, you could just read the 16-bit value via read_u16(), and shift the value right 4 times, and you're done. If you're working with audio data, you might consider looking into companding algorithms (a-law, u-law), which will give a better subjective performance than simply chopping off the 4 LSBs to get a 12-bit number from a 16-bit number.
Yes, a float on that system is 32bit, and is likely represented in IEEE754 format. Multiplying a pair of 32-bit values versus a pair of 16-bit values may very well take the same amount of time, depending on the chip in use and the presence of an FPU and ALU. On your chip, multiplying two floats will be horrendously expensive in terms of time. Also, if you multiply two 32-bit integers, they could potentially overflow, so there is one potential reason to go with floating point logic if you don't want to implement a fixed-point algorithm.
It is correct to assume that multiplying two 32-bit floating point numbers will take longer than multiplying two 16-bit short value if special hardware(Floating point unit) is not present in the processor.

Why don't the authors of the C99 standard specify a standard for the size of floating point types?

I noticed on Windows and Linux x86, float is a 4-byte type, double is 8, but long double is 12 and 16 on x86 and x86_64 respectively. C99 is supposed to be breaking such barriers with the specific integral sizes.
The initial technological limitation appears to be due to the x86 processor not being able to handle more than 80-bit floating point operations (plus 2 bytes to round it up) but why the inconsistency in the standard compared to int types? Why don't they go at least to 80-bit standardization?
The C language doesn't specify the implementation of various types, so that it can be efficiently implemented on as wide a variety of hardware as possible.
This extends to the integer types too - the C standard integral types have minimum ranges (eg. signed char is -127 to 127, short and int are both -32,767 to 32,767, long is -2,147,483,647 to 2,147,483,647, and long long is -9,223,372,036,854,775,807 to 9,223,372,036,854,775,807). For almost all purposes, this is all that the programmer needs to know.
C99 does provide "fixed-width" integer types, like int32_t - but these are optional - if the implementation can't provide such a type efficiently, it doesn't have to provide it.
For floating point types, there are equivalent limits (eg double must have at least 10 decimal digits worth of precision).
They were trying to (mostly) accommodate pre-existing C implementations, some of which don't even use IEEE floating point formats.
ints can be used to represent abstract things like ids, colors, error code, requests, etc. In this case ints are not really used as integers numbers but as sets of bits (= a container). Most of the time a programmer knows exactly how many bits he needs, so he wants to be able to use just as many bits as needed.
floats on the other hand are design for a very specific usage (floating point arithmetic). You are very unlikely to be able to size precisely how many bits you need for your float.
Actually, most of the time the more bits you have the better it is.
C99 is supposed to be breaking such barriers with the specific integral sizes.
No, those fixed-width (u)intN_t types are completely optional because not all processors use type sizes that are a power of 2. C99 only requires that (u)int_fastN_t and (u)int_leastN_t to be defined. That means the premise why the inconsistency in the standard compared to int types is just plain wrong because there's no consistency in the size of int types
Lots of modern DSPs use 24-bit word for 24-bit audio. There are even 20-bit DSPs like the Zoran ZR3800x family or 28-bit DSPs like the ADAU1701 which allows transformation of 16/24-bit audio without clipping. Many 32 or 64-bit architectures also have some odd-sized registers to allow accumulation of values without overflow, for example the TI C5500/C6000 with 40-bit long and SHARC with 80-bit accumulator. The Motorola DSP5600x/3xx series also has odd sizes: 2-byte short, 3-byte int, 6-byte long. In the past there were lots of architectures with other word sizes like 12, 18, 36, 60-bit... and lots of CPUs that use one's complement of sign-magnitude. See Exotic architectures the standards committees care about
C was designed to be flexible to support all kinds of such platforms. Specifying a fixed size, whether for integer or floating-point types, defeats that purpose. Floating-point support in hardware varies wildly just like integer support. There are different formats that use decimal, hexadecimal or possibly other bases. Each format has different sizes of exponent/mantissa, different position of sign/exponent/mantissa and even the signed format. For example some use two's complement for the mantissa while some others use two's complement for the exponent or the whole floating-point value. You can see many formats here but that's obviously not every format that ever existed. For example the SHARC above has a special 40-bit floating-point format. Some platforms also use double-double arithmetic for long double. See also
What uncommon floating-point sizes exist in C++ compilers?
Do any real-world CPUs not use IEEE 754?
That means you can't standardize a single floating-point format for all platforms because there's no one-size-fits-all solution. If you're designing a DSP then obviously you need to have a format that's best for your purpose so that you can churn as most data as possible. There's no reason to use IEEE-754 binary64 when a 40-bit format has enough precision for your application, fits better in cache and needs far less die size. Or if you're on a small embedded system then 80-bit long double is usually useless as you don't even have enough ROM for that 80-bit long double library. That's why some platforms limit long double to 64-bit like double

Why does the value of this float change from what it was set to?

Why is this C program giving the "wrong" output?
#include<stdio.h>
void main()
{
float f = 12345.054321;
printf("%f", f);
getch();
}
Output:
12345.054688
But the output should be, 12345.054321.
I am using VC++ in VS2008.
It's giving the "wrong" answer simply because not all real values are representable by floats (or doubles, for that matter). What you'll get is an approximation based on the underlying encoding.
In order to represent every real value, even between 1.0x10-100 and 1.1x10-100 (a truly minuscule range), you still require an infinite number of bits.
Single-precision IEEE754 values have only 32 bits available (some of which are tasked to other things such as exponent and NaN/Inf representations) and cannot therefore give you infinite precision. They actually have 23 bits available giving precision of about 224 (there's an extra implicit bit) or just over 7 decimal digits (log10(224) is roughly 7.2).
I enclose the word "wrong" in quotes because it's not actually wrong. What's wrong is your understanding about how computers represent numbers (don't be offended though, you're not alone in this misapprehension).
Head on over to http://www.h-schmidt.net/FloatApplet/IEEE754.html and type your number into the "Decimal representation" box to see this in action.
If you want a more accurate number, use doubles instead of floats - these have double the number of bits available for representing values (assuming your C implementation is using IEEE754 single and double precision data types for float and double respectively).
If you want arbitrary precision, you'll need to use a "bignum" library like GMP although that's somewhat slower than native types so make sure you understand the trade-offs.
The decimal number 12345.054321 cannot be represented accurately as a float on your platform. The result that you are seeing is a decimal approximation to the closest number that can be represented as a float.
floats are about convenience and speed, and use a binary representation - if you care about precision use a decimal type.
To understand the problem, read What Every Computer Scientist Should Know About Floating-Point Arithmetic:
http://docs.sun.com/source/806-3568/ncg_goldberg.html
For a solution, see the Decimal Arithmetic FAQ:
http://speleotrove.com/decimal/decifaq.html
It's all to do with precision. Your number cannot be stored accurately in a float.
Single-precision floating point values can only represent about eight to nine significant (decimal) digits. Beyond that point, you're seeing quantization error.

Can doubles be used to represent a 64 bit number without loss of precision

I want to use lua (that internally uses only doubles) to represent a integer that can't have rounding errors between 0 and 2^64-1 or terrible things will happen.
Is it possible to do so?
No
At least some bits of a 64-bit double must be used to represent the exponent (position of the binary point), and hence there are fewer than 64-bits available for the actual number. So no, a 64-bit double can't represent all the values a 64-bit integer can (and vice-versa).
Even though you've gotten some good answers to your question about 64-bit types, you may still want a practical solution to your specific problem. The most reliable solution I know of is to build Lua 5.1 with the LNUM patch (also known as the Lua integer patch) which can be downloaded from LuaForge. If you aren't planning on building Lua from the C source, there is at least one pure Lua library that handles 64-bit signed integers -- see the Lua-users wiki.
The double is a 64bit type itself. However you lose 1 bit for the sign and 11 for the exponent.
So the answer is no: it can't be done.
From memory, a double can represent a 53-bit signed integer exactly.
No, you cannot use Double to store 64-bit integers without losing precision.
However, you can apply a Lua patch that adds support for true 64-bit integers to the Lua interpreter. Apply the LNUM patch to your Lua source and recompile.
On 64 bits you can only store 2^64 different codes. This means that a 64-bit type which can represent 2^64 integers doesn't have any place for representing something else, such as floating point numbers.
Obviously double can represent a lot of non-integers numbers, so it can't fit your requirements.
IEEE 754 double cannot represent 64-bit integers exactly. It can, however, represent exactly every 32-bit integer value.
i know nothing about lua
but if you could figure out how to perform bitwise manipulation of the float in this language, you could in theory make a wrapper class that took your number in the form of a string and set the bits of the float in the order which represents the number you gave it
a more practical solution would be to use some bignum library

Resources