As far as I know, representing a fraction in C relies on floats and doubles which are in floating point representation.
Assume I'm trying to represent 1.5 which is a fixed point number (only one digit to the right of the radix point). Is there a way to represent such number in C or even assembly using a fixed point data type?
Are there even any fixed point instructions on x86 (or other architectures) which would operate on such type?
Every integral type can be used as a fixed point type. A favorite of mine is to use int64_t with an implied 8 digit shift, e.g. you store 1.5 as 150000000 (1.5e8). You'll have to analyze your use case to decide on an underlying type and how many digits to shift (that is, assuming you use base-10 scaling, which most people do). But 64 bits scaled by 10^8 is a pretty reasonable starting point with a broad range of uses.
While some C compilers offer special fixed-point types as an extension (not part of the standard C language), there's really very little use for them. Fixed point is just integers, interpreted with a different unit. For example, fixed point currency in typical cent denominations is just using integers that represent cents instead of dollars (or whatever the whole currency unit is) for your unit. Likewise, you can think of 8-bit RGB as having units of 1/256 or 1/255 "full intensity".
Adding and subtracting fixed point values with the same unit is just adding and subtracting integers. This is just like arithmetic with units in the physical sciences. The only value in having the language track that they're "fixed point" would be ensuring that you can only add/subtract values with matching units.
For multiplication and division, the result will not have the same units as the operands so you have to either treat the result as a different fixed-point type, or renormalize. For example if you multiply two values representing 1/16 units, the result will have 1/256 units. You can then either scale the value down by a factor of 16 (rounding in whatever way is appropriate) to get back to a value with 1/16 units.
If the issue here is representing decimal values as fixed point, there's probably a library for this for C, you could try a web search. You could create your own BCD fixed point library in assembly, using the BCD related instructions, AAA (adjusts after addition), AAS (adjusts after subtraction) and AAM (adjusts after multiplication). However, it seems these instructions are invalid in X86 X64 (64 bit) mode, so you'll need to use a 32 bit program, which should be runnable on a 64 bit OS.
Financial institutions in the USA and other countries are required by law to perform decimal based math on currency values, to avoid decimal -> binary -> decimal conversion issues.
Related
On my mbed LPC1768 I have an ADC on a pin which when polled returns a 16-bit short number normalised to a floating point value between 0-1. Document here.
Because it converts it to a floating point number does that mean its 32-bits? Because the number I have is a number to six decimal places. Data Types here
I'm running Autocorrelation and I want to reduce the time it takes to complete the analysis.
Is it correct that the floating point numbers are 32-bits long and if so is it correct that multiplying two 32-bit floating point numbers will take a lot longer than multiplying two 16-bit short value (non-demical) numbers together?
I am working with C to program the mbed.
Cheers.
I should be able to comment on this quite accurately. I used to do DSP processing work where we would "integerize" code, which effectively meant we'd take a signal/audio/video algorithm, and replace all the floating point logic with fixed point arithmetic (ie: Q_mn notation, etc).
On most modern systems, you'll usually get better performance using integer arithmetic, compared to floating point arithmetic, at the expense of more complicated code you have to write.
The Chip you are using (Cortex M3) doesn't have a dedicated hardware-based FPU: it only emulates floating point operations, so floating point operations are going to be expensive (take a lot of time).
In your case, you could just read the 16-bit value via read_u16(), and shift the value right 4 times, and you're done. If you're working with audio data, you might consider looking into companding algorithms (a-law, u-law), which will give a better subjective performance than simply chopping off the 4 LSBs to get a 12-bit number from a 16-bit number.
Yes, a float on that system is 32bit, and is likely represented in IEEE754 format. Multiplying a pair of 32-bit values versus a pair of 16-bit values may very well take the same amount of time, depending on the chip in use and the presence of an FPU and ALU. On your chip, multiplying two floats will be horrendously expensive in terms of time. Also, if you multiply two 32-bit integers, they could potentially overflow, so there is one potential reason to go with floating point logic if you don't want to implement a fixed-point algorithm.
It is correct to assume that multiplying two 32-bit floating point numbers will take longer than multiplying two 16-bit short value if special hardware(Floating point unit) is not present in the processor.
I have been reading a lot about floats and computer-processed floating-point operations. The biggest question I see when reading about them is why are they so inaccurate? I understand this is because binary cannot accurately represent all real numbers, so the numbers are rounded to the 'best' approximation.
My question is, knowing this, why do we still use binary as the base for computer operations? Surely using a larger base number than 2 would increase the accuracy of floating-point operations exponentially, would it not?
What are the advantages of using a binary number system for computers as opposed to another base, and has another base ever been tried? Or is it even possible?
First of all: You cannot represent all real numbers even when using say, base 100. But you already know this. Anyway, this means: Inaccuracy will always arise due to 'not being able to represent all real numbers'.
Now lets talk about "what can higher bases bring to you when doing math?": Higher bases bring exactly 'nothing' in terms of precision. Why?
If you want to use base 4, then a 16 digit base 4 number provides exactly 416 different values.
But you can get the same number of different values from a 32 digit base 2 number (232 = 416).
As another answer already said: Transistors can either be on or off. So your newly designed base 4 registers need to be an abstraction over (base 2) ON/OFF 'bits'. This means: Use two 'bits' to represent a base 4 digit. But you'll still get exactly 2N levels by spending N 'bits' (or N/2 base-4 digits). You can only get better accuracy by spending more bits, not by increasing the base. Which base you 'imagine/abstract' your numbers to be in (e.g. like how printf can print these base-2 numbers in base-10) is really just a matter of abstraction and not precision.
Computers are built on transistors, which have a "switched on" state, and a "switched off" state. This corresponds to high and low voltage. Pretty much all digital integrated circuits work in this binary fashion.
Ignoring the fact that transistors just simply work this way, using a different base (e.g. base 3) would require these circuits to operate at an intermediate voltage state (or several) as well as 0V and their highest operating voltage. This is more complicated, and can result in problems at high frequencies - how can you tell whether a signal is just transitioning between 2V and 0V, or actually at 1V?
When we get down to the floating point level, we are (as nhahtdh mentioned in their answer) mapping an infinite space of numbers down to a finite storage space. It's an absolute guarantee that we'll lose some precision. One advantage of IEEE floats, though, is that the precision is relative to the magnitude of the value.
Update: You should also check out Tunguska, a ternary computer emulator. It uses base-3 instead of base-2, which makes for some interesting (albeit mind-bending) concepts.
We are essentially mapping a finite space to an infinite set of real number. So it is not even a problem of base anyway.
Base 2 is chosen, like Polynomial said, for implementation reason, as it is easier to differentiate 2 levels of energy.
We either throw more space to represent more numbers/increase precision, or limit the range that we want to encode, or a mix of them.
It boils out to getting the most from available chip area.
If you use on/off switches to represent numbers, you can't get more precision per switch than with a base-2 representation. This is simply because N switches can represent 2^N quantities no matter what you choose these values to be. There were early machines that used base 16 floating point digits, but each of these needed 4 binary bits, so the overall precision per bit was the same as base 2 (actually somewhat less due to edge cases).
If you choose a base that's not a power of 2, precision is obviously lost. For example you need 4 bits to represent one decimal digit, but 6 of the available values of those 4 bits are never used. This system is called binary-coded decimal and it's still used occassionally, usually when doing computations with money.
Multi-level logic could efficiently implement other bases, but at least with current chip technologies, it turns out to be very expensive to implement more than 2 levels. Even quantum computers are being design assuming two quantum levels: quantum bits or qubits.
The nature of the world and math is what makes the floating point situation hopeless. There is a hierarchy of real numbers Integer -> Rational -> Algebraic -> Transendental. There's a wonderful mathematical proof, Cantor diagonalization, that most numbers, i.e. a "bigger infinity" than the other sets, are Transdendental. Yet no matter what floating point system you choose, there will still be lowly rational numbers with no perfect representation (i.e. 1/3 in base 10). This is our universe. No amount of clever hardware design will change it.
Software can and does use rational representations, storing a numerator and denominator as integers. However with these your programmer's hands are tied. For example, square root is not "closed." Sqrt(2) has no rational representation.
There has been research with algebraic number reps and "lazy" reps of arbitrary reals that produce more digits as needed. Most work of this type seems to be in computational geometry.
Your first paragraph makes sense but the second is a non-sequiter. A larger base would not make a difference to the precision.
The precison of a number depends on the amount of storage that is used for it - for example a 16 bit binary number has the same precision as a 2 x 256 base number - both take up the same amount of information.
See the Usual floating point reference. for more detail - and it generalises to all bases.
Yes computers have been built using other bases - I know of ones that use base 10 (decimal) cf wikipaedia
Yes, there are/have been computers that use other than binary (i.e., other than base 2 representations and aritmetic):
Decimal computers.
Designers of computing systems have looked into many alternatives. But it's hard to find a model that's as simple to implement in a physical device than one using two discrete states. So start with a binary circuit that's very easy and cheap to build and work up to a computer with complex operations. That's the history of binary in a nutshell.
I am not a EE, so everything I say below may be totally wrong. But...
The advantage of binary is that it maps very cleanly to distinguishing between on/off (or, more accurately, high/low voltage) states in real circuits. Trying to distinguish between multiple voltages would, I think, present a bit more of a challenge.
This may go completely out the window if quantum computers make it out of the lab.
There are 2 issues arising from the use of binary floating-point numbers to represent mathematical real numbers -- well, there are probably a lot more issues, but 2 is enough for the time being.
All computer numbers are finite so any number which requires an
infinite number of digits cannot be accurately represented on a
computer, whatever number base is chosen. So that deals with pi, e,
and most of the other real numbers.
Whatever base is chosen will have difficulties representing (finitely) some fractions. Base 2 can only approximate any fraction with a factor of 3 in the denominator, but base 5 or base 7 do too.
Over the years computers with circuitry based on devices with more than 2 states have been built. The old Soviet Union developed a series of computers with 3-state devices and at least one US computer manufacturer at one time offered computers using 10-state devices for arithmetic.
I suspect that binary representation has won out (so far) because it is simple, both to reason about and to implement with current electronic devices.
I vote that we move to a Rational number system storage. Two 32 bit intergers that will evaluate as p/q. Multiplication and division will be really cheap operations. Yeah there will be redundant evaluated numbers (1/2 = 2/4), but who really uses the full dynamic range of a 64 bit double anyways.
I'm neither an electrical engineer nor a mathematician, so take that into consideration when I make the following statement:
All floating point numbers can be represented as integers.
I used sizeof to check the sizes of longs and floats in my 64 bit amd opteron machine. Both show up as 4.
When I check limits.h and float.h for maximum float and long values these are the values I get:
Max value of Float:340282346638528859811704183484516925440.000000
Max value of long:9223372036854775807
Since they both are of the same size, how can a float store such a huge value when compared to the long?
I assume that they have a different storage representation for float. If so, does this impact performance:ie ,is using longs faster than using floats?
It is a tradeoff.
A 32 bit signed integer can express every integer between -231 and +231-1.
A 32 bit float uses exponential notation and can express a much wider range of numbers, but would be unable to express all of the numbers in the range -- not even all of the integers. It uses some of the bits to represent a fraction, and the rest to represent an exponent. It is effectively the binary equivalent of a notation like 6.023*1023 or what have you, with the distance between representable numbers quite large at the ends of the range.
For more information, I would read this article, "What Every Computer Scientist Should Know About Floating Point Arithmetic" by David Goldberg: http://web.cse.msu.edu/~cse320/Documents/FloatingPoint.pdf
By the way, on your platform, I would expect a float to be a 32 bit quantity and a long to be a 64 bit quantity, but that isn't really germane to the overall point.
Performance is kind of hard to define here. Floating point operations may or may not take significantly longer than integer operations depending on the nature of the operations and whether hardware acceleration is used for them. Typically, operations like addition and subtraction are much faster in integer -- multiplication and division less so. At one point, people trying to bum every cycle out when doing computation would represent real numbers as "fixed point" arithmetic and use integers to represent them, but that sort of trick is much rarer now. (On an Opteron, such as you are using, floating point arithmetic is indeed hardware accelerated.)
Almost all platforms that C runs on have distinct "float" and "double" representations, with "double" floats being double precision, that is, a representation that occupies twice as many bits. In addition to the space tradeoff, operations on these are often somewhat slower, and again, people highly concerned about performance will try to use floats if the precision of their calculation does not demand doubles.
It's unlikely to matter whether operations on long are faster than operations on float, or vice versa.
If you only need to represent whole number values, use an integer type. Which type you should use depends on what you're using it for (signed vs. unsigned, short vs. int vs. long vs. long long, or one of the exact-width types in <stdint.h>).
If you need to represent real numbers, use one of the floating-point types: float, double, or long double. (float is actually not used much unless memory space is at a premium; double has better precision and often is no slower than float.)
In short, choose a type whose semantics match what you need, and worry about performance later. There's no great advantageously in getting wrong answers quickly.
As for storage representation, the other answers have pretty much covered that. Typically unsigned integers user all their bits to represent the value, signed integers devote one bit to representing the sign (though usually not directly), and floating-point types devote one bit for the sign, a few bits for an exponent, and the rest for the value. (That's a gross oversimplification.)
Floating point maths is a subject all to itself, but yes: int types are typically faster than float types.
One trick to remember is that not all values can be expressed as a float.
e.g. the closest you may be able to get to 1.9 is 1.899999999. This leads to fun bugs where you say if (v == 1.9) things behave unexpectedly!
If so, does this impact performance: ie, is using longs faster than using floats?
Yes, arithmetic with longs will be faster than with floats.
I assume that they have a different storage representation for float.
Yes. The float types are in IEEE 754 (single precision) format.
Since they both are of the same size, how can a float store such a huge value when compared to the long?
It's optimized to store numbers at a few points (near 0 for example), but it's not optimized to be accurate. For example you could add 1 to 1000000000. With the float, there probably won't be any difference in the sum (1000000000 instead of 1000000001), but with the long there will be.
I'm writing a utility to calculate π to a million digits after the decimal. On a 32- or 64-bit consumer desktop system, what is the most efficient way to store and work with such a large number accurate to the millionth digit?
clarification: The language would be C.
Forget floating point, you need bit strings that represent integers
This takes a bit less than 1/2 megabyte per number. "Efficient" can mean a number of things. Space-efficient? Time-efficient? Easy-to-program with?
Your question is tagged floating-point, but I'm quite sure you do not want floating point at all. The entire idea of floating point is that our data is only known to a few significant figures and even the famous constants of physics and chemistry are known precisely to only a handful or two of digits. So there it makes sense to keep a reasonable number of digits and then simply record the exponent.
But your task is quite different. You must account for every single bit. Given that, no floating point or decimal arithmetic package is going to work unless it's a template you can arbitrarily size, and then the exponent will be useless. So you may as well use integers.
What you really really need is a string of bits. This is simply an array of convenient types. I suggest <stdint.h> and simply using uint32_t[125000] (or 64) to get started. This actually might be a great use of the more obscure constants from that header that pick out bit sizes that are fast on a given platform.
To be more specific we would need to know more about your goals. Is this for practice in a specific language? For some investigation into number theory? If the latter, why not just use a language that already supports Bignum's, like Ruby?
Then the storage is someone else's problem. But, if what you really want to do is implement a big number package, then I might suggest using bcd (4-bit) strings or even ordinary ascii 8-bit strings with printable digits, simply because things will be easier to write and debug and maximum space and time efficiency may not matter so much.
I'd recommend storing it as an array of short ints, one per digit, and then carefully write utility classes to add and subtract portions of the number. You'll end up moving from this array of ints to floats and back, but you need a 'perfect' way of storing the number - so use its exact representation. This isn't the most efficient way in terms of space, but a million ints isn't very big.
It's all in the way you use the representation. Decide how you're going to 'work with' this number, and write some good utility functions.
If you're willing to tolerate computing pi in hex instead of decimal, there's a very cute algorithm that allows you to compute a given hexadecimal digit without knowing the previous digits. This means, by extension, that you don't need to store (or be able to do computation with) million digit numbers.
Of course, if you want to get the nth decimal digit, you will need to know all of the hex digits up to that precision in order to do the base conversion, so depending on your needs, this may not save you much (if anything) in the end.
Unless you're writing this purely for fun and/or learning, I'd recommend using a library such as GNU Multiprecision. Look into the mpf_t data type and its associated functions for storing arbitrary-precision floating-point numbers.
If you are just doing this for fun/learning, then represent numbers as an array of chars, which each array element storing one decimal digit. You'll have to implement long addition, long multiplication, etc.
Try PARI/GP, see wikipedia.
You could store its decimals digits as text in a file and mmap it to an array.
i once worked on an application that used really large numbers (but didnt need good precision). What we did was store the numbers as logarithms since you can store a pretty big number as a log10 within an int.
Think along this lines before resorting to bit stuffing or some complex bit representations.
I am not too good with complex math, but i reckon there are solutions which are elegant when storing numbers with millions of bits of precision.
IMO, any programmer of arbitrary precision arithmetics needs understanding of base conversion. This solves anyway two problems: being able to calculate pi in hex digits and converting the stuff to decimal representation and as well finding the optimal container.
The dominant constraint is the number of correct bits in the multiplication instruction.
In Javascript one has always 53-bits of accuracy, meaning that a Uint32Array with numbers having max 26 bits can be processed natively. (waste of 6 bits per word).
In 32-bit architecture with C/C++ one can easily get A*B mod 2^32, suggesting basic element of 16 bits. (Those can be parallelized in many SIMD architectures starting from MMX). Also each 16-bit result can contain 4-digit decimal numbers (wasting about 2.5 bits) per word.
I want to use lua (that internally uses only doubles) to represent a integer that can't have rounding errors between 0 and 2^64-1 or terrible things will happen.
Is it possible to do so?
No
At least some bits of a 64-bit double must be used to represent the exponent (position of the binary point), and hence there are fewer than 64-bits available for the actual number. So no, a 64-bit double can't represent all the values a 64-bit integer can (and vice-versa).
Even though you've gotten some good answers to your question about 64-bit types, you may still want a practical solution to your specific problem. The most reliable solution I know of is to build Lua 5.1 with the LNUM patch (also known as the Lua integer patch) which can be downloaded from LuaForge. If you aren't planning on building Lua from the C source, there is at least one pure Lua library that handles 64-bit signed integers -- see the Lua-users wiki.
The double is a 64bit type itself. However you lose 1 bit for the sign and 11 for the exponent.
So the answer is no: it can't be done.
From memory, a double can represent a 53-bit signed integer exactly.
No, you cannot use Double to store 64-bit integers without losing precision.
However, you can apply a Lua patch that adds support for true 64-bit integers to the Lua interpreter. Apply the LNUM patch to your Lua source and recompile.
On 64 bits you can only store 2^64 different codes. This means that a 64-bit type which can represent 2^64 integers doesn't have any place for representing something else, such as floating point numbers.
Obviously double can represent a lot of non-integers numbers, so it can't fit your requirements.
IEEE 754 double cannot represent 64-bit integers exactly. It can, however, represent exactly every 32-bit integer value.
i know nothing about lua
but if you could figure out how to perform bitwise manipulation of the float in this language, you could in theory make a wrapper class that took your number in the form of a string and set the bits of the float in the order which represents the number you gave it
a more practical solution would be to use some bignum library