Binary, Floats, and Modern Computers

Binary, Floats, and Modern Computers - c

I have been reading a lot about floats and computer-processed floating-point operations. The biggest question I see when reading about them is why are they so inaccurate? I understand this is because binary cannot accurately represent all real numbers, so the numbers are rounded to the 'best' approximation.
My question is, knowing this, why do we still use binary as the base for computer operations? Surely using a larger base number than 2 would increase the accuracy of floating-point operations exponentially, would it not?
What are the advantages of using a binary number system for computers as opposed to another base, and has another base ever been tried? Or is it even possible?

First of all: You cannot represent all real numbers even when using say, base 100. But you already know this. Anyway, this means: Inaccuracy will always arise due to 'not being able to represent all real numbers'.
Now lets talk about "what can higher bases bring to you when doing math?": Higher bases bring exactly 'nothing' in terms of precision. Why?
If you want to use base 4, then a 16 digit base 4 number provides exactly 416 different values.
But you can get the same number of different values from a 32 digit base 2 number (232 = 416).
As another answer already said: Transistors can either be on or off. So your newly designed base 4 registers need to be an abstraction over (base 2) ON/OFF 'bits'. This means: Use two 'bits' to represent a base 4 digit. But you'll still get exactly 2N levels by spending N 'bits' (or N/2 base-4 digits). You can only get better accuracy by spending more bits, not by increasing the base. Which base you 'imagine/abstract' your numbers to be in (e.g. like how printf can print these base-2 numbers in base-10) is really just a matter of abstraction and not precision.

Computers are built on transistors, which have a "switched on" state, and a "switched off" state. This corresponds to high and low voltage. Pretty much all digital integrated circuits work in this binary fashion.
Ignoring the fact that transistors just simply work this way, using a different base (e.g. base 3) would require these circuits to operate at an intermediate voltage state (or several) as well as 0V and their highest operating voltage. This is more complicated, and can result in problems at high frequencies - how can you tell whether a signal is just transitioning between 2V and 0V, or actually at 1V?
When we get down to the floating point level, we are (as nhahtdh mentioned in their answer) mapping an infinite space of numbers down to a finite storage space. It's an absolute guarantee that we'll lose some precision. One advantage of IEEE floats, though, is that the precision is relative to the magnitude of the value.
Update: You should also check out Tunguska, a ternary computer emulator. It uses base-3 instead of base-2, which makes for some interesting (albeit mind-bending) concepts.

We are essentially mapping a finite space to an infinite set of real number. So it is not even a problem of base anyway.
Base 2 is chosen, like Polynomial said, for implementation reason, as it is easier to differentiate 2 levels of energy.
We either throw more space to represent more numbers/increase precision, or limit the range that we want to encode, or a mix of them.

It boils out to getting the most from available chip area.
If you use on/off switches to represent numbers, you can't get more precision per switch than with a base-2 representation. This is simply because N switches can represent 2^N quantities no matter what you choose these values to be. There were early machines that used base 16 floating point digits, but each of these needed 4 binary bits, so the overall precision per bit was the same as base 2 (actually somewhat less due to edge cases).
If you choose a base that's not a power of 2, precision is obviously lost. For example you need 4 bits to represent one decimal digit, but 6 of the available values of those 4 bits are never used. This system is called binary-coded decimal and it's still used occassionally, usually when doing computations with money.
Multi-level logic could efficiently implement other bases, but at least with current chip technologies, it turns out to be very expensive to implement more than 2 levels. Even quantum computers are being design assuming two quantum levels: quantum bits or qubits.
The nature of the world and math is what makes the floating point situation hopeless. There is a hierarchy of real numbers Integer -> Rational -> Algebraic -> Transendental. There's a wonderful mathematical proof, Cantor diagonalization, that most numbers, i.e. a "bigger infinity" than the other sets, are Transdendental. Yet no matter what floating point system you choose, there will still be lowly rational numbers with no perfect representation (i.e. 1/3 in base 10). This is our universe. No amount of clever hardware design will change it.
Software can and does use rational representations, storing a numerator and denominator as integers. However with these your programmer's hands are tied. For example, square root is not "closed." Sqrt(2) has no rational representation.
There has been research with algebraic number reps and "lazy" reps of arbitrary reals that produce more digits as needed. Most work of this type seems to be in computational geometry.

Your first paragraph makes sense but the second is a non-sequiter. A larger base would not make a difference to the precision.
The precison of a number depends on the amount of storage that is used for it - for example a 16 bit binary number has the same precision as a 2 x 256 base number - both take up the same amount of information.
See the Usual floating point reference. for more detail - and it generalises to all bases.
Yes computers have been built using other bases - I know of ones that use base 10 (decimal) cf wikipaedia

Yes, there are/have been computers that use other than binary (i.e., other than base 2 representations and aritmetic):
Decimal computers.
Designers of computing systems have looked into many alternatives. But it's hard to find a model that's as simple to implement in a physical device than one using two discrete states. So start with a binary circuit that's very easy and cheap to build and work up to a computer with complex operations. That's the history of binary in a nutshell.

I am not a EE, so everything I say below may be totally wrong. But...
The advantage of binary is that it maps very cleanly to distinguishing between on/off (or, more accurately, high/low voltage) states in real circuits. Trying to distinguish between multiple voltages would, I think, present a bit more of a challenge.
This may go completely out the window if quantum computers make it out of the lab.

There are 2 issues arising from the use of binary floating-point numbers to represent mathematical real numbers -- well, there are probably a lot more issues, but 2 is enough for the time being.
All computer numbers are finite so any number which requires an
infinite number of digits cannot be accurately represented on a
computer, whatever number base is chosen. So that deals with pi, e,
and most of the other real numbers.
Whatever base is chosen will have difficulties representing (finitely) some fractions. Base 2 can only approximate any fraction with a factor of 3 in the denominator, but base 5 or base 7 do too.
Over the years computers with circuitry based on devices with more than 2 states have been built. The old Soviet Union developed a series of computers with 3-state devices and at least one US computer manufacturer at one time offered computers using 10-state devices for arithmetic.
I suspect that binary representation has won out (so far) because it is simple, both to reason about and to implement with current electronic devices.

I vote that we move to a Rational number system storage. Two 32 bit intergers that will evaluate as p/q. Multiplication and division will be really cheap operations. Yeah there will be redundant evaluated numbers (1/2 = 2/4), but who really uses the full dynamic range of a 64 bit double anyways.

I'm neither an electrical engineer nor a mathematician, so take that into consideration when I make the following statement:
All floating point numbers can be represented as integers.

Related

Is there a fixed point representation available in C or Assembly

As far as I know, representing a fraction in C relies on floats and doubles which are in floating point representation.
Assume I'm trying to represent 1.5 which is a fixed point number (only one digit to the right of the radix point). Is there a way to represent such number in C or even assembly using a fixed point data type?
Are there even any fixed point instructions on x86 (or other architectures) which would operate on such type?

Every integral type can be used as a fixed point type. A favorite of mine is to use int64_t with an implied 8 digit shift, e.g. you store 1.5 as 150000000 (1.5e8). You'll have to analyze your use case to decide on an underlying type and how many digits to shift (that is, assuming you use base-10 scaling, which most people do). But 64 bits scaled by 10^8 is a pretty reasonable starting point with a broad range of uses.

While some C compilers offer special fixed-point types as an extension (not part of the standard C language), there's really very little use for them. Fixed point is just integers, interpreted with a different unit. For example, fixed point currency in typical cent denominations is just using integers that represent cents instead of dollars (or whatever the whole currency unit is) for your unit. Likewise, you can think of 8-bit RGB as having units of 1/256 or 1/255 "full intensity".
Adding and subtracting fixed point values with the same unit is just adding and subtracting integers. This is just like arithmetic with units in the physical sciences. The only value in having the language track that they're "fixed point" would be ensuring that you can only add/subtract values with matching units.
For multiplication and division, the result will not have the same units as the operands so you have to either treat the result as a different fixed-point type, or renormalize. For example if you multiply two values representing 1/16 units, the result will have 1/256 units. You can then either scale the value down by a factor of 16 (rounding in whatever way is appropriate) to get back to a value with 1/16 units.

If the issue here is representing decimal values as fixed point, there's probably a library for this for C, you could try a web search. You could create your own BCD fixed point library in assembly, using the BCD related instructions, AAA (adjusts after addition), AAS (adjusts after subtraction) and AAM (adjusts after multiplication). However, it seems these instructions are invalid in X86 X64 (64 bit) mode, so you'll need to use a 32 bit program, which should be runnable on a 64 bit OS.
Financial institutions in the USA and other countries are required by law to perform decimal based math on currency values, to avoid decimal -> binary -> decimal conversion issues.

Converting 32-bit number to 16 bits or less

On my mbed LPC1768 I have an ADC on a pin which when polled returns a 16-bit short number normalised to a floating point value between 0-1. Document here.
Because it converts it to a floating point number does that mean its 32-bits? Because the number I have is a number to six decimal places. Data Types here
I'm running Autocorrelation and I want to reduce the time it takes to complete the analysis.
Is it correct that the floating point numbers are 32-bits long and if so is it correct that multiplying two 32-bit floating point numbers will take a lot longer than multiplying two 16-bit short value (non-demical) numbers together?
I am working with C to program the mbed.
Cheers.

I should be able to comment on this quite accurately. I used to do DSP processing work where we would "integerize" code, which effectively meant we'd take a signal/audio/video algorithm, and replace all the floating point logic with fixed point arithmetic (ie: Q_mn notation, etc).
On most modern systems, you'll usually get better performance using integer arithmetic, compared to floating point arithmetic, at the expense of more complicated code you have to write.
The Chip you are using (Cortex M3) doesn't have a dedicated hardware-based FPU: it only emulates floating point operations, so floating point operations are going to be expensive (take a lot of time).
In your case, you could just read the 16-bit value via read_u16(), and shift the value right 4 times, and you're done. If you're working with audio data, you might consider looking into companding algorithms (a-law, u-law), which will give a better subjective performance than simply chopping off the 4 LSBs to get a 12-bit number from a 16-bit number.
Yes, a float on that system is 32bit, and is likely represented in IEEE754 format. Multiplying a pair of 32-bit values versus a pair of 16-bit values may very well take the same amount of time, depending on the chip in use and the presence of an FPU and ALU. On your chip, multiplying two floats will be horrendously expensive in terms of time. Also, if you multiply two 32-bit integers, they could potentially overflow, so there is one potential reason to go with floating point logic if you don't want to implement a fixed-point algorithm.

It is correct to assume that multiplying two 32-bit floating point numbers will take longer than multiplying two 16-bit short value if special hardware(Floating point unit) is not present in the processor.

Handling Decimals on Embedded C

I have my code below and I want to ask what's the best way in solving numbers (division, multiplication, logarithm, exponents) up to 4 decimals places? I'm using PIC16F1789 as my device.
float sensorValue;
float sensorAverage;
void main(){
//Get an average data by testing 100 times
for(int x = 0; x < 100; x++){
// Get the total sum of all 100 data
sensorValue = (sensorValue + ADC_GetConversion(SENSOR));
}
// Get the average
sensorAverage = sensorValue/100.0;
}

In general, on MCUs, floating point types are more costly (clocks, code) to process than integer types. While this is often true for devices which have a hardware floating point unit, it becomes a vital information on devices without, like the PIC16/18 controllers. These have to emulate all floating point operations in software. This can easily cost >100 clock cycles per addition (much more for multiplication) and bloats the code.
So, best is to avoid float (not to speak of double on such systems.
For your example, the ADC returns an integer type anyway, so the summation can be done purely with integer types. You just have to make sure the summand does not overflow, so it has to hold ~100 * for your code.
Finally, to calculate the average, you can either divide the integer by the number of iterations (round to zero), or - better - apply a simple "round to nearest" by:
#define NUMBER_OF_ITERATIONS 100
sensorAverage = (sensorValue + NUMBER_OF_ITERATIONS / 2) / NUMBER_OF_ITERATIONS;
If you really want to speed up your code, set NUMBER_OF_ITERATIONS to a power of two (64 or 128 here), if your code can tolerate this.
Finally: To get not only the integer part of the division, you can treat the sum (sensoreValue) as a fractional value. For the given 100 iterations, you can treat it as decimal fraction: when converting to a string, just print a decimal point left of the lower 2 digits. As you divide by 100, there will be no more than two significal digits of decimal fraction. If you really need 4 digits, e.g. for other operations, you can multiply the sum by 100 (actually, it is 10000, but you already have multipiled it by 100 by the loop).
This is called decimal fixed point. Faster for processing (replaces multiplication by shifts) would be to use binary fixed point, as I stated above.
On PIC16, I would strongly suggest to think of using binary fraction as multiplication and division are very costly on this platform. In general, they are not well suited for signal processing. If you need to sustain some performance, an ARM Cortex-M0 or M4 would be the far better choice - at similar prices.

In your example it is trivial to avoid non-integer representations altogether, however to answer your question more generally an ISO compliant compiler will support floating point arithmetic and the library, but for performance and code size reasons you may want to avoid that.
Fixed-point arithmetic is what you probably need. For simple calculations an ad-hoc approach to fixed point can be used whereby for example you treat the units of sensorAverage in your example as hundredths (1/100), and avoid the expensive division altogether. However if you want to perform full maths library operations, then a better approach is to use a fixed-point library. One such library is presented in Optimizing Applications with Fixed-Point Arithmetic by Anthony Williams. The code is C++ and PIC16 may lack a decent C++ compiler, but the methods can be ported somewhat less elegantly to C. It also uses a huge 64bit fixed-point 36Q28 format, which would be expensive and slow on PIC16; you might want to adapt it to use 16Q16 perhaps.

If you are really concerned about performance, stick to integer arithmetics, try to make the number of samples to average a power of two so the division can be made by means of bit shifts, however if it is not a power of two lets say 100 (as Olaf point out for fixed point) you can also use bit shifts and additions: How can I multiply and divide using only bit shifting and adding?
If you are not concerned about performace and still want to work with floats (you already got warned this may not be very fast in a PIC16 and may use a lot of flash), math.h has the following functions: http://en.cppreference.com/w/c/numeric/math including exponeciation: pow(base,exp) and logarithms* only base 2, base 10 and base e, for arbitrary base use the change of base logarithmic property

What is the difference between fixed point and floating point processor?How float is handled in both kinds?

I am bit confused about how floating point operations are handled in a processor which is do not support floating point operation.
Again how floating point processor is different from fixed point processor?
In which case IEEE floating point formats are used?

First off there are a number of different floating point formats, for various reasons. (some) DSPs do not use IEEE for performance reasons, it carries a lot of extra baggage (which most folks never use).
From elementary school we learned how to count then we learned how to add which is just a short cut for counting, then we learned to multiply which is just a short cut for adding, likewise subtraction and division are shortcuts for counting down rather than counting up. We also learned to do all of our math one column at a time, so if you have a processor that can do at least 1 bit math operations you can do addition, subtraction, multiplication and division as wide (As many bits per operand) as you desire, it may take a lot of operations but it is quite doable and anyone that made it through grade school has the tool box/skill set to do such a thing.
floating point is a middle school thing, manipulate a decimal point and use powers of some base (1.3 * 10^5) + (1.5 * 10*5). we know we have to get the 10 to the powers the same then we can just do basic elementary addition with the decimal points lined up. multiplication is even easier as you dont have to line up the decimal points you just do the math on the significant digits and simply add the exponents.
When your processor has a multiply instruction, it is just a shortcut for you having to do multiple additions (the shortcuts usually involve multiple additions). What they do is depending on how many clock cycles they want to get the multiply operation down to uses an increasingly large amount of chip real estate. Likewise for division, that is why you dont see divide on a lot of instruction sets and dont see multiply on some of the ones that dont have divide, it is a cost trade off, performance vs power and chip real estate, yield, etc.
Then floating point is just an extension of that at the core of a floating point operation you still have the fixed point operations, a floating point multiply requires a fixed point multiplication and an addition and some adjustment. A floating point addition requires some adjustment, an addition and some more adjustment.
Now what processors have fpus and what dont? What processors with an fpu support ieee and what dont? That is as easy to find as the information above, but I will leave you to solve that yourself.
if you are for example able to do math operations using scientific notation (1.345*10^4 + 2.456*10^6, or 2.3*10^6 * 4.5*10^7) then you should be able to break down the math steps involved and write your own soft float routines, not optimized but you can see how a cpu that either doesnt have an fpu or a programmer that doesnt want to use the fpu can do floating point operations. You have to be able to think in terms of base 2 not ten though which makes the problem significantly easier 1.101001*2^4 + 1.010101*2^5, in particular multiplies get real easy.

When floating point is not supported in hardware, the calculations are done by heavily optimised pieces of assembly code, usually from a library.
One Google search away you could have found this about fixed-point. I assume you can find info about IEEE floating point yourself ;-)
Good luck!

Explanation and arbitrary fixed point library can be found here.

What is the most efficient way to store and work with a floating point number with 1,000,000 significant digits in C?

I'm writing a utility to calculate π to a million digits after the decimal. On a 32- or 64-bit consumer desktop system, what is the most efficient way to store and work with such a large number accurate to the millionth digit?
clarification: The language would be C.

Forget floating point, you need bit strings that represent integers
This takes a bit less than 1/2 megabyte per number. "Efficient" can mean a number of things. Space-efficient? Time-efficient? Easy-to-program with?
Your question is tagged floating-point, but I'm quite sure you do not want floating point at all. The entire idea of floating point is that our data is only known to a few significant figures and even the famous constants of physics and chemistry are known precisely to only a handful or two of digits. So there it makes sense to keep a reasonable number of digits and then simply record the exponent.
But your task is quite different. You must account for every single bit. Given that, no floating point or decimal arithmetic package is going to work unless it's a template you can arbitrarily size, and then the exponent will be useless. So you may as well use integers.
What you really really need is a string of bits. This is simply an array of convenient types. I suggest <stdint.h> and simply using uint32_t[125000] (or 64) to get started. This actually might be a great use of the more obscure constants from that header that pick out bit sizes that are fast on a given platform.
To be more specific we would need to know more about your goals. Is this for practice in a specific language? For some investigation into number theory? If the latter, why not just use a language that already supports Bignum's, like Ruby?
Then the storage is someone else's problem. But, if what you really want to do is implement a big number package, then I might suggest using bcd (4-bit) strings or even ordinary ascii 8-bit strings with printable digits, simply because things will be easier to write and debug and maximum space and time efficiency may not matter so much.

I'd recommend storing it as an array of short ints, one per digit, and then carefully write utility classes to add and subtract portions of the number. You'll end up moving from this array of ints to floats and back, but you need a 'perfect' way of storing the number - so use its exact representation. This isn't the most efficient way in terms of space, but a million ints isn't very big.
It's all in the way you use the representation. Decide how you're going to 'work with' this number, and write some good utility functions.

If you're willing to tolerate computing pi in hex instead of decimal, there's a very cute algorithm that allows you to compute a given hexadecimal digit without knowing the previous digits. This means, by extension, that you don't need to store (or be able to do computation with) million digit numbers.
Of course, if you want to get the nth decimal digit, you will need to know all of the hex digits up to that precision in order to do the base conversion, so depending on your needs, this may not save you much (if anything) in the end.

Unless you're writing this purely for fun and/or learning, I'd recommend using a library such as GNU Multiprecision. Look into the mpf_t data type and its associated functions for storing arbitrary-precision floating-point numbers.
If you are just doing this for fun/learning, then represent numbers as an array of chars, which each array element storing one decimal digit. You'll have to implement long addition, long multiplication, etc.

Try PARI/GP, see wikipedia.

You could store its decimals digits as text in a file and mmap it to an array.

i once worked on an application that used really large numbers (but didnt need good precision). What we did was store the numbers as logarithms since you can store a pretty big number as a log10 within an int.
Think along this lines before resorting to bit stuffing or some complex bit representations.
I am not too good with complex math, but i reckon there are solutions which are elegant when storing numbers with millions of bits of precision.

IMO, any programmer of arbitrary precision arithmetics needs understanding of base conversion. This solves anyway two problems: being able to calculate pi in hex digits and converting the stuff to decimal representation and as well finding the optimal container.
The dominant constraint is the number of correct bits in the multiplication instruction.
In Javascript one has always 53-bits of accuracy, meaning that a Uint32Array with numbers having max 26 bits can be processed natively. (waste of 6 bits per word).
In 32-bit architecture with C/C++ one can easily get A*B mod 2^32, suggesting basic element of 16 bits. (Those can be parallelized in many SIMD architectures starting from MMX). Also each 16-bit result can contain 4-digit decimal numbers (wasting about 2.5 bits) per word.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight