64-bit integer implementation for 8-bit microcontroller - c

I'm working on OKI 431 microcontroller. This is 8-bit microcontroller. We don't like to have any floating point operation to be performed in our project so we've eliminated all floating point operations and converted them into integer operations in some way. But we cannot eliminate one floating point operation because optimizing the calculation for integer operation requires 64-bit integer which the micro doesn't natively support. It has C compiler that supports upto 32-bit integer operation. The calculation takes too long time which is noticeable in a way to user.
I'm wondering if there is any 64-bit integer library that can be easily used in C for microcontoller coding. Or what is the easiest way to write such thing efficiently? Here efficiently implies minimize amount of time required.
Thanks in advance.

Since this is a micro-controller you will probably want to use a simple assembly library. The fewer operations it has to support the simpler and smaller it can be. You may also find that you can get away with smaller than 64 bit numbers (48 bit, perhaps) and reduce the run time and register requirements.

You may have to go into assembly to do this. The obvious things you need are:
addition
2s complement (invert and increment)
left and right arithmetic shift by 1
From those you can build subtraction, multiplication, long division, and longer shifts. Keep in mind that multiplying two 64-bit numbers gives you a 128-bit number, and long division may need to be able to take a 128-bit dividend.
It will seem painfully slow, but the assumption in such a machine is that you need a small footprint, not speed. I assume you are doing these calculations at the lowest frequency you can.
An open-source library may have a slightly faster way to do it,
but it could also be even slower.

Whenever speed is a problem with floating point math in small embedded systems, and when integer math is not enough, fixed point math is a fast replacement.
http://forum.e-lab.de/topic.php?t=2387
http://en.wikipedia.org/wiki/Fixed-point_arithmetic
http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C
http://en.wikibooks.org/wiki/Embedded_Systems/Floating_Point_Unit

Related

Converting 32-bit number to 16 bits or less

On my mbed LPC1768 I have an ADC on a pin which when polled returns a 16-bit short number normalised to a floating point value between 0-1. Document here.
Because it converts it to a floating point number does that mean its 32-bits? Because the number I have is a number to six decimal places. Data Types here
I'm running Autocorrelation and I want to reduce the time it takes to complete the analysis.
Is it correct that the floating point numbers are 32-bits long and if so is it correct that multiplying two 32-bit floating point numbers will take a lot longer than multiplying two 16-bit short value (non-demical) numbers together?
I am working with C to program the mbed.
Cheers.
I should be able to comment on this quite accurately. I used to do DSP processing work where we would "integerize" code, which effectively meant we'd take a signal/audio/video algorithm, and replace all the floating point logic with fixed point arithmetic (ie: Q_mn notation, etc).
On most modern systems, you'll usually get better performance using integer arithmetic, compared to floating point arithmetic, at the expense of more complicated code you have to write.
The Chip you are using (Cortex M3) doesn't have a dedicated hardware-based FPU: it only emulates floating point operations, so floating point operations are going to be expensive (take a lot of time).
In your case, you could just read the 16-bit value via read_u16(), and shift the value right 4 times, and you're done. If you're working with audio data, you might consider looking into companding algorithms (a-law, u-law), which will give a better subjective performance than simply chopping off the 4 LSBs to get a 12-bit number from a 16-bit number.
Yes, a float on that system is 32bit, and is likely represented in IEEE754 format. Multiplying a pair of 32-bit values versus a pair of 16-bit values may very well take the same amount of time, depending on the chip in use and the presence of an FPU and ALU. On your chip, multiplying two floats will be horrendously expensive in terms of time. Also, if you multiply two 32-bit integers, they could potentially overflow, so there is one potential reason to go with floating point logic if you don't want to implement a fixed-point algorithm.
It is correct to assume that multiplying two 32-bit floating point numbers will take longer than multiplying two 16-bit short value if special hardware(Floating point unit) is not present in the processor.

Standard guarantees for using floating point arithmetic to represent integer operations

I am working on some code to be run on a very heterogeneous cluster. The program performs interval arithmetic using 3, 4, or 5 32 bit words (unsigned ints) to represent high precision boundaries for the intervals. It seems to me that representing some words in floating point in some situations may produce a speedup. So, my question is two parts:
1) Are there any guarantees in the C11 standard as to what range of integers will be represented exactly, and what range of input pairs would have their products represented exactly? One multiplication error could entirely change the results.
2) Is this even a reasonable approach? It seems that the separation of floating point and integer processing within the processor would allow data to be running through both pipelines simultaneously, improving throughput. I don't know much about hardware though, so I'm not sure that the pipelines for integers and floating points actually are all that separate, or, if they are, if they can be used simultaneously.
I understand that the effectiveness of this sort of thing is platform dependent, but right now I am concerned about the reliability of the approach. If it is reliable, I can benchmark it and see, but I am having trouble proving reliability. Secondly, perhaps this sort of approach shows little promise, and if so I would like to know so I can focus elsewhere.
Thanks!
I don't know about the Standard, but it seems that you can assume all your processors are using the normal IEEE floating point format. In this case, it's pretty easy to determine whether your calculations are correct. The first integer not representable by the 32-bit float format is 16777217 (224+1), so if all your intermediate results are less than that (in absolute value), float will be fine.
The reverse is also true: if any intermediate result is greater than 224 (in absolute value) and odd, float representation will alter it, which is unacceptable for you.
If you are worried specifically about multiplications, look at how the multiplicands are limited. If one is limited by 211, and the other by 213, you will be fine (just barely). If, for example, both are limited by 216, there almost certainly is a problem. To prove it, find a test case that causes their product to exceed 224 and be odd.
All that you need to know to which limits you may go and still have integer precision should be available to you through the macros defined in <float.h>. There you have the exact description of the floating point types, FLT_RADIX for the radix, FLT_MANT_DIG for the number of the digits, etc.
As you say, whether or not such an approach is efficient will depend on the platform. You should be aware that this is much dependent of the particular processor you'd have, not only the processor family. From one Intel or AMD processor variant to another there could already be sensible differences. So you'd basically benchmark all possibilities and have code that decides on program startup which variant to use.

Looking for Ansi C89 arbitrary precision math library

I wrote an Ansi C compiler for a friend's custom 16-bit stack-based CPU several years ago but I never got around to implementing all the data types. Now I would like to finish the job so I'm wondering if there are any math libraries out there that I can use to fill the gaps. I can handle 16-bit integer data types since they are native to the CPU and therefore I have all the math routines (ie. +, -, *, /, %) done for them. However, since his CPU does not handle floating point then I have to implement floats/doubles myself. I also have to implement the 8-bit and 32-bit data types (bother integer and floats/doubles). I'm pretty sure this has been done and redone many times and since I'm not particularly looking forward to recreating the wheel I would appreciate it if someone would point me at a library that can help me out.
Now I was looking at GMP but it seems to be overkill (library must be absolutely huge, not sure my custom compiler would be able to handle it) and it takes numbers in the form of strings which would be wasteful for obvious reasons. For example :
mpz_set_str(x, "7612058254738945", 10);
mpz_set_str(y, "9263591128439081", 10);
mpz_mul(result, x, y);
This seems simple enough, I like the api... but I would rather pass in an array rather than a string. For example, if I wanted to multiply two 32-bit longs together I would like to be able to pass it two arrays of size two where each array contains two 16-bit values that actually represent a 32-bit long and have the library place the output into an output array. If I needed floating point then I should be able to specify the precision as well.
This may seem like asking for too much but I'm asking in the hopes that someone has seen something like this.
Many thanks in advance!
Let's divide the answer.
8-bit arithmetic
This one is very easy. In fact, C already talks about this under the term "integer promotion". This means that if you have 8-bit data and you want to do an operation on them, you simply pad them with zero (or one if signed and negative) to make them 16-bit. Then you proceed with the normal 16-bit operation.
32-bit arithmetic
Note: so long as the standard is concerned, you don't really need to have 32-bit integers.
This could be a bit tricky, but it is still not worth using a library for. For each operation, you would need to take a look at how you learned to do them in elementary school in base 10, and then do the same in base 216 for 2 digit numbers (each digit being one 16-bit integer). Once you understand the analogy with simple base 10 math (and hence the algorithms), you would need to implement them in assembly of your CPU.
This basically means loading the most significant 16 bit on one register, and the least significant in another register. Then follow the algorithm for each operation and perform it. You would most likely need to get help from overflow and other flags.
Floating point arithmetic
Note: so long as the standard is concerned, you don't really need to conform to IEEE 754.
There are various libraries already written for software emulated floating points. You may find this gcc wiki page interesting:
GNU libc has a third implementation, soft-fp. (Variants of this are also used for Linux kernel math emulation on some targets.) soft-fp is used in glibc on PowerPC --without-fp to provide the same soft-float functions as in libgcc. It is also used on Alpha, SPARC and PowerPC to provide some ABI-specified floating-point functions (which in turn may get used by GCC); on PowerPC these are IEEE quad functions, not IBM long double ones.
Performance measurements with EEMBC indicate that soft-fp (as speeded up somewhat using ideas from ieeelib) is about 10-15% faster than fp-bit and ieeelib about 1% faster than soft-fp, testing on IBM PowerPC 405 and 440. These are geometric mean measurements across EEMBC; some tests are several times faster with soft-fp than with fp-bit if they make heavy use of floating point, while others don't make significant use of floating point. Depending on the particular test, either soft-fp or ieeelib may be faster; for example, soft-fp is somewhat faster on Whetstone.
One answer could be to take a look at the source code for glibc and see if you could salvage what you need.

Is NEON of ARM faster for integers than floating points?

Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?
You can find information about Instruction-specific scheduling for Advanced SIMD instructions for Cortex-A8 (they don't publish it for newer cores since timing business got quite complicated since).
See Advanced SIMD integer ALU instructions versus Advanced SIMD floating-point instructions:
You may need to read explanation of how to read those tables.
To give a complete answer, in general floating point instructions take two cycles while instructions executes on ALU takes one cycle. On the other hand multiplication of long long (8 byte integer) is four cycles (forum same source) while multiplication of double is two cycles.
In general it seems you shouldn't care about float versus integer but carefully choosing data type (float vs double, int vs long long) is more important.
It depends on which model you have, but the tendency has been for integer to have more opportunities to use the 128-bit wide data paths. This is no longer true on newer CPUs.
Of course, integer arithmetic also gives you the opportunity to increase the parallelism by using 16-bit or 8-bit operations.
As with all integer-versus-floating-point arguments, it depends on the specific problem and how much time you're willing to invest in tuning, because they can rarely run exactly the same code.
I would refer to auselen's answer for great links to all of the references, however, I found the actual cycle counts a little misleading. It is true that it can "go either way" depending on the precision that you need, but let's say that you have some parallelism in your routine and can efficiently operate on two words (SP float) at a time. Let's assume that you need the amount of precision for which floating point may be a good idea... 24 bits.
In particular when analyzing NEON performance, remember that there is a write-back delay (pipeline delay) so that you have to wait for a result to become ready if that result is required as the input to another instruction.
For fixed point you will need 32 bit ints to represent at least 24 bits of precision:
Multiply two-by-two 32 bit numbers together, and get a 64 bit result. This takes two cycles and requires an extra register to store the wide result.
Shift the 64 bit numbers back to a 32 bit numbers of the desired precision. This takes one cycle, and you have to wait for the write-back (5-6 cycle) delay from the multiply.
For floating point:
Multiply two-by-two 32 bit floats together. This takes one cycle.
So for this scenario, there is no way in heck that you would ever choose integer over floating point.
If you are dealing with 16 bit data, then the tradeoffs are much closer, although you may still need an extra instruction to shift the result of the multiply back to the desired precision. To achieve good performance if you are using Q15, then you can use the VQDMULH instruction on s16 data and achieve much higher performance with fewer registers than SP float.
Also, as auselen mentions, newer cores have different micro-architectures, and things always change. We are lucky that ARM actually makes their info public. For vendors that modify the microarchitecture like Apple, Qualcomm and Samsung (probably others...) the only way to know is to try it, which can be a lot of work if you are writing assembly. Still, I think the official ARM instruction timing website is probably quite useful. And I actually do think that they publish the numbers for A9, and these are mostly identical.

How to avoid FPU when given float numbers?

Well, this is not at all an optimization question.
I am writing a (for now) simple Linux kernel module in which I need to find the average of some positions. These positions are stored as floating point (i.e. float) variables. (I am the author of the whole thing, so I can change that, but I'd rather keep the precission of float and not get involved in that if I can avoid it).
Now, these position values are stored (or at least used to) in the kernel simply for storage. One user application writes these data (through shared memory (I am using RTAI, so yes I have shared memory between kernel and user spaces)) and others read from it. I assume read and write from float variables would not use the FPU so this is safe.
By safe, I mean avoiding FPU in the kernel, not to mention some systems may not even have an FPU. I am not going to use kernel_fpu_begin/end, as that likely breaks the real-time-ness of my tasks.
Now in my kernel module, I really don't need much precision (since the positions are averaged anyway), but I would need it up to say 0.001. My question is, how can I portably turn a floating point number to an integer (1000 times the original number) without using the FPU?
I thought about manually extracting the number from the float's bit-pattern, but I'm not sure if it's a good idea as I am not sure how endian-ness affects it, or even if floating points in all architectures are standard.
If you want to tell gcc to use a software floating point library there's apparently a switch for that, albeit perhaps not turnkey in the standard environment:
Using software floating point on x86 linux
In fact, this article suggests that linux kernel and its modules are already compiled with -msoft-float:
http://www.linuxsmiths.com/blog/?p=253
That said, #PaulR's suggestion seems most sensible. And if you offer an API which does whatever conversions you like then I don't see why it's any uglier than anything else.
The SoftFloat software package has the function float32_to_int32 that does exactly what you want (it implements IEEE 754 in software).
In the end it will be useful to have some sort of floating point support in a kernel anyway (be it hardware or software), so including this in your project would most likely be a wise decision. It's not too big either.
Really, I think you should just change your module's API to use data that's already in integer format, if possible. Having floating point types in a kernel-user interface is just a bad idea when you're not allowed to use floating point in kernelspace.
With that said, if you're using single-precision float, it's essentially ALWAYS going to be IEEE 754 single precision, and the endianness should match the integer endianness. As far as I know this is true for all archs Linux supports. With that in mind, just treat them as unsigned 32-bit integers and extract the bits to scale them. I would scale by 1024 rather than 1000 if possible; doing that is really easy. Just start with the mantissa bits (bits 0-22), "or" on bit 23, then right shift if the exponent (after subtracting the bias of 127) is less than 23 and left shift if it's greater than 23. You'll need to handle the cases where the right shift amount is greater than 32 (which C wouldn't allow; you have to just special-case the zero result) or where the left shift is sufficiently large to overflow (in which case you'll probably want to clamp the output).
If you happen to know your values won't exceed a particular range, of course, you might be able to eliminate some of these checks. In fact, if your values never exceed 1 and you can pick the scaling, you could pick it to be 2^23 and then you could just use ((float_bits & 0x7fffff)|0x800000) directly as the value when the exponent is zero, and otherwise right-shift.
You can use rational numbers instead of floats. The operations (multiplication, addition) can be implemented without loss in accuracy too.
If you really only need 1/1000 precision, you can just store x*1000 as a long integer.

Resources