Does doing pointer arithmetic incur the cost of a divide - c

I'm working on an embedded processor where the cost of doing a divide is high. When tracking down divide calls in the assembler output I was surprised to see pointer arithmetic generating a call to the divide function.
I can't see how compilers can avoid the divide unless the size of the struct is a power of 2. Anyone know if cleverer compilers like gcc manage to avoid this somehow?

Division by a constant can usually be optimized into a wide multiplication followed by a shift. This may still be too slow for you, I don't know. But this only happens for pointer subtraction, which can probably be avoided, depending on how you're using it.

On certain processors, when full optimisations are on, compilers can do strength reduction to turn a divide into a multiply. So for instance instead of dividing by 10 they will multiply by 3435973837 and take the upper 32 bits, which is equivalent to multiplying by 0.8, and then divide by 8 using a shift.

Related

Fastest way to get mod 10 in C

In my program I have a great presence of the operation n % 10. I know that the module operation can be done much faster when we have n% m where m is the power of 2, since it can be replaced by n & (m-1 ), however Is there any faster way to calculate modulus if the operand is 10?
In my case n is a uint8_t in some cases and in other cases n is an uint32_t.
Because most modern processors can do multiplication much, much faster than division, it is often possible to speed up division and modulus operations where the dividend is a known small constant by replacing the division with one or two multiplications and a few other fast operations (such as shift and addition).
To do so requires computing at compile-time some magic numbers dependent on the dividend; fortunately most modern compilers know how to do this so you don't need to do anything to take advantage. Just let your compiler do the heavy lifting for you, as #chux suggests in an excellent answer.
You can help the compiler by using unsigned types; for some dividends, signed division and modulus are harder to replace.
The basic outline of the optimisation of modulus looks like this:
If you had exact arithmetic, you could replace x % p with p * ((x * (1/p)) % 1). For constant p, 1/p can be precomputed at compile time. The %1 operation simply consists of discarding the fraction part, which is just a right-shift. So that replaces a division with two multiplies, and if p only has a few bits set, the multiply by p might be further optimised into a few left-shifts.
We can do that computation with fixed-point arithmetic, taking advantage of the fact that most processors produce a double-sized result for integer multiplication. Since we don't care about the integer part of the inner multiplication and we know that the result of the outer multiplication must be less than p, we only need to reserve ceil(log2 p) bits for the integer part of the computation, leaving the rest of the bits for the fraction. And that might give us enough precision to correctly handle the possible range of values of x, particularly if x has a limited range (eg. uint8_t or even uint16_t). The key is finding a position of the fixed-point which minimises the error in representation of 1/p.
For many small values of p, that works. For others, there is an alternative (but slower) solution which involves estimating q = x/p using multiplication by the inverse, and then computing x - q * p. If the estimate of q can be guaranteed to be either correct or off by one in a known direction, we only need to correct the final computation by conditionally adding or subtracting p; that can be accomplished without a branch on many modern CPUs. (The direction of the error is known because it will depend only on whether the approximation we chose for the inverse of the dividend was too small or too big.)
In the very specific case of x % 10 where x is a uint_8, you might be able to do better than the above using a 256-byte lookup table. That would only be worthwhile if you were doing the modulus operation in a tight loop over a large number of values, and even then you'd want to profile carefully to verify that it is an improvement.
I doubt whether that's the best expenditure of your time; there are probably much more fruitful optimisation opportunities in your application.
however Is there any faster way to calculate modulus if the operand is 10?
With a good compiler, no. The compiler would have already emitted good code. You can explore different optimization settings with the compiler.
OTOH, if you know of some restrictions that the compiler cannot assume with n % 10, like values are always positive or of a sub-range, you might be able to out optimize the compiler.
Such micro-optimisation is usually not efficient use of programmer's time.

Handling Decimals on Embedded C

I have my code below and I want to ask what's the best way in solving numbers (division, multiplication, logarithm, exponents) up to 4 decimals places? I'm using PIC16F1789 as my device.
float sensorValue;
float sensorAverage;
void main(){
//Get an average data by testing 100 times
for(int x = 0; x < 100; x++){
// Get the total sum of all 100 data
sensorValue = (sensorValue + ADC_GetConversion(SENSOR));
}
// Get the average
sensorAverage = sensorValue/100.0;
}
In general, on MCUs, floating point types are more costly (clocks, code) to process than integer types. While this is often true for devices which have a hardware floating point unit, it becomes a vital information on devices without, like the PIC16/18 controllers. These have to emulate all floating point operations in software. This can easily cost >100 clock cycles per addition (much more for multiplication) and bloats the code.
So, best is to avoid float (not to speak of double on such systems.
For your example, the ADC returns an integer type anyway, so the summation can be done purely with integer types. You just have to make sure the summand does not overflow, so it has to hold ~100 * for your code.
Finally, to calculate the average, you can either divide the integer by the number of iterations (round to zero), or - better - apply a simple "round to nearest" by:
#define NUMBER_OF_ITERATIONS 100
sensorAverage = (sensorValue + NUMBER_OF_ITERATIONS / 2) / NUMBER_OF_ITERATIONS;
If you really want to speed up your code, set NUMBER_OF_ITERATIONS to a power of two (64 or 128 here), if your code can tolerate this.
Finally: To get not only the integer part of the division, you can treat the sum (sensoreValue) as a fractional value. For the given 100 iterations, you can treat it as decimal fraction: when converting to a string, just print a decimal point left of the lower 2 digits. As you divide by 100, there will be no more than two significal digits of decimal fraction. If you really need 4 digits, e.g. for other operations, you can multiply the sum by 100 (actually, it is 10000, but you already have multipiled it by 100 by the loop).
This is called decimal fixed point. Faster for processing (replaces multiplication by shifts) would be to use binary fixed point, as I stated above.
On PIC16, I would strongly suggest to think of using binary fraction as multiplication and division are very costly on this platform. In general, they are not well suited for signal processing. If you need to sustain some performance, an ARM Cortex-M0 or M4 would be the far better choice - at similar prices.
In your example it is trivial to avoid non-integer representations altogether, however to answer your question more generally an ISO compliant compiler will support floating point arithmetic and the library, but for performance and code size reasons you may want to avoid that.
Fixed-point arithmetic is what you probably need. For simple calculations an ad-hoc approach to fixed point can be used whereby for example you treat the units of sensorAverage in your example as hundredths (1/100), and avoid the expensive division altogether. However if you want to perform full maths library operations, then a better approach is to use a fixed-point library. One such library is presented in Optimizing Applications with Fixed-Point Arithmetic by Anthony Williams. The code is C++ and PIC16 may lack a decent C++ compiler, but the methods can be ported somewhat less elegantly to C. It also uses a huge 64bit fixed-point 36Q28 format, which would be expensive and slow on PIC16; you might want to adapt it to use 16Q16 perhaps.
If you are really concerned about performance, stick to integer arithmetics, try to make the number of samples to average a power of two so the division can be made by means of bit shifts, however if it is not a power of two lets say 100 (as Olaf point out for fixed point) you can also use bit shifts and additions: How can I multiply and divide using only bit shifting and adding?
If you are not concerned about performace and still want to work with floats (you already got warned this may not be very fast in a PIC16 and may use a lot of flash), math.h has the following functions: http://en.cppreference.com/w/c/numeric/math including exponeciation: pow(base,exp) and logarithms* only base 2, base 10 and base e, for arbitrary base use the change of base logarithmic property

Is NEON of ARM faster for integers than floating points?

Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?
You can find information about Instruction-specific scheduling for Advanced SIMD instructions for Cortex-A8 (they don't publish it for newer cores since timing business got quite complicated since).
See Advanced SIMD integer ALU instructions versus Advanced SIMD floating-point instructions:
You may need to read explanation of how to read those tables.
To give a complete answer, in general floating point instructions take two cycles while instructions executes on ALU takes one cycle. On the other hand multiplication of long long (8 byte integer) is four cycles (forum same source) while multiplication of double is two cycles.
In general it seems you shouldn't care about float versus integer but carefully choosing data type (float vs double, int vs long long) is more important.
It depends on which model you have, but the tendency has been for integer to have more opportunities to use the 128-bit wide data paths. This is no longer true on newer CPUs.
Of course, integer arithmetic also gives you the opportunity to increase the parallelism by using 16-bit or 8-bit operations.
As with all integer-versus-floating-point arguments, it depends on the specific problem and how much time you're willing to invest in tuning, because they can rarely run exactly the same code.
I would refer to auselen's answer for great links to all of the references, however, I found the actual cycle counts a little misleading. It is true that it can "go either way" depending on the precision that you need, but let's say that you have some parallelism in your routine and can efficiently operate on two words (SP float) at a time. Let's assume that you need the amount of precision for which floating point may be a good idea... 24 bits.
In particular when analyzing NEON performance, remember that there is a write-back delay (pipeline delay) so that you have to wait for a result to become ready if that result is required as the input to another instruction.
For fixed point you will need 32 bit ints to represent at least 24 bits of precision:
Multiply two-by-two 32 bit numbers together, and get a 64 bit result. This takes two cycles and requires an extra register to store the wide result.
Shift the 64 bit numbers back to a 32 bit numbers of the desired precision. This takes one cycle, and you have to wait for the write-back (5-6 cycle) delay from the multiply.
For floating point:
Multiply two-by-two 32 bit floats together. This takes one cycle.
So for this scenario, there is no way in heck that you would ever choose integer over floating point.
If you are dealing with 16 bit data, then the tradeoffs are much closer, although you may still need an extra instruction to shift the result of the multiply back to the desired precision. To achieve good performance if you are using Q15, then you can use the VQDMULH instruction on s16 data and achieve much higher performance with fewer registers than SP float.
Also, as auselen mentions, newer cores have different micro-architectures, and things always change. We are lucky that ARM actually makes their info public. For vendors that modify the microarchitecture like Apple, Qualcomm and Samsung (probably others...) the only way to know is to try it, which can be a lot of work if you are writing assembly. Still, I think the official ARM instruction timing website is probably quite useful. And I actually do think that they publish the numbers for A9, and these are mostly identical.

How can floating point calculations be made deterministic?

Floating point calculation is neither associative nor distributive on processors. So,
(a + b) + c is not equal to a + (b + c)
and a * (b + c) is not equal to a * b + a * c
Is there any way to perform deterministic floating point calculation that do not give different results. It would be deterministic on uniprocessor ofcourse, but it would not be deterministic in multithreaded programs if threads add to a sum for example, as there might be different interleavings of the threads.
So my question is, how can one achieve deterministic results for floating point calculations in multithreaded programs?
Floating-point is deterministic. The same floating-point operations, run on the same hardware, always produces the same result. There is no black magic, noise, randomness, fuzzing, or any of the other things that people commonly attribute to floating-point. The tooth fairy does not show up, take the low bits of your result, and leave a quarter under your pillow.
Now, that said, certain blocked algorithms that are commonly used for large-scale parallel computations are non-deterministic in terms of the order in which floating-point computations are performed, which can result in non-bit-exact results across runs.
What can you do about it?
First, make sure that you actually can't live with the situation. Many things that you might try to enforce ordering in a parallel computation will hurt performance. That's just how it is.
I would also note that although blocked algorithms may introduce some amount of non-determinism, they frequently deliver results with smaller rounding errors than do naive unblocked serial algorithms (surprising but true!). If you can live with the errors produced by a naive serial algorithm, you can probably live with the errors of a parallel blocked algorithm.
Now, if you really, truly, need exact reproducibility across runs, here are a few suggestions that tend not to adversely affect performance too much:
Don't use multithreaded algorithms that can reorder floating-point computations. Problem solved. This doesn't mean you can't use multithreaded algorithms at all, merely that you need to ensure that each individual result is only touched by a single thread between synchronization points. Note that this can actually improve performance on some architectures if done properly, by reducing D$ contention between cores.
In reduction operations, you can have each thread store its result to an indexed location in an array, wait for all threads to finish, the accumulate the elements of the array in order. This adds a small amount of memory overhead, but is generally pretty tolerable, especially when the number of threads is "small".
Find ways to hoist the parallelism. Instead of computing 24 matrix multiplications, each one of which uses parallel algorithms, compute 24 matrix products in parallel, each one of which uses a serial algorithm. This, too, can be beneficial for performance (sometimes enormously so).
There are lots of other ways to handle this. They all require thought and care. Parallel programming usually does.
Edit: I've removed my old answer since I seem to have misunderstood OP's question. If you want to see it you can read the edit history.
I think the ideal solution would be to switch to having a separate accumulator for each thread. This avoids all locking, which should make a drastic difference to performance. You can simply sum the accumulators at the end of the whole operation.
Alternatively, if you insist on using a single accumulator, one solution is to use "fixed-point" rather than floating point. This can be done with floating-point types by including a giant "bias" term in your accumulator to lock the exponent at a fixed value. For example if you know the accumulator will never exceed 2^32, you can start the accumulator at 0x1p32. This will lock you at 32 bits of precision to the left of the radix point, and 20 bits of fractional precision (assuming double). If that's not enough precision, you could us a smaller bias (assuming the accumulator will not grow too large) or switch to long double. If long double is 80-bit extended format, a bias of 2^32 would give 31 bits of fractional precision.
Then, whenever you want to actually "use" the value of the accumulator, simply subtract out the bias term.
Even using a high-precision fixed point datatype would not solve the problem of making the results for said equations determinisic (except in certain cases). As Keith Thompson pointed out in a comment, 1/3 is a trivial counter-example of a value that cannot be stored correctly in either a standard base-10 or base-2 floating point representation (regardless of precision or memory used).
One solution that, depending upon particular needs, may address this issue (it still has limits) is to use a Rational number data-type (one that stores both a numerator and denominator). Keith suggested GMP as one such library:
GMP is a free library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating point numbers. There is no practical limit to the precision...
Whether it is suitable (or adequate) for this task is another story...
Happy coding.
Use a decimal type or library supporting such a type.
Try storing each intermediate result in a volatile object:
volatile double a_plus_b = a + b;
volatile double a_plus_b_plus_c = a_plus_b + c;
This is likely to have nasty effects on performance. I suggest measuring both versions.
EDIT: The purpose of volatile is to inhibit optimizations that might affect the results even in a single-threaded environment, such as changing the order of operations or storing intermediate results in wider registers. It doesn't address multi-threading issues.
EDIT2: Something else to consider is that
A floating expression may be contracted, that is, evaluated as though
it were an atomic operation, thereby omitting rounding errors implied
by the source code and the expression evaluation method.
This can be inhibited by using
#include <math.h>
...
#pragma STDC FP_CONTRACT off
Reference: C99 standard (large PDF), sections 7.12.2 and 6.5 paragraph 8. This is C99-specific; some compilers might not support it.
Use packed decimal.

64-bit integer implementation for 8-bit microcontroller

I'm working on OKI 431 microcontroller. This is 8-bit microcontroller. We don't like to have any floating point operation to be performed in our project so we've eliminated all floating point operations and converted them into integer operations in some way. But we cannot eliminate one floating point operation because optimizing the calculation for integer operation requires 64-bit integer which the micro doesn't natively support. It has C compiler that supports upto 32-bit integer operation. The calculation takes too long time which is noticeable in a way to user.
I'm wondering if there is any 64-bit integer library that can be easily used in C for microcontoller coding. Or what is the easiest way to write such thing efficiently? Here efficiently implies minimize amount of time required.
Thanks in advance.
Since this is a micro-controller you will probably want to use a simple assembly library. The fewer operations it has to support the simpler and smaller it can be. You may also find that you can get away with smaller than 64 bit numbers (48 bit, perhaps) and reduce the run time and register requirements.
You may have to go into assembly to do this. The obvious things you need are:
addition
2s complement (invert and increment)
left and right arithmetic shift by 1
From those you can build subtraction, multiplication, long division, and longer shifts. Keep in mind that multiplying two 64-bit numbers gives you a 128-bit number, and long division may need to be able to take a 128-bit dividend.
It will seem painfully slow, but the assumption in such a machine is that you need a small footprint, not speed. I assume you are doing these calculations at the lowest frequency you can.
An open-source library may have a slightly faster way to do it,
but it could also be even slower.
Whenever speed is a problem with floating point math in small embedded systems, and when integer math is not enough, fixed point math is a fast replacement.
http://forum.e-lab.de/topic.php?t=2387
http://en.wikipedia.org/wiki/Fixed-point_arithmetic
http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C
http://en.wikibooks.org/wiki/Embedded_Systems/Floating_Point_Unit

Resources