Is there any particular speed of reading a code by the compiler? - c

I have noticed that 10(^7) or 10 000 000 increment is equal to 10 seconds in my environment.
Here is an example of custom function that works for me that wastes x seconds before the next line:
void pause(unsigned short seconds)
{
int f;
unsigned long long deltaTime = seconds*10000000;
for(f=0; f<deltaTime; f++);
}
with this function you can request specific amount of seconds for "pause".
However.. i am not sure if thats even correct. Maybe the speed of listening the code depends from the compiller or the processor.. or both?

Several things wrong here:
In most compilers, if you enable optimizations (-O), it'll totally remove this code realizing it does nothing.
the speed of the loop is determined by compiler, processor, system load, and many other aspects
There's already a sleep function.

Related

Time measurements differ on microcontroller

I am measuring the cycle count of different C functions which I try to make constant time in order to mitigate side channel attacks (crypto).
I am working with a microcontroller (aurix from infineon) which has an onboard cycle counter which gets incremented each clock tick and which I can read out.
Consider the following:
int result[32], cnt=0;
int secret[32];
/** some other code***/
reset_and_startCounter(); //resets cycles to 0 and starts the counter
int tmp = readCycles(); //read cycles before function call
function(secret) //I want to measure this function, should be constant time
result[cnt++] = readCycles() - tmp; //read out cycles and subtract to get correct result
When I measure the cycles like shown above, I will sometimes receive a different amount of cycles depending on the input given to the function. (~1-10 cycles difference, function itself takes about 3000 cycles).
I was now wondering if it not yet is perfectly constant time, and that the calculations depend on some input. I looked into the function and did the following:
void function(int* input){
reset_and_startCounter();
int tmp = readCycles();
/*********************************
******calculations on input******
*********************************/
result[cnt++] = readCycles() - tmp;
}
and I received the same amount of cycles no matter what input is given.
I then also measured the time needed to call the function only, and to return from the function. Both measurements were the same no matter what input.
I was always using the gcc compiler flags -O3,-fomit-frame-pointer. -O3 because the runtime is critical and I need it to be fast. And also important, no other code has been running on the microcontroller (no OS etc.)
Does anyone have a possible explanation for this. I want to be secure, that my code is constant time, and those cycles are arbitrary...
And sorry for not providing a runnable code here, but I believe not many have an Aurix lying arround :O
Thank you
The Infineon Aurix microcontroller you're using is designed for hard real-time applications. It has intentionally been designed to provide consistent runtime performance -- it lacks most of the features that can lead to inconsistent performance on more sophisticated CPUs, like cache memory or branch prediction.
While showing that your code has constant runtime on this part is a start, it is still possible for your code to have variable runtime when run on other CPUs. It is also possible that a device containing this CPU may leak information through other channels, particularly through power analysis. If making your application resistant to sidechannel analysis is critical, you may want to consider using a part designed for cryptographic applications. (The Aurix is not such a part.)

Fastest (optimal in time) way of converting seconds and nanoseconds to microseconds

I want to write function in C that takes seconds and nanoseconds as input. Converts seconds and nanoseconds into microseconds, returns the total in microseconds.
unsigned long long get_microseconds(int seconds, unsigned long long nSeconds);
Now the conversion is pretty trivial. I can use following formula-
mSeconds = Seconds*1000000 + nSeconds/1000 (Loss of precision in nanosecond conversion is alright, my timer has anyway minimum resolution of 100 microseconds)
What would be fastest way of implementing this equation without using multiplication and division operators to get the best accuracy and least number of cpu cycles.
EDIT: I am running on a custom DSP with a GNU based but custom designed toolchain. I have not really tested out performance of the arithmetic operation, I am simply curious to know if it would affect the performance and if is there a way to improve it.
return Seconds*1000000 + nSeconds/1000;
If there's any worthwhile bit-shifting or other bit manipulation worth doing, your compiler will probably take care of it.
The compiler will almost certainly optimize the multiplication as far as it can. What it will not do is "accept a small loss" when dividing by 1000, so you will perhaps find it somewhat faster writing
return Seconds*1000000 + nSeconds/1024; /* Explicitly show the error */
...keeping in mind that nSeconds can't grow too much, or the error may become unacceptable.
But whatever you do, test the results - both speed and accuracy over real inputs. Also explore converting the function to a macro and save the call altogether. Frankly, for so simple a calculation there's precious little chance to do better than an optimizing compiler.
Also, consider the weight of this optimization in the scope of the global algorithm. Is this function really called with such a frequency that its savings are worth the hassle?
If nSeconds never gets above 232 (it shouldn't if you are working with "split time" as from timespec - it should be below 109), you should probably use a 32 bit integer for it.
On a 64 bit machine it's not a problem to use 64 bit integers for everything (the division is optimized to a multiply by inverse+shift), but on a 32 bit one the compiler gets tricked into using a full 64 bit division routine, which is quite heavyweight. So, I would do:
unsigned long long get_microseconds(int seconds, unsigned long nSeconds) {
return seconds*1000000ULL + nSeconds / 1000;
}
This, at least on x86, doesn't call external routines and manages to keep the 64 bit overhead to a minimum.
Of course, these are tests done on x86 (which has a 32x32=>64 multiply instruction even in 32 bit mode), given that you are working on a DSP you would need to check the actual code produced by your compiler.

Looping timer function in avr?

I recently had to make an Arduino project using avr library and without delay lib. In that i had to create an implementation of the delay function.
After searching on the internet i found this particular code in many many places.
And the only explanation i got was it kills time in a callibrated manner.
void delay_ms(int ms) {
int delay_count = F_CPU / 17500;//Where is this 17500 comming from
volatile int i;
while (ms != 0) {
for (i=0; i != delay_count; i++);
ms--;
}
}
iam not able to understand how the following works,(though it did do the job) i.e., how did we determine delay count to be F_cpu/17500. Where is this number comming from.
Delay functions is better to be done in assembly, because you must know how many instruction cycle your code take to know how to repeat it to achieve the total delay.
I didn't test your code but this value (17500) is designed to reach 1ms delay.
for example if F_CPU = 1000000so delay_count = 57, to reach 1ms it count 57 count a simple calculation you could found that every count will take 17us and this value is the time for loop take when compiled to assembly.
But of course different compiler versions will produce different assembly code which means inaccurate delay.
My advice to you is to use standard avr/delay.h library. i cannot see any reason why can't you use it? But if you must create another one so you should learn assembly!!

Delay on PIC18F

I'm using a PIC18F14K50 with HiTech ANSI C Compiler and MPLAB v8.43. My PIC code is finally up and running and working, with the exception of the delay function. This is crucial for my application - I need it to be in certain states for a given number of milliseconds, seconds, or minutes.
I have been trying to find a solution for this for about 2 weeks but have been unsuccessful so far. I gave up and wrote my own delay function with asm("nop"); in a loop, but this gives very unpredictable results. If I tell it to wait for half a second or 5 seconds, it works accurately enough. But as soon as I tell it to wait for longer - like for 10 minutes, the delay only lasts about 10 - 20 seconds, and 2 minutes ands up being a blink shorter than a 500ms delay.
Here are my config fuses and wait() function:
#include <htc.h>
__CONFIG(1, FOSC_IRC & FCMEN_OFF & IESO_OFF & XINST_OFF);
__CONFIG(2, PWRTEN_OFF & BOREN_OFF & WDTEN_OFF);
__CONFIG(3, MCLRE_OFF);
__CONFIG(4, STVREN_ON & LVP_OFF & DEBUG_OFF);
__CONFIG(5, 0xFFFF);
__CONFIG(6, 0xFFFF);
__CONFIG(7, 0xFFFF);
void wait(int ms)
{
for (int i = 0; i < ms; i++)
for (int j = 0; j < 12; j++)
asm("nop");
}
Like I said, if I call wait(500) up to wait(30000) then I will get half a second to 30 second delay to within the tolerence I'm interested in - however if I call wait(600000) then I do not get a 10 minute delay as I would expect, but rather about 10-15 seconds, and wait(120000) doesn't give a 2 minute delay, but rather a quick blink.
Ideally, I'd like to get the built-in __delay_ms() function working and being called from within my wait(), however I haven't had any success with this. If I try to #include <delay.h> then my MPLAB complains there is no such file or directory. If I look at the delay.h in my HiTech samples, there is a DelayUs(unsigned char) defined and an extern void DelayMs(unsigned char) which I haven't tried, however when I try to put the extern directly into my C code, I get an undefined symbol error upon linking.
The discrepancy between the short to medium delays and the long delays makes no sense. The only explanation I have is that the compiler has optimised out the NOPs or something.
Like I said, it's a PIC18F14K50 with the above configuration fuses. I don't have a great deal of experience with PICs, but I assume it's running at 4MHz given this set-up.
I'm happy with an external function from a library or macro, or with a hand-written function with NOPs. All I need is for it to be accurate to within a couple of seconds per minute or so.
Is the PIC a 16-bit microcontroller? My guess is that you're getting overflow on the value of wait, which would overflow after 2^15 (32,767 is the max value of a signed 16 bit int).
If you change your int variables to unsigned, you can go up to 65535ms. To go higher than that, you need to use long as your parameter type and nest your loops even deeper.
A better long term solution would be to write a delay function that uses one of the built in hardware timers that are in your chip. Your NOP delay will not be accurate over long periods if you have things like other interrupts firing and using some CPU cycles

Illogical benchmarking?

I witnessed the following weird behavior. I have two functions, which do almost the same - they measure the number of cycles it takes to do a certain operation. In one function, inside the loop I increment a variable; in the other nothing happens. The variables are volatile so they won't be optimized away. These are the functions:
unsigned int _osm_iterations=5000;
double osm_operation_time(){
// volatile is used so that j will not be optimized, and ++ operation
// will be done in each loop
volatile unsigned int j=0;
volatile unsigned int i;
tsc_counter_t start_t, end_t;
start_t = tsc_readCycles_C();
for (i=0; i<_osm_iterations; i++){
++j;
}
end_t = tsc_readCycles_C();
if (tsc_C2CI(start_t) ==0 || tsc_C2CI(end_t) ==0 || tsc_C2CI(start_t) >= tsc_C2CI(end_t))
return -1;
return (tsc_C2CI(end_t)-tsc_C2CI(start_t))/_osm_iterations;
}
double osm_empty_time(){
volatile unsigned int i;
volatile unsigned int j=0;
tsc_counter_t start_t, end_t;
start_t = tsc_readCycles_C();
for (i=0; i<_osm_iterations; i++){
;
}
end_t = tsc_readCycles_C();
if (tsc_C2CI(start_t) ==0 || tsc_C2CI(end_t) ==0 || tsc_C2CI(start_t) >= tsc_C2CI(end_t))
return -1;
return (tsc_C2CI(end_t)-tsc_C2CI(start_t))/_osm_iterations;
}
There are some non-standard functions there but I'm sure you'll manage.
The thing is, the first function returns 4, while the second function (which supposedly does less) returns 6, although the second one obviously does less than the first one.
Does that make any sense to anyone?
Actually I made the first function so I could reduce the loop overhead for my measurement of the second. Do you have any idea how to do that (as this method doesn't really cut it)?
I'm on Ubuntu (64 bit I think).
Thanks a lot.
I can see a couple of things here. One is that the code for the two loops looks identical. Secondly, the compiler will probably realise that the variable i and the variable j will always have the same value and optimise one of them away. You should look at the generated assembly and see what is really going on.
Another theory is that the change to the inner body of the loop has affected the cachability of the code - this could have moved it across cache lines or some other thing.
Since the code is so trivial, you may find it difficult to get an accuate timing value, even if you are doing 5000 iterations, you may find that the time is inside the margin for error for the timing code you are using. A modern computer can probably run that in far less than a millisecond - perhaps you should increase the number of iterations?
To see the generated assembly in gcc, specify the -S compiler option:
Q: How can I peek at the assembly code
generated by GCC?
Q: How can I create a file where I can
see the C code and its assembly
translation together?
A: Use the -S (note: capital S) switch
to GCC, and it will emit the assembly
code to a file with a .s extension.
For example, the following command:
gcc -O2 -S -c foo.c
will leave the generated assembly code
on the file foo.s.
If you want to see the C code together
with the assembly it was converted to,
use a command line like this:
gcc -c -g -Wa,-a,-ad [other GCC
options] foo.c > foo.lst
which will output the combined
C/assembly listing to the file
foo.lst.
It's sometimes difficult to guess at this sort of thing, especially due to the small number of iterations. One thing that might be happening, though, is the increment could be executing on a free integer execution unit, gaining some slight degree of parallelism, since it has no dep on the value of i.
Since you mentioned this was 64 bit os, it's almost certain all these values are in registers, since there's more registers in the x86_64 architecture. Other than that, i'd say perform many more iterations, and see how stable the results are.
If you are truly trying to test the operation of a piece of code ("j++;" in this case), you're actually better off doing the following:
1/ Do it in two separate executables since there is a possibility that position within the executable may affect the code.
2/ Make sure you use CPU time rather than elapsed time (I'm not sure what "tsc_readCycles_C()" gives you). This is to avoid errant results from a CPU loaded up with other tasks.
3/ Turn off compiler optimization (e.g., "gcc -O0") to ensure gcc doesn't put in any fancy stuff that's likely to skew the results.
4/ You don't need to worry about volatile if you use the actual result, such as placing:
printf ("%d\n",j);
after the loop, or:
FILE *fx = fopen ("/dev/null","w");
fprintf (fx, "%d\n", j);
fclose (fx);
if you don't want any output at all. I can't remember whether volatile was a suggestion to the compiler or enforced.
5/ Iterations of 5,000 seem a little on the low side, where "noise" could affect the readings. Maybe a higher value would be better. This may not be an issue if you're timing a larger piece of code and you've just included "j++;" as a place-holder.
When I'm running tests similar to this, I normally:
Ensure that the times are measured in at least seconds, preferably (small) tens of seconds.
Have a single run of the program call the first function, then the second, then the first again, then the second again, and so on, just to see if there are weird cache warmup issues.
Run the program multiple times to see how stable the timing is across runs.
I'm still at a loss to explain your observed results, but if you're sure you've got your functions identified properly (not self-evidently the case given that there were copy'n'paste errors earlier, for example), then looking at the assembler output is the main option left.

Resources