How to modify a C program so that gprof can profile it?

How to modify a C program so that gprof can profile it? - c

When I run gprof on my C program it says no time accumulated for my program and shows 0 time for all function calls. However it does count the function calls.
How do I modify my program so that gprof will be able to count how much time something takes to run?

Did you specify -pg when compiling?
http://sourceware.org/binutils/docs-2.20/gprof/Compiling.html#Compiling
Once it is compiled, you run the program and then run gprof on the binary.
E.g.:
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 10000; i++) {
printf ("%d\n", i);
}
return 0;
}
Compile as cc -pg test.c, then run as a.out, then gprof a.out, gives me
granularity: each sample hit covers 4 byte(s) for 1.47% of 0.03 seconds
% cumulative self self total
time seconds seconds calls ms/call ms/call name
45.6 0.02 0.02 10000 0.00 0.00 __sys_write [10]
45.6 0.03 0.02 0 100.00% .mcount (26)
2.9 0.03 0.00 20000 0.00 0.00 __sfvwrite [6]
1.5 0.03 0.00 20000 0.00 0.00 memchr [11]
1.5 0.03 0.00 10000 0.00 0.00 __ultoa [12]
1.5 0.03 0.00 10000 0.00 0.00 _swrite [9]
1.5 0.03 0.00 10000 0.00 0.00 vfprintf [2]
What are you getting?

I tried running Kinopiko's example, except I increased the number of iterations by a factor of 100.
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 1000000; i++) {
printf ("%d\n", i);
}
return 0;
}
Then I took 10 stackshots (under VC, but you can use pstack). Here are the stackshots:
9 copies of this stack:
NTDLL! 7c90e514()
KERNEL32! 7c81cbfe()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
1 copy of this stack:
KERNEL32! 7c81cb96()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
In case it isn't obvious, this tells you that:
mainCRTStartup() line 206 + 25 bytes Cost ~100% of the time
main() line 7 + 14 bytes Cost ~100% of the time
printf() line 62 + 14 bytes Cost ~100% of the time
_ftbuf() line 171 + 9 bytes Cost ~100% of the time
_flush() line 162 + 23 bytes Cost ~100% of the time
_write() line 168 + 57 bytes Cost ~100% of the time
In a nutshell, the program spends ~100% of it's time flushing to disk (or console) the output buffer as part of the printf on line 7.
(What I mean by "Cost of a line" is - it is the fraction of total time spent at the request of that line, and that's roughly the fraction of samples that contain it.
If that line could be made to take no time, such as by removing it, skipping over it, or passing its work off to an infinitely fast coprocessor, that time fraction is how much the total time would shrink. So if the execution of any of these lines of code could be avoided, time would shrink by somewhere in the range of 95% to 100%. If you were to ask "What about recursion?", the answer is It Makes No Difference.)
Now, maybe you want to know something else, like how much time is spent in the loop, for example. To find that out, remove the printf because it's hogging all the time. Maybe you want to know what % of time is spent purely in CPU time, not in system calls. To get that, just throw away any stackshots that don't end in your code.
The point I'm trying to make is if you're looking for things you can fix to make the code run faster, the data gprof gives you, even if you understand it, is almost useless. By comparison, if there is some of your code that is causing more wall-clock time to be spent than you would like, stackshots will pinpoint it.

One gotcha with gprof: it doesn't work with code in dynamically-linked libraries. For that, you need to use sprof. See this answer: gprof : How to generate call graph for functions in shared library that is linked to main program

First compile you application with -g, and check what CPU counters are you using.
If your application runs very quick than gprof could just miss all events or less that required (reduce the number of events to read).
Actually profiling should show you CPU_CLK_UNHALTED or INST_RETIRED events without any special switches. But with such data you'll be able only to say how well your code it performing: INST_RETIRED/CPU_CLK_UNHALTED.
Try to use Intel VTune profiler - it's free for 30 days and for education.

Related

Unable to profile libcrypto methods using gprof

I am trying to profile a C program that uses some methods of openssl/libcrypto. Everything work well when I compile and run the code without profiling information. When I add options to profile it with gprof, I get unexpected results from the profiling tool.
I did many researched but I didn't find any page that solved my problem.
This is my code (named test.c):
#include <stdio.h>
#include <openssl/bn.h>
#include <openssl/rand.h>
static BIGNUM *x;
static BIGNUM *y;
static BIGNUM *z;
static BIGNUM *p;
static BN_CTX *tmp;
static unsigned int max_size;
int main(void){
int max_bytes, res_gen;
max_bytes = 50;
tmp = BN_CTX_new();
BN_CTX_init(tmp);
x = BN_new();
y = BN_new();
z = BN_new();
p = BN_new();
RAND_load_file("/dev/urandom", max_bytes);
max_size = 256;
BN_rand(x, max_size, 0, 0);
BN_rand(y, max_size, 0, 0);
res_gen = BN_generate_prime_ex(p, max_size, 0, NULL, NULL, NULL);
BN_mul(z, x, y, tmp);
BN_nnmod(x, z, p, tmp);
printf("\nOk\n");
BN_free(x);
BN_free(y);
BN_free(z);
BN_free(p);
BN_CTX_free(tmp);
return 0;
}
When I compile with profiling information using gcc -pg -static test.c -lcrypto -ldl, it produces the following results. I get 0% (and 0 second) for everything, which is unexpected.
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 1 0.00 0.00 main
Call graph
granularity: each sample hit covers 2 byte(s) no time propagated
index % time self children called name
0.00 0.00 1/1 __libc_start_main [4282]
[1] 0.0 0.00 0.00 1 main [1]
0.00 0.00 0/0 mcount (3495)
0.00 0.00 0/0 BN_CTX_new [275]
0.00 0.00 0/0 BN_CTX_init [274]
0.00 0.00 0/0 BN_new [372]
0.00 0.00 0/0 RAND_load_file [1636]
0.00 0.00 0/0 BN_rand [386]
0.00 0.00 0/0 BN_generate_prime_ex [331]
0.00 0.00 0/0 BN_mul [370]
0.00 0.00 0/0 BN_nnmod [378]
0.00 0.00 0/0 puts [3696]
0.00 0.00 0/0 BN_free [327]
0.00 0.00 0/0 BN_CTX_free [272]
-----------------------------------------------
Also, it seems that the profiler detects only the main method because details for others methods don't appear in flat profile and call graph.
So, I would like to know if I must compile OpenSSL library with some options (what options ?) or something else.

gprof is a CPU-profiler. That means it cannot account for any time spent in blocking system calls like I/O, sleep, locks, etc.
Assuming the goal is to find speedups,
what a lot of people do is get it running under a debugger like GDB and take stack samples manually.
It's not measuring (with any precision). It's pinpointing problems. Anything you see it doing that you could avoid, if you see it on more than one sample, will give you a substantial speedup. The fewer samples you have to take before seeing it twice, the bigger the speedup.

Calling with Arguments versus using Globals in C

I have a decent understanding of x86 assembly and i know that when a function is called all the arguments are pushed onto the stack.
I have a function which basically loops through a 8 by 8 array and calls some functions based on the values in the array. Each of these function calls involves 6-10 arguments being passed. This program takes a very long time to run, it is a Chess AI, but this function takes 20% of the running time.
So i guess my question is, what can i do to give my functions access to the variables they need in a faster way?
int row,col,i;
determineCheckValidations(eval_check, b, turn);
int * eval_check_p = &(eval_check[0][0]);
for(row = 0; row < 8; row++){
for(col = 0; col < 8; col++, eval_check_p++){
if (b->colors[row][col] == turn){
int type = b->types[row][col];
if (type == PAWN)
findPawnMoves(b,moves_found,turn,row,col,last_move,*eval_check_p);
else if (type == KNIGHT)
findMappedNoIters(b,moves_found,turn,row,col,*move_map_knight, 8, *eval_check_p);
else if (type == BISHOP)
findMappedIters(b,moves_found,turn,row,col,*move_map_bishop, 4, *eval_check_p);
else if (type == ROOK)
findMappedIters(b,moves_found,turn,row,col,*move_map_rook, 4, *eval_check_p);
else if (type == QUEEN)
findMappedIters(b,moves_found,turn,row,col,*move_map_queen, 8, *eval_check_p);
else if (type == KING){
findMappedNoIters(b,moves_found,turn,row,col,*move_map_king, 8, *eval_check_p);
findCastles(b,moves_found,turn,row,col);
}
}
}
}
all the code can be found # https://github.com/AndyGrant/JChess/tree/master/_Core/_Scripts
A sample of the profile:
% cumulative self self total
time seconds seconds calls s/call s/call name
20.00 1.55 1.55 2071328 0.00 0.00 findAllValidMoves
14.84 2.70 1.15 10418354 0.00 0.00 checkMove
10.06 3.48 0.78 1669701 0.00 0.00 encodeBoard
7.23 4.04 0.56 10132526 0.00 0.00 findMappedIters
6.84 4.57 0.53 1669701 0.00 0.00 getElement
6.71 5.09 0.52 68112169 0.00 0.00 createNormalMove

You have performed good work on profiling. You need to take the function with the worst case and profile it in more detail.
You may want to try different compiler optimization settings when you profile.
Try some common optimization techniques, such as loop unrolling and factoring out invariants from loops.
You may get some improvements by designing your functions with the processor's data cache in mind. Search the web for "optimizing data cache".
If the function works correctly, I recommend posting to CodeReview#StackExchange.com.
Don't assume anything.

How do I sscan for time?

I'm having a hard time using sscanf to scan hour and minutes from a list. Below is a small snip of the list.
1704 86 2:30p 5:50p Daily
1711 17 10:40a 2:15p 5
1712 86 3:10p 6:30p 1
1731 48 6:25a 9:30a 156
1732 100 10:15a 1:30p Daily
1733 6 2:15p 3:39p Daily
I've tried this, but it keeps getting me segmentation Fault.(I'm putting this information into structures).
for(i=0;i<check_enter;i++){
sscanf(all_flights[i],
"%d %d %d:%d%c %d:%d%c %s",
&all_flights_divid[1].flight_number,
&all_flights_divid[i].route_id,
&all_flights_divid[i].departure_time_hour,
&all_flights_divid[i].departure_time_minute,
&all_flights_divid[i].departure_time_format,
&all_flights_divid[i].arrival_time_minute,
&all_flights_divid[i].arrival_time_minute,
&all_flights_divid[i].arrival_time_format,
&all_flights_divid[i].frequency);
printf("%d ",all_flights_divid[i].flight_number);
printf("%d ",all_flights_divid[i].route_id);
printf("%d ",all_flights_divid[i].departure_time_hour);
printf("%d ",all_flights_divid[i].departure_time_minute);
printf("%c ",all_flights_divid[i].departure_time_format);
printf("%d ",all_flights_divid[i].arrival_time_hour);
printf("%d ",all_flights_divid[i].arrival_time_minute);
printf("%c ",all_flights_divid[i].arrival_time_format);
printf("%s\n",all_flights_divid[i].frequency);
}
This is how I declared it.
struct all_flights{
int flight_number;
int route_id;
int departure_time_hour;
int departure_time_minute;
char departure_time_format;
int arrival_time_hour;
int arrival_time_minute;
char arrival_time_format;
char frequency[10];
};
struct all_flights all_flights_divid[3000];
These are the results I get
0 86 2 30 p 0 50 p Daily
0 17 10 40 a 0 15 p 5
0 86 3 10 p 0 30 p 1
0 48 6 25 a 0 30 a 156
0 100 10 15 a 0 30 p Daily
0 6 2 15 p 0 39 p Daily

Look carefully at your list of output targets in sscanf. Do you see the difference between &all_flights_divid[i].departure_time_minute and all_flights_divid[i].departure_time_format? Similary for .arrival_time_format and .frequency.
What do you think the & ampersand is for? Hint: what is one way of returning multiple values from a single function call, and what does this have to do with the & ampersand?
A segmentation fault arises when your program tries to write data into memory the operating system has never instructed the CPU to make available to the program. Segmentation faults do not always occur when data is misplaced, because sometimes data is misplaced within available memory. By way of analogy, if you inadvertently put a book in the wrong place on the bookshelf, you'll not easily find the book later, but the book is still on a bookshelf and does not seem to anyone to be out of place. On the other hand, if you inadvertently put the same book in the refrigerator, well, when mother goes to get the milk she's going to issue you a segmentation fault! That's the analogy, anyway.
In general, it is hard to guess whether misplacing data will cause a segmentation fault (as misplaced into the refrigerator) or not (as misplaced on the bookshelf) until you run the program. The segmentation fault (refrigerator) is preferable because it makes the mistake obvious, so the operating system tries to give you as many segmentation faults as it can by affording the program as little memory as possible.
I am avoiding giving a 100 percent direct answer because of your "homework" tag. See if you cannot figure out the & ampersand matter, then come back here if it still does not make sense.

Your pointers are all messed up. Perhaps a series of local variables specifically for reading this stuff in would help you organize this all in your head.
int flightNum, routeID, depHour, depMin, arrHour, arrMin;
char depFormat, arrFormat;
char * freq;
for(i=0;i<check_enter;i++){
sscanf(
all_flights[i],"%d %d %d:%d%c %d:%d%c %s"
&flightNum, &routeID,
&depHour, &depMin, &depFormat,
&arrHour, $arrMin, $arrFormat,
freq
);
all_flights_divid[i].flight_number = flightNum;
all_flights_divid[i].route_id = routeID;
all_flights_divid[i].departure_time_hour = depHour;
all_flights_divid[i].departure_time_minute = depMin;
all_flights_divid[i].departure_time_format = depFormat;
all_flights_divid[i].arrival_time_hour = arrHour;
all_flights_divid[i].arrival_time_minute = arrMin;
all_flights_divid[i].arrival_time_format = arrFormat;
strcpy(all_flights_divid[i].frequency, freq);
}

which one faster, getimeofday or clock_gettime?

I want to store event time.
I found these two functions, but don't know which one is faster.

int main()
{
struct timespec tp;
struct timeval tv;
int i=0;
int j=0;
for(i=0; i < 500000; i++)
{
gettimeofday(&tv, NULL);
j+= tv.tv_usec%2;
clock_gettime(CLOCK_HIGHRES, &tp);
j+=tp.tv_sec%2;
}
return 0;
}
%Time Seconds Cumsecs #Calls msec/call Name
68.3 0.28 0.28 500000 0.0006 __clock_gettime
22.0 0.09 0.37 500000 0.0002 _gettimeofday
7.3 0.03 0.40 1000009 0.0000 _mcount
2.4 0.01 0.41 1 10. main
0.0 0.00 0.41 4 0. atexit
0.0 0.00 0.41 1 0. _exithandle
0.0 0.00 0.41 1 0. _fpsetsticky
0.0 0.00 0.41 1 0. _profil
0.0 0.00 0.41 1 0. exit
This ran on Solaris 9 v440 box. Profile the code on your own box. This result is completely different from the result quoted in a link supplied earlier. In other words, if there is a difference it will be due to the implmenetation.

It will depend on your system; there's no standard for which is faster. My guess would be they're usually both about the same speed. gettimeofday is more traditional and probably a good bet if you want your code to be portable to older systems. clock_gettime has more features.

You don't care.
If you're calling either of them often enough that it matters, You're Doing It Wrong.
Profile code which is causing specific performance problems and check how much time it is spending in there. If it's too much, consider refactoring it to call the function less often.

Depending on constant, used in clock_gettime(). For the fastest clocks, there are CLOCK_*_COARSE constants. These timers are fastest, but are not precise.
gettimeofday() should return the same as clock_gettime(CLOCK_REALTIME)
Also, benchmark results depend on architecture (for Linux). Since Linux has special technology (VDSO) to eliminate syscall to get time. These technologies do not work on X86-32bit architecture. See strace output.

Have a look at this discussion. It seems of good quality and very closely related to what you might be looking for though a little dated.

Linux, timerfd accuracy

I have a system that needs at least 10 mseconds of accuracy for timers.
I went for timerfd as it suits me perfectly, but found that even for times up to 15 milliseconds it is not accurate at all, either that or I don't understand how it works.
The times I have measured were up to 21 mseconds on a 10 mseconds timer.
I have put together a quick test that shows my problem.
Here a test:
#include <sys/timerfd.h>
#include <time.h>
#include <string.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <inttypes.h>
int main(int argc, char *argv[]){
int timerfd = timerfd_create(CLOCK_MONOTONIC,0);
int milliseconds = atoi(argv[1]);
struct itimerspec timspec;
bzero(&timspec, sizeof(timspec));
timspec.it_interval.tv_sec = 0;
timspec.it_interval.tv_nsec = milliseconds * 1000000;
timspec.it_value.tv_sec = 0;
timspec.it_value.tv_nsec = 1;
int res = timerfd_settime(timerfd, 0, &timspec, 0);
if(res < 0){
perror("timerfd_settime:");
return 1;
}
uint64_t expirations = 0;
int iterations = 0;
while( res = read(timerfd, &expirations, sizeof(expirations))){
if(res < 0){ perror("read:"); continue; }
if(expirations > 1){
printf("%" PRIu64 " expirations, %d iterations\n", expirations, iterations);
break;
}
iterations++;
}
return 0;
}
And executed like this:
Zack ~$ for i in 2 4 8 10 15; do echo "intervals of $i milliseconds"; ./test $i;done
intervals of 2 milliseconds
2 expirations, 1 iterations
intervals of 4 milliseconds
2 expirations, 6381 iterations
intervals of 8 milliseconds
2 expirations, 21764 iterations
intervals of 10 milliseconds
2 expirations, 1089 iterations
intervals of 15 milliseconds
2 expirations, 3085 iterations
Even assuming some possible delays, 15 milliseconds delays sounds too much for me.

Try altering it as follows, this should pretty much garuntee that it'll never miss a wakeup, but be careful with it since running realtime priority can lock your machine hard if it doesn't sleep, also you may need to set things up so that your user has the ability to run stuff at realtime priority (see /etc/security/limits.conf)
#include <sys/timerfd.h>
#include <time.h>
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include <sched.h>
int main(int argc, char *argv[])
{
int timerfd = timerfd_create(CLOCK_MONOTONIC,0);
int milliseconds = atoi(argv[1]);
struct itimerspec timspec;
struct sched_param schedparm;
memset(&schedparm, 0, sizeof(schedparm));
schedparm.sched_priority = 1; // lowest rt priority
sched_setscheduler(0, SCHED_FIFO, &schedparm);
bzero(&timspec, sizeof(timspec));
timspec.it_interval.tv_sec = 0;
timspec.it_interval.tv_nsec = milliseconds * 1000000;
timspec.it_value.tv_sec = 0;
timspec.it_value.tv_nsec = 1;
int res = timerfd_settime(timerfd, 0, &timspec, 0);
if(res < 0){
perror("timerfd_settime:");
}
uint64_t expirations = 0;
int iterations = 0;
while( res = read(timerfd, &expirations, sizeof(expirations))){
if(res < 0){ perror("read:"); continue; }
if(expirations > 1){
printf("%ld expirations, %d iterations\n", expirations, iterations);
break;
}
iterations++;
}
}
If you are using threads you should use pthread_setschedparam instead of sched_setscheduler.
Realtime also isn't about low latency, it's about guarantees, RT means that if you want to wake up exactly once every second on the second, you WILL, the normal scheduling does not give you this, it might decide to wake you up 100ms later, because it had some other work to do at that time anyway. If you want to wake up every 10ms and you REALLY do need to, then you should set yourself to run as a realtime task then the kernel will wake you up every 10ms without fail. Unless a higher priority realtime task is busy doing stuff.
If you need to guarantee that your wakeup interval is exactly some time it doesn't matter if it's 1ms or 1 second, you won't get it unless you run as a realtime task. There are good reasons the kernel will do this to you (saving power is one of them, higher throughput is another, there are others), but it's well within it's rights to do so since you never told it you need better guarantees. Most stuff doesn't actually need to be this accurate, or need to never miss so you should think hard about whether or not you really do need it.
quoting from http://www.ganssle.com/articles/realtime.htm
A hard real time task or system is
one where an activity simply must be
completed - always - by a specified
deadline. The deadline may be a
particular time or time interval, or
may be the arrival of some event. Hard
real time tasks fail, by definition,
if they miss such a deadline.
Notice this definition makes no
assumptions about the frequency or
period of the tasks. A microsecond or
a week - if missing the deadline
induces failure, then the task has
hard real time requirements.
Soft realtime is pretty much the same, except that missing a deadline, while undesirable, is not the end of the world (for example video and audio playback are soft realtime tasks, you don't want to miss displaying a frame, or run out of buffer, but if you do it's just a momentary hiccough, and you simply continue). If what you are trying to do is 'soft' realtime I wouldn't bother with running at realtime priority since you should generally get your wakeups in time (or at least close to it).
EDIT:
If you aren't running realtime the kernel will by default give any timers you make some 'slack' so that it can merge your request to wake up with other events that happen at times close to the one you asked for (that is if the other event is within your 'slack' time it will not wake you at the time you asked, but a little earlier or later, at the same time it was already going to do something else, this saves power).
For a little more info see High- (but not too high-) resolution timeouts and Timer slack (note I'm not sure if either of those things is exactly what's really in the kernel since both those articles are about lkml mailing list discussions, but something like the first one really is in the kernel.

I've got a feeling that your test is very hardware dependent. When I ran your sample program on my system, it appeared to hang at 1ms. To make your test at all meaningful on my computer, I had to change from milliseconds to microseconds. (I changed the multiplier from 1_000_000 to 1_000.)
$ grep 1000 test.c
timspec.it_interval.tv_nsec = microseconds * 1000;
$ for i in 1 2 4 5 7 8 9 15 16 17\
31 32 33 47 48 49 63 64 65 ; do\
echo "intervals of $i microseconds";\
./test $i;done
intervals of 1 microseconds
11 expirations, 0 iterations
intervals of 2 microseconds
5 expirations, 0 iterations
intervals of 4 microseconds
3 expirations, 0 iterations
intervals of 5 microseconds
2 expirations, 0 iterations
intervals of 7 microseconds
2 expirations, 0 iterations
intervals of 8 microseconds
2 expirations, 0 iterations
intervals of 9 microseconds
2 expirations, 0 iterations
intervals of 15 microseconds
2 expirations, 7788 iterations
intervals of 16 microseconds
4 expirations, 1646767 iterations
intervals of 17 microseconds
2 expirations, 597 iterations
intervals of 31 microseconds
2 expirations, 370969 iterations
intervals of 32 microseconds
2 expirations, 163167 iterations
intervals of 33 microseconds
2 expirations, 3267 iterations
intervals of 47 microseconds
2 expirations, 1913584 iterations
intervals of 48 microseconds
2 expirations, 31 iterations
intervals of 49 microseconds
2 expirations, 17852 iterations
intervals of 63 microseconds
2 expirations, 24 iterations
intervals of 64 microseconds
2 expirations, 2888 iterations
intervals of 65 microseconds
2 expirations, 37668 iterations
(Somewhat interesting that I got the longest runs from 16 and 47 microseconds, but 17 and 48 were awful.)
time(7) has some suggestions on why our platforms are so different:
High-Resolution Timers
Before Linux 2.6.21, the accuracy of timer and sleep system
calls (see below) was also limited by the size of the jiffy.
Since Linux 2.6.21, Linux supports high-resolution timers
(HRTs), optionally configurable via CONFIG_HIGH_RES_TIMERS. On
a system that supports HRTs, the accuracy of sleep and timer
system calls is no longer constrained by the jiffy, but instead
can be as accurate as the hardware allows (microsecond accuracy
is typical of modern hardware). You can determine whether
high-resolution timers are supported by checking the resolution
returned by a call to clock_getres(2) or looking at the
"resolution" entries in /proc/timer_list.
HRTs are not supported on all hardware architectures. (Support
is provided on x86, arm, and powerpc, among others.)
All the 'resolution' lines in my /proc/timer_list are 1ns on my (admittedly ridiculously powerful) x86_64 system.
I decided to try to figure out where the 'breaking point' is on my computer, but gave up on the 110 microsecond run:
$ for i in 70 80 90 100 110 120 130\
; do echo "intervals of $i microseconds";\
./test $i;done
intervals of 70 microseconds
2 expirations, 639236 iterations
intervals of 80 microseconds
2 expirations, 150304 iterations
intervals of 90 microseconds
4 expirations, 3368248 iterations
intervals of 100 microseconds
4 expirations, 1964857 iterations
intervals of 110 microseconds
^C
90 microseconds ran for three million iterations before it failed a few times; that's 22 times better resolution than your very first test, so I'd say that given the right hardware, 10ms shouldn't be anywhere near difficult. (90 microseconds is 111 times better resolution than 10 milliseconds.)
But if your hardware doesn't provide the timers for high resolution timers, then Linux can't help you without resorting to SCHED_RR or SCHED_FIFO. And even then, perhaps another kernel could better provide you with the software timer support you need.
Good luck. :)

Here's a theory. If HZ is set to 250 for your system ( as is typical ) then you have a 4 millisecond timer resolution. Once your process is swapped out by the scheduler, it's likely that a number of other processes will be scheduled and run before your process gets another time slice. This might explain you seeing timer resolutions in the 15 to 21 millisecond range. The only way to get around this would be to run a real-time kernel.
The typical solution for high resolution timing on non-realtime systems is basically to busy wait with a call to select.

Depending on what else the system is doing, it may be a bit slow in switching back to your task. Unless you have a "real" realtime system, there's no guarantee it will do better than what you're seeing, although I agree that result is a bit disappointing.
You can (mostly) eliminate that task switch / scheduler time. If you have CPU power (and electrical power!) to spare, a brutal but effective solution would be a busy wait spin loop.
The idea is to run your program in a tight loop that continuously polls the clock for what time it is, and then calls your other code when the time is right. At the expense of making your system act very sluggish for everything else and heating up your CPU, you will end up with task scheduling that is mostly jitter free.
I wrote a system like this once under Windows XP to spin a stepper motor, supplying evenly spaced pulses up to 40K times per second, and it worked fine. Of course, your mileage may vary.