Why does my Linux app get stopped every 0.5 seconds?

Why does my Linux app get stopped every 0.5 seconds? - c

I have a 16 core Linux machine that is idle. If I run a trivial, single threaded C program that sits in a loop reading the cycle counter forever (using the rdtsc instruction), then every 0.5 seconds, I see a 0.17 ms jump in the timer value. In other words, every 0.5 seconds it seems that my application is stopped for 0.17ms. I would like to understand why this happens and what I can do about it. I understand Linux is not a real time operating system. I'm just trying to understand what is going on, so I can make the best use of what Linux provides.
I found someone else's software for measuring this problem - https://github.com/nokia/clocktick_jumps. Its results are consistent with my own.
To answer the "tell us what specific problem you are trying to solve" question - I work on high-speed networking applications using DPDK. About 60 million packets arrive per second. I need to decide what size to make the RX buffers and have reasons that the number I pick is sensible. The answer to this question is one part of that puzzle.
My code looks like this:
// Build with gcc -O2 -Wall
#include <stdio.h>
#include <unistd.h>
#include <x86intrin.h>
int main() {
// Bad way to learn frequency of cycle counter.
unsigned long long t1 = __rdtsc();
usleep(1000000);
double millisecs_per_tick = 1e3 / (double)(__rdtsc() - t1);
// Loop forever. Print message if any iteration takes unusually long.
t1 = __rdtsc();
while (1) {
unsigned long long t2 = __rdtsc();
double delta = t2 - t1;
delta *= millisecs_per_tick;
if (delta > 0.1) {
printf("%4.2f - Delay of %.2f ms.\n", (double)t2 * millisecs_per_tick, delta);
}
t1 = t2;
}
return 0;
}
I'm running on Ubuntu 16.04, amd64. My processor is an Intel Xeon X5672 # 3.20GHz.

I expect your system is scheduling another process to run on the same CPU, and you're either replaced or moved to another core with some timing penalty.
You can find the reason by digging into kernel events happening at the same time. For example systemtap, or perf can give you some insight. I'd start with the scheduler events to eliminate that one first: https://github.com/jav/systemtap/blob/master/tapset/scheduler.stp

Related

Practical jitter with clock_nanosleep()

I'm trying to establish what practical jitter I can achieve by using clock_nanosleep() in a loop and through experimentation I'm observing something I'm not confident I understand.
I'm using code posted in this SO question by another user to benchmark performance, targeting a 250ms interval. I've observed that on my system the sleep function returns very consistently 10us late with only about 2us jitter the vast majority of the time (fairly narrow statistical distribution).
NOTE: I haven't collected data to present a plot of statistical distribution but casual qualitative description should hopefully suffice.
I decided to subtract the 10us offset from the target wakeup time to compensate for it, and this caused the average error to be approximately zero as expected, however the jitter increased dramatically - I would estimate most wakeups are >100us early/late, and much more widely distributed.
Why is this?
My theory is that with the 10us correction the target waketimes are less nicely aligned with the underlying hardware clock, but it would be helpful to get confirmation. If this is true, is there a method to synchronize the phase of the target waketimes with the hardware clock?

Manpages for clock_nanosleep(2) say: "Furthermore, after the
sleep completes, there may still be a delay before the CPU
becomes free to once again execute the calling thread."
I tried to comprehend your question. For this I created the source code below based on the reference at SO which you provided. I include the source code such that you or someone else can check it, test it, play with it.
The debug print refers to a sleep of exactly 1 second. The debug print is shorter than the print in the comments - and the debug print will always refer to the deviation from 1 second, no matter which wakeTime has been defined. Thus, it is possible, to try a reduced wakeTime (wakeTime.tv_nsec-= some_value;) to achieve the target of 1 second.
Conclusions:
I would generally agree to all you (davegravy) write about it in your post, except that I am seeing much higher delays and deviations.
There are minor changes in the delay between a non-loaded and a heavy loaded system (all CPUs 100% load). On heavy loaded system scattering of delay reduces and the average delay also reduces (on my system - but not very significant).
As expected, the delay changes quite a bit when I try it on another machine (as expected raspberry pi is worse :o).
For a specific machine and moment it is possible to define a correction value of nanoseconds to bring the average sleep closer to the target. Anyway, the correction value is not necessarily equal to the delay error without correction. And the correction value might be different for different machines.
Idea: As the provided code can measure how good it is. There might be the chance, that the code does a few loops from which it can derive an optimized delay correction value by itself. (This auto-correction might be interesting just from a theoretical point of view. Well, it is an idea.)
Idea 2: Or some correction values can be created just to avoid a long-term shift when considering many intervals, one after another.
#include <pthread.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#define CLOCK CLOCK_MONOTONIC
//#define CLOCK CLOCK_REALTIME
//#define CLOCK CLOCK_TAI
//#define CLOCK CLOCK_BOOTTIME
static long calcTimeDiff(struct timespec const* t1, struct timespec const* t2)
{
long diff = t1->tv_nsec - t2->tv_nsec;
diff += 1000000000 * (t1->tv_sec - t2->tv_sec);
return diff;
}
static void* tickThread()
{
struct timespec sleepStart;
struct timespec currentTime;
struct timespec wakeTime;
long sleepTime;
long wakeDelay;
while(1)
{
clock_gettime(CLOCK, &wakeTime);
wakeTime.tv_sec += 1;
wakeTime.tv_nsec -= 0; // Value to play with for delay "correction"
clock_gettime(CLOCK, &sleepStart);
clock_nanosleep(CLOCK, TIMER_ABSTIME, &wakeTime, NULL);
clock_gettime(CLOCK, &currentTime);
sleepTime = calcTimeDiff(&currentTime, &sleepStart);
wakeDelay = calcTimeDiff(&currentTime, &wakeTime);
{
/*printf("sleep req=%-ld.%-ld start=%-ld.%-ld curr=%-ld.%-ld sleep=%-ld delay=%-ld\n",
(long) wakeTime.tv_sec, (long) wakeTime.tv_nsec,
(long) sleepStart.tv_sec, (long) sleepStart.tv_nsec,
(long) currentTime.tv_sec, (long) currentTime.tv_nsec,
sleepTime, wakeDelay);*/
// Debug Short Print with respect to target sleep = 1 sec. = 1000000000 ns
long debugTargetDelay=sleepTime-1000000000;
printf("sleep=%-ld delay=%-ld targetdelay=%-ld\n",
sleepTime, wakeDelay, debugTargetDelay);
}
}
}
int main(int argc, char*argv[])
{
tickThread();
}
Some output with wakeTime.tv_nsec -= 0;
sleep=1000095788 delay=96104 targetdelay=95788
sleep=1000078989 delay=79155 targetdelay=78989
sleep=1000080717 delay=81023 targetdelay=80717
sleep=1000068001 delay=68251 targetdelay=68001
sleep=1000080475 delay=80519 targetdelay=80475
sleep=1000110925 delay=110977 targetdelay=110925
sleep=1000082415 delay=82561 targetdelay=82415
sleep=1000079572 delay=79713 targetdelay=79572
sleep=1000098609 delay=98664 targetdelay=98609
and with wakeTime.tv_nsec -= 65000;
sleep=1000031711 delay=96987 targetdelay=31711
sleep=1000009400 delay=74611 targetdelay=9400
sleep=1000015867 delay=80912 targetdelay=15867
sleep=1000015612 delay=80708 targetdelay=15612
sleep=1000030397 delay=95592 targetdelay=30397
sleep=1000015299 delay=80475 targetdelay=15299
sleep=999993542 delay=58614 targetdelay=-6458
sleep=1000031263 delay=96310 targetdelay=31263
sleep=1000002029 delay=67169 targetdelay=2029
sleep=1000031671 delay=96821 targetdelay=31671
sleep=999998462 delay=63608 targetdelay=-1538
Anyway, the delays change all the time. I tried different CLOCK definitions and different compiler options, but without any special results.
Some statistics from further testing, sample size = 100 in both cases.
targetdelay from wakeTime.tv_nsec -= 0;
Mean value = 97503 Standard deviation = 27536
targetdelay from wakeTime.tv_nsec -= 97508;
Mean value = -1909 Standard deviation = 32682
In both cases, there were a few massive outliers, such that even this result from 100 samples might not quite be representative.

Equivalent to Arduino millis()

I am currently working on the integration of a "shunt" type sensor on an electronic board. My choice was on a Linear (LTC2947), unfortunately it only has an Arduino driver. I have to translate everything in C under Linux to be compatible with my microprocessor (APQ8009 ARM Cortex-A7). I have a small question about one of the functions:
int16_t LTC2947_wake_up() //Wake up LTC2947 from shutdown mode and measure the wakeup time
{
byte data[1];
unsigned long wakeupStart = millis(), wakeupTime;
LTC2947_WR_BYTE(LTC2947_REG_OPCTL, 0);
do
{
delay(1);
LTC2947_RD_BYTE(LTC2947_REG_OPCTL, data);
wakeupTime = millis() - wakeupStart;
if (data[0] == 0) //! check if we are in idle mode
{
return wakeupTime;
}
if (wakeupTime > 200)
{
//! failed to wake up due to timeout, return -1
return -1;
}
}
while (true);
}
After finding usleep() as equivalent for delay(), I can not find it for millis() in C. Can you help me translate this function please?

Arduino millis() is based on a timer that trips an overflow interrupt at very close to 1 KHz, or 1 millisecond. To achieve the same thing, I suggest you setup a timer on the ARM platform and update a volatile unsigned long variable with a counter. That will be the equivalent of millis().
Here is what millis() is doing behind the scenes:
SIGNAL(TIMER0_OVF_vect)
{
// copy these to local variables so they can be stored in registers
// (volatile variables must be read from memory on every access)
unsigned long m = timer0_millis;
unsigned char f = timer0_fract;
m += MILLIS_INC;
f += FRACT_INC;
if (f >= FRACT_MAX) {
f -= FRACT_MAX;
m += 1;
}
timer0_fract = f;
timer0_millis = m;
timer0_overflow_count++;
}
unsigned long millis()
{
unsigned long m;
uint8_t oldSREG = SREG;
// disable interrupts while we read timer0_millis or we might get an
// inconsistent value (e.g. in the middle of a write to timer0_millis)
cli();
m = timer0_millis;
SREG = oldSREG;
return m;
}
Coming from the embedded world, arguably the first thing you should do when starting a project on a new platform is establish clocks and get a timer interrupt going at a prescribed rate. That is the "Hello World" of embedded systems. ;) If you choose to do this at 1 KHz, you're most of the way there.

#include <time.h>
unsigned int millis () {
struct timespec t ;
clock_gettime ( CLOCK_MONOTONIC_RAW , & t ) ; // change CLOCK_MONOTONIC_RAW to CLOCK_MONOTONIC on non linux computers
return t.tv_sec * 1000 + ( t.tv_nsec + 500000 ) / 1000000 ;
}
or
#include <sys/time.h>
unsigned int millis () {
struct timeval t ;
gettimeofday ( & t , NULL ) ;
return t.tv_sec * 1000 + ( t.tv_usec + 500 ) / 1000 ;
}
The gettimeofday() version probably does not work on non linux computers.
The clock_gettime() version probably does not work with old C compilers.
The arduino millis() returns unsigned long, 32 bit unsigned integer. Most
computers are 32 bit or 64 bit, so there is no need to use long except on
16 bit computers like arduino, so these versions return unsigned int. If
you want to measure a time period longer than 50 days in milliseconds, or if
you want the number of milliseconds since the beginning of unix in 1970, you
need a long long (64 bit) integer.
If a computer clock has the incorrect time, the operating system or system
administrator or program which synchonizes the computer clock with internet
clocks may change the computer clock to the correct time. This will affect
these functions, especially the gettimeofday() version. Usually there is a
big change in the computer clock when the computer boots, connects to the
network, and synchonizes the computer clock with the network time server.
But most programs are not running this early in the boot process, and thus
are not affected. Usually other changes to the computer clock are very
small, and the effect on other programs is very small. So usually changes
to the computer clock are not a problem.
The clock_gettime() requires a clock id.
CLOCK_MONOTONIC is not affected by discontinuous jumps in the system time,
but is affected by incremental adjustments, and does not count time computer
is suspended.
CLOCK_MONOTONIC_RAW is linux only, not affected by discontinuous jumps in
the system time, not affected by incremental adjustments, does not count
time computer is suspended.
CLOCK_BOOTTIME is linux only, not affected by discontinuous jumps in the
system time, but is affected by incremental adjustments, does count time
computer is suspended. It counts the time since the computer booted.
CLOCK_REALTIME is affected by discontinuous jumps in the system time, and by
incremental adjustments. It does count the time the computer is suspended.
It counts standard unix time (time since the beginning of unix in 1970).
I think CLOCK_MONOTONIC_RAW is the best choice for linux, and
CLOCK_MONOTONIC is the best choice for non linux. Usually millisecond time
is used to measure short periods of time, like how long it takes for part of
a computer program to run. In a short period of time, there will probably
be no changes to the computer clock, and the computer will probably not be
suspended, so any clock id will work, so the choice of clock id is not
important.
Precise time measurements are unreliable on multitasking computers because
the time measurement might be interrupted. Errors are usually small.
Sometimes this is a problem, and sometimes it isn't. If you need more
precise time measurements, you need dedicated hardware which cannot be
interrupted. Some computers have such hardware built in. For example, if a
program uses software pwm, changes to the output will be delayed if the
computer is interrupted at the time the computer needs to change the output.
But if the program uses hardware pwm, the hardware pwm controller cannot be
interrupted, and will change the output at the correct time.
Tested on a raspberry pi.

I hope this be useful. Works for me under Lubuntu 20.04 LTS.
#include <sys/time.h>
#include <stdio.h>
#include <unistd.h>
struct timeval __millis_start;
void init_millis() {
gettimeofday(&__millis_start, NULL);
};
unsigned long int millis() {
long mtime, seconds, useconds;
struct timeval end;
gettimeofday(&end, NULL);
seconds = end.tv_sec - __millis_start.tv_sec;
useconds = end.tv_usec - __millis_start.tv_usec;
mtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
return mtime;
};
int main()
{
init_millis();
printf("Elapsed time: %ld milliseconds\n", millis());
return 0;
}
Note:
Based on the discussion in comments (with dear #MarcCompere), I must mention that the conversion of seconds and useconds to mtime in the millis function is rounded by adding 0.5 (read comments to understand how!); but the 0.5 can be removed. It depends on your application. If you are using millis for accurate time measurement then add it to lower the "Mean Squared Error (MSE)" of conversion statistically. But if you need timing for general logic-based decisions (or closer behavior to that of Arduino), then the floor (natural behaviour when casting in this case) can be considered as the better option, so do not add the 0.5.

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?

First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.

The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

Computing algorithm running time in C

I am using the time.h lib in c to find the time taken to run an algorithm. The code structure is somewhat as follows :-
#include <time.h>
int main()
{
time_t start,end,diff;
start = clock();
//ALGORITHM COMPUTATIONS
end = clock();
diff = end - start;
printf("%d",diff);
return 0;
}
The values for start and end are always zero. Is it that the clock() function does't work? Please help.
Thanks in advance.

Not that it doesn't work. In fact, it does. But it is not the right way to measure time as the clock () function returns an approximation of processor time used by the program. I am not sure about other platforms, but on Linux you should use clock_gettime () with CLOCK_MONOTONIC flag - that will give you the real wall time elapsed. Also, you can read TSC, but be aware that it won't work if you have a multi-processor system and your process is not pinned to a particular core. If you want to analyze and optimize your algorithm, I'd recommend you use some performance measurement tools. I've been using Intel's vTune for a while and am quite happy. It will show you not only what part uses the most cycles, but highlight memory problems, possible parallelism issues etc. You may be very surprised with the results. For example, most of the CPU cycles might be spent waiting for memory bus. Hope it helps!
UPDATE: Actually, if you run later versions of Linux, it might provide CLOCK_MONOTONIC_RAW, which is a hardware-based clock that is not a subject to NTP adjustments. Here is a small piece of code you can use:
stopwatch.hpp
stopwatch.cpp

Note that clock() returns the execution time in clock ticks, as opposed to wall clock time. Divide a difference of two clock_t values by CLOCKS_PER_SEC to convert the difference to seconds. The actual value of CLOCKS_PER_SEC is a quality-of-implementation issue. If it is low (say, 50), your process would have to run for 20ms to cause a nonzero return value from clock(). Make sure your code runs long enough to see clock() increasing.

I usually do it this way:
clock_t start = clock();
clock_t end;
//algo
end = clock();
printf("%f", (double)(end - start));

Consider the code below:
#include <stdio.h>
#include <time.h>
int main()
{
clock_t t1, t2;
t1 = t2 = clock();
// loop until t2 gets a different value
while(t1 == t2)
t2 = clock();
// print resolution of clock()
printf("%f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);
return 0;
}
Output:
$ ./a.out
10.000000 ms
Might be that your algorithm runs for a shorter amount of time than that.
Use gettimeofday for higher resolution timer.

Why is my computer not showing a speedup when I use parallel code?

So I realize this question sounds stupid (and yes I am using a dual core), but I have tried two different libraries (Grand Central Dispatch and OpenMP), and when using clock() to time the code with and without the lines that make it parallel, the speed is the same. (for the record they were both using their own form of parallel for). They report being run on different threads, but perhaps they are running on the same core? Is there any way to check? (Both libraries are for C, I'm uncomfortable at lower layers.) This is super weird. Any ideas?

EDIT: Added detail for Grand Central Dispatch in response to OP comment.
While the other answers here are useful in general, the specific answer to your question is that you shouldn't be using clock() to compare the timing. clock() measures CPU time which is added up across the threads. When you split a job between cores, it uses at least as much CPU time (usually a bit more due to threading overhead). Search for clock() on this page, to find "If process is multi-threaded, cpu time consumed by all individual threads of process are added."
It's just that the job is split between threads, so the overall time you have to wait is less. You should be using the wall time (the time on a wall clock). OpenMP provides a routine omp_get_wtime() to do it. Take the following routine as an example:
#include <omp.h>
#include <time.h>
#include <math.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
int i, nthreads;
clock_t clock_timer;
double wall_timer;
for (nthreads = 1; nthreads <=8; nthreads++) {
clock_timer = clock();
wall_timer = omp_get_wtime();
#pragma omp parallel for private(i) num_threads(nthreads)
for (i = 0; i < 100000000; i++) cos(i);
printf("%d threads: time on clock() = %.3f, on wall = %.3f\n", \
nthreads, \
(double) (clock() - clock_timer) / CLOCKS_PER_SEC, \
omp_get_wtime() - wall_timer);
}
}
The results are:
1 threads: time on clock() = 0.258, on wall = 0.258
2 threads: time on clock() = 0.256, on wall = 0.129
3 threads: time on clock() = 0.255, on wall = 0.086
4 threads: time on clock() = 0.257, on wall = 0.065
5 threads: time on clock() = 0.255, on wall = 0.051
6 threads: time on clock() = 0.257, on wall = 0.044
7 threads: time on clock() = 0.255, on wall = 0.037
8 threads: time on clock() = 0.256, on wall = 0.033
You can see that the clock() time doesn't change much. I get 0.254 without the pragma, so it's a little slower using openMP with one thread than not using openMP at all, but the wall time decreases with each thread.
The improvement won't always be this good due to, for example, parts of your calculation that aren't parallel (see Amdahl's_law) or different threads fighting over the same memory.
EDIT: For Grand Central Dispatch, the GCD reference states, that GCD uses gettimeofday for wall time. So, I create a new Cocoa App, and in applicationDidFinishLaunching I put:
struct timeval t1,t2;
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
for (int iterations = 1; iterations <= 8; iterations++) {
int stride = 1e8/iterations;
gettimeofday(&t1,0);
dispatch_apply(iterations, queue, ^(size_t i) {
for (int j = 0; j < stride; j++) cos(j);
});
gettimeofday(&t2,0);
NSLog(#"%d iterations: on wall = %.3f\n",iterations, \
t2.tv_sec+t2.tv_usec/1e6-(t1.tv_sec+t1.tv_usec/1e6));
}
and I get the following results on the console:
2010-03-10 17:33:43.022 GCDClock[39741:a0f] 1 iterations: on wall = 0.254
2010-03-10 17:33:43.151 GCDClock[39741:a0f] 2 iterations: on wall = 0.127
2010-03-10 17:33:43.236 GCDClock[39741:a0f] 3 iterations: on wall = 0.085
2010-03-10 17:33:43.301 GCDClock[39741:a0f] 4 iterations: on wall = 0.064
2010-03-10 17:33:43.352 GCDClock[39741:a0f] 5 iterations: on wall = 0.051
2010-03-10 17:33:43.395 GCDClock[39741:a0f] 6 iterations: on wall = 0.043
2010-03-10 17:33:43.433 GCDClock[39741:a0f] 7 iterations: on wall = 0.038
2010-03-10 17:33:43.468 GCDClock[39741:a0f] 8 iterations: on wall = 0.034
which is about the same as I was getting above.
This is a very contrived example. In fact, you need to be sure to keep the optimization at -O0, or else the compiler will realize we don't keep any of the calculations and not do the loop at all. Also, the integer that I'm taking the cos of is different in the two examples, but that doesn't affect the results too much. See the STRIDE on the manpage for dispatch_apply for how to do it properly and for why iterations is broadly comparable to num_threads in this case.
EDIT: I note that Jacob's answer includes
I use the omp_get_thread_num()
function within my parallelized loop
to print out which core it's working
on... This way you can be sure that
it's running on both cores.
which is not correct (it has been partly fixed by an edit). Using omp_get_thread_num() is indeed a good way to ensure that your code is multithreaded, but it doesn't show "which core it's working on", just which thread. For example, the following code:
#include <omp.h>
#include <stdio.h>
int main() {
int i;
#pragma omp parallel for private(i) num_threads(50)
for (i = 0; i < 50; i++) printf("%d\n", omp_get_thread_num());
}
prints out that it's using threads 0 to 49, but this doesn't show which core it's working on, since I only have eight cores. By looking at the Activity Monitor (the OP mentioned GCD, so must be on a Mac - go Window/CPU Usage), you can see jobs switching between cores, so core != thread.

Most likely your execution time isn't bound by those loops you parallelized.
My suggestion is that you profile your code to see what is taking most of the time. Most engineers will tell you that you should do this before doing anything drastic to optimize things.

It's hard to guess without any details. Maybe your application isn't even CPU bound. Did you watch CPU load while your code was running? Did it hit 100% on at least one core?

Your question is missing some very crucial details such as what the nature of your application is, what portion of it are you trying to improve, profiling results (if any), etc...
Having said that you should remember several critical points when approaching a performance improvement effort:
Efforts should always concentrate on the code areas which have been proven, by profiling, to be the inefficient
Parallelizing CPU bound code will almost never improve performance (on a single core machine). You will be losing precious time on unnecessary context switches and gaining nothing. You can very easily worsen performance by doing this.
Even if you are parallelizing CPU bound code on a multicore machine, you must remember you never have any guarantee of parallel execution.
Make sure you are not going against these points, because an educated guess (barring any additional details) will say that's exactly what you're doing.

If you are using a lot of memory inside the loop, that might prevent it from being faster. Also you could look into pthread library, to manually handle threading.

I use the omp_get_thread_num() function within my parallelized loop to print out which core it's working on if you don't specify num_threads. For e.g.,
printf("Computing bla %d on core %d/%d ...\n",i+1,omp_get_thread_num()+1,omp_get_max_threads());
The above will work for this pragma
#pragma omp parallel for default(none) shared(a,b,c)
This way you can be sure that it's running on both cores since only 2 threads will be created.
Btw, is OpenMP enabled when you're compiling? In Visual Studio you have to enable it in the Property Pages, C++ -> Language and set OpenMP Support to Yes