OpenMP for beginners

OpenMP for beginners - c

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?

First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.

The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

Related

OpenMP low performance

I am trying to parallelize a loop in my program so i searched about multi-threading. First i took a look on POSIX multithreaded programming tutorial, it was so complicated so i tried to do something easier. I tried with OpenMP. I have successfully parallelized my code but the problem of execution time get worser than the serial case. this is below a portion ok my program. I wish you tell me what's the problem. Should i specify what variables are shared and what are private? and how can i know the kind of each variable? i wish you answer me because i searched in many forums and i still don't know what to do.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#define D 0.215 // magnetic dipolar constant
main()
{
int i,j,n,p,NTOT = 1600,Nc = NTOT-1;
float r[2],spin[2*NTOT],w[2],d;
double E,F,V,G,dU;
.
.
.
for(n = 1; n <= Nc; n++){
fscanf(voisins,"%d%d%f%f%f",&i,&j,&r[0],&r[1],&d);
V = 0.0;E = 0.0;F = 0.0;
#pragma omp parallel num_threads(4)
{
#pragma omp for schedule(auto)
for(p = 0;p < 2;p++)
{
V += (D/pow(d,3.0))*(spin[2*i-2+p]-w[p])*spin[2*j-2+p];
E += (spin[2*i-2+p]-w[p])*r[p];
F += spin[2*j-2+p]*r[p];
}
}
G = -3*(D/pow(d,5.0))*E*F;
dU += (V+G);
}
.
.
.
}//End of main()

You are parallelizing a loop with only 2 iterations: p=0 and p=1. The way that OpenMP's omp for works is by splitting up the loop iterations among your threads in the parallel team (which you've defined as 4 threads) and letting them work through their part of the problem in parallel.
With only 2 iterations, 2 of your threads will be sitting idle. On top of that, actually figuring out which threads will work on which part of the problem takes overhead. And if your actual loop doesn't take long (which in this case it clearly doesn't), the overhead will cost more than the benefits you've gained from parallelization.
A better strategy is usually to parallelize the outermost loops with OpenMP whenever possible in order to solve both the problems of splitting up the work evenly and reducing the (relative) overhead. Alternatively, you can parallelize at the lowest loop level using OpenMP 4.0's omp simd command.
Lastly, you are not computing the variables V, E, and F correctly. Because they are summed from iteration to iteration, you should define them all as reduction variables with reduction(+:V). I would be surprised if you are currently getting the correct answer as is.
(Also as High Performance Mark says: make sure you're timing the wall time execution of your program and not the CPU time execution of your program. This is typically done with omp_get_wtime().)

OpenMP Parallel for-loop showing little performance increase

I am in the process of learning how to use OpenMP in C, and as a HelloWorld exercise I am writing a program to count primes. I then parallelise this as follows:
int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes)
for (i = 1; i <= n; i++)
{
if (is_prime(i) == true)
numprimes ++;
}
I compile this code using gcc -g -Wall -fopenmp -o primes primes.c -lm (-lm for the math.h functions I am using). Then I run this code on an Intel® Core™2 Duo CPU E8400 # 3.00GHz × 2, and as expected, the performance is better than for a serial program.
The problem, however, comes when I try to run this on a much more powerful machine. (I have also tried to manually set the number of threads to use with num_threads, but this did not change anything.) Counting all the primes up to 10 000 000 gives me the following times (using time):
8-core machine:
real 0m8.230s
user 0m50.425s
sys 0m0.004s
dual-core machine:
real 0m10.846s
user 0m17.233s
sys 0m0.004s
And this pattern continues for counting more primes, the machine with more cores shows a slight performance increase, but not as much as I would expect for having so many more cores available. (I would expect 4 times more cores to imply almost 4 times less running time?)
Counting primes up to 50 000 000:
8-core machine:
real 1m29.056s
user 8m11.695s
sys 0m0.017s
dual-core machine:
real 1m51.119s
user 2m50.519s
sys 0m0.060s
If anyone can clarify this for me, it would be much appreciated.
EDIT
This is my prime-checking function.
static int is_prime(int n)
{
/* handle special cases */
if (n == 0) return 0;
else if (n == 1) return 0;
else if (n == 2) return 1;
int i;
for(i=2;i<=(int)(sqrt((double) n));i++)
if (n%i==0) return 0;
return 1;
}

This performance is happening because:
is_prime(i) takes longer the higher i gets, and
Your OpenMP implementation uses static scheduling by default for parallel for constructs without the schedule clause, i.e. it chops the for loop into equal sized contiguous chunks.
In other words, the highest-numbered thread is doing all of the hardest operations.
Explicitly selecting a more appropriate scheduling type with the schedule clause allows you to divide work among the threads fairly.
This version will divide the work better:
int numprimes = 0;
#pragma omp parallel for schedule(dynamic, 1) reduction(+:numprimes)
for (i = 1; i <= n; i++)
{
if (is_prime(i) == true)
numprimes ++;
}
Information on scheduling syntax is available via MSDN and Wikipedia.
schedule(dynamic, 1) may not be optimal, as High Performance Mark notes in his answer. There is a more in-depth discussion of scheduling granularity in this OpenMP wihtepaper.
Thanks also to Jens Gustedt and Mahmoud Fayez for contributing to this answer.

The reason for the apparently poor scaling of your program is, as #naroom has suggested, the variability in the run time of each call to your is_prime function. The run time does not simply increase with the value of i. Your code shows that the test terminates as soon as the first factor of i is found so the longest run times will be for numbers with few (and large) factors, including the prime numbers themselves.
As you've already been told, the default schedule for your parallelisation will parcel out the iterations of the master loop a chunk at a time to the available threads. For your case of 5*10^7 integers to test and 8 cores to use, the first thread will get the integers 1..6250000 to test, the second will get 6250001..12500000 and so on. This will lead to a severely unbalanced load across the threads because, of course, the prime numbers are not uniformly distributed.
Rather than using the default scheduling you should experiment with dynamic scheduling. The following statement tells the run-time to parcel out the iterations of your master loop m iterations at a time to the threads in your computation:
#pragma omp parallel for schedule(dynamic,m)
Once a thread has finished its m iterations it will be given m more to work on. The trick for you is to find the sweet spot for m. Too small and your computation will be dominated by the work that the run time does in parcelling out iterations, too large and your computation will revert to the unbalanced loads that you have already seen.
Take heart though, you will learn some useful lessons about the costs, and benefits, of parallel computation by working through all of this.

I think your code need to use dynamic so the threads each can consume different number of iterations as your iterations have different work load so the current code is balanced which won't help in your case try this out please:
int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes) schedule(dynamic,1)
for (i = 1; i <= n; i++){
if (is_prime(i) == true)
++numprimes;
}

Calculation time elapsed by a particular function in C program

I have a code in which i want to calculate the time taken by two sorting algorithms merge sort and quick sort to sort N numbers in microseconds or more precise.
The two times thus calculated will then we outputted to the terminal.
Code(part of code):
printf("THE LIST BEFORE SORTING IS(UNSORTED LIST):\n");
printlist(arr,n);
mergesort(extarr,0,n-1);
printf("THE LIST AFTER SORTING BY MERGE SORT IS(SORTED LIST):\n");
printlist(extarr,n);
quicksort(arr,0,n-1);
printf("THE LIST AFTER SORTING BY QUICK SORT IS(SORTED LIST):\n");
printlist(arr,n);
Help me by providing that how it will be done.I have tried clock_t by taking two variables as start stop and keeping them above and below the function call respectively but this doesnt help at all and always print out the its difference as zero.
Please suggest some other methods or function keeping in mind that it has no problem running in any type of OS.
Thanks for any help in advance.

Method : 1
To calculate total time taken by program You can use linux utility "time".
Lets your program name is test.cpp.
$g++ -o test test.cpp
$time ./test
Output will be like :
real 0m11.418s
user 0m0.004s
sys 0m0.004s
Method : 2
You can also use linux profiling method "gprof" to find the time by different functions.
First you have to compile the program with "-pg" flag.
$g++ -pg -o test test.cpp
$./test
$gprof test gmon.out
PS : gmon.out is default file created by gprof

You can call gettimeofday function in Linux and timeGetTime in Windows. Call these functions before calling your sorting function and after calling your sorting function and take the difference.
Please check the man page for further details. If you are still unable to get some tangible data (as the time taken may be too small due to smaller data sets), better to try to measure the time together for 'n' number of iterations and then deduce the time for a single run or increase the size of the data set to be sorted.

Not sure if you tried the following. I know your original post says that you have tried utilizing the CLOCKS_PER_SEC. Using CLOCKS_PER_SEC and doing (stop-start)/CLOCKS_PER_SEC will allow you get seconds. The double will provide more precision.
#include <time.h>
main()
{
clock_t launch = clock();
//do work
clock_t done = clock();
double diff = (done - launch) / CLOCKS_PER_SEC;
}

The reason to get Zeroas the result is likely the poor resolution of the time source you're using. These time sources typically increment by some 10 to 20 ms. This is poor but that's the way they work. When your sorting is done in less that this time increment, the result will be zero. You may increase this resultion into the 1 ms regime by increasing the systems interrupt frequency. There is no standard way to accomplish this for windows and Linux. They have their individual way.
An even higher resolution can be obtained by a high frequency counter. Windows and Linux do provide access to such counters, but again, the code may look slightly different.
If you deserve one piece of code to run on windows and linux, I'd recommend to perform the time measurement in a loop. Run the code to measure hundreds or even more times in a loop
and capture the time outside the loop. Divide the captured time by the numer of loop cycles and have the result.
Of course: This is for evaluation only. You don't want to have that in final code.
And: Taking into account that the time resolution is in the 1 to 20 ms you should make a good choice of the total time to go for to get decent resolution of you measurement. (Hint: Adjust the loop count to let it go for at least a second or so.)
Example:
clock_t start, end;
printf("THE LIST BEFORE SORTING IS(UNSORTED LIST):\n");
printlist(arr,n);
start = clock();
for(int i = 0; i < 100; i++){
mergesort(extarr,0,n-1);
}
end = clock();
double diff = (end - start) / CLOCKS_PER_SEC;
// and so on...
printf("THE LIST AFTER SORTING BY MERGE SORT IS(SORTED LIST):\n");
printlist(extarr,n);
quicksort(arr,0,n-1);
printf("THE LIST AFTER SORTING BY QUICK SORT IS(SORTED LIST):\n");
printlist(arr,n);

If you are in Linux 2.6.26 or above then getrusage(2) is the most accurate way to go:
#include <sys/time.h>
#include <sys/resource.h>
// since Linux 2.6.26
// The macro is not defined in all headers, but supported if your Linux version matches
#ifndef RUSAGE_THREAD
#define RUSAGE_THREAD 1
#endif
// If you are single-threaded then RUSAGE_SELF is POSIX compliant
// http://linux.die.net/man/2/getrusage
struct rusage rusage_start, rusage_stop;
getrusage(RUSAGE_THREAD, &rusage_start);
...
getrusage(RUSAGE_THREAD, &rusage_stop);
// amount of microseconds spent in user space
size_t user_time = ((rusage_stop.ru_utime.tv_sec - rusage_start.ru_stime.tv_sec) * 1000000) + rusage_stop.ru_utime.tv_usec - rusage_start.ru_stime.tv_usec;
// amount of microseconds spent in kernel space
size_t system_time = ((rusage_stop.ru_stime.tv_sec - rusage_start.ru_stime.tv_sec) * 1000000) + rusage_stop.ru_stime.tv_usec - rusage_start.ru_stime.tv_usec;

What could be some possible problems with this use of OpenMP?

I was trying to figure out how to parallelize a segment of code in OpenMP, where the inside of the for loop is independent from the rest of it.
Basically the project is dealing with particle systems, but I don't think that should relevant to the parallelization of the code. Is it a caching problem where the for loop divides the threads in a way such that the particles are not cached in each core in an efficient manner?
Edit: As mentioned by an answer below, I'm wondering why I'm not getting speedup.
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles[i].pos = s->particles[i].pos + dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
// printf("%d", omp_get_thread_num());
}

If you're asking whether it's parallelized correctly, it looks fine. I don't see any data-races or loop-dependencies that could break it.
But I think you're wondering on why you aren't getting any speedup with parallelism.
Since you mentioned that the trip count, psize-n_dead will be on the order of 4000. I'd say that's actually pretty small given the amount of work in the loop.
In other words, you don't have much total work to be worth parallelizing. So threading overhead is probably eating up any speedup that you should be gaining. If possible, you should try parallelizing at a higher level.
EDIT: You updated your comment to include up to 200000.
For larger values, it's likely that you'll be memory bound in some way. Your loop merely iterates through all the data doing very little work. So using more threads probably won't help much (if at all).

There is no correctness issues such as data races in this piece of code.
Assuming that the number of particles to process is big enough to warrant parallelism, I do not see OpenMP related performance issues in this code. By default, OpenMP will split the loop iterations statically in equal portions across all threads, so any cache conflicts may only occur at the boundaries of these portions, i.e. just in a few iterations of the loop.
Unrelated to OpenMP (and so to the parallel speedup problem), possibly performance improvement can be achieved by switching from array-of-structs to struct-of-arrays, as this might help compiler to vectorize the code (i.e. use SIMD instructions of a target processor):
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles.pos[i] = s->particles.pos[i] + dt * s->particles.vel[i];
s->particles.vel[i] = (1 - dt*.1) * s->particles.vel[i] + dt*s->force;
}
Such reorganization assumes that most time all particles are processed in a loop like this one. Working with an individual particle requires more cache lines to be loaded, but if you process them all in a loop, the net amount of cache lines loaded is nearly the same.

How sure are you that you're not getting speedup?
Trying it both ways - array of structs and struct of arrays, compiled with gcc -O3 (gcc 4.6), on a dual quad-core nehalem, I get for psize-n_dead = 200000, running 100 iterations for better timer accuracy:
Struct of arrays (reported time are in milliseconds)
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 90.984000
Took time 45.992000
Took time 22.996000
Took time 11.998000
Array of structs:
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 58.989000
Took time 28.995000
Took time 14.997000
Took time 8.999000
However, I because the operation is so short (sub-ms) I didn't see any speedup without doing 100 iterations because of timer accuracy. Also, you'd have to have a machine with good memory bandwidth to to get this sort of behaviour; you're only doing ~3 FMAs and another multiplication for every two pieces of data you read in.
Code for array-of-structs follows.
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
typedef struct particle_struct {
double pos;
double vel;
} particle;
typedef struct simulation_struct {
particle *particles;
double force;
} simulation;
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
void update(simulation *s, unsigned psize, double dt) {
#pragma omp parallel for
for (unsigned i = 0; i < psize; ++i)
{
s->particles[i].pos = s->particles[i].pos+ dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
}
}
void init(simulation *s, unsigned np) {
s->force = 1.;
s->particles = malloc(np*sizeof(particle));
for (unsigned i=0; i<np; i++) {
s->particles[i].pos = 1.;
s->particles[i].vel = 1.;
}
int main(void)
{
const unsigned np=200000;
simulation s;
struct timeval clock;
init(&s, np);
tick(&clock);
for (int iter=0;iter< 100; iter++)
update(&s, np, 0.75);
double elapsed=tock(&clock)*1000.;
printf("Took time %lf\n", elapsed);
free(s.particles);
}

Why is my computer not showing a speedup when I use parallel code?

So I realize this question sounds stupid (and yes I am using a dual core), but I have tried two different libraries (Grand Central Dispatch and OpenMP), and when using clock() to time the code with and without the lines that make it parallel, the speed is the same. (for the record they were both using their own form of parallel for). They report being run on different threads, but perhaps they are running on the same core? Is there any way to check? (Both libraries are for C, I'm uncomfortable at lower layers.) This is super weird. Any ideas?

EDIT: Added detail for Grand Central Dispatch in response to OP comment.
While the other answers here are useful in general, the specific answer to your question is that you shouldn't be using clock() to compare the timing. clock() measures CPU time which is added up across the threads. When you split a job between cores, it uses at least as much CPU time (usually a bit more due to threading overhead). Search for clock() on this page, to find "If process is multi-threaded, cpu time consumed by all individual threads of process are added."
It's just that the job is split between threads, so the overall time you have to wait is less. You should be using the wall time (the time on a wall clock). OpenMP provides a routine omp_get_wtime() to do it. Take the following routine as an example:
#include <omp.h>
#include <time.h>
#include <math.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
int i, nthreads;
clock_t clock_timer;
double wall_timer;
for (nthreads = 1; nthreads <=8; nthreads++) {
clock_timer = clock();
wall_timer = omp_get_wtime();
#pragma omp parallel for private(i) num_threads(nthreads)
for (i = 0; i < 100000000; i++) cos(i);
printf("%d threads: time on clock() = %.3f, on wall = %.3f\n", \
nthreads, \
(double) (clock() - clock_timer) / CLOCKS_PER_SEC, \
omp_get_wtime() - wall_timer);
}
}
The results are:
1 threads: time on clock() = 0.258, on wall = 0.258
2 threads: time on clock() = 0.256, on wall = 0.129
3 threads: time on clock() = 0.255, on wall = 0.086
4 threads: time on clock() = 0.257, on wall = 0.065
5 threads: time on clock() = 0.255, on wall = 0.051
6 threads: time on clock() = 0.257, on wall = 0.044
7 threads: time on clock() = 0.255, on wall = 0.037
8 threads: time on clock() = 0.256, on wall = 0.033
You can see that the clock() time doesn't change much. I get 0.254 without the pragma, so it's a little slower using openMP with one thread than not using openMP at all, but the wall time decreases with each thread.
The improvement won't always be this good due to, for example, parts of your calculation that aren't parallel (see Amdahl's_law) or different threads fighting over the same memory.
EDIT: For Grand Central Dispatch, the GCD reference states, that GCD uses gettimeofday for wall time. So, I create a new Cocoa App, and in applicationDidFinishLaunching I put:
struct timeval t1,t2;
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
for (int iterations = 1; iterations <= 8; iterations++) {
int stride = 1e8/iterations;
gettimeofday(&t1,0);
dispatch_apply(iterations, queue, ^(size_t i) {
for (int j = 0; j < stride; j++) cos(j);
});
gettimeofday(&t2,0);
NSLog(#"%d iterations: on wall = %.3f\n",iterations, \
t2.tv_sec+t2.tv_usec/1e6-(t1.tv_sec+t1.tv_usec/1e6));
}
and I get the following results on the console:
2010-03-10 17:33:43.022 GCDClock[39741:a0f] 1 iterations: on wall = 0.254
2010-03-10 17:33:43.151 GCDClock[39741:a0f] 2 iterations: on wall = 0.127
2010-03-10 17:33:43.236 GCDClock[39741:a0f] 3 iterations: on wall = 0.085
2010-03-10 17:33:43.301 GCDClock[39741:a0f] 4 iterations: on wall = 0.064
2010-03-10 17:33:43.352 GCDClock[39741:a0f] 5 iterations: on wall = 0.051
2010-03-10 17:33:43.395 GCDClock[39741:a0f] 6 iterations: on wall = 0.043
2010-03-10 17:33:43.433 GCDClock[39741:a0f] 7 iterations: on wall = 0.038
2010-03-10 17:33:43.468 GCDClock[39741:a0f] 8 iterations: on wall = 0.034
which is about the same as I was getting above.
This is a very contrived example. In fact, you need to be sure to keep the optimization at -O0, or else the compiler will realize we don't keep any of the calculations and not do the loop at all. Also, the integer that I'm taking the cos of is different in the two examples, but that doesn't affect the results too much. See the STRIDE on the manpage for dispatch_apply for how to do it properly and for why iterations is broadly comparable to num_threads in this case.
EDIT: I note that Jacob's answer includes
I use the omp_get_thread_num()
function within my parallelized loop
to print out which core it's working
on... This way you can be sure that
it's running on both cores.
which is not correct (it has been partly fixed by an edit). Using omp_get_thread_num() is indeed a good way to ensure that your code is multithreaded, but it doesn't show "which core it's working on", just which thread. For example, the following code:
#include <omp.h>
#include <stdio.h>
int main() {
int i;
#pragma omp parallel for private(i) num_threads(50)
for (i = 0; i < 50; i++) printf("%d\n", omp_get_thread_num());
}
prints out that it's using threads 0 to 49, but this doesn't show which core it's working on, since I only have eight cores. By looking at the Activity Monitor (the OP mentioned GCD, so must be on a Mac - go Window/CPU Usage), you can see jobs switching between cores, so core != thread.

Most likely your execution time isn't bound by those loops you parallelized.
My suggestion is that you profile your code to see what is taking most of the time. Most engineers will tell you that you should do this before doing anything drastic to optimize things.

It's hard to guess without any details. Maybe your application isn't even CPU bound. Did you watch CPU load while your code was running? Did it hit 100% on at least one core?

Your question is missing some very crucial details such as what the nature of your application is, what portion of it are you trying to improve, profiling results (if any), etc...
Having said that you should remember several critical points when approaching a performance improvement effort:
Efforts should always concentrate on the code areas which have been proven, by profiling, to be the inefficient
Parallelizing CPU bound code will almost never improve performance (on a single core machine). You will be losing precious time on unnecessary context switches and gaining nothing. You can very easily worsen performance by doing this.
Even if you are parallelizing CPU bound code on a multicore machine, you must remember you never have any guarantee of parallel execution.
Make sure you are not going against these points, because an educated guess (barring any additional details) will say that's exactly what you're doing.

If you are using a lot of memory inside the loop, that might prevent it from being faster. Also you could look into pthread library, to manually handle threading.

I use the omp_get_thread_num() function within my parallelized loop to print out which core it's working on if you don't specify num_threads. For e.g.,
printf("Computing bla %d on core %d/%d ...\n",i+1,omp_get_thread_num()+1,omp_get_max_threads());
The above will work for this pragma
#pragma omp parallel for default(none) shared(a,b,c)
This way you can be sure that it's running on both cores since only 2 threads will be created.
Btw, is OpenMP enabled when you're compiling? In Visual Studio you have to enable it in the Property Pages, C++ -> Language and set OpenMP Support to Yes