Increase of execution time while using multithreaded FFTW

Increase of execution time while using multithreaded FFTW - c

I am new to FFTW library. I have successfully implemented 1D and 2D fft using FFTW library. I converted my 2D fft code into multithreaded 2D fft. But the results were completely opposite. Multithreaded 2D FFT code is taking longer time to run than serialized 2D FFT code. I am missing something somewhere. I followed all the instructions given in FFTW documentation to parallelize the code.
This is my parallelized 2D FFT C program
#include <mpi.h>
#include <fftw3.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define N 2000
#define M 2000
#define index(i, j) (j + i*M)
int i, j;
void get_input(fftw_complex *in) {
for(i=0;i<N;i++){
for(j=0;j<M;j++){
in[index(i, j)][0] = sin(i + j);
in[index(i, j)][1] = sin(i * j);
}
}
}
void show_out(fftw_complex *out){
for(i=0;i<N;i++){
for(j=0;j<M;j++){
printf("%lf %lf \n", out[index(i, j)][0], out[index(i, j)][1]);
}
}
}
int main(){
clock_t start, end;
double time_taken;
start = clock();
int a = fftw_init_threads();
printf("%d\n", a);
fftw_complex *in, *out;
fftw_plan p;
in = (fftw_complex *)fftw_malloc(N * M * sizeof(fftw_complex));
out = (fftw_complex *)fftw_malloc(N * M * sizeof(fftw_complex));
get_input(in);
fftw_plan_with_nthreads(4);
p = fftw_plan_dft_2d(N, M, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(p);
/*p = fftw_plan_dft_1d(N, out, out, FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(p);
puts("In Real Domain");
show_out(out);*/
fftw_destroy_plan(p);
fftw_free(in);
fftw_free(out);
fftw_cleanup_threads();
end = clock();
time_taken = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("%g \n", time_taken);
return 0;
}
Can someone please help me in pointing out the mistake what I am doing?

That kind of behavior is typical of incorrect binding.
Generally speaking, OpenMP threads should all be bound to cores of the same socket in order to avoid NUMA effect (which can make performance suboptimal or even worst).
Also, make sure MPI tasks are correctly bound (one task should be bound to several cores from the same sockets, and you should use one OpenMP thread per core).
Because of MPI, there is a risk your OpenMP threads end up doing time sharing.
At first, i recommend you start printing both MPI and OpenMP binding.
How to achieve that is dependent on both MPI library and OpenMP runtime. If you use Open MPI and Intel compilers, you can KMP_AFFINITY=verbose mpirun --report-bindings --tag-output ...
Then, as suggested earlier, i recommend you start easy and increase complexity
1 MPI task and 1 OpenMP thread
1 MPI task and x OpenMP threads (x is the number of cores on one socket)
x MPI tasks and 1 OpenMP thread per task
x MPI tasks and y OpenMP threads per task
hopefully, 2. will be faster than 1. and 4 will be faster than 3.

Related

Why does the time for this simple program to run double if run quickly in succession?

I have been working through the introductory openmp example and on the first multithreaded example - a numerical integration to pi - I knew the bit about false sharing would be coming and so implemented the following:
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "omp.h"
#define STEPS 100000000.0
#define MAX_THREADS 4
void pi(double start, double end, double **sum);
int main(){
double * sum[MAX_THREADS];
omp_set_num_threads(MAX_THREADS);
double inc;
bool set_inc=false;
double start=omp_get_wtime();
#pragma omp parallel
{
int ID=omp_get_thread_num();
#pragma omp critical
if(!set_inc){
int num_threads=omp_get_num_threads();
printf("Using %d threads.\n", num_threads);
inc=1.0/num_threads;
set_inc=true;
}
pi(ID*inc, (ID+1)*inc, &sum[ID]);
}
double end=omp_get_wtime();
double tot=0.0;
for(int i=0; i<MAX_THREADS; i++){
tot=tot+*sum[i];
}
tot=tot/STEPS;
printf("The value of pi is: %.8f. Took %f secs.\n", tot, end-start);
return 0;
}
void pi(double start, double end, double **sum_ptr){
double *sum=(double *) calloc(1, sizeof(double));
for(double i=start; i<end; i=i+1/STEPS){
*sum=*sum+4.0/(1.0+i*i);
}
*sum_ptr=sum;
}
My idea was that in using calloc, the probability of the pointers returned being contiguous and thus being pulled into the same cache lines was virtually impossible (though I'm a tad unsure as to why there would be false sharing anyways as double is 64 bit here and my cache lines are 8 bytes as well, so if you can enlighten me there as well...). -- now I realize cache lines are typically 64 bytes not bits
In fun, after compiling I ran the program in quick succession and here's a short example of what I got (definitely was pushing arrows and enter in the terminal more than 1 press/.5 secs):
user#user-kubuntu:~/git/openmp-practice$ ./pi_mp.exe
Using 4 threads.
The value of pi is: 3.14159273. Took 0.104703 secs.
user#user-kubuntu:~/git/openmp-practice$ ./pi_mp.exe
Using 4 threads.
The value of pi is: 3.14159273. Took 0.196900 secs.
I thought that maybe something was happening because of the way I tried to avoid the false sharing and since I am still ignorant about the complete happenings amongst the levels of memory I chalked it up to that. So, I followed the prescribed method of the tutorial using a "critical" section like so:
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "omp.h"
#define STEPS 100000000.0
#define MAX_THREADS 4
double pi(double start, double end);
int main(){
double sum=0.0;
omp_set_num_threads(MAX_THREADS);
double inc;
bool set_inc=false;
double start=omp_get_wtime();
#pragma omp parallel
{
int ID=omp_get_thread_num();
#pragma omp critical
if(!set_inc){
int num_threads=omp_get_num_threads();
printf("Using %d threads.\n", num_threads);
inc=1.0/num_threads;
set_inc=true;
}
double temp=pi(ID*inc, (ID+1)*inc);
#pragma omp critical
sum+=temp;
}
double end=omp_get_wtime();
sum=sum/STEPS;
printf("The value of pi is: %.8f. Took %f secs.\n", sum, end-start);
return 0;
}
double pi(double start, double end){
double sum=0.0;
for(double i=start; i<end; i=i+1/STEPS){
sum=sum+4.0/(1.0+i*i);
}
return sum;
}
The doubling in run time is virtually identical. What's the explanation for this? Does it have anything to do with the low level memory? Can you answer my intermediate question?
Thanks a lot.
Edit:
The compiler is gcc 7 on Kubuntu 17.10. options used were -fopenmp -W -o ( in that order).
The system specs include an i5 6500 # 3.2Ghz and 16 gigs of DDR4 RAM (though I forget its clock speed)
As some have asked, the program time does not continue to double if run more than twice in quick succession. After the initial doubling, it remains at around the same time (~.2 secs) for as many successive runs as I have tested (5+). Waiting a second or two, the time to run returns to the lesser amount. However, when the runs are not run manually in succession but rather in one command line such as ./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe; I get:
The value of pi is: 3.14159273. Took 0.100528 secs.
Using 4 threads.
The value of pi is: 3.14159273. Took 0.097707 secs.
Using 4 threads.
The value of pi is: 3.14159273. Took 0.098078 secs.
...
Adding gcc optimization options (-O3) had no change on any of the results.

Parallelization for Monte Carlo pi approximation

I am writing a c script to parallelize pi approximation with OpenMp. I think my code works fine with a convincing output. I am running it with 4 threads now. What I am not sure is that if this code is vulnerable to race condition? and if it is, how do I coordinate the thread action in this code ?
the code looks as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
double sample_interval(double a, double b) {
double x = ((double) rand())/((double) RAND_MAX);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int i;
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:counter) num_threads(NumThreads)
{
srand(time(NULL));
for (int i=0; i < N; ++i)
{
x = sample_interval(-1.,1.);
y = sample_interval(-1.,1.);
z = ((x*x)+(y*y));
if (z<= 1)
{
counter++;
}
}
}
double approx_pi = 4.0 * counter/ (double)N;
printf("%i %1.6e %1.6e\n ", N, 4.0 * counter/ (double)N, fabs(4.0 * counter/ (double)N - pi) / pi);
return 0;
}
Also I was wondering if the seed for random number should be declared inside or outside parallelization. my output looks like this:
10 3.600000e+00 1.459156e-01
100 3.160000e+00 5.859240e-03
1000 3.108000e+00 1.069287e-02
10000 3.142400e+00 2.569863e-04
100000 3.144120e+00 8.044793e-04
1000000 3.142628e+00 3.295610e-04
10000000 3.141379e+00 6.794439e-05
100000000 3.141467e+00 3.994585e-05
1000000000 3.141686e+00 2.971945e-05
Which looks OK for now. your suggestion for race condition and seed placement is most welcome.

There are a few problems in your code that I can see. The main one is from my standpoint that it isn't parallelized. Or more precisely, you didn't enable the parallelism you introduced with OpenMP while compiling it. Here is the way one can see that:
The way the code is parallelized, the main for loop should be executed in full by all the threads (there is no worksharing here, no #pragma omp parallel for, only a #pragma omp parallel). Therefore, considering you set the number of threads to be 4, the global number of iterations should be 4*N. Thus, your output should slowly converge towards 4*Pi, not towards Pi.
Indeed, I tried your code on my laptop, compiled it with OpenMP support, and that is pretty-much what I get. However, when I don't enable OpenMP, I get an output similar to yours. So in conclusion, you need to:
Enable OpenMP at compilation time for getting a parallel version of your code.
Divide your result by NumThreads to get a "valid" approximation of Pi (or distribute your loop over N with a #pragma omp for for example)
But that is if / when your code is correct elsewhere, which it isn't yet.
As BitTickler already hinted, rand() isn't thread-safe. So you have to go for another random number generator, which will allow you to privatize it's state. That could be rand_r() for example. That said, this still has quite a few issues:
rand() / rand_r() is a terrible RNG in term of randomness and periodicity. While increasing your number of tries, you'll rapidly go over the period of the RNG and repeat over and over again the same sequence. You need something more robust to do anything remotely serious.
Even with a "good" RNG, the parallelism aspect can be an issue in the sense that you want your sequences in parallel to be uncorrelated between each-other. And just using a different seed value per thread doesn't guaranty that to you (although with a wide-enough RNG, you have a bit of headroom for that)
Anyway, bottom line is:
Use a better thread-safe RNG (I find drand48_r() or random_r() to be OK for toy codes on Linux)
Initialize its state per-thread based on the thread id for example, while keeping in mind that this won't ensure a proper decorrelation of the random series in some circumstances (and the larger the number of times you call the functions, the more likely you are to finally have overlapping series).
This done (along with a few minor fixes), your code becomes for example as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
typedef struct drand48_data RNGstate;
double sample_interval(double a, double b, RNGstate *state) {
double x;
drand48_r(state, &x);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
time_t ctime = time(NULL);
#pragma omp parallel private(x, y, z) reduction(+:counter) num_threads(NumThreads)
{
RNGstate state;
srand48_r(ctime+omp_get_thread_num(), &state);
for (int i=0; i < N; ++i) {
x = sample_interval(-1, 1, &state);
y = sample_interval(-1, 1, &state);
z = ((x*x)+(y*y));
if (z<= 1) {
counter++;
}
}
}
double approx_pi = 4.0 * counter / (NumThreads * N);
printf("%i %1.6e %1.6e\n ", N, approx_pi, fabs(approx_pi - pi) / pi);
return 0;
}
Which I compile like this:
gcc -std=gnu99 -fopenmp -O3 -Wall pi.c -o pi_omp

How to measure overall performance of parallel programs (with papi)

I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can measure the overall performance of a parallel program.
I would like to measure the performance of a blas/lapack function, in my example below gemm. But I also want to measure other function, specially functions where the number of operation is not known. (In the case of gemm the ops are known (ops(gemm) = 2*n^3), so I could calculate the performance as a function of the number of operations and the execution time.) The library (I am using Intel MKL) spawn the threads automatically. So I can't measure the performance of each thread individually and then reduce it.
This is my example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "mkl.h"
#include "omp.h"
#include "papi.h"
int main(int argc, char *argv[] )
{
int i, j, l, k, n, m, idx, iter;
int mat, mat_min, mat_max;
int threads;
double *A, *B, *C;
double alpha =1.0, beta=0.0;
float rtime1, rtime2, ptime1, ptime2, mflops;
long long flpops;
#pragma omp parallel
{
#pragma omp master
threads = omp_get_num_threads();
}
if(argc < 4){
printf("pass me 3 arguments!\n");
return( -1 );
}
else
{
mat_min = atoi(argv[1]);
mat_max = atoi(argv[2]);
iter = atoi(argv[3]);
}
m = mat_max; n = mat_max; k = mat_max;
printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
" A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);
A = (double *) malloc( m*k * sizeof(double) );
B = (double *) malloc( k*n * sizeof(double) );
C = (double *) malloc( m*n * sizeof(double) );
printf (" Intializing matrix data \n\n");
for (i = 0; i < (m*k); i++)
A[i] = (double)(i+1);
for (i = 0; i < (k*n); i++)
B[i] = (double)(-i-1);
memset(C,0,m*n*sizeof(double));
// actual meassurment
for(mat=mat_min;mat<=mat_max;mat+=5)
{
m = mat; n = mat; k = mat;
for( idx=-1; idx<iter; idx++ ){
PAPI_flops( &rtime1, &ptime1, &flpops, &mflops );
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, C, n);
PAPI_flops( &rtime2, &ptime2, &flpops, &mflops );
}
printf("%d threads: %d in %f sec, %f MFLOPS\n",threads,mat,rtime2-rtime1,mflops);fflush(stdout);
}
printf("Done\n");fflush(stdout);
free(A);
free(B);
free(C);
return 0;
}
This is one output (for matrix size 200):
1 threads: 200 in 0.001459 sec, 5570.258789 MFLOPS
2 threads: 200 in 0.000785 sec, 5254.993652 MFLOPS
4 threads: 200 in 0.000423 sec, 4919.640137 MFLOPS
8 threads: 200 in 0.000264 sec, 3894.036865 MFLOPS
We can see for the execution time, that the function gemm scales. But the flops that I am measuring is only the performance of thread 0.
My question is: How can I measure the overall performance? I am grateful for any input.

First, I'm just curious - why do you need the FLOPS? don't you just care how much time is taken? or maybe time taken in compare to other BLAS libraries?
PAPI is thread based not much help on its own here.
What I would do is measure around the function call and see how time changes with number of threads it spawns. It should not spawn more threads than physical cores (HT is no good here). Then, if the matrix is big enough, and the machine is not loaded, the time should simply divide by the number of threads. E.g., 10 seconds over 4 core should become 2.5 seconds.
Other than that, there are 2 things you can do to really measure it:
1. Use whatever you use now but inject your start/end measurement code around the BLAS code. One way to do that (in linux) is by pre-loading a lib that defines pthread_start and using your own functions that call the originals but do some extra measurements. Another way to to override the function pointer when the process is already running (=trampoline). In linux it's in the GOT/PLT and in windows it's more complicated - look for a library.
2. Use oprofile, or some other profiler, to report number of instructions executed in the time you care for. Or better yet, to report the number of floating point instructions executed. A little problem with this is that SSE instructions are multiplying or adding 2 or more doubles at a time so you'd have to account for that. I guess you can assume they always use the maximum possible operands.

MPI wrapper that imitates OpenMP's for-loop pragma

I am thinking about implementing a wrapper for MPI that imitates OpenMP's way
of parallelizing for loops.
begin_parallel_region( chunk_size=100 , num_proc=10 );
for( int i=0 ; i<1000 ; i++ )
{
//some computation
}
end_parallel_region();
The code above distributes computation inside the for loop to 10 slave MPI processors.
Upon entering the parallel region, the chunk size and number of slave processors are provided.
Upon leaving the parallel region, the MPI processors are synched and are put idle.
EDITED in response to High Performance Mark.
I have no intention to simulate the OpenMP's shared memory model.
I propose this because I need it.
I am developing a library that is required to build graphs from mathetical functions.
In these mathetical functions, there often exist for loops like the one below.
for( int i=0 ; i<n ; i++ )
{
s = s + sin(x[i]);
}
So I want to first be able to distribute sin(x[i]) to slave processors and at the end reduce to the single varible just like in OpenMP.
I was wondering if there is such a wrapper out there so that I don't have to reinvent the wheel.
Thanks.

There is no such wrapper out there which has escaped from the research labs into widespread use. What you propose is not so much re-inventing the wheel as inventing the flying car.
I can see how you propose to write MPI code which simulates OpenMP's approach to sharing the burden of loops, what is much less clear is how you propose to have MPI simulate OpenMP's shared memory model ?
In a simple OpenMP program one might have, as you suggest, 10 threads each perform 10% of the iterations of a large loop, perhaps updating the values of a large (shared) data structure. To simulate that inside your cunning wrapper in MPI you'll either have to (i) persuade single-sided communications to behave like shared memory (this might be doable and will certainly be difficult) or (ii) distribute the data to all processes, have each process independently compute 10% of the results, then broadcast the results all-to-all so that at the end of execution each process has all the data that the others have.
Simulating shared memory computing on distributed memory hardware is a hot topic in parallel computing, always has been, always will be. Google for distributed shared memory computing and join the fun.
EDIT
Well, if you've distributed x across processes then individual processes can compute sin(x[i]) and you can reduce the sum on to one process using MPI_Reduce.
I must be missing something about your requirements because I just can't see why you want to build any superstructure on top of what MPI already provides. Nevertheless, my answer to your original question remains No, there is no such wrapper as you seek and all the rest of my answer is mere commentary.

Yes, you could do this, for specific tasks. But you shouldn't.
Consider how you might implement this; the begin part would distribute the data, and the end part would bring the answer back:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
typedef struct state_t {
int globaln;
int localn;
int *locals;
int *offsets;
double *localin;
double *localout;
double (*map)(double);
} state;
state *begin_parallel_mapandsum(double *in, int n, double (*map)(double)) {
state *s = malloc(sizeof(state));
s->globaln = n;
s->map = map;
/* figure out decomposition */
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
s->locals = malloc(size * sizeof(int));
s->offsets = malloc(size * sizeof(int));
s->offsets[0] = 0;
for (int i=0; i<size; i++) {
s->locals[i] = (n+i)/size;
if (i < size-1) s->offsets[i+1] = s->offsets[i] + s->locals[i];
}
/* allocate local arrays */
s->localn = s->locals[rank];
s->localin = malloc(s->localn*sizeof(double));
s->localout = malloc(s->localn*sizeof(double));
/* distribute */
MPI_Scatterv( in, s->locals, s->offsets, MPI_DOUBLE,
s->localin, s->locals[rank], MPI_DOUBLE,
0, MPI_COMM_WORLD);
return s;
}
double end_parallel_mapandsum(state **s) {
double localanswer=0., answer;
/* sum up local answers */
for (int i=0; i<((*s)->localn); i++) {
localanswer += ((*s)->localout)[i];
}
/* and get global result. Everyone gets answer */
MPI_Allreduce(&localanswer, &answer, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
free( (*s)->localin );
free( (*s)->localout );
free( (*s)->locals );
free( (*s)->offsets );
free( (*s) );
return answer;
}
int main(int argc, char **argv) {
int rank;
double *inputs;
double result;
int n=100;
const double pi=4.*atan(1.);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
inputs = malloc(n * sizeof(double));
for (int i=0; i<n; i++) {
inputs[i] = 2.*pi/n*i;
}
}
state *s=begin_parallel_mapandsum(inputs, n, sin);
for (int i=0; i<s->localn; i++) {
s->localout[i] = (s->map)(s->localin[i]);
}
result = end_parallel_mapandsum(&s);
if (rank == 0) {
printf("Calculated result: %lf\n", result);
double trueresult = 0.;
for (int i=0; i<n; i++) trueresult += sin(inputs[i]);
printf("True result: %lf\n", trueresult);
}
MPI_Finalize();
}
That constant distribute/gather is a terrible communications burden to sum up a few numbers, and is antithetical to the entire distributed-memory computing model.
To a first approximation, shared memory approaches - OpenMP, pthreads, IPP, what have you - are about scaling computations faster; about throwing more processors at the same chunk of memory. On the other hand, distributed-memory computing is about scaling a computation bigger; about using more resourses, particularly memory, than can be found on a single computer. The big win of using MPI is when you're dealing with problem sets which can't fit on any one node's memory, ever. So when doing distributed-memory computing, you avoid having all the data in any one place.
It's important to keep that basic approach in mind even when you are just using MPI on-node to use all the processors. The above scatter/gather approach will just kill performance. The more idiomatic distributed-memory computing approach is for the logic of the program to already have distributed the data - that is, your begin_parallel_region and end_parallel_region above would have already been built into the code above your loop at the very beginning. Then, every loop is just
for( int i=0 ; i<localn ; i++ )
{
s = s + sin(x[i]);
}
and when you need to exchange data between tasks (or reduce a result, or what have you) then you call the MPI functions to do those specific tasks.

Is MPI a must or are you just trying to run your OpenMP-like code on a cluster? In the latter case, I propose you to take a look at Intel's Cluster OpenMP:
http://www.hpcwire.com/hpcwire/2006-05-19/openmp_on_clusters-1.html

What could be some possible problems with this use of OpenMP?

I was trying to figure out how to parallelize a segment of code in OpenMP, where the inside of the for loop is independent from the rest of it.
Basically the project is dealing with particle systems, but I don't think that should relevant to the parallelization of the code. Is it a caching problem where the for loop divides the threads in a way such that the particles are not cached in each core in an efficient manner?
Edit: As mentioned by an answer below, I'm wondering why I'm not getting speedup.
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles[i].pos = s->particles[i].pos + dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
// printf("%d", omp_get_thread_num());
}

If you're asking whether it's parallelized correctly, it looks fine. I don't see any data-races or loop-dependencies that could break it.
But I think you're wondering on why you aren't getting any speedup with parallelism.
Since you mentioned that the trip count, psize-n_dead will be on the order of 4000. I'd say that's actually pretty small given the amount of work in the loop.
In other words, you don't have much total work to be worth parallelizing. So threading overhead is probably eating up any speedup that you should be gaining. If possible, you should try parallelizing at a higher level.
EDIT: You updated your comment to include up to 200000.
For larger values, it's likely that you'll be memory bound in some way. Your loop merely iterates through all the data doing very little work. So using more threads probably won't help much (if at all).

There is no correctness issues such as data races in this piece of code.
Assuming that the number of particles to process is big enough to warrant parallelism, I do not see OpenMP related performance issues in this code. By default, OpenMP will split the loop iterations statically in equal portions across all threads, so any cache conflicts may only occur at the boundaries of these portions, i.e. just in a few iterations of the loop.
Unrelated to OpenMP (and so to the parallel speedup problem), possibly performance improvement can be achieved by switching from array-of-structs to struct-of-arrays, as this might help compiler to vectorize the code (i.e. use SIMD instructions of a target processor):
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles.pos[i] = s->particles.pos[i] + dt * s->particles.vel[i];
s->particles.vel[i] = (1 - dt*.1) * s->particles.vel[i] + dt*s->force;
}
Such reorganization assumes that most time all particles are processed in a loop like this one. Working with an individual particle requires more cache lines to be loaded, but if you process them all in a loop, the net amount of cache lines loaded is nearly the same.

How sure are you that you're not getting speedup?
Trying it both ways - array of structs and struct of arrays, compiled with gcc -O3 (gcc 4.6), on a dual quad-core nehalem, I get for psize-n_dead = 200000, running 100 iterations for better timer accuracy:
Struct of arrays (reported time are in milliseconds)
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 90.984000
Took time 45.992000
Took time 22.996000
Took time 11.998000
Array of structs:
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 58.989000
Took time 28.995000
Took time 14.997000
Took time 8.999000
However, I because the operation is so short (sub-ms) I didn't see any speedup without doing 100 iterations because of timer accuracy. Also, you'd have to have a machine with good memory bandwidth to to get this sort of behaviour; you're only doing ~3 FMAs and another multiplication for every two pieces of data you read in.
Code for array-of-structs follows.
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
typedef struct particle_struct {
double pos;
double vel;
} particle;
typedef struct simulation_struct {
particle *particles;
double force;
} simulation;
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
void update(simulation *s, unsigned psize, double dt) {
#pragma omp parallel for
for (unsigned i = 0; i < psize; ++i)
{
s->particles[i].pos = s->particles[i].pos+ dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
}
}
void init(simulation *s, unsigned np) {
s->force = 1.;
s->particles = malloc(np*sizeof(particle));
for (unsigned i=0; i<np; i++) {
s->particles[i].pos = 1.;
s->particles[i].vel = 1.;
}
int main(void)
{
const unsigned np=200000;
simulation s;
struct timeval clock;
init(&s, np);
tick(&clock);
for (int iter=0;iter< 100; iter++)
update(&s, np, 0.75);
double elapsed=tock(&clock)*1000.;
printf("Took time %lf\n", elapsed);
free(s.particles);
}