Benchmarking, sequential x parallel program. Sublinear speedup?

Benchmarking, sequential x parallel program. Sublinear speedup? - c

Update2. Solved! This is memory issue. Some benching about it here:
http://dontpad.com/bench_mem
Update. My goal is to achieve best throughput. All my results are here.
Sequential Results:
https://docs.google.com/spreadsheet/ccc?key=0AjKHxPB2qgJXdE8yQVNHRkRiQ2VzeElIRWwxMWtRcVE&usp=sharing
Parallel Results*:
https://docs.google.com/spreadsheet/ccc?key=0AjKHxPB2qgJXdEhTb2plT09PNEs3ajBvWUlVaWt0ZUE&usp=sharing
multsoma_par1_vN, N determines how data is acessed by each thread.
N: 1 - NTHREADS displacement, 2 - L1 displacement, 3 - L2 displacement, 4 - TAM/NTHREADS
I am having a hard time trying to figure out why my parallel code runs just slighty faster than sequential code.
What I basically do is to loop through a big array (10^8 elements) of a type (int/float/double) and apply the computation: A = A * CONSTANT + B. Where A and B are arrays of same size.
Sequential code only do a single function call.
Parallel version create pthreads and uses the same function as starting function.
I am using gettimeofday(), RDTSC() and more recently getrusage() to measure timings. My main results are expressed by Clocks per Element (CPE).
My processor is an i5-3570K. 4 Cores, no hyper-threading.
The problem is that I can get up to 2.00 CPE under sequential code and when going parallel my best performance was 1.84 CPE. I know that I get an overhead by creating pthreads and calling more timing routines, but I don't think this is the reason for not getting better timings.
I did measured each thread CPE and executed the program with 1, 2, 3 and 4 threads. When creating only one thread, I get the expected result CPE around 2.00 (+ some overhead expressed in miliseconds but overall CPE is not affected at all).
When running with 2 threads or more the main CPE decreases, but each thread CPE increases.
2 threads I get main CPE around 1.9 and each thread to 3.8 ( Why this is not 2.0 ?! )
The same happens to 3 and 4 threads.
4 threads I get main CPE around 1.85 (my best timing) and each thread with 7.0~7.5 CPE.
Using many threads more than avaiable cores(4) I still getting CPE under 2.0 but not better than 1.85 (most times higher due to overhead).
I suspect that maybe context switching could be the limiting factor here. When running with 2 threads I can count 5 to 10 involuntary contexts switch from each thread...
But I am not so sure about this. Are those seemly few context switches enough to almost double my CPE ? I was expecting to atleast get around 1.00 CPE using all my CPU Cores.
I went further on this and analyzed the assembly code for this function. They are identical, except for some extra shifts and adds (4 instructions) at the very beginning of the function and they are out of loops.
In case you want to see some code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
#include <cpuid.h>
typedef union{
unsigned long long int64;
struct {unsigned int lo, hi;} int32;
} tsc_counter;
#define RDTSC(cpu_c) \
__asm__ __volatile__ ("rdtsc" : \
"=a" ((cpu_c).int32.lo), \
"=d" ((cpu_c).int32.hi) )
#define CNST 5
#define NTHREADS 4
#define L1_SIZE 8096
#define L2_SIZE 72512
typedef int data_t;
data_t * A;
data_t * B;
int tam;
double avg_thread_CPE;
tsc_counter thread_t0[NTHREADS], thread_t1[NTHREADS];
struct timeval thread_sec0[NTHREADS], thread_sec1[NTHREADS];
void fillA_B(int tam){
int i;
for (i=0;i<tam;i++){
A[i]=2; B[i]=2;
}
return;
}
void* multsoma_par4_v4(void *arg){
int w;
int i,j;
int *id = (int *) arg;
int limit = tam-14;
int size = tam/NTHREADS;
int tam2 = ((*id+1)*size);
int limit2 = tam2-14;
gettimeofday(&thread_sec0[*id],NULL);
RDTSC(thread_t0[*id]);
//Mult e Soma
for (i=(*id)*size;i<limit2 && i<limit;i+=15){
A[i] = A[i] * CNST + B[i];
A[i+1] = A[i+1] * CNST + B[i+1];
A[i+2] = A[i+2] * CNST + B[i+2];
A[i+3] = A[i+3] * CNST + B[i+3];
A[i+4] = A[i+4] * CNST + B[i+4];
A[i+5] = A[i+5] * CNST + B[i+5];
A[i+6] = A[i+6] * CNST + B[i+6];
A[i+7] = A[i+7] * CNST + B[i+7];
A[i+8] = A[i+8] * CNST + B[i+8];
A[i+9] = A[i+9] * CNST + B[i+9];
A[i+10] = A[i+10] * CNST + B[i+10];
A[i+11] = A[i+11] * CNST + B[i+11];
A[i+12] = A[i+12] * CNST + B[i+12];
A[i+13] = A[i+13] * CNST + B[i+13];
A[i+14] = A[i+14] * CNST + B[i+14];
}
for (; i<tam2 && i<tam; i++)
A[i] = A[i] * CNST + B[i];
RDTSC(thread_t1[*id]);
gettimeofday(&thread_sec1[*id],NULL);
double CPE, elapsed_time;
CPE = ((double)(thread_t1[*id].int64-thread_t0[*id].int64))/((double)(size));
elapsed_time = (double)(thread_sec1[*id].tv_sec-thread_sec0[*id].tv_sec)*1000;
elapsed_time+= (double)(thread_sec1[*id].tv_usec - thread_sec0[*id].tv_usec)/1000;
//printf("Thread %d workset - %d\n",*id,size);
//printf("CPE Thread %d - %lf\n",*id, CPE);
//printf("Time Thread %d - %lf\n",*id, elapsed_time/1000);
avg_thread_CPE+=CPE;
free(arg);
pthread_exit(NULL);
}
void imprime(int tam){
int i;
int ans = 12;
for (i=0;i<tam;i++){
//printf("%d ",A[i]);
//checking...
if (A[i]!=ans) printf("WA!!\n");
}
printf("\n");
return;
}
int main(int argc, char *argv[]){
tsc_counter t0,t1;
struct timeval sec0,sec1;
pthread_t thread[NTHREADS];
double CPE;
double elapsed_time;
int i;
int* id;
tam = atoi(argv[1]);
A = (data_t*) malloc (tam*sizeof(data_t));
B = (data_t*) malloc (tam*sizeof(data_t));
fillA_B(tam);
avg_thread_CPE = 0;
//Start Computing...
gettimeofday(&sec0,NULL);
RDTSC(t0); //Time Stamp 0
for (i=0;i<NTHREADS;i++){
id = (int*) malloc(sizeof(int));
*id = i;
if (pthread_create(&thread[i], NULL, multsoma_par4_v4, (void*)id)) {
printf("--ERRO: pthread_create()\n"); exit(-1);
}
}
for (i=0; i<NTHREADS; i++) {
if (pthread_join(thread[i], NULL)) {
printf("--ERRO: pthread_join() \n"); exit(-1);
}
}
RDTSC(t1); //Time Stamp 1
gettimeofday(&sec1,NULL);
//End Computing...
imprime(tam);
CPE = ((double)(t1.int64-t0.int64))/((double)(tam)); //diferenca entre Time_Stamps/repeticoes
elapsed_time = (double)(sec1.tv_sec-sec0.tv_sec)*1000;
elapsed_time+= (double)(sec1.tv_usec - sec0.tv_usec)/1000;
printf("Main CPE: %lf\n",CPE);
printf("Avg Thread CPE: %lf\n",avg_thread_CPE/NTHREADS);
printf("Time: %lf\n",elapsed_time/1000);
free(A); free(B);
return 0;
}
I appreciate any help.

After seeing the full code, I rather agree with the guess of #nosid in comments: since the ratio of compute operations to memory loads is low, and the data (about 800M if I am not mistaken) don't fit in cache, the memory bandwidth is likely the limiting factor. The link to the main memory is shared to all cores in a processor, so when its bandwidth is saturated, all memory operations start stalling and take longer time; thus CPE increases.
Also, the following place in your code is a data race:
avg_thread_CPE+=CPE;
as you sum up CPE values calculated on different threads to a single global variable without any synchronization.
Below I left the part of my initial answer, including the "first statement" referred in the comments. I still find it correct, for the definition of CPE as the number of clocks taken by the operations over a single element.
You should not expect the clocks per element (CPE) metric to decrease
due to use of multiple threads. By definition, it's how fast a
single data item is processed, in average. Threading helps to process all data faster (by simultaneous processing on different
cores), so the elapsed wallclock time, i.e. the time to execute the
whole program, should be expected to decrease.

Related

False sharing in multi threads

The following code runs slower as I increase the NTHREADS. Why use more threads make the program run slower? Is there any way to fix it? Someone said it is about false sharing but I do not really understand that concept.
The program basicly calculate the sum from 1 to 100000000. The idea to use multithread is to seperate the number list into several chuncks, and calculate the sum of each chunck parallelly to make the calculation faster.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define LENGTH 100000000
#define NTHREADS 2
#define NREPEATS 10
#define CHUNCK (LENGTH / NTHREADS)
typedef struct {
size_t id;
long *array;
long result;
} worker_args;
void *worker(void *args) {
worker_args *wargs = (worker_args*) args;
const size_t start = wargs->id * CHUNCK;
const size_t end = wargs->id == NTHREADS - 1 ? LENGTH : (wargs->id+1) * CHUNCK;
for (size_t i = start; i < end; ++i) {
wargs->result += wargs->array[i];
}
return NULL;
}
int main(void) {
long* numbers = malloc(sizeof(long) * LENGTH);
for (size_t i = 0; i < LENGTH; ++i) {
numbers[i] = i + 1;
}
worker_args *args = malloc(sizeof(worker_args) * NTHREADS);
for (size_t i = 0; i < NTHREADS; ++i) {
args[i] = (worker_args) {
.id = i,
.array = numbers,
.result = 0
};
}
pthread_t thread_ids[NTHREADS];
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_create(thread_ids+i, NULL, worker, args+i);
}
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_join(thread_ids[i], NULL);
}
long sum = 0;
for (size_t i = 0; i < NTHREADS; ++i) {
sum += args[i].result;
}
printf("Run %2zu: total sum is %ld\n", n, sum);
free(args);
free(numbers);
}

Why use more threads make the program run slower?
There is an overhead creating and joining threads. If the threads hasn't much to do then this overhead may be more expensive than the actual work.
Your threads are only doing a simple sum which isn't that expensive. Also consider that going from e.g. 10 to 11 threads doesn't change the work load per thread a lot.
10 threads --> 10000000 sums per thread
11 threads --> 9090909 sums per thread
The overhead of creating an extra thread may exceed the "work load saved" per thread.
On my PC the program runs in less than 100 milliseconds. Multi-threading isn't worth the trouble.
You need a more processing intensive task before multi-threading is worth doing.
Also notice that it seldom make sense to create more threads than the number of cores (incl hyper thread) your computer has.
false sharing
yes, "false sharing" can impact the performance of a multi-threaded program but I doubt it's the real problem in your case.
"false sharing" is something that happens in (some) cache systems when two threads (or rather two cores) writes to two different variables that belongs to the same cache line. In such cases the two threads/cores competes to own the cache line (for writing) and consequently, they'll have to refresh the memory and the cache again and again. That's bad for performance.
As I said - I doubt that is your problem. A clever compiler will do your loop solely be using CPU registers and only write to memory at the end. You can check the disassemble of your code to see if that is the case.
You can avoid "false sharing" by increasing the sizeof of your struct so that each struct fits the size of a cache line on your system.

Non-deterministic CUDA C kernel

I'm still a beginner with CUDA and I have been trying to write a simple kernel to perform a parallel prime sieve on the GPU. Originally I had written my code in C but I wanted to investigate the speed up on a GPU so I rewrote it:
41.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define B 1024
#define T 256
#define N (B*T)
#define checkCudaErrors(error) {\
if (error != cudaSuccess) {\
printf("CUDA Error - %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(error));\
exit(1);\
}\
}\
__global__ void prime_sieve(int *primes) {
unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
primes[i] = i;
primes[0] = primes[1] = 0;
if (i > 1 && i<N) {
for (int j=2; j<N/2; j++) {
if (i*j < N) {
primes[i*j] = 0;
}
}
}
}
int main() {
int *h_primes=(int*)malloc(N * sizeof(int));
int *d_primes;
checkCudaErrors(cudaMalloc( (void**)&d_primes, N*sizeof(int)));
checkCudaErrors(cudaMemcpy(d_primes,h_primes,N*sizeof(int),cudaMemcpyHostToDevice));
prime_sieve<<<B,T>>>(d_primes);
checkCudaErrors(cudaMemcpy(h_primes,d_primes,N*sizeof(int),cudaMemcpyDeviceToHost));
checkCudaErrors(cudaFree(d_primes));
int size = 0;
int total = 0;
for (int i=2; i<N; i++) {
if (h_primes[i]) {
size++;
}
total++;
}
printf("\n");
printf("Length = %d\tPrimes = %d\n",total,size);
free(h_primes);
return 0;
}
I run the program on Ubuntu 16.04 (4.4.0-83-generic) and I compile using nvcc 41.cu -o 41.o -arch=sm_30 under version 8.0.61. The program is run on a GeForce GTX 780 Ti but everytime it runs, it always produces non-deterministic results:
Length = 262142 Primes = 49477
Length = 262142 Primes = 49486
Length = 262142 Primes = 49596
Length = 262142 Primes = 49589
There were no errors reported back. At first I thought it was a race condition but cuda-memcheck didn't report back any hazards for racecheck,initcheck or synccheck and I couldn't think of any problems with my assumptions. I was thinking this could be a synchronisation problem?
This non-deterministic behaviour only occurs when I increase the block size and thread size as seen in the code. When I tried a block size and thread size of say 16, then there were no problems (as far as I could tell). It seems that not all threads get the chance to execute? I was planning to run this on very large array sizes (< 1 billion integers) but I am stuck at this point.
What am I doing wrong here?

There is a giant race-condition
So prime[i] > 0 means prime, while prime[i]=0 means composite.
primes[i] = i; is executed as first update on primes by each thread. Keep this in mind.
Now let's see what happen when thread 16 executes. It marks primes[16]=16 and and all multiples of 16 too. Something like the following
primes[16] = primes[32] = primes[48]=....=primes[k*16]=0
Imagine that thread 48 gets scheduled just after thread 16 completed its job (or when j>3 in thread 16 loop`).
Thread 48 sets primes[48] = 48. You have lost the update made by thread 16.
That is a race condition.
When coding in CUDA you should make sure that the correctness of your code does not depend on a particular scheduling of warps.
You should think as the order of execution as something non-deterministic.

Why is the multithreaded version of this program slower?

I am trying to learn pthreads and I have been experimenting with a program that tries to detect the changes on an array. Function array_modifier() picks a random element and toggles it's value (1 to 0 and vice versa) and then sleeps for some time (big enough so race conditions do not appear, I know this is bad practice). change_detector() scans the array and when an element doesn't match it's prior value and it is equal to 1, the change is detected and diff array is updated with the detection delay.
When there is one change_detector() thread (NTHREADS==1) it has to scan the whole array. When there are more threads each is assigned a portion of the array. Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes.
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define TIME_INTERVAL 100
#define CHANGES 5000
#define UNUSED(x) ((void) x)
typedef struct {
unsigned int tid;
} parm;
static volatile unsigned int* my_array;
static unsigned int* old_value;
static struct timeval* time_array;
static unsigned int N;
static unsigned long int diff[NTHREADS] = {0};
void* array_modifier(void* args);
void* change_detector(void* arg);
int main(int argc, char** argv) {
if (argc < 2) {
exit(1);
}
N = (unsigned int)strtoul(argv[1], NULL, 0);
my_array = calloc(N, sizeof(int));
time_array = malloc(N * sizeof(struct timeval));
old_value = calloc(N, sizeof(int));
parm* p = malloc(NTHREADS * sizeof(parm));
pthread_t generator_thread;
pthread_t* detector_thread = malloc(NTHREADS * sizeof(pthread_t));
for (unsigned int i = 0; i < NTHREADS; i++) {
p[i].tid = i;
pthread_create(&detector_thread[i], NULL, change_detector, (void*) &p[i]);
}
pthread_create(&generator_thread, NULL, array_modifier, NULL);
pthread_join(generator_thread, NULL);
usleep(500);
for (unsigned int i = 0; i < NTHREADS; i++) {
pthread_cancel(detector_thread[i]);
}
for (unsigned int i = 0; i < NTHREADS; i++) fprintf(stderr, "%lu ", diff[i]);
fprintf(stderr, "\n");
_exit(0);
}
void* array_modifier(void* arg) {
UNUSED(arg);
srand(time(NULL));
unsigned int changing_signals = CHANGES;
while (changing_signals--) {
usleep(TIME_INTERVAL);
const unsigned int r = rand() % N;
gettimeofday(&time_array[r], NULL);
my_array[r] ^= 1;
}
pthread_exit(NULL);
}
void* change_detector(void* arg) {
const parm* p = (parm*) arg;
const unsigned int tid = p->tid;
const unsigned int start = tid * (N / NTHREADS) +
(tid < N % NTHREADS ? tid : N % NTHREADS);
const unsigned int end = start + (N / NTHREADS) +
(tid < N % NTHREADS);
unsigned int r = start;
while (1) {
unsigned int tmp;
while ((tmp = my_array[r]) == old_value[r]) {
r = (r < end - 1) ? r + 1 : start;
}
old_value[r] = tmp;
if (tmp) {
struct timeval tv;
gettimeofday(&tv, NULL);
// detection time in usec
diff[tid] += (tv.tv_sec - time_array[r].tv_sec) * 1000000 + (tv.tv_usec - time_array[r].tv_usec);
}
}
}
when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && ./a.out 100
I get:
665
but when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && ./a.out 100
I get:
152 190 164 242
(this sums up to 748).
So, the delay for the multithreaded program is larger.
My cpu has 6 cores.

Short Answer
You are sharing memory between thread and sharing memory between threads is slow.
Long Answer
Your program is using a number of thread to write to my_array and another thread to read from my_array. Effectively my_array is shared by a number of threads.
Now lets assume you are benchmarking on a multicore machine, you probably are hoping that the OS will assign different cores to each thread.
Bear in mind that on modern processors writing to RAM is really expensive (hundreds of CPU cycles). To improve performance CPUs have multi-level caches. The fastest Cache is the small L1 cache. A core can write to its L1 cache in the order of 2-3 cycles. The L2 cache may take on the order of 20 - 30 cycles.
Now in lots of CPU architectures each core has its own L1 cache but the L2 cache is shared. This means any data that is shared between thread (cores) has to go through the L2 cache which is much slower than the L1 cache. This means that shared memory access tends to be quite slow.
Bottom line is that if you want your multithreaded programs to perform well you need to ensure that threads do not share memory. Sharing memory is slow.
Aside
Never rely on volatile to do the correct thing when sharing memory between thread, either use your library atomic operations or use mutexes. This is because some CPUs allow out of order reads and writes that may do strange things if you do not know what you are doing.

It is rare that a multithreaded program scales perfectly with the number of threads. In your case you measured a speed-up factor of ca 0.9 (665/748) with 4 threads. That is not so good.
Here are some factors to consider:
The overhead of starting threads and dividing the work. For small jobs the cost of starting additional threads can be considerably larger than the actual work. Not applicable to this case, since the overhead isn't included in the time measurements.
"Random" variations. Your threads varied between 152 and 242. You should run the test multiple times and use either the mean or the median values.
The size of the test. Generally you get more reliable measurements on larger tests (more data). However, you need to consider how having more data affects the caching in L1/L2/L3 cache. And if the data is too large to fit into RAM you need to factor in disk I/O. Usually, multithreaded implementations are slower, because they want to work on more data at a time but in rare instances they can be faster, a phenomenon called super-linear speedup.
Overhead caused by inter-thread communication. Maybe not a factor in your case, since you don't have much of that.
Overhead caused by resource locking. Usually has a low impact on cpu utilization but may have a large impact on the total real time used.
Hardware optimizations. Some CPUs change the clock frequency depending on how many cores you use.
The cost of the measurement itself. In your case a change will be detected within 25 (100/4) iterations of the for loop. Each iteration takes but a few clock cycles. Then you call gettimeofday which probably costs thousands of clock cycles. So what you are actually measuring is more or less the cost of calling gettimeofday.
I would increase the number of values to check and the cost to check each value. I would also consider turning off compiler optimizations, since these can cause the program to do unexpected things (or skip some things entirely).

How to measure overall performance of parallel programs (with papi)

I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can measure the overall performance of a parallel program.
I would like to measure the performance of a blas/lapack function, in my example below gemm. But I also want to measure other function, specially functions where the number of operation is not known. (In the case of gemm the ops are known (ops(gemm) = 2*n^3), so I could calculate the performance as a function of the number of operations and the execution time.) The library (I am using Intel MKL) spawn the threads automatically. So I can't measure the performance of each thread individually and then reduce it.
This is my example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "mkl.h"
#include "omp.h"
#include "papi.h"
int main(int argc, char *argv[] )
{
int i, j, l, k, n, m, idx, iter;
int mat, mat_min, mat_max;
int threads;
double *A, *B, *C;
double alpha =1.0, beta=0.0;
float rtime1, rtime2, ptime1, ptime2, mflops;
long long flpops;
#pragma omp parallel
{
#pragma omp master
threads = omp_get_num_threads();
}
if(argc < 4){
printf("pass me 3 arguments!\n");
return( -1 );
}
else
{
mat_min = atoi(argv[1]);
mat_max = atoi(argv[2]);
iter = atoi(argv[3]);
}
m = mat_max; n = mat_max; k = mat_max;
printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
" A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);
A = (double *) malloc( m*k * sizeof(double) );
B = (double *) malloc( k*n * sizeof(double) );
C = (double *) malloc( m*n * sizeof(double) );
printf (" Intializing matrix data \n\n");
for (i = 0; i < (m*k); i++)
A[i] = (double)(i+1);
for (i = 0; i < (k*n); i++)
B[i] = (double)(-i-1);
memset(C,0,m*n*sizeof(double));
// actual meassurment
for(mat=mat_min;mat<=mat_max;mat+=5)
{
m = mat; n = mat; k = mat;
for( idx=-1; idx<iter; idx++ ){
PAPI_flops( &rtime1, &ptime1, &flpops, &mflops );
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, C, n);
PAPI_flops( &rtime2, &ptime2, &flpops, &mflops );
}
printf("%d threads: %d in %f sec, %f MFLOPS\n",threads,mat,rtime2-rtime1,mflops);fflush(stdout);
}
printf("Done\n");fflush(stdout);
free(A);
free(B);
free(C);
return 0;
}
This is one output (for matrix size 200):
1 threads: 200 in 0.001459 sec, 5570.258789 MFLOPS
2 threads: 200 in 0.000785 sec, 5254.993652 MFLOPS
4 threads: 200 in 0.000423 sec, 4919.640137 MFLOPS
8 threads: 200 in 0.000264 sec, 3894.036865 MFLOPS
We can see for the execution time, that the function gemm scales. But the flops that I am measuring is only the performance of thread 0.
My question is: How can I measure the overall performance? I am grateful for any input.

First, I'm just curious - why do you need the FLOPS? don't you just care how much time is taken? or maybe time taken in compare to other BLAS libraries?
PAPI is thread based not much help on its own here.
What I would do is measure around the function call and see how time changes with number of threads it spawns. It should not spawn more threads than physical cores (HT is no good here). Then, if the matrix is big enough, and the machine is not loaded, the time should simply divide by the number of threads. E.g., 10 seconds over 4 core should become 2.5 seconds.
Other than that, there are 2 things you can do to really measure it:
1. Use whatever you use now but inject your start/end measurement code around the BLAS code. One way to do that (in linux) is by pre-loading a lib that defines pthread_start and using your own functions that call the originals but do some extra measurements. Another way to to override the function pointer when the process is already running (=trampoline). In linux it's in the GOT/PLT and in windows it's more complicated - look for a library.
2. Use oprofile, or some other profiler, to report number of instructions executed in the time you care for. Or better yet, to report the number of floating point instructions executed. A little problem with this is that SSE instructions are multiplying or adding 2 or more doubles at a time so you'd have to account for that. I guess you can assume they always use the maximum possible operands.

What could be some possible problems with this use of OpenMP?

I was trying to figure out how to parallelize a segment of code in OpenMP, where the inside of the for loop is independent from the rest of it.
Basically the project is dealing with particle systems, but I don't think that should relevant to the parallelization of the code. Is it a caching problem where the for loop divides the threads in a way such that the particles are not cached in each core in an efficient manner?
Edit: As mentioned by an answer below, I'm wondering why I'm not getting speedup.
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles[i].pos = s->particles[i].pos + dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
// printf("%d", omp_get_thread_num());
}

If you're asking whether it's parallelized correctly, it looks fine. I don't see any data-races or loop-dependencies that could break it.
But I think you're wondering on why you aren't getting any speedup with parallelism.
Since you mentioned that the trip count, psize-n_dead will be on the order of 4000. I'd say that's actually pretty small given the amount of work in the loop.
In other words, you don't have much total work to be worth parallelizing. So threading overhead is probably eating up any speedup that you should be gaining. If possible, you should try parallelizing at a higher level.
EDIT: You updated your comment to include up to 200000.
For larger values, it's likely that you'll be memory bound in some way. Your loop merely iterates through all the data doing very little work. So using more threads probably won't help much (if at all).

There is no correctness issues such as data races in this piece of code.
Assuming that the number of particles to process is big enough to warrant parallelism, I do not see OpenMP related performance issues in this code. By default, OpenMP will split the loop iterations statically in equal portions across all threads, so any cache conflicts may only occur at the boundaries of these portions, i.e. just in a few iterations of the loop.
Unrelated to OpenMP (and so to the parallel speedup problem), possibly performance improvement can be achieved by switching from array-of-structs to struct-of-arrays, as this might help compiler to vectorize the code (i.e. use SIMD instructions of a target processor):
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles.pos[i] = s->particles.pos[i] + dt * s->particles.vel[i];
s->particles.vel[i] = (1 - dt*.1) * s->particles.vel[i] + dt*s->force;
}
Such reorganization assumes that most time all particles are processed in a loop like this one. Working with an individual particle requires more cache lines to be loaded, but if you process them all in a loop, the net amount of cache lines loaded is nearly the same.

How sure are you that you're not getting speedup?
Trying it both ways - array of structs and struct of arrays, compiled with gcc -O3 (gcc 4.6), on a dual quad-core nehalem, I get for psize-n_dead = 200000, running 100 iterations for better timer accuracy:
Struct of arrays (reported time are in milliseconds)
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 90.984000
Took time 45.992000
Took time 22.996000
Took time 11.998000
Array of structs:
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 58.989000
Took time 28.995000
Took time 14.997000
Took time 8.999000
However, I because the operation is so short (sub-ms) I didn't see any speedup without doing 100 iterations because of timer accuracy. Also, you'd have to have a machine with good memory bandwidth to to get this sort of behaviour; you're only doing ~3 FMAs and another multiplication for every two pieces of data you read in.
Code for array-of-structs follows.
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
typedef struct particle_struct {
double pos;
double vel;
} particle;
typedef struct simulation_struct {
particle *particles;
double force;
} simulation;
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
void update(simulation *s, unsigned psize, double dt) {
#pragma omp parallel for
for (unsigned i = 0; i < psize; ++i)
{
s->particles[i].pos = s->particles[i].pos+ dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
}
}
void init(simulation *s, unsigned np) {
s->force = 1.;
s->particles = malloc(np*sizeof(particle));
for (unsigned i=0; i<np; i++) {
s->particles[i].pos = 1.;
s->particles[i].vel = 1.;
}
int main(void)
{
const unsigned np=200000;
simulation s;
struct timeval clock;
init(&s, np);
tick(&clock);
for (int iter=0;iter< 100; iter++)
update(&s, np, 0.75);
double elapsed=tock(&clock)*1000.;
printf("Took time %lf\n", elapsed);
free(s.particles);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight