I am trying to determine the L1 cache line size through a C code on a platform where L1 I D cache are 32 KB each and L2 cache is 2MB.
#include<stdio.h>
#include<stdlib.h>
#include<sys/time.h>
#include<time.h>
#define SIZE 100
long long wall_clock_time();
int main()
{
int *arr=calloc(SIZE,sizeof(int));
register int r,i;
long long before,after;
double time_elapsed;
for(i=0;i<SIZE;i++)
{
before=wall_clock_time();
r=arr[i];
after=wall_clock_time();
time_elapsed=((float)(after - before))/1000000000;
printf("Element Index = %d, Time Taken = %1.4fn",i,time_elapsed);
}
free(arr);
return 0;
}
long long wall_clock_time() {
#ifdef __linux__
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return (long long)(tp.tv_nsec + (long long)tp.tv_sec * 1000000000ll);
#else
struct timeval tv;
gettimeofday(&tv, NULL);
return (long long)(tv.tv_usec * 1000 + (long long)tv.tv_sec * 1000000000ll);
#endif
}
Above is a small code snippet that I am using to access elements of an array and trying to determine the jump in access delay at cache line boundaries. However, when I execute the code I get all the timing outputs as 0.000. I have read several threads on stackoverflow regarding this topic but couldn't understand much, hence attempted to write this code.
Can anybody explain to me whether there is an error conceptually or syntactically?
The 0.00 should have hinted that you're measuring something too small. The overhead of calling the measurement function is several magnitudes higher than what you measure.
Instead, measure the overall time it takes you to pass the array, and divide by SIZE to amortize it. Since SIZE is also rather small, you should probably repeat this action several hundreds of times and amortize over the entire thing.
Note that this still won't give you the latency, but rather the throughput of accesses. You'll need to come up with a way to measure the line size from that (try reading from the 2nd level cache, and use the fact the reads to the same line would hit in the L1. By increasing your step, you'll be able to see when your BW stops degrading and stays constant).
Related
I have searched and used many approaches for measuring the elapsed time. there are many questions for this purpose. For example, this question is very good but when you need an accurate time recorder I couldn't find a good method. For this, I want to share my method here to be used and be corrected if something is wrong.
UPDATE&NOTE: this question is for Benchmarking, less than one nanosecond. It's completely different from using clock_gettime(CLOCK_MONOTONIC,&start); it records time more than one nanosecond.
UPDATE : A common method to measure the speedup is repeating a section of the program which should be benchmarked. But, as mentioned in comment it might show different optimization when the researcher rely on autovectorizing.
NOTE It's not accurate enough to measure the elapsed time in one repeatinng. In some cases my results show that the section must be repeated more than 1K or 1M to get the smallest time.
SUGGESTION : I'm not familiar with shell programming (just know some basic commands...) But, it might be possible to measure the smallest time with out repeating inside the program.
MY CURRENT SOLUTION In order to prevent the branches I repeat the ode section using a macro #define REP_CODE(X) X X X... X X which X is the code section I want to benchmark as follows:
//numbers
#define FMAX1 MAX1*MAX1
#define COEFF 8
int __attribute__(( aligned(32))) input[FMAX1+COEFF]; //= {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17};
int __attribute__(( aligned(32))) output[FMAX1];
int __attribute__(( aligned(32))) coeff[COEFF] = {1,2,3,4,5,6,7,8};//= {1,1,1,1,1,1,1,1};//; //= {1,2,1,2,1,2,1,2,2,1};
int main()
{
REP_CODE(
t1_rdtsc=_rdtsc();
//Code
for(i = 0; i < FMAX1; i++){
for(j = 0; j < COEFF; j++){//IACA_START
output[i] += coeff[j] * input[i+j];
}//IACA_END
}
t2_rdtsc=_rdtsc();
ttotal_rdtsc[ii++]=t2_rdtsc-t1_rdtsc;
)
// The smallest element in `ttotal_rdtsc` is the answer
}
This does not impact the optimization but also is restricted by code size and compiling time is too much in some cases.
Any suggestion and correction?
Thanks in advance.
If you have problem with autovectorizer and want to limit it just add a asm("#somthing"); after your begin_rdtsc it will separate the do-while loop. I just checked and it vectorized your posted code which auto vectorizer was unable to vectorize it.
I changed your macro you can use it....
long long t1_rdtsc, t2_rdtsc, ttotal_rdtsc[do_while], ttbest_rdtsc = 99999999999999999, elapsed, elapsed_rdtsc=do_while, overal_time = OVERAL_TIME, ttime=0;
int ii=0;
#define begin_rdtsc\
do{\
asm("#mmmmmmmmmmm");\
t1_rdtsc=_rdtsc();
#define end_rdtsc\
t2_rdtsc=_rdtsc();\
asm("#mmmmmmmmmmm");\
ttotal_rdtsc[ii]=t2_rdtsc-t1_rdtsc;\
}while (ii++<do_while);\
for(ii=0; ii<do_while; ii++){\
if (ttotal_rdtsc[ii]<ttbest_rdtsc){\
ttbest_rdtsc = ttotal_rdtsc[ii];}}\
printf("\nthe best is %lld in %lld iteration\n", ttbest_rdtsc, elapsed_rdtsc);
I have developed my first answer and got this solution. But, I still want a solution. Because it is very important to measure the time accurately and with the least impacts. I put this part in a header file and include it in main program files.
//Header file header.h
#define count 1000 // number of repetition
long long t1_rdtsc, t2_rdtsc, ttotal_rdtsc[count], ttbest_rdtsc = 99999999999999999, elapsed, elapsed_rdtsc=count, overal_time = OVERAL_TIME, ttime=0;
int ii=0;
#define begin_rdtsc\
do{\
t1_rdtsc=_rdtsc();
#define end_rdtsc\
t2_rdtsc=_rdtsc();\
ttotal_rdtsc[ii]=t2_rdtsc-t1_rdtsc;\
}while (ii++<count);\
for(ii=0; ii<do_while; ii++){\
if (ttotal_rdtsc[ii]<ttbest_rdtsc){\
ttbest_rdtsc = ttotal_rdtsc[ii];}}\
printf("\nthe best is %lld in %lldth iteration \n", ttbest_rdtsc, elapsed_rdtsc);
//Main program
#include "header.h"
.
.
.
int main()
{
//before the section
begin_rdtsc
//put your code here to measure the clocks.
end_rdtsc
return 0
}
I recommend using this method for x86 micro-architecture.
NOTE:
NUM_LOOP should be a number which helps to increase the accuracy
with repeating your code to record the best time
ttbest_rdtsc must
be bigger than the worst time I recommend to maximize it.
I used (you might not want it) OVERAL_TIME as another checking rule because I used this for many kernels and in some cases NUM_LOOP was very big and I didn't want to change it. I planned OVERAL_TIME to limit the iterations and stop after specific time.
UPDATE: The whole program is this:
#include <stdio.h>
#include <x86intrin.h>
#define NUM_LOOP 100 //executes your code NUM_LOOP times to get the smalest time to avoid overheads such as cache misses, etc.
int main()
{
long long t1_rdtsc, t2_rdtsc, ttotal_rdtsc, ttbest_rdtsc = 99999999999999999;
int do_while = 0;
do{
t1_rdtsc = _rdtsc();
//put your code here
t2_rdtsc = _rdtsc();
ttotal_rdtsc = t2_rdtsc - t1_rdtsc;
//store the smalest time:
if (ttotal_rdtsc<ttbest_rdtsc)
ttbest_rdtsc = ttotal_rdtsc;
}while (do_while++ < NUM_LOOP);
printf("\nthe best is %lld in %d repetitions\n", ttbest_rdtsc, NUM_LOOP );
return 0;
}
that I have changed to this and added to a header for my self then I can use it simply in my program.
#include <x86intrin.h>
#define do_while NUM_LOOP
#define OVERAL_TIME 999999999
long long t1_rdtsc, t2_rdtsc, ttotal_rdtsc, ttbest_rdtsc = 99999999999999999, elapsed, elapsed_rdtsc=do_while, overal_time = OVERAL_TIME, ttime=0;
#define begin_rdtsc\
do{\
t1_rdtsc=_rdtsc();
#define end_rdtsc\
t2_rdtsc=_rdtsc();\
ttotal_rdtsc=t2_rdtsc-t1_rdtsc;\
if (ttotal_rdtsc<ttbest_rdtsc){\
ttbest_rdtsc = ttotal_rdtsc;\
elapsed=(do_while-elapsed_rdtsc);}\
ttime+=ttotal_rdtsc;\
}while (elapsed_rdtsc-- && (ttime<overal_time));\
printf("\nthe best is %lld in %lldth iteration and %lld repetitions\n", ttbest_rdtsc, elapsed, (do_while-elapsed_rdtsc));
How to use this method? Well, it is very simple!
int main()
{
//before the section
begin_rdtsc
//put your code here to measure the clocks.
end_rdtsc
return 0
}
Be creative, You can change it to measure the speedup in your program, etc.
An example of the output is:
the best is 9600 in 384751th iteration and 569179 repetitions
my tested code got 9600 clock that the best was recorded in 384751enditeration and my code was tested 569179 times
I have tested them on GCC and Clang.
I have used the below code to get the clock cycle of the processor
unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
I get some value say 43, but what is the unit here? Is it in microseconds or nanoseconds.
I used below code to get the frequency of my board.
cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
1700000
I also used below code to find my processor speed
dmidecode -t processor | grep "Speed"
Max Speed: 3700 MHz
Current Speed: 3700 MHz
Now how do I use above frequency and convert it to microseconds or milliseconds?
A simple answer to the stated question, "how do I convert the TSC frequency to microseconds or milliseconds?" is: You do not. What the TSC (Time Stamp Counter) clock frequency actually is, varies depending on the hardware, and may vary during runtime on some. To measure real time, you use clock_gettime(CLOCK_REALTIME) or clock_gettime(CLOCK_MONOTONIC) in Linux.
As Peter Cordes mentioned in a comment (Aug 2018), on most current x86-64 architectures the Time Stamp Counter (accessed by the RDTSC instruction and __rdtsc() function declared in <x86intrin.h>) counts reference clock cycles, not CPU clock cycles. His answer to a similar question in C++ is valid for C also in Linux on x86-64, because the compiler provides the underlying built-in when compiling C or C++, and rest of the answer deals with the hardware details. I recommend reading that one, too.
The rest of this answer assumes the underlying issue is microbenchmarking code, to find out how two implementations of some function compare to each other.
On x86 (Intel 32-bit) and x86-64 (AMD64, Intel and AMD 64-bit) architectures, you can use __rdtsc() from <x86intrin.h> to find out the number of TSC clock cycles elapsed. This can be used to measure and compare the number of cycles used by different implementations of some function, typically a large number of times.
Do note that there are hardware differences as to how the TSC clock is related to CPU clock. The abovementioned more recent answer goes into some detail on that. For practical purposes in Linux, it is sufficient in Linux to use cpufreq-set to disable frequency scaling (to ensure the relationship between the CPU and TSC frequencies does not change during microbenchmarking), and optionally taskset to restrict the microbenchmark to specific CPU core(s). That ensures that the results gathered in that microbenchmark yield results that can be compared to each other.
(As Peter Cordes commented, we also want to add _mm_lfence() from <emmintrin.h> (included by <immintrin.h>). This ensures that the CPU does not internally reorder the RDTSC operation compared to the function to be benchmarked. You can use -DNO_LFENCE at compile time to omit those, if you want.)
Let's say you have functions void foo(void); and void bar(void); that you wish to compare:
#include <stdlib.h>
#include <x86intrin.h>
#include <stdio.h>
#ifdef NO_LFENCE
#define lfence()
#else
#include <emmintrin.h>
#define lfence() _mm_lfence()
#endif
static int cmp_ull(const void *aptr, const void *bptr)
{
const unsigned long long a = *(const unsigned long long *)aptr;
const unsigned long long b = *(const unsigned long long *)bptr;
return (a < b) ? -1 :
(a > b) ? +1 : 0;
}
unsigned long long *measure_cycles(size_t count, void (*func)())
{
unsigned long long *elapsed, started, finished;
size_t i;
elapsed = malloc((count + 2) * sizeof elapsed[0]);
if (!elapsed)
return NULL;
/* Call func() count times, measuring the TSC cycles for each call. */
for (i = 0; i < count; i++) {
/* First, let's ensure our CPU executes everything thus far. */
lfence();
/* Start timing. */
started = __rdtsc();
/* Ensure timing starts before we call the function. */
lfence();
/* Call the function. */
func();
/* Ensure everything has been executed thus far. */
lfence();
/* Stop timing. */
finished = __rdtsc();
/* Ensure we have the counter value before proceeding. */
lfence();
elapsed[i] = finished - started;
}
/* The very first call is likely the cold-cache case,
so in case that measurement might contain useful
information, we put it at the end of the array.
We also terminate the array with a zero. */
elapsed[count] = elapsed[0];
elapsed[count + 1] = 0;
/* Sort the cycle counts. */
qsort(elapsed, count, sizeof elapsed[0], cmp_ull);
/* This function returns all cycle counts, in sorted order,
although the median, elapsed[count/2], is the one
I personally use. */
return elapsed;
}
void benchmark(const size_t count)
{
unsigned long long *foo_cycles, *bar_cycles;
if (count < 1)
return;
printf("Measuring run time in Time Stamp Counter cycles:\n");
fflush(stdout);
foo_cycles = measure_cycles(count, foo);
bar_cycles = measure_cycles(count, bar);
printf("foo(): %llu cycles (median of %zu calls)\n", foo_cycles[count/2], count);
printf("bar(): %llu cycles (median of %zu calls)\n", bar_cycles[count/2], count);
free(bar_cycles);
free(foo_cycles);
}
Note that the above results are very specific to the compiler and compiler options used, and of course on the hardware it is run on. The median number of cycles can be interpreted as "the typical number of TSC cycles taken", because the measurement is not completely reliable (may be affected by events outside the process; for example, by context switches, or by migration to another core on some CPUs). For the same reason, I don't trust the minimum, maximum, or average values.
However, the two implementations' (foo() and bar()) cycle counts above can be compared to find out how their performance compares to each other, in a microbenchmark. Just remember that microbenchmark results may not extend to real work tasks, because of how complex tasks' resource use interactions are. One function might be superior in all microbenchmarks, but poorer than others in real world, because it is only efficient when it has lots of CPU cache to use, for example.
In Linux in general, you can use the CLOCK_REALTIME clock to measure real time (wall clock time) used, in the very same manner as above. CLOCK_MONOTONIC is even better, because it is not affected by direct changes to the realtime clock the administrator might make (say, if they noticed the system clock is ahead or behind); only drift adjustments due to NTP etc. are applied. Daylight savings time or changes thereof does not affect the measurements, using either clock. Again, the median of a number of measurements is the result I seek, because events outside the measured code itself can affect the result.
For example:
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#ifdef NO_LFENCE
#define lfence()
#else
#include <emmintrin.h>
#define lfence() _mm_lfence()
#endif
static int cmp_double(const void *aptr, const void *bptr)
{
const double a = *(const double *)aptr;
const double b = *(const double *)bptr;
return (a < b) ? -1 :
(a > b) ? +1 : 0;
}
double median_seconds(const size_t count, void (*func)())
{
struct timespec started, stopped;
double *seconds, median;
size_t i;
seconds = malloc(count * sizeof seconds[0]);
if (!seconds)
return -1.0;
for (i = 0; i < count; i++) {
lfence();
clock_gettime(CLOCK_MONOTONIC, &started);
lfence();
func();
lfence();
clock_gettime(CLOCK_MONOTONIC, &stopped);
lfence();
seconds[i] = (double)(stopped.tv_sec - started.tv_sec)
+ (double)(stopped.tv_nsec - started.tv_nsec) / 1000000000.0;
}
qsort(seconds, count, sizeof seconds[0], cmp_double);
median = seconds[count / 2];
free(seconds);
return median;
}
static double realtime_precision(void)
{
struct timespec t;
if (clock_getres(CLOCK_REALTIME, &t) == 0)
return (double)t.tv_sec
+ (double)t.tv_nsec / 1000000000.0;
return 0.0;
}
void benchmark(const size_t count)
{
double median_foo, median_bar;
if (count < 1)
return;
printf("Median wall clock times over %zu calls:\n", count);
fflush(stdout);
median_foo = median_seconds(count, foo);
median_bar = median_seconds(count, bar);
printf("foo(): %.3f ns\n", median_foo * 1000000000.0);
printf("bar(): %.3f ns\n", median_bar * 1000000000.0);
printf("(Measurement unit is approximately %.3f ns)\n", 1000000000.0 * realtime_precision());
fflush(stdout);
}
In general, I personally prefer to compile the benchmarked function in a separate unit (to a separate object file), and also benchmark a do-nothing function to estimate the function call overhead (although it tends to give an overestimate for the overhead; i.e. yield too large an overhead estimate, because some of the function call overhead is latencies and not actual time taken, and some operations are possible during those latencies in the actual functions).
It is important to remember that the above measurements should only be used as indications, because in a real world application, things like cache locality (especially on current machines, with multi-level caching, and lots of memory) hugely affect the time used by different implementations.
For example, you might compare the speeds of a quicksort and a radix sort. Depending on the size of the keys, the radix sort requires rather large extra arrays (and uses a lot of cache). If the real application the sort routine is used in does not simultaneously use a lot of other memory (and thus the sorted data is basically what is cached), then a radix sort will be faster if there is enough data (and the implementation is sane). However, if the application is multithreaded, and the other threads shuffle (copy or transfer) a lot of memory around, then the radix sort using a lot of cache will evict other data also cached; even though the radix sort function itself does not show any serious slowdown, it may slow down the other threads and therefore the overall program, because the other threads have to wait for their data to be re-cached.
This means that the only "benchmarks" you should trust, are wall clock measurements used on the actual hardware, running actual work tasks with actual work data. Everything else is subject to many conditions, and are more or less suspect: indications, yes, but not very reliable.
I have a C program which for the following number of inputs executes in the following amount of time:
23 0.001s
100 0.001s
I tried finding a formula for this but was unsuccessful. As far as I can see, the time sometimes doubles sometimes it doesn't, that is why I couldn't find a formula for this.
Any thoughts?
NOTES
1) I am measuring this in CPU time (user+sys).
2) My program uses quicksort
2) The asymptotic run-time analysis/complexity of my program is O(NlogN)
Plotting this it looks very much like you are hitting a "cache cliff" - your data is big enough so that it doesn't all fit in a cpu cache level and so must swap data between cheaper and more expensive levels
The differences at lower levels are likely due to precission - if you increase loops inside the timer block - this might smooth out
generally when hitting a cache cliff the eprformace is something like plotting O*Q
where O is the average runtime -- Nlog(N) in your case
and Q is a step function that increases as you pass the allocated size for each cache.
so Q=1 below L1 size, 10 above L1 size, 100 above L2 size, etc. (numbers only for example)
In fact, there are people that verify their cache sizes by plotting a O(1) function and looking for the memory used at performance cliffs the cliffs:
___
_____-----
L1 | L2 | L3
I always use this to get precise run time..
#include<stdio.h>
#include <time.h>
#include <stdlib.h>
`clock_t startm, stopm;
`#define START if ( (startm = clock()) == -1) {printf("Error calling clock");exit(1);}
`#define STOP if ( (stopm = clock()) == -1) {printf("Error calling clock");exit(1);}
#define PRINTTIME printf( "%6.3f seconds used by the processor.", ((double)stopm-startm)/CLOCKS_PER_SEC);
`int main() {
int i,x;
START;
scanf("%d",&x);
for(i=0;i<10000;i++){
printf("%d\n",i);
}
STOP;
PRINTTIME;
}
My program is going to race different sorting algorithms against each other, both in time and space. I've got space covered, but measuring time is giving me some trouble. Here is the code that runs the sorts:
void test(short* n, short len) {
short i, j, a[1024];
for(i=0; i<2; i++) { // Loop over each sort algo
memused = 0; // Initialize memory marker
for(j=0; j<len; j++) // Copy scrambled list into fresh array
a[j] = n[j]; // (Sorting algos are in-place)
// ***Point A***
switch(i) { // Pick sorting algo
case 0:
selectionSort(a, len);
case 1:
quicksort(a, len);
}
// ***Point B***
spc[i][len] = memused; // Record how much mem was used
}
}
(I removed some of the sorting algos for simplicity)
Now, I need to measure how much time the sorting algo takes. The most obvious way to do this is to record the time at point (a) and then subtract that from the time at point (b). But none of the C time functions are good enough:
time() gives me time in seconds, but the algos are faster than that, so I need something more accurate.
clock() gives me CPU ticks since the program started, but seems to round to the nearest 10,000; still not small enough
The time shell command works well enough, except that I need to run over 1,000 tests per algorithm, and I need the individual time for each one.
I have no idea what getrusage() returns, but it's also too long.
What I need is time in units (significantly, if possible) smaller than the run time of the sorting functions: about 2ms. So my question is: Where can I get that?
gettimeofday() has microseconds resolution and is easy to use.
A pair of useful timer functions is:
static struct timeval tm1;
static inline void start()
{
gettimeofday(&tm1, NULL);
}
static inline void stop()
{
struct timeval tm2;
gettimeofday(&tm2, NULL);
unsigned long long t = 1000 * (tm2.tv_sec - tm1.tv_sec) + (tm2.tv_usec - tm1.tv_usec) / 1000;
printf("%llu ms\n", t);
}
For measuring time, use clock_gettime with CLOCK_MONOTONIC (or CLOCK_MONOTONIC_RAW if it is available). Where possible, avoid using gettimeofday. It is specifically deprecated in favor of clock_gettime, and the time returned from it is subject to adjustments from time servers, which can throw off your measurements.
You can get the total user + kernel time (or choose just one) using getrusage as follows:
#include <sys/time.h>
#include <sys/resource.h>
double get_process_time() {
struct rusage usage;
if( 0 == getrusage(RUSAGE_SELF, &usage) ) {
return (double)(usage.ru_utime.tv_sec + usage.ru_stime.tv_sec) +
(double)(usage.ru_utime.tv_usec + usage.ru_stime.tv_usec) / 1.0e6;
}
return 0;
}
I elected to create a double containing fractional seconds...
double t_begin, t_end;
t_begin = get_process_time();
// Do some operation...
t_end = get_process_time();
printf( "Elapsed time: %.6f seconds\n", t_end - t_begin );
The Time Stamp Counter could be helpful here:
static unsigned long long rdtsctime() {
unsigned int eax, edx;
unsigned long long val;
__asm__ __volatile__("rdtsc":"=a"(eax), "=d"(edx));
val = edx;
val = val << 32;
val += eax;
return val;
}
Though there are some caveats to this. The timestamps for different processor cores may be different, and changing clock speeds (due to power saving features and the like) can cause erroneous results.
I was trying to figure out how to parallelize a segment of code in OpenMP, where the inside of the for loop is independent from the rest of it.
Basically the project is dealing with particle systems, but I don't think that should relevant to the parallelization of the code. Is it a caching problem where the for loop divides the threads in a way such that the particles are not cached in each core in an efficient manner?
Edit: As mentioned by an answer below, I'm wondering why I'm not getting speedup.
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles[i].pos = s->particles[i].pos + dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
// printf("%d", omp_get_thread_num());
}
If you're asking whether it's parallelized correctly, it looks fine. I don't see any data-races or loop-dependencies that could break it.
But I think you're wondering on why you aren't getting any speedup with parallelism.
Since you mentioned that the trip count, psize-n_dead will be on the order of 4000. I'd say that's actually pretty small given the amount of work in the loop.
In other words, you don't have much total work to be worth parallelizing. So threading overhead is probably eating up any speedup that you should be gaining. If possible, you should try parallelizing at a higher level.
EDIT: You updated your comment to include up to 200000.
For larger values, it's likely that you'll be memory bound in some way. Your loop merely iterates through all the data doing very little work. So using more threads probably won't help much (if at all).
There is no correctness issues such as data races in this piece of code.
Assuming that the number of particles to process is big enough to warrant parallelism, I do not see OpenMP related performance issues in this code. By default, OpenMP will split the loop iterations statically in equal portions across all threads, so any cache conflicts may only occur at the boundaries of these portions, i.e. just in a few iterations of the loop.
Unrelated to OpenMP (and so to the parallel speedup problem), possibly performance improvement can be achieved by switching from array-of-structs to struct-of-arrays, as this might help compiler to vectorize the code (i.e. use SIMD instructions of a target processor):
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles.pos[i] = s->particles.pos[i] + dt * s->particles.vel[i];
s->particles.vel[i] = (1 - dt*.1) * s->particles.vel[i] + dt*s->force;
}
Such reorganization assumes that most time all particles are processed in a loop like this one. Working with an individual particle requires more cache lines to be loaded, but if you process them all in a loop, the net amount of cache lines loaded is nearly the same.
How sure are you that you're not getting speedup?
Trying it both ways - array of structs and struct of arrays, compiled with gcc -O3 (gcc 4.6), on a dual quad-core nehalem, I get for psize-n_dead = 200000, running 100 iterations for better timer accuracy:
Struct of arrays (reported time are in milliseconds)
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 90.984000
Took time 45.992000
Took time 22.996000
Took time 11.998000
Array of structs:
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 58.989000
Took time 28.995000
Took time 14.997000
Took time 8.999000
However, I because the operation is so short (sub-ms) I didn't see any speedup without doing 100 iterations because of timer accuracy. Also, you'd have to have a machine with good memory bandwidth to to get this sort of behaviour; you're only doing ~3 FMAs and another multiplication for every two pieces of data you read in.
Code for array-of-structs follows.
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
typedef struct particle_struct {
double pos;
double vel;
} particle;
typedef struct simulation_struct {
particle *particles;
double force;
} simulation;
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
void update(simulation *s, unsigned psize, double dt) {
#pragma omp parallel for
for (unsigned i = 0; i < psize; ++i)
{
s->particles[i].pos = s->particles[i].pos+ dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
}
}
void init(simulation *s, unsigned np) {
s->force = 1.;
s->particles = malloc(np*sizeof(particle));
for (unsigned i=0; i<np; i++) {
s->particles[i].pos = 1.;
s->particles[i].vel = 1.;
}
int main(void)
{
const unsigned np=200000;
simulation s;
struct timeval clock;
init(&s, np);
tick(&clock);
for (int iter=0;iter< 100; iter++)
update(&s, np, 0.75);
double elapsed=tock(&clock)*1000.;
printf("Took time %lf\n", elapsed);
free(s.particles);
}