Changing irrelevant part of the function changes papi measurement of branch prediction - c

I am playing with the codes that I found online and I want to try different branch prediction codes to have a better understanding of branch predictors.
CPU is AMD Ryzen 3600.
Basically, what I am doing is in the code below, I am trying to measure a misprediction rate of a given function/code segment. As a pseudo/shortened code, here is what I do:
int measurement(int length, int arr){
r1 = papi_read();
for(;i<len;){
if(arr[j]){
do_sth();
}
}
r2 = papi_read();
return r2-r1;
}
void warmup(){
for(volatile int i = 0; i< 10000; i++){
for(volatile int j = 0; j < 100; j++){} // important line
}
}
int main() {
init_papi();
init_others(); //creates arrays, initialize them, etc.
warmup();
for(int i = 0; i < 20; i++){
results[i] = measurement(128, array);
usleep(1200); // 2nd important line
}
print_mispredictions();
}
My setup
I have isolated the core I am working on so that there is no other user process in that core. I have also isolated the sibling one so that I am full in charge of both of the cores, except there is an interrupt or a routine.
Previously, I have seen that if I use sleep between iterations (as in main function), the CPU enters a deeper level C-state so the branch prediction units (BHT in this case) resets. This is the explanation of 2nd important line in the code.
What I want to see
Without the sleep line I am seeing that in each iteration I am having lower and lower misprediction rates. That is because of the branch predictors are learning the pattern in the array.
With the sleep line, what I want to achieve is, at each iteration, I should see similar misprediction numbers, as BPU entries are being reset.
What is the problem
The problem is when I change the warmup line from
void warmup(){
for(volatile int i = 0; i< 10000; i++){
for(volatile int j = 0; j < 100; j++){}
}
}
to
void warmup(){
for(volatile int i = 0; i< 1000000; i++){ // notice that I have the
// same amount of iteration
}
}
then the measurements are messed up. I have experienced this kind of issue in my previous question, which is never ever answered. Changing a line that is irrelevant to the measurement changes the measurement behavior.
This is my results:
# With 1 for loop, expected behavior. At every iteration, it is reset to ~ 60
# For 128 if statement, 60 misprediction is 50% guessing.
$ ./exp
0:73 #iteration count, misprediction count
1:62
2:63
3:21
4:63
...
# With 2 for loops. Unexpected behavior. It should always reset to ~ 60 but it keeps decreasing.
./exp
0:66
1:18
2:4
3:4
4:1
5:0
6:0
...
Putting a mfence or lfence instruction after the warmup doesn't change the result either.
Below, I am putting the whole code in case someone wants to try and/or has an answer for this behavior.
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/sysinfo.h>
#include <time.h>
#include "papi.h"
#define stick_this_thread_to_core(retval, core_id){ \
int num_cores = sysconf(_SC_NPROCESSORS_ONLN); \
if (core_id < 0 || core_id >= num_cores) \
retval = EINVAL; \
cpu_set_t cpuset; \
CPU_ZERO(&cpuset); \
CPU_SET(core_id, &cpuset); \
retval = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);\
}
#define ERROR_RETURN(retval) { \
fprintf(stderr, "Error %d, (%s) %s:line %d: \n", retval,PAPI_strerror(retval), __FILE__,__LINE__); \
exit(retval); \
}
void papi_my_native_add_event(int* EventSet1, char* eventname, int *native){
int retval;
//printf("native add\n");
if((retval = PAPI_event_name_to_code(eventname, native)) != PAPI_OK)
ERROR_RETURN(retval);
//printf("native add to_code is successful\n");
if ((retval = PAPI_add_event(*EventSet1, *native)) != PAPI_OK)
ERROR_RETURN(retval);
//printf("native add add_event is successful\n");
int number = 0;
if((retval = PAPI_list_events(*EventSet1, NULL, &number)) != PAPI_OK)
ERROR_RETURN(retval);
//fprintf(stderr, "Added %d events.\n", number);
}
void papi_my_native_add_start_event(int* EventSet1, char* eventname, int *native){
papi_my_native_add_event(EventSet1, eventname, native);
int retval = 0;
if((retval = PAPI_start(*EventSet1)) != PAPI_OK)
ERROR_RETURN(retval);
//printf("START %s\n", eventname);
}
int RNG_SIZE = 128;
uint64_t* rng_arr;
uint64_t dummy;
// 12th core
int cpuid = 11;
// Code from
// https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2019/11/05
// acts like a random number generator, but it is deterministic.
static inline uint64_t rng(uint64_t h) {
h ^= h >> 33;
h *= UINT64_C(0xff51afd7ed558ccd);
h ^= h >> 33;
h *= UINT64_C(0xc4ceb9fe1a85ec53);
h ^= h >> 33;
return h;
}
uint64_t measurement(int* EventSet, uint64_t howmany, uint64_t* arr){
long long reads[2] = {0};
PAPI_read(*EventSet, &reads[0]);
for(int j = 0; j < howmany; j++){
if(arr[j]){
dummy &= arr[j];
}
}
PAPI_read(*EventSet, &reads[1]);
return (reads[1] - reads[0]);
}
void precompute_rng(){
int howmany = RNG_SIZE;
for(int i = 0; i < RNG_SIZE; i++){
rng_arr[i] = rng(howmany) &0x1;
howmany--;
}
}
int stick_to_core(){
int retval = 0;
stick_this_thread_to_core(retval, cpuid);
if(retval){
printf("Affinity error: %s\n", strerror(errno));
return 1;
}
return 0;
}
void init_papi(int* EventSet, int cpuid){
int retval = 0;
// papi init
if((retval = PAPI_library_init(PAPI_VER_CURRENT)) != PAPI_VER_CURRENT )
ERROR_RETURN(retval);
PAPI_option_t opts1;
opts1.cpu.cpu_num = cpuid;
if((retval = PAPI_create_eventset(EventSet)) != PAPI_OK)
ERROR_RETURN(retval);
if((retval = PAPI_assign_eventset_component(*EventSet, 0)) != PAPI_OK)
ERROR_RETURN(retval);
opts1.cpu.eventset = *EventSet;
if((retval =PAPI_set_opt(PAPI_CPU_ATTACH, &opts1)) != PAPI_OK)
ERROR_RETURN(retval);
char* eventname = "RETIRED_BRANCH_INSTRUCTIONS_MISPREDICTED";
unsigned int native = 0x0;
papi_my_native_add_start_event(EventSet, eventname, &native);
}
void warmup(){
for(volatile int i = 0; i< 100000; i++){
for(volatile int j = 0; j < 100; j++){} // important line
}
}
int main() {
if(stick_to_core()){
printf("Error on sticking to the core\n");
return 1;
}
int EventSet = PAPI_NULL;
int* EventSetPtr = &EventSet;
init_papi(EventSetPtr, cpuid);
rng_arr = (uint64_t*) malloc(RNG_SIZE * sizeof(uint64_t));
precompute_rng(cpuid);
int iter = 4096;
uint64_t* results = (uint64_t*) malloc(iter * sizeof(uint64_t));
for(int i = 0; i < iter; i++)
results[i] = 0;
warmup();
for(int i = 0; i < 20; i++){
results[i] = measurement(&EventSet, RNG_SIZE, rng_arr);
usleep(1200);
}
// prints
for(int i = 0; i < 20; i++){
printf("%d:%ld\n", i, results[i]);
}
printf("\n");
free(results);
return 0;
}
Compile with
gcc -O0 main.c -lpthread -lpapi -o exp

Related

Performance of multithreaded algorithm to find max number in array

I'm trying to learn about multithreaded algorithms so I've implemented a simple find max number function of an array.
I've made a baseline program (findMax1.c) which loads from a file about 263 million int numbers into memory.
Then I simply use a for loop to find the max number. Then I've made another program (findMax2.c) which uses 4 threads.
I chose 4 threads because the CPU (intel i5 4460) I'm using has 4 cores and 1 thread per core. So my guess is that
if I assign each core a chunk of the array to process it would be more efficient because that way I'll have fewer cache
misses. Now, each thread finds the max number from each chunk, then I join all threads to finally find the max number
from all those chunks. The baseline program findMax1.c takes about 660ms to complete the task, so my initial thought was
that findMax2.c (which uses 4 threads) would take about 165ms (660ms / 4) to complete since now I have 4 threads running
all in parallel to do the same task, but findMax2.c takes about 610ms. Only 50ms less than findMax1.c.
What am I missing? is there something wrong with the implementation of the threaded program?
findMax1.c
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
int main(void)
{
int i, *array, max = 0, position;
size_t array_size_in_bytes = 1024*1024*1024, elements_read, array_size;
FILE *f;
clock_t t;
double time;
array = (int*) malloc(array_size_in_bytes);
assert(array != NULL); // assert if condition is falsa
printf("Loading array...");
t = clock();
f = fopen("numbers.bin", "rb");
assert(f != NULL);
elements_read = fread(array, array_size_in_bytes, 1, f);
t = clock() - t;
time = ((double) t) / CLOCKS_PER_SEC;
assert(elements_read == 1);
printf("done!\n");
printf("File load time: %f [s]\n", time);
fclose(f);
array_size = array_size_in_bytes / sizeof(int);
printf("Finding max...");
t = clock();
for(i = 0; i < array_size; i++)
if(array[i] > max)
{
max = array[i];
position = i;
}
t = clock() - t;
time = ((double) t) / CLOCKS_PER_SEC;
printf("done!\n");
printf("----------- Program results -------------\nMax number: %d position %d\n", max, position);
printf("Time %f [s]\n", time);
return 0;
}
findMax2.c:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#define NUM_THREADS 4
int max_chunk[NUM_THREADS], pos_chunk[NUM_THREADS];
int *array;
pthread_t tid[NUM_THREADS];
void *thread(void *arg)
{
size_t array_size_in_bytes = 1024*1024*1024;
int i, rc, offset, chunk_size, array_size, *core_id = (int*) arg, num_cores = sysconf(_SC_NPROCESSORS_ONLN);
pthread_t id = pthread_self();
cpu_set_t cpuset;
if (*core_id < 0 || *core_id >= num_cores)
return NULL;
CPU_ZERO(&cpuset);
CPU_SET(*core_id, &cpuset);
rc = pthread_setaffinity_np(id, sizeof(cpu_set_t), &cpuset);
if(rc != 0)
{
printf("pthread_setaffinity_np() failed! - rc %d\n", rc);
return NULL;
}
printf("Thread running on CPU %d\n", sched_getcpu());
array_size = (int) (array_size_in_bytes / sizeof(int));
chunk_size = (int) (array_size / NUM_THREADS);
offset = chunk_size * (*core_id);
// Find max number in the array chunk
for(i = offset; i < (offset + chunk_size); i++)
{
if(array[i] > max_chunk[*core_id])
{
max_chunk[*core_id] = array[i];
pos_chunk[*core_id] = i;
}
}
return NULL;
}
void load_array(void)
{
FILE *f;
size_t array_size_in_bytes = 1024*1024*1024, elements_read;
array = (int*) malloc(array_size_in_bytes);
assert(array != NULL); // assert if condition is false
printf("Loading array...");
f = fopen("numbers.bin", "rb");
assert(f != NULL);
elements_read = fread(array, array_size_in_bytes, 1, f);
assert(elements_read == 1);
printf("done!\n");
fclose(f);
}
int main(void)
{
int i, max = 0, position, id[NUM_THREADS], rc;
clock_t t;
double time;
load_array();
printf("Finding max...");
t = clock();
// Create threads
for(i = 0; i < NUM_THREADS; i++)
{
id[i] = i; // uso id para pasarle un puntero distinto a cada thread
rc = pthread_create(&(tid[i]), NULL, &thread, (void*)(id + i));
if (rc != 0)
printf("Can't create thread! rc = %d\n", rc);
else
printf("Thread %lu created\n", tid[i]);
}
// Join threads
for(i = 0; i < NUM_THREADS; i++)
pthread_join(tid[i], NULL);
// Find max number from all chunks
for(i = 0; i < NUM_THREADS; i++)
if(max_chunk[i] > max)
{
max = max_chunk[i];
position = pos_chunk[i];
}
t = clock() - t;
time = ((double) t) / CLOCKS_PER_SEC;
printf("done!\n");
free(array);
printf("----------- Program results -------------\nMax number: %d position %d\n", max, position);
printf("Time %f [s]\n", time);
pthread_exit(NULL);
return 0;
}
First of all, you're measuring your time wrong.
clock() measures process CPU time, i.e., time used by all threads. The real elapsed time will be fraction of that. clock_gettime(CLOCK_MONOTONIC,...) should yield better measurements.
Second, your core loops aren't at all comparable.
In the multithreaded program you're writing in each loop iteration to global variables that are very close to each other and that is horrible for cache contention.
You could space that global memory apart (make each array item a cache-aligned struct (_Alignas(64))) and that'll help the time, but a better and fairer approach would be to use local variables (which should go into registers), copying the approach of the first loop, and then write out the chunk result to memory at the end of the loop:
int l_max_chunk=0, l_pos_chunk=0, *a;
for(i = 0,a=array+offset; i < chunk_size; i++)
if(a[i] > l_max_chunk) l_max_chunk=a[i], l_pos_chunk=i;
max_chunk[*core_id] = l_max_chunk;
pos_chunk[*core_id] = l_pos_chunk;
Here's your modified test program with expected speedups (I'm getting approx. a 2x speedup on my two-core processor).
(I've also taken the liberty of replacing the file load with in-memory initialization, to make it simpler to test.)
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <stdint.h>
struct timespec ts0,ts1;
uint64_t sc_timespec_diff(struct timespec Ts1, struct timespec Ts0) { return (Ts1.tv_sec - Ts0.tv_sec)*1000000000+(Ts1.tv_nsec - Ts0.tv_nsec); }
#define NUM_THREADS 4
int max_chunk[NUM_THREADS], pos_chunk[NUM_THREADS];
int *array;
pthread_t tid[NUM_THREADS];
void *thread(void *arg)
{
size_t array_size_in_bytes = 1024*1024*1024;
int i, rc, offset, chunk_size, array_size, *core_id = (int*) arg, num_cores = sysconf(_SC_NPROCESSORS_ONLN);
#if 1 //shouldn't make much difference
pthread_t id = pthread_self();
cpu_set_t cpuset;
if (*core_id < 0 || *core_id >= num_cores)
return NULL;
CPU_ZERO(&cpuset);
CPU_SET(*core_id, &cpuset);
rc = pthread_setaffinity_np(id, sizeof(cpu_set_t), &cpuset);
if(rc != 0)
{
printf("pthread_setaffinity_np() failed! - rc %d\n", rc);
return NULL;
}
printf("Thread running on CPU %d\n", sched_getcpu());
#endif
array_size = (int) (array_size_in_bytes / sizeof(int));
chunk_size = (int) (array_size / NUM_THREADS);
offset = chunk_size * (*core_id);
// Find max number in the array chunk
#if 0 //horrible for caches
for(i = offset; i < (offset + chunk_size); i++)
{
if(array[i] > max_chunk[*core_id])
{
max_chunk[*core_id] = array[i];
pos_chunk[*core_id] = i;
}
}
#else
int l_max_chunk=0, l_pos_chunk=0, *a;
for(i = 0,a=array+offset; i < chunk_size; i++)
if(a[i] > l_max_chunk) l_max_chunk=a[i], l_pos_chunk=i;
max_chunk[*core_id] = l_max_chunk;
pos_chunk[*core_id] = l_pos_chunk;
#endif
return NULL;
}
void load_array(void)
{
FILE *f;
size_t array_size_in_bytes = 1024*1024*1024, array_size=array_size_in_bytes/sizeof(int);
array = (int*) malloc(array_size_in_bytes);
if(array == NULL) abort(); // assert if condition is false
for(size_t i=0; i<array_size; i++) array[i]=i;
}
int main(void)
{
int i, max = 0, position, id[NUM_THREADS], rc;
clock_t t;
double time;
load_array();
printf("Finding max...");
t = clock();
clock_gettime(CLOCK_MONOTONIC,&ts0);
// Create threads
for(i = 0; i < NUM_THREADS; i++)
{
id[i] = i; // uso id para pasarle un puntero distinto a cada thread
rc = pthread_create(&(tid[i]), NULL, &thread, (void*)(id + i));
if (rc != 0)
printf("Can't create thread! rc = %d\n", rc);
else
printf("Thread %lu created\n", tid[i]);
}
// Join threads
for(i = 0; i < NUM_THREADS; i++)
pthread_join(tid[i], NULL);
// Find max number from all chunks
for(i = 0; i < NUM_THREADS; i++)
if(max_chunk[i] > max)
{
max = max_chunk[i];
position = pos_chunk[i];
}
clock_gettime(CLOCK_MONOTONIC,&ts1);
printf("Time2 %.6LF\n", sc_timespec_diff(ts1,ts0)/1E9L);
t = clock() - t;
time = ((double) t) / CLOCKS_PER_SEC;
printf("done!\n");
free(array);
printf("----------- Program results -------------\nMax number: %d position %d\n", max, position);
printf("Time %f [s]\n", time);
pthread_exit(NULL);
return 0;
}
My timings:
0.188917 for the signle threaded version
2.511590 for the original multithreaded version (measured with clock_gettime(CLOCK_MONOTONIC,...)
0.099802 with the modified threaded version (measured with clock_gettime(CLOCK_MONOTONIC,...)
ran on a Linux machine with Intel(R) Core(TM) i7-2620M CPU # 2.70GHz.

Will there be serious performance degradation when using multiple threads to write to memory concurrently on Linux?

I wrote a multi-thread today. The task of the thread is to write data to a large array. A single thread takes about 0.7s, but it takes more than 20 seconds to write independently and concurrently with two threads. The same operation is under Windows or Multi-process seconds under Linux all are about 0.7s.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/time.h>
#include <sys/types.h>
#define SIZE_IN_MB 256
#define NUM_BYTE (SIZE_IN_MB*1024*1024)
#define NUM_LONG (NUM_BYTE/sizeof(long))
#define CHILD_COUNT 2
#define STEP_SIZE 1 //use to avoid cache,when set to 8
unsigned long Time[CHILD_COUNT];
struct Arg {
unsigned long *data;
int index;
};
unsigned long diffTime(struct timeval *end, struct timeval *start) {
return labs((end->tv_sec - start->tv_sec) * 1000 + (end->tv_usec - start->tv_usec) / 1000);
}
void getTime(struct timeval *t) {
gettimeofday(t, NULL);
}
unsigned long writeData() {
struct timeval start, end;
getTime(&start);
unsigned long *data = (unsigned long *) malloc(NUM_LONG * sizeof(long));
for (int i = 0; i < STEP_SIZE; ++i) {
for (size_t k = i; k < NUM_LONG; k+=STEP_SIZE)
data[k] = 0x5a5a5a5a5a5a5a5a + rand();
}
getTime(&end);
free(data);
return diffTime(&end, &start);
}
void *child(void *arg) {
Time[((struct Arg *) arg)->index] = writeData();
}
void waitAll(pthread_t threads[]) {
for (int i = 0; i < CHILD_COUNT; i++) {
pthread_join(threads[i], NULL);
}
}
void printAverTime(int count) {
unsigned long time = 0;
for (int i = 0; i < count; ++i) {
time += Time[i];
}
printf("Thread: %ld\n", time / count);
}
void thread_test() {
pthread_t threads[CHILD_COUNT];
struct Arg arg[CHILD_COUNT] = {};
for (int i = 0; i < CHILD_COUNT; i++) {
arg[i].index = i;
pthread_create(&threads[i], NULL, child, (void *) &arg[i]);
}
waitAll(threads);
printAverTime(CHILD_COUNT);
}
void process_test() {
int p[CHILD_COUNT][2];
for (int i = 0; i < CHILD_COUNT; ++i) {
pipe(p[i]);
}
for (int i = 0; i < CHILD_COUNT; i++) {
if (fork() == 0) {
unsigned long t = writeData();
write(p[i][1], &t, sizeof(t));
exit(0);
}
}
unsigned long t = 0,tmp= 0;
for (int i = 0; i < CHILD_COUNT; ++i) {
read(p[i][0], &tmp, sizeof(tmp));
t += tmp;
}
printf("Process: %ld\n", t / CHILD_COUNT);
}
int main() {
thread_test();
process_test();
}
The penalty you are paying when using multiple threads is not for writing to memory but for the fact that you are calling rand(), which involves locking, many times in the following nested loops in writeData():
for (int i = 0; i < STEP_SIZE; ++i) {
for (size_t k = i; k < NUM_LONG; k+=STEP_SIZE)
data[k] = 0x5a5a5a5a5a5a5a5a + rand();
}
So you are incurring a huge penalty because for each call to rand() only one thread can get in at a time and all the other threads have to wait and there is overhead to this waiting.
You can fix your code to avoid collisions in the inner loop by using a reentrant form of rand(), such as rand_r() (which is documented at https://man7.org/linux/man-pages/man3/rand.3.html)
unsigned int seed = rand();
for (int i = 0; i < STEP_SIZE; ++i) {
for (size_t k = i; k < NUM_LONG; k+=STEP_SIZE)
data[k] = 0x5a5a5a5a5a5a5a5a + rand_r(&seed);
}

Difference in behavior between clang and gcc?

I'm writing a C function to simulate a cache given an address trace. The function works as expected when compiled on my mac using gcc (really clang). gcc --version on my mac returns this:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.1.0 (clang-802.0.42)
When I compile the same program on linux using gcc, the returns are way off, and eC & hC in my program (cache eviction counter and hit counter) are in the hundreds of thousands, when they should be below 10. When typing gcc --version on the linux machine, it returns this:
gcc (Ubuntu 4.9.3-8ubuntu2~14.04) 4.9.3
Here is the program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <limits.h>
#include <getopt.h>
#include "cachelab.h"
typedef struct{
int v;
int t;
int LRU;
} block;
typedef struct{
block *blocks;
} set;
typedef struct{
set *sets;
} cache;
void simulate(int s, int E, int b, char* file, int* hC, int* mC, int* eC)
{
int numSets = (1 << s);
char operation;
int address;
int size;
int curTag;
int curSet;
int maxLRU = 0;
int curLRU = 0;
int check = 0;
cache c;
set *sets = malloc(sizeof(set) * numSets);
c.sets = sets;
int i = 0;
while(i < numSets)
{
c.sets[i].blocks = malloc(sizeof(block) * E);
for (int j = 0; j < E; j++)
{
c.sets[i].blocks[j].v = 0;
c.sets[i].blocks[j].t = INT_MIN;
c.sets[i].blocks[j].LRU = 0;
}
i++;
}
FILE *f = fopen(file, "r");
while(fscanf(f," %c %x,%d", &operation, &address, &size) != EOF)
{
check = 0;
curTag = ((unsigned int) address) >> (s+b);
curSet = (address >> b) & ((1 << s) - 1);
for (int i = 0; i < E; i++)
{
c.sets[curSet].blocks[i].LRU++;
if(c.sets[curSet].blocks[i].LRU >= maxLRU)
{
maxLRU = c.sets[curSet].blocks[i].LRU;
curLRU = i;
}
if(curTag == c.sets[curSet].blocks[i].t)
{
*hC = *hC + 1;
if (operation == 'M')
{
*hC = *hC + 1;
}
c.sets[curSet].blocks[i].LRU = 0;
check = 1;
}
}
if(check == 0)
{
for(int i = 0; i < E; i++)
{
if(c.sets[curSet].blocks[i].v == 0)
{
*mC = *mC + 1;
if (operation == 'M')
{
*hC = *hC + 1;
}
c.sets[curSet].blocks[i].v = 1;
c.sets[curSet].blocks[i].LRU = 0;
c.sets[curSet].blocks[i].t = curTag;
check = 1;
break;
}
}
}
if(check == 0)
{
*eC = *eC + 1;
*mC = *mC + 1;
if (operation == 'M')
{
*hC = *hC + 1;
}
c.sets[curSet].blocks[curLRU].t = curTag;
c.sets[curSet].blocks[curLRU].v = 1;
c.sets[curSet].blocks[curLRU].LRU = 0;
}
}
}
int main(int argc, char** argv)
{
int hitCount, missCount, evictionCount;
int s, E, b;
char *file;
char opt;
while((opt = getopt(argc,argv,"v:h:s:E:b:t:")) != -1)
{
switch(opt){
case 'v':
break;
case 'h':
break;
case 's':
s = atoi(optarg);
break;
case 'E':
E = atoi(optarg);
break;
case 'b':
b = atoi(optarg);
break;
case 't':
file = optarg;
break;
default:
exit(1);
}
}
simulate(s, E, b, file, &hitCount, &missCount, &evictionCount);
printSummary(hitCount, missCount, evictionCount);
return 0;
}
EDIT:
I understand that this is due to a difference between clang and gcc. Does anyone have any information about how I can go about fixing this discrepancy?
Here is cachelab.c:
/*
* cachelab.c - Cache Lab helper functions
*/
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "cachelab.h"
#include <time.h>
trans_func_t func_list[MAX_TRANS_FUNCS];
int func_counter = 0;
/*
* printSummary - Summarize the cache simulation statistics. Student cache simulators
* must call this function in order to be properly autograded.
*/
void printSummary(int hits, int misses, int evictions)
{
printf("hits:%d misses:%d evictions:%d\n", hits, misses, evictions);
FILE* output_fp = fopen(".csim_results", "w");
assert(output_fp);
fprintf(output_fp, "%d %d %d\n", hits, misses, evictions);
fclose(output_fp);
}
/*
* initMatrix - Initialize the given matrix
*/
void initMatrix(int M, int N, int A[N][M], int B[M][N])
{
int i, j;
srand(time(NULL));
for (i = 0; i < N; i++){
for (j = 0; j < M; j++){
// A[i][j] = i+j; /* The matrix created this way is symmetric */
A[i][j]=rand();
B[j][i]=rand();
}
}
}
void randMatrix(int M, int N, int A[N][M]) {
int i, j;
srand(time(NULL));
for (i = 0; i < N; i++){
for (j = 0; j < M; j++){
// A[i][j] = i+j; /* The matrix created this way is symmetric */
A[i][j]=rand();
}
}
}
/*
* correctTrans - baseline transpose function used to evaluate correctness
*/
void correctTrans(int M, int N, int A[N][M], int B[M][N])
{
int i, j, tmp;
for (i = 0; i < N; i++){
for (j = 0; j < M; j++){
tmp = A[i][j];
B[j][i] = tmp;
}
}
}
/*
* registerTransFunction - Add the given trans function into your list
* of functions to be tested
*/
void registerTransFunction(void (*trans)(int M, int N, int[N][M], int[M][N]),
char* desc)
{
func_list[func_counter].func_ptr = trans;
func_list[func_counter].description = desc;
func_list[func_counter].correct = 0;
func_list[func_counter].num_hits = 0;
func_list[func_counter].num_misses = 0;
func_list[func_counter].num_evictions =0;
func_counter++;
}
You forgot to initialize the counters and flags so they start at undefined values. The following lines:
int hitCount, missCount, evictionCount;
int s, E, b;
should be:
int hitCount = 0, missCount = 0, evictionCount = 0;
int s = 0, E = 0, b = 0;
It just happens that the initial values happen to be lower on the mac so you're not getting correct results on the mac either (at least not guaranteed since the initial value is undefined).

Why is my parallel code slower than serial?

Issue
Hello everyone, I have got a program (from the net) that I intend to speed up by converting it into its parallel version with the use of pthreads. But surprisingly though, it runs slower than the serial version. Below is the program:
# include <stdio.h>
//fast square root algorithm
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
//test if a number is prime
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
//number generator iterated from 0 to n
int main()
{
n = 1000000; //maximum number
int k,j;
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
}
return 0;
}
First attempt for parallelization
I let the pthread manage the for loop
# include <stdio.h>
.
.
int main()
{
.
.
//----->pthread code here<----
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
}
return 0;
}
Well, it runs slower than the serial one
Second attempt
I divided the for loop into two threads and run them in parallel using pthreads
However, it still runs slower, I am intending that it may run about twice as fast or well faster. But its not!
These is my parallel code by the way:
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 2
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
int k = 0;
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
struct arg_struct
{
int initialPrime;
int nextPrime;
};
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
void *parallel_launcher(void *arguments)
{
struct arg_struct *args = (struct arg_struct *)arguments;
int j = args -> initialPrime;
int n = args -> nextPrime - 1;
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1)
{
printf("This is prime: %d\n",j);
pthread_mutex_lock( &mutex1 );
k++;
pthread_mutex_unlock( &mutex1 );
}
if(j == n) printf("Count: %d\n",k);
}
pthread_exit(NULL);
}
int main()
{
int f = 100000000;
int m;
pthread_t thread_id[NTHREADS];
struct arg_struct args;
int rem = (f+1)%NTHREADS;
int n = floor((f+1)/NTHREADS);
for(int h = 0; h < NTHREADS; h++)
{
if(rem > 0)
{
m = n + 1;
rem-= 1;
}
else if(rem == 0)
{
m = n;
}
args.initialPrime = args.nextPrime;
args.nextPrime = args.initialPrime + m;
pthread_create(&thread_id[h], NULL, &parallel_launcher, (void *)&args);
pthread_join(thread_id[h], NULL);
}
// printf("Count: %d\n",k);
return 0;
}
Note:
OS: Fedora 21 x86_64,
Compiler: gcc-4.4,
Processor: Intel Core i5 (2 physical core, 4 logical),
Mem: 6 Gb,
HDD: 340 Gb,
You need to split the range you are examining for primes up into n parts, where n is the number of threads.
The code that each thread runs becomes:
typedef struct start_end {
int start;
int end;
} start_end_t;
int find_primes_in_range(void *in) {
start_end_t *start_end = (start_end_t *) in;
int num_primes = 0;
for (int j = start_end->start; j <= start_end->end; j++) {
if (isPrime(j) == 1)
num_primes++;
}
pthread_exit((void *) num_primes;
}
The main routine first starts all the threads which call find_primes_in_range, then calls pthread_join for each thread. It sums all the values returned by find_primes_in_range. This avoids locking and unlocking a shared count variable.
This will parallelize the work, but the amount of work per thread will not be equal. This can be addressed but is more complicated.
The main design flaw: you must let each thread have its own private counter variable instead of using the shared one. Otherwise they will spend far more time waiting on and handling that mutex, than they will do on the actual calculation. You are essentially forcing the threads to execute in serial.
Instead, sum everything up with a private counter variable and once a thread is done with its work, return the counter variable and sum them up in main().
Also, you should not call printf() from inside the threads. If there is a context switch in the middle of a printf call, you'll end up with crappy output such as This is This is prime: 2. In which case you must synchronize the printf calls between threads, which will slow the program down again. Also, the printf() calls themselves are likely 90% of the work that the thread is doing. So some sort of re-design of who does the printing might be a good idea, depending on what you want to do with the results.
Summary
Indeed, the use of PThread speed up my code. It was my programming flaw of placing pthread_join right after the first pthread_create and the common counter I have set on arguments. After fixing this up, I tested my parallel code to determine the primality of 100 Million numbers then compared its processing time with a serial code. Below are the results.
http://i.stack.imgur.com/gXFyk.jpg (I could not attach the image as I don't have much reputation yet, instead, I am including a link)
I conducted three trials for each to account for the variations caused by different OS activities. We got speed up for utilizing parallel programming with PThread. What is surprising is a PThread code running in ONE thread was a bit faster than purely serial code. I could not explain this one, nevertheless using PThreads is well, surely worth a try.
Here is the corrected parallel version of the code (gcc-c++):
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 4
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
struct start_end_f
{
int start;
int end;
};
//test if a number is prime
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn = asmSqrt(n);
for (int i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
//executes the tests for prime in a certain range, other threads will test the next range and so on..
void *find_primes_in_range(void *in)
{
int k = 0;
struct start_end_f *start_end_h = (struct start_end_f *)in;
for (int j = start_end_h->start; j < (start_end_h->end +1); j++)
{
if(isPrime(j) == 1) k++;
}
int *t = new int;
*t = k;
pthread_exit(t);
}
int main()
{
int f = 100000000; //maximum number to be tested for prime
pthread_t thread_id[NTHREADS];
struct start_end_f start_end[NTHREADS];
int rem = (f+1)%NTHREADS;
int n = (f+1)/NTHREADS;
int rem_change = rem;
int m;
if(rem>0) m = n+1;
else if(rem == 0) m = n;
//distributes task 'evenly' to the number of parallel threads requested
for(int h = 0; h < NTHREADS; h++)
{
if(rem_change > 0)
{
start_end[h].start = m*h;
start_end[h].end = start_end[h].start+m-1;
rem_change -= 1;
}
else if(rem_change<= 0)
{
start_end[h].start = m*(h+rem_change)-rem_change*n;
start_end[h].end = start_end[h].start+n-1;
rem_change -= 1;
}
pthread_create(&thread_id[h], NULL, find_primes_in_range, &start_end[h]);
}
//retreiving returned values
int *t;
int c = 0;
for(int h = 0; h < NTHREADS; h++)
{
pthread_join(thread_id[h], (void **)&t);
int b = *((int *)t);
c += b;
b = 0;
}
printf("\nNumber of Primes: %d\n",c);
return 0;
}

How to get the correct order of execution of pthreads

I was doing Histogram using pthreads and after long struggle on it.. finally it said:
Segmentation Fault (Core Dumped)
unfortunately I had this line
p=(struct1 *)malloc(sizeof(struct1));
after getting the values to the struct variables from command line.. So that was cleared off.. Thanks for #DNT for letting me know that..
Now when I try to execute the following program.. It sometimes displays the output and sometimes it is going out to the which_bin function and prints the following
output type 1(which is not the correct output):
Data = 0.000000 doesn't belong to a bin!
Quitting
output type 2(almost the correct output of histo with time taken by threads):
10.000-28.000:
28.000-46.000:
46.000-64.000:
64.000-82.000:
82.000-100.000: XXXXXXXXXX
The code to be timed took 0.000415 seconds
My ques is why the same prog when ran shows different outputs.. I am confused of what it is exactly looking for..
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include "timer.h"
void Usage(char prog_name[]);
void Gen_data(void *p);
void Gen_bins(void *p);
int Which_bin(void *p);
void Print_histo(void *p);
void func(void *p);
struct test
{
int bin_count, i, bin;
float min_meas, max_meas;
float* bin_maxes;
int* bin_counts;
int data_count;
float* data;
};
typedef struct test struct1;
int main(int argc, char* argv[])
{
double start, finish, elapsed;
GET_TIME(start);
struct1 *p;
pthread_t th1, th2, th3;
p=(struct1 *)malloc(sizeof(struct1));
if (argc != 5)
Usage(argv[0]);
p->bin_count = strtol(argv[1], NULL, 10);
p->min_meas = strtof(argv[2], NULL);
p->max_meas = strtof(argv[3], NULL);
p->data_count = strtol(argv[4], NULL, 10);
p->bin_maxes = malloc(p->bin_count*sizeof(float));
p->bin_counts = malloc(p->bin_count*sizeof(int));
p->data = malloc(p->data_count*sizeof(float));
pthread_create(&th1,NULL,(void*) Gen_data,(void*) p);
pthread_create(&th2,NULL,(void*) Gen_bins,(void*) p);
pthread_create(&th3,NULL,(void*) func,(void*) p);
printf("Hi\n");
pthread_join(th1,NULL);
pthread_join(th2,NULL);
pthread_join(th3,NULL);
Print_histo(p);
free(p->data);
free(p->bin_maxes);
free(p->bin_counts);
GET_TIME(finish);
elapsed = finish - start;
printf("The code to be timed took %f seconds\n", elapsed);
return 0;
} /* main */
void func(void *p)
{
int i;
struct1 *args;
args=(struct1*)p;
for (i = 0; i < args->data_count; i++)
{
args->bin = Which_bin(args);
args->bin_counts[args->bin]++;
}
# ifdef DEBUG
printf("bin_counts = ");
for (i = 0; i < args->bin_count; i++)
printf("%d ", args->bin_counts[i]);
printf("\n");
# endif
}
/*---------------------------------------------------------------------
* Function: Usage
* Purpose: Print a message showing how to run program and quit
* In arg: prog_name: the name of the program from the command line
*/
void Usage(char prog_name[] /* in */)
{
fprintf(stderr, "usage: %s ", prog_name);
fprintf(stderr, "<bin_count> <min_meas> <max_meas> <data_count>\n");
exit(0);
} /* Usage */
void Gen_data(void *p)
{
struct1 *args;
args=(struct1*)p;
int i;
srandom(0);
for (i = 0; i < args->data_count; i++)
args->data[i] = args->min_meas + (args->max_meas - args->min_meas)*random()/((double) RAND_MAX);
#ifdef DEBUG
printf("data = ");
for (i = 0; i < args->data_count; i++)
printf("%4.3f ", args->data[i]);
printf("\n");
#endif
} /* Gen_data */
void Gen_bins(void* p)
{
struct1 *args;
args=(struct1*)p;
float bin_width;
int i;
bin_width = (args->max_meas - args->min_meas)/args->bin_count;
for (i = 0; i < args->bin_count; i++)
{
args->bin_maxes[i] = args->min_meas + (i+1)*bin_width;
args->bin_counts[i] = 0;
}
# ifdef DEBUG
printf("bin_maxes = ");
for (i = 0; i < args->bin_count; i++)
printf("%4.3f ", args->bin_maxes[i]);
printf("\n");
# endif
}
int Which_bin(void* p)
{
struct1 *args;
args=(struct1*)p;
int bottom = 0, top = args->bin_count-1;
int mid;
float bin_max, bin_min;
while (bottom <= top)
{
mid = (bottom + top)/2;
bin_max = args->bin_maxes[mid];
bin_min = (mid == 0) ? args->min_meas: args->bin_maxes[mid-1];
if (*(args->data) >= bin_max)
bottom = mid+1;
else if (*(args->data) < bin_min)
top = mid-1;
else
return mid;
}
fprintf(stderr, "Data = %f doesn't belong to a bin!\n", args->data);
fprintf(stderr, "Quitting\n");
exit(-1);
}
void Print_histo(void *p)
{
struct1 *args;
args=(struct1*)p;
int i, j;
float bin_max, bin_min;
for (i = 0; i < args->bin_count; i++)
{
bin_max = args->bin_maxes[i];
bin_min = (i == 0) ? args->min_meas: args->bin_maxes[i-1];
printf("%.3f-%.3f:\t", bin_min, bin_max);
for (j = 0; j < args->bin_counts[i]; j++)
printf("X");
printf("\n");
}
}
/* Print_histo */ #include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include "timer.h"
void Usage(char prog_name[]);
void Gen_data(void *p);
void Gen_bins(void *p);
int Which_bin(void *p);
void Print_histo(void *p);
void func(void *p);
pthread_mutex_t lock;
struct test
{
int bin_count, i, bin;
float min_meas, max_meas;
float* bin_maxes;
int* bin_counts;
int data_count;
float* data;
};
typedef struct test struct1;
int main(int argc, char* argv[])
{
if (pthread_mutex_init(&lock, NULL) != 0)
{
printf("\n mutex init failed\n");
return 1;
}
double start, finish, elapsed;
GET_TIME(start);
struct1 *p;
pthread_t th1, th2, th3;
p=(struct1 *)malloc(sizeof(struct1));
if (argc != 5)
Usage(argv[0]);
p->bin_count = strtol(argv[1], NULL, 10);
p->min_meas = strtof(argv[2], NULL);
p->max_meas = strtof(argv[3], NULL);
p->data_count = strtol(argv[4], NULL, 10);
p->bin_maxes = malloc(p->bin_count*sizeof(float));
p->bin_counts = malloc(p->bin_count*sizeof(int));
p->data = malloc(p->data_count*sizeof(float));
pthread_create(&th1,NULL,(void*) Gen_data,(void*) p);
pthread_create(&th2,NULL,(void*) Gen_bins,(void*) p);
pthread_create(&th3,NULL,(void*) func,(void*) p);
printf("Hi\n");
pthread_join(th1,NULL);
pthread_join(th2,NULL);
pthread_join(th3,NULL);
Print_histo(p);
free(p->data);
free(p->bin_maxes);
free(p->bin_counts);
GET_TIME(finish);
elapsed = finish - start;
printf("The code to be timed took %f seconds\n", elapsed);
return 0;
} /* main */
void func(void *p)
{
pthread_mutex_lock(&lock);
printf("th3 from Gen_func\n");
int i;
struct1 *args;
args=(struct1*)p;
for (i = 0; i < args->data_count; i++)
{
args->bin = Which_bin(args);
args->bin_counts[args->bin]++;
}
# ifdef DEBUG
printf("bin_counts = ");
for (i = 0; i < args->bin_count; i++)
printf("%d ", args->bin_counts[i]);
printf("\n");
# endif
pthread_mutex_unlock(&lock);
}
/*---------------------------------------------------------------------
* Function: Usage
* Purpose: Print a message showing how to run program and quit
* In arg: prog_name: the name of the program from the command line
*/
void Usage(char prog_name[] /* in */)
{
fprintf(stderr, "usage: %s ", prog_name);
fprintf(stderr, "<bin_count> <min_meas> <max_meas> <data_count>\n");
exit(0);
} /* Usage */
void Gen_data(void *p)
{
pthread_mutex_lock(&lock);
printf("th1 from Gen_data\n");
struct1 *args;
args=(struct1*)p;
int i;
srandom(0);
for (i = 0; i < args->data_count; i++)
args->data[i] = args->min_meas + (args->max_meas - args->min_meas)*random()/((double) RAND_MAX);
#ifdef DEBUG
printf("data = ");
for (i = 0; i < args->data_count; i++)
printf("%4.3f ", args->data[i]);
printf("\n");
#endif
pthread_mutex_unlock(&lock);
} /* Gen_data */
void Gen_bins(void* p)
{
pthread_mutex_lock(&lock);
printf("th2 from Gen_bins\n");
struct1 *args;
args=(struct1*)p;
float bin_width;
int i;
bin_width = (args->max_meas - args->min_meas)/args->bin_count;
for (i = 0; i < args->bin_count; i++)
{
args->bin_maxes[i] = args->min_meas + (i+1)*bin_width;
args->bin_counts[i] = 0;
}
# ifdef DEBUG
printf("bin_maxes = ");
for (i = 0; i < args->bin_count; i++)
printf("%4.3f ", args->bin_maxes[i]);
printf("\n");
# endif
pthread_mutex_unlock(&lock);
}
int Which_bin(void* p)
{
struct1 *args;
args=(struct1*)p;
int bottom = 0, top = args->bin_count-1;
int mid;
float bin_max, bin_min;
while (bottom <= top)
{
mid = (bottom + top)/2;
bin_max = args->bin_maxes[mid];
bin_min = (mid == 0) ? args->min_meas: args->bin_maxes[mid-1];
if (*(args->data) >= bin_max)
bottom = mid+1;
else if (*(args->data) < bin_min)
top = mid-1;
else
return mid;
}
fprintf(stderr, "Data = %f doesn't belong to a bin!\n", args->data);
fprintf(stderr, "Quitting\n");
exit(-1);
}
void Print_histo(void *p)
{
struct1 *args;
args=(struct1*)p;
int i, j;
float bin_max, bin_min;
for (i = 0; i < args->bin_count; i++)
{
bin_max = args->bin_maxes[i];
bin_min = (i == 0) ? args->min_meas: args->bin_maxes[i-1];
printf("%.3f-%.3f:\t", bin_min, bin_max);
for (j = 0; j < args->bin_counts[i]; j++)
printf("X");
printf("\n");
}
}
/* Print_histo */
I have added the lines to see if all the threads are accessing its functions.. I observed this..
output 1:
Hi
th1 from Gen_data
th3 from Gen_func
Data = 0.000000 doesn't belong to a bin!
Quitting
In the output 1, I can see that th2 is not executed and program ended displaying error..
output 2:
th1 from Gen_data
Hi
th2 from Gen_bins
th3 from Gen_func
10.000-28.000:
28.000-46.000:
46.000-64.000:
64.000-82.000:
82.000-100.000: XXXXXXXXXX
The code to be timed took 0.000348 seconds
In output 2, all the threads are executed and so is the output..
I am confused that why the thread th2 is not being executed and how can I make sure that all threads runs in correct order..
I would like to know if the program is logically wrong? if it is wrong logically in that case why is it showing the histogram output at times.. Thanks!
Order of thread execution is not guaranteed. On a modern, multi-core processor, the threads may even execute concurrently. There is no guarantee that the Gen_bins thread completes before the func thread. Since your threads access and manipulate the same data structures, the results are unpredictable as you have noticed.
While I don't think threads are necessary for this application, make the following change to ensure the threads execute in the order listed. Change:
pthread_create(&th1,NULL,(void*) Gen_data,(void*) p);
pthread_create(&th2,NULL,(void*) Gen_bins,(void*) p);
pthread_create(&th3,NULL,(void*) func,(void*) p);
pthread_join(th1,NULL);
pthread_join(th2,NULL);
pthread_join(th3,NULL);
to:
pthread_create(&th1,NULL,(void*) Gen_data,(void*) p);
pthread_join(th1,NULL);
pthread_create(&th2,NULL,(void*) Gen_bins,(void*) p);
pthread_join(th2,NULL);
pthread_create(&th3,NULL,(void*) func,(void*) p);
pthread_join(th3,NULL);
This ensures that each thread executes and completes before the next starts. Again, since the threads aren't executing concurrently, threading isn't necessary for this program and just adds complexity.

Resources