According to specification, the function rand() in C uses mutexes to lock context (http://sourcecodebrowser.com/uclibc/0.9.27/rand_8c.html). So if I use multiple threads that call it, my program will be slow because all threads will try to access this lock region.
So, I have found drand48(), another random number generator function, which does not have locks (http://sourcecodebrowser.com/uclibc/0.9.27/drand48_8c.html#af9329f9acef07ca14ea2256191c3ce74). But, somehow, my parallel program is still slower than the serial one! The code is pasted bellow:
Serial version:
#include <cstdlib>
#define M 100000000
int main()
{
for (int i = 0; i < M; ++i)
drand48();
return 0;
}
Parallel version:
#include <pthread.h>
#include <cstdlib>
#define M 100000000
#define N 4
pthread_t threads[N];
void* f(void* p)
{
for (int i = 0; i < M/N; ++i)
drand48();
}
int main()
{
for (int i = 0; i < N; ++i)
pthread_create(&threads[i], NULL, f, NULL);
for (int i = 0; i < N; ++i)
pthread_join(threads[i], NULL);
return 0;
}
I executed both codes. The serial one runs in ~0.6 seconds and the parallel in ~2.1 seconds.
Could anyone explain me why this happens?
Some additional information: I have 4 cores on my PC. I compile the serial version using
g++ serial.cpp -o serial
and the parallel using
g++ parallel.cpp -lpthread -o parallel
Edit:
Apparently, this performance loss happens whenever I updates a global variable in my threads. In the exemple below, the x variable is the global (note that in the parallel example, the operation will be non thread-safe):
Serial:
#include <cstdlib>
#define M 1000000000
int x = 0;
int main()
{
for (int i = 0; i < M; ++i)
x = x + 10 - 10;
return 0;
}
Parallel:
#include <pthread.h>
#include <cstdlib>
#define M 1000000000
#define N 4
pthread_t threads[N];
int x;
void* f(void* p)
{
for (int i = 0; i < M/N; ++i)
x = x + 10 - 10;
}
int main()
{
for (int i = 0; i < N; ++i)
pthread_create(&threads[i], NULL, f, NULL);
for (int i = 0; i < N; ++i)
pthread_join(threads[i], NULL);
return 0;
}
Note that the drand48() uses the global struct variable _libc_drand48_data.
drand48() uses the global struct variable _libc_drand48_data, it keeps state there (writes to it), and is therefore the source of cache line contention, which is very likely the source of the performance degradation. It isn't false sharing as I initially suspected and wrote in the comments, it is bona fide sharing. The reason there is no locking in the implementation of drand48() is two fold:
drand48() is not required to be thread-safe "The drand48(), lrand48(), and mrand48() functions need not be thread-safe."
If two threads happen to access it at the same time, and their writes to memory are interleaved there is no harm done - the data structure is not corrupted, and it is, after all, supposed to return pseudo random data.
There are some subtle considerations (race conditions) in the use of drand48() when one thread is initializing state, but considered harmless
Notice below in __drand48_iterate how it stores to three 16-bit words in the global variable, this is where the random generator keeps its state, and this is the source of the cache-line contention between your threads
xsubi[0] = result & 0xffff;
xsubi[1] = (result >> 16) & 0xffff;
xsubi[2] = (result >> 32) & 0xffff;
Source code
You provided the link to drand48() source code which I've included below for reference. The problem is cache line contention when the state is updated
#include <stdlib.h>
/* Global state for non-reentrant functions. Defined in drand48-iter.c. */
extern struct drand48_data __libc_drand48_data;
double drand48(void)
{
double result;
erand48_r (__libc_drand48_data.__x, &__libc_drand48_data, &result);
return result;
}
And here is the source for erand48_r
extern int __drand48_iterate(unsigned short xsubi[3], struct drand48_data *buffer);
int erand48_r (xsubi, buffer, result)
unsigned short int xsubi[3];
struct drand48_data *buffer;
double *result;
{
union ieee754_double temp;
/* Compute next state. */
if (__drand48_iterate (xsubi, buffer) < 0)
return -1;
/* Construct a positive double with the 48 random bits distributed over
its fractional part so the resulting FP number is [0.0,1.0). */
temp.ieee.negative = 0;
temp.ieee.exponent = IEEE754_DOUBLE_BIAS;
temp.ieee.mantissa0 = (xsubi[2] << 4) | (xsubi[1] >> 12);
temp.ieee.mantissa1 = ((xsubi[1] & 0xfff) << 20) | (xsubi[0] << 4);
/* Please note the lower 4 bits of mantissa1 are always 0. */
*result = temp.d - 1.0;
return 0;
}
And the implementation of __drand48_iterate which is where it writes back to the global
int
__drand48_iterate (unsigned short int xsubi[3], struct drand48_data *buffer)
{
uint64_t X;
uint64_t result;
/* Initialize buffer, if not yet done. */
if (unlikely(!buffer->__init))
{
buffer->__a = 0x5deece66dull;
buffer->__c = 0xb;
buffer->__init = 1;
}
/* Do the real work. We choose a data type which contains at least
48 bits. Because we compute the modulus it does not care how
many bits really are computed. */
X = (uint64_t) xsubi[2] << 32 | (uint32_t) xsubi[1] << 16 | xsubi[0];
result = X * buffer->__a + buffer->__c;
xsubi[0] = result & 0xffff;
xsubi[1] = (result >> 16) & 0xffff;
xsubi[2] = (result >> 32) & 0xffff;
return 0;
}
Related
This question already has answers here:
Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux
(3 answers)
Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs?
(2 answers)
The advantages of using 32bit registers/instructions in x86-64
(2 answers)
Why does Clang do this optimization trick only from Sandy Bridge onward?
(1 answer)
Closed 2 years ago.
I was testing some program and I came upon a rather unexpected anomaly.
I wrote a simple program that computed prime numbers, and used pthreads API to parallelize this workload.
After conducting some tests, I found that if i used uint64_t as the datatype for calculations and loops, the program took significantly more time to run than if i used uint32_t.
Here is the code that I ran:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <pthread.h>
#define UINT uint64_t
#define SIZE (1024 * 1024)
typedef struct _data
{
UINT start;
UINT len;
int t;
UINT c;
}data;
int isprime(UINT x)
{
uint8_t flag = 1;
if(x < 2)
return 0;
for(UINT i = 2;i < x/2; i++)
{
if(!(x % i ))
{
flag = 0;
break;
}
}
return flag;
}
void* calc(void *p)
{
data *a = (data*)p;
//printf("thread no. %d has start: %lu length: %lu\n",a->t,a->start,a->len);
for(UINT i = a->start; i < a->len; i++)
{
if(isprime(i))
a->c++;
}
//printf("thread no. %d found %lu primes\n", a->t,a->c);
pthread_exit(NULL);
}
int main(int argc,char **argv)
{
pthread_t *t;
data *a;
uint32_t THREAD_COUNT;
if(argc < 2)
THREAD_COUNT = 1;
else
sscanf(argv[1],"%u",&THREAD_COUNT);
t = (pthread_t*)malloc(THREAD_COUNT * sizeof(pthread_t));
a = (data*)malloc(THREAD_COUNT * sizeof(data));
printf("executing the application on %u thread(s).\n",THREAD_COUNT);
for(uint8_t i = 0; i < THREAD_COUNT; i++)
{
a[i].t = i;
a[i].start = i * (SIZE / THREAD_COUNT);
a[i].len = a[i].start + (SIZE / THREAD_COUNT);
a[i].c = 0;
}
for(uint8_t i = 0; i < THREAD_COUNT; i++)
pthread_create(&t[i],NULL,calc,(void*)&a[i]);
for(uint8_t i = 0; i < THREAD_COUNT; i++)
pthread_join(t[i],NULL);
free(a);
free(t);
return 0;
}
I changed the UINT macro between uint32_t and uint64_t and compiled and ran the program and determined its runtime using time command on linux.
I found major difference between the runtime for uint64_t vs uint32_t.
On using uint32_t it took the program 46s to run while using uint64_t it took 2m49s to run!
I wrote a blog post about it here : https://qcentlabs.com/index.php/2021/02/01/intelx86_64-64-bit-vs-32-bit-arithmetic-big-performance-difference/
You can check out the post if you want more information.
What might be the issue behind this? Are 64 bit arithmetic slower on x86_64 than 32 bit one?
In general 64-bit arithmetic is as fast as 32-bit, ignoring things like larger operands taking up more memory and BW (bandwidth), and on x86-64 addressing the full 64-bit registers requires longer instructions.
However, you have managed to hit one of the few exceptions to this rule, namely the div instruction for calculating divisions.
EDIT TO QUESTION: Is it possible to have thread safe access to a bit array? My implementation below seems to require mutex locks which defeats the purpose of parallelizing.
I've been tasked with creating a parallel implementation of a twin prime generator using pthreads. I decided to use the Sieve of Eratosthenes and to divide the work of marking the factors of known primes. I staggering which factors a thread gets.
For example, if there are 4 threads:
thread one marks multiples 3, 11, 19, 27...
thread two marks multiples 5, 13, 21, 29...
thread two marks multiples 7, 15, 23, 31...
thread two marks multiples 9, 17, 25, 33...
I skipped the even multiples as well as the even base numbers. I've used a bitarray, so I run it up to INT_MAX. The problem I have is at max value of 10 million, the result varies by about 5 numbers, which is how much error there is compared to a known file. The results vary all the way down to about max value of 10000, where it changes by 1 number. Anything below that is error-free.
At first I didn't think there was a need for communication between processes. When I saw the results, I added a pthread barrier to let all the threads catch up after each set of multiples. This didn't make any change. Adding a mutex lock around the mark() function did the trick, but that slows everything down.
Here is my code. Hoping someone might see something obvious.
#include <pthread.h>
#include <stdio.h>
#include <sys/times.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#include <string.h>
#include <limits.h>
#include <getopt.h>
#define WORDSIZE 32
struct t_data{
int *ba;
unsigned int val;
int num_threads;
int thread_id;
};
pthread_mutex_t mutex_mark;
void mark( int *ba, unsigned int k )
{
ba[k/32] |= 1 << (k%32);
}
void mark( int *ba, unsigned int k )
{
pthread_mutex_lock(&mutex_mark);
ba[k/32] |= 1 << (k%32);
pthread_mutex_unlock(&mutex_mark);
}
void initBa(int **ba, unsigned int val)
{
*ba = calloc((val/WORDSIZE)+1, sizeof(int));
}
void getPrimes(int *ba, unsigned int val)
{
int i, p;
p = -1;
for(i = 3; i<=val; i+=2){
if(!isMarked(ba, i)){
if(++p == 8){
printf(" \n");
p = 0;
}
printf("%9d", i);
}
}
printf("\n");
}
void markTwins(int *ba, unsigned int val)
{
int i;
for(i=3; i<=val; i+=2){
if(!isMarked(ba, i)){
if(isMarked(ba, i+2)){
mark(ba, i);
}
}
}
}
void *setPrimes(void *arg)
{
int *ba, thread_id, num_threads, status;
unsigned int val, i, p, start;
struct t_data *data = (struct t_data*)arg;
ba = data->ba;
thread_id = data->thread_id;
num_threads = data->num_threads;
val = data->val;
start = (2*(thread_id+2))-1; // stagger threads
i=3;
for(i=3; i<=sqrt(val); i+=2){
if(!isMarked(ba, i)){
p=start;
while(i*p <= val){
mark(ba, (i*p));
p += (2*num_threads);
}
}
}
return 0;
}
void usage(char *filename)
{
printf("Usage: \t%s [option] [arg]\n", filename);
printf("\t-q generate #'s internally only\n");
printf("\t-m [size] maximum size twin prime to calculate\n");
printf("\t-c [threads] number of threads\n");
printf("Defaults:\n\toutput results\n\tsize = INT_MAX\n\tthreads = 1\n");
}
int main(int argc, char **argv)
{
int *ba, i, num_threads, opt, output;
unsigned int val;
output = 1;
num_threads = 1;
val = INT_MAX;
while ((opt = getopt(argc, argv, "qm:c:")) != -1){
switch (opt){
case 'q': output = 0;
break;
case 'm': val = atoi(optarg);
break;
case 'c': num_threads = atoi(optarg);
break;
default:
usage(argv[0]);
exit(EXIT_FAILURE);
}
}
struct t_data data[num_threads];
pthread_t thread[num_threads];
pthread_attr_t attr;
pthread_mutex_init(&mutex_mark, NULL);
initBa(&ba, val);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for(i=0; i < num_threads; i++){
data[i].ba = ba;
data[i].thread_id = i;
data[i].num_threads = num_threads;
data[i].val = val;
if(0 != pthread_create(&thread[i],
&attr,
setPrimes,
(void*)&data[i])){
perror("Cannot create thread");
exit(EXIT_FAILURE);
}
}
for(i = 0; i < num_threads; i++){
pthread_join(thread[i], NULL);
}
markTwins(ba, val);
if(output)
getPrimes(ba, val);
free(ba);
return 0;
}
EDIT: I got rid of the barrier and added a mutex_lock to the mark function. Output is accurate now, but now more than one thread slows it down. Any suggestions on speeding it up?
Your currently implementation of mark is correct, but the locking is extremely coarse-grained - there's only one lock for your entire array. This means that your threads are constantly contending for that lock.
One way of improving performance is to make the lock finer-grained: each 'mark' operation only requires exclusive access to a single integer within the array, so you could have a mutex for each array entry:
struct bitarray
{
int *bits;
pthread_mutex_t *locks;
};
struct t_data
{
struct bitarray ba;
unsigned int val;
int num_threads;
int thread_id;
};
void initBa(struct bitarray *ba, unsigned int val)
{
const size_t array_size = val / WORDSIZE + 1;
size_t i;
ba->bits = calloc(array_size, sizeof ba->bits[0]);
ba->locks = calloc(array_size, sizeof ba->locks[0]);
for (i = 0; i < array_size; i++)
{
pthread_mutex_init(&ba->locks[i], NULL);
}
}
void mark(struct bitarray ba, unsigned int k)
{
const unsigned int entry = k / 32;
pthread_mutex_lock(&ba.locks[entry]);
ba.bits[entry] |= 1 << (k%32);
pthread_mutex_unlock(&ba.locks[entry]);
}
Note that your algorithm has a race-condition: consider the example where num_threads = 4, so Thread 0 starts at 3, Thread 1 starts at 5 and Thread 2 starts at 7. It is possible for Thread 2 to execute fully, marking every multiple of 7 and then start again at 15, before Thread 0 or Thread 1 get a chance to mark 15 as a multiple of 3 or 5. Thread 2 will then do useless work, marking every multiple of 15.
Another alternative, if your compiler supports Intel-style atomic builtins, is to use those instead of a lock:
void mark(int *ba, unsigned int k)
{
__sync_or_and_fetch(&ba[k/32], 1U << k % 32);
}
Your mark() funciton is not threadsafe - if two threads try to set bits within the same int location one might overwrite with 0 a bit that was just set by another thread.
This crashes for some reason whenever the limit is 2082192 or more.
Why, and how can I raise this limit? Does the number 2082192 ring any bells?
It seems to be a problem only on my machine - it runs fine on Ideone.com
I'm using MinGW with Code::Blocks and -std=c++11 but playing around with the settings and I haven't been able to fix it.
Any help would be appreciated!
#include <iostream>
#include <math.h>
#include <time.h>
#define limit 2082192
int main()
{
float time_start = clock();
bool bucket[limit];
float root = sqrt(limit);
for (unsigned int i = 1; i <= limit; ++i)
bucket[i] = true;
for (unsigned int i = 2; i <= root; ++i)
for (unsigned int j = i*2; j <= limit; j+=i)
bucket[j] = false;
unsigned int primes = 0;
for (unsigned int i = 2; i <= limit; ++i)
if (bucket[i] == true)
++primes;
float time_taken = clock() - time_start;
std::cout << "Primes found: " << primes << " up to " << limit << " in " << time_taken;
return 0;
}
The problem must be in limited stack size.
On Linux you can fix it typing
$ ulimit -s unlimited
in the terminal where you call this program from.
On Windows you may adjust stack size using -Wl,--stack=67108864 compilation flag, where the number is expected size in bytes.
I had a short interview where a question is like this: set an integer value to be 0xaa55 at address 0x*****9.
The only thing I noticed is that the address given is not aligned on word boundary. So setting an int *p to the address should not work. Then is it just using a unsigned char *p to assign the value byte-wise? Is it the point of this interview question? There is no point of doing this in real life, is there?
You need to get back to the interviewer with a number of subsidiary questions:
What is the size in bytes of an int?
Is the machine little-endian or big-endian?
Does the machine handle non-aligned access automatically?
What is the performance penalty for handling non-aligned access automatically?
What is the point of this?
The chances are that someone is thinking of marshalling data the quick and dirty way.
You're right that one basic process is to write the bytes via a char * or unsigned char * that is initialized to the relevant address. The answers to my subsidiary questions 1 and 2 determine the exact mechanism to use, but for a 2-byte int in little-endian format, you might use:
unsigned char *p = 0x*****9; // Copied from question!
unsigned int v = 0xAA55;
*p++ = v & 0xFF;
v >>= 8;
*p = v & 0xFF;
You can generalize to 4-byte or 8-byte integers easily; handling big-endian integers is a bit more fiddly.
I assembled some timing code to see what the relative costs were. Tested on a MacBook Pro (2.3 GHz Intel Core i7, 16 GiB 1333 MHz DDR3 RAM, Mac OS X 10.7.5, home-built GCC 4.7.1), I got the following times for the non-optimized code:
Aligned: 0.238420
Marshalled: 0.931727
Unaligned: 0.243081
Memcopy: 1.047383
Aligned: 0.239070
Marshalled: 0.931718
Unaligned: 0.242505
Memcopy: 1.060336
Aligned: 0.239915
Marshalled: 0.934913
Unaligned: 0.242374
Memcopy: 1.049218
When compiled with optimization, I got segmentation faults, even without -DUSE_UNALIGNED — which puzzles me a bit. Debugging was not easy; there seemed to be a lot of aggressive inline optimization which meant that variables could not be printed by the debugger.
The code is below. The Clock type and the time.h header (and timer.c source) are not shown, but can be provided on request (see my profile). They provide high resolution timing across most platforms (Windows is shakiest).
#include <string.h>
#include <stdio.h>
#include "timer.h"
static int array[100000];
enum { ARRAY_SIZE = sizeof(array) / sizeof(array[0]) };
static int repcount = 1000;
static void uac_aligned(int value)
{
int *base = array;
for (int i = 0; i < repcount; i++)
{
for (int j = 0; j < ARRAY_SIZE - 2; j++)
base[j] = value;
}
}
static void uac_marshalled(int value)
{
for (int i = 0; i < repcount; i++)
{
char *base = (char *)array + 1;
for (int j = 0; j < ARRAY_SIZE - 2; j++)
{
*base++ = value & 0xFF;
value >>= 8;
*base++ = value & 0xFF;
value >>= 8;
*base++ = value & 0xFF;
value >>= 8;
*base = value & 0xFF;
value >>= 8;
}
}
}
#ifdef USE_UNALIGNED
static void uac_unaligned(int value)
{
int *base = (int *)((char *)array + 1);
for (int i = 0; i < repcount; i++)
{
for (int j = 0; j < ARRAY_SIZE - 2; j++)
base[j] = value;
}
}
#endif /* USE_UNALIGNED */
static void uac_memcpy(int value)
{
for (int i = 0; i < repcount; i++)
{
char *base = (char *)array + 1;
for (int j = 0; j < ARRAY_SIZE - 2; j++)
{
memcpy(base, &value, sizeof(int));
base += sizeof(int);
}
}
}
static void time_it(int value, const char *tag, void (*function)(int value))
{
Clock c;
char buffer[32];
clk_init(&c);
clk_start(&c);
(*function)(value);
clk_stop(&c);
printf("%-12s %12s\n", tag, clk_elapsed_us(&c, buffer, sizeof(buffer)));
}
int main(void)
{
int value = 0xAA55;
for (int i = 0; i < 3; i++)
{
time_it(value, "Aligned:", uac_aligned);
time_it(value, "Marshalled:", uac_marshalled);
#ifdef USE_UNALIGNED
time_it(value, "Unaligned:", uac_unaligned);
#endif /* USE_UNALIGNED */
time_it(value, "Memcopy:", uac_memcpy);
}
return(0);
}
memcpy((void *)0x23456789, &(int){0xaa55}, sizeof(int));
Yes, you may need to deal with unaligned multi-byte values in real life. Imagine your device exchanges data with another device. For example, this data may be a message structure sent over a network or a file structure saved to disk. The format of that data may be predefined and not under your control. And the definiton of the data structure may not account for alignement (or even endianness) restrictions of your device. In these situations you'll need to take care when accessing these unaligned multi-byte values.
so I was trying to make a GPGPU emulator with c & pthreads but ran into a rather strange problem which I have no idea why its occurring. The code is as below:
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <assert.h>
// simplifies malloc
#define MALLOC(a) (a *)malloc(sizeof(a))
// Index of x/y coordinate
#define x (0)
#define y (1)
// Defines size of a block
#define BLOCK_DIM_X (3)
#define BLOCK_DIM_Y (2)
// Defines size of the grid, i.e., how many blocks
#define GRID_DIM_X (5)
#define GRID_DIM_Y (7)
// Defines the number of threads in the grid
#define GRID_SIZE (BLOCK_DIM_X * BLOCK_DIM_Y * GRID_DIM_X * GRID_DIM_Y)
// execution environment for the kernel
typedef struct exec_env {
int threadIdx[2]; // thread location
int blockIdx[2];
int blockDim[2];
int gridDim[2];
float *A,*B; // parameters for the thread
float *C;
} exec_env;
// kernel
void *kernel(void *arg)
{
exec_env *env = (exec_env *) arg;
// compute number of threads in a block
int sz = env->blockDim[x] * env->blockDim[y];
// compute the index of the first thread in the block
int k = sz * (env->blockIdx[y]*env->gridDim[x] + env->blockIdx[x]);
// compute the index of a thread inside a block
k = k + env->threadIdx[y]*env->blockDim[x] + env->threadIdx[x];
// check whether it is in range
assert(k >= 0 && k < GRID_SIZE && "Wrong index computation");
// print coordinates in block and grid and computed index
/*printf("tx:%d ty:%d bx:%d by:%d idx:%d\n",env->threadIdx[x],
env->threadIdx[y],
env->blockIdx[x],
env->blockIdx[y], k);
*/
// retrieve two operands
float *A = &env->A[k];
float *B = &env->B[k];
printf("%f %f \n",*A, *B);
// retrieve pointer to result
float *C = &env->C[k];
// do actual computation here !!!
// For assignment replace the following line with
// the code to do matrix addition and multiplication.
*C = *A + *B;
// free execution environment (not needed anymore)
free(env);
return NULL;
}
// main function
int main(int argc, char **argv)
{
float A[GRID_SIZE] = {-1};
float B[GRID_SIZE] = {-1};
float C[GRID_SIZE] = {-1};
pthread_t threads[GRID_SIZE];
int i=0, bx, by, tx, ty;
//Error location
/*for (i = 0; i < GRID_SIZE;i++){
A[i] = i;
B[i] = i+1;
printf("%f %f\n ", A[i], B[i]);
}*/
// Step 1: create execution environment for threads and create thread
for (bx=0;bx<GRID_DIM_X;bx++) {
for (by=0;by<GRID_DIM_Y;by++) {
for (tx=0;tx<BLOCK_DIM_X;tx++) {
for (ty=0;ty<BLOCK_DIM_Y;ty++) {
exec_env *e = MALLOC(exec_env);
assert(e != NULL && "memory exhausted");
e->threadIdx[x]=tx;
e->threadIdx[y]=ty;
e->blockIdx[x]=bx;
e->blockIdx[y]=by;
e->blockDim[x]=BLOCK_DIM_X;
e->blockDim[y]=BLOCK_DIM_Y;
e->gridDim[x]=GRID_DIM_X;
e->gridDim[y]=GRID_DIM_Y;
// set parameters
e->A = A;
e->B = B;
e->C = C;
// create thread
pthread_create(&threads[i++],NULL,kernel,(void *)e);
}
}
}
}
// Step 2: wait for completion of all threads
for (i=0;i<GRID_SIZE;i++) {
pthread_join(threads[i], NULL);
}
// Step 3: print result
for (i=0;i<GRID_SIZE;i++) {
printf("%f ",C[i]);
}
printf("\n");
return 0;
}
Ok this code here runs fine, but as soon as I uncomment the "Error Location" (for loop which assigns A[i] = i and B[i] = i + 1, I get snapped by a segmentation fault in unix, and by these random 0s within C in cygwin. I must admit my fundamentals in C is pretty poor, so it may be highly likely that I missed something. If someone can give an idea on what's going wrong it'd be greatly appreciated. Thanks.
It works when you comment that because i is still 0 when the 4 nested loops start.
You have this:
for (i = 0; i < GRID_SIZE;i++){
A[i] = i;
B[i] = i+1;
printf("%f %f\n ", A[i], B[i]);
}
/* What value is `i` now ? */
And then
pthread_create(&threads[i++],NULL,kernel,(void *)e);
^
So pthread_create will try to access some interesting indexes indeed.