I'm working through a thread exercise in C, it's a typical thread scheduling code many schools teach, a basic one can be seen here, my code is basically the same except for my altered runner method
http://webhome.csc.uvic.ca/~wkui/Courses/CSC360/pthreadScheduling.c
What I'm doing is basically altering the runner part so my code prints an array with random numbers within a certain range, instead of just printing some words. my runner code is here:
void *runner(void *param) {
int i, j, total;
int threadarray[100];
for (i = 0; i < 100; i++)
threadarray[i] = rand() % ((199 + modifier*100) + 1 - (100 + modifier*100)) + (100 + modifier*100);
/* prints array and add to total */
for (j = 0; j < 100; j += 10) {
printf("%d,%d,%d,%d,%d,%d,%d,%d,%d,%d\n", threadarray[j], threadarray[j+1], threadarray[j+2], threadarray[j+3], threadarray[j+4], threadarray[j+5], threadarray[j+6], threadarray[j+7], threadarray[j+8], threadarray[j+9]);
total = total + threadarray[j] + threadarray[j+1] + threadarray[j+2] + threadarray[j+3] + threadarray[j+4] + threadarray[j+5] + threadarray[j+6] + threadarray[j+7] + threadarray[j+8] + threadarray[j+9];
}
printf("Thread %d finished running, total is: %d\n", pthread_self(), total);
pthread_exit(0);
}
My question lies in the first for loop where I'm assigning random numbers to my array, I want this modifier to change based on which thread it is, but I can't figure out how to do it, for example if its the first thread the range will be 100-199, 2nd will be 200-299, etc and so on. I have tried to assign i to an value before doing pthread_create and assigning that value to an int in runner to use as the modifier, but since there are 5 concurrent threads it ends up assigning this number to all 5 threads, and they end up having the same modifier.
So I'm looking for a method to approach this where it will work for all the individual threads instead of assigning it to all of them, I have tried to change the parameters to something like (void *param, int modifier) but when I do this I have no idea how to reference runner, since by default it's refrenced like pthread_create(&tid[i],&attr,runner,NULL);
You want to make param point to a data structure or variable who's lifetime will exist longer than the thread lifetime. And you cast the void* parameter to the actual data type it was allocated as.
Easy example:
struct thread_data
{
int thread_index;
int start;
int end;
}
struct thread_info;
{
struct thread_data data;
pthread_t thread;
}
struct thread_info threads[10];
for (int x = 0; x < 10; x++)
{
struct thread_data* pData = (struct thread_data*)malloc(sizeof(struct thread_data)); // never pass a stack variable as a thread parameter. Always allocate it from the heap.
pData->thread_index = x;
pData->start = 100 * x + 1;
pData->end = 100*(x+1) - 1;
pthread_create(&(threads[x].thread), NULL, runner, pData);
}
Then your runner:
void *runner(void *param)
{
struct thread_data* data = (struct thread_data*)param;
int modifier = data->thread_index;
int i, j, total;
int threadarray[100];
for (i = 0; i < 100; i++)
{
threadarray[i] = ...
}
Related
I've been writing a C program to simulate the motion of n bodies under the influence of gravity. I have a working version that uses a single thread, and I'm attempting to write a version that uses multi-threading with the POSIX pthreads library. Essentially, the program initializes a specified number n of bodies, and stores their randomly selected initial positions as well as masses and radii in an array 'data', made using the pointer-to-pointer method. Data is a pointer in the global scope, and it is allocated the correct amount of memory in the 'populate()' function in main(). Then, I spawn twelve threads (I am using a 6 core processor, so I thought 12 would be a good starting point), and each thread is assigned a set of objects in the simulation. The thread function below calculates the interaction between all objects in 'data' and the object currently being operated on. See the function below:
void* calculate_step(void*index_val) {
int index = * (int *)index_val;
long double x_dist;
long double y_dist;
long double distance;
long double force;
for (int i = 0; i < (rows/nthreads); ++i) { //iterate over every object assigned to this thread
data[i+index][X_FORCE] = 0; //reset all forces to 0
data[i+index][Y_FORCE] = 0;
data[i+index][X_ACCEL] = 0;
data[i+index][X_ACCEL] = 0;
for (int j = 0; j < rows; ++j) { //iterate over every possible pair with this object i
if (i != j && data[j][DELETED] != 1 && data[i+index][DELETED] != 1) { //continue if not comparing an object with itself and if the other object has not been deleted previously.
x_dist = data[j][X_POS] - data[i+index][X_POS];
y_dist = data[j][Y_POS] - data[i+index][X_POS];
distance = sqrtl(powl(x_dist, 2) + powl(y_dist, 2));
if (distance > data[i+index][RAD] + data[j][RAD]) {
force = G * data[i+index][MASS] * data[j][MASS] /
powl(distance, 2); //calculate accel, vel, pos, data for pair of non-colliding objects
data[i+index][X_FORCE] += force * (x_dist / distance);
data[i+index][Y_FORCE] += force * (y_dist / distance);
data[i+index][X_ACCEL] = data[i+index][X_FORCE]/data[i+index][MASS];
data[i+index][X_VEL] += data[i+index][X_ACCEL]*dt;
data[i+index][X_POS] += data[i+index][X_VEL]*dt;
data[i+index][Y_ACCEL] = data[i+index][Y_FORCE]/data[i+index][MASS];
data[i+index][Y_VEL] += data[i+index][Y_ACCEL]*dt;
data[i+index][Y_POS] += data[i+index][Y_VEL]*dt;
}
else{
if (data[i+index][MASS] < data[j][MASS]) {
int temp;
temp = i;
i = j;
j = temp;
} //conserve momentum
data[i+index][X_VEL] = (data[i+index][X_VEL] * data[i+index][MASS] + data[j][X_VEL] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
data[i+index][Y_VEL] = (data[i+index][Y_VEL] * data[i+index][MASS] + data[j][Y_VEL] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
//conserve center of mass position
data[i+index][X_POS] = (data[i+index][X_POS] * data[i+index][MASS] + data[j][X_POS] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
data[i+index][Y_POS] = (data[i+index][Y_POS] * data[i+index][MASS] + data[j][Y_POS] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
//conserve mass
data[i+index][MASS] += data[j][MASS];
//increase radius proportionally to dM
data[i+index][RAD] = powl(powl(data[i+index][RAD], 3) + powl(data[j][RAD], 3), ((long double) 1 / (long double) 3));
data[j][DELETED] = 1;
data[j][MASS] = 0;
data[j][RAD] = 0;
}
}
}
}
return NULL;
}
This calculates values for velocity, acceleration, etc. and writes them to the array. Each thread does this once for each object assigned to it (i.e. 36 objects means each thread calculates values for 3 objects). The thread then returns and the main loop jumps to the next time step (usually increments of 0.01 seconds), and the process repeats again. If two balls collide, their masses, momenta and centers of mass are added, and one of the objects' 'DELETED' index in its row in the array is marked with a row. This object is then ignored in all future iterations. See the main loop below:
int main() {
pthread_t *thread_array; //pointer to future thread array
long *thread_ids;
short num_obj;
short sim_time;
printf("Number of objects to simulate: \n");
scanf("%hd", &num_obj);
num_obj = num_obj - num_obj%12;
printf("Timespan of the simulation: \n");
scanf("%hd", &sim_time);
printf("Length of time steps: \n");
scanf("%f", &dt);
printf("Relative complexity score: %.2f\n", (((float)sim_time/dt)*((float)(num_obj^2)))/1000);
thread_array = malloc(nthreads*sizeof(pthread_t));
thread_ids = malloc(nthreads*sizeof(long));
populate(num_obj);
int index;
for (int i = 0; i < nthreads; ++i) { //initialize all threads
}
time_t start = time(NULL);
print_data();
for (int i = 0; i < (int)((float)sim_time/dt); ++i) { //main loop of simulation
for (int j = 0; j < nthreads; ++j) {
index = j*(rows/nthreads);
thread_ids[j] = j;
pthread_create(&thread_array[j], NULL, calculate_step, &index);
}
for (int j = 0; j < nthreads; ++j) {
pthread_join(thread_array[j], NULL);
//pthread_exit(NULL);
}
}
time_t end = time(NULL) - start;
printf("\n");
print_data();
printf("Took %zu seconds to simulate %d frames with %d objects initially, now %d objects.\n", end, (int)((float)sim_time/dt), num_obj, rows);
}
Every time the program runs, I get the following message:
Number of objects to simulate:
36
Timespan of the simulation:
10
Length of time steps:
0.01
Relative complexity score: 38.00
Process finished with exit code -1073740940 (0xC0000374)
which seams to indicate the heap is getting corrupted. I am guessing this has to do with the data array pointer being a global variable, but that was my workaround for only being allowed to pass one arg to the pthreads function.
I have tried stepping through the program with the debugger, and it seems it works when I run it in debug mode (I am using CLion), but not in regular compile mode. Furthermore, when i debug the program and it outputs the values of the data array for the last simulation 'frame', the first chunk of values which were supposed to be handled by the first thread that spawns are unchanged. When I go through it with the debugger however I can see that thread being created in the thread generation loop. What are some issues with this code structure and what could be causing the heap corruption and the first thread doing nothing?
I'm having an issue with my code. Disclaimer btw, I'm new to C. Trying to learn it on my own. Anyways, I'm trying to get the minimum and maximum of an array. I broke the array into 4 parts to make 4 separate arrays and then used those 4 to pass in one of the parameters of each thread. With that being said, I'm only able to get the maximum for each part of the array and not the minimum and I don't understand why.
I think we can simplify your code, avoid all these unnecessary malloc calls, and simplify your algorithm for finding a min/max pair in an array.
Start by having a thread function that takes as input the following: an array (represented by a pointer), an index into the array from where to start searching on, and an index in the array on where to stop. Further, this function will need two output parameters - smallest and largest integer found in the array subset found.
Start with the parameter declaration. Similar to your MaxMin, but has both input and output parameters:
struct ThreadParameters
{
// input
int* array;
int start;
int end;
// output
int smallest;
int largest;
};
And then a thread function that scans from array[start] all the way up to (but not including) array[end]. And it puts the results of its scan into the smallest and largest member of the above struct:
void* find_min_max(void* args)
{
struct ThreadParameters* params = (struct ThreadParameters*)args;
int *array = params->array;
int start = params->start;
int end = params->end;
int smallest = array[start];
int largest = array[start];
for (int i = start; i < end; i++)
{
if (array[i] < smallest)
{
smallest = array[i];
}
if (array[i] > largest)
{
largest = array[i];
}
}
// write the result back to the parameter structure
params->smallest = smallest;
params->largest = largest;
return NULL;
}
And while we are at it, use capitol letters for your macros:
#define THREAD_COUNT 4
Now you can keep with your "4 separate arrays" design. But there's no reason to since the thread function can scan any range of any array. So let's declare a single global array as follows:
#define ARRAY_SIZE 400
int arr[ARRAY_SIZE];
The capitol letter syntax is preferred for macros.
fillArray becomes simpler:
void fillArray()
{
for (int i = 0; i < ARRAY_SIZE; i++)
{
arr[i] = rand() % 1000 + 1;
}
}
Now main, becomes a whole lot simpler by doing these techniques.:
We'll leverage the stack to allocate our thread parameter structure (no malloc and free)
We'll simply start 4 threads - passing each thread a pointer to a ThreadParameter struct. Since the thread won't outlive main, this is safe.
After starting each thread, we just wait for each thread to finish)
Then we scan the list of thread parameters to get the final smallest and largest.
main becomes much easier to manage:
int main()
{
int smallest;
int largest;
// declare an array of threads and associated parameter instances
pthread_t threads[THREAD_COUNT] = {0};
struct ThreadParameters thread_parameters[THREAD_COUNT] = {0};
// intialize the array
fillArray();
// smallest and largest needs to be set to something
smallest = arr[0];
largest = arr[0];
// start all the threads
for (int i = 0; i < THREAD_COUNT; i++)
{
thread_parameters[i].array = arr;
thread_parameters[i].start = i * (ARRAY_SIZE / THREAD_COUNT);
thread_parameters[i].end = (i+1) * (ARRAY_SIZE / THREAD_COUNT);
thread_parameters[i].largest = 0;
pthread_create(&threads[i], NULL, find_min_max, &thread_parameters[i]);
}
// wait for all the threads to complete
for (int i = 0; i < THREAD_COUNT; i++)
{
pthread_join(threads[i], NULL);
}
// Now aggregate the "smallest" and "largest" results from all thread runs
for (int i = 0; i < THREAD_COUNT; i++)
{
if (thread_parameters[i].smallest < smallest)
{
smallest = thread_parameters[i].smallest;
}
if (thread_parameters[i].largest > largest)
{
largest = thread_parameters[i].largest;
}
}
printf("Smallest is %d\n", smallest);
printf("Largest is %d\n", largest);
}
I'm currently learning about pthreads in C and came across the issue of False Sharing. I think I understand the concept of it and I've tried experimenting a bit.
Below is a short program that I've been playing around with. Eventually I'm going to change it into a program to take a large array of ints and sum it in parallel.
#include <stdio.h>
#include <pthread.h>
#define THREADS 4
#define NUMPAD 14
struct s
{
int total; // 4 bytes
int my_num; // 4 bytes
int pad[NUMPAD]; // 4 * NUMPAD bytes
} sum_array[4];
static void *worker(void * ind) {
const int curr_ind = *(int *) ind;
for (int i = 0; i < 10; ++i) {
sum_array[curr_ind].total += sum_array[curr_ind].my_num;
}
printf("%d\n", sum_array[curr_ind].total);
return NULL;
}
int main(void) {
int args[THREADS] = { 0, 1, 2, 3 };
pthread_t thread_ids[THREADS];
for (size_t i = 0; i < THREADS; ++i) {
sum_array[i].total = 0;
sum_array[i].my_num = i + 1;
pthread_create(&thread_ids[i], NULL, worker, &args[i]);
}
for (size_t i = 0; i < THREADS; ++i) {
pthread_join(thread_ids[i], NULL);
}
}
My question is, is it possible to prevent false sharing without using padding? Here struct s has a size of 64 bytes so that each struct is on its own cache line (assuming that the cache line is 64 bytes). I'm not sure how else I can achieve parallelism without padding.
Also, if I were to sum an array of a varying size between 1000-50,000 bytes, how could I prevent false sharing? Would I be able to pad it out using a similar program? My current thoughts are to put each int from the big array, into an array of struct s and then use parallelism to sum it. However I'm not sure if this is the optimal solution.
Partition the problem: In worker(), sum into a local variable, then add the local variable to the array:
static void *worker(void * ind) {
const int curr_ind = *(int *) ind;
int localsum = 0;
for (int i = 0; i < 10; ++i) {
localsum += sum_array[curr_ind].my_num;
}
sum_array[curr_ind].total += localsum;
printf("%d\n", sum_array[curr_ind].total);
return NULL;
}
This may still have false sharing after the loop, but that is one time per thread. Thread creation overhead is much more significant than a single cache-miss. Of course, you probably want to have a loop that actually does something time-consuming, as your current code can be optimized to:
static void *worker(void * ind) {
const int curr_ind = *(int *) ind;
int localsum = 10 * sum_array[curr_ind].my_num;
sum_array[curr_ind].total += localsum;
printf("%d\n", sum_array[curr_ind].total);
return NULL;
}
The runtime of which is definitely dominated by thread creation and synchronization in printf().
I have the following piece of code
#include "stdio.h"
#include "stdlib.h"
#include <string.h>
#define MAXBINS 8
void swap_long(unsigned long int **x, unsigned long int **y){
unsigned long int *tmp;
tmp = x[0];
x[0] = y[0];
y[0] = tmp;
}
void swap(unsigned int **x, unsigned int **y){
unsigned int *tmp;
tmp = x[0];
x[0] = y[0];
y[0] = tmp;
}
void truncated_radix_sort(unsigned long int *morton_codes,
unsigned long int *sorted_morton_codes,
unsigned int *permutation_vector,
unsigned int *index,
int *level_record,
int N,
int population_threshold,
int sft, int lv){
int BinSizes[MAXBINS] = {0};
unsigned int *tmp_ptr;
unsigned long int *tmp_code;
level_record[0] = lv; // record the level of the node
if(N<=population_threshold || sft < 0) { // Base case. The node is a leaf
memcpy(permutation_vector, index, N*sizeof(unsigned int)); // Copy the pernutation vector
memcpy(sorted_morton_codes, morton_codes, N*sizeof(unsigned long int)); // Copy the Morton codes
return;
}
else{
// Find which child each point belongs to
int j = 0;
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
BinSizes[ii]++;
}
// scan prefix
int offset = 0, i = 0;
for(i=0; i<MAXBINS; i++){
int ss = BinSizes[i];
BinSizes[i] = offset;
offset += ss;
}
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
permutation_vector[BinSizes[ii]] = index[j];
sorted_morton_codes[BinSizes[ii]] = morton_codes[j];
BinSizes[ii]++;
}
//swap the index pointers
swap(&index, &permutation_vector);
//swap the code pointers
swap_long(&morton_codes, &sorted_morton_codes);
/* Call the function recursively to split the lower levels */
offset = 0;
for(i=0; i<MAXBINS; i++){
int size = BinSizes[i] - offset;
truncated_radix_sort(&morton_codes[offset],
&sorted_morton_codes[offset],
&permutation_vector[offset],
&index[offset], &level_record[offset],
size,
population_threshold,
sft-3, lv+1);
offset += size;
}
}
}
I tried to make this block
int j = 0;
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
BinSizes[ii]++;
}
parallel by substituting it with the following
int rc,j;
pthread_t *thread = (pthread_t *)malloc(NTHREADS*sizeof(pthread_t));
belong *belongs = (belong *)malloc(NTHREADS*sizeof(belong));
pthread_mutex_init(&bin_mtx, NULL);
for (j = 0; j < NTHREADS; j++){
belongs[j].n = NTHREADS;
belongs[j].N = N;
belongs[j].tid = j;
belongs[j].sft = sft;
belongs[j].BinSizes = BinSizes;
belongs[j].mcodes = morton_codes;
rc = pthread_create(&thread[j], NULL, belong_wrapper, (void *)&belongs[j]);
}
for (j = 0; j < NTHREADS; j++){
rc = pthread_join(thread[j], NULL);
}
and defining these outside the recursive function
typedef struct{
int n, N, tid, sft;
int *BinSizes;
unsigned long int *mcodes;
}belong;
pthread_mutex_t bin_mtx;
void * belong_wrapper(void *arg){
int n, N, tid, sft, j;
int *BinSizes;
unsigned int ii;
unsigned long int *mcodes;
n = ((belong *)arg)->n;
N = ((belong *)arg)->N;
tid = ((belong *)arg)->tid;
sft = ((belong *)arg)->sft;
BinSizes = ((belong *)arg)->BinSizes;
mcodes = ((belong *)arg)->mcodes;
for (j = tid; j<N; j+=n){
ii = (mcodes[j] >> sft) & 0x07;
pthread_mutex_lock(&bin_mtx);
BinSizes[ii]++;
pthread_mutex_unlock(&bin_mtx);
}
}
However it takes a lot more time than the serial one to execute... Why is this happening? What should I change?
Since you're using a single mutex to guard updates to the BinSizes array, you're still ultimately doing all the updates to this array sequentially: only one thread can call BinSizes[ii]++ at any given time. Basically you're still executing your function in sequence but incurring the extra overhead of creating and destroying threads.
There are several options I can think of for you (there are probably more):
do as #Chris suggests and make each thread update one portion of
BinSizes. This might not be viable depending on the properties of
the calculation you're using to compute ii.
Create multiple mutexes representing different partitions of
BinSizes. For example, if BinSizes has 10 elements, you could
create one mutex for elements 0-4, and another for elements 5-9,
then use them in your thread something like so:
if (ii < 5) {
mtx_index = 0;
} else {
mtx_index = 1;
}
pthread_mutex_lock(&bin_mtx[mtx_index]);
BinSizes[ii]++;
pthread_mutex_unlock(&bin_mtx[mtx_index]);
You could generalize this idea to any size of BinSizes and any range:
Potentially you could have a different mutex for each array element. Of course
then you're opening yourself up to the overhead of creating each of these mutexes, and
the possibility of deadlock if someone tries to lock several of them at once etc...
Finally, you could abandon the idea of parallelizing this block altogether: as other users have mentioned using threads this way is subject to some level of diminishing returns. Unless your BinSizes array is very large, you might not see a huge benefit to parallelization even if you "do it right".
tl;dr - adding threads isn't a trivial fix for most problems. Yours isn't embarassingly parallelizable, and this code has hardly any actual concurrency.
You spin a mutex for every (cheap) integer operation on BinSizes. This will crush any parallelism, because all your threads are serialized on this.
The few instructions you can run concurrently (the for loop and a couple of operations on the morton code array) are much cheaper than (un)locking a mutex: even using an atomic increment (if available) would be more expensive than the un-synchronized part.
One fix would be to give each thread its own output array, and combine them after all tasks are complete.
Also, you create and join multiple threads per call. Creating threads is relatively expensive compared to computation, so it's generally recommended to create a long-lived pool of them to spread that cost.
Even if you do this, you need to tune the number of threads according to how many (free) cores do you have. If you do this in a recursive function, how many threads exist at the same time? Creating more threads than you have cores to schedule them on is pointless.
Oh, and you're leaking memory.
I'm looking to do a matrix multiply using threads where each thread does a single multiplication and then the main thread will add up all of the results and place them in the appropriate spot in the final matrix (after the other threads have exited).
The way I am trying to do it is to create a single row array that holds the results of each thread. Then I would go through the array and add + place the results in the final matrix.
Ex: If you have the matrices:
A = [{1,4}, {2,5}, {3,6}]
B = [{8,7,6}, {5,4,3}]
Then I want an array holding [8, 20, 7, 16, 6, 12, 16 etc]
I would then loop through the array adding up every 2 numbers and placing them in my final array.
This is a HW assignment so I am not looking for exact code, but some logic on how to store the results in the array properly. I'm struggling with how to keep track of where I am in each matrix so that I don't miss any numbers.
Thanks.
EDIT2: Forgot to mention that there must be a single thread for every single multiplication to be done. Meaning for the example above, there will be 18 threads each doing its own calculation.
EDIT: I'm currently using this code as a base to work off of.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define M 3
#define K 2
#define N 3
#define NUM_THREADS 10
int A [M][K] = { {1,4}, {2,5}, {3,6} };
int B [K][N] = { {8,7,6}, {5,4,3} };
int C [M][N];
struct v {
int i; /* row */
int j; /* column */
};
void *runner(void *param); /* the thread */
int main(int argc, char *argv[]) {
int i,j, count = 0;
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
//Assign a row and column for each thread
struct v *data = (struct v *) malloc(sizeof(struct v));
data->i = i;
data->j = j;
/* Now create the thread passing it data as a parameter */
pthread_t tid; //Thread ID
pthread_attr_t attr; //Set of thread attributes
//Get the default attributes
pthread_attr_init(&attr);
//Create the thread
pthread_create(&tid,&attr,runner,data);
//Make sure the parent waits for all thread to complete
pthread_join(tid, NULL);
count++;
}
}
//Print out the resulting matrix
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
printf("%d ", C[i][j]);
}
printf("\n");
}
}
//The thread will begin control in this function
void *runner(void *param) {
struct v *data = param; // the structure that holds our data
int n, sum = 0; //the counter and sum
//Row multiplied by column
for(n = 0; n< K; n++){
sum += A[data->i][n] * B[n][data->j];
}
//assign the sum to its coordinate
C[data->i][data->j] = sum;
//Exit the thread
pthread_exit(0);
}
Source: http://macboypro.wordpress.com/2009/05/20/matrix-multiplication-in-c-using-pthreads-on-linux/
You need to store M * K * N element-wise products. The idea is presumably that the threads will all run in parallel, or at least will be able to do, so each thread needs its own distinct storage location of appropriate type. A straightforward way to do that would be to create an array with that many elements ... but of what element type?
Each thread will need to know not only where to store its result, but also which multiplication to perform. All of that information needs to be conveyed via a single argument of type void *. One would typically, then, create a structure type suitable for holding all the data needed by one thread, create an instance of that structure type for each thread, and pass pointers to those structures. Sounds like you want an array of structures, then.
The details could be worked a variety of ways, but the one that seems most natural to me is to give the structure members for the two factors, and a member in which to store the product. I would then have the main thread declare a 3D array of such structures (if the needed total number is smallish) or else dynamically allocate one. For example,
struct multiplication {
// written by the main thread; read by the compute thread:
int factor1;
int factor2;
// written by the compute thread; read by the main thread:
int product;
} partial_result[M][K][N];
How to write code around that is left as the exercise it is intended to be.
Not sure haw many threads you would need to dispatch and I am also not sure if you would use join later to pick them up. I am guessing you are in C here so I would use the thread id as a way to track which row to process .. something like :
#define NUM_THREADS 64
/*
* struct to pass parameters to a dispatched thread
*/
typedef struct {
int value; /* thread number */
char somechar[128]; /* char data passed to thread */
unsigned long ret;
struct foo *row;
} thread_parm_t;
Where I am guessing that each thread will pick up its row data in the pointer *row which has some defined type foo. A bunch of integers or floats or even complex types. Whatever you need to pass to the thread.
/*
* the thread to actually crunch the row data
*/
void *thr_rowcrunch( void *parm );
pthread_t tid[NUM_THREADS]; /* POSIX array of thread IDs */
Then in your main code segment something like :
thread_parm_t *parm=NULL;
Then dispatch the threads with something like :
for ( i = 0; i < NUM_THREADS; i++) {
parm = malloc(sizeof(thread_parm_t));
parm->value = i;
strcpy(parm->somechar, char_data_to-pass );
fill_in_row ( parm->row, my_row_data );
pthread_create(&tid[i], NULL, thr_insert, (void *)parm);
}
Then later on :
for ( i = 0; i < NUM_THREADS; i++)
pthread_join(tid[i], NULL);
However the real work needs to be done in thr_rowcrunch( void *parm ) which receives the row data and then each thread just knows its own thread number. The guts of what you do in that dispatched thread however I can only guess at.
Just trying to help here, not sure if this is clear.