Optimizing thread level parallelism - C - c

I'm currently writing a contrived producer/consumer program that uses a bounded buffer and a number of worker threads to multiply matrices (for the sake of exposition). While trying to assure that the concurrency of my program is optimal I'm observing some unusual behavior. Namely, when executing the program I am only able to achieve up to 100% CPU usage (observed in top), despite having 6 cores. Using shift-i to view the relative percentage changes the upper bound to ~%16.7 and I can clearly see when viewing the usage of different cores by pressing 1 that either only one core is fully maxed out or the load of one core is distributed amongst all six.
No matter what stress tests I run (I tried the stress program and a simple stress test that created multiple threads that spun) I cannot get a single process to use more than 100% (or ~%16.7 relative to all available cores) CPU, so I presume the parallelism is bound to a single core. This behavior is observed on an Ubuntu LTS 20.10 running within VirtualBox on a Mac host with a 2.3 GHz 8-Core Intel Core i9 processor. Is there some way I must enable multi-core parallelism in VirtualBox or is this perhaps just an idiosyncrasy of the setup?
For reference here is the simple stress test I used
void *prod_worker(void *arg) {
while (1) {
printf("...");
}
}
int main (int argc, char * argv[])
{
printf("pid: %lun\n", getpid());
getc(stdin);
int numw = atoi(argv[1]);
pthread_t *prod_threads = malloc(sizeof(pthread_t) * numw);
for (int i = 0; i < numw; i++) {
pthread_t prod;
int rcp;
rcp = pthread_create(&prod, NULL, prod_worker, NULL);
if (rcp == THREAD_CREATE_SUCCESS) {
prod_threads[i] = prod;
} else {
printf("Failed to create producer and consumer thread #%d...\n", i);
printf("Error codes: prod = %d\n", rcp);
printf("Retrying...\n");
i--;
}
}
for (int i = 0; i < numw; i++) {
pthread_join(prod_threads[i], NULL);
}
return 0;
}

printf is thread safe, which typically means there is a mutex of some kind on the stdout stream, such that only one thread can print at a time. Thus even though all your threads are "running", at any given time, all but one of them is likely to be waiting to take ownership of the stream, thus doing nothing useful and using no CPU.
You might want to instead try a test where the threads do some computation, instead of just I/O.

Related

How to achieve parallelism in C?

This might be a dumb question, i'm very sorry if that's the case. But i'm struggling to take advantage of the multiple cores in my computer to perform multiple computations at the same time in my Quad-Core MacBook. This is not for any particular project, just a general question, since i want to learn for when i eventually do need to do this kind of things
I am aware of threads, but the seem to run in the same core, so i don't seem to gain any performance using them for compute-bound operations (They are very useful for socket based stuff tho!).
I'm also aware of processed that can be created with fork, but i'm nor sure they are guaranteed to use more CPU, or if they, like threads, just help with IO-bound operations.
Finally i'm aware of CUDA, allowing paralellism in the GPU (And i think OpenCL and Compute Shaders also allows my code to run in the CPU in parallel) but i'm currently looking for something that will allow me to take advantage of the multiple CPU cores that my computer has.
In python, i'm aware of the multiprocessing module, which seems to provide an API very similar to threads, but there i do seem to gain an edge by running multiple functions performing computations in parallel. I'm looking into how could i get this same advantage in C, but i don't seem to be able
Any help pointing me to the right direction would be very much appreciated
Note: I'm trying to achive true parallelism, not concurrency
Note 2: I'm only aware of threads and using multiple processes in C, with threads i don't seem to be able to win the performance boost i want. And i'm not very familiar with processes, but i'm still not sure if running multiple processes is guaranteed to give me the advantage i'm looking for.
A simple program to heat up your CPU (100% utilization of all available cores).
Hint: The thread starting function does not return, program exit via [CTRL + C]
#include <pthread.h>
void* func(void *arg)
{
while (1);
}
int main()
{
#define NUM_THREADS 4 //use the number of cores (if known)
pthread_t threads[NUM_THREADS];
for (int i=0; i < NUM_THREADS; ++i)
pthread_create(&threads[i], NULL, func, NULL);
for (int i=0; i < NUM_THREADS; ++i)
pthread_join(threads[i], NULL);
return 0;
}
Compilation:
gcc -pthread -o thread_test thread_test.c
If i start ./thread_test, all cores are at 100%.
A word to fork and pthread_create:
fork creates a new process (the current process image will be copied and executed in parallel), while pthread_create will create a new thread, sometimes called a lightweight process.
Both, processes and threads will run in 'parallel' to the parent process.
It depends, when to use a child process over a thread, e.g. a child is able to replace its process image (via exec family) and has its own address space, while threads are able to share the address space of the current parent process.
There are of course a lot more differences, for that i recommend to study the following pages:
man fork
man pthreads
I am aware of threads, but the seem to run in the same core, so i don't seem to gain any performance using them for compute-bound operations (They are very useful for socket based stuff tho!).
No, they don't. Except if you block and your threads don't block, you'll see alll of them running. Just try this (beware that this consumes all your cpu time) that starts 16 threads each counting in a busy loop for 60 s. You will see all of them running and makins your cores to fail to their knees (it runs only a minute this way, then everything ends):
#include <assert.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N 16 /* had 16 cores, so I used this. Put here as many
* threads as cores you have. */
struct thread_data {
pthread_t thread_id; /* the thread id */
struct timespec end_time; /* time to get out of the tunnel */
int id; /* the array position of the thread */
unsigned long result; /* number of times looped */
};
void *thread_body(void *data)
{
struct thread_data *p = data;
p->result = 0UL;
clock_gettime(CLOCK_REALTIME, &p->end_time);
p->end_time.tv_sec += 60; /* 60 s. */
struct timespec now;
do {
/* just get the time */
clock_gettime(CLOCK_REALTIME, &now);
p->result++;
/* if you call printf() you will see them slowing, as there's a
* common buffer that forces all thread to serialize their outputs
*/
/* check if we are over */
} while ( now.tv_sec < p->end_time.tv_sec
|| now.tv_nsec < p->end_time.tv_nsec);
return p;
} /* thread_body */
int main()
{
struct thread_data thrd_info[N];
for (int i = 0; i < N; i++) {
struct thread_data *d = &thrd_info[i];
d->id = i;
d->result = 0;
printf("Starting thread %d\n", d->id);
int res = pthread_create(&d->thread_id,
NULL, thread_body, d);
if (res < 0) {
perror("pthread_create");
exit(EXIT_FAILURE);
}
printf("Thread %d started\n", d->id);
}
printf("All threads created, waiting for all to finish\n");
for (int i = 0; i < N; i++) {
struct thread_data *joined;
int res = pthread_join(thrd_info[i].thread_id,
(void **)&joined);
if (res < 0) {
perror("pthread_join");
exit(EXIT_FAILURE);
}
printf("PTHREAD %d ended, with value %lu\n",
joined->id, joined->result);
}
} /* main */
Linux and all multithread systems work the same, they create a new execution unit (if both don't share the virtual address space, they are both processes --not exactly so, but this explains the main difference between a process and a thread--) and the available processors are given to each thread as necessary. Threads are normally encapsulated inside processes (they share ---not in linux, if that has not changed recently--- the process id, and virtual memory) Processes run each in a separate virtual space, so they can only share things through the system resources (files, shared memory, communication sockets/pipes, etc.)
The problem with your test case (you don't show it so I have go guess) is that probably you will make all threads in a loop in which you try to print something. If you do that, probably the most time each thread is blocked trying to do I/O (to printf() something)
Stdio FILEs have the problem that they share a buffer between all threads that want to print on the same FILE, and the kernel serializes all the write(2) system calls to the same file descriptor, so if the most of the time you pass in the loop is blocked in a write, the kernel (and stdio) will end serializing all the calls to print, making it to appear that only one thread is working at a time (all the threads will become blocked by the one that is doing the I/O) This busy loop will make all the threads to run in parallel and will show you how the cpu is collapsed.
Parallelism in C can be achieved by using the fork() function. This function simulates a thread by allowing two threads to run simultaneously and share data. The first thread forks itself, and the second thread is then executed as if it was launched from main(). Forking allows multiple processes to be Run concurrently without conflicts arising.
To make sure that data is shared appropriately between the two threads, use the wait() function before accessing shared resources. Wait will block execution of the current program until all database connections are closed or all I/O has been completed, whichever comes first.

How can multithreading give a factor of speeedup that is larger than the number of cores?

I am using pthreads with gcc. The simple code example takes the number of threads "N" as a user-supplied input. It splits up a long array into N roughly equally sized subblocks. Each subblock is written into by individual threads.
The dummy processing for this example really involves sleeping for a fixed amount of time for each array index and then writing a number into that array location.
Here's the code:
/******************************************************************************
* FILE: threaded_subblocks_processing
* DESCRIPTION:
* We have a bunch of parallel processing to do and store the results in a
* large array. Let's try to use threads to speed it up.
******************************************************************************/
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#define BIG_ARR_LEN 10000
typedef struct thread_data{
int start_idx;
int end_idx;
int id;
} thread_data_t;
int big_result_array[BIG_ARR_LEN] = {0};
void* process_sub_block(void *td)
{
struct thread_data *current_thread_data = (struct thread_data*)td;
printf("[%d] Hello World! It's me, thread #%d!\n", current_thread_data->id, current_thread_data->id);
printf("[%d] I'm supposed to work on indexes %d through %d.\n", current_thread_data->id,
current_thread_data->start_idx,
current_thread_data->end_idx-1);
for(int i=current_thread_data->start_idx; i<current_thread_data->end_idx; i++)
{
int retval = usleep(1000.0*1000.0*10.0/BIG_ARR_LEN);
if(retval)
{
printf("sleep failed");
}
big_result_array[i] = i;
}
printf("[%d] Thread #%d done, over and out!\n", current_thread_data->id, current_thread_data->id);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
if (argc!=2)
{
printf("usage: ./a.out number_of_threads\n");
return(1);
}
int NUM_THREADS = atoi(argv[1]);
if (NUM_THREADS<1)
{
printf("usage: ./a.out number_of_threads (where number_of_threads is at least 1)\n");
return(1);
}
pthread_t *threads = malloc(sizeof(pthread_t)*NUM_THREADS);
thread_data_t *thread_data_array = malloc(sizeof(thread_data_t)*NUM_THREADS);
int block_size = BIG_ARR_LEN/NUM_THREADS;
for(int i=0; i<NUM_THREADS-1; i++)
{
thread_data_array[i].start_idx = i*block_size;
thread_data_array[i].end_idx = (i+1)*block_size;
thread_data_array[i].id = i;
}
thread_data_array[NUM_THREADS-1].start_idx = (NUM_THREADS-1)*block_size;
thread_data_array[NUM_THREADS-1].end_idx = BIG_ARR_LEN;
thread_data_array[NUM_THREADS-1].id = NUM_THREADS;
int ret_code;
long t;
for(t=0;t<NUM_THREADS;t++){
printf("[main] Creating thread %ld\n", t);
ret_code = pthread_create(&threads[t], NULL, process_sub_block, (void *)&thread_data_array[t]);
if (ret_code){
printf("[main] ERROR; return code from pthread_create() is %d\n", ret_code);
exit(-1);
}
}
printf("[main] Joining threads to wait for them.\n");
void* status;
for(int i=0; i<NUM_THREADS; i++)
{
pthread_join(threads[i], &status);
}
pthread_exit(NULL);
}
and I compile it with
gcc -pthread threaded_subblock_processing.c
and then I call it from command line like so:
$ time ./a.out 4
I see a speed up when I increase the number of threads. With 1 thread the process takes just a little over 10 seconds. This makes sense because I sleep for 1000 usec per array element, and there are 10,000 array elements. Next when I go to 2 threads, it goes down to a little over 5 seconds, and so on.
What I don't understand is that I get a speed-up even after my number of threads exceeds the number of cores on my computer! I have 4 cores, so I was expecting no speed-up for >4 threads. But, surprisingly, when I run
$ time ./a.out 100
I get a 100x speedup and the processing completes in ~0.1 seconds! How is this possible?
Some general background
A program's progress can be slowed by many things, but, in general, you can slow spots (otherwise known as hot spots) into two categories:
CPU Bound: In this case, the processor is doing some heavy number crunching (like trigonometric functions). If all the CPU's cores are engaged in such tasks, other processes must wait.
Memory bound: In this case, the processor is waiting for information to be retrieved from the hard disk or RAM. Since these are typically orders of magnitude slower than the processor, from the CPU's perspective this takes forever.
But you can also imagine other situations in which a process must wait, such as for a network response.
In many of these memory-/network-bound situations, it is possible to put a thread "on hold" while the memory crawls towards the CPU and do other useful work in the meantime. If this is done well then a multi-threaded program can well out-perform its single-threaded equivalent. Node.js makes use of such asynchronous programming techniques to achieve good performance.
Here's a handy depiction of various latencies:
Your question
Now, getting back to your question: you have multiple threads going, but they are performing neither CPU-intensive nor memory-intensive work: there's not much there to take up time. In fact, the sleep function is essentially telling the operating system that no work is being done. In this case, the OS can do work in other threads while your threads sleep. So, naturally, the apparent performance increases significantly.
Note that for low-latency applications, such as MPI, busy waiting is sometimes used instead of a sleep function. In this case, the program goes into a tight loop and repeatedly checks a condition. Externally, the effect looks similar, but sleep uses no CPU while the busy wait uses ~100% of the CPU.

How to program so that different processes run on different CPU cores?

I'm writing a linux C program with 2 processes. I will run the program on different machines.
These machine may have multiple CPU cores.
When I run the program, will the system allocate different CPU core for different processes? or I need to write some codes so as to fully utilize the CPU cores?
If you wish to pin threads/processes to specific CPUs then you have to use the sched_setaffinity(2) system call or the pthread_setaffinity_np(3) library call for that. Each core in Linux has it's own virtual CPU ID.
These calls allow you to set the allowed CPU mask.
Otherwise it will be up to the digression of the scheduler to run your threads where it feels like running them.
Neither will guarantee that your process runs in parallel though. This is something only the scheduler can decide unless you run realtime.
Here is some sample code:
#include <sched.h>
int run_on_cpu(int cpu) {
cpu_set_t allcpus;
CPU_ZERO(&allcpus);
sched_getaffinity(0, sizeof(cpu_set_t), &allcpus);
int num_cpus = CPU_COUNT(&allcpus);
fprintf(stderr, "%d cpus available for scheduling\nAvailable CPUs: ", num_cpus);
size_t i;
for (i = 0; i < CPU_SETSIZE; i++) {
if (CPU_ISSET(i, &allcpus))
fprintf(stderr, "%zu, ", i);
}
fprintf(stderr, "\n");
if (CPU_ISSET(cpu, &allcpus)) {
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(cpu, &cpu_set);
return pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpu_set);
}
return -1;
}

test program for processor load generate only 3 threads but we need more

I write a simple test program to produce some processor load. It will throw 6 threads and calculates in every thread pi. But the processor generates only 3 threads on the target platform (arm), the same program on a normal Linux-PC generates all 6 threads.
What is the Problem?
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#define ITERATIONS 10000000000000
#define NUM_THREADS 6
void *calculate_pi(void *threadID) {
double i;
double pi;
int add = 0;
pi = 4;
for (i = 0; i < ITERATIONS; i++) {
if (add == 1) {
pi = pi + (4/(3+i*2));
add = 0;
} else {
pi = pi - (4/(3+i*2));
add = 1;
}
}
printf("pi from thread %d = %20lf in %20lf iterations\n", (int)threadID, pi, i);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
pthread_t threads[NUM_THREADS];
int rc;
int i;
for ( i = 0 ; i < NUM_THREADS; i++) {
rc = pthread_create(&threads[i], NULL, calculate_pi, (void *)i);
if (rc) {
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(EXIT_FAILURE);
}
}
for ( i = 0 ; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}
return(EXIT_SUCCESS);
}
When your main thread creates a new thread, depending on how many CPUs you've got and a bunch of other things, the library/OS can decide to switch to the new thread immediately and run that new thread until it blocks or terminates; then switch back to the main thread which creates another new thread that runs until it blocks or terminates, and so on. In this case you'd never have more than 2 threads actually running at the same time (the main thread, and one of the new threads).
Of course the more CPUs you have the more likely it is that the main thread will keep running long enough to spawn all of the new threads. I'm guessing that this is what is happened - your PC simply has a lot more CPUs than the ARM system.
The best way to prevent this would be to make the new threads lower priority than the main thread. That way, when the higher priority main thread creates a lower priority thread, the library/kernel should be smart enough not to stop running the higher priority thread.
Sadly, the implementation of pthreads on Linux has a habit of ignoring normal pthreads thread priorities. The last time I looked into it, the only alternative was to use real time thread priorities instead, and this required root access and creates a security/permissions disaster. This is possibly due to limitations of the underlying scheduler in the kernel (e.g. a problem that the pthreads library can't work around).
There is another alternative. If your main thread acquires a mutex before creating any new threads and released it after all new threads are created, and if the other threads attempt to acquire (and release) the same mutex before doing any real work; then you'd force it to have all 7 threads at the same time.
If the purpose is just to load the processors, and you have a compiler that supports OpenMP, you can use the following:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <omp.h>
double calculate_pi(int iterations) {
double pi;
int add = 0;
pi = 4;
for (int ii = 0; ii < iterations; ii++) {
if (add == 1) {
pi = pi + (4.0/(3.0+ii*2));
add = 0;
} else {
pi = pi - (4.0/(3.0+ii*2));
add = 1;
}
}
return pi;
}
int main(int argc, char *argv[]) {
if ( argc != 2 ) {
printf("Usage: %s <niter>",argv[0]);
return 1;
}
const int iterations = atoi(argv[1]);
#pragma omp parallel
{
double pi = calculate_pi(iterations);
printf("Thread %d, pi = %g\n",omp_get_thread_num(),pi);
}
return 0;
}
In this way you can set the number of iterations from command line, and the number of threads from the environment variable OMP_NUM_THREADS. For instance:
export OMP_NUM_THREADS=4
./pi.x 1000
will run the executable with 1000 iterations and 4 threads.
There's nothing that guarantees you that the operating system will create as many kernel level threads/tasks as you spawn threads with pthread_create. There are pthreads implementations that will do everything in userland and only use one kernel level thread and cpu. Many (most?) implementations will do 1:1 threading where one thread is one kernel level thread because it's the simplest to implement. Some will implement M:N hybrid model where the userland library decides how many kernel level threads to spawn. This might be the case for the implementation you use. "ps -eLF" will only show you the kernel level threads, it doesn't have information about user level threads.
The advantage of M:N threading is that context switching between the various user level threads can be magnitudes faster in some cases. The disadvantage is that it's much more complicated to implement and usually the implementations are very fragile.
Maybe 1000 seconds (on the sleep) is not enough for that many iterations to finish. So the program might be exiting before the 6 threads are done.
Have you tried joining instead of sleeping?
Try replacing the sleep() for this:
for ( i = 0 ; i < NUM_THREADS; i++) {
s = pthread_join(threads[i], NULL);
}

Matrix Multiplication with Threads: Why is it not faster?

So I've been playing around with pthreads, specifically trying to calculate the product of two matrices. My code is extremely messy because it was just supposed to be a quick little fun project for myself, but the thread theory I used was very similar to:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define M 3
#define K 2
#define N 3
#define NUM_THREADS 10
int A [M][K] = { {1,4}, {2,5}, {3,6} };
int B [K][N] = { {8,7,6}, {5,4,3} };
int C [M][N];
struct v {
int i; /* row */
int j; /* column */
};
void *runner(void *param); /* the thread */
int main(int argc, char *argv[]) {
int i,j, count = 0;
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
//Assign a row and column for each thread
struct v *data = (struct v *) malloc(sizeof(struct v));
data->i = i;
data->j = j;
/* Now create the thread passing it data as a parameter */
pthread_t tid; //Thread ID
pthread_attr_t attr; //Set of thread attributes
//Get the default attributes
pthread_attr_init(&attr);
//Create the thread
pthread_create(&tid,&attr,runner,data);
//Make sure the parent waits for all thread to complete
pthread_join(tid, NULL);
count++;
}
}
//Print out the resulting matrix
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
printf("%d ", C[i][j]);
}
printf("\n");
}
}
//The thread will begin control in this function
void *runner(void *param) {
struct v *data = param; // the structure that holds our data
int n, sum = 0; //the counter and sum
//Row multiplied by column
for(n = 0; n< K; n++){
sum += A[data->i][n] * B[n][data->j];
}
//assign the sum to its coordinate
C[data->i][data->j] = sum;
//Exit the thread
pthread_exit(0);
}
source: http://macboypro.com/blog/2009/06/29/matrix-multiplication-in-c-using-pthreads-on-linux/
For the non-threaded version, I used the same setup (3 2-d matrices, dynamically allocated structs to hold r/c), and added a timer. First trials indicated that the non-threaded version was faster. My first thought was that the dimensions were too small to notice a difference, and it was taking longer to create the threads. So I upped the dimensions to about 50x50, randomly filled, and ran it, and I'm still not seeing any performance upgrade with the threaded version.
What am I missing here?
Unless you're working with very large matrices (many thousands of rows/columns), then you are unlikely to see much improvement from this approach. Setting up a thread on a modern CPU/OS is actually pretty expensive in relative terms of CPU time, much more time than a few multiply operations.
Also, it's usually not worthwhile to set up more than one thread per CPU core that you have available. If you have, say, only two cores and you set up 2500 threads (for 50x50 matrices), then the OS is going to spend all its time managing and switching between those 2500 threads rather than doing your calculations.
If you were to set up two threads beforehand (still assuming a two-core CPU), keep those threads available all the time waiting for work to do, and supply them with the 2500 dot products you need to calculate in some kind of synchronised work queue, then you might start to see an improvement. However, it still won't ever be more than 50% better than using only one core.
I'm not entirely sure I understand the source code, but here's what it looks like: You have a loop that runs M*N times. Each time through the loop, you create a thread that fills in one number in the result matrix. But right after you launch the thread, you wait for it to complete. I don't think that you're ever actually running more than one thread.
Even if you were running more than one thread, the thread is doing a trivial amount of work. Even if K was large (you mention 50), 50 multiplications isn't much compared to the cost of starting the thread in the first place. The program should create fewer threads--certainly no more than the number of processors--and assign more work to each.
You don't allow much parallel execution: you wait for the thread immediately after creating it, so there is almost no way for your program to use additional CPUs (i.e. it can never use a third CPU/core). Try to allow more threads to run (probably up to the count of cores you have).
If you have a processor with two cores, then you should just divide the work to be done in two halfs and give each thread one half. The same principle if you have 3, 4, 5 cores. The optimal performance design will always match the number of threads to the number of available cores (by available I mean cores that aren't already being heavily used by other processes).
One other thing you have to consider is that each thread must have its data contiguous and independent from the data for the other threads. Otherwise, memcache misses will slow down sighificantly the processing.
To better understand these issues, I'd recommend the book Patterns for Parallel Programming
http://astore.amazon.com/amazon-books-20/detail/0321228111
Although its code samples are more directed to OpenMP & MPI, and you're using PThreads, still the first half of the book is very rich in fundamental concepts & inner working of multithreading environments, very useful to avoid most of the performance bottlenecks you'll encounter.
Provided the code parallelizes correctly (I won't check it), likely performance boosts only when the code is parallelized in hardware, i.e. threads are really parallel (multi cores, multi cpus, ... other techologies...) and not apparently ("multitasking" way) parallel. Just an idea, I am not sure this is the case.

Resources