I have a program in C, which takes arbitrary number of files as commandline argument, and calculates sha1sum for every file. I am using pthreads, so that I can take advantage of all 4 my cores.
Currently, my code runs all threads in parallel at the same time.
Here is a snippet:
c = 0;
for (n = optind; n < argc; n++) {
if (pthread_create(&t[c], NULL, &sha1sum, (void *) argv[n])) {
fprintf(stderr, "Error creating thread\n");
return 1;
}
c++;
}
c = 0;
for (n = optind; n < argc; n++) {
pthread_join(t[c], NULL);
c++;
}
Obviously, it is not efficient (or scalable) to start all threads at once.
What would be the best way to make sure, that only 4 threads are running at any time? Somehow I need to start 4 threads at the beginning, and then "replace" a thread with new one as soon as it completes.
How can I do that ?
Obviously, it is not efficient (or scalable) to start all threads at once.
Creating 4 threads is not necessarily provides the best performance on a 4 core machine. If the threads are doing IO or waiting on something, then creating more than 4 threads could also result in better performance/efficiency. You just need to figure out an approximate number based on the work your threads do and possbily a mini-benchmark.
Regardless of what number you choose (i.e. number of threads), what you are looking for is a thread pool. The idea is to create a fixed number of threads and feed them work as soon as they complete.
See C: What's the way to make a poolthread with pthreads? for a simple skeleton. This repo also shows a self-contained example (check the license if you are going to use it). You can find many similar examples online.
The thing you are looking for is a semaphore; it will allow you to restrict only 4 threads to be running at a time. You would/could start them all up initially, and it will take care of letting a new one proceed when a running one finishes.
Related
I've gotten ideas for multiple projects recently that all involve reading IP addresses from a file. Since they are all supposed to be able to handle a large amount of hosts, I've attempted to implement multi-threading or creating a pool of sockets and select()-ing from them in order to achieve some form of concurrency for better performance. On multiple occasions, reading from the file seems to be the bottleneck in enhancing performance. The way I understand it, reading from a file with fgets or similar is a synchronous, blocking operation. So even if I successfully implemented a client that connects to multiple hosts asynchronously, the operation would still be synchronous because I can only read one address at a time from a file.
/* partially pseudo code */
/* getaddrinfo() stuff here */
while(fgets(ip, sizeof(ip), file) {
FD_ZERO(&readfds);
/* create n sockets here in a for loop */
for (i = 0; i < socket_num; i++) {
if (newfd > fd[i]) newfd = fd[i];
FD_SET(fd[i], &readfds);
}
/* here's where I think I should connect n sockets to n addresses from file
* but I'm only getting one IP at a time from file, so I'm not sure how to connect to
* n addresses at once with fgets
*/
for (j = 0; j < socket_num; j++) {
if ((connect(socket, ai->ai_addr, ai->ai_addrlen)) == -1)
// error
else {
freeaddrinfo(ai);
FD_SET(socket, &master);
fdmax = socket;
if (select(socket+1, &master, NULL, NULL, &tv) == -1);
// error
if ((recvd = read(socket, banner, RECVD)) <= 0)
// error
if (FD_ISSET(socket, &master))
// print success
}
/* clear sets and close sockets and stuff */
}
I've pointed out my issues with comments, but just to clarify: I'm not sure how to perform asynchronous I/O operations on multiple target servers read from a file, since reading entries from file seems to be strictly synchronous. I've run into similar isssues with multithreading, with a marginally better degree of success.
void *function_passed_to_pthread_create(void *opts)
{
while(fgets(ip_addr, sizeof(ip_addr), opts->file) {
/* speak to ip_addr and get response */
}
}
main()
{
/* necessary stuff */
for (i = 0; i < thread_num; i++) {
pthread_create(&tasks, NULL, above_function, opts)
}
for (j = 0; j < thread_num; j++)
/* join threads */
return 0;
}
This seems to work, but since multiple threads are all processing the same file the results aren't always accurate. I imagine it's because multiple threads may process the same address from file at the same time.
I've considered loading all the entries from a file into an array/into memory, but if the file was particularly large I imagine that could cause memory issues. On top of that, I'm not sure it that even makes sense to do anyway.
As a final note; if the file I'm reading from happens to be a particularly large file with a huge amount of IPs then I do not believe either solution scales well. Anything is possible with C though, so I imagine there is some way to achieve what I'm hoping to.
To sum this post up; I'd like to find a way to improve a client-side applications performance using asynchronous I/O or multi-threading when reading entries from a file.
Several people have hinted at a good solution to this in their comments, but it's probably worth spelling it out in more detail. The full solution has quite a lot of details and is pretty complicated code, so I'm going to use pseudocode to explain what I'd recommend.
What you have here is really a variation on a classic producer/consumer problem: You have a single thing producing data, and many things trying to consume that data. In your case, it must be a "single thing" producing that data, because the lengths of each line of the source file are unknown: You can't just jump forward 'n' bytes and somehow be at the next IP. There can only be one actor at a time moving the read pointer toward the next unknown position of the \n, so you by definition have a single producer.
There are three general ways to attack this:
Solution A involves having each thread pulling a little more out of a shared file buffer, and kicking off an asynchronous (nonblocking) read every time the last read completes. There are a whole host of headaches getting this solution right, as it's very sensitive to timing differences between the filesystem and the work being performed: If the file reads are slow, the workers will all stall waiting for the file. If the workers are slow, the file reader will either stall or fill up memory waiting for them to consume the data. This solution is likely the absolute fastest, but it's also incredibly difficult synchronization code to get right with about a zillion caveats. Unless you're an expert in threading (or extremely clever abuse of epoll_wait()), you probably don't want to go this route.
Solution B has a "master" thread, responsible for reading the file, and populating some kind of thread-safe queue with the data it reads, with one IP address (one string) per queue entry. Each of the worker threads just consumes queue entries as fast as it can, querying the remote server and then requesting another queue entry. This requires a little care to get right, but is generally a lot safer than Solution A, especially if you use somebody else's queue implementation.
Solution C is pretty hacktastic, but you shouldn't dismiss it out-of-hand, depending on what you're doing. This solution just involves using something like the Un*x sed command (see Get a range of lines from a file given the start and end line numbers) to slice your source file into a bunch of "chunky" source files in advance — say, twenty of them. Then you just run twenty copies of a really simple single-thread program in parallel using &, each on a different "slice" of file. Mushed together with a little shell script to automate it, this can be a "good enough" solution for a lot of needs.
Let's take a closer look at Solution B — a master thread with a thread-safe queue. I'm going to cheat and assume you can construct a working queue implementation (if not, there are StackOverflow articles on implementing a thread-safe queue using pthreads: pthread synchronized blocking queue).
In pseudocode, this solution is then something like this:
main()
{
/* Create a queue. */
queue = create_queue();
/* Kick off the master thread to read the file, and give it the queue. */
master_thread = pthread_create(master, queue);
/* Kick off a bunch of workers with access to the queue. */
for (i = 0; i < 20; i++) {
worker_thread[i] = pthread_create(worker, queue);
}
/* Wait for everybody to finish. */
pthread_join(master_thread);
for (i = 0; i < 20; i++) {
pthread_join(worker_thread[i]);
}
}
void master(queue q)
{
FILE *fp = fopen("ips.txt", "r");
char buffer[BIGGER_THAN_ANY_IP];
/* Inhale the file as fast as we can, and push each line we
read onto the queue. */
while (fgets(fp, buffer) != NULL) {
char *next_ip = strdup(buffer);
enqueue(q, next_ip);
}
/* Add some final messages in the queue to let the workers
know that we're out of data. There are *much* better ways
of notifying them that we're "done", but in this case,
pushing a bunch of NULLs equal to the number of threads is
simple and probably good enough. */
for (i = 0; i < 20; i++) {
enqueue(q, NULL);
}
}
void worker(queue q)
{
char *ip;
/* Inhale messages off the queue as fast as we can until
we get a "NULL", which means that it's time to stop.
The call to dequeue() *must* block if there's nothing
in the queue; the call should only return NULL if the
queue actually had NULL pushed into it. */
while ((ip = dequeue(q)) != NULL) {
/* Insert code to actually do the work here. */
connect_and_send_and_receive_to(ip);
}
}
There are plenty of caveats and details in a real implementation (like: how do we implement the queue, ring buffers or a linked list? what if the text isn't all IPs? what if the char buffer isn't big enough? how many threads is enough? how do we deal with file or network errors? will malloc performance become a bottleneck? what if the queue gets too big? can we do better to overlap the network I/O?).
But, caveats and details aside, the pseudocode I presented above is a good enough starting point that you likely can expand it into a working solution.
read IP's from a file, have worker threads, keep giving IP's to worker threads. let all socket communication happen in worker threads. Also if the IPv4 addresses are stored hex format instead of ascii, probably can read multiples of them in a single shot and it would be faster.
If you just want to read asynchronously you can use getch() from ncurses with delay set to 0. It is part of posix so you don't need any additional dependencies. Also you have unlocked_stdio.
On the other hand, I have to wonder why is fgets() a bottleneck. As long as you have data in file it should not block. And even if data is huge (like 1MB or 100k ip addresses) reading it into list at startup should take less than 1 second.
And why are you openining sockets_num connections to every ip in the list? You are having sockets_num multiplied by number of ip addresses at the same time. Since every socket is file on linux you will hit system issues when you try to open more than several thousand files (see ulimit -Sn). Can you confirm that issue is not in connect() in that case?
With the very simple code below, my system (Ubuntu Linux 14.04) simply crashes not even letting my mouse respond. I had to force quit with the power button. I thought Linux is a stable OS tolerable of handling such basic program errors. Did I miss something?
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <semaphore.h>
void check(int isOkay){
if(!isOkay){
printf("error\n");
abort();
}
}
int main(void){
#define n 1000000
int array[n];
sem_t blocker;
int i;
while(1){
if(!fork()){
for(i = 0; i < n; ++i){
array[i] = rand();
}
check(sem_init(&blocker, 0, 0) == 0);
check(sem_wait(&blocker) == 0);
}
}
return 0;
}
Congratulations, you've discovered the fork bomb. There are shell one-liners that can wreak the same sort of havic with a lot less typing on your part.
It is in fact possible to limit the number of processes that a user can spawn using ulimit -- see the bottom of the linked wikipedia articles for details.
A desktop install of Ubuntu is not exactly a hardened server, though. It's designed for usability first and foremost. If you need a locked down system that can't crash, there are better options.
The command ulmit -u shows the maximum number of processes that you can start. However, do not start that many processes in the background: your machine would spend time switching between processes and wouldn't get around to getting actual work done.
The linux does its job of processing your request to create a process, it is for the user to implement his code based on this limit.
The main problem here is determining the best limit. A lot of software doesn't use fork() at all, so do you set the limit to something small like 5? Some software might create a new process whenever it receives a request from network, so do you set the limit to "max. number of network packets"? If you assume most software isn't buggy, then you'd be tempted to set the limit relatively high so that correct software works properly.
The other problem is one of scheduling priorities. In a well designed system things like the GUI would be "high priority" and if it wants CPU time it'd preempt normal/lower priority work immediately. If this was the case, a massive fork bomb running at normal/lower priority would have no effect on the system's ability to respond to the user, and the user would be able to kill the fork bomb without much problem.
Sadly, for a variety of reasons, the scheduler in Linux doesn't work like that. It does support priorities, but to use them you have to be a "real time" process and have to be running as root (which is a massive security disaster). Without sane priorities, Linux assumes that every forked process is as important as everything else, and the CPU/s end up busy doing the forking and there's no CPU time left to respond to the user.
I have a C application, part of which does some threaded stuff, which I'm having some difficulty to implement. I'm using pthread.h (POSIX thread programming) as a guideline.
I need to synchronize two threads that repeat a certain task a predefined number of times, and with each repetition the two tasks need to start at the same time. My idea is to let each thread initialize and do their work before the sync'd task begins, and when this happens thread one (let's call this thread TX) will signal thread 2 (RX) that it can begin doing the task.
Here's an example:
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t tx_condvar = PTHREAD_COND_INITIALIZER;
static bool tx_ready = false;
These are declared in a header file. The TX thread is shown below:
while (reps > 0 && !task->quit) {
pthread_mutex_lock(&mutex);
tx_ready = true;
pthread_cond_signal(&tx_condvar);
pthread_mutex_unlock(&mutex);
status = do_stuff();
if (status != 0) {
print_error();
goto tx_task_out;
}
reps--;
// one task done, wait till it's time to do the next
usleep(delay_value_us);
tx_ready = false;
}
And then the RX
while (!done && !task->quit) {
// wait for the tx_ready signal before carrying on
pthread_mutex_lock(&mutex);
while (!tx_ready){
pthread_cond_wait(&tx_condvar, &mutex);
}
pthread_mutex_unlock(&mutex);
status = do stuff();
if (status != 0) {
print_error();
goto rx_task_out;
}
n = fwrite(samples, 2 * sizeof(samples[0]), to_rx, p->out_file);
num_rx += to_rx;
if (num_rx == s->rx_length){
done = true;
}
}
Is there a better way to handle this, and am I even doing it correctly? It's incredibly important that the two tasks within the tx/rx threads start at the same time for each repetition.
Thanks in advance for your input!
What you are looking for is called a barrier. Basically it blocks threads entering the barrier until a certain number of threads have entered and then it releases them all.
I believe pthreads have a barrier, although it might be a extension.
https://computing.llnl.gov/tutorials/pthreads/man/pthread_barrier_wait.txt
First, if you are not using a multi-core processor, you cannot run two tasks at precisely the same time.
Starting two tasks at precisely the same time requires a multi-core system. (SMP architecture where multiple cores with shared memory are under a single OS). There are extensions available on several dev environments specifically for taking advantage of features such as processor affinity, where you can dedicate a particular thread to run only on a specific core, or set up processors to be dedicated to running specific threads, and when to run them.
I use LabWindows/CVI (ANSI C with extensions for instrumentation). HERE is a small white paper on capabilities using multi-core. Here is another that is mostly NI specific, but also includes some generic techniques applicable using any ANSI C compiler (toward bottom) dealing with time critical loops.
I'm having some trouble writing Pseduocode for a homework assignment in my operating systems class in which we are programming in C.
You will be implementing a Producer-Consumer program with a bounded buffer queue of N elements, P producer threads and C consumer threads
(N, P and C should be command line arguments to your program, along with three additional parameters, X, Ptime and Ctime, that are described below). Each
Producer thread should Enqueue X different numbers onto the queue (spin-waiting for Ptime*100,000 cycles in between each call to Enqueue). Each Consumer thread
should Dequeue P*X/C items from the queue (spin-waiting for Ctime*100,000 cycles
in between each call to Dequeue). The main program should create/initialize the
Bounded Buffer Queue, print a timestamp, spawn off C consumer threads & P
producer threads, wait for all of the threads to finish and then print off another
timestamp & the duration of execution.
My main difficulty is understanding what my professor means by spin-waiting for the variables times 100,000. I have bolded the section that is confusing me.
I understand a time stamp will be used to print the difference between each thread. We are using semaphores and implementing synchronization at the moment. Any suggestions on the above queries would be much appreciated.
I'm guessing it means busy-waiting; repeatedly checking the loop condition and consuming unnecessary CPU power in a tight loop:
while (current_time() <= wake_up_time);
One would ideally use something that suspends your thread until it's woken up externally, by the scheduler (so resources such as the CPU can be diverted elsewhere):
sleep(2 * 60 * 1000 ms);
or at least give up some CPU (i.e. not be so tight):
while (current_time() <= wake_up_time)
sleep(100 ms);
But I guess they don't want you to manually invoke the scheduler, hinting the OS (or your threading library) that it's a good time to make a context switch.
I'm not sure what cycles are; in assembly they might be CPU cycles but given that your question is tagged C, I'll bet that they're simply loop iterations:
for (int i=0; i<Ptime*100000; ++i); //spin-wait for Ptime*100,000 cycles
Though it's always safest to ask whoever issued the homework.
busy-waiting or spinning is a technique in which a process repeatedly checks to see if a condition is true, such as whether keyboard input is available, or if a lock is available.
so the assignment says to wait for Ptime*100000 time before producing next element and enqueue x different elements after the condition is true
similarly Each Consumer thread
should Dequeue P*X/C items from the queue and wait for ctime*100000 after every consumption of item
I suspect that your professor is being a complete putz - by actually ASKING for the worste "busy waiting" technique in existance:
int n = pTime * 100000;
for ( int i=0; i<n; ++i) ; // waste some cycles.
I also suspect that he still uses a pterosaur thigh-bone as a walking stick, has a very nice (dry) cave, and a partner with a large bald patch.... O/S guys tend to be that way. It goes with the cool beards.
No wonder his thoroughly modern students misunderstand him. He needs to (re)learn how to grunt IN TUNE.
Cheers. Keith.
I've written a program that executes some calculations and then merges the results.
I've used multi-threading to calculate in parallel.
During the phase of merge result, each thread will lock the global array, and then append individual part to it, and some extra work will be done to eliminate the repetitions.
I test it and find that the cost on merging increases with the number of threads, and the rate is unexpected:
2 thread: 40,116,084(us)
6 thread:511,791,532(us)
Why: what occurs when the number of threads increases? How do I change this?
--------------------------slash line -----------------------------------------------------
Actually, the code was very simply, there is the pseudo-code:
typedef my_object{
long no;
int count;
double value;
//something others
} my_object_t;
static my_object_t** global_result_array; //about ten thounds
static pthread_mutex_t global_lock;
void* thread_function(void* arg){
my_object_t** local_result;
int local_result_number;
int i;
my_object_t* ptr;
for(;;){
if( exit_condition ){ return NULL;}
if( merge_condition){
//start time point to log
pthread_mutex_lock( &global_lock);
for( i = local_result_number-1; i>=0 ;i++){
ptr = local_result[ i] ;
if( NULL == global_result_array[ ptr->no] ){
global_result_array[ ptr->no] = ptr; //step 4
}else{
global_result_array[ ptr->no] -> count += ptr->count; // step 5
global_result_array[ ptr->no] -> value += ptr->value; // step 6
}
}
pthread_mutex_unlock( &global_lock); // end time point to log
}else{
//do some calculation and produce the partly and thread-local result ,namely the local_result and local_result_number
}
}
}
As above, the difference between two threads and six threads are step 5 and step6, i has counted that there were about hundreds millions order of execution of step 5 and 6. The others are same.
So, from my view, the merge operation was very light, in spite of using 2 thread or 6 thread, they both need to lock and do merge exclusively.
Another astonished thing was : when using six thread, the cost on step 4 was boomed! It was the boot reason that the total cost was boomed!
btw: The test server has two cpus ,each cpu has four cores.
There are various reasons for the behaviour shown:
More threads means more locks and more blocking time among threads. As is apparent from your description, your implementation uses mutex locks or something similar. The speed-up with threads is better if the data sets are largely exclusive.
Unless your system has as many processors/cores as the number of threads, all of them cannot run concurrently. You can set the maximum concurrency using pthread_setconcurrency.
Context switching is an overhead. Hence the difference. If your computer had 6 cores it would be faster. Overwise you need to have more context switches for the threads.
This is a huge performance difference between 2/6 threads. I'm sorry, but you have to try very hard indeed to make such a huge discrepancy. You seem to have succeeded:((
As others have pointed out, using multiple threads on one data set only becomes worth it if the time spent on inter-thread communication, (locks etc.), is less than the time gained by the concurrent operations.
If, for example, you find that you are merging successively smaller data sections, (eg. with a merge sort), you are effectively optimizing the time wasted on inter-thread comms and cache-thrashing. This is why multi-threaded merge-sorts are frequently started with an in-place sort once the data has been divided up into a chunk less than the size of the L1 cache.
'each thread will lock the global array' - try to not do this. Locking large data structures for extended periods, or continually locking them for successive short periods, is a very bad plan. Locking the global once serializes the threads and generates one thread with too much inter-thread comms. Continualy locking/releasing generates one thread with far, far too much inter-thread comms.
Once the operations get so short that the returns are diminished to the point of uselessness, you would be better off queueing those operations to one thread that finishes off the job on its own.
Locking is often grossly over-used and/or misused. If I find myself locking anything for longer than the time taken to push/pop a pointer onto a queue or similar, I start to get jittery..
Without seeing/analysing the code, and more importantly, data,, (I guess both are complex), it's difficult to give any direct advice:(