I'm trying to 'roughly' calculate the time of a thread context switch in a Linux system. I've written a program that uses pipes and multi-threading to achieve this. When running the program the calculated time is clearly wrong(see output below). I am unsure if this is due to me using the wrong clock_id for this procedure or perhaps my implementation
I have implemented sched_setaffinity() so as to only have the program run on core 0. I've tried to leave as much fluff out of code so to only measure the time of a context switch, so the tread process only writes a single character to the pipe and the parent does a 0 byte read.
I have a parent tread that creates one child thread with a one-way pipe between them to pass data, the child thread runs a simple function to write to a pipe.
void* thread_1_function()
{
write(fd2[1],"",sizeof("");
}
while the parent thread creates the child thread, starts the time counter and then calls a read on the pipe that the child thread writes to.
int main(int argc, char argv[])
{
//time struct declaration
struct timespec start,end;
//sets program to only use core 0
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(0,&cpu_set);
if((sched_setaffinity(0, sizeof(cpu_set_t), &cpu_set) < 1))
{
int nproc = sysconf(_SC_NPROCESSORS_ONLN);
int k;
printf("Processor used: ");
for(k = 0; k < nproc; ++k)
{
printf("%d ", CPU_ISSET(k, &cpu_set));
}
printf("\n");
if(pipe(fd1) == -1)
{
printf("fd1 pipe error");
return 1;
}
//fail on file descriptor 2 fail
if(pipe(fd2) == -1)
{
printf("fd2 pipe error");
return 1;
}
pthread_t thread_1;
pthread_create(&thread_1, NULL, &thread_1_function, NULL);
pthread_join(thread_1,NULL);
int i;
uint64_t sum = 0;
for(i = 0; i < iterations; ++i)
{
//initalize clock start
clock_gettime(CLOCK_MONOTONIC, &start);
//wait for child thread to write to pipe
read(fd2[0],input,0);
//record clock end
clock_gettime(CLOCK_MONOTONIC, &end);
write(fd1[1],"",sizeof(""));
uint64_t diff;
diff = billion * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
diff = diff;
sum += diff;
}
The results i get while running this are typically in this manner:
3000
3000
4000
2000
12000
3000
5000
and so forth, when I inspect the time returned to the start and end timespec structs i see that tv_nsec seems to be a 'rounded' number as well:
start.tv_nsec: 714885000, end.tv_nsec: 714888000
Would this be caused by a clock_monotonic not being precise enough for what im attempting to measure, or some other problem that i'm overlooking?
i see that tv_nsec seems to be a 'rounded' number as well:
2626, 714885000, 2626, 714888000
Would this be caused by a clock_monotonic not being precise enough for
what im attempting to measure, or some other problem that i'm
overlooking?
Yes, that's a possibility. Every clock supported by the system has a fixed resolution. struct timespec is capable of supporting clocks with nanosecond resolution, but that does not mean that you can expect every clock to actually have such resolution. It looks like your CLOCK_MONOTONIC might have a resolution of 1 microsecond (1000 nanoseconds), but you can check that via the clock_getres() function.
If it is available to you, then you might try CLOCK_PROCESS_CPUTIME_ID. It is possible that that would have higher resolution than CLOCK_MONOTONIC for you, but do note that single-microsecond resolution is pretty precise -- that's on the order of one tick per 3000 CPU cycles on a modern machine.
Even so, I see several possible problems with your approach:
Although you set your process to have affinity for a single CPU, that does not prevent the system from scheduling other processes on that CPU, too. Thus, unless you've taken additional measures, you can't be certain -- it's not even likely -- that every context switch away from one of your program's threads is to the other thread.
You start your second thread and then immediately join it. There is no more context switching between your threads after that, because your second thread no longer exists after being successfully joined.
read() with a count of 0 may or may not check for errors, and it certainly does not transfer any data. It is totally unclear to me why you identify the time for that call with the time for a context switch.
If a context switch does occur in the space you're timing, then at least two need to occur there -- away from your program and back to it. Also, you're measuring the time consumed by whatever else runs in the other context as well, not just the switch time. The 1000-nanosecond steps may thus reflect time slices, rather than switching time.
Your main thread is writing null characters to the write end of a pipe, but there does not appear to be anything reading them. If indeed there isn't then this will eventually fill up the pipe's buffer and block. The purpose is lost on me.
Related
This might be a dumb question, i'm very sorry if that's the case. But i'm struggling to take advantage of the multiple cores in my computer to perform multiple computations at the same time in my Quad-Core MacBook. This is not for any particular project, just a general question, since i want to learn for when i eventually do need to do this kind of things
I am aware of threads, but the seem to run in the same core, so i don't seem to gain any performance using them for compute-bound operations (They are very useful for socket based stuff tho!).
I'm also aware of processed that can be created with fork, but i'm nor sure they are guaranteed to use more CPU, or if they, like threads, just help with IO-bound operations.
Finally i'm aware of CUDA, allowing paralellism in the GPU (And i think OpenCL and Compute Shaders also allows my code to run in the CPU in parallel) but i'm currently looking for something that will allow me to take advantage of the multiple CPU cores that my computer has.
In python, i'm aware of the multiprocessing module, which seems to provide an API very similar to threads, but there i do seem to gain an edge by running multiple functions performing computations in parallel. I'm looking into how could i get this same advantage in C, but i don't seem to be able
Any help pointing me to the right direction would be very much appreciated
Note: I'm trying to achive true parallelism, not concurrency
Note 2: I'm only aware of threads and using multiple processes in C, with threads i don't seem to be able to win the performance boost i want. And i'm not very familiar with processes, but i'm still not sure if running multiple processes is guaranteed to give me the advantage i'm looking for.
A simple program to heat up your CPU (100% utilization of all available cores).
Hint: The thread starting function does not return, program exit via [CTRL + C]
#include <pthread.h>
void* func(void *arg)
{
while (1);
}
int main()
{
#define NUM_THREADS 4 //use the number of cores (if known)
pthread_t threads[NUM_THREADS];
for (int i=0; i < NUM_THREADS; ++i)
pthread_create(&threads[i], NULL, func, NULL);
for (int i=0; i < NUM_THREADS; ++i)
pthread_join(threads[i], NULL);
return 0;
}
Compilation:
gcc -pthread -o thread_test thread_test.c
If i start ./thread_test, all cores are at 100%.
A word to fork and pthread_create:
fork creates a new process (the current process image will be copied and executed in parallel), while pthread_create will create a new thread, sometimes called a lightweight process.
Both, processes and threads will run in 'parallel' to the parent process.
It depends, when to use a child process over a thread, e.g. a child is able to replace its process image (via exec family) and has its own address space, while threads are able to share the address space of the current parent process.
There are of course a lot more differences, for that i recommend to study the following pages:
man fork
man pthreads
I am aware of threads, but the seem to run in the same core, so i don't seem to gain any performance using them for compute-bound operations (They are very useful for socket based stuff tho!).
No, they don't. Except if you block and your threads don't block, you'll see alll of them running. Just try this (beware that this consumes all your cpu time) that starts 16 threads each counting in a busy loop for 60 s. You will see all of them running and makins your cores to fail to their knees (it runs only a minute this way, then everything ends):
#include <assert.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N 16 /* had 16 cores, so I used this. Put here as many
* threads as cores you have. */
struct thread_data {
pthread_t thread_id; /* the thread id */
struct timespec end_time; /* time to get out of the tunnel */
int id; /* the array position of the thread */
unsigned long result; /* number of times looped */
};
void *thread_body(void *data)
{
struct thread_data *p = data;
p->result = 0UL;
clock_gettime(CLOCK_REALTIME, &p->end_time);
p->end_time.tv_sec += 60; /* 60 s. */
struct timespec now;
do {
/* just get the time */
clock_gettime(CLOCK_REALTIME, &now);
p->result++;
/* if you call printf() you will see them slowing, as there's a
* common buffer that forces all thread to serialize their outputs
*/
/* check if we are over */
} while ( now.tv_sec < p->end_time.tv_sec
|| now.tv_nsec < p->end_time.tv_nsec);
return p;
} /* thread_body */
int main()
{
struct thread_data thrd_info[N];
for (int i = 0; i < N; i++) {
struct thread_data *d = &thrd_info[i];
d->id = i;
d->result = 0;
printf("Starting thread %d\n", d->id);
int res = pthread_create(&d->thread_id,
NULL, thread_body, d);
if (res < 0) {
perror("pthread_create");
exit(EXIT_FAILURE);
}
printf("Thread %d started\n", d->id);
}
printf("All threads created, waiting for all to finish\n");
for (int i = 0; i < N; i++) {
struct thread_data *joined;
int res = pthread_join(thrd_info[i].thread_id,
(void **)&joined);
if (res < 0) {
perror("pthread_join");
exit(EXIT_FAILURE);
}
printf("PTHREAD %d ended, with value %lu\n",
joined->id, joined->result);
}
} /* main */
Linux and all multithread systems work the same, they create a new execution unit (if both don't share the virtual address space, they are both processes --not exactly so, but this explains the main difference between a process and a thread--) and the available processors are given to each thread as necessary. Threads are normally encapsulated inside processes (they share ---not in linux, if that has not changed recently--- the process id, and virtual memory) Processes run each in a separate virtual space, so they can only share things through the system resources (files, shared memory, communication sockets/pipes, etc.)
The problem with your test case (you don't show it so I have go guess) is that probably you will make all threads in a loop in which you try to print something. If you do that, probably the most time each thread is blocked trying to do I/O (to printf() something)
Stdio FILEs have the problem that they share a buffer between all threads that want to print on the same FILE, and the kernel serializes all the write(2) system calls to the same file descriptor, so if the most of the time you pass in the loop is blocked in a write, the kernel (and stdio) will end serializing all the calls to print, making it to appear that only one thread is working at a time (all the threads will become blocked by the one that is doing the I/O) This busy loop will make all the threads to run in parallel and will show you how the cpu is collapsed.
Parallelism in C can be achieved by using the fork() function. This function simulates a thread by allowing two threads to run simultaneously and share data. The first thread forks itself, and the second thread is then executed as if it was launched from main(). Forking allows multiple processes to be Run concurrently without conflicts arising.
To make sure that data is shared appropriately between the two threads, use the wait() function before accessing shared resources. Wait will block execution of the current program until all database connections are closed or all I/O has been completed, whichever comes first.
I am writing a data acquisitioning program that needs to
wait for serial with select()
read serial data (RS232 at 115200 baud),
timestamp it (clock_gettime()),
read an ADC on SPI,
interpret it,
send new data over another tty device
loop and repeat
The ADC is irrelevant for now.
At the end of the loop I use select() again with a 0 timeout to poll and see whether data is available already, if it is it means I have overrun , I.e. I expect the loop to end before more data and for the select() at the start of the loop to block and get it as soon as it arrives.
The data should arrive every 5ms, my first select() timeout is calculated as (5.5ms - loop time) - which should be about 4ms.
I get no timeouts but many overruns.
Examining the timestamps reveals that select() blocks for longer than the timeout (but still returns>0).
It looks like select() returns late after getting data before timeout.
This might happen 20 times in 1000 repeats.
What might be the cause? How do I fix it?
EDIT:
Here is cut down version of the code (I do much more error checking than this!)
#include <bcm2835.h> /* for bcm2835_init(), bcm2835_close() */
int main(int argc, char **argv){
int err = 0;
/* Set real time priority SCHED_FIFO */
struct sched_param sp;
sp.sched_priority = 30;
if ( pthread_setschedparam(pthread_self(), SCHED_FIFO, &sp) ){
perror("pthread_setschedparam():");
err = 1;
}
/* 5ms between samples on /dev/ttyUSB0 */
int interval = 5;
/* Setup tty devices with termios, both totally uncooked, 8 bit, odd parity, 1 stop bit, 115200baud */
int fd_wc=setup_serial("/dev/ttyAMA0");
int fd_sc=setup_serial("/dev/ttyUSB0");
/* Setup GPIO for SPI, SPI mode, clock is ~1MHz which equates to more than 50ksps */
bcm2835_init();
setup_mcp3201spi();
int collecting = 1;
struct timespec starttime;
struct timespec time;
struct timespec ftime;
ftime.tv_nsec = 0;
fd_set readfds;
int countfd;
struct timeval interval_timeout;
struct timeval notime;
uint16_t p1;
float w1;
uint8_t *datap = malloc(8);
int data_size;
char output[25];
clock_gettime(CLOCK_MONOTONIC, &starttime);
while ( !err && collecting ){
/* Set timeout to (5*1.2)ms - (looptime)ms, or 0 if looptime was longer than (5*1.2)ms */
interval_timeout.tv_sec = 0;
interval_timeout.tv_usec = interval * 1200 - ftime.tv_nsec / 1000;
interval_timeout.tv_usec = (interval_timeout.tv_usec < 0)? 0 : interval_timeout.tv_usec;
FD_ZERO(&readfds);
FD_SET(fd_wc, &readfds);
FD_SET(0, &readfds); /* so that we can quit, code not included */
if ( (countfd=select(fd_wc+1, &readfds, NULL, NULL, &interval_timeout))<0 ){
perror("select()");
err = 1;
} else if (countfd == 0){
printf("Timeout on select()\n");
fflush(stdout);
err = 1;
} else if (FD_ISSET(fd_wc, &readfds)){
/* timestamp for when data is just available */
clock_gettime(CLOCK_MONOTONIC, &time)
if (starttime.tv_nsec > time.tv_nsec){
time.tv_nsec = 1000000000 + time.tv_nsec - starttime.tv_nsec;
time.tv_sec = time.tv_sec - starttime.tv_sec - 1;
} else {
time.tv_nsec = time.tv_nsec - starttime.tv_nsec;
time.tv_sec = time.tv_sec - starttime.tv_sec;
}
/* get ADC value, which is sampled fast so corresponds to timestamp */
p1 = getADCvalue();
/* receive_frame, receiving is slower so do it after getting ADC value. It is timestamped anyway */
/* This function consists of a loop that gets data from serial 1 byte at a time until a 'frame' is collected. */
/* it uses select() with a very short timeout (enough for 1 byte at baudrate) just to check comms are still going */
/* It never times out and behaves well */
/* The interval_timeout is passed because it is used as a timeout for responding an ACK to the device */
/* That select also never times out */
ireceive_frame(&datap, fd_wc, &data_size, interval_timeout.tv_sec, interval_timeout.tv_usec);
/* do stuff with it */
/* This takes most of the time in the loop, about 1.3ms at 115200 baud */
snprintf(output, 24, "%d.%04d,%d,%.2f\n", time.tv_sec, time.tv_nsec/100000, pressure, w1);
write(fd_sc, output, strnlen(output, 23));
/* Check how long the loop took (minus the polling select() that follows */
clock_gettime(CLOCK_MONOTONIC, &ftime);
if ((time.tv_nsec+starttime.tv_nsec) > ftime.tv_nsec){
ftime.tv_nsec = 1000000000 + ftime.tv_nsec - time.tv_nsec - starttime.tv_nsec;
ftime.tv_sec = ftime.tv_sec - time.tv_sec - starttime.tv_sec - 1;
} else {
ftime.tv_nsec = ftime.tv_nsec - time.tv_nsec - starttime.tv_nsec;
ftime.tv_sec = ftime.tv_sec - time.tv_sec - starttime.tv_sec;
}
/* Poll with 0 timeout to check that data hasn't arrived before we're ready yet */
FD_ZERO(&readfds);
FD_SET(fd_wc, &readfds);
notime.tv_sec = 0;
notime.tv_usec = 0;
if ( !err && ( (countfd=select(fd_wc+1, &readfds, NULL, NULL, ¬ime)) < 0 )){
perror("select()");
err = 1;
} else if (countfd > 0){
printf("OVERRUN!\n");
snprintf(output, 25, ",,,%d.%04d\n\n", ftime.tv_sec, ftime.tv_nsec/100000);
write(fd_sc, output, strnlen(output, 24));
}
}
}
return 0;
}
The timestamps I see on the serial stream that I output is fairly regular (a deviation is caught up by the next loop usually). A snippet of output:
6.1810,0,225.25
6.1867,0,225.25
6.1922,0,225.25
6,2063,0,225.25
,,,0.0010
Here, up to 6.1922s everything is OK. The next sample is 6.2063 - 14.1ms after the last, but it didn't time out nor did the previous loop from 6.1922-6.2063 catch the overrun with the polling select(). My conclusion is that the last loop was withing the sampling time and select took -10ms too long return without timing out.
The ,,,0.0010 indicates the loop time (ftime) of the loop after - I should really be checking what the loop time was when it went wrong. I'll try that tomorrow.
The timeout passed to select is a rough lower bound — select is allowed to delay your process for slightly more than that. In particular, your process will be delayed if it is preempted by a different process (a context switch), or by interrupt handling in the kernel.
Here is what the Linux manual page has to say on the subject:
Note that the timeout interval will be rounded up to the system clock
granularity, and kernel scheduling delays mean that the blocking
interval may overrun by a small amount.
And here's the POSIX standard:
Implementations may
also place limitations on the granularity of timeout intervals. If the
requested timeout interval requires a finer granularity than the
implementation supports, the actual timeout interval shall be
rounded up to the next supported value.
Avoiding that is difficult on a general purpose system. You will get reasonable results, especially on a multi-core system, by locking your process in memory (mlockall) and setting your process to a real-time priority (use sched_setscheduler with SCHED_FIFO, and remember to sleep often enough to give other processes a chance to run).
A much more difficult approach is to use a real-time microcontroller that is dedicated to running the real-time code. Some people claim to reliably sample at 20MHz on fairly cheap hardware using that technique.
If values for struct timeval are set to zero, then select will not block, but if timeout argument is a NULL pointer, it will...
If the timeout argument is not a NULL pointer, it points to an object of type struct timeval that specifies a maximum interval to
wait for the selection to complete. If the timeout argument points to
an object of type struct timeval whose members are 0, select() does
not block. If the timeout argument is a NULL pointer, select()
blocks until an event causes one of the masks to be returned with
a valid (non-zero) value or until a signal occurs that needs to be
delivered. If the time limit expires before any event occurs that
would cause one of the masks to be set to a non-zero value, select()
completes successfully and returns 0.
Read more here
EDIT to address comments, and add new information:
A couple of noteworthy points.
First - in the comments, there is a suggestion to add sleep() to your worker loop. This is a good suggestion. The reasons stated here, although dealing with thread entry points, still apply, since you are instantiating a continuous loop.
Second - Linux select() is a system call with an interesting implemantation history, and as such has a range of varying behaviours from implementation to implementation, some which may contribute to the unexpected behaviours you are seeing. I am not sure which of the major blood lines of Linux Arch Linux comes from, but the man7.org page for select() includes the following two segments, which per your descriptions appear to describe conditions that could possibly contribute to the delays you are experiencing.
Bad checksum:
Under Linux, select() may report a socket file descriptor as "ready
for reading", while nevertheless a subsequent read blocks. This could
for example happen when data has arrived but upon examination has wrong
checksum and is discarded.
Race condition: (introduces and discusses pselect())
...Suppose the signal handler sets a global flag and returns. Then a test
of this global flag followed by a call of select() could hang indefinitely
if the signal arrived just after the test but just before the call...
Given the description of your observations, and depending on how your version of Linux is implemented, either one of these implementation features may be a possible contributor.
I'd like to create multi-threads program in C (Linux) with:
Infinite loop with infinite number of tasks
One thread per one task
Limit the total number of threads, so if for instance total threads number is more then MAX_THREADS_NUMBER, do sleep(), until total threads number become less then MAX_THREADS_NUMBER, continue after.
Resume: I need to do infinite number of tasks(one task per one thread) and I'd like to know how to implement it using pthreads in C.
Here is my code:
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#define MAX_THREADS 50
pthread_t thread[MAX_THREADS];
int counter;
pthread_mutex_t lock;
void* doSomeThing(void *arg)
{
pthread_mutex_lock(&lock);
counter += 1;
printf("Job %d started\n", counter);
pthread_mutex_unlock(&lock);
return NULL;
}
int main(void)
{
int i = 0;
int ret;
if (pthread_mutex_init(&lock, NULL) != 0)
{
printf("\n mutex init failed\n");
return 1;
}
for (i = 0; i < MAX_THREADS; i++) {
ret = pthread_create(&(thread[i]), NULL, &doSomeThing, NULL);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
// Wait all threads to finish
for (i = 0; i < MAX_THREADS; i++) {
pthread_join(thread[i], NULL);
}
pthread_mutex_destroy(&lock);
return 0;
}
How to make this loop infinite?
for (i = 0; i < MAX_THREADS; i++) {
ret = pthread_create(&(thread[i]), NULL, &doSomeThing, NULL);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
I need something like this:
while (1) {
if (thread_number > MAX_THREADS_NUMBER)
sleep(1);
ret = pthread_create(...);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
Your current program is based on a simple dispatch design: the initial thread creates worker threads, assigning each one a task to perform. Your question is, how you make this work for any number of tasks, any number of worker threads. The answer is, you don't: your chosen design makes such a modification basically impossible.
Even if I were to answer your stated questions, it would not make the program behave the way you'd like. It might work after a fashion, but it'd be like a bicycle with square wheels: not very practical, nor robust -- not even fun after you stop laughing at how silly it looks.
The solution, as I wrote in a comment to the original question, is to change the underlying design: from a simple dispatch to a thread pool approach.
Implementing a thread pool requires two things: First, is to change your viewpoint from starting a thread and having it perform a task, to each thread in the "pool" grabbing a task to perform, and returning to the "pool" after they have performed it. Understanding this is the hard part. The second part, implementing a way for each thread to grab a new task, is simple: this typically centers around a data structure, protected with locks of some sort. The exact data structure does depend on what the actual work to do is, however.
Let's assume you wanted to parallelize the calculation of the Mandelbrot set (or rather, the escape time, or the number of iterations needed before a point can be ruled to be outside the set; the Wikipedia page contains pseudocode for exactly this). This is one of the "embarrassingly parallel" problems; those where the sub-problems (here, each point) can be solved without any dependencies.
Here's how I'd do the core of the thread pool in this case. First, the escape time or iteration count needs to be recorded for each point. Let's say we use an unsigned int for this. We also need the number of points (it is a 2D array), a way to calculate the complex number that corresponds to each point, plus some way to know which points have either been computed, or are being computed. Plus mutually exclusive locking, so that only one thread will modify the data structure at once. So:
typedef struct {
int x_size, y_size;
size_t stride;
double r_0, i_0;
double r_dx, i_dx;
double r_dy, i_dy;
unsigned int *iterations;
sem_t done;
pthread_mutex_t mutex;
int x, y;
} fractal_work;
When an instance of fractal_work is constructed, x_size and y_size are the number of columns and rows in the iterations map. The number of iterations (or escape time) for point x,y is stored in iterations[x+y*stride]. The real part of the complex coordinate for that point is r_0 + x*r_dx + y*r_dy, and imaginary part is i_0 + x*i_dx + y*i_dy (which allows you to scale and rotate the fractal freely).
When a thread grabs the next available point, it first locks the mutex, and copies the x and y values (for itself to work on). Then, it increases x. If x >= x_size, it resets x to zero, and increases y. Finally, it unlocks the mutex, and calculates the escape time for that point.
However, if x == 0 && y >= y_size, the thread posts on the done semaphore and exits, letting the initial thread know that the fractal is complete. (The initial thread just needs to call sem_wait() once for each thread it created.)
The thread worker function is then something like the following:
void *fractal_worker(void *data)
{
fractal_work *const work = (fractal_work *)data;
int x, y;
while (1) {
pthread_mutex_lock(&(work->mutex));
/* No more work to do? */
if (work->x == 0 && work->y >= work->y_size) {
sem_post(&(work->done));
pthread_mutex_unlock(&(work->mutex));
return NULL;
}
/* Grab this task (point), advance to next. */
x = work->x;
y = work->y;
if (++(work->x) >= work->x_size) {
work->x = 0;
++(work->y);
}
pthread_mutex_unlock(&(work->mutex));
/* z.r = work->r_0 + (double)x * work->r_dx + (double)y * work->r_dy;
z.i = work->i_0 + (double)x * work->i_dx + (double)y * work->i_dy;
TODO: implement the fractal iteration,
and count the iterations (say, n)
save the escape time (number of iterations)
in the work->iterations array; e.g.
work->iterations[(size_t)x + work->stride*(size_t)y] = n;
*/
}
}
The program first creates the fractal_work data structure for the worker threads to work on, initializes it, then creates some number of threads giving each thread the address of that fractal_work structure. It can then call fractal_worker() itself too, to "join the thread pool". (This pool automatically "drains", i.e. threads will return/exit, when all points in the fractal are done.)
Finally, the main thread calls sem_wait() on the done semaphore, as many times as it created worker threads, to ensure all the work is done.
The exact fields in the fractal_work structure above do not matter. However, it is at the very core of the thread pool. Typically, there is at least one mutex or rwlock protecting the work details, so that each worker thread gets unique work details, as well as some kind of flag or condition variable or semaphore to let the original thread know that the task is now complete.
In a multithreaded server, there is usually only one instance of the structure (or variables) describing the work queue. It may even contain things like minimum and maximum number of threads, allowing the worker threads to control their own number to dynamically respond to the amount of work available. This sounds magical, but is actually simple to implement: when a thread has completed its work, or is woken up in the pool with no work, and is holding the mutex, it first examines how many queued jobs there are, and what the current number of worker threads is. If there are more than the minimum number of threads, and no work to do, the thread reduces the number of threads, and exits. If there are less than the maximum number of threads, and there is a lot of work to do, the thread first creates a new thread, then grabs the next task to work on. (Yes, any thread can create new threads into the process. They are all on equal footing, too.)
A lot of the code in a practical multithreaded application using one or more thread pools to do work, is some sort of bookkeeping. Thread pool approaches very much concentrates on the data, and the computation needed to be performed on the data. I'm sure there are much better examples of thread pools out there somewhere; the hard part is to think of a good task for the application to perform, as the data structures are so task-dependent, and many computations are so simple that parallelizing them makes no sense (since creating new threads does have a small computational cost, it'd be silly to waste time creating threads when a single thread does the same work in the same or less time).
Many tasks that benefit from parallelization, on the other hand, require information to be shared between workers, and that requires a lot of thinking to implement correctly. (For example, although solutions exist for parallelizing molecular dynamics simulations efficiently, most simulators still calculate and exchange data in separate steps, rather than at the same time. It's just that hard to do right, you see.)
All this means that you cannot expect to be able to write the code, unless you understand the concept. Indeed, truly understanding the concepts are the hard part: writing the code is comparatively easy.
Even in the above example, there are certain tripping points: Does the order of posting the semaphore and releasing the mutex matter? (Well, it depends on what the thread that is waiting for the fractal to complete does -- and indeed, if it is waiting yet.) If it was a condition variable instead of a semaphore, it would be essential that the thread that is interested in the fractal completion is waiting on the condition variable, otherwise it would miss the signal/broadcast. (This is also why I used a semaphore.)
TL;DR I need to emulate a timer in C that allows concurrent writes and reads, whilst preserving constant decrements at 60 Hz (not exactly, but approximately accurate). It will be part of a Linux CHIP8 emulator. Using a thread-based approach with shared memory and semaphores raises some accuracy problems, as well as race conditions depending on how the main thread uses the timer.
Which is the best way to devise and implement such a timer?
I am writing a Linux CHIP8 interpreter in C, module by module, in order to dive into the world of emulation.
I want my implementation to be as accurate as possible with the specifications. In that matter, timers have proven to be the most difficult modules for me.
Take for instance the delay timer. In the specifications, it is a "special" register, initally set at 0. There are specific opcodes that set a value to, and get it from the register.
If a value different from zero is entered into the register, it will automatically start decrementing itself, at a frequency of 60 Hz, stopping once zero is reached.
My idea regarding its implementation consists of the following:
The use of an ancillary thread that does the decrementing automatically, at a frequency of nearly 60 Hz by using nanosleep(). I use fork() to create the thread for the time being.
The use of shared memory via mmap() in order to allocate the timer register and store its value on it. This approach allows both the ancillary and the main thread to read from and write to the register.
The use of a semaphore to synchronise the access for both threads. I use sem_open() to create it, and sem_wait() and sem_post() to lock and unlock the shared resource, respectively.
The following code snippet illustrates the concept:
void *p = mmap(NULL, sizeof(int), PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
/* Error checking here */
sem_t *mutex = sem_open("timersem", O_CREAT, O_RDWR, 1);
/* Error checking and unlinking */
int *val = (int *) p;
*val = 120; // 2-second delay
pid_t pid = fork();
if (pid == 0) {
// Child process
while (*val > 0) { // Possible race condition
sem_wait(mutex); // Possible loss of frequency depending on main thread code
--(*val); // Safe access
sem_post(mutex);
/* Here it goes the nanosleep() */
}
} else if (pid > 0) {
// Parent process
if (*val == 10) { // Possible race condition
sem_wait(mutex);
*val = 50; // Safe access
sem_post(mutex);
}
}
A potential problem I see with such implementation relies on the third point. If a program happens to update the timer register once it has reached a value different from zero, then the ancillary thread must not wait for the main thread to unlock the resource, or else the 60 Hz delay will not be fulfilled. This implies both threads may freely update and/or read the register (constant writes in the case of the ancillary thread), which obviously introduces race conditions.
Once I have explained what I am doing and what I try to achieve, my question is this:
Which is the best way to devise and emulate a timer that allows concurrent writes and reads, whilst preserving an acceptable fixed frequency?
Don't use threads and synchronization primitives (semaphores, shared memory, etc) for this. In fact, I'd go as far as to say: don't use threads for anything unless you explicitly need multi-processor concurrency. Synchronization is difficult to get right, and even more difficult to debug when you get it wrong.
Instead, figure out a way to implement this in a single thread. I'd recommend one of two approaches:
Keep track of the time the last value was written to the timer register. When reading from the register, calculate how long ago it was written to, and subtract an appropriate value from the result.
Keep track of how many instructions are being executed overall, and subtract 1 from the timer register every N instructions, where N is a large number such that N instructions is about 1/60 second.
I'm trying to port u8glib (graphics library) to MIPS processor, OpenWrt router.
Here's an example in arm environment.
As such, one of the routines I must implement is:
delay_micro_seconds(uint32_t us)
Since this is a high resolution unit of time, how can I do this reliably in my environment? I've tried the following, but I'm not even sure how to validate it:
nanosleep((struct timespec[]){{0, 1000}}, NULL);
How can I validate this approach? If its a bad approach, how could I reliably delay by 1 microsecond in C?
EDIT: I've tried this, but I'm getting strange output, I expect the difference between the two print s to be 1000*10 iterations = 10,000 , but it is actually closer to 670,000 nanoseconds:
int main(int argc, char **argv)
{
long res, resb;
struct timespec ts, tsb;
int i;
res = clock_gettime(CLOCK_REALTIME, &ts);
for(i=0;i<10;i++){
nanosleep((struct timespec[]){{0,1000}}, NULL);
}
resb = clock_gettime(CLOCK_REALTIME, &tsb);
if (0 == res) printf("%ld %ld\n", ts.tv_sec, ts.tv_nsec);
else perror("clock_gettime");
if (0 == resb) printf("%ld %ld\n", tsb.tv_sec, tsb.tv_nsec);
else perror("clock_gettime"); //getting 670k delta instead of 10k for tv_nsec
return 0;
}
I assume that your codes would run under Linux.
First, using clock_getres(2) to find a clock's resolution.
and then, using clock_nanosleep(2) might make more accurate sleep.
To validate the sleep, I suggest you check elapsed time with clock_gettime(2)
res = clock_gettime(CLOCK_REALTIME, &ts);
if (0 == res) printf("%ld %ld\n", ts.tv_sec, ts.tv_nsec);
else perror("clock_gettime");
clock_nanosleep(CLOCK_REALTIME, 0, &delay, NULL);
res = clock_gettime(CLOCK_REALTIME, &ts);
if (0 == res) printf("%ld %ld\n", ts.tv_sec, ts.tv_nsec);
else perror("clock_gettime");
Also, if necessary, you can recompile your kernel with higher HZ configuration.
It would be helpful to read time(7) man page. Especially, The software clock, HZ, and jiffies and High-resolution timers sections.
Though my man pages says that High-resolution timer is not supported under mips architecture but I just googled it and mips-linux support HRT apparently.
Is this sleep synchronous? (I'm not familiar with your platform)
Validation
If you have development tools available, are there any profiling tools included? They can at least help you measure execution time.
You can also loop around your sleep call 1000+ times in a test program. Before you enter the loop, get a timestamp from the system clock. After the loop is done cycling, take another timestamp and compare with the first. Remember the loop itself will have some time overhead, but otherwise this will let you know how accurate (overshoot or undershoot) your sleep is. (If you cycle 1,000,000 time around a 1 microsecond sleep function, you would expect that it finishes quite near to 1 second.
Alternative
Sleep functions are not always perfect to their resolution, but they promise to be simple to use and always get you in the neighborhood of what they say. There are many statements that run much quicker than a microsecond, a++;.
Using a similar method as above, you could easily make a homemade synchronous timer with awesome accuracy using a FOR loop with some pointless statement inside of it. Once you find out how many iterations lands you nearest 1 microsecond, it should never change and you could hardcode a function out of it.
If you intend your delay to be asynchronous with a multi-tasking process in mind, this obviously would not cooperate with the other tasks well.