Implementing a WatchDog timer

Implementing a WatchDog timer - c

I need to implement a timer that checks for conditions every 35 seconds. My program is using IPC schemes to communicate information back and forth between client and server processes. The problem is that I am running msgrcv() function in a loop, which pauses the loop until it finds a message, which is no good because I need the timer to always be checking if a client has stopped sending messages. (if it only checks when it receives a message, this will be useless...)
The problem may seem unclear, but the basics of what I need is a way to implement a Watchdog timer that will check a condition every 35 seconds.
I currently have this code:
time_t start = time(NULL);
//Enter main processing loop
while(running)
{
size_t size = sizeof(StatusMessage) - sizeof(long);
if(msgrcv(messageID, &statusMessage, size, 0, 0) != -1)
{
printf("(SERVER) Message Data (ID #%ld) = %d : %s\n", statusMessage.machineID, statusMessage.status_code, statusMessage.status);
masterList->msgQueueID = messageID;
int numDCs = ++masterList->numberOfDCs;
masterList->dc[numDCs].dcProcessID = (pid_t) statusMessage.machineID;
masterList->dc[numDCs].lastTimeHeardFrom = 1000;
printf("%d\n", masterList->numberOfDCs);
}
printf("%.2f\n", (double) (time(NULL) - start));
}
The only problem is as I stated before, the code to check how much time has passed, won't be reached if there is no message to come in, as the msgrcv function will hold the process.
I hope I am making sense, and that someone will be able to assist me in my problem.

You may want to try the msgctl(msqid, IPC_STAT, struct msqid_ds *msgqdsbuf); If the call is successful, then the current number of messages can be found using msgdsbuf->msg_qnum. The caller needed read permissions, which I think you may have in here.

Related

Asynchronous I/O in C with a callback function -- exactly when and how does I/O thread return?

My question is about exactly when the I/O thread in an asynchronous I/O call returns when a call back function is involved. Specifically, given this very general code for reading a file ...
#include<stdio.h>
#include<aio.h>
...
// callback function:
void finish_aio(sigval_t sigval) {
/* do stuff ... maybe close the file */
}
int main() {
struct aiocb my_aiocb;
int aio_return;
...
//Open file, take care of any other prelims, then
//Fill in file-specific info for my_aiocb, then
//Fill in callback information for my_aiocb:
my_aiocb.aio_sigevent.sigev_notify = SIGEV_THREAD;
my_aiocb.aio_sigevent.sigev_notify_function = finish_aio;
my_aiocb.aio_sigevent.sigev_notify_attributes = NULL;
my_aiocb.aio_sigevent.sigev_value.sival_ptr = &info_on_file;
// then read the file:
aio_return = aio_read(&my_aiocb);
// do stuff that doesn't need data that is being read ...
// then block execution until read is complete:
while(aio_error(&my_aiocb) == EINPROGRESS) {}
// etc.
}
I understand that the callback function is called as soon as the read of the file is completed. But what exactly happens then? Does the I/O thread start running the callback finish_aio()? Or does it spawn a new thread to handle that callback, while it returns to the main thread? Another way to put this would be: When does aio_error(&my_aiocb) stop returning EINPROGRESS? Is it just before the call to the callback, or when the callback is completed?

I understand that the callback function is called as soon as the read of the file is completed. But what exactly happens then?
What happens is that when the IO finishes it "behaves as if" it started a new thread (similar to calling pthread_create(&ignored, NULL, finish_aio, &info_on_file)).
When does aio_error(&my_aiocb) stop returning EINPROGRESS?
I'd expect that aio_error(&my_aiocb) stops returning EINPROGRESS as soon as the IO finishes, then the system (probably the standard library) either begins creating a new thread to call finish_aio() or "unblocks" a "previously created without you knowing" thread. However, I don't think the exact order is documented anywhere ("implementation defined") because it doesn't make much sense to call aio_error(&my_aiocb) from anywhere other than the finish_aio() anyway.
More specifically; if you're using polling (my_aiocb.aio_sigevent.sigev_notify = SIGEV_NONE) then you'd repeatedly check aio_error(&my_aiocb) yourself and you can't care if you're notified before or after this happens because you're not notified at all; and if you aren't using polling you'd wait until you are notified (via. a new thread or a signal) that there's a reason to check aio_error(&my_aiocb).
In other words, your finish_aio() would look more like this:
void finish_aio(sigval_t sigval) {
struct aiocb * my_aiocb = (struct aiocb *)sigval;
int status;
status = aio_error(&my_aiocb);
/* Figure out what to do (handle the error or handle the file's data) */
.. and for main() the while(aio_error(&my_aiocb) == EINPROGRESS) (which may waste a huge amount of CPU time for nothing) would be deleted and/or possibly replaced with something else (e.g. a pthread_cond_wait() to wait until the code in finish_aio() does a pthread_cond_signal() to tell the main thread it can continue).
To understand this, let's take a look at what pure polling would look like:
int main() {
struct aiocb my_aiocb;
int aio_return;
...
//Open file, take care of any other prelims, then
//Fill in file-specific info for my_aiocb, then
my_aiocb.aio_sigevent.sigev_notify = SIGEV_NONE; /* CHANGED! */
// my_aiocb.aio_sigevent.sigev_notify_function = finish_aio;
// my_aiocb.aio_sigevent.sigev_notify_attributes = NULL;
// my_aiocb.aio_sigevent.sigev_value.sival_ptr = &info_on_file;
// then read the file:
aio_return = aio_read(&my_aiocb);
// do stuff that doesn't need data that is being read ...
// then block execution until read is complete:
while(aio_error(&my_aiocb) == EINPROGRESS) {}
finish_aio(sigval_t sigval); /* ADDED! */
}
In this case it behaves almost the same as your original code, except that there's no extra thread (and you can't care if the "thread that doesn't exist" is started before or after aio_error(&my_aiocb) returns a value other than EINPROGRESS).
The problem with pure polling is that the while(aio_error(&my_aiocb) == EINPROGRESS) could waste a huge amount of CPU time constantly checking when nothing has happened yet.
The main purpose of using my_aiocb.aio_sigevent.sigev_notify = SIGEV_THREAD is to avoid wasting a possibly huge amount of CPU time polling when nothing changed (not forgetting that in some cases wasting CPU time polling like this can prevent other threads, including the finish_aio() thread, from getting CPU time). In other words, you want to delete the while(aio_error(&my_aiocb) == EINPROGRESS) loop, so you used SIGEV_THREAD so that you can delete that polling loop.
The new problem is that (if the main thread has to wait until the data is ready) you need some other way for the main thread to wait until the data is ready. However, typically it's not "the aio_read() completed" that you actually care about, it's something else. For example, maybe the raw file data is a bunch of values in a text file (like "12, 34, 56, 78") and you want to parse that data and create an array of integers, and want to notify the main thread that the array of integers is ready (and don't want to notify the main thread if you're starting to parse the file's data). It might be like:
int parsed_file_result = 0;
void finish_aio(sigval_t sigval) {
struct aiocb * my_aiocb = (struct aiocb *)sigval;
int status;
status = aio_error(&my_aiocb);
close(my_aiocb->aio_fildes);
if(status == 0) {
/* Read was successful */
parsed_file_result = parse_file_data(); /* Create the array of integers */
} else {
/* Read failed, handle the error somehow */
parsed_file_result = -1; /* Tell main thread we failed to create the array of integers */
}
/* Tell the main thread it can continue somehow */
}
One of the best ways to tell the main thread it can continue (at the end of finish_aio()) is to use pthread conditional variables (e.g. pthread_cond_signal() called at the end of finish_aio(); with pthread_cond_wait() in the main thread). In this case the main thread will simply block (the kernel/scheduler will not give it any CPU time until pthread_cond_signal() is called) so it wastes no CPU time polling.
Sadly, pthread conditional variables aren't trivial (they require a mutex, initialization, etc), and teaching/showing their use here is a little too far from the original topic. Fortunately; you shouldn't have much trouble finding a good tutorial elsewhere.
The important part is that if you used SIGEV_THREAD (so that you can delete that awful while(aio_error(&my_aiocb) == EINPROGRESS) polling loop) you're left with no reason to call aio_error(&my_aiocb) until after the finish_aio() has already been started; and no reason to care if aio_error(&my_aiocb) would've been changed (or not) before finish_aio() is started.

C: Accessing or trying to alter a passed variable in a particular segment of a function involving message queues stops the process

I am trying to get processes to communicate and form groups through message queues.
In this case, a process forks into a number of children that communicate.
What bugs me, actually, is the fact that when entering a particular function that uses both the message queue and some variables i need to change with it, accessing the dereferencered values seems to stop the execution of the process (and the others too since it doesn't "release" the semaphore they share) until the parent process stops its children and starts doing something else.
struct msgbuf {
long mtype;
int mtext[2];
};
This is the msgbuf i'm using.
int checkMsg(int msgid, struct msgbuf messaggio, int voto, int * partner, int * took, int * rejects, int *pending){
...
while(msgrcv(msgid, &messaggio, sizeof(messaggio), getpid(), IPC_NOWAIT) != -1){
if(messaggio.mtext[0]==1){
printf("%s\n", strerror(errno)); // this is shown in the terminal
printf("%d\n", * partner); // this is not shown, blocking the execution
* partner = messaggio.mtext[1];
* took = 1;
* pending = *pending - 1;
refuseInvites(msgid, messaggio);
return 1;
}
...
}
...
}
And this is the problematic piece of code.
I tried changing the second printf to output *took and it does, but then the execution stops again (i presume the moment *partner is taken into consideration). the value of partner should be set to 0 before entering the function and messaggio.mtext[1] is a process ID.
I expect the function to not stop and returning successfully, mantaining the consistency of the data.
EDIT : The call to checkMsg() is this one:
taken = checkMsg(msgid, messaggio, voto, &partner, &took, &rejects, &pending);
partner, took, rejects and pending are set to 0 before the forking
EDIT 2: all the variables are of type int and the signature of the checkMsg is included in a header before the program starts.

Accuracy of clock_gettime() in a context switch scenario

I'm trying to 'roughly' calculate the time of a thread context switch in a Linux system. I've written a program that uses pipes and multi-threading to achieve this. When running the program the calculated time is clearly wrong(see output below). I am unsure if this is due to me using the wrong clock_id for this procedure or perhaps my implementation
I have implemented sched_setaffinity() so as to only have the program run on core 0. I've tried to leave as much fluff out of code so to only measure the time of a context switch, so the tread process only writes a single character to the pipe and the parent does a 0 byte read.
I have a parent tread that creates one child thread with a one-way pipe between them to pass data, the child thread runs a simple function to write to a pipe.
void* thread_1_function()
{
write(fd2[1],"",sizeof("");
}
while the parent thread creates the child thread, starts the time counter and then calls a read on the pipe that the child thread writes to.
int main(int argc, char argv[])
{
//time struct declaration
struct timespec start,end;
//sets program to only use core 0
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(0,&cpu_set);
if((sched_setaffinity(0, sizeof(cpu_set_t), &cpu_set) < 1))
{
int nproc = sysconf(_SC_NPROCESSORS_ONLN);
int k;
printf("Processor used: ");
for(k = 0; k < nproc; ++k)
{
printf("%d ", CPU_ISSET(k, &cpu_set));
}
printf("\n");
if(pipe(fd1) == -1)
{
printf("fd1 pipe error");
return 1;
}
//fail on file descriptor 2 fail
if(pipe(fd2) == -1)
{
printf("fd2 pipe error");
return 1;
}
pthread_t thread_1;
pthread_create(&thread_1, NULL, &thread_1_function, NULL);
pthread_join(thread_1,NULL);
int i;
uint64_t sum = 0;
for(i = 0; i < iterations; ++i)
{
//initalize clock start
clock_gettime(CLOCK_MONOTONIC, &start);
//wait for child thread to write to pipe
read(fd2[0],input,0);
//record clock end
clock_gettime(CLOCK_MONOTONIC, &end);
write(fd1[1],"",sizeof(""));
uint64_t diff;
diff = billion * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
diff = diff;
sum += diff;
}
The results i get while running this are typically in this manner:
3000
3000
4000
2000
12000
3000
5000
and so forth, when I inspect the time returned to the start and end timespec structs i see that tv_nsec seems to be a 'rounded' number as well:
start.tv_nsec: 714885000, end.tv_nsec: 714888000
Would this be caused by a clock_monotonic not being precise enough for what im attempting to measure, or some other problem that i'm overlooking?

i see that tv_nsec seems to be a 'rounded' number as well:
2626, 714885000, 2626, 714888000
Would this be caused by a clock_monotonic not being precise enough for
what im attempting to measure, or some other problem that i'm
overlooking?
Yes, that's a possibility. Every clock supported by the system has a fixed resolution. struct timespec is capable of supporting clocks with nanosecond resolution, but that does not mean that you can expect every clock to actually have such resolution. It looks like your CLOCK_MONOTONIC might have a resolution of 1 microsecond (1000 nanoseconds), but you can check that via the clock_getres() function.
If it is available to you, then you might try CLOCK_PROCESS_CPUTIME_ID. It is possible that that would have higher resolution than CLOCK_MONOTONIC for you, but do note that single-microsecond resolution is pretty precise -- that's on the order of one tick per 3000 CPU cycles on a modern machine.
Even so, I see several possible problems with your approach:
Although you set your process to have affinity for a single CPU, that does not prevent the system from scheduling other processes on that CPU, too. Thus, unless you've taken additional measures, you can't be certain -- it's not even likely -- that every context switch away from one of your program's threads is to the other thread.
You start your second thread and then immediately join it. There is no more context switching between your threads after that, because your second thread no longer exists after being successfully joined.
read() with a count of 0 may or may not check for errors, and it certainly does not transfer any data. It is totally unclear to me why you identify the time for that call with the time for a context switch.
If a context switch does occur in the space you're timing, then at least two need to occur there -- away from your program and back to it. Also, you're measuring the time consumed by whatever else runs in the other context as well, not just the switch time. The 1000-nanosecond steps may thus reflect time slices, rather than switching time.
Your main thread is writing null characters to the write end of a pipe, but there does not appear to be anything reading them. If indeed there isn't then this will eventually fill up the pipe's buffer and block. The purpose is lost on me.

How to be notified when a thread has been terminated for some error

I am working on a program with a fixed number of threads in C using posix threads.
How can i be notified when a thread has been terminated due to some error?
Is there a signal to detect it?
If so, can the signal handler create a new thread to keep the number of threads the same?

Make the threads detached
Get them to handle errors gracefully. i.e. Close mutexs, files etc...
Then you will have no probolems.
Perhaps fire a USR1 signal to the main thread to tell it that things have gone pear shaped (i was going to say tits up!)

Create your threads by passing the function pointers to an intermediate function. Start that intermediate function asynchronously and have it synchronously call the passed function. When the function returns or throws an exception, you can handle the results in any way you like.

With the latest inputs you've provided, I suggest you do something like this to get the number of threads a particular process has started-
#include<stdio.h>
#define THRESHOLD 50
int main ()
{
unsigned count = 0;
FILE *a;
a = popen ("ps H `ps -A | grep a.out | awk '{print $1}'` | wc -l", "r");
if (a == NULL)
printf ("Error in executing command\n");
fscanf(a, "%d", &count );
if (count < THRESHOLD)
{
printf("Number of threads = %d\n", count-1);
// count - 1 in order to eliminate header.
// count - 2 if you don't want to include the main thread
/* Take action. May be start a new thread etc */
}
return 0;
}
Notes:
ps H displays all threads.
$1 prints first column where PID is displayed on my system Ubuntu. The column number might change depending on the system
Replace a.out it with your process name
The backticks will evaluate the expression within them and give you the PID of your process. We are taking advantage of the fact that all POSIX threads will have same PID.

I doubt Linux would signal you when a thread dies or exits for any reason. You can do so manually though.
First, let's consider 2 ways for the thread to end:
It terminates itself
It dies
In the first method, the thread itself can tell someone (say the thread manager) that it is being terminated. The thread manager will then spawn another thread.
In the second method, a watchdog thread can keep track of whether the threads are alive or not. This is done more or less like this:
Thread:
while (do stuff)
this_thread->is_alive = true
work
Watchdog:
for all threads t
t->timeout = 0
while (true)
for all threads t
if t->is_alive
t->timeout = 0
t->is_alive = false
else
++t->timeout
if t->timeout > THRESHOLD
Thread has died! Tell the thread manager to respawn it

If for any reason one could not go for Ed Heal's "just work properly"-approach (which is my favorite answer to the OP's question, btw), the lazy fox might take a look at the pthread_cleanup_push() and pthread_cleanup_pop() macros, and think about including the whole thread function's body in between such two macros.

The clean way to know whether a thread is done is to call pthread_join() against that thread.
// int pthread_join(pthread_t thread, void **retval);
int retval = 0;
int r = pthread_join(that_thread_id, &retval);
... here you know that_thread_id returned ...
The problem with pthread_join() is, if the thread never returns (continues to run as expected) then you are blocked. That's therefore not very useful in your case.
However, you may actually check whether you can join (tryjoin) as follow:
//int pthread_tryjoin_np(pthread_t thread, void **retval);
int retval = 0;
int r = pthread_tryjoin_np(that_thread_id, &relval);
// here 'r' tells you whether the thread returned (joined) or not.
if(r == 0)
{
// that_thread_id is done, create new thread here
...
}
else if(errno != EBUSY)
{
// react to "weird" errors... (maybe a perror() at least?)
}
// else -- thread is still running
There is also a timed join which will wait for the amount of time you specified, like a few seconds. Depending on the number of threads to check and if your main process just sits around otherwise, it could be a solution. Block on thread 1 for 5 seconds, then thread 2 for 5 seconds, etc. which would be 5,000 seconds per loop for 1,000 threads (about 85 minutes to go around all threads with the time it takes to manage things...)
There is a sample code in the man page which shows how to use the pthread_timedjoin_np() function. All you would have to do is put a for loop around to check each one of your thread.
struct timespec ts;
int s;
...
if (clock_gettime(CLOCK_REALTIME, &ts) == -1) {
/* Handle error */
}
ts.tv_sec += 5;
s = pthread_timedjoin_np(thread, NULL, &ts);
if (s != 0) {
/* Handle error */
}
If your main process has other things to do, I would suggest you do not use the timed version and just go through all the threads as fast as you can.

ways of implementing timer in worker thread in C

I have a worker thread that gets work from pipe. Something like this
void *worker(void *param) {
while (!work_done) {
read(g_workfds[0], work, sizeof(work));
do_work(work);
}
}
I need to implement a 1 second timer in the same thread do to some book-keeping about the work. Following is what I've in mind:
void *worker(void *param) {
prev_uptime = get_uptime();
while (!work_done) {
// set g_workfds[0] as non-block
now_uptime = get_uptime();
if (now_uptime - prev_uptime > 1) {
do_book_keeping();
prev_uptime = now_uptime;
}
n = poll(g_workfds[0], 1000); // Wait for 1 second else timeout
if (n == 0) // timed out
continue;
read(g_workfds[0], work, sizeof(work));
do_work(work); // This can take more than 1 second also
}
}
I am using system uptime instead of system time because system time can get changed while this thread is running. I was wondering if there is any other better way to do this. I don't want to consider using another thread. Using alarm() is not an option as it already used by another thread in same process. This is getting implemented in Linux environment.

I agree with most of what webbi wrote in his answer. But there is one issue with his suggestion of using time instead of uptime. If the system time is updated "forward" it will work as intended. But if the system time is set back by say 30 seconds, then there will be no book keeping done for 30 seconds as (now_time - prev_time) will be negative (unless an unsigned type is used, in which case it will work anyway).
An alternative would be to use clock_gettime() with CLOCK_MONOTONIC as clockid ( http://linux.die.net/man/2/clock_gettime ). A bit messy if you don't need smaller time units than seconds.
Also, adding code to detect a backwards clock jump isn't hard either.

I have found a better way but it is Linux specific using timerfd_create() system call. It takes care of system time change. Following is possible psuedo code:
void *worker(void *param) {
int timerfd = timerfd_create(CLOCK_MONOTONIC, 0); // Monotonic doesn't get affected by system time change
// set timerfd to non-block
timerfd_settime(timerfd, 1 second timer); // timer starts
while (!work_done) {
// set g_workfds[0] as non-block
n = poll(g_workfds[0] and timerfd, 0); // poll on both pipe and timerfd and Wait indefinetly
if (timerfd is readable)
do_book_keeping();
if (g_workfds[0] is readable) {
read(g_workfds[0], work, sizeof(work));
do_work(work); // This can take more than 1 second also
}
}
}
It seems cleaner and read() on timerfd returns extra time elapsed in case do_work() takes long time which is quite useful as do_book_keeping() expects to get called every second.

I found some things weird in your code...
poll() has 3 args, you are passing 2, the second arg is the number of structs that you are passing in the struct array of first param, the third param is the timeout.
Reference: http://linux.die.net/man/2/poll
Besides that, it's fine for me that workaround, it's not the best of course, but it's fine without involving another thread or alarm(), etc.
You use time and not uptime, it could cause you one error if the system date gets changed, but then it will continue working as it will be updated and continuing waiting for 1 sec, no matter what time is.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight