Design considerations for an error logging/notification library in C - c

I'm working on several programs right now, and have become frustrated over some of the haphazard ways I'm debugging my programs and logging errors. As such, I've decided to take a couple days to write an error library that I can use across all of my programs. I do most of my development in Windows, making extensive use of the Windows API, but the key here is flexibility: this library ideally needs to remain flexible, offering the programmer notification options in console apps and GUI apps on Windows and Unix-like environments.
My initial idea is to use one library that uses preprocessor conditional inclusion of the headers for windows and unix based on the current environment. For example, in a win32 app, console error message notification (while possible) isn't necessary; instead, a simple
MessageBoxA/W(hWndParent, TEXT("Some error message that makes sense in context"), TEXT("Application Name"), MB_ICONERROR | MB_OK)
would make the most sense. On the other hand, on linux things are a little more complex:
GtkWindow *w;
w = gtk_message_dialog_new(pOwner, GTK_DIALOG_MODAL, GTK_MESSAGE_ERROR, GTK_BUTTONS_OK, TEXT("Some error message");
gtk_window_set_title(w, TEXT("Application Name"));
On either operation, simple file logging with more information about the error (function, file, line, etc.) would also be useful in pinpointing the source and tracing the flow.
Moreover, logging should be possible: even when a function requiring error logging/notification is called, it should be possible to show the sequence of function calls in the program, if that level of logging is activated.
Thus, my initial considerations are to have a library that incorporates all of these features, with minimal overhead by dint of preprocessor conditionals. I think it would make the most sense to break this up into a couple structs:
The "main" struct, which is passed to every function in the main program requiring logging, and contains
A bitmask containing the status codes for which file logging is necessary (e.g., NOERROR | WARNING | CRITICALERROR or CRITICALERROR | CRASH)
A pointer to an "output" struct
A linked list of "error information" nodes
The "output" struct, which handles printing to file, displaying message boxes, and printing to the console
The "error information" struct, which contains information about the error (function, file, line, message, type, etc.)
This is just my initial thoughts about this library. Can any of you think of any more information that I should include? Another major issue for me is atomicity of error addition: it's likely that another thread might create the error than the one logging the error, so I need to make sure that creating and adding an error node is actually an atomic operation. Thus, mutexes would likely be the way I'd go about synchronization.
Thanks for the help!

In the case of not CRITICALERROR | CRASH, where the app would be expected to continue after the logging call, it would be better to queue, (thread-safe producer-consumer queue), off each log struct to a logging thread that performs the requested action/s. The logging thread would normally free the structs after handling them. Some advantages:
1) The action taken for each logging request is reduced to mallocing the struct, loading it and pushing it onto the queue. If the disk is temporarily busy, or has high latency because it's on a network, or becomes actually unavailable, the calling thread/s will continue to run almost normally. With this, the set of apps that will fail just because the logging has been turned on is reduced. A user with some intermittent problem that you cannot reproduce can be instructed to turn on the logger with little chance that the logging will affect normal operations or, worse, introduce delays that cover up the bug.
2) For the same reasons as (1), adding/changing logger functionality, even at runtime, is much easier. For example, maybee you want to restrict the size of log files or raise a new date/timestamped log file every day. A 'normal' call would introduce a long delay into the calling thread while the old file was closed and the new one opened. If the logging is queued off, all you get is a temporary increase in the number of queued log structs.
3) Controlling the logging is easier. In a GUI app, the logger could have its own form where the logging options can be modified. You could have a 'New log file now' button which, when clicked, queued a 'LOGCONTROL' struct to the logging thread, along with all the other logging messages. When the thread gets it, it opens a new log file.
4) Forwarding the log messages is fairly easy. Maybe you want to watch the logged messages, as well as write them to disk - queue up a 'LOGCONTROL' struct that instructs the thread to save a function ptr passed in the struct and henceforth call this function with subsequent logging messages after writing them to disk. The function passed could queue up the messages to your GUI for display in a 'terminal' type window, (PostMessage on Windows, Qt etc. have a similar functionalities to allow data to be passed to the GUI). Sure, on ***x, you could open a console window and 'tail-f' the log file, but this will not appear particularly elegant to a GUI user, is more difficult to manage for users and is anyway not available as standard on Windows, (how many users know how to copy paste from a console window and email you the error message?).
Another possibility is that the logging thread might be instructed to stream the log text to a remote server - another 'LOGCONTROL' struct could pass the hostname/port to the logger thread. The temporary delays of opening the network connection to the server would not matter because of the queued communications.
5) 'Lazy writing' and other such performance enhancements become easier, but:
Disadvantages:
1) The main one is that when the log call returns to the requestor, the log operation has probably not yet happened. This is very bad news in the case of CRITICALERROR | CRASH, and can be unacceptable in some cases even with 'ordinary' logging of progress messages etc. There should be an option to bypass in these cases and a direct disk write/flush made - fOpen/CreateFile a separate 'Direct.log', append, write, flush, close. Slow - but secure, just in case the app explodes after the log call returns.
2) More complex, so more development, more conditionals, bigger API interface include.
Rgds,
Martin

Hi I use this for another language but you could research it and follow its design
http://www.gurock.com/smartinspect/
regards

Related

Logging Events---Looking For a Good Way

I created a program that monitors for events.
I want to log these events "in the right way".
Currently I have a string array, log[500][100].
Each line is a string of characters (up to 100) that report something about the event.
I have it set up so that only the last 500 events are saved in the array.
After that, new events overwrite the oldest events.
Currently I just keep revolving through the array until the program terminates, then I write the array to a file.
Going forward I would like to view the log in real time, any time I wish, without disturbing the event processing and logging process.
I considered opening the file for "appending" but here are my concerns:
(1) The program is running on a Raspberry Pi which has a flash memory as a "disk drive". I believe flash memories have a limited number of write cycles before problems can occur. This program runs 24/7 "forever" so I am afraid the "disk drive" will "wear out".
(2) I am using pretty much all the CPU capacity of the RPi so I don't want to add a lot of overhead/CPU cycles.
How would experienced programmers attack this problem?
Please go easy on me, this is my first C program.
[EDIT]
I began reviewing all the information and I became intrigued by Mark A's suggestion for tmpfs. I looked into it more and I am sure this answers my question. It allows the creation of files in RAM not the SD card. They are lost on power down but I don't care.
In order to keep the files from growing to large I created a double buffer approach. First I write 500 events to file A then switch to file B. When 500 events have been written to file B I close and reopen file A (to delete the contents and start at 0 events) and switch to writing to file A. I found I needed to fflush(file...) after each write or else the file was empty until fclose.
Normally that would be OK but right now I am fighting a nasty segmentation fault so I want as much insight into what is going on. When I hit the fault, I never get to my fclose statements.
Welcome to Stack Overflow and to C programming! A wonderful world of possibilities awaits you.
I have several thoughts in response to your situation.
The very short summary is to use stdout and delegate the output-file management to the shell.
The longer, rambling answer full of my personal musing is as follows:
1 : A very typical thing for C programs to do is not be in charge of how outputs are kept. You might have heard of the "built in" file handles, stdin, stdout, and stderr. These file handles are (under normal circumstances) always available to your program for input (from stdin) and output (stdout and stderr). As you might guess from their names stdout is customarily used for regular output and stderr is customarily used for error / exception output. It is exceedingly typical for a C program to simply read from stdin and output to stdout and stderr, and let something else (e.g., the shell) take care of what those actually are.
For example, reading from stdin means that your program can be used for keyboard entry and for file reading, without needing to change your program's code. The same goes for stdout and stderr; simply output to those file handles, and let the user decide whether those should go to the screen or be redirected to a file. And, because stdout and stderr are separate file handles, the user can have them go to separate 'destinations'.
In your case, to implement this, drop the array entirely, and simply
fprintf(stdout, "event notice : %s\n", eventdetailstring);
(or similar) every time your program has something to say. Take a look at fflush(), too, because of potential output buffering.
2a : This gets you continuous output. This itself can help with your concern about memory wear on the Pi's flash disk. If you do something like:
eventmonitor > logfile
then logfile will be being appended to during the lifetime of your program, which will tend to be writing to new parts of the flash disk. Of course, if you only ever append, you will eventually run out of space on the disk, so you might set up a cron job to kill the currently running eventmonitor and restart it every day at midnight. Done with the above command, that would cause it to overwrite logfile once per day. This prevents endless growth, and it might even use a new physical area of the flash drive for the new file (even though it's the same name; underneath, it's a different file, with a different inode, etc.) But even if it reuses the exact same area of the flash drive, now you are down to worrying if this will last more than 10,000 days, instead of 10,000 writes. I'm betting that within 10,000 days, new options will be available -- worst case, you buy a new Pi every 27 years or so!
There are other possible variations on this theme, as well. e.g., you could have a sophisticated script kicked off by cron every day at midnight that kills any currently running eventmonitor, deletes output files older than a week, and starts a new eventmonitor outputting to a file whose filename is based partly on the date so that past days' file aren't overwritten. But all of this is in the realm of using your program. You can make your program easier to use by writing it to use stdin, stdout, and stderr.
2b : Or, you can just have stdout go to the screen, which is typically how it already is when a program is started from an interactive shell / terminal window. I imagine you could have the Pi running headless most of the time, and when you want to see what your program is outputting, hook up a monitor. Generally, things will stay running between disconnecting and reconnecting your monitor. This avoids affecting the flash drive at all.
3 : Another approach is to have your event monitoring program send its output somewhere off-system. This is getting into more advanced programming territory, so you might want to save this for a later enhancement, after you've mastered more of the basics. But, your program could establish a network connection to, say, a JSON API and send event information there. This would let you separate the functions of event monitoring from event reporting.
You will discover as you learn more programming that this idea of separation of concerns is an important concept, and applies at various levels of a program or a system of interoperating programs. In this case, the Pi is a good fit for the data monitoring aspect because it is a lightweight solution, and some other system with more capacity and more stable storage can cover the data collection aspect.

Should we error check every call in C?

When we write C programs we make calls to malloc or printf. But do we need to check every call? What guidelines do you use?
e.g.
char error_msg[BUFFER_SIZE];
if (fclose(file) == EOF) {
sprintf(error_msg, "Error closing %s\n", filename);
perror(error_msg);
}
The answer to your question is: "Do whatever you want", there is no written rule, BUT the right question is "What do users want in case of failure".
Let me explain, if you are a student writing a test program for example, no absolute need to check for errors: it may be a waste of time.
Now, if your code may be distributed or used by other people, that quite different: put yourself in the shoes of future users. Which message do you prefer when something goes wrong with an application:
Core was generated by `./cut --output-d=: -b1,1234567890- /dev/fd/63'.
Program terminated with signal SIGSEGV, Segmentation fault.
or
MySuperApp failed to start MySuperModule because there is not enough space on the disk.
Try to free space on disk, then relaunch the app.
If this error persists contact us at support#mysuperapp.com
As it has already been addressed in the comment, you have to consider two types of error:
A fatal error is one that kills your program (app / server / site / whatever it is). It renders it unusable, either by crashing or by putting it in some state whereby it can't do it's usable work. e.g. memory allocation, disk space ...
Non-fatal error is one where something messes up, but the program can continue to do what it's supposed to do. e.g. file not found, serve other users not requesting the thing that called the error.
Source : https://www.quora.com/What-is-the-difference-between-an-error-and-a-fatal-error
Just do error checking if your program behaviour has to behave differently in case an error is detected. Let me illustrate this with an example: Assume you have used a temporary file in your program and you use the unlink(2) system call to erase that temporary file at the end of the program. Have you to check if the file has been successfully erased? Let's analyse the problem with some common sense: if you check for errors, are you going to be able (inside the program) of doing some alternate thing to cope with this? This is uncommon (if you created the file, it's rare that you will not be able to erase it, but something can happen in the time between --- for example a change in directory permissions that forbids you to write on the directory anymore) But what can you do in that case? Is it possible to use a different approach to erase temporary file in that case. Probably not... so checking (in that case) a possible error from the unlink(2) system call will be almost useless.
Of course, this doesn't apply always, you have to use common sense while programming. Errors about writing to files should be always considered, as they belong to access permissions or mostly to full filesystems (In that case, even trying to generate a log message can be useles, as you have filled your disk --- or not, that depends) Do you know always the precise environment details to obviate if a full filesystem error can be ignored. Suppose you have to connect to a server in your program. Should the connect(2) system call failure be acted upon? probably most of the times, at least a message to the user with the protocol error (or the cause of the failure) must be given to the user.... assuming everything goes ok can save you time in a prototype, but you have to cope with what can happen, in production programs.
When i want to use return value of function than suggested to check return value before using it
For example pointer return address that can be null also.so suggested to keep null check before using it.

Ensure that UID/GID check in system call is executed in RCU-critical section

Task
I have a small kernel module I wrote for my RaspBerry Pi 2 which implements an additional system call for generating power consumption metrics. I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it. Otherwise, the call just skips the bulk of its body and returns success.
Background Work
I've read into the issue at length, and I've found a similar question on SO, but there are numerous problems with it, from my perspective (noted below).
Question
The linked question notes that struct task_struct contains a pointer element to struct cred, as defined in linux/sched.h and linux/cred.h. The latter of the two headers doesn't exist on my system(s), and the former doesn't show any declaration of a pointer to a struct cred element. Does this make sense?
Silly mistake. This is present in its entirety in the kernel headers (ie: /usr/src/linux-headers-$(uname -r)/include/linux/cred.h), I was searching in gcc-build headers in /usr/include/linux.
Even if the above worked, it doesn't mention if I would be getting the the real, effective, or saved UID for the process. Is it even possible to get each of these three values from within the system call?
cred.h already contains all of these.
Is there a safe way in the kernel module to quickly determine which groups the user belongs to without parsing /etc/group?
cred.h already contains all of these.
Update
So, the only valid question remaining is the following:
Note, that iterating through processes and reading process's
credentials should be done under RCU-critical section.
... how do I ensure my check is run in this critical section? Are there any working examples of how to accomplish this? I've found some existing kernel documentation that instructs readers to wrap the relevant code with rcu_read_lock() and rcu_read_unlock(). Do I just need to wrap an read operations against the struct cred and/or struct task_struct data structures?
First, adding a new system call is rarely the right way to do things. It's best to do things via the existing mechanisms because you'll benefit from already-existing tools on both sides: existing utility functions in the kernel, existing libc and high-level language support in userland. Files are a central concept in Linux (like other Unix systems) and most data is exchanged via files, either device files or special filesystems such as proc and sysfs.
I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it.
You can't do this in the kernel. Not only is it wrong from a design point of view, but it isn't even possible. The kernel knows nothing about user names. The only knowledge about users in the kernel in that some privileged actions are reserved to user 0 in the root namespace (don't forget that last part! And if that's new to you it's a sign that you shouldn't be doing advanced things like adding system calls). (Many actions actually look for a capability rather than being root.)
What you want to use is sysfs. Read the kernel documentation and look for non-ancient online tutorials or existing kernel code (code that uses sysfs is typically pretty clean nowadays). With sysfs, you expose information through files under /sys. Access control is up to userland — have a sane default in the kernel and do things like calling chgrp, chmod or setfacl in the boot scripts. That's one of the many wheels that you don't need to reinvent on the user side when using the existing mechanisms.
The sysfs show method automatically takes a lock around the file, so only one kernel thread can be executing it at a time. That's one of the many wheels that you don't need to reinvent on the kernel side when using the existing mechanisms.
The linked question concerns a fundamentally different issue. To quote:
Please note that the uid that I want to get is NOT of the current process.
Clearly, a thread which is not the currently executing thread can in principle exit at any point or change credentials. Measures need to be taken to ensure the stability of whatever we are fiddling with. RCU is often the right answer. The answer provided there is somewhat wrong in the sense that there are other ways as well.
Meanwhile, if you want to operate on the thread executing the very code, you can know it wont exit (because it is executing your code as opposed to an exit path). A question arises what about the stability of credentials -- good news, they are also guaranteed to be there and can be accessed with no preparation whatsoever. This can be easily verified by checking the code doing credential switching.
We are left with the question what primitives can be used to do the access. To that end one can use make_kuid, uid_eq and similar primitives.
The real question is why is this a syscall as opposed to just a /proc file.
See this blogpost for somewhat elaborated description of credential handling: http://codingtragedy.blogspot.com/2015/04/weird-stuff-thread-credentials-in-linux.html

cryptsetup backend safe with multithreading?

Is there a crypto backend for cryptsetup that either is always thread safe, or can be easily used (or even modified, preferably with minimal effort) in a thread safe manner for simply testing if a key is correct?
Background and what I have tried:
I started by testing if I could modify the source of cryptsetup to simply test multiple keys using pthreads. This crashed, I believe I used gcrypt initially. Eventually I tried all of the backends available in the cryptsetup stable source and found that openssl and nettle seems to avoid crashing.
However, my testing was not very thorough and even though it (nettle specifically) does not crash, it seems that it does not work correctly when using threads. When using a single thread it always works, but increasing the number of threads makes it increasingly likely it will silently never find the correct key.
This is for brute forcing LUKS devices. I am aware the pbkdf slows it down to a crawl. I'm also aware the key space of AES cannot be exhausted even if the KDF was not there. This is just for the fun of making it in a network distributed and multithreaded manner.
I noticed in the source of cryptsetup (libdevmapper.c):
/*
* libdevmapper is not context friendly, switch context on every DM call.
* FIXME: this is not safe if called in parallel but neither is DM lib.
*/
However, it is possible I'm simply not using it correctly.
if(!LUKS_open_key_with_hdr(CRYPT_ANY_SLOT, key, strlen(key), &cd->u.luks1.hdr, &vk, cd)) {
return 0;
}
Each worker thread does this. I only call crypt_init() and crypt_load() once before the worker threads start up and pass them their own separate copy of the struct crypt_device. vk is created locally for each attempt. The keys are simply fetched from a wordlist with access control by a mutex. I found that if each thread calls these functions (crypt_init and crypt_load) every time, it seems to crash more easily.
Is it completely incorrect to try start removing and rewriting the code that uses dmcrypt? In LUKS_endec_template() it attaches a loop device to the crypto device, and creates a dm device which it eventually gives to open(), which it then gives the fd of to read_blockwise(). My idea was to simply skip all of that since I don't really need to use the device except to verify the key. However, simply opening the crypto device directly (and give it to read_blockwise()) does not work.

What are the reasons to check for error on close()?

Note: Please read to the end before marking this as duplicate. While it's similar, the scope of what I'm looking for in an answer extends beyond what the previous question was asking for.
Widespread practice, which I tend to agree with, tends to be treating close purely as a resource-deallocation function for file descriptors rather than a potential IO operation with meaningful failure cases. And indeed, prior to the resolution of issue 529, POSIX left the state of the file descriptor (i.e. whether it was still allocated or not) unspecified after errors, making it impossible to respond portably to errors in any meaningful way.
However, a lot of GNU software goes to great lengths to check for errors from close, and the Linux man page for close calls failure to do so "a common but nevertheless serious programming error". NFS and quotas are cited as circumstances under which close might produce an error but does not give details.
What are the situations under which close might fail, on real-world systems, and are they relevant today? I'm particularly interested in knowing whether there are any modern systems where close fails for any non-NFS, non-device-node-specific reasons, and as for NFS or device-related failures, under what conditions (e.g. configurations) they might be seen.
Once upon a time (24 march, 2007), Eric Sosman had the following tale to share in the comp.lang.c newsgroup:
(Let me begin by confessing to a little white lie: It wasn't
fclose() whose failure went undetected, but the POSIX close()
function; this part of the application used POSIX I/O. The lie
is harmless, though, because the C I/O facilities would have
failed in exactly the same way, and an undetected failure would
have had the same consequences. I'll describe what happened in
terms of C's I/O to avoid dwelling on POSIX too much.)
The situation was very much as Richard Tobin described.
The application was a document management system that loaded a
document file into memory, applied the user's edits to the in-
memory copy, and then wrote everything to a new file when told
to save the edits. It also maintained a one-level "old version"
backup for safety's sake: the Save operation wrote to a temp
file, and then if that was successful it deleted the old backup,
renamed the old document file to the backup name, and renamed the
temp file to the document. bak -> trash, doc -> bak, tmp -> doc.
The write-to-temp-file step checked almost everything. The
fopen(), obviously, but also all the fwrite()s and even a final
fflush() were checked for error indications -- but the fclose()
was not. And on one system it happened that the last few disk
blocks weren't actually allocated until fclose() -- the I/O
system sat atop VMS' lower-level file access machinery, and a
little bit of asynchrony was inherent in the arrangement.
The customer's system had disk quotas enabled, and the
victim was right up close to his limit. He opened a document,
edited for a while, saved his work thus far, and exceeded his
quota -- which went undetected because the error didn't appear
until the unchecked fclose(). Thinking that the save succeeded,
the application discarded the old backup, renamed the original
document to become the backup, and renamed the truncated temp
file to be the new document. The user worked a little longer
and saved again -- same thing, except you'll note that this time
the only surviving complete file got deleted, and both the
backup and the master document file are truncated. Result: the
whole document file became trash, not just the latest session
of work but everything that had gone before.
As Murphy would have it, the victim was the boss of the
department that had purchased several hundred licenses for our
software, and I got the privilege of flying to St. Louis to be
thrown to the lions.
[...]
In this case, the failure of fclose() would (if detected) have
stopped the delete-and-rename sequence. The user would have been
told "Hey, there was a problem saving the document; do something
about it and try again. Meanwhile, nothing has changed on disk."
Even if he'd been unable to save his latest batch of work, he would
at least not have lost everything that went before.
Consider the inverse of your question: "Under what situations can we guarantee that close will succeed?" The answer is:
when you call it correctly, and
when you know that the file system the file is on does not return errors from close in this OS and Kernel version
If you are convinced that you program doesn't have any logic errors and you have complete control over the Kernel and file system, then you don't need to check the return value of close.
Otherwise, you have to ask yourself how much you care about diagnosing problems with close. I think there is value in checking and logging the error for diagnostic purposes:
If a coder makes a logic error and passes an invalid fd to close, then you'll be able to quickly track it down. This may help to catch a bug early before it causes problems.
If a user runs the program in an environment where close does return an error when (for example) data was not flushed, then you'll be able to quickly diagnose why the data got corrupted. It's an easy red flag because you know the error should not occur.

Resources