programs running in parallel, read/writing in C - c

I'm considering a set of 4 programs: (Prog1, Prog2, Prog3, Prog4)
interacting with 4 files (FileA, FileB, FileC, FileD)
Prog1: writes (appends) to FileA
Prog2: reads File A and writes (appends) to FileB
Prog3: reads File A, and writes (appends) to FileC
Prog4: reads File B, and writes (appends) to FileD
or Potentially Prog1, might also read upon startup, and write continuously to say FileX.
Now all 4 programs will be running simultaneously (over network potentially but it shouldn't matter). Will this work?
Do I need to set "Strobes" or "busy" signals (i could do that with say mkdir, and rmdir)?

Is your problem that of synchronizing the reads/writes? Writes are the more problematic part since they modify the contents. Further, the nature of of the write (append at end, append at begining etc) may further complicate your situation. I have a feeling that you may need to look up "file locks"/mutexes etc. A lot depends on the OS(-es) you plan to run these on. Boost.Interprocess is a good place to start.

I think you need some kind of real FIFO structure here, also called pipes. There are constructs with that name under Windows and Unix-flavour OS'es.
An example under Linux can be found here, named pipes under Windows here

It can be made to work. You need to consider which process opens each file, and where the data that Prog1 is writing comes from.
If each program opens the files it works with, then there is no major problem. The main issue is the same as 'tail -f' has to deal with, namely that each of the reading processes is likely to read to EOF, and then has to pause and retry to see when more data becomes available.
If you have a central process that opens all the files, then you need to open file A for reading twice so that Prog2 and Prog3 have independent access to the file. However, it seems more sensible for any coordinator process to simply tell the children which files to open.
I don't see any need for strobes or busy signals. You've not warned of any 'near hard real time' response requirements or other special conditions that might warrant special programming.

Related

Logging Events---Looking For a Good Way

I created a program that monitors for events.
I want to log these events "in the right way".
Currently I have a string array, log[500][100].
Each line is a string of characters (up to 100) that report something about the event.
I have it set up so that only the last 500 events are saved in the array.
After that, new events overwrite the oldest events.
Currently I just keep revolving through the array until the program terminates, then I write the array to a file.
Going forward I would like to view the log in real time, any time I wish, without disturbing the event processing and logging process.
I considered opening the file for "appending" but here are my concerns:
(1) The program is running on a Raspberry Pi which has a flash memory as a "disk drive". I believe flash memories have a limited number of write cycles before problems can occur. This program runs 24/7 "forever" so I am afraid the "disk drive" will "wear out".
(2) I am using pretty much all the CPU capacity of the RPi so I don't want to add a lot of overhead/CPU cycles.
How would experienced programmers attack this problem?
Please go easy on me, this is my first C program.
[EDIT]
I began reviewing all the information and I became intrigued by Mark A's suggestion for tmpfs. I looked into it more and I am sure this answers my question. It allows the creation of files in RAM not the SD card. They are lost on power down but I don't care.
In order to keep the files from growing to large I created a double buffer approach. First I write 500 events to file A then switch to file B. When 500 events have been written to file B I close and reopen file A (to delete the contents and start at 0 events) and switch to writing to file A. I found I needed to fflush(file...) after each write or else the file was empty until fclose.
Normally that would be OK but right now I am fighting a nasty segmentation fault so I want as much insight into what is going on. When I hit the fault, I never get to my fclose statements.
Welcome to Stack Overflow and to C programming! A wonderful world of possibilities awaits you.
I have several thoughts in response to your situation.
The very short summary is to use stdout and delegate the output-file management to the shell.
The longer, rambling answer full of my personal musing is as follows:
1 : A very typical thing for C programs to do is not be in charge of how outputs are kept. You might have heard of the "built in" file handles, stdin, stdout, and stderr. These file handles are (under normal circumstances) always available to your program for input (from stdin) and output (stdout and stderr). As you might guess from their names stdout is customarily used for regular output and stderr is customarily used for error / exception output. It is exceedingly typical for a C program to simply read from stdin and output to stdout and stderr, and let something else (e.g., the shell) take care of what those actually are.
For example, reading from stdin means that your program can be used for keyboard entry and for file reading, without needing to change your program's code. The same goes for stdout and stderr; simply output to those file handles, and let the user decide whether those should go to the screen or be redirected to a file. And, because stdout and stderr are separate file handles, the user can have them go to separate 'destinations'.
In your case, to implement this, drop the array entirely, and simply
fprintf(stdout, "event notice : %s\n", eventdetailstring);
(or similar) every time your program has something to say. Take a look at fflush(), too, because of potential output buffering.
2a : This gets you continuous output. This itself can help with your concern about memory wear on the Pi's flash disk. If you do something like:
eventmonitor > logfile
then logfile will be being appended to during the lifetime of your program, which will tend to be writing to new parts of the flash disk. Of course, if you only ever append, you will eventually run out of space on the disk, so you might set up a cron job to kill the currently running eventmonitor and restart it every day at midnight. Done with the above command, that would cause it to overwrite logfile once per day. This prevents endless growth, and it might even use a new physical area of the flash drive for the new file (even though it's the same name; underneath, it's a different file, with a different inode, etc.) But even if it reuses the exact same area of the flash drive, now you are down to worrying if this will last more than 10,000 days, instead of 10,000 writes. I'm betting that within 10,000 days, new options will be available -- worst case, you buy a new Pi every 27 years or so!
There are other possible variations on this theme, as well. e.g., you could have a sophisticated script kicked off by cron every day at midnight that kills any currently running eventmonitor, deletes output files older than a week, and starts a new eventmonitor outputting to a file whose filename is based partly on the date so that past days' file aren't overwritten. But all of this is in the realm of using your program. You can make your program easier to use by writing it to use stdin, stdout, and stderr.
2b : Or, you can just have stdout go to the screen, which is typically how it already is when a program is started from an interactive shell / terminal window. I imagine you could have the Pi running headless most of the time, and when you want to see what your program is outputting, hook up a monitor. Generally, things will stay running between disconnecting and reconnecting your monitor. This avoids affecting the flash drive at all.
3 : Another approach is to have your event monitoring program send its output somewhere off-system. This is getting into more advanced programming territory, so you might want to save this for a later enhancement, after you've mastered more of the basics. But, your program could establish a network connection to, say, a JSON API and send event information there. This would let you separate the functions of event monitoring from event reporting.
You will discover as you learn more programming that this idea of separation of concerns is an important concept, and applies at various levels of a program or a system of interoperating programs. In this case, the Pi is a good fit for the data monitoring aspect because it is a lightweight solution, and some other system with more capacity and more stable storage can cover the data collection aspect.

Creating unflushed file output buffers

I am trying to clear up an issue that occurs with unflushed file I/O buffers in a couple of programs, in different languages, running on Linux. The solution of flushing buffers is easy enough, but this issue of unflushed buffers happens quite randomly. Rather than seek help on what may cause it, I am interested in how to create (reproduce) and diagnose this kind of situation.
This leads to a two-part question:
Is it feasible to artificially and easily construct instances where, for a given period of time, one can have output buffers that are known to be unflushed? My searches are turning up empty. A trivial baseline is to hammer the hard drive (e.g. swapping) in one process while trying to write a large amount of data from another process. While this "works", it makes the system practically unusable: I can't poke around and see what's going on.
Are there commands from within Linux that can identify that a given process has unflushed file output buffers? Is this something that can be run at the command line, or is it necessary to query the kernel directly? I have been looking at fsync, sync, ioctl, flush, bdflush, and others. However, lacking a method for creating unflushed buffers, it's not clear what these may reveal.
In order to reproduce for others, an example for #1 in C would be excellent, but the question is truly language agnostic - just knowing an approach to create this situation would help in the other languages I'm working in.
Update 1: My apologies for any confusion. As several people have pointed out, buffers can be in the kernel space or the user space. This helped pinpoint the problems: we're creating big dirty kernel buffers. This distinction and the answers completely resolve #1: it now seems clear how to re-create unflushed buffers in either user space or kernel space. Identifying which process ID has dirty kernel buffers is not yet clear, though.
If you are interested in the kernel-buffered data, then you can tune the VM writeback through the sysctls in /proc/sys/vm/dirty_*. In particular, dirty_expire_centisecs is the age, in hundredths of a second, at which dirty data becomes eligible for writeback. Increasing this value will give you a larger window of time in which to do your investigation. You can also increase dirty_ratio and dirty_background_ratio (which are percentages of system memory, defining the point at which synchronous and asynchronous writeback start respectively).
Actually creating dirty pages is easy - just write(2) to a file and exit without syncing, or dirty some pages in a MAP_SHARED mapping of a file.
A simple program that would have an unflushed buffer would be:
main()
{
printf("moo");
pause();
}
Stdio, by default only flushes stdout on newlines, when connected to a terminal.
It is very easy to cause unflushed buffers by controlling the receiving side. The beauty of *nix systems is that everything looks like a file, so you can use special files to do what you want. The easiest option is a pipe. If you just want to control stdout, this is the simples option: unflushed_program | slow_consumer. Otherwise, you can use named pipes:
mkfifo pipe_file
unflushed_program --output pipe_file
slow_consumer --input pipe_file
slow_consumer is most likely a program you design to read data slowly, or just read X bytes and stop.

How to know if a file is being copied?

I am currently trying to check wether the copy of a file from a directory to another is done.
I would like to know if the target file is still being copied.
So I would like to get the number of file descriptors openned on this file.
I use C langage and don't really find a way to resolve that problem.
If you have control of it, I would recommend using the copy-move idiom on the program doing the copying:
cp file1 otherdir/.file1.tmp
mv otherdir/.file1.tmp otherdir/file1
The mv just changes some filesystem entries and is atomic and very fast compared to the copy.
If you're able to open the file for writing, there's a good chance that the OS has finished the copy and has released its lock on it. Different operating systems may behave differently for this, however.
Another approach is to open both the source and destination files for reading and compare their sizes. If they're of identical size, the copy has very likely finished. You can use fseek() and ftell() to determine the size of a file in C:
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
In linux, try the lsof command, which lists all of the open files on your system.
edit 1: The only C language feature that comes to mind is the fstat function. You might be able to use that with the struct's st_mtime (last modification time) field - once that value stops changing (for, say, a period of 10 seconds), then you could assume that file copy operation has stopped.
edit 2: also, on linux, you could traverse /proc/[pid]/fd to see which files are open. The files in there are symlinks, but C's readlink() function could tell you its path, so you could see whether it is still open. Using getpid(), you would know the process ID of your program (if you are doing a file copy from within your program) to know where to look in /proc.
I think your basic mistake is trying to synchronize a C program with a shell tool/external program that's not intended for synchronization. If you have some degree of control over the program/script doing the copying, you should modify it to perform advisory locking of some sort (preferably fcntl-based) on the target file. Then your other program can simply block on acquiring the lock.
If you don't have any control over the program performing the copy, the only solutions depend on non-portable hacks like lsof or Linux inotify API.
(This answer makes the big, big assumption that this will be running on Linux.)
The C source code of lsof, a tool that tells which programs currently have an open file descriptor to a specific file, is freely available. However, just to warn you, I couldn't make any sense out of it. There are references to reading kernel memory, so to me it's either voodoo or black magic.
That said, nothing prevents you from running lsof through your own program. Running third-party programs from your own program is normally something you try to avoid for several reasons, like security (if a rogue user changes lsof for a malicious program, it will run with your program's privileges, with potentially catastrophic consequences) but inspecting the lsof source code, I came to the conclusion that there's no public API to determine which program has which file open. If you're not afraid of people changing programs in /usr/sbin, you might consider this.
int isOpen(const char* file)
{
char* command;
// BE AWARE THAT THIS WILL NOT WORK IF THE FILE NAME CONTAINS A DOUBLE QUOTE
// OR IF IT CAN SOMEHOW BE ALTERED THROUGH SHELL EXPANSION
// you should either try to fix it yourself, or use a function of the `exec`
// family that won't trigger shell expansion.
// It would be an EXTREMELY BAD idea to call `lsof` without an absolute path
// since it could result in another program being run. If this is not where
// `lsof` resides on your system, change it to the appropriate absolute path.
asprintf(&command, "/usr/sbin/lsof \"%s\"", file);
int result = system(command);
free(command);
return result;
}
If you also need to know which program has your file open (presumably cp?), you can use popen to read the output of lsof in a similar fashion. popen descriptors behave like fopen descriptors, so all you need to do is fread them and see if you can find your program's name. On my machine, lsof output looks like this:
$ lsof document.pdf
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
SomeApp 873 felix txt REG 14,3 303260 5165763 document.pdf
As poundifdef mentioned, the fstat() function can give you the current modification time. But fstat also gives you the size of the file.
Back in the dim dark ages of C when I was monitoring files being copied by various programs I had no control over I always:
Waited until the target file size was >= the source size, and
Waited until the target modification time was at least N seconds older than the current time. N being a number such a 5, and set larger if experience showed that was necessary. Yes 5 seconds seems extreme, but it is safe.
If you don't know what the target file is then the only real choice you have is #2, but user a larger N to allow for the worse case network and local CPU delays, with a healthy safety factor.
using boost libs will solve the issue
boost::filesystem::fstream fileStream(filePath, std::ios_base::in | std::ios_base::binary);
if(fileStream.is_open())
//not getting copied
else
//Wait, the file is getting copied

popen performance in C

I'm designing a program I plan to implement in C and I have a question about the best way (in terms of performance) to call external programs. The user is going to provide my program with a filename, and then my program is going to run another program with that file as input. My program is then going to process the output of the other program.
My typical approach would be to redirect the other program's output to a file and then have my program read that file when it's done. However, I understand I/O operations are quite expensive and I would like to make this program as efficient as possible.
I did a little bit of looking and I found the popen command for running system commands and grabbing the output. How does the performance of this approach compare to the performance of the approach I just described? Does popen simply write the external program's output to a temporary file, or does it keep the program output in memory?
Alternatively, is there another way to do this that will give better performance?
On Unix systems, popen will pass data through an in-memory pipe. Assuming the data isn't swapped out, it won't hit disk. This should give you just about as good performance as you can get without modifying the program being invoked.
popen does pretty much what you are asking for: it does the pipe-fork-exec idiom and gives you a file pointer that you can read and write from.
However, there is a limitation on the size of the pipe buffer (~4K iirc), and if you arent reading quickly enough, the other process could block.
Do you have access to shared memory as a mount point? [on linux systems there is a /dev/shm mountpoint]
1) popen keep the program output in memory. It actually uses pipes to transfer data between the processes.
2) popen looks IMHO as the best option for performance.
It also have an advantage over files of reducing latency. I.e. your program will be able to get the other program output on the fly, while it is produced. If this output is large, then you don't have to wait until the other program is finished to start processing its output.
The problem with having your subcommand redirect to a file is that it's potentially insecure while popen communication can't be intercepted by another process. Plus you need to make sure the filename is unique if you're running several instances of your master program (and thus of your subcommand). The popen solution doesn't suffer from this.
The performance of popen is just fine as long as your don't read/write one byte chunks. Always read/write multiples of 512 (like 4096). But that does apply to file operations as well. popen connects your process and the child process through pipes, so if you don't read then the pipe fills up and the child can't write and vice versa. So all the exchanged data is in memory, but it's only small amounts.
(Assuming Unix or Linux)
Writing to the temp file may be slow if the file is on a slow disk. It also means the entire output will have to fit on the disk.
popen connects to the other program using a pipe, which means that output will be sent to your program incrementally. As it is generated, it is copied to your program chunk-by-chunk.

System.out.print + OS redirect vs writing to a file, which is faster?

I'm making a program that process a big file and output something to another that I need to use later. I'm wondering should I just print the output and redirect that to a file, or should I just write to the file in the program. Since this will be a very big file, I would like to know which way is faster, every bit counts.
You question is really, "Should I write to stdout, or use native file i/o".
The answer will depend somewhat on how you process the file (can it be processed and output line by line), and how optimally your file i/o code is written.
Its quite possible to write code that outputs directly to a file that is slower than code that writes to stdout.
What's the difference? Stdout is a stream, so is a file. On most operating systems, there is literally no difference. On windows, there are different functions you have to use when handling file streams vs output streams, but they're still almost exactly the same API (just the file ones are prefixed with 'f'). I'd be very surprised if there's a difference in performance.
You can of course use alternative APIs for files, but I do not see a compelling reason to do so, since the files are still streams at the OS level.
If the typical output is a text format, i would prefer stdout. You can simply check if on the terminal, redirect to a file or pipe it to the next command. The performance should be the same. For binary output the file output is more typical.

Resources