popen performance in C

popen performance in C - c

I'm designing a program I plan to implement in C and I have a question about the best way (in terms of performance) to call external programs. The user is going to provide my program with a filename, and then my program is going to run another program with that file as input. My program is then going to process the output of the other program.
My typical approach would be to redirect the other program's output to a file and then have my program read that file when it's done. However, I understand I/O operations are quite expensive and I would like to make this program as efficient as possible.
I did a little bit of looking and I found the popen command for running system commands and grabbing the output. How does the performance of this approach compare to the performance of the approach I just described? Does popen simply write the external program's output to a temporary file, or does it keep the program output in memory?
Alternatively, is there another way to do this that will give better performance?

On Unix systems, popen will pass data through an in-memory pipe. Assuming the data isn't swapped out, it won't hit disk. This should give you just about as good performance as you can get without modifying the program being invoked.

popen does pretty much what you are asking for: it does the pipe-fork-exec idiom and gives you a file pointer that you can read and write from.
However, there is a limitation on the size of the pipe buffer (~4K iirc), and if you arent reading quickly enough, the other process could block.
Do you have access to shared memory as a mount point? [on linux systems there is a /dev/shm mountpoint]

1) popen keep the program output in memory. It actually uses pipes to transfer data between the processes.
2) popen looks IMHO as the best option for performance.
It also have an advantage over files of reducing latency. I.e. your program will be able to get the other program output on the fly, while it is produced. If this output is large, then you don't have to wait until the other program is finished to start processing its output.

The problem with having your subcommand redirect to a file is that it's potentially insecure while popen communication can't be intercepted by another process. Plus you need to make sure the filename is unique if you're running several instances of your master program (and thus of your subcommand). The popen solution doesn't suffer from this.
The performance of popen is just fine as long as your don't read/write one byte chunks. Always read/write multiples of 512 (like 4096). But that does apply to file operations as well. popen connects your process and the child process through pipes, so if you don't read then the pipe fills up and the child can't write and vice versa. So all the exchanged data is in memory, but it's only small amounts.

(Assuming Unix or Linux)
Writing to the temp file may be slow if the file is on a slow disk. It also means the entire output will have to fit on the disk.
popen connects to the other program using a pipe, which means that output will be sent to your program incrementally. As it is generated, it is copied to your program chunk-by-chunk.

Related

How are files written? Why do I not see my data written immediately?

I understand the general process of writing and reading from a file, but I was curious as to what is happening under the hood during file writing. For instance, I have written a program that writes a series of numbers, line by line, to a .txt file. One thing that bothers me however is that I don't see the information written until after my c program is finished running. Is there a way to see the information written while the program is running rather than after? Is this even possible to do? This is a hard question to phrase in one line, so please forgive me if it's already been answered elsewhere.
The reason I ask this is because I'm writing to a file and was hoping that I could scan the file for the highest and lowest values (the program would optimally be able to run for hours).

Research buffering and caching.
There are a number of layers of optimisation performed by:
your application,
your OS, and
your disk driver,
in order to extend the life of your disk and increase performance.
With the careful use of flushing commands, you can generally make things happen "quite quickly" when you really need them to, though you should generally do so sparingly.
Flushing can be particularly useful when debugging.
The GNU C Library documentation has a good page on the subject of file flushing, listing functions such as fflush which may do what you want.

You observe an effect solely caused by the C standard I/O (stdio) buffers. I claim that any OS or disk driver buffering has nothing to do with it.
In stdio, I/O happens in one of three modes:
Fully buffered, data is written once BUFSIZ (from <stdio.h>) characters were accumulated. This is the default when I/0 is redirected to a file or pipe. This is what you observe. Typically BUFSIZ is anywhere from 1k to several kBytes.
Line buffered, data is written once a newline is seen (or BUFSIZ is reached). This is the default when i/o is to a terminal.
Unbuffered, data is written immediately.
You can use the setvbuf() (<stdio.h>) function to change the default, using the _IOFBF, _IOLBF or _IONBF macros, respectively. See your friendly setvbuf man page.
In your case, you can set your output stream (stdout or the FILE * returned by fopen) to line buffered.
Alternatively, you can call fflush() on the output stream whenever you want I/O to happen, regardless of buffering.

Indeed, there are several layers between the writing commands resp. functions and the actual file.
First, you open the file for writing. This causes the file to be either created or emptied. If you write then, the write doesn't actually occur immediately, but the data are cached until the buffer is full or the file is flushed or closed.
You can call fflush() for writing each portion of data, or you can actually wait until the file is closed.

Yes, it is possible to see whats written in the file(s). If you programm under Linux you can open a new Terminal and watch the progress with for example "less Filename".

Allocating a lot of file descriptors

I am interested in bringing a system down (for, say 15 minutes) by allocating a lot of file descriptors and causing Out-of-File-Descriptor failure. (Don't worry, I am not trying to hack into anything. This is for testing a service I am writing... to see how it behaves under other programs misbehaving.) Any best practices for that? Should I just keep saying fopen() in a infinite for loop? And after 15 minutes, I can kill the process? Does anybody have experience with this?
Update: I am running Linux and the program I am writing will have super user privileges.
Thanks,
~yogi

Did you consider lowering with setrlimit RLIMIT_NOFILE the file descriptor limit before running your program?
This can be done simply with the bash ulimit -n builtin, in the same shell where you test your application, e.g.:
ulimit -n 32
And it won't perturb much a lot of other services already running. Lowering that limit will make your application (run in the same shell) hurt it quickly (for your testing purposes).
On the entire system level you might also write into /proc/sys/fs/file-max e.g. with
echo 1024 > /proc/sys/fs/file-max

Depends on OS implementation, but call fopen on same file from same process will not allocate new file description, but just increment reference counter.
I would recommend you to read something about stress testing
Here are some usable software(you don't tag any OS platform):
http://www.opensourcetesting.org/performance.php

I had this happen once in normal use. I believe you run of inodes in linux. I don't know a faster way that just opening files. Just be careful, we locked our system up. It was a while ago so I don't remember what was trying to open a file, but things generally assume they can get a file handle and don't behave as well as they should in the case they can't. ~Ben

My 2 cents:
1.Write a program that creates a lot of file descriptors. You can achieve it by one of the following methods:
(a)Opening lot of different files in your code
(b)Opening a lot of socket descriptors
(c)Creating a lot of threads
2.Now, keep spawning multiple instances of the program created in Step-1 (i.e. create multiple processes) using a shell script or something similar.
Note:
In linux as well as most other operating systems, there is a limit on the number of file descriptors per process (In linux by default it is 1024 I guess. You can check it using ulimit -a). So, your process will just fail when you do this. I am really not so sure that just by increasing the number of file descriptor usage you can make the system go down.

You can use mkstemp to get file descriptors of temporary files.

Creating unflushed file output buffers

I am trying to clear up an issue that occurs with unflushed file I/O buffers in a couple of programs, in different languages, running on Linux. The solution of flushing buffers is easy enough, but this issue of unflushed buffers happens quite randomly. Rather than seek help on what may cause it, I am interested in how to create (reproduce) and diagnose this kind of situation.
This leads to a two-part question:
Is it feasible to artificially and easily construct instances where, for a given period of time, one can have output buffers that are known to be unflushed? My searches are turning up empty. A trivial baseline is to hammer the hard drive (e.g. swapping) in one process while trying to write a large amount of data from another process. While this "works", it makes the system practically unusable: I can't poke around and see what's going on.
Are there commands from within Linux that can identify that a given process has unflushed file output buffers? Is this something that can be run at the command line, or is it necessary to query the kernel directly? I have been looking at fsync, sync, ioctl, flush, bdflush, and others. However, lacking a method for creating unflushed buffers, it's not clear what these may reveal.
In order to reproduce for others, an example for #1 in C would be excellent, but the question is truly language agnostic - just knowing an approach to create this situation would help in the other languages I'm working in.
Update 1: My apologies for any confusion. As several people have pointed out, buffers can be in the kernel space or the user space. This helped pinpoint the problems: we're creating big dirty kernel buffers. This distinction and the answers completely resolve #1: it now seems clear how to re-create unflushed buffers in either user space or kernel space. Identifying which process ID has dirty kernel buffers is not yet clear, though.

If you are interested in the kernel-buffered data, then you can tune the VM writeback through the sysctls in /proc/sys/vm/dirty_*. In particular, dirty_expire_centisecs is the age, in hundredths of a second, at which dirty data becomes eligible for writeback. Increasing this value will give you a larger window of time in which to do your investigation. You can also increase dirty_ratio and dirty_background_ratio (which are percentages of system memory, defining the point at which synchronous and asynchronous writeback start respectively).
Actually creating dirty pages is easy - just write(2) to a file and exit without syncing, or dirty some pages in a MAP_SHARED mapping of a file.

A simple program that would have an unflushed buffer would be:
main()
{
printf("moo");
pause();
}
Stdio, by default only flushes stdout on newlines, when connected to a terminal.

It is very easy to cause unflushed buffers by controlling the receiving side. The beauty of *nix systems is that everything looks like a file, so you can use special files to do what you want. The easiest option is a pipe. If you just want to control stdout, this is the simples option: unflushed_program | slow_consumer. Otherwise, you can use named pipes:
mkfifo pipe_file
unflushed_program --output pipe_file
slow_consumer --input pipe_file
slow_consumer is most likely a program you design to read data slowly, or just read X bytes and stop.

System.out.print + OS redirect vs writing to a file, which is faster?

I'm making a program that process a big file and output something to another that I need to use later. I'm wondering should I just print the output and redirect that to a file, or should I just write to the file in the program. Since this will be a very big file, I would like to know which way is faster, every bit counts.

You question is really, "Should I write to stdout, or use native file i/o".
The answer will depend somewhat on how you process the file (can it be processed and output line by line), and how optimally your file i/o code is written.
Its quite possible to write code that outputs directly to a file that is slower than code that writes to stdout.

What's the difference? Stdout is a stream, so is a file. On most operating systems, there is literally no difference. On windows, there are different functions you have to use when handling file streams vs output streams, but they're still almost exactly the same API (just the file ones are prefixed with 'f'). I'd be very surprised if there's a difference in performance.
You can of course use alternative APIs for files, but I do not see a compelling reason to do so, since the files are still streams at the OS level.

If the typical output is a text format, i would prefer stdout. You can simply check if on the terminal, redirect to a file or pipe it to the next command. The performance should be the same. For binary output the file output is more typical.

programs running in parallel, read/writing in C

I'm considering a set of 4 programs: (Prog1, Prog2, Prog3, Prog4)
interacting with 4 files (FileA, FileB, FileC, FileD)
Prog1: writes (appends) to FileA
Prog2: reads File A and writes (appends) to FileB
Prog3: reads File A, and writes (appends) to FileC
Prog4: reads File B, and writes (appends) to FileD
or Potentially Prog1, might also read upon startup, and write continuously to say FileX.
Now all 4 programs will be running simultaneously (over network potentially but it shouldn't matter). Will this work?
Do I need to set "Strobes" or "busy" signals (i could do that with say mkdir, and rmdir)?

Is your problem that of synchronizing the reads/writes? Writes are the more problematic part since they modify the contents. Further, the nature of of the write (append at end, append at begining etc) may further complicate your situation. I have a feeling that you may need to look up "file locks"/mutexes etc. A lot depends on the OS(-es) you plan to run these on. Boost.Interprocess is a good place to start.

I think you need some kind of real FIFO structure here, also called pipes. There are constructs with that name under Windows and Unix-flavour OS'es.
An example under Linux can be found here, named pipes under Windows here

It can be made to work. You need to consider which process opens each file, and where the data that Prog1 is writing comes from.
If each program opens the files it works with, then there is no major problem. The main issue is the same as 'tail -f' has to deal with, namely that each of the reading processes is likely to read to EOF, and then has to pause and retry to see when more data becomes available.
If you have a central process that opens all the files, then you need to open file A for reading twice so that Prog2 and Prog3 have independent access to the file. However, it seems more sensible for any coordinator process to simply tell the children which files to open.
I don't see any need for strobes or busy signals. You've not warned of any 'near hard real time' response requirements or other special conditions that might warrant special programming.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight