Replace pipe-shellscript with C-program

Replace pipe-shellscript with C-program - c

I have the following Bash script:
cat | command1 | command2 | command3
The commands never change.
For performance reasons, I want to replace it with a small C-program, that runs the commands and creates and assings the pipes accordingly.
Is there a way to do that in C?

As others said, you probably won't get a significant performance benefit.
It's reasonable to assume that the commands you run take most of the time, not the shell script gluing them together, so even if the glue becomes faster, it will change almost nothing.
Having said that, if you want to do it, you should use the fork(), pipe, dup2() and exec() functions.
fork will give you multiple processes.
pipe will give you a pair of file descriptors - what you write into one, you can read from the other.
dup2 can be used to change file descriptor numbers. You can take one side of a pipe and make it become file descriptor 1 (stdout) in one process, and the other side you'll make file descriptor 0 (stdin) in another (don't forget to close the normal stdin, stdout first).
exec (or one of its variants) will be used to execute the programs.
There are lots of details to fill in. Have fun.

Here is an example that does pretty much this.
There is no performance benefit for the processing itself, just a couple of milliseconds in initialization. Obviously we don't know the context in which you're doing this, but just using dash instead of bash would probably have gotten you 80% of those milliseconds from a single character change in your #!

Related

Capturing stdout/stderr separately and simultaneously from child process results in wrong total order (libc/unix)

I'm writing a library that should execute a program in a child process, capture the output, and make the output available in a line by line (string vector) way. There is one vector for STDOUT, one for STDERR, and one for "STDCOMBINED", i.e. all output in the order it was printed by the program. The child process is connected via two pipes to a parent process. One pipe for STDOUT and one for STDERR. In the parent process I read from the read-ends of the pipes, in the child process I dup2()'ed STDOUT/STDERR to the write ends of the pipes.
My problem:
I'd like to capture STDOUT, STDERR, and "STDCOMBINED" (=both in the order they appeared). But the order in the combined vector is different to the original order.
My approach:
I iterate until both pipes show EOF and the child process exited. At each iteration I read exactly one line (or EOF) from STDOUT and exactly one line (or EOF) from STDERR. This works so far. But when I capture out the lines as they come in the parent process, the order of STDOUT and STDERR is not the same as if I execute the program in a shell and look at the output.
Why is this so and how can I fix this? Is this possible at all? I know in the child process I could redirect STDOUT and STDERR both to a single pipe but I need STDOUT and STDERR separately, and "STDCOMBINED".
PS: I'm familiar with libc/unix system calls, like dup2(), pipe(), etc. Therefore I didn't post code. My question is about the general approach and not a coding problem in a specific language. I'm doing it in Rust against the raw libc bindings.
PPS: I made a simple test program, that has a mixup of 5 stdout and 5 stderr messages. That's enough to reproduce the problem.

At each iteration I read exactly one line (or EOF) from STDOUT and exactly one line (or EOF) from STDERR.
This is the problem. This will only capture the correct order if that was exactly the order of output in the child process.
You need to capture the asynchronous nature of the beast: make your pipe endpoints nonblocking, select* on the pipes, and read whatever data is present, as soon as select returns. Then you'll capture the correct order of the output. Of course now you can't be reading "exactly one line": you'll have to read whatever data is available and no more, so that you won't block, and maintain a per-pipe buffer where you append new data, extract any lines that are present, shove the unprocessed output to the beginning, and repeat. You could also use a circular buffer to save a little bit of memcpy-ing, but that's probably not very important.
Since you're doing this in Rust, I presume there's already a good asynchronous reaction pattern that you could leverage (I'm spoiled with go, I guess, and project the hopes on the unsuspecting).
*Always prefer platform-specific higher-performance primitives like epoll on Linux, /dev/poll on Solaris, pollset &c. on AIX
Another possibility is to launch the target process with LD_PRELOAD, with a dedicated library that it takes over glibc's POSIX write, detects writes to the pipes, and encapsulates such writes (and only those) in a packet by prepending it with a header that has an (atomically updated) process-wide incrementing counter stored in it, as well as the size of the write. Such headers can be easily decoded on the other end of the pipe to reorder the writes with a higher chance of success.

I think it's not possible to strictly do what you want to do.
If you think about how it's done when running a command in an interactive shell, what happens is that both stdout and stderr point to the same file descriptor (the TTY), so the total ordering is correct by means of synchronization against the same file.
To illustrate, imagine what happens if the child process has 2 completely independent threads, one only writing to stderr, and to other only writing to stdout. The total ordering would depend on however the scheduler decided to schedule these threads, and if you wanted to capture that, you'd need to synchronize those threads against something.
And of course, something can write thousands of lines to stdout before writing anything to stderr.
There are 2 ways to relax your requirements into something workable:
Have the user pass a flag waiving separate stdout and stderr streams in favor of a correct stdcombined, and then redirect both to a single file descriptor. You might need to change the buffering settings (like stdbuf does) before you execute the process.
Assume that stdout and stderr are "reasonably interleaved", an assumption pointed out by #Nate Eldredge, in which case you can use #Unslander Monica's answer.

Running multiple forked processes and constantly reading their standard out, while printing to their standard in

I'm looking to run X amount of processes that I'm able to iterate through in order to run programs where there's a master and 'slaves' that take the masters orders and return a string.
I'm writing in C. I'm wondering how I'd be able to set up pipes and forking between there processes to read from standard in and out. I'm currently able to have them work one at a time until the are killed, but I would like to simply read one line then move to the next process. Any help?

Generally, the common strategy for this sort of programming is to set up an event loop.
You would set up pipes and connect them to stdin and stdout of your program.
You don't specify what language you're using.
In C, you would create two pipes, one for reading, and one for writing.
Then you would fork. After the fork, in the child, you close stdin and stdout, and you use the dup2 system call to copy one end of the pipe filedescriptors to the child.
In the parent, you connect each process to an event loop, which lets you know when one of your FDs is ready for reading or writing.
Take a look at these class notes for discussion of using pipes and dup2.
Here's an introduction to libevent, one of the common event loops for C.
For other languages you'll do something similar. For example for Python, take a look at the asyncio support for subprocesses.

how to buffer and delay printf() output?

I wrote a C program and in the program there are many printf() which output log information to stdout. Now I want to use multiple processes to run the program simultaneously with different arguments. And I want to redirect the output from stdout to a log file using >.
But multiple processes are running at the same time, their log information output overlap with each other, which can be confusing for future analysis.
one solution is: considering that different processes will exit at different time,modify the C program, so each log information is temporarily written into a temporal file. When the C program is about to exit. Read from the temporal file and write the content to stdout, this requires a lot of modification.
My idea is: I hope in the C program, all the printf() output can be buffered, the outputs put into stdout/redirection only when the process exits.
is it possible or not?
thanks!

This is not really possible, unless you are sure that the output is reasonably bounded (e.g. the total output is less than a few megabytes), otherwise use a logging mechanism which send to some central logger (like syslog).
On Linux and most Posix systems, the simplest way to do logging would be to use syslog(3) which is designed for logging (and is able to deal with different processes). I think this is the preferable approach.
With GNU libc, you could consider using open_memstream(3) -to write to memory, and here you need to be sure the total output is bounded- and use atexit(3) to have the memory stream written at the exit of the program into some file; you probably want to use some locking mechanism like flock(2) etc...
As commented by J.Holetzeck the simplest way is to redirect output into different files (perhaps using freopen(3), or simply in the invoking shell), and later merge these files.
I'm guessing you use Linux, or some Posix system. For Windows, I have no idea.

Is it possible to know how many bytes have been printed to a file stream such as standard output?

Is it possible for a caller program in C to know how many bytes it has printed to a file stream such as stdout without actually counting and adding up the return values of printf?
I am trying to implement control of the quantity of output of a C program which uses libraries to print, but the libraries don't report the amount of data they have printed out.
I am interested in either a general solution or a Unix-specific one.

POSIX-specific: redirect stdout to a file, flush after all writing is done, then stat the file and look at st_size (or use the ls command).
Update: You say you're trying to control the quantity of output of a program. The POSIX head command will do that. If that's not satisfactory, then state your requirements clearly.

It's rather a heavyweight solution, but the following will work:
Create a pipe by calling pipe()
Spawn a child process
In the parent: redirect stdout to the write-side of the pipe, and close the read side (and the old stdout)
In the child: keep reading from the read side of the pipe, and copying the data to the inherited stdout (which is the original stdout) - counting it as it goes past
In the parent, keep writing to stdout (which is now the pipe) as usual
Use some form of IPC to communicate the result of the count from the child to the parent.
Basically the idea is to spawn a child process, and pipe all output through it, and have the child process count all the data as it goes through.
The precise form of IPC to use may vary - for example, shared memory (with atomic reads/writes on each side) would work well for fast transfer of data, but other methods (such as sockets, more pipes etc) are possible, and offer better scope for synchronisation.
The trickiest part is the synchronisation, i.e. ensuring that, at the time the child tells the parent how much data has been written, it has already processed all the data that the parent said (and there is none left in the pipe, for example). How important this is will depend on exactly what your aim is - if an approximate indication is all that's required, then you may be able to get away with using shared memory for IPC and not performing any explicit synchronisation; if the total is only required at the end then you can close stdout from the parent, and have the child indicate in the shared memory when it has received the eof notification.
If you require more frequent readouts, which must be exact, then something more copmlex will be required, but this can be achieved by designing some sort of protocol using sockets, pipes, or even condvars/semaphores/etc in the shared memory.

printf returns the number of bytes written.
Add them up.

No idea how reliable this is, but you can use ftell on stdout:
long int start = ftell(stdout);
printf("abcdef\n");
printf("%ld\n", ftell(stdout) - start); // >> 7
EDIT
Checked this on Ubuntu Precise: it does not work if the output goes to the console, but does work if it is redirected to a file.
$ ./a.out
abcdef
0
$ ./a.out >tt
$ cat tt
abcdef
7
$ echo `./a.out`
abcdef 0
$ echo `cat tt`
abcdef 7

How does a pipe work in Linux?

How does piping work? If I run a program via CLI and redirect output to a file will I be able to pipe that file into another program as it is being written?
Basically when one line is written to the file I would like it to be piped immediately to my second application (I am trying to dynamically draw a graph off an existing program). Just unsure if piping completes the first command before moving on to the next command.
Any feed back would be greatly appreciated!

If you want to redirect the output of one program into the input of another, just use a simple pipeline:
program1 arg arg | program2 arg arg
If you want to save the output of program1 into a file and pipe it into program2, you can use tee(1):
program1 arg arg | tee output-file | program2 arg arg
All programs in a pipeline are run simultaneously. Most programs typically use blocking I/O: if when they try to read their input and nothing is there, they block: that is, they stop, and the operating system de-schedules them to run until more input becomes available (to avoid eating up the CPU). Similarly, if a program earlier in the pipeline is writing data faster than a later program can read it, eventually the pipe's buffer fills up and the writer blocks: the OS de-schedules it until the pipe's buffer gets emptied by the reader, and then it can continue writing again.
EDIT
If you want to use the output of program1 as the command-line parameters, you can use the backquotes or the $() syntax:
# Runs "program1 arg", and uses the output as the command-line arguments for
# program2
program2 `program1 arg`
# Same as above
program2 $(program1 arg)
The $() syntax should be preferred, since they are clearer, and they can be nested.

Piping does not complete the first command before running the second. Unix (and Linux) piping run all commands concurrently. A command will be suspended if
It is starved for input.
It has produced significantly more output than its successor is ready to consume.
For most programs output is buffered, which means that the OS accumulates a substantial amount of output (perhaps 8000 characters or so) before passing it on to the next stage of the pipeline. This buffering is used to avoid too much switching back and forth between processes and kernel.
If you want output on a pipeline to be sent right away, you can use unbuffered I/O, which in C means calling something like fflush() to be sure that any buffered output is immediately sent on to the next process. Unbuffered input is also possible but is generally unnecessary because a process that is starved for input typically does not wait for a full buffer but will process any input you can get.
For typical applications unbuffered output is not recommended; you generally get the best performance with the defaults. In your case, however, where you want to do dynamic graphing immediately the first process has the info available, you definitely want to be using unbuffered output. If you're using C, calling fflush(stdout) whenever you want output sent will be sufficient.

If your programs are communicating using stdin and stdout, then make sure that you are either calling fflush(stdout) after you write or find some way to disable standard IO buffering. The best reference that I can think of that really describe how to best implement pipelines in C/C++ is Advanced Programming in the UNIX Environment or UNIX Network Programming: Volume 2. You could probably start with a this article as well.

If your two programs insist on reading and writing to files and do not use stdin/stdout, you may find you can use a named pipe instead of a file.
Create a named pipe with the mknod(1) command:
$ mknod /tmp/named-pipe p
Then configure your programs to read and write to /tmp/named-pipe (use whatever path/name you feel is appropriate).
In this case, both programs will run in parallel, blocking as necessary when the pipe becomes full/empty as described in the other answers.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight