The magic of STREAMS in Linux. When to finish? - c

Today at 5am I read an article about read system call. And things become significantly clear for me.
ssize_t read(int fd, void *buf, size_t count);
The construction of *nix like operation system become amazing in it's simplicity. File interface for any entity, just ask to write some date from this fd interface into some memory by *buf pointer. All the same for network, files, streams.
But some question appears.
How to distinguish two cases?:
1) Stream is empty need to wait for new data. 2) Stream is closed need to close program.
Here is a scenario:
Reading data from STDIN in loop, this STDIN redirected by pipe.
some text_data appears
just read bite by bite until what EOF in memory, or 0 as result of read call?
How program will understand: wait for a new input, or exit?
This is unclear. In case of endless or continuous streams.
UPD After speak with #Bailey Kocin and reading some docs I have this understanding. Fix me if I'm wrong.
read holds the program execution and waits for count amount of bites.
When count amount of bites appears read writes it into buf and execution continues.
When stream is closed read returns 0, and it is a signal that program may be finished.
Question Do EOF appears in buf?
UPD2 EOF is a constant that can be in the output of getc function
while (ch != EOF) {
/* display contents of file on screen */
putchar(ch);
ch = getc(fp);
}
But in case of read the EOF value dose not appears in a buf. read system call signalize about file ending by returning 0. Instead of writing EOF constant into the data-area, as like ak in case of getc.
EOF is a constant that vary in different systems. And it used for getc.

Let's deal first with your original question. Note that man 7 pipe should give some useful information on this.
Say we have the standard input redirected to the input side of a descriptor created by a pipe call, as in:
pipe(p);
// ... fork a child to write to the output side of the pipe ...
dup2(p[0], 0); // redirect standard input to input side
and we call:
bytes = read(0, buf, 100);
First, note that this behaves no differently than simply reading directly from p[0], so we could have just done:
pipe(p);
// fork child
bytes = read(p[0], buf, 100);
Then, there are essentially three cases:
If there are bytes in the pipe (i.e., at least one byte has been written but not yet read), then the read call will return immediately, and it will return all bytes available up to a maximum of 100 bytes. The return value will be the number of bytes read, and it will always be a positive number between 1 and 100.
If the pipe is empty (no bytes) and the output side has been closed, the buffer won't be touched, and the call will return immediately with return value of 0.
Otherwise, the read call will block until something is written to the pipe or the output side is closed, and then the read call will return immediately using the rules in cases 1 and 2.
So, if a read() call returns 0, that means the end-of-file was reached, and no more bytes are expected. Waiting for additional data happens automatically, and after the wait, you'll either get data (positive return value) or an end-of-file signal (zero return value). In the special case that another process writes some bytes and then immediately closes (the output side of) the pipe, the next read() call will return a positive value up to the specified count. Subsequent read() calls will continue to return positive values as long as there's more data to read. When the data are exhausted, the read() call will return 0 (since the pipe is closed).
On Linux, the above is always true for pipes and any positive count. There can be differences for things other than pipes. Also, if the count is 0, the read() call will always return immediately with return value 0. Note that, if you are trying to write code that runs on platforms other than Linux, you may have to be more careful. An implementation is allowed to return a non-zero number of bytes less than the number requested, even if more bytes are available in the pipe -- this might mean that there's an implementation-defined limit (so you never get more than 4096 bytes, no matter how many you request, for example) or that this implementation-defined limit changes from call to call (so if you request bytes over a page boundary in a kernel buffer, you only get the end of the page or something). On Linux, there's no limit -- the read call will always return everything available up to count, no matter how big count is.
Anyway, the idea is that something like the following code should reliably read all bytes from a pipe until the output side is closed, even on platforms other than Linux:
#define _GNU_SOURCE 1
#include <errno.h>
#include <unistd.h>
/* ... */
while ((count = TEMP_FAILURE_RETRY(read(fd, buffer, sizeof(buffer)))) > 0) {
// process "count" bytes in "buffer"
}
if (count == -1) {
// handle error
}
// otherwise, end of data reached
If the pipe is never closed ("endless" or "continuous" stream), the while loop will run forever because read will block until it can return a non-zero byte count.
Note that the pipe can also be put into a non-blocking mode which changes the behavior substantially, but the above is the default blocking mode behavior.
With respect to your UPD questions:
Yes, read holds the program execution until data is available, but NO, it doesn't necessarily wait for count bytes. It will wait for a least one non-empty write to the pipe, and that will wake the process; when the process gets a chance to run, it will return whatever's available up to but not necessarily equal to count bytes. Usually, this means that if another process writes 5 bytes, a blocked read(fd, buffer, 100) call will return 5 and execution will continue. Yes, if read returns 0, it's a signal that there's no more data to be read and the write side of the pipe has been closed (so no more data will ever be available). No, an EOF value does not appear in the buffer. Only bytes read will appear there, and the buffer won't be touched when read() returns 0, so it'll contain whatever was there before the read() call.
With respect to your UPD2 comment:
Yes, on Linux, EOF is a constant equal to the integer -1. (Technically, according to the C99 standard, it is an integer constant equal to a negative value; maybe someone knows of a platform where it's something other than -1.) This constant is not used by the read() interface, and it is certainly not written into the buffer. While read() returns -1 in case of error, it would be considered bad practice to compare the return value from read() with EOF instead of -1. As you note, the EOF value is really only used for C library functions like getc() and getchar() to distinguish the end of file from a successfully read character.

Related

getchar() keeps returning EOF even after subsequent calls but read() system calls seem to "clear" the stdin. What are the reasons behind this?

char buff[1];
int main() {
int c;
c = getchar();
printf("%d\n", c); //output -1
c = getchar();
printf("%d\n", c); // output -1
int res;
//here I get a prompt for input. What happened to EOF ?
while ((res = read(0, buff, 1)) > 0) {
printf("Hello\n");
}
while ((res = read(0, buff, 1)) > 0) {
printf("Hello\n");
}
return 0;
}
The resulting output showed with commented lines in the code is the result of simply typing Ctrl-D (EOF on macOS).
I'm a bit confused about the behaviour of getchar(), especially when compared to read.
Shouldn't the read system calls inside the while loop also return EOF? Why do they prompt the user? Has some sort of stdin clear occurred?
Considering that getchar() uses the read system call under the hood how come they behave differently? Shouldn't the stdin be "unique" and the EOF condition shared?
How come in the following code the two read system calls return both EOF when a Ctrl-D input is given?
int res;
while ((res = read(0, buff, 1)) > 0) {
printf("Hello\n");
}
while ((res = read(0, buff, 1)) > 0) {
printf("Hello\n");
}
I'm trying to find a logic behind all this. Hope that someone could make it clear what EOF really is a how it really behaves.
P.S I'm using a Mac OS machine
Once the end-of-file indicator is set for stdin, getchar() does not attempt to read.
Clear the end-of-file indicator (e.g. clearerr() or others) to re-try reading.
The getchar function is equivalent to getc with the argument stdin.
The getc function is equivalent to fgetc ...
If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined).
read() still tries to read each time.
Note: Reading via a FILE *, like stdin, does not attempt to read if the end-of-file indicator is set. Yet even if the error indicator is set, a read attempt still occurs.
MacOs is a derivative of BSD unix systems. Its stdio implementation does not come from GNU software and so it is a different implementation. On EOF, the file descriptor is marked as erroneous when issuing a read(2) system call and receiving 0 as the number of characters returned by read, and so, it doesn't read(2) it again until the error condition is reset, and this produces the behaviour you observe. Use clearerr(stream); on the FILE * descriptor before issuing the next getchar(3) call, and everything will be fine. You can do that with glib also, and then, your program will run the same in either implementation of stdio (glib vs. bsd)
I'm trying to find a logic behind all this. Hope that someone could make it clear what EOF really is a how it really behaves.
EOF is simply a constant (normally it's valued as -1) that is different to any possible char value returned by getchar(3) (getchar() returns an int in the interval 0..255, and not a char for this purpose, to extend the range os possible characters with one more to represent the EOF condition, but EOF is not a char) The end of file condition is so indicated by the getchar family of functions (getchar, fgetc, etc) as the end of file condition is signalled by a read(2) return value of 0 (the number of returned characters is zero) which doesn't map as some character.... for that reason, the number of possible chars is extended to an integer and a new value EOF is defined to be returned when the end of file condition is reached. This is compatible with files that have Ctrl-D characters (ASCII EOT or Cntrl-D, decimal value 4) and not representing an END OF FILE condition (when you read an ASCII EOT from a file it appears as a normal character of decimal value 4)
The unix tty implementation, on the other side, allows on line input mode to use a special character (Ctrl-D, ASCII EOT/END OF TRANSMISSION, decimal value 4) to indicate and end of stream to the driver.... this is a special character, like ASCII CR or ASCII DEL (that produce line editing in input before feeding it to the program) in that case the terminal just prepares all the input characters and allows the application to read them (if there's none, none is read, and you got the end of file) So think that the Cntrl-D is only special in the unix tty driver and only when it is working in canonical mode (line input mode). So, finally, there are only two ways to input data to the program in line mode:
pressing the RETURN key (this is mapped by the terminal into ASCII CR, which the terminal translates into ASCII LF, the famous '\n' character) and the ASCII LF character is input to the program
pressing the Ctrl-D key. this makes the terminal to grab all that was keyed in upto this moment and send it to the program (without adding the Ctrl-D itself) and no character is added to the input buffer, what means that, if the input buffer was empty, nothing is sent to the program and the read(2) call reads effectively zero characters from the buffer.
To unify, in every scenario, the read(2) system call normally blocks into the kernel until one or more characters are available.... only at end of file, it unblocks and returns zero characters to the program. THIS SHOULD BE YOUR END OF FILE INDICATION. Many programs read an incomplete buffer (less than the number of characters you passed as parameter) before a true END OF FILE is signalled, and so, almost every program does another read to check if that was an incomplete read or indeed it was an end of file indication.
Finally, what if I want to input a Cntrl-D character as itself to a file.... there's another special character in the tty implementation that allows you to escape the special behaviour on the special character this one precedes. In today systems, that character is by default Ctrl-V, so if you want to enter a special character (even ?Ctrl-V) you have to precede it with Ctrl-V, making entering Ctrl-D into the file to have to input Ctrl-V + Ctrl-D.

Can a output stream, in C, which is full buffered, be flushed automatically even before the buffer is fully filled?

Consider this code:
#include <stdio.h>
int main()
{
char buffer[500];
int n = setvbuf(stdout, buffer, _IOFBF, 100);
printf("Hello");
while(1 == 1)
;
return 0;
}
When run on Linux, the "Hello" message appears on the output device immediately, and the program then hangs indefinitely. Shouldn't the output instead be buffered until stdout is flushed or closed, either manually or at normal program termination? That's what seems to happen on Windows 10, and also what happens on Linux if the buffer size is specified as 130 bytes or more. I am using VS Code on both systems.
What am I missing? Am I wrong about the Full Buffering Concept?
What am I missing? Am I wrong about the Full Buffering Concept?
You are not wrong about the concept. There is wiggle room in the wording of the language of the specification, as #WilliamPursell observes in his answer, but your program's observed behavior does not exhibit full buffering according to the express intent of the specification. Moreover, I interpret the specification as leaving room here for implementations to conform despite being incapable for one reason or another of implementing the intent, not as offering a free pass for implementations that reasonably can implement the intent nevertheless to do something different at will.
I tested this variation on your program against Glibc 2.22 on Linux:
#include <stdio.h>
int main() {
static char buffer[BUFSIZ] = { 0 };
int n = setvbuf(stdout, buffer, _IOFBF, 100);
if (n != 0) {
perror("setvbuf");
return 1;
}
printf("Hello");
puts(buffer);
return 0;
}
The program exited with status 0 and did not print any error output, so I conclude that setvbuf returned 0, indicating success. However, the program printed "Hello" only once, showing that in fact it did not use the specified buffer. If I increase the buffer size specified to setvbuf to 128 bytes (== 27) then the output is "HelloHello", showing that the specified buffer is used.
The observed behavior, then, seems to be1 that this implementation of setvbuf silently sets the stream to unbuffered when the provided buffer is specified to be smaller than 128 bytes. That is consistent with the behavior of your version of the program, too, but inconsistent with my reading of the function's specifications:
[...]The argument mode determines how stream will be buffered, as
follows: _IOFBF causes input/output to be fully buffered [...]. If
buf is not a null pointer, the array it points to may be used instead
of a buffer allocated by the setvbuf function and the argument
size specifies the size of the array; otherwise, size may
determine the size of a buffer allocated by the setvbuf function.
The contents of the array at any time are indeterminate.
The setvbuf function returns zero on success, or nonzero if an
invalid value is given for mode or if the request cannot be honored.
(C17, 7.21.5.6/2-3)
As I read the specification, setvbuf is free to use the specified buffer or not, at its discretion, and if it chooses not to do so then it may or may not use a buffer of the specified size, but it must either set the specified buffering mode or fail. It is inconsistent with those specifications for it to change the buffering mode to one that is different from both the original mode and the requested mode, and it is also inconsistent to fail to set the requested mode and nevertheless return 0.
Inasmuch as I conclude that this Glibc version's setvbuf is behaving contrary to the language specification, I'd say you've tripped over a glibc bug.
1 But it should be noted that the specifications say that the contents of the buffer at any time are indeterminate. Therefore, by accessing the buffer after asking setvbuf to assign it as a stream buffer, this program invokes undefined behavior, hence, technically, it does not prove anything.
Given the lack of specificity in the standard, I would argue that such behavior is not prohibited.
According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05:
When a stream is "unbuffered", bytes are intended to appear from the source or at the destination as soon as possible; otherwise, bytes may be accumulated and transmitted as a block. When a stream is "fully buffered", bytes are intended to be transmitted as a block when a buffer is filled. When a stream is "line buffered", bytes are intended to be transmitted as a block when a byte is encountered. Furthermore, bytes are intended to be transmitted as a block when a buffer is filled, when input is requested on an unbuffered stream, or when input is requested on a line-buffered stream that requires the transmission of bytes. Support for these characteristics is implementation-defined, and may be affected via setbuf() and setvbuf().

How to control the output of fileno?

I'm facing a piece of code that I don't understand:
read(fileno(stdin),&i,1);
switch(i)
{
case '\n':
printf("\a");
break;
....
I know that fileno return the file descriptor associated with the sdtin here, then read put this value in i variable.
So, what should be the value of stdin to allow i to match with the first "case", i.e \n ?
Thank you
But what should be the value of stdin to match with the first "case", i.e \n ?
The case statement doesn't look at the "value" of stdin.
read(fileno(stdin),&i,1);
reads in a single byte into i (assuming read() call is successful) and if that byte is \n (newline character) then it'll match the case. You probably need to read the man page of read(2) to understand what it does.
I know that fileno return the file descriptor associated with the sdtin here,
Yes, though I suspect you don't know what that means.
then read put this value in i variable.
No. No no no no no. read() does not put the value of the file descriptor, or any part of it, into the provided buffer (in your case, the bytes of i). As its name suggests, read() attempts to read from the file represented by the file descriptor passed as its first argument. The bytes read, if any, are stored in the provided buffer.
stdin represents the program's standard input. If you run the program from an interactive shell, that will correspond to your keyboard. The program attempts to read user input, and to compare it with a newline.
The program is likely flawed, and maybe outright wrong, though it's impossible to tell from just the fragment presented. If i is a variable of type int then its representation is larger than one byte, but you're only reading one byte into it. That will replace only one byte of the representation, with results depending on C implementation and the data read.
What the program seems to be trying to do can be made to work with read(), but I would recommend using getchar() instead:
#include <stdio.h>
/*
...
int i;
...
*/
i = getchar();
/* ... */

How does the statements inside this IF statement work?

I just recently started my programming education within Inter-process commmunications and this piece of code was written within the parent processs code section. From what I have read about write(), it returns -1 if it failed, 0 if nothing was written to the pipe() and a positive integer if successful. How exactly does sizeof(value) help us identify this? Isn't if(write(request[WRITE],&value,sizeof(value) < 1) a much more reading friendlier alternative to what the sizeof(value).
if(sizeof(value)!=write(request[WRITE],&value,sizeof(value)))
{
perror("Cannot write thru pipe.\n");
return 1;
}
Code clarification: The variable value is an input of a digit in the parent process which the parent then sends to the child process through a pipe the child to do some arithmic operation on it.
Any help of clarification on the subject is very much apprecaited.
Edit: How do I highlight my system functions here when asking questions?
This also captures a successful but partial write, which the application wants to consider being a failure.
It's slightly easier to read without the pointless parnethesis:
if(write(request[WRITE], &value, sizeof value) != sizeof value)
So, for instance, if value is an int, it might occupy 4 bytes, but if the write() just writes 2 of those it will return 2 which is captured by this test.
At least in my opinion. Remember that sizeof is not a function.
That's not a read, that's a write. The principle is almost the same, but there's a bit of a twist.
As a general rule you are correct: write() could return a "short count", indicating a partial write. For instance, you might ask to write 2000 bytes to some file descriptor, and write might return a value like 1024 instead, indicating that 976 (2000 - 1024) bytes were not written but no actual error occurred. (This occurs, for instance, when receiving a signal while writing on a "slow" device like a tty or pty. Of course, the application must decide what to do about the partial write: should it consider this an error? Should it retry the remaining bytes? It's pretty common to wrap the write in a loop, that retries remaining bytes in case of short counts; the stdio fwrite code does this, for instance.)
With pipes, however, there's a special case: writes of sufficiently small size (less than or equal to PIPE_BUF) are atomic. So assuming sizeof(value) <= PIPE_BUF, and that this really is writing on a pipe, this code is correct: write will return either sizeof(value) or -1.
(If sizeof(value) is 1, the code is correct—albeit misleading—for any descriptor: write never returns zero. The only possible return values are -1 and some positive value between 1 and the number of bytes requested-to-write, inclusive. This is where read and write are not symmetric with respect to return values: read can, and does, return zero.)

using read system call after a scanf

I am having a confusion regarding the following code,
#include<stdio.h>
int main()
{
char buf[100]={'\0'};
int data=0;
scanf("%d",&data);
read(stdin,buf,4); //attaching to stdin
printf("buffer is %s\n",buf);
return 1;
}
suppose on runtime I provided with the input 10abcd so as per my understanding following should happen:
scanf should place 10 in data
and abcd will still be on the stdin buffer
when read tries to read the stdin (already abcd is there) it should place the abcd into the buf
so printf should print abcd
but it is not happening ,printf showing no o/p
am I missing something here?
First of all read (stdin, ...) should give warnings (if you have them enabled) which you would be wise to heed. read() takes an integer as the first parameter specifying which channel to read from. stdin is of type FILE *.
Even if you changed it to read(0,..., this is not recommended practice. scanf is reading from FILE *stdin which is buffered from file handle 0. read (0, ...) reads directly from the underlying file handle and ignore any characters which were buffered. This will cause strange results unless stdin is set unbuffered.
Ignoring mechanical issues related to the syntax of the read() function call, there are two cases to consider:
Input is from a terminal.
Input is from a file.
Terminal
No data will be available for reading until the user hits return. At that point, the standard I/O library will read all the available data into the buffer associated with stdin (that would be "10abcd\n"). It will then parse the number, leaving the a in the buffer to be read later by other standard I/O functions.
When the read() occurs, it will also wait for the user to provide some input. It has no clue about the data in the stdin buffer. It will hang until the user hits return, and will then read the next lot of data, returning up to 4 bytes in the buffer (no null termination unless it so happens that the fourth character is an ASCII NUL '\0').
File
Actually, this isn't all that much different, except that instead of reading a line of data into the buffer, the standard I/O library will probably read an entire buffer full, (BUFSIZ bytes, which might be 512 or larger). It will then convert the 10 and leave the a for later use. (If the file is shorter than the buffer size, it will all be read into the stdin buffer.)
The read will then collect the next 4 bytes from the file. If the whole file was read already, then it will return nothing — 0 bytes read.
You need to record and check the return value from read(). You should also check the return value from scanf() to ensure it did actually read a number.
try... man read first.
read is declared as ssize_t read(int fd, void *buf, size_t count);
and stdin is declared as FILE *. thats the issue. use fread() instead and you will be sorted.
int main()
{
char buf[100]={'\0'};
int data=0;
scanf("%d",&data);
fread(buf, 1, 4, stdin);
printf("buffer is %s\n",buf);
return 1;
}
EDIT: Your understanding is almost correct but not totally.
To address your question properly, i will agree with Jonathen Laffer.
how your code works,
1) scanf should place 10 in data.
2) abcd will still be on the stdin buffer when you press ENTER.
3) then read() will again wait for entry and you have to again press ENTER to run program further.
4)now if you have entered anything before pressing ENTER for 2nd time the printf should print it else you will not get anything on output other than your printf statement.
Thats why i asked you to use fread instead. hope it helps.

Resources