File input and output stream in c - c

Suppose i open a text file in write mode using c language. Now I add some text data to it.
1.Internally how is data stored in file ? Is each character stored as 8 bit ascii code ?
We Never add EOF at the end of writing to file and we use fclose() to close the file .
2.How is then EOF added to file ? How is it stored in file ?
When we read character by character of that file using getchar() , We are able to detect EOF. Now EOF if is ctrl+z , these are 2 characters ^z are saved at end of file. So getchar() will get ^ and then z . so,
3.How does getchar() detects EOF ?

EOF is not a character that gets stored in a file, it is a special return code that you get when you read a file. The file I/O system knows how many characters there are in a file, because it stores the exact length of the file. When your program tries to read a character after the last available character, the file I/O system returns a special value EOF, which is outside the range of char (it is for that reason that character reading routines such as getchar() return an int instead of a char).
The Ctrl+Z sequence is not an EOF character either. It is a special sequence of keys that tells the shell to close the console input stream associated with the program. Once the stream is closed, the next read returns EOF to your program. It is important to understand, however, that Ctrl+Z is merely a keyboard sequence that is interpreted by the command line processor - in the same way that Ctrl+C is a sequence that tells the command line processor to terminate the program.
Finally, ^Z is not two characters that get stored in a file, it's a screen representation of the Ctrl+Z sequence produced by the command line processor to confirm visually that the keyboard sequence has been accepted.

Typically C will be using Latin-1 or some other single byte
encoding, but it should be possible to use UTF-8 locale setting.
Note that most C character/string handling routines will not
properly handle UTF-8 or any other multibyte encoding -- you have to use special libraries.
It depends on the Operating System used, but most will simply store
a continuous stream of characters, with a Line-End (CR-LF in
Windows, \n in Unixy systems) character to mark the end of the line
(YOU have to explicitly put it there).
Some Operating Systems, such as MS-DOS, may explicitly write an EOF
character to the end of the file, but most don't. They simply run
off the end of the file and report a status of EOF.
See 2.

Related

Guarantee that getchar receives newline or EOF (eventually)?

I would like to read characters from stdin until one of the following occurs:
an end-of-line marker is encountered (the normal case, in my thinking),
the EOF condition occurs, or
an error occurs.
How can I guarantee that one of the above events will happen eventually? In other words, how do I guarantee that getchar will eventually return either \n or EOF, provided that no error (in terms of ferror(stdin)) occurs?
// (How) can we guarantee that the LABEL'ed statement will be reached?
int done = 0;
while (!0) if (
(c = getchar()) == EOF || ferror(stdin) || c == '\n') break;
LABEL: done = !0;
If stdin is connected to a device that always delivers some character other than '\n', none of the above conditions will occur. It seems like the answer will have to do with the properties of the device. Where can those details be found (in the doumentation for compiler, device firmware, or device hardware perhaps)?
In particular, I am interested to know if keyboard input is guaranteed to be terminated by an end-of-line marker or end-of-file condition. Similarly for files stored on disc / SSD.
Typical use case: user enters text on the keyboard. Program reads first few characters and discards all remaining characters, up to the end-of-line marker or end-of-file (because some buffer is full or after that everything is comments, etc.).
I am using C89, but I am curious if the answer depends on which C standard is used.
You can't.
Let's say I run your program, then I put a weight on my keyboard's "X" key and go on vacation to Hawaii. On the way there, I get struck by lightning and die.
There will never be any input other than 'x'.
Or, I may decide to type the complete story of Moby Dick, without pressing enter. It will probably take a few days. How long should your program wait before it decides that maybe I won't ever finish typing?
What do you want it to do?
Looking at all the discussion in the comments, it seems you are looking in the wrong place:
It is not a matter of keyboard drivers or wrapping stdin.
It is also not a matter of what programming language you are using.
It is a matter of the purpose of the input in your software.
Basically, it is up to you as a programmer to know how much input you want or need, and then decide when to stop reading input, even if valid input is still available.
Note, that not only are there devices that can send input forever without triggering EOF or end of line condition, but there are also programs that will happily read input forever.
This is by design.
Common examples can be found in POSIX style OS (like Linux) command line tools.
Here is a simple example:
cat /dev/urandom | hexdump
This will print random numbers for as long as your computer is running, or until you hit Ctrl+C
Though cat will stop working when there is nothing more to print (EOF or any read error), it does not expect such an end, so unless there is a bug in the implementation you are using it should happily run forever.
So the real question is:
When does your program need to stop reading characters and why?
If stdin is connected to a device that always delivers some character other than '\n', none of the above conditions will occur.
A device such as /dev/zero, for example. Yes, stdin can be connected to a device that never provides a newline or reaches EOF, and that is not expected ever to report an error condition.
It seems like the answer will have to do with the properties of the device.
Indeed so.
Where can those details be found (in the doumentation for compiler, device firmware, or device hardware perhaps)?
Generally, it's a question of the device driver. And in some cases (such as the /dev/zero example) that's all there is anyway. Generally drivers do things that are sensible for the underlying hardware, but in principle, they don't have to do.
In particular, I am interested to know if keyboard input is guaranteed to be terminated by an end-of-line marker or end-of-file condition.
No. Generally speaking, an end-of-line marker is sent by a terminal device if and only if the <enter> key is pressed. An end-of-file condition might be signaled if the terminal disconnects (but the program continues), or if the user explicitly causes one to be sent (by typing <-<D> on Linux or Mac, for example, or <-<Z> on Windows). Neither of those events need actually happen on any given run of a program, and it is very common for the latter not to do.
Similarly for files stored on disc / SSD.
You can generally rely on data read from an ordinary file to contain newlines where they are present in the file itself. If the file is open in text mode, then the system-specific text line terminator will also be translated to a newline, if it differs. It is not necessary for a file to contain any of those, so a program reading from a regular file might never see a newline.
You can rely on EOF being signaled when a read is attempted while the file position is at or past the and of the file's data.
Typical use case: user enters text on the keyboard. Program reads first few characters and discards all remaining characters, up to the end-of-line marker or end-of-file (because some buffer is full or after that everything is comments, etc.).
I think you're trying too hard.
Reading to end-of-line might be a reasonable thing to do in some cases. Expecting a newline to eventually be reached is reasonable if the program is intended to support interactive use. But trying to ensure that invalid data cannot be fed to your program is a losing cause. Your objective should be to accept the widest array of inputs you reasonably can, and to fail gracefully when other inputs are presented.
If you need to read input in a line-by-line mode then by all means do that, and document that you do it. If only the first n characters of each line are significant to the program then document that, too. Then, if your program never terminates when a user connects its input to /dev/zero that's on them, not on you.
On the other hand, try to avoid placing arbitrary constraints, especially on sizes of things. If there is not a natural limit on the size of something, then no artificial limit you introduce will ever be enough.

Why doesn't getchar() read characters such as backspace?

This is a very basic C question, coming from page 18 of Kernighan and Ritchie.
I've compiled this very simple code for counting characters input from the keyboard:
#include <stdio.h>
/* count characters in input; 1st version */
main()
{
long nc;
nc = 0;
while (getchar() != EOF)
++nc;
printf("%1d\n", nc);
}
This compiles fine, runs fine, and behaves pretty much as expected i.e. if I enter "Hello World", it returns a value of 11 when I press CTRLD to give the EOF character.
What is confusing me is if I make a mistake, I can use backspace to delete the characters and re-enter them, and it returns only the number of characters displayed by the terminal when I invoke EOF.
If the code is counting each character, including special characters, if I type four characters, delete two, and type another two, shouldn't that output as 8 characters (4 char + 2 del + 2 char), not 4?
I'm obviously misunderstanding how C handles backspace, and how/when the code is incrementing the variable nc?
Typically, your terminal session is running in "line mode", that is, it only passes data to your program when a line is complete (eg, you pressed Return, etc). So you only see the line as it is complete (with any editing having been done before your program ever sees anything). Typically this is a good thing, so every program doesn't need to deal with delete/etc.
On most systems (eg Unix-based systems, etc), it is possible to put the terminal into "raw" mode -- that is, each character is passed as received to the program. For example, screen-oriented text editors commonly do this.
It's not that getchar() doesn't count the "deletions" but it doesn't even see the input until it's passed to your program by the terminal driver.
When you input something, it doesn't reach your C program until you press \n or send EOF (or EOL). This is what POSIX defines as Canonical Mode Input Processing - which is typically the default mode.
Backspace characters are normally used to edit input in cooked tty mode (see canonical input mode in tty(4) in BSD and termios(3) in linux systems), so they are consumed at the tty driver, and don't get to the input the process gets after that. The same applies to Ctrl-D as the end of file or to Ctrl-K as the kill input character. There are several things the driver does behind the scenes that your process doesn't get finally. These are directed to make life easier to users and programmers, as you normally don't want erased input in your life (that's the reason of erasing it), or want line endings to be \n and not \r as the tty normally generates when you press the [RETURN] key. But if you read from a file that happens to have backspaces, you'll get them as normal input anyway, just create a file with backspaces and try to read redirecting input from it, and you'll see those characters in your input.
By the way, if you want to generate backspaces at the terminal, just prepend a Ctrl-V character before each (this is also managed at the tty driver and will not happen when reading from a file) and you'll see your backspace chars as normal input in your file (to send a Ctrl-V just double it)

Different behaviour of Ctrl-D (Unix) and Ctrl-Z (Windows)

As per title I am trying to understand the exact behavior of Ctrl+D / Ctrl+Z in a while loop with a gets (which I am required to use). The code I am testing is the following:
#include <stdio.h>
#include <stdlib.h>
int main()
{
char str[80];
while(printf("Insert string: ") && gets(str) != NULL) {
puts(str);
}
return 0;
}
If my input is simply a Ctrl+D (or Ctrl+Z on Windows) gets returns NULL and the program exits correctly. The unclear situation is when I insert something like house^D^D (Unix) or house^Z^Z\n (Windows).
In the first case my interpretation is a getchar (or something similar inside the gets function) waits for read() to get the input, the first Ctrl+D flushes the buffer which is not empty (hence not EOF) then the second time read() is called EOF is triggered.
In the second case though, I noticed that the first Ctrl+Z is inserted into the buffer while everything that follows is simply ignored. Hence my understanding is the first read() call inserted house^Z and discarded everything else returning 5 (number of characters read). (I say 5 because otherwise I think a simple Ctrl+Z should return 1 without triggering EOF). Then the program waits for more input from the user, hence a second read() call.
I'd like to know what I get right and wrong of the way it works and which part of it is simply implementation dependent, if any.
Furthermore I noticed that in both Unix and Windows even after EOF is triggered it seem to reset to false in the following gets() call and I don't understand why this happens and in which line of the code.
I would really appreciate any kind of help.
(12/20/2016) I heavily edited my question in order to avoid confusion
The CTRL-D and CTRL-Z "end of file" indicators serve a similar purpose on Unix and Windows systems respectively, but are implemented quite differently.
On Unix systems (including Unix clones like Linux) CTRL-D, while officially described as the end-of-file character, is actually a delimiter character. It does almost the same thing as the end-of-line character (usually carriage return or CTRL-M) which is used to delimit lines. Both characters tell the operating system that the input line is finished and to make it available the program. The only difference is that with end-of-line character a line feed (CTRL-J) character is inserted at the end of the input buffer to mark the end of the line, while with the end-of-file character nothing is inserted.
This means when you enter house^D^D on Unix the read system call will first return a buffer of length 5 with the 5 characters house in it. When read is called again to obtain more input, it will then returns of a buffer of length 0 with no characters in it. Since a zero length read on a normal file indicates that the end of file has been reached the gets library function also interprets this as the end of file and stops reading the input. However since it filled the buffer with 5 characters it doesn't return NULL to indicate that it reached end of the file. And since it hasn't actually actually reached end of file, as terminal devices aren't actually files, further calls to gets after this will make further calls to read which will return any subsequent characters that the user types.
On Windows CTRL-Z is handled much differently. The biggest difference is that it's not treated specially by the operating system at all. When you type house^Z^Z^M on Windows only the carriage return character is given special treatment. Just like on Unix, the carriage return makes the typed line available to the program, though in this case a carriage return and a line feed are added to the buffer to mark the end of the line. So the result is that ReadFile function returns a 9 byte long buffer with the 9 characters house^Z^Z^M^J in it.
It actually the program itself, specifically the C runtime library, that treats CTRL-Z specially. In the case of the Microsoft C runtime library when it sees the CTRL-Z character in the buffer returned by ReadFile it treats it as an end-of-file marker and ignores everything else after it. Using the example in the previous paragraph, gets ends up calling ReadFile to get more input because the fact its seen the CTRL-Z character isn't remembered when reading from the console (or other device) and it hasn't yet seen the end-of-line (which was ignored). If you then press enter again, gets will return with the buffer filled with the 7 bytes house^Z\0 (adding a 0 byte to indicate the end of the string). By default, it does the much same thing when reading from normal files, if a CTRL-Z character appears in a file, it and everything after it is ignored. This is for backward-compatibility with CP/M which only supported files in lengths that were multiples of 128 and used CTRL-Z to mark where text files really were supposed to end.
Note that both the Unix and Windows behaviours described above are only the normal default handling of user input. The Unix handling of CTRL-D only occurs when reading from a terminal device in canonical mode and it's possible to change the "end-of-file" character to something else. On Windows the operating system never treats CTRL-Z specially, but whether the C runtime library does or not depends on whether the FILE stream being read is in text or binary mode. This is why in portable programs you should always include the character b in the mode string when opening binary files (eg. fopen("foo.gif", "rb")).

Check if stdin has bytes pending (for escape sequences)

In plain C, how can check if standard input has bytes pending without blocking?
The reason for this is handling escape sequences which are multi-byte.
For example, if I use getchar() and the user presses Ctrl-Up Arrow, then 3 bytes are immediately placed in standard input: 1B 4F 41, however getchar() only reads ONE of those bytes. I need to read all three before continuing on. Different escape sequences are of different length, so the logic is that if the first character is an escape character then I read ALL characters currently in the buffer and then process that as an escape unit. But I can't do that with getchar() because it will block when it reaches the end of the buffer. I need to know how many characters are in the buffer.
There's no such provision in standard C; you need to use OS-specific calls to either do non-blocking reads or, as I think is more appropriate here, to read bytes from the input immediately rather than wait for a newline. For the latter, see
setvbuf not able to make stdin unbuffered
Note:
the logic is that if the first character is an escape character then I read ALL characters currently in the buffer and then process that as an escape unit
That really isn't the way to do it, and will fail if keys are pressed rapidly or are autorepeated, or, say, you're reading from a key log file. Each byte determines how many more bytes of escape sequence there are; you should read conditionally based on what you have already seen.

What is the role of '\n' in fprintf C?

I am confused about the role of '\n' in fprintf. I understand that it copies characters into an array and the \n character signals when to stop reading the character in the current buffer. But then, when does it know to make the system call write.
for example, fprintf(stdout,"hello") prints but I never gave the \n character so how does it know to make the system call.
The system call is made when the channel is synced/flushed. This can be when the buffer is full (try writing a LOT without a \n and you'll see output at some point), when a \n is seen (IF you have line buffering configured for the channel, which is the default for tty devices), or when a call to fflush() is made.
When the path is closed, it will be flushed as well. When the process terminates, the operating system will close any open paths it has. Each of these events will lead to the system call to emit the output happening.
First of all, fprintf("hello"); isn't correct, it should be for instance fprintf(stdout, "hello");. Next, \n doesn't implies to stop reading or writing characters, because \n is itself a character, the linefeed (10th in ascii table).
The assumptions you've stated about the purpose and use of \n are wrong. \n is simply one of 255 ASCII characters that are read, i.e. they are not commands, or directives, that cause anything to happen, they are just passive characters that when used in various C functions (eg. sprintf(), printf(), fprintf(), etc.) are interpreted such that presentation of output is manipulated to appear in the desired way.
By the way, \n (new line) is in good company with many other formatting codes which you can see represented HERE

Resources