I am creating a program that reads in the contents of a text file through the command line, character by character.
Is the NULL value automatically inserted or do I have to add it to the text file manually?
Text files do not need to have a terminator on modern platforms. (On some legacy platforms they did have one, but I doubt it is the case here.) You almost certainly should not write a terminator into the file, as it may cause problems with programs that do not expect one. The end of file serves as a terminator when reading.
Text strings in C are arrays of characters terminated by a zero, aka the null character, mnemonic NUL (with one L, and it is not the same thing as NULL in C). When creating strings, you do need to terminate them correctly. Functions returning strings, including ones that read them from files (e.g., fgets), terminate them for you.
Related
I understand that when using fgets, the program will not stop when it encounters NUL, namely '\0'. However when will this a problem and needs to be manually addressed?
My main use case for fgets is to get it from user input (like a better version of scanf to allow reading white spaces.) I cannot think of a situation where a user will want to terminates his input by typing '\0'.
Recall that text file input is usually lines: characters followed by a '\n' (expect maybe the last line). On reading text input, a null character is not special. It is not an alternate end-of-line. It is just another non-'\n' character.
It is functions like fgets(), fscanf() append a null character to the read buffer to denote the end of string. Now when code reads that string, is a null character a read one or the appended one?
If code uses fgets(), fscanf(), getchar(), etc. is not really the issue. The issue is how should code detect null characters and how to handle them.
Reading a null character from a text stream is uncommon, but not impossible. Null characters tend to reflect a problem more often than valid text data.
Reasons null characters exist in a text file
The text file is a wide character text file, perhaps UTF16 when null characters are common. Code needs to read this file with fgetws() and related functions.
The text file is a binary data one. Better to use fread().
File is a text file, yet through error or nefarious intent, code has null characters. Usually best to detect, if possible, and exit working this file with an error message or status.
Legitimate text file uncommonly using null characters. fgets() is not the best tool. Likely need crafted input functions or other extensions like getline().
How to detect?
fgets(): prefill buffer with non-zero input. See if the characters after the first null character are all the pre-fill value.
fscanf(): Read a line with some size like char buf[200]; fscanf(f, "%199[^\n]%n", buf, &length); and use length for input length. Additional code needed to handle end-of-line, extra-long lines, 0 length lines, etc.
fgetc(): Build user code to read/handle as needed - tends to be slow.
How to handle?
In general, error out with a message or status.
If null characters are legitimate to this code's handling of text files, code needs to handle input, not as C strings, but as a buffer and length.
Good luck.
I know in C++, you can check the length of the string, but in C, not so much.
Is it possible knowing the file size of a text file, to know how many characters are in the file?
Is it one byte per character or are other headers secretly stored whether or not I set them?
I would like to avoid performing a null check on every character as I iterate through the file for performance reasons.
Thanks.
You can open the file and read all the characters and count them.
Besides that, there's no fully portable method to check how long a file is -- neither on disk, nor in terms of how many characters will be read. This is true for text files and binary files.
How do you determine the size of a file in C? goes over some of the pitfalls. Perhaps one of the solutions there will suit a subset of systems that you run your code on; or you might like to use a POSIX or operating system call.
As mentioned in comments; if the intent behind the question is to read characters and process them on the fly, then you still need to check for read errors even if you knew the file size, because reading can fail.
Characters (of type char) are single byte values, as defined in the C standard (see CHAR_BIT). A NUL character is also a character, and so it, too, takes up a single byte.
Thus, if you are working with an ASCII text file, the file size will be the number of bytes and therefore equivalent to the number of characters.
If you are asking how long individual strings are inside the file, then you will indeed need to look for NUL and other extended character bytes and calculate string lengths on that basis. You might not be able to safely assume that there is only one NUL character and that it is at the end of the file, depending on how that file was made. There can also be newlines and other extended characters you would want to exclude. You have to decide on a character set and do counting from that set.
Further, if you are working with a file containing multibyte characters encoded in, say, Unicode, then this will be a different answer. You would use different functions to read a text file using a multibyte encoding.
So the answer will depend on what type of encoding your text file uses, and whether you are calculating characters or string lengths, which are two different measures.
I am trying to demonstrate a buffer overflow, and I wish to overwrite a local varible with gets. I have compiled my program using gcc with -fno-stack-protector, so I know that the buffer that gets uses is right next to another local variable I am trying to overwrite. My goal is to overflow the buffer and overwrite the adjacent variable so that both of them have the same string. However, I noticed that I need to be able to input the '\0' character so that strcmp will actually show that both are equal. How can I input '\0'?
On many keyboards, you can enter a NUL character with ctrl# (might be ctrlshift2 or ctrlalt2).
Barring that, you can create a file with a NUL byte and redirect that as stdin.
I'm not sure you'll be able to input a '\0' into a gets(3) or fgets(3) function, as the function checks for newline terminators and probably has some way of protecting you from inputing a nul terminator to a C string (which is assumed to terminate on nul character).
Probably, what you are trying to demonstrate is something implementation dependant (so, undefined behaviour), and will work differently for different implementations.
If you want to correctly overwrite a local variable with only one input statement, just use read(2), which allows you to enter nulls and any other possible character value.
I encountered a somewhat annoying bug today where a string (stored as a char[]) would be printed with junk at the end. The string that was suppose to be printed (using arduino print/write functions) was correct (it correctly included \r and \n). However, there would be junk printed at the end.
I then allocated an extra element to store a '\0' after '\r' and '\n' (which were the last 2 characters in the string to be printed). Then, print() printed the string correctly. It seems '\0' was used to indicate to the print() function that the string had terminated (I remember reading this in Kernighan's C).
This bug appeared in my code which reads from a text file. It occurred to me that I did not encounter '\0' at all when I designed my code. This leads me to believe that '\0' has no practical use in text editors and are merely used by print functions. Is this correct?
C strings are terminated by the NUL byte ('\0') - this is implicitly appended to any string literals in double quotes, and used as the terminator by all standard library functions operating on strings. From this it follows that C strings can not contain the '\0' terminator in between other characters, since there would be no way to tell whether it is the actual end of string or not.
(Of course you could handle strings in the C language other than as C strings - e.g., simply adding an integer to record the length of the string would make the terminator unnecessary, but such strings would not be fully interoperable with functions expecting C strings.)
A "text file" in general is not governed by the C standard, and a user of a C program could conceivably give a file containing a NUL byte as input to a C program (which would be unable to handle it "correctly" for the above reasons if it read the file into C strings). However, the NUL byte has no valid reason for existing in a plain text file, and it may be considered at least a de facto standard for text files that they do not contain the NUL byte (or certain other control characters, which might break transmission of that text through some terminals or serial protocols).
I would argue that it is an acceptable (though not necessary!) limitation for a program working on plain text input to not guarantee correct output if there are NUL bytes in the input. However, the programmer should be aware of this possibility regardless of whether it will be treated correctly, and not allow it to cause undefined behaviour in their program. Like all user input, it should be considered "unsafe" in the sense that it can contain anything (e.g., it could be maliciously formed on purpose).
This leads me to believe that '\0' has no practical use in text
editors and are merely used by print functions. Is this correct?
This is wrong. In C, the end of a character string is designated by the \0 character. This is commonly known as the null terminator. Almost all string functions declared in the C library under <string.h> use this criteria to check or find the end of a string.
A text file, on the other hand, will not typically have any \0 characters in it. So, when reading text from a file, you have to null-terminate your character buffer before you then print it.
\0 is the C escape sequence for the null character (ASCII code 0) and is widely used to represent the end of a string in memory. The character normally doesn't appear explicitly in a text file, however, by convention, most C strings contain a null terminator at the end. Functions that read a string into memory will generally append a \0 to denote the end of the string, and functions that output a string from memory will similarly expect a \0.
Note that there are other ways of representing strings in memory, for example as a (length, content) pair (Pascal notably used this representation), which do not require a null terminator since the length of the string is known ahead of time.
Common Text Files
The null character '\0', even if rare, can appear in a text file. Code should be prepared to handle reading '\0'.
This also includes other char outside the typical ASCII range, which may be negative with a signed char.
UTF-16
Some "text" files use UTF-16 encoding and code encountering that, but expecting a typical "text" file will encounter many null characters.
Line Length
Lines can be too long, too short (only "\n"). or maybe other "text" problems exist.
Robust code does not trust use/file input until it is qualified and meets expectations. It does not assume null chracters are absent.
I was just wondering that when you input text just using a normal application such as textedit (on OSX) would it still harbour the same '\0' character on the end of each string so that when read through fgets() if would pick said character up and stop reading?
Because I've created a normal text file, but fgets() keeps on stopping at the end of the designated length, instead of when it finds that character, so I have suspicious if it actually exists when I write to a normal text file.
For Example:
How Are You
There
fgets(str, 15, stdin);
This would end up producing: TherAre You
No, in general, text files do not contain \0 characters. fgets reads the number of characters requested, or to the end of the line, whichever comes first. It's fgets itself that appends the \0. From the man page:
fgets() reads in at most one less than size characters from stream and stores them into the buffer pointed to by s. Reading stops after an EOF or a newline. If a newline is read, it is stored into the buffer. A terminating null byte ('\0') is stored after the last character in the buffer.
No, text files don't generally contain any control characters. The termination is a C "feature", i.e. a property of how the C language and environment works with strings. Text files are independent of C. The termination is added (to the in-memory buffer into which the data has been read) by the fgets() function.
If your input file does contain a null byte and you're reading with fgets() or equivalent, you have difficulty knowing whether the null in the middle of the string was simply a null in the 'text' file or indicates that the last line of the file did not end with a newline, or that the line was truncated. Clearly, if you try another read and get more data, it was not a premature EOF. If the character immediately before the null byte is a newline, then you can assume that the null byte is the end of string marker added by fgets().
Generally speaking, therefore, if the file contains null bytes, it is not a good idea to use fgets() to read the file.