Does '\0' appear naturally in text files? - c

I encountered a somewhat annoying bug today where a string (stored as a char[]) would be printed with junk at the end. The string that was suppose to be printed (using arduino print/write functions) was correct (it correctly included \r and \n). However, there would be junk printed at the end.
I then allocated an extra element to store a '\0' after '\r' and '\n' (which were the last 2 characters in the string to be printed). Then, print() printed the string correctly. It seems '\0' was used to indicate to the print() function that the string had terminated (I remember reading this in Kernighan's C).
This bug appeared in my code which reads from a text file. It occurred to me that I did not encounter '\0' at all when I designed my code. This leads me to believe that '\0' has no practical use in text editors and are merely used by print functions. Is this correct?

C strings are terminated by the NUL byte ('\0') - this is implicitly appended to any string literals in double quotes, and used as the terminator by all standard library functions operating on strings. From this it follows that C strings can not contain the '\0' terminator in between other characters, since there would be no way to tell whether it is the actual end of string or not.
(Of course you could handle strings in the C language other than as C strings - e.g., simply adding an integer to record the length of the string would make the terminator unnecessary, but such strings would not be fully interoperable with functions expecting C strings.)
A "text file" in general is not governed by the C standard, and a user of a C program could conceivably give a file containing a NUL byte as input to a C program (which would be unable to handle it "correctly" for the above reasons if it read the file into C strings). However, the NUL byte has no valid reason for existing in a plain text file, and it may be considered at least a de facto standard for text files that they do not contain the NUL byte (or certain other control characters, which might break transmission of that text through some terminals or serial protocols).
I would argue that it is an acceptable (though not necessary!) limitation for a program working on plain text input to not guarantee correct output if there are NUL bytes in the input. However, the programmer should be aware of this possibility regardless of whether it will be treated correctly, and not allow it to cause undefined behaviour in their program. Like all user input, it should be considered "unsafe" in the sense that it can contain anything (e.g., it could be maliciously formed on purpose).

This leads me to believe that '\0' has no practical use in text
editors and are merely used by print functions. Is this correct?
This is wrong. In C, the end of a character string is designated by the \0 character. This is commonly known as the null terminator. Almost all string functions declared in the C library under <string.h> use this criteria to check or find the end of a string.
A text file, on the other hand, will not typically have any \0 characters in it. So, when reading text from a file, you have to null-terminate your character buffer before you then print it.

\0 is the C escape sequence for the null character (ASCII code 0) and is widely used to represent the end of a string in memory. The character normally doesn't appear explicitly in a text file, however, by convention, most C strings contain a null terminator at the end. Functions that read a string into memory will generally append a \0 to denote the end of the string, and functions that output a string from memory will similarly expect a \0.
Note that there are other ways of representing strings in memory, for example as a (length, content) pair (Pascal notably used this representation), which do not require a null terminator since the length of the string is known ahead of time.

Common Text Files
The null character '\0', even if rare, can appear in a text file. Code should be prepared to handle reading '\0'.
This also includes other char outside the typical ASCII range, which may be negative with a signed char.
UTF-16
Some "text" files use UTF-16 encoding and code encountering that, but expecting a typical "text" file will encounter many null characters.
Line Length
Lines can be too long, too short (only "\n"). or maybe other "text" problems exist.
Robust code does not trust use/file input until it is qualified and meets expectations. It does not assume null chracters are absent.

Related

How to input a string to C with null character in it via gets?

I am trying to demonstrate a buffer overflow, and I wish to overwrite a local varible with gets. I have compiled my program using gcc with -fno-stack-protector, so I know that the buffer that gets uses is right next to another local variable I am trying to overwrite. My goal is to overflow the buffer and overwrite the adjacent variable so that both of them have the same string. However, I noticed that I need to be able to input the '\0' character so that strcmp will actually show that both are equal. How can I input '\0'?
On many keyboards, you can enter a NUL character with ctrl# (might be ctrlshift2 or ctrlalt2).
Barring that, you can create a file with a NUL byte and redirect that as stdin.
I'm not sure you'll be able to input a '\0' into a gets(3) or fgets(3) function, as the function checks for newline terminators and probably has some way of protecting you from inputing a nul terminator to a C string (which is assumed to terminate on nul character).
Probably, what you are trying to demonstrate is something implementation dependant (so, undefined behaviour), and will work differently for different implementations.
If you want to correctly overwrite a local variable with only one input statement, just use read(2), which allows you to enter nulls and any other possible character value.

How can I print a string with the same length with or without multicharacters?

I am trying to do exercise 1-22 in K&R book. It asks to fold long lines (i.e.going into a new line) after a predefined number of characters in string.
As I was testing the program and it worked well, but I saw that some lines were "folding" earlier than they should. I noticed that it was the lines on which special characters appeared, such as:
ö ş ç ğ
So, my question is, how do I ensure that lines are printed with the same maximum length with or without multicharacters?
What happens in your code ?
The K&R was written in a time where all characters were encoded on one single char. Example of such encoding standards are ASCII or ISO 8859.
Nowadays the leading encoding standard is UNICODE, which comes in several flavors. The UTF-8 encoding is used to represent the thousands of unicode characters on 8 bit bytes, using a variable length scheme:
the ascii characters (i.e. 0x00 to 0x7F) are encoded on a single byte.
all other characters are encoded on 2 to 4 bytes.
So the letter ö and the others in your list are encoded as 2 consecutive bytes. Unfortunately, the standard C library and the algorithms of K&R do not manage variable encoding. So each of your special char is counted as two so that your algorithm is tricked.
How to solve it ?
There is no easy way. You must make a distinction between the length of the strings in memory, and the length of the strings when they are displayed.
I can propose you a trick that uses the properties of the encoding scheme: whenever you count the display length of a string, just ignore the characters c in memory that comply with the condition c&0xC0==0x80.
Another way would be to use wide chars wchar_t/win_t (requires header wchar.h) instead of char/int and use getwc()/putwc() instead of getc()/putc(). If on your environment sizeof(wchar_t) is 4 then you will be able to work with unicode just using the wide characters and wide library functions instead of the normal ones mentioned in K&R. If however
sizeof(wchar_t) is smaller (for example 2), you could work correctly with a larger subset of unicode but still could encounter alignement issues in some cases.
As in the comment, your string is probably encoded in UTF-8. That means that some characters, including the ones you mention, use more than one byte. If you simply count bytes to determine the width of your output, your computed value may be too large.
To properly determine the number of characters in a string with multibyte characters, use a function such as mbrlen(3).
You can use mbrtowc(3) to find out the number of bytes of the first character in a string, if you're counting character for character.
This of course goes way beyond the scope of the K&R book. It was written before multibyte characters were used.

NULL Terminator in text files

I am creating a program that reads in the contents of a text file through the command line, character by character.
Is the NULL value automatically inserted or do I have to add it to the text file manually?
Text files do not need to have a terminator on modern platforms. (On some legacy platforms they did have one, but I doubt it is the case here.) You almost certainly should not write a terminator into the file, as it may cause problems with programs that do not expect one. The end of file serves as a terminator when reading.
Text strings in C are arrays of characters terminated by a zero, aka the null character, mnemonic NUL (with one L, and it is not the same thing as NULL in C). When creating strings, you do need to terminate them correctly. Functions returning strings, including ones that read them from files (e.g., fgets), terminate them for you.

What is wrong with adding null character to non null-terminated string?

Why I shouldn't add a null character to the end of a non null-terminated string like in this answer? I mean if I have a non null-terminated string and add null character to the end of the string, I now have a null-terminated string which should be good, right?
Is there any security problem I don't see?
Here's the code in case the answer gets deleted:
char letters[SIZE + 1]; // Leave room for the null-terminator.
// ...
// Populate letters[].
// ...
letters[SIZE] = '\0'; // Null-terminate the array.
to know the end of the string you must have a null terminated string, otherwise there is no way to know the end of the string
There is nothing technically wrong in terminating the string with \0 this way. However, the approaches you can use to populate the array before adding \0 are prone to error. Take a look in some situations:
Suppose you decide to populate letters char by char. What happens if you forget to add some letters? What if you add more letters than the expected size?
What if there are thousands of letters to populate the array?
What if you need to populate letters with Unicode characters that (often) require more than one byte per symbol?
Of course you can address these situations very carefully but they still will be prone to error when maintaining the code.
To be clear: a string in C always has one and only one null character - it is the last character of the string. A string is an array of characters. If an array of characters does not have a null character, it is not a string.
A string is a contiguous sequence of characters terminated by and including the first null character. C11dr 7.1.1 1
There is nothing wrong with adding a null character to an array of characters as OP coded.
This is a fine way to form a a string if:
All the preceding characters are defined.
String functions are not call until after a null character is written.
You shouldn't use it, to avoid errors (or security holes) due mixing C/Pascal strings.
C style string: An array of char, terminated by NULL ('\0')
Pascal style string: a kind of structure, with a int with the size of the string, and an array with the string itself.
The Pascal style don't use in-band control, so it can use any char inside it, like NULL. C strings can't, as they use it as signaling control.
The problem is when you mix them, or assume one style when it's another. Or even try to convert between them.
Converting a C string to pascal would do no harm. But if you have a legit Pascal string with more then one NULL character, converting it to C style will cause problem, as it can't represent it.
A good example of this is the X.509 Null Char Exploit, where you could register a ssl certificate to:
www.mysimplesite.com\0www.bigbank.com
The X.509 certificate uses Pascal string, so this is valid. But when checking, the CA could use or assume C code or string style that just sees the first www.mysimplesite.com and signs the certificate. And some brosers parses this certificate as valid also for www.bigbank.com.
So, you CAN use it, but you SHOULD'NT, as it's risky to cause some bug or even a security breach.
More details and info:
https://www.blackhat.com/presentations/bh-usa-09/MARLINSPIKE/BHUSA09-Marlinspike-DefeatSSL-SLIDES.pdf
https://sites.google.com/site/cse825maninthemiddle/odds-and-ends/x-509-null-char-exploit
In general, there are two ways of keeping track of an array of some variable number of things:
Use a terminator. Of course, this is the C approach to representing strings: an array of characters of some unknown size, with the actual string length given by a null terminator.
Use an explicit count stored somewhere else. (As it happens, this is how Pascal traditionally represents strings.)
If you have an array containing a known but not null-terminated sequence of characters, and if you want to turn it into a proper null-terminated string, and if you know that the underlying array is allocated big enough to contain the null terminator, then yes, explicitly setting array[N] to '\0' is not only acceptable, it is the way to do it.
Bottom line: it's a fine technique (if the constraints are met). I don't know why that earlier answer was criticized and downvoted.

The terminating NULL in an array in C

I have a simple question. Why is it necessary to consider the terminating null in an
array of chars (or simply a string) and not in an array of integers. So when i want a string to hold 20 characters i need to declare char string[21];. When i want to declare an array of integers holding 5 digits then int digits[5]; is enough. What is the reason for this?
You don't have to terminate a char array with NULL if you don't want to, but when using them to represent a string, then you need to do it because C uses null-terminated strings to represent its strings. When you use functions that operate on strings (like strlen for string-length or using printf to output a string), then those functions will read through the data until a NULL is encountered. If one isn't present, then you would likely run into buffer overflow or similar access violation/segmentation fault problems.
In short: that's how C represents string data.
Null terminators are required at the end of strings (or character arrays) because:
Most standard library string functions expect the null character to be there. It's put there in lieu of passing an explicit string length (though some functions require that instead.)
By design, the NUL character (ASCII 0x00) is used to designate the end of strings. Hence why it's also used as an EOF character when reading from ASCII files or streams.
Technically, if you're doing your own string manipulation with your own coded functions, you don't need a null terminator; you just need to keep track of how long the string is. But, if you use just about anything standardized, it will expect it.
It is only by convention that C strings end in the ascii nul character. (That's actually something different than NULL.)
If you like, you can begin your strings with a nul byte, or randomly include nul bytes in the middle of strings. You will then need your own library.
So the answer is: all arrays must allocate space for all of their elements. Your "20 character string" is simply a 21-character string, including the nul byte.
The reason is it was a design choice of the original implementors. A null terminated string gives you a way to pass an array into a function and not pass the size. With an integer array you must always pass the size. Ints convention of the language nothing more you could rewrite every string function in c with out using a null terminator but you would allways have to keep track of your array size.
The purpose of null termination in strings is so that the parser knows when to stop iterating through the array of characters.
So, when you use printf with the %s format character, it's essentially doing this:
int i = 0;
while(input[i] != '\0') {
output(input[i]);
i++;
}
This concept is commonly known as a sentinel.
It's not about declaring an array that's one-bigger, it's really about how we choose to define strings in C.
C strings by convention are considered to be a series of characters terminated by a final NUL character, as you know. This is baked into the language in the form of interpreting "string literals", and is adopted by all the standard library functions like strcpy and printf and etc. Everyone agrees that this is how we'll do strings in C, and that character is there to tell those functions where the string stops.
Looking at your question the other way around, the reason you don't do something similar in your arrays of integers is because you have some other way of knowing how long the array is-- either you pass around a length with it, or it has some assumed size. Strings could work this way in C, or have some other structure to them, but they don't -- the guys at Bell Labs decided that "strings" would be a standard array of characters, but would always have the terminating NUL so you'd know where it ended. (This was a good tradeoff at that time.)
It's not absolutely necessary to have the character array be 21 elements. It's only necessary if you follow the (nearly always assumed) convention that the twenty characters be followed by a null terminator. There is usually no such convention for a terminator in integer and other arrays.
Because of the the technical reasons of how C Strings are implemented compared to other conventions
Actually - you don't have to NUL-terminate your strings if you don't want to!
The only problem is you have to re-write all the string libraries because they depend on them. It's just a matter of doing it the way the library expects if you want to use their functionality.
Just like I have to bring home your daughter at midnight if I wish to date her - just an agreement with the library (or in this case, the father).

Resources