I am working with the 2nd edition of The C Programming Language by K & R.
In the example program on pg. 29, the authors create a function called getline(), whose purpose is to count the number of chars in a line and also append a '\0' to the end of a line (after the newline character '\n').
My question is, why do you want to do that? Can't you figure out the start of a newline by the fact that you have the newline character?
I think the intent is to split the text into lines.
In the C data model, \0 marks the end of the string. You can be given a string with multiple lines signaled by \n, but it'll have a single \0, at the end.
If you put a \0 after every \n, you are effectively splitting the string into lines, one \0-terminated string for each line.
In C, there's no string type properly speaking (like in Java or C# for example). For C it's just a sequence of bytes until a 0 byte is found. This is called a NUL-terminated (do not confuse with NULL constant) or zero-terminated string.
So \0 is appended to make it a valid C string that represents a line and be able to manipulate it as a normal C string afterwards (e.g. use strlen function). If you don't append a \0 the character count will be wrong because you don't know where the string ends. To show this, here's an example:
If take a look at a C string containing "Hello" in memory, we find this:
48 65 6C 6C
6F 00 A4 00
48 65 6C 6C 6F is "Hello", plus a 00 byte (\0) that terminates it. So to count how many characters, we just count bytes until a 00 byte that terminates it, that is 5 bytes (5 characters).
If you don't zero-terminate the string, then there's no way to know how many characters the string has. This is what the memory would look like for a non-zero-terminated "Hello" string:
48 65 6C 6C
6F A4 00 FF
As you can see, there's no way to know where the string ends, and hence, impossible to count how many bytes it has.
The presence of the \0 character has nothing to do with the newline (which is not always \n in binary streams - see comment by Keith Thompson).
Newline is used for the on-screen formatting, (and is denoted in binary by a line feed, a carriage return, or both, depending on the platform); while \0 is used to mark the end of a string, which is, in C, a mere array of characters, with no inherent end.
I agree to the answers above, '\0' just marks the end of the string, it is important in some
functions such as strcmp. If there is no '\0' in a char array, the program may sometimes return random characters after the string.
Related
I'm trying to write a sample text using c, but my code is doing it wrong.
the code is below
#include <stdio.h>
int main(){
FILE *ptr;
char c[8]={65,66,67,13,10,67,68,69};
int count;
ptr = fopen("write.txt","w");
for(cont = 0 ; cont < 8; count++){
fprintf(ptr, "%c", c[count]);
}
return 0;
}
The text should be
ABC
CDE
But when I open it in a hex editor, this is what it display:
41 42 43 0D 0D 0A 43 44 45
Edit: I put wb instead of w and it works
Opening a file in text mode opens you up to having your data translated in certain ways, to comply with the requirements of the underlying environment.
For example, in Windows, opening a file in text mode may mean that the C concept of a newline character \n maps to and from the CRLF character sequence in the file.
Hence, when you write the 13 (CR), it goes out as is. When you write the 10 (LF), that's the newline character \n and is translated to Windows line endings, CRLF. That's why you're seeing 0D 0D 0A in the file.
If you want to avoid this translation, just open the file in binary mode(a):
ptr = fopen("write.txt", "wb");
As an aside, it's also a good idea to:
check functions that can fail to see if they do fail (for example, fopen returning NULL or fprintf returning anything other than one (in this case));
close files explicitly when you're done with them;
let C automatically size your arrays where possible (char c[] = {...}); and
use sizeof(c) rather than the "magic" number 8.
(a) The relevant part of ISO C11 is 7.21.2 Streams /2:
A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character. Whether the last line requires a terminating new-line character is implementation-defined. Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment. Thus, there need not be a one-to-one correspondence between the characters in a stream and those in the external representation.
But just keep in mind that even binary data may not go unmodified, as per /3 in that same section:
A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream shall compare equal to the data that were earlier written out to that stream, under the same implementation. Such a stream may, however, have an implementation-defined number of null characters appended to the end of the stream.
The only time I personally have ever seen that clause come into play is on the System z mainframes with fixed length record sizes that had to be padded out.
I am reading a binary file using fread. Inside that binary file, there are hex codes that I need to store in a char array, and I am using sscanf to parse them, as such:
Buffer has the whole data, and I know how many of them are there, which is stored in an int size.
Example of this can be: B8 04 00 8B 5C.
The problem: whenever sscanf sees a 00, and because we're storing them as characters, it thinks that it has ended, and all characters after 00 become unidentified characters, for example, 32 once became 5D and such.
A small snippet:
int size=5;
char codes[255];
sscanf(buffer, "%sizec", codes);
.
.
.
printf("%2X ", (unsigned char) codes[i]);
The output: B8 04 00 99 58
while it should be: B8 04 00 3B 5C
If all you are trying to do is to copy size bytes from the front of buffer to the front of codes, then:
memcpy(codes, buffer, size);
Leaving aside the inconsistency of your format string, you cannot use sscanf on binary data at all.
The problem that is most relevant to your case is that the scanf family of functions treat '\0' as a null termination of the input string. The values 0x99 0x58 that you see there after 0x00 is simply a "leftover garbage" that is left there from the time the uninitialized memory block has been allocated.
However, the good news is that you do not need sscanf: all it does for you is copying the content of the string to the array of codes - something a plain memcpy can do much more efficiently.
I am writing a program which can encrypt a text file.
First I need to ask the users for the file name. I did something like:
char file_name[50];
fgets(file_name, 50, stdin);
but it didn't work. How can I do this?
I am confused that if I store the file name in an char array which has, say, 50 elements, but the file name has just 10 characters. When I pass the array or the pointer to fopen, what is the program going to do with the remaining 40 elements of that array? Are they storing any value? Will they be passed to fopen?
In C, a string is just a \0 terminated sequence of characters. So long as there is a \0 within the 50 bytes of file_name, file_name contains a valid string for fopen().
The reason fopen() probably failed is that the name you passed it had an extra \n in it after reading it from the input. This is because fgets() also stores the newline character into the buffer. You have to remove it before using the string as a file name.
char *p = strrchr(file_name, '\n');
if (p) *p = '\0';
i am confused that if i store the file name in an char array which, say, has 50 elements. but the file name has just 10 characters. when i pass the array or the pointer to fopen() function, what is the program gonna do with the rest 40 elements of that array?
You mean the rest 38, right? The 11th character will be the newline fgets() puts into the buffer, and the 12th one is assigned a NUL ('\0') character.
It isn't going to do anything with the rest.
Are they storing any value? will they be passed to fopen()?
Obviously, they are storing some value, which is indeterminate and irrelevant anyway. No, they won't be passed to fopen(). In fact, none of the characters will be passed to fopen(). When you use the array as the argument of a function, it automatically decays into a pointer to its first element, and it's only that pointer that fopen() sees.
Internally, fopen() is most likely implemented using the open() system call (on Unices, at least), and, as almost every function accepting a C string, it will interpret the pointer by searching for the NUL-terminator and assuming that the file name is made of characters up to (but not including) that '\0' character.
fopen will check the string character by character until it encounters \0, and then it stops. So the rest of the string you get from fgets won't matter, as long as the first 10 characters plus the trailing \0 make a valid string.
Also note that you need to remove the extra \n from the string you get from fgets.
Strings in C are just characters in memory followed by a null byte. For example:
'H' 'e' 'l' 'l' 'o' ',' ' ' 'w' 'o' 'r' 'l' 'd' '!' '\0'
48 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 00
When you pass an array to the filename argument of fopen, it decays into a pointer. Essentially, what fopen will do is it will use the characters up to and excluding the null byte. Anything after that isn't touched.
What is the string terminator sequence for a UTF-16 string?
EDIT:
Let me rephrase the question in an attempt to clarify. How's does the call to wcslen() work?
Unicode does not define string terminators. Your environment or language does. For instance, C strings use 0x0 as a string terminator, as well as in .NET strings where a separate value in the String class is used to store the length of the string.
To answer your second question, wcslen looks for a terminating L'\0' character. Which as I read it, is any length of 0x00 bytes, depending on the compiler, but will likely be the two-byte sequence 0x00 0x00 if you're using UTF-16 (encoding U+0000, 'NUL')
7.24.4.6.1 The wcslen function (from the Standard)
...
[#3] The wcslen function returns the number of wide
characters that precede the terminating null wide character.
And the null wide character is L'\0'
There isn't any. String terminators are not part of an encoding.
For example if you had the string ab it would be encoded in UTF-16 with the following sequence of bytes: 61 00 62 00. And if you had 大家 you would get 27-59-B6-5B. So as you can see no predetermined terminator sequence.
How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself?
Or, more general, how do I get the right byte size of a TCHAR string?
Solution:
_tcslen(_T("TCHAR string")) * sizeof(TCHAR)
EDIT:
I was talking about null-terminated strings only.
Let's see if I can clear this up:
"Multi-byte character string" is a vague term to begin with, but in the world of Microsoft, it typically meants "not ASCII, and not UTF-16". Thus, you could be using some character encoding which might use 1 byte per character, or 2 bytes, or possibly more. As soon as you do, the number of characters in the string != the number of bytes in the string.
Let's take UTF-8 as an example, even though it isn't used on MS platforms. The character é is encoded as "c3 a9" in memory -- thus, two bytes, but 1 character. If I have the string "thé", it's:
text: t h é \0
mem: 74 68 c3 a9 00
This is a "null terminated" string, in that it ends with a null. If we wanted to allow our string to have nulls in it, we'd need to store the size in some other fashion, such as:
struct my_string
{
size_t length;
char *data;
};
... and a slew of functions to help deal with that. (This is sort of how std::string works, quite roughly.)
For null-terminated strings, however, strlen() will compute their size in bytes, not characters. (There are other functions for counting characters) strlen just counts the number of bytes before it sees a 0 byte -- nothing fancy.
Now, "wide" or "unicode" strings in the world of MS refer to UTF-16 strings. They have similar problems in that the number of bytes != the number of characters. (Also: the number of bytes / 2 != the number of characters) Let look at thé again:
text: t h é \0
shorts: 0x0074 0x0068 0x00e9 0x0000
mem: 74 00 68 00 e9 00 00 00
That's "thé" in UTF-16, stored in little endian (which is what your typical desktop is). Notice all the 00 bytes -- these trip up strlen. Thus, we call wcslen, which looks at it as 2-byte shorts, not single bytes.
Lastly, you have TCHARs, which are one of the above two cases, depending on if UNICODE is defined. _tcslen will be the appropriate function (either strlen or wcslen), and TCHAR will be either char or wchar_t. TCHAR was created to ease the move to UTF-16 in the Windows world.
According to MSDN, _tcslen corresponds to strlen when _MBCS is defined. strlen will return the number of bytes in the string. If you use _tcsclen that corresponds to _mbslen which returns the number of multibyte characters.
Also, multibyte strings do not (AFAIK) contain embedded nulls, no.
I would question the use of a multibyte encoding in the first place, though... unless you're supporting a legacy app, there's no reason to choose multibyte over Unicode.