How to get byte size of multibyte string - c

How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself?
Or, more general, how do I get the right byte size of a TCHAR string?
Solution:
_tcslen(_T("TCHAR string")) * sizeof(TCHAR)
EDIT:
I was talking about null-terminated strings only.

Let's see if I can clear this up:
"Multi-byte character string" is a vague term to begin with, but in the world of Microsoft, it typically meants "not ASCII, and not UTF-16". Thus, you could be using some character encoding which might use 1 byte per character, or 2 bytes, or possibly more. As soon as you do, the number of characters in the string != the number of bytes in the string.
Let's take UTF-8 as an example, even though it isn't used on MS platforms. The character é is encoded as "c3 a9" in memory -- thus, two bytes, but 1 character. If I have the string "thé", it's:
text: t h é \0
mem: 74 68 c3 a9 00
This is a "null terminated" string, in that it ends with a null. If we wanted to allow our string to have nulls in it, we'd need to store the size in some other fashion, such as:
struct my_string
{
size_t length;
char *data;
};
... and a slew of functions to help deal with that. (This is sort of how std::string works, quite roughly.)
For null-terminated strings, however, strlen() will compute their size in bytes, not characters. (There are other functions for counting characters) strlen just counts the number of bytes before it sees a 0 byte -- nothing fancy.
Now, "wide" or "unicode" strings in the world of MS refer to UTF-16 strings. They have similar problems in that the number of bytes != the number of characters. (Also: the number of bytes / 2 != the number of characters) Let look at thé again:
text: t h é \0
shorts: 0x0074 0x0068 0x00e9 0x0000
mem: 74 00 68 00 e9 00 00 00
That's "thé" in UTF-16, stored in little endian (which is what your typical desktop is). Notice all the 00 bytes -- these trip up strlen. Thus, we call wcslen, which looks at it as 2-byte shorts, not single bytes.
Lastly, you have TCHARs, which are one of the above two cases, depending on if UNICODE is defined. _tcslen will be the appropriate function (either strlen or wcslen), and TCHAR will be either char or wchar_t. TCHAR was created to ease the move to UTF-16 in the Windows world.

According to MSDN, _tcslen corresponds to strlen when _MBCS is defined. strlen will return the number of bytes in the string. If you use _tcsclen that corresponds to _mbslen which returns the number of multibyte characters.
Also, multibyte strings do not (AFAIK) contain embedded nulls, no.
I would question the use of a multibyte encoding in the first place, though... unless you're supporting a legacy app, there's no reason to choose multibyte over Unicode.

Related

K&R C - Adding '\0' after a newline

I am working with the 2nd edition of The C Programming Language by K & R.
In the example program on pg. 29, the authors create a function called getline(), whose purpose is to count the number of chars in a line and also append a '\0' to the end of a line (after the newline character '\n').
My question is, why do you want to do that? Can't you figure out the start of a newline by the fact that you have the newline character?
I think the intent is to split the text into lines.
In the C data model, \0 marks the end of the string. You can be given a string with multiple lines signaled by \n, but it'll have a single \0, at the end.
If you put a \0 after every \n, you are effectively splitting the string into lines, one \0-terminated string for each line.
In C, there's no string type properly speaking (like in Java or C# for example). For C it's just a sequence of bytes until a 0 byte is found. This is called a NUL-terminated (do not confuse with NULL constant) or zero-terminated string.
So \0 is appended to make it a valid C string that represents a line and be able to manipulate it as a normal C string afterwards (e.g. use strlen function). If you don't append a \0 the character count will be wrong because you don't know where the string ends. To show this, here's an example:
If take a look at a C string containing "Hello" in memory, we find this:
48 65 6C 6C
6F 00 A4 00
48 65 6C 6C 6F is "Hello", plus a 00 byte (\0) that terminates it. So to count how many characters, we just count bytes until a 00 byte that terminates it, that is 5 bytes (5 characters).
If you don't zero-terminate the string, then there's no way to know how many characters the string has. This is what the memory would look like for a non-zero-terminated "Hello" string:
48 65 6C 6C
6F A4 00 FF
As you can see, there's no way to know where the string ends, and hence, impossible to count how many bytes it has.
The presence of the \0 character has nothing to do with the newline (which is not always \n in binary streams - see comment by Keith Thompson).
Newline is used for the on-screen formatting, (and is denoted in binary by a line feed, a carriage return, or both, depending on the platform); while \0 is used to mark the end of a string, which is, in C, a mere array of characters, with no inherent end.
I agree to the answers above, '\0' just marks the end of the string, it is important in some
functions such as strcmp. If there is no '\0' in a char array, the program may sometimes return random characters after the string.

Why does printf("%s",charstr) increasingly prints more than expected with each fread()?

In an attempt to learn file structures, I am trying to read in a .wav file and simply print information about it. I have a struct that holds all the information defined as so:
typedef struct{
char chunkId[4];
unsigned int chunkSize;
char format[4];
char subchunk1Id[4];
unsigned int subchunk1Size;
unsigned short audioFormat;
unsigned short numChannels;
unsigned int sampleRate;
unsigned int byteRate;
unsigned short blockAlign;
unsigned short bitsPerSample;
char subchunk2Id[4];
unsigned int subchunk2Size;
void *data;
} WavFile;
What's happening is that for each time I fread through the file, It causes my c-strings to print longer and longer. Here's a sample code snippet:
fseek(file, SEEK_SET, 0);
fread(wavFile.chunkId, 1, sizeof(wavFile.chunkId), file);
fread(&wavFile.chunkSize, 1, sizeof(wavFile.chunkSize), file);
fread(wavFile.format, 1,sizeof(wavFile.format), file);
fread(wavFile.subchunk1Id, 1, sizeof(wavFile.subchunk1Id), file);
fread(&wavFile.subchunk1Size, 1, sizeof(wavFile.subchunk1Size), file);
fread(&wavFile.audioFormat, 1, sizeof(wavFile.audioFormat), file);
printf("%s\n",wavFile.chunkId);
printf("%d\n",wavFile.chunkSize);
printf("%s\n",wavFile.format);
printf("%s\n",wavFile.subchunk1Id);
printf("%d\n",wavFile.subchunk1Size);
printf("%d\n",wavFile.audioFormat);
Something in the way I have my struct setup, the way I'm reading the file, or the way that printf() is seeing the string is causing the output to print as shown:
RIFF�WAVEfmt
79174602
WAVEfmt
fmt
16
1
The expected output:
RIFF
79174602
WAVE
fmt
16
1
I do understand that c-strings need to be null terminated, but then I got to thinking how is printing a string from a binary file any different from printing a string literal like printf("test");? The file specifications requires that the size of the members to have the exact sizes defined in my struct. Doing char chunkId[5]; and then chunkId[4]='\0'; won't seem to be a good solution to this problem.
I've been trying to resolve this for a couple days now, so now I'm coming to SO to maybe get a push in the right direction.
For full disclosure, here's the hex output of the relevant portion of the file because this webform doesn't show all garbled mess that is showing up on my output.
52 49 46 46 CA 1B B8 04 57 41 56 45 66 6D 74 20 10 00 00 00 01 00 02 00 44 AC 00 00 98 09 04 00 06 00 18 00 64 61 74 61
If you know the size, you can limit the output of printf:
// Only prints 4-bytes from format. No NULL-terminator needed.
printf("%.4s\n", wavFile.format);
If the size is stored in a different field, you can use that too:
// The * says: print number of chars, as dictated by "theSize"
printf("%.*s\n", wavFile.theSize, wavFile.format);
The way you have called printf(), it expects a '\0' terminated string, but your struct elements aren't (fread() doesn't add '\0' and format, chunkId etc. don't have enough length to contain it).
The simplest way is:
printf( "%.*s\n", (int)sizeof(wavFile.format), wavFile.format );
If it is not a null terminated string you can use .* and an extra int argument which specifies the size of the string to printf, for example:
printf("%.*s\n", (int)sizeof(wavFile.chunkId), wavFile.chunkId);
or alternatively:
printf("%.4s\n", wavFile.chunkId);
which in your case may be simpler since the size seems to be fixed in your case.
From the printf document above the precision specifier in the format string works as follows:
(optional) . followed by integer number or * that specifies precision of the conversion. In the case when * is used, the precision is specified by an additional argument of type int. If the value of this argument is negative, it is ignored. See the table below for exact effects of precision.
and the table below which this text references says the following for character string:
Precision specifies the maximum number of bytes to be written.
First, be sure you're reading the file in binary mode (use fopen with the mode set to "rb"). This makes no difference on Unix-like systems, but on others reading a binary file in text mode may give you corrupted data. And you should be checking the value returned by each fread() call; don't just assume that everything works.
printf with a %s format requires a pointer to a string. A string always has a null character '\0' to mark the end of it.
If you have a chunk of data read from a file, it's unlikely to have a terminating null character.
As the other answers say, there are variations of the %s format that can limit the number of character printed, but even so, printf won't print anything past the first null character that happens to appear in the array. (A null character, which is simply a byte with the value 0, may be valid data, and there may be more valid data after it.)
To print arbitrary character data of known length, use fwrite:
fwrite(wavFile.chunkId, sizeof wavFile.chunkId, 1, stdout);
putchar('\n');
In this particular case, it looks like you're expecting chunkId to contain printable characters; in your example, it has "RIFF" (but without the trailing null character). But you could be reading an invalid file.
And printing binary data to standard output can be problematic. If it happens to consist of printable characters, that's fine, and you can assume that everything is printable in an initial version. But you might consider checking whether the characters in the array actually are printable (see isprint()), and print their values in hexadecimal if they're not.

Trouble comparing UTF-8 characters using wchar.h

I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome.
I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.
C has two different kinds of characters: multibyte characters and wide characters.
Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.
Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).
So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).
Unfortunately, there is more to Unicode than this.
ά can be represented as a single Unicode codepoint, or as two separate codepoints:
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.
All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).
First of all, please do take the time to read this great article, which explains UTF8 vs Unicode and lots of other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html
What you are trying to do in your code is read in unicode character by character, and do comparisons with those. That's won't work if the input stream is UTF8, and it's not really possible to do with quite this structure.
In short: Fully unicode strings can be encoded in several ways. One of them is using a series of equally-sized "wide" chars, one for each character. That is what the wchar_t type (sometimes WCHAR) is for. Another way is UTF8, which uses a variable number of raw bytes to encode each character, depending on the value of the character.
UTF8 is just a stream of bytes, which can encode a unicode string, and is commonly used in files. It is not the same as a string of WCHARs, which are the more common in-memory representation. You can't poke through a UTF8 stream reliably, and do character replacements within it directly. You'll need to read the whole thing in and decode it, and then loop through the WCHARs that result to do your comparisons and replacement, and then map that result back to UTF8 to write to the output file.
On Win32, use MultiByteToWideChar to do the decoding, and you can use the corresponding WideCharToMultiByte to go back.
When you use a "string literal" with regular quotes, you're creating a nul-terminated ASCII string (char*), which does not support Unicode. The L"string literal" with the L prefix will create a nul-terminated string of WCHARs (wchar_t *), which you can use in string or character comparisons. The L prefix also works with single-quote character literals, like so: L'ε'
As a commenter noted, when you use fread/fwrite, you should be using sizeof(wchar_t) and not its pointer type, since the amount you are trying to read/write is an actual wchar, not the size of a pointer to one. This advice is just code feedback independent of the above-- you don't want to be reading the input character by character anyways.
Note too that when you do string comparisons (wcscmp), you should use actual wide strings (which are terminated with a nul wide char)-- not use single characters in memory as input. If (when) you want to do character-to-character comparisons, you don't even need to use the string functions. Since a WCHAR is just a value, you can compare directly: if (sample == L'ά') {}.

UTF-16 string terminator

What is the string terminator sequence for a UTF-16 string?
EDIT:
Let me rephrase the question in an attempt to clarify. How's does the call to wcslen() work?
Unicode does not define string terminators. Your environment or language does. For instance, C strings use 0x0 as a string terminator, as well as in .NET strings where a separate value in the String class is used to store the length of the string.
To answer your second question, wcslen looks for a terminating L'\0' character. Which as I read it, is any length of 0x00 bytes, depending on the compiler, but will likely be the two-byte sequence 0x00 0x00 if you're using UTF-16 (encoding U+0000, 'NUL')
7.24.4.6.1 The wcslen function (from the Standard)
...
[#3] The wcslen function returns the number of wide
characters that precede the terminating null wide character.
And the null wide character is L'\0'
There isn't any. String terminators are not part of an encoding.
For example if you had the string ab it would be encoded in UTF-16 with the following sequence of bytes: 61 00 62 00. And if you had 大家 you would get 27-59-B6-5B. So as you can see no predetermined terminator sequence.

printf field width : bytes or chars?

The printf/fprintf/sprintf family supports
a width field in its format specifier. I have a doubt
for the case of (non-wide) char arrays arguments:
Is the width field supposed to mean bytes or characters?
What is the (correct-de facto) behaviour if the char array
corresponds to (say) a raw UTF-8 string?
(I know that normally I should use some wide char type,
that's not the point)
For example, in
char s[] = "ni\xc3\xb1o"; // utf8 encoded "niño"
fprintf(f,"%5s",s);
Is that function supposed to try to ouput just 5 bytes
(plain C chars) (and you take responsability of misalignments
or other problems if two bytes results in a textual characters) ?
Or is it supposed to try to compute the length of "textual characters"
of the array? (decodifying it... according to the current locale?)
(in the example, this would amount to find out that the string has
4 unicode chars, so it would add a space for padding).
UPDATE: I agree with the answers, it is logical that the printf family doesnt
distinguish plain C chars from bytes. The problem is my glibc doest not seem
to fully respect this notion, if the locale has been set previously, and if
one has the (today most used) LANG/LC_CTYPE=en_US.UTF-8
Case in point:
#include<stdio.h>
#include<locale.h>
main () {
char * locale = setlocale(LC_ALL, ""); /* I have LC_CTYPE="en_US.UTF-8" */
char s[] = {'n','i', 0xc3,0xb1,'o',0}; /* "niño" in utf8: 5 bytes, 4 unicode chars */
printf("|%*s|\n",6,s); /* this should pad a blank - works ok*/
printf("|%.*s|\n",4,s); /* this should eat a char - works ok */
char s3[] = {'A',0xb1,'B',0}; /* this is not valid UTF8 */
printf("|%s|\n",s3); /* print raw chars - ok */
printf("|%.*s|\n",15,s3); /* panics (why???) */
}
So, even when a non-POSIX-C locale has been set, still printf seems to have the right notion for counting width: bytes (c plain chars) and not unicode chars. That's fine. However, when given a char array that is not decodable in his locale, it silently panics (it aborts - nothing is printed after the first '|' - without error messages)... only if it needs to count some width. I dont understand why it even tries to decode the string from utf-8, when it doesn need/have to. Is this a bug in glibc ?
Tested with glibc 2.11.1 (Fedora 12) (also glibc 2.3.6)
Note: it's not related to terminal display issues - you can check the output by piping to od : $ ./a.out | od -t cx1 Here's my output:
0000000 | n i 303 261 o | \n | n i 303 261 | \n
7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a
0000020 | A 261 B | \n |
7c 41 b1 42 7c 0a 7c
UPDATE 2 (May 2015): This questionable behaviour has been fixed in newer versions of glibc (from 2.17, it seems). With glibc-2.17-21.fc19 it works ok for me.
It will result in five bytes being output. And five chars. In ISO C, there is no distinction between chars and bytes. Bytes are not necessarily 8 bits, instead being defined as the width of a char.
The ISO term for an 8-bit value is an octet.
Your "niño" string is actually five characters wide in terms of the C environment (sans the null terminator, of course). If only four symbols show up on your terminal, that's almost certainly a function of the terminal, not C's output functions.
I'm not saying a C implementation couldn't handle Unicode. It could quite easily do UTF-32 if CHAR_BITS was defined as 32. UTF-8 would be harder since it's a variable length encoding but there are ways around almost any problem :-)
Based on your update, it seems like you might have a problem. However, I'm not seeing your described behaviour in my setup with the same locale settings. In my case, I'm getting the same output in those last two printf statements.
If your setup is just stopping output after the first | (I assume that's what you mean by abort but, if you meant the whole program aborts, that's much more serious), I would raise the issue with GNU (try your particular distributions bug procedures first). You've done all the important work such as producing a minimal test case so someone should even be happy to run that against the latest version if your distribution doesn't quite get there (most don't).
As an aside, I'm not sure what you meant by checking the od output. On my system, I get:
pax> ./qq | od -t cx1
0000000 | n i 303 261 o | \n | n i 303 261 | \n
7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a
0000020 | A 261 B | \n | A 261 B | \n
7c 41 b1 42 7c 0a 7c 41 b1 42 7c 0a
0000034
so you can see the output stream contains the UTF-8, meaning that it's the terminal program which must interpret this. C/glibc isn't modifying the output at all, so maybe I just misunderstood what you were trying to say.
Although I've just realised you may be saying that your od output has only the starting bar on that line as well (unlike mine which appears to not have the problem), meaning that it is something wrong within C/glibc, not something wrong with the terminal silently dropping the characters (in all honesty, I would expect the terminal to drop either the whole line or just the offending character (i.e., output |A) - the fact that you're just getting | seems to preclude a terminal problem). Please clarify that.
Bytes (chars). There is no built-in support for Unicode semantics. You can imagine it as resulting in at least 5 calls to fputc.
What you've found is a bug in glibc. Unfortunately it's an intentional one which the developers refuse to fix. See here for a description:
http://www.kernel.org/pub/linux/libs/uclibc/Glibc_vs_uClibc_Differences.txt
The original question (bytes or chars?) was rightly answered by several people: both according to the spec and the glibc implementation, the width (or precision) in the printf C function counts bytes (or plain C chars, which are the same thing). So, fprintf(f,"%5s",s) in my first example, means definitely "try to output at least 5 bytes (plain chars) from the array s -if not enough, pad with blanks".
It does not matter whether the string (in my example, of byte-length 5) represents text encoded in -say- UTF8 and if fact contains 4 "textual (unicode) characters". To printf(), internally, it just has 5 (plain) C chars, and that's what counts.
Ok, this seems crystal clear. But it doesn't explain my other problem. Then we must be missing something.
Searching in glibc bug-tracker, I found some related (rather old) issues - I was not the first one caught by this... feature:
http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308
http://sources.redhat.com/bugzilla/show_bug.cgi?id=649
This quote, from the last link, is specially relevant here:
ISO C99 requires for %.*s to only write complete characters that fit below the
precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1
characters as shown in the input file you provided, some of the strings are
not valid UTF-8 strings, therefore sprintf fails with -1 because of the
encoding error. That's not a bug in glibc.
Whether it is a bug (perhaps in interpretation or in the ISO spec itself) is debatable.
But what glibc is doing is clear now.
Recall my problematic statement: printf("|%.*s|\n",15,s3) . Here, glibc must find out if the length of s3 is greater than 15 and, if so, truncate it. For computing this length it doesn't need to mess with encodings at all. But, if it must be truncated, glibc strives to be careful: if it just keeps the first 15 bytes, it could potentially break a multibyte character in half, and hence produce an invalid text output (I'd be ok with that - but glibc sticks to its curious ISO C99 interpretation).
So, it unfortunately needs to decode the char array, using the environment locale, to find out where are the real characters boundaries. Hence, for example, if LC_TYPE says UTF-8 and the array is not a valid UTF-8 bytes sequence, it aborts (not so bad, because then printf returns -1 ; not so well, because it prints part of the string anyway, so it's difficult to recover cleanly).
Apparently only in this case, when a precision is specified for a string and there is possibility of truncation, glibc needs to mix some Unicode semantics with the plain-chars/bytes semantics. Quite ugly, IMO, but so it is.
Update: Notice that this behaviour is relevant not only for the case of invalid original encodings, but also for invalid codes after the truncation. For example:
char s[] = "ni\xc3\xb1o"; /* "niño" in UTF8: 5 bytes, 4 unicode chars */
printf("|%.3s|",s); /* would cut the double-byte UTF8 char in two */
Thi truncates the field to 2 bytes, not 3, because it refuses to output an invalid UTF8 string:
$ ./a.out
|ni|
$ ./a.out | od -t cx1
0000000 | n i | \n
7c 6e 69 7c 0a
UPDATE (May 2015) This (IMO) questionable behaviour has been changed (fixed) in newer versions of glib. See main question.
To be portable, convert the string using mbstowcs and print it using printf( "%6ls", wchar_ptr ).
%ls is the specifier for a wide string according to POSIX.
There is no "de-facto" standard. Typically, I would expect stdout to accept UTF-8 if the OS and locale have been configured to treat it as a UTF-8 file, but I would expect printf to be ignorant of multibyte encoding because it isn't defined in those terms.
Don't use mbstowcs unless you also make sure that wchar_t is at-least 32 bits long.
else you'll likely end up with UTF-16 which has all disadvantages of UTF-8 and
all the disadvantages of UTF-32.
I'm not saying avoid mbstowcs I just saying don't let windows programmers use it.
It might be simpler to use iconv to convert to UTF-32.

Resources