Why storing Unicode Characters in char works?

Why storing Unicode Characters in char works? - c

I have a program I made to test I/O from a terminal:
#include <stdio.h>
int main()
{
char *input[100];
scanf("%s", input);
printf("%s", input);
return 0;
}
It works as it should with ASCII characters, but it also works with Unicode characters and emoji.
Why is this?

Your code works because the input and output stream have the same encoding, and you do not do anything with c.
Basically, you type something, which is converted into a sequence of bytes, which are then stored in c, then you send back that sequence of bytes to stdout which convert them back to readable characters.
As long as the encoding and decoding process are compatible, you will get the "expected" result.
Now, what happens if you try to use standard "string" C functions? Let's assume you typed "♠Hello" in your terminal, you will get the expected output but:
strlen(c) -> 8
c[0] -> Some strange character
c[3] -> H
You see? You may be able to store whatever you want in a char array, it does not mean you should. If you want to deal with extended character sets, use wchar_t instead.

You're probably running on Linux, with your terminal set to UTF-8 so scanf produces UTF-8, and printf can output it. UTF-8 is designed such that char[] can store it. I explicitly use char[] and not char because non-ASCII characters need more than one byte.

Your program is undefined as it has undefined behavior.
scanf("%s", input);
expects a pointer to string, but
char *input[100];
input is pointer to pointer to char, char *.
Your program may work because the buffer you pass to scanf is of sufficient size to store unicode character and a characters you pass don't have a NULL byte in between them, but it may not work as well because the implementation of C on your (and any other) machine is allowed to do anything in cases of UB.

Related

Why does fgetc() in C always reads extra, non-existent characters whenever I try to read non-printable characters from txt files?

I am trying to read non-printable characters from a text file, print out the characters' ASCII code, and finally write these non-printable characters into an output file.
However, I have noticed that for every non-printable character I read, there is always an extra non-printable character existing in front of what I really want to read.
For example, the character I want to read is "§".
And when I print out its ASCII code in my program, instead of printing just "167", it prints out "194 167".
I looked it up in the debugger and saw "Â§" in the char array. But I don't have Â anywhere in my input file.
screenshot of debugger
And after I write the non-printable character into my output file, I have noticed that it is also just "§", not "Â§".
There is an extra character being attached to every single non-printable character I read. Why is this happening? How do I get rid of it?
Thanks!
Code as follows:
case 1:
mode = 1;
FILE *fp;
fp = fopen ("input2.txt", "r");
int charCount = 0;
while(!feof(fp)) {
original_message[charCount] = fgetc(fp);
charCount++;
}
original_message[charCount - 1] = '\0';
fclose(fp);
k = strlen(original_message);//split the original message into k input symbols
printf("k: \n%lld\n", k);
printf("ASCII code:\n");
for (int i = 0; i < k; i++)
{
ASCII = original_message[i];
printf("%d ", ASCII);
}

C's getchar (and getc and fgetc) functions are designed to read individual bytes. They won't directly handle "wide" or "multibyte" characters such as occur in the UTF-8 encoding of Unicode.
But there are other functions which are specifically designed to deal with those extended characters. In particular, if you wish, you can replace your call to fgetc(fp) with fgetwc(fp), and then you should be able to start reading characters like § as themselves.
You will have to #include <wchar.h> to get the prototype for fgetwc. And you may have to add the call
setlocale(LC_CTYPE, "");
at the top of your program to synchronize your program's character set "locale" with that of your operating system.
Not your original code, but I wrote this little program:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main()
{
wchar_t c;
setlocale(LC_CTYPE, "");
while((c = fgetwc(stdin)) != EOF)
printf("%lc %d\n", c, c);
}
When I type "A", it prints A 65.
When I type "§", it prints § 167.
When I type "Ƶ", it prints Ƶ 437.
When I type "†", it prints † 8224.
Now, with all that said, reading wide characters using functions like fgetwc isn't the only or necessarily even the best way of dealing with extended characters. In your case, it carries a number of additional consequences:
Your original_message array is going to have to be an array of wchar_t, not an array of char.
Your original_message array isn't going to be an ordinary C string — it's a "wide character string". So you can't call strlen on it; you're going to have to call wcslen.
Similarly, you can't print it using %s, or its characters using %c. You'll have to remember to use %ls or %lc.
So although you can convert your entire program to use "wide" strings and "w" functions everywhere, it's a ton of work. In many cases, and despite anomalies like the one you asked about, it's much easier to use UTF-8 everywhere, since it tends to Just Work. In particular, as long as you don't have to pick a string apart and work with its individual characters, or compute the on-screen display length of a string (in "characters") using strlen, you can just use plain C strings everywhere, and let the magic of UTF-8 sequences take care of any non-ASCII characters your users happen to enter.

read 2 bytes in hexadecimal base and convert into decimal using C language fscanf

Well as said Im using C language and fscanf for this task but it seems to make the program crash each time then its surely that I did something wrong here, I havent dealed a lot with this type of input read so even after reading several topics here I still cant find the right way, I have this array to read the 2 bytes
char p[2];
and this line to read them, of course fopen was called earlier with file pointer fp, I used "rb" as read mode but tried other options too when I noticed this was crashing, Im just saving space and focusing in the trouble itself.
fscanf(fp,"%x%x",p[0],p[1]);
later to convert into decimal I have this line (if its not the EOF that we reached)
v = strtol(p, 0, 10);
Well v is mere integer to store the final value we are seeking. But the program keeps crashing when scanf is called or I think thats the case, Im not compiling to console so its a pitty that I cant output what has been done and what hasnt but in debugger it seems like crashing there
Well I hope you can help me out in this, Im a bit lost regarding this type of read/conversion any clue will help me greatly, thanks =).
PS forgot to add that this is not homework, a friend want to make some file conversion for a game and this code will manipulate the files needed alone, so while I could be using any language or environment for this, I always feel better in C language

char strings in C are really called null-terminated byte strings. That null-terminated part is important, as it means a string of two characters needs space for three characters to include the null-terminator character '\0'. Not having the terminator means string functions will go out of bounds in their search for it, leading to undefined behavior.
Furthermore the "%x" format is to read a heaxadecimal integer number and store it in an int. Mismatching format specifiers and arguments leads to undefined behavior.
Lastly and probably what's causing the crash: The scanf family of function expects pointers as their arguments. Not providing pointers will again lead to undefined behavior.
There are two solutions to the above problems:
Going with code similar to what you already use, first of all you must make space for the terminator in the array. Then you need to read two characters. Lastly you need to add the terminator:
char p[3] = { 0 }; // String for two characters, initialized to zero
// The initialization means that we don't need to explicitly add the terminator
// Read two characters, skipping possible leading white-space
fscanf(fp," %c%c",p[0],p[1]);
// Now convert the string to an integer value
// The string is in base-16 (two hexadecimal characters)
v = strtol(p, 0, 16);
Read the hexadecimal value into an integer directly:
unsigned int v;
fscanf(fp, "%2x", &v); // Read as hexadecimal
The second alternative is what I would recommend. It reads two characters and parses it as a hexadecimal value, and stores the result into the variable v. It's important to note that the value in v is stored in binary! Hexadecimal, decimal or octal are just presentation formats, internally in the computer it will still be stored in binary ones and zeros (which is true for the first alternative as well). To print it as decimal use e.g.
printf("%d\n", v);

You need to pass to fscanf() the address of a the variable(s) to scan into.
Also the conversion specifier need to suite the variable provided. In your case those are chars. x expects an int, to scan into a char use the appropriate length modifiers, two times h here:
fscanf(fp, "%hhx%hhx", &p[0], &p[1]);
strtol() expects a C-string as 1st parameter.
What you pass isn't a C-string, as a C-string ought to be 0-terminated, which p isn't.
To fix this you could do the following:
char p[3];
fscanf(fp, "%x%x", p[0], p[1]);
p[2] = '\0';
long v = strtol(p, 0, 10);

Is is possible to convert a char with %s conversion specifier

char c;
scanf("%s",&c);
Is this correct or wrong ? What happening exactly when typing EOF character in stdin ?
I know there is something like in SO but I cannot find it ?
I already know the exact right form which should be:
scanf("%hhd",&c);

No, this is not correct.
While it might appear that sending the address to the char (&c) would be equivalent, it is not the same as a "string."
Of course, C doesn't really have a string type; instead, we find an abstraction in the standard library string and printf style functions that use the convention of a character pointer to a set of bytes with a null byte (null terminator) at the end of the string.
If you tell scanf to read into a "%s" and then provide a char &, the allocation will only be for one byte. Even if the input is only a single byte, it will be null-terminated (an extra byte appended to the end), automatically causing a buffer overflow.

It's clearly wrong to use char c;scanf("%s",&c).. Format specifier "%s" reads in a sequence of non-white-space characters and appends a '\0' at the end (cf. cppreference concerning scanf). So it will most likely exceed the memory provided by a single character char c (unless the user just presses return or ctrl-d).
Use char c;scanf("%c",&c). instead.

This is wrong. You can't put a string in a char. This is undefined behavior and will probably result in memory stomping.
scanf("%c", &c);
is how you put take a char from scanf.
Better yet, avoid scanf() entirely and use getchar():
int c = getchar();
Just be aware of the return type.

inputting a character string using scanf()

I started learning about inputting character strings in C. In the following source code I get a character array of length 5.
#include<stdio.h>
int main(void)
{
char s1[5];
printf("enter text:\n");
scanf("%s",s1);
printf("\n%s\n",s1);
return 0;
}
when the input is:
1234567891234567, and I've checked it's working fine up to 16 elements(which I don't understand because it is more than 5 elements).
12345678912345678, it's giving me an error segmentation fault: 11 (I gave 17 elements in this case)
123456789123456789, the error is Illegal instruction: 4 (I gave 18 elements in this case)
I don't understand why there are different errors. Is this the behavior of scanf() or character arrays in C?. The book that I am reading didn't have a clear explanation about these things. FYI I don't know anything about pointers. Any further explanation about this would be really helpful.

Is this the behavior of scanf() or character arrays in C?
TL;DR - No, you're facing the side-effects of undefined behavior.
To elaborate, in your case, against a code like
scanf("%s",s1);
where you have defined
char s1[5];
inputting anything more than 4 char will cause your program to venture into invalid memory area (past the allocated memory) which in turn invokes undefined behavior.
Once you hit UB, the behavior of the program cannot be predicted or justified in any way. It can do absolutely anything possible (or even impossible).
There is nothing inherent in the scanf() which stops you from reading overly long input and overrun the buffer, you should keep control on the input string scanning by using the field width, like
scanf("%4s",s1); //1 saved for terminating null

The scanf function when reading strings read up to the next white-space (e.g. newline, space, tab etc.), or the "end of file". It has no idea about the size of the buffer you provide it.
If the string you read is longer than the buffer provided, then it will write out of bounds, and you will have undefined behavior.
The simplest way to stop this is to provide a field length to the scanf format, as in
char s1[5];
scanf("%4s",s1);
Note that I use 4 as field length, as there needs to be space for the string terminator as well.
You can also use the "secure" scanf_s for which you need to provide the buffer size as an argument:
char s1[5];
scanf_s("%s", s1, sizeof(s1));

Converting non printable ASCII character to binary

I am trying to convert a string of non-printable ASCII character to binary. Here is the code:
int main(int argc, char *argv[])
{
char str[32];
sprintf(str,"\x01\x00\x02");
printf("\n[%x][%x][%x]",str[0],str[1],str[2]);
return 1;
}
I expect the output should be [1][0][2], but it prints [1][0][4].
What am I doing wrong here?

The sprintf operation ended at the first instance of \x00 in your string literal, because NUL (U+0000) terminates strings in C. (That the compiler does not complain when you write \x00 inside a string literal is arguably a misfeature of the language.) Thus str[2] accesses uninitialized memory and the program is entitled to print complete nonsense or even crash.
To do what you wanted to do, simply eliminate the sprintf:
int main(void)
{
static const unsigned char str[32] =
{ 0x01, 0x00, 0x02 }; // will be zero-filled to declared size
printf("[%02x][%02x][%02x]\n", str[0], str[1], str[2]);
return 0;
}
(Binary data should always be stored in arrays of unsigned char, not plain char; or uint8_t if you have it. Because U+0000 terminates strings, I think it's better style to write embedded binary data using an array literal rather than a string literal; but it is more typing. The static const is just because the data is never modified and known at compile time; the program would work without it. Don't declare argc and argv if you're not going to use them. Return zero, not one, from main to indicate successful completion.)
(Using sprintf the way you were using it is a bad idea for other reasons: for instance, if your binary block contained \x25 (also known as % in ASCII), it would try to read additional arguments-to-be-formatted, and again print complete nonsense or crash. If you have a good reason to not just use static initialized data, the right way to copy blocks of binary data around is memcpy.)

C strings end with a null byte, so sprintf only reads until \x00. Instead, you can use memcpy (like this) or simply initialize with
char str[32] = "\x01\x00\x02";

"\x00" terminates the format string which is the 2nd argument of the sprint() prematurely. Obviously that was unintentional but there is no ways sprint() can figure out that the first NUL is not the last NUL. So the format string it works on is actually shorter than what you intended to pass.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Why storing Unicode Characters in char works? - c

I have a program I made to test I/O from a terminal: #include <stdio.h> int main() { char *input[100]; scanf("%s", input); printf("%s", input); return 0; } It works as it should with ASCII characters, but it also works with Unicode characters and emoji. Why is this?

You're probably running on Linux, with your terminal set to UTF-8 so scanf produces UTF-8, and printf can output it. UTF-8 is designed such that char[] can store it. I explicitly use char[] and not char because non-ASCII characters need more than one byte.

Related

Why does fgetc() in C always reads extra, non-existent characters whenever I try to read non-printable characters from txt files?

read 2 bytes in hexadecimal base and convert into decimal using C language fscanf

Is is possible to convert a char with %s conversion specifier

inputting a character string using scanf()

Converting non printable ASCII character to binary

Categories

Resources