Merge ascii char using C [duplicate] - c

This question already has answers here:
How can I merge two ASCII characters? [closed]
(2 answers)
Closed 2 years ago.
I tried the following code but couldn't get the desired output.
Result should be AB and it should come from single variable C
int main()
{
int a = 'A';
int b = 'B';
unsigned int C = a << 8 | b;
printf(" %c\n",C);
return 0;
}```

%c will print a single character. If you want to print a string, you have to use %s and provide a pointer to this string. Strings in C have to be null-terminated, meaning they require one additional character after the text and this character carries the value \0 (zero).
You could do this in an int, but you'd have to understand some concepts first.
If you are using a computer with Intel architecture, integer variables larger than one byte will store data in reverse order in memory. This is called little-endianness.
So a number like 0x11223344 (hexadecimal) will be stored in memory as the sequence of bytes 44 33 22 11.
'A' is equivalent to the number 65, or 0x00000041, and if put in a 32-bit integer will be stored as 41 00 00 00.
When you do 'A' << 8 | 'B' you create the number 0x00006566, but in memory it is actually 66 65 00 00 (equivalent to string "BA\0\0"). It's in the opposite order of what you're trying to do, but since it is technically null-terminated, it's a valid string.
You can print this using sprintf("%s", &C);
If you're in a big-endian architecture (such as ARM), you will have to work out the null-terminator, but I think I already gave you enough information to figure out what is going on for yourself.

you're trying to print only single byte located at &C, so depending on if you're machine is little endian you'd get "66" as output or if you're machine is big endian you'd get 0 as output.

Related

What actually is the type of C `char **argv` on Windows

From reading docs in either MSDN or the n1256 committee draft, I was under the impression that a char would always be exactly CHAR_BIT bits as defined in <limits.h>.
If CHAR_BIT is set to 8, then a byte is 8 bits long, and so is a char.
Test code
Given the following C code:
int main(int argc, char **argv) {
int length = 0;
while (argv[1][length] != '\0') {
// print the character, its hexa value, and its size
printf("char %u: %c\tvalue: 0x%X\t sizeof char: %u\n",
length,
argv[1][length],
argv[1][length],
sizeof argv[1][length]);
length++;
}
printf("\nTotal length: %u\n", length);
printf("Actual char size: %u\n", CHAR_BIT);
return 0;
}
I was unsure what the behaviour would be, given arguments that include non-ASCII chars, like ç and à.
Those chars are supposedly UTF-8, so written as multiple bytes each. I would expect them to get processed as individual bytes, meaning ça has a length of 3 for example (4 if counting the \0) and when printing, I'd get one line per byte, and so 3 lines instead of 2 (which would be the actual latin character count).
Output
$ gcc --std=c99 -o program.exe win32.c
$ program.exe test_çà
char 0: t value: 0x74 sizeof char: 1
char 1: e value: 0x65 sizeof char: 1
char 2: s value: 0x73 sizeof char: 1
char 3: t value: 0x74 sizeof char: 1
char 4: _ value: 0x5F sizeof char: 1
char 5: τ value: 0xFFFFFFE7 sizeof char: 1
char 6: α value: 0xFFFFFFE0 sizeof char: 1
Total length: 7
Actual char size: 8
Question
What is probably happening under the hood is char **argv is turned into int **argv. This would explain why line 5 and 6 have an hexadecimal value written on 4 bytes.
Is that what actually happens ?
Is it standard behaviour ?
Why chars 5 and 6 are not what is given as input ?
CHAR_BIT == 8 and sizeof(achar) == 1 and somechar = 0xFFFFFFE7. This seems counter-intuitive. What's happening ?
Environment
Windows 10
Terminal: Alacritty and Windows default cmd (tried in both just in case)
GCC under Mingw-w64
No, it's not received as an array of int.
But it's not far from the truth: printf is indeed receiving the char as an int.
When passing an integer type small than an int to a vararg function like printf, it gets promoted to an int. On your system, char is a signed type.[1] Given a char with a value of -25, an int with a value of -25 was passed to printf. %u expects an unsigned int, so it's treating the int with a value of -25 as an unsigned int, printing 0xFFFFFFE7.
A simple fix:
printf("%X\n", (unsigned char)c); // 74 65 73 74 5F E7 E0
But why did you get E7 and E0 in the first place?
Each Windows system call that deals with text has two versions:
An "ANSI" (A) version that deals with text encoded using the system's Active Code Page.[2] For en-us installs of Windows, this is cp1252.
And a Wide (W) version that deals with text encoded using UTF-16le.
The command line is being obtained from the system using GetCommandLineA, the A version of GetCommandLine. Your system uses cp1252 as its ACP. Encoded using cp1252, ç is E7, and à is E0.
GetCommandLineW will provide the command line as UTF-16le, and CommandLineToArgvW will parse it.
Finally, why did E7 and E0 show as τ and α?
The terminal's encoding is different than the ACP! On your machine, it appears to be 437. (This can be changed.) Encoded using cp437, τ is E7, and α is E0.
Issuing chcp 1252 will set that terminal's encoding to cp1252, matching the ACP. (UTF-8 is 65001.)
You can query the terminal's encoding using GetConsoleCP (for input) and GetConsoleOutputCP (for output). Yeah, apparently they can be different? I don't know how that would happen in practice.
It's up the compiler whether char is a signed or unsigned type.
This can be changed on a per program basis since Windows 10, Version 1903 (May 2019 Update).
From your code and the output on your system, it appears that:
type char has indeed 8 bits. Its size is 1 by definition. char **argv is a pointer to an array of pointers to C strings, null terminated arrays of char (8-bit bytes).
the char type is signed for your compiler configuration, hence the output 0xFFFFFFE7 and 0xFFFFFFE0 for values beyond 127. char values are passed as int to printf, which interprets the value as unsigned for the %X conversion. The behavior is technically undefined, but in practice negative values are offset by 232 when used as unsigned. You can configure gcc to make the char type unsigned by default with -funsigned-char, a safer choice that is also more consistent with the C library behavior.
the 2 non ASCII characters çà are encoded as single bytes E7 and E0, which correspond to Microsoft's proprietary encoding, their code page Windows-1252, not UTF-8 as you assume.
The situation is ultimately confusing: the command line argument is passed to the program encoded with the Windows-1252 code page, but the terminal uses the old MS/DOS code page 437 for compatibility with historic stuff. Hence your program outputs the bytes it receives as command line arguments, but the terminal shows the corresponding characters from CP437, namely τ and α.
Microsoft made historic decisions regarding the encoding of non ASCII characters that seem obsolete by today's standards, it is a shame they seem stuck with cumbersome choices other vendors have steered away from for good reasons. Programming in C in this environment is a rough road.
UTF-8 was invented in September of 1992 by Unix team leaders Kenneth Thomson and Rob Pike. They implemented it in plan-9 overnight as it had a number of interesting properties for compatibility with the C language character strings. Microsoft had already invested millions in their own system and ignored this simpler approach, which has become ubiquitous on the web today.

Why the answer of "printf("%d", '0/');" is 12335? [duplicate]

This question already has answers here:
Multiple characters in a character constant
(3 answers)
Closed 8 years ago.
This is the c code:
int main(int argc, const char *argv[])
{
printf("%d", '0/');
return 0;
}
The output is 12335! Then I try to replace '0/' with '00' and '000', and the outputs change to 12336 and 3158064, while 12336=48*(1+2^8), 3158064=48*(1+2^8+2^16). However, I still don't know why. What happens when '0/' is transformed to an integer for output?
PS: My computer is MBP, and the operating system is OS X 10.9.5 (13F34). The compiler is Apple LLVM 6.0.
You have constructed a "multi-character literal". The behaviour is implementation-defined, but in your case, the integer value is constructed from the ASCII values (12235 == 48 * 256 + 47).
'0/' is a multi-character constant, which means it has an implementation-defined value. In your case, the ASCII value of the characters is 0x30 0x2f. These are combined into 0x302f, which equals 12335.
Because 0/ is a multi-character constant of type int. You initialize the first part of a 2x3 buffer with it, and pass it to printf to be re-interpreted as an int, the first byte '0' is multiplied by 256 and then the second byte '/' is added to it. This produces the value that you see:
printf("%d %d %d", '0', '/', '0'*256+'/');
prints
48 47 12335
demo.
Note that this behavior is system-dependent. On other systems you could see 12080 instead of 12335.
See this answer for more information on multicharacter constants.

scanf characters into an int without conversion

How can one use scanf to scan in an integer amount of characters and simply stuff them into an unsigned int without conversion?
Take an example, I have the following input characters (I have put them in hex for visibility):
5A 5F 03 00 FF FF 3D 2A
I want the first 4 (because 4 char's fit in an int). In base 10 (decimal) this is equal to 221018 (big-endian). Great! That's what I want in my int. This seems to work as expected:
scanf("%s", &my_integer);
Somehow it seems to get the endianness right, placing the first character in the LSB of the int (why?). As you would expect however this produces a compiler warning as the pointer must be to a character array (man 3 scanf).
An alternate approach without using scanf():
for (int i = 0; i < 4; i++)
{
my_integer |= (getchar() << i * 8);
}
Note that I don't intend to do any conversion here, I simple wish to use the pointer type to specify how many characters to read. The same is true if &my_integer was a long, I would read and store eight characters.
Simple really.
It appears my idea behind the use of scanf isn't correct and there must be a better approach.
How would you do it?
N.B. I'm aware type sizes are architecture dependent.
So you want to read 4 bytes from stdin and use them as they are as the representation of a 32-bit big-endian value:
int my_integer;
if (fread (&my_integer, sizeof my_integer, 1, stdin) != 1) {
/* Some problem... */
}

how do I determine if this is latin1 or utf8?

I have a string "Artîsté" in latin1 table. I use a C mysql connector to get the string out of the table. I have character_set_connection set to utf8.
In the debugger it looks like :
"Art\xeest\xe9"
If I print the hex values with printf ("%02X", (unsigned char) a[i]); for each char I get
41 72 74 EE 73 74 E9
How do I know if it is utf8 or latin1?
\x74\xee\x73 isn't a valid UTF-8 sequence, since UTF-8 never has a run of only 1 byte with the top bit set. So of the two, it must be Latin-1.
However, if you see bytes that are valid UTF-8 data, then it's not always possible to rule out that it might be Latin-1 that just so happens to also be valid UTF-8.
Latin-1 does have some invalid bytes (the ASCII control characters 0x00-0x1F and the unused range 0x7f-0x9F), so there are some UTF-8 strings that you can be sure are not Latin-1. But in my experience it's common enough to see Windows CP1252 mislabelled as Latin-1, that rejecting all those code points is fairly futile except in the case where you're converting from another charset to Latin-1, and want to be strict about what you output. CP1252 has a few unused bytes too, but not as many.
as yo can see in the schema of a UTF-8 sequence you can have 2 great possibilities:
1st bit = 0 (same as ascii), 1 byte per char having value <=0X7F
1st bit = 1 of utf-8 sequence, the sequence length is >= 2 bytes having value >= 0X80
This is iso-8859 encoding
41 72 74 *EE* 73 74 *E9*
only 2 stand alone bytes with values >= 0x80
ADD BEWARE
Be carefull! Even if you found a well formatted UTF-8 sequence, you cannot differentiate it from a bounch of ISO-8859 chars!

How to convert struct to char array in C

I'm trying to convert a struct to a char array to send over the network. However, I get some weird output from the char array when I do.
#include <stdio.h>
struct x
{
int x;
} __attribute__((packed));
int main()
{
struct x a;
a.x=127;
char *b = (char *)&a;
int i;
for (i=0; i<4; i++)
printf("%02x ", b[i]);
printf("\n");
for (i=0; i<4; i++)
printf("%d ", b[i]);
printf("\n");
return 0;
}
Here is the output for various values of a.x (on an X86 using gcc):
127:
7f 00 00 00
127 0 0 0
128:
ffffff80 00 00 00
-128 0 0 0
255:
ffffffff 00 00 00
-1 0 0 0
256:
00 01 00 00
0 1 0 0
I understand the values for 127 and 256, but why do the numbers change when going to 128? Why wouldn't it just be:
80 00 00 00
128 0 0 0
Am I forgetting to do something in the conversion process or am I forgetting something about integer representation?
*Note: This is just a small test program. In a real program I have more in the struct, better variable names, and I convert to little-endian.
*Edit: formatting
What you see is the sign preserving conversion from char to int. The behavior results from the fact that on your system, char is signed (Note: char is not signed on all systems). That will lead to negative values if a bit-pattern yields to a negative value for a char. Promoting such a char to an int will preserve the sign and the int will be negative too. Note that even if you don't put a (int) explicitly, the compiler will automatically promote the character to an int when passing to printf. The solution is to convert your value to unsigned char first:
for (i=0; i<4; i++)
printf("%02x ", (unsigned char)b[i]);
Alternatively, you can use unsigned char* from the start on:
unsigned char *b = (unsigned char *)&a;
And then you don't need any cast at the time you print it with printf.
The x format specifier by itself says that the argument is an int, and since the number is negative, printf requires eight characters to show all four non-zero bytes of the int-sized value. The 0 modifier tells to pad the output with zeros, and the 2 modifier says that the minimum output should be two characters long. As far as I can tell, printf doesn't provide a way to specify a maximum width, except for strings.
Now then, you're only passing a char, so bare x tells the function to use the full int that got passed instead — due to default argument promotion for "..." parameters. Try the hh modifier to tell the function to treat the argument as just a char instead:
printf("%02hhx", b[i]);
char is a signed type; so with two's complement, 0x80 is -128 for an 8-bit integer (i.e. a byte)
Treating your struct as if it were a char array is undefined behavior. To send it over the network, use proper serialization instead. It's a pain in C++ and even more so in C, but it's the only way your app will work independently of the machines reading and writing.
http://en.wikipedia.org/wiki/Serialization#C
Converting your structure to characters or bytes the way you're doing it, is going to lead to issues when you do try to make it network neutral. Why not address that problem now? There are a variety of different techniques you can use, all of which are likely to be more "portable" than what you're trying to do. For instance:
Sending numeric data across the network in a machine-neutral fashion has long been dealt with, in the POSIX/Unix world, via the functions htonl, htons, ntohl and ntohs. See, for example, the byteorder(3) manual page on a FreeBSD or Linux system.
Converting data to and from a completely neutral representation like JSON is also perfectly acceptable. The amount of time your programs spend converting the data between JSON and native forms is likely to pale in comparison to the network transmission latencies.
char is a signed type so what you are seeing is the two-compliment representation, casting to (unsigned char*) will fix that (Rowland just beat me).
On a side note you may want to change
for (i=0; i<4; i++) {
//...
}
to
for (i=0; i<sizeof(x); i++) {
//...
}
The signedness of char array is not the root of the problem! (It is -a- problem, but not the only problem.)
Alignment! That's the key word here. That's why you should NEVER try to treat structs like raw memory. Compliers (and various optimization flags), operating systems, and phases of the moon all do strange and exciting things to the actual location in memory of "adjacent" fields in a structure. For example, if you have a struct with a char followed by an int, the whole struct will be EIGHT bytes in memory -- the char, 3 blank, useless bytes, and then 4 bytes for the int. The machine likes to do things like this so structs can fit cleanly on pages of memory, and such like.
Take an introductory course to machine architecture at your local college. Meanwhile, serialize properly. Never treat structs like char arrays.
When you go to send it, just use:
(char*)&CustomPacket
to convert. Works for me.
You may want to convert to a unsigned char array.
Unless you have very convincing measurements showing that every octet is precious, don't do this. Use a readable ASCII protocol like SMTP, NNTP, or one of the many other fine Internet protocols codified by the IETF.
If you really must have a binary format, it's still not safe just to shove out the bytes in a struct, because the byte order, basic sizes, or alignment constraints may differ from host to host. You must design your wire protcol to use well-defined sizes and to use a well defined byte order. For your implementation, either use macros like ntohl(3) or use shifting and masking to put bytes into your stream. Whatever you do, make sure your code produces the same results on both big-endian and little-endian hosts.

Resources