How can one use scanf to scan in an integer amount of characters and simply stuff them into an unsigned int without conversion?
Take an example, I have the following input characters (I have put them in hex for visibility):
5A 5F 03 00 FF FF 3D 2A
I want the first 4 (because 4 char's fit in an int). In base 10 (decimal) this is equal to 221018 (big-endian). Great! That's what I want in my int. This seems to work as expected:
scanf("%s", &my_integer);
Somehow it seems to get the endianness right, placing the first character in the LSB of the int (why?). As you would expect however this produces a compiler warning as the pointer must be to a character array (man 3 scanf).
An alternate approach without using scanf():
for (int i = 0; i < 4; i++)
{
my_integer |= (getchar() << i * 8);
}
Note that I don't intend to do any conversion here, I simple wish to use the pointer type to specify how many characters to read. The same is true if &my_integer was a long, I would read and store eight characters.
Simple really.
It appears my idea behind the use of scanf isn't correct and there must be a better approach.
How would you do it?
N.B. I'm aware type sizes are architecture dependent.
So you want to read 4 bytes from stdin and use them as they are as the representation of a 32-bit big-endian value:
int my_integer;
if (fread (&my_integer, sizeof my_integer, 1, stdin) != 1) {
/* Some problem... */
}
Related
From reading docs in either MSDN or the n1256 committee draft, I was under the impression that a char would always be exactly CHAR_BIT bits as defined in <limits.h>.
If CHAR_BIT is set to 8, then a byte is 8 bits long, and so is a char.
Test code
Given the following C code:
int main(int argc, char **argv) {
int length = 0;
while (argv[1][length] != '\0') {
// print the character, its hexa value, and its size
printf("char %u: %c\tvalue: 0x%X\t sizeof char: %u\n",
length,
argv[1][length],
argv[1][length],
sizeof argv[1][length]);
length++;
}
printf("\nTotal length: %u\n", length);
printf("Actual char size: %u\n", CHAR_BIT);
return 0;
}
I was unsure what the behaviour would be, given arguments that include non-ASCII chars, like ç and à.
Those chars are supposedly UTF-8, so written as multiple bytes each. I would expect them to get processed as individual bytes, meaning ça has a length of 3 for example (4 if counting the \0) and when printing, I'd get one line per byte, and so 3 lines instead of 2 (which would be the actual latin character count).
Output
$ gcc --std=c99 -o program.exe win32.c
$ program.exe test_çà
char 0: t value: 0x74 sizeof char: 1
char 1: e value: 0x65 sizeof char: 1
char 2: s value: 0x73 sizeof char: 1
char 3: t value: 0x74 sizeof char: 1
char 4: _ value: 0x5F sizeof char: 1
char 5: τ value: 0xFFFFFFE7 sizeof char: 1
char 6: α value: 0xFFFFFFE0 sizeof char: 1
Total length: 7
Actual char size: 8
Question
What is probably happening under the hood is char **argv is turned into int **argv. This would explain why line 5 and 6 have an hexadecimal value written on 4 bytes.
Is that what actually happens ?
Is it standard behaviour ?
Why chars 5 and 6 are not what is given as input ?
CHAR_BIT == 8 and sizeof(achar) == 1 and somechar = 0xFFFFFFE7. This seems counter-intuitive. What's happening ?
Environment
Windows 10
Terminal: Alacritty and Windows default cmd (tried in both just in case)
GCC under Mingw-w64
No, it's not received as an array of int.
But it's not far from the truth: printf is indeed receiving the char as an int.
When passing an integer type small than an int to a vararg function like printf, it gets promoted to an int. On your system, char is a signed type.[1] Given a char with a value of -25, an int with a value of -25 was passed to printf. %u expects an unsigned int, so it's treating the int with a value of -25 as an unsigned int, printing 0xFFFFFFE7.
A simple fix:
printf("%X\n", (unsigned char)c); // 74 65 73 74 5F E7 E0
But why did you get E7 and E0 in the first place?
Each Windows system call that deals with text has two versions:
An "ANSI" (A) version that deals with text encoded using the system's Active Code Page.[2] For en-us installs of Windows, this is cp1252.
And a Wide (W) version that deals with text encoded using UTF-16le.
The command line is being obtained from the system using GetCommandLineA, the A version of GetCommandLine. Your system uses cp1252 as its ACP. Encoded using cp1252, ç is E7, and à is E0.
GetCommandLineW will provide the command line as UTF-16le, and CommandLineToArgvW will parse it.
Finally, why did E7 and E0 show as τ and α?
The terminal's encoding is different than the ACP! On your machine, it appears to be 437. (This can be changed.) Encoded using cp437, τ is E7, and α is E0.
Issuing chcp 1252 will set that terminal's encoding to cp1252, matching the ACP. (UTF-8 is 65001.)
You can query the terminal's encoding using GetConsoleCP (for input) and GetConsoleOutputCP (for output). Yeah, apparently they can be different? I don't know how that would happen in practice.
It's up the compiler whether char is a signed or unsigned type.
This can be changed on a per program basis since Windows 10, Version 1903 (May 2019 Update).
From your code and the output on your system, it appears that:
type char has indeed 8 bits. Its size is 1 by definition. char **argv is a pointer to an array of pointers to C strings, null terminated arrays of char (8-bit bytes).
the char type is signed for your compiler configuration, hence the output 0xFFFFFFE7 and 0xFFFFFFE0 for values beyond 127. char values are passed as int to printf, which interprets the value as unsigned for the %X conversion. The behavior is technically undefined, but in practice negative values are offset by 232 when used as unsigned. You can configure gcc to make the char type unsigned by default with -funsigned-char, a safer choice that is also more consistent with the C library behavior.
the 2 non ASCII characters çà are encoded as single bytes E7 and E0, which correspond to Microsoft's proprietary encoding, their code page Windows-1252, not UTF-8 as you assume.
The situation is ultimately confusing: the command line argument is passed to the program encoded with the Windows-1252 code page, but the terminal uses the old MS/DOS code page 437 for compatibility with historic stuff. Hence your program outputs the bytes it receives as command line arguments, but the terminal shows the corresponding characters from CP437, namely τ and α.
Microsoft made historic decisions regarding the encoding of non ASCII characters that seem obsolete by today's standards, it is a shame they seem stuck with cumbersome choices other vendors have steered away from for good reasons. Programming in C in this environment is a rough road.
UTF-8 was invented in September of 1992 by Unix team leaders Kenneth Thomson and Rob Pike. They implemented it in plan-9 overnight as it had a number of interesting properties for compatibility with the C language character strings. Microsoft had already invested millions in their own system and ignored this simpler approach, which has become ubiquitous on the web today.
This question already has answers here:
How can I merge two ASCII characters? [closed]
(2 answers)
Closed 2 years ago.
I tried the following code but couldn't get the desired output.
Result should be AB and it should come from single variable C
int main()
{
int a = 'A';
int b = 'B';
unsigned int C = a << 8 | b;
printf(" %c\n",C);
return 0;
}```
%c will print a single character. If you want to print a string, you have to use %s and provide a pointer to this string. Strings in C have to be null-terminated, meaning they require one additional character after the text and this character carries the value \0 (zero).
You could do this in an int, but you'd have to understand some concepts first.
If you are using a computer with Intel architecture, integer variables larger than one byte will store data in reverse order in memory. This is called little-endianness.
So a number like 0x11223344 (hexadecimal) will be stored in memory as the sequence of bytes 44 33 22 11.
'A' is equivalent to the number 65, or 0x00000041, and if put in a 32-bit integer will be stored as 41 00 00 00.
When you do 'A' << 8 | 'B' you create the number 0x00006566, but in memory it is actually 66 65 00 00 (equivalent to string "BA\0\0"). It's in the opposite order of what you're trying to do, but since it is technically null-terminated, it's a valid string.
You can print this using sprintf("%s", &C);
If you're in a big-endian architecture (such as ARM), you will have to work out the null-terminator, but I think I already gave you enough information to figure out what is going on for yourself.
you're trying to print only single byte located at &C, so depending on if you're machine is little endian you'd get "66" as output or if you're machine is big endian you'd get 0 as output.
I wanted to print the actual bit representation of integers in C. These are the two approaches that I found.
First:
union int_char {
int val;
unsigned char c[sizeof(int)];
} data;
data.val = n1;
// printf("Integer: %p\nFirst char: %p\nLast char: %p\n", &data.f, &data.c[0], &data.c[sizeof(int)-1]);
for(int i = 0; i < sizeof(int); i++)
printf("%.2x", data.c[i]);
printf("\n");
Second:
for(int i = 0; i < 8*sizeof(int); i++) {
int j = 8 * sizeof(int) - 1 - i;
printf("%d", (val >> j) & 1);
}
printf("\n");
For the second approach, the outputs are 00000002 and 02000000. I also tried the other numbers and it seems that the bytes are swapped in the two. Which one is correct?
Welcome to the exotic world of endian-ness.
Because we write numbers most significant digit first, you might imagine the most significant byte is stored at the lower address.
The electrical engineers who build computers are more imaginative.
Someimes they store the most significant byte first but on your platform it's the least significant.
There are even platforms where it's all a bit mixed up - but you'll rarely encounter those in practice.
So we talk about big-endian and little-endian for the most part. It's a joke about Gulliver's Travels where there's a pointless war about which end of a boiled egg to start at. Which is itself a satire of some disputes in the Christian Church. But I digress.
Because your first snippet looks at the value as a series of bytes it encounters then in endian order.
But because the >> is defined as operating on bits it is implemented to work 'logically' without regard to implementation.
It's right of C to not define the byte order because hardware not supporting the model C chose would be burdened with an overhead of shuffling bytes around endlessly and pointlessly.
There sadly isn't a built-in identifier telling you what the model is - though code that does can be found.
It will become relevant to you if (a) as above you want to breakdown integer types into bytes and manipulate them or (b) you receive files for other platforms containing multi-byte structures.
Unicode offers something called a BOM (Byte Order Marker) in UTF-16 and UTF-32.
In fact a good reason (among many) for using UTF-8 is the problem goes away. Because each component is a single byte.
Footnote:
It's been pointed out quite fairly in the comments that I haven't told the whole story.
The C language specification admits more than one representation of integers and particularly signed integers. Specifically signed-magnitude, twos-complement and ones-complement.
It also permits 'padding bits' that don't represent part of the value.
So in principle along with tackling endian-ness we need to consider representation.
In principle. All modern computers use twos complement and extant machines that use anything else are very rare and unless you have a genuine requirement to support such platforms, I recommend assuming you're on a twos-complement system.
The correct Hex representation as string is 00000002 as if you declare the integer with hex represetation.
int n = 0x00000002; //n=2
or as you where get when printing integer as hex like in:
printf("%08x", n);
But when printing integer bytes 1 byte after the other, you also must consider the endianess, which is the byte order in multi-byte integers:
In big endian system (some UNIX system use it) the 4 bytes will be ordered in memory as:
00 00 00 02
While in little endian system (most of OS) the bytes will be ordered in memory as:
02 00 00 00
The first prints the bytes that represent the integer in the order they appear in memory. Platforms with different endian will print different results as they store integers in different ways.
The second prints the bits that make up the integer value most significant bit first. This result is independent of endian. The result is also independent of how the >> operator is implemented for signed ints as it does not look at the bits that may be influenced by the implementation.
The second is a better match to the question "Printing actual bit representation of integers in C". Although there is a lot of ambiguity.
It depends on your definition of "correct".
The first one will print the data exactly like it's laid out in memory, so I bet that's the one you're getting the maybe unexpected 02000000 for. *) IMHO, that's the correct one. It could be done simpler by just aliasing with unsigned char * directly (char pointers are always allowed to alias any other pointers, in fact, accessing representations is a usecase for char pointers mentioned in the standard):
int x = 2;
unsigned char *rep = (unsigned char *)&x;
for (int i = 0; i < sizeof x; ++i) printf("0x%hhx ", rep[i]);
The second one will print only the value bits **) and take them in the order from the most significant byte to the least significant one. I wouldn't call it correct because it also assumes that bytes have 8 bits, and because the shifting used is implementation-defined for negative numbers. ***) Furthermore, just ignoring padding bits doesn't seem correct either if you really want to see the representation.
edit: As commented by Gerhardh meanwhile, this second code doesn't print byte by byte but bit by bit. So, the output you claim to see isn't possible. Still, it's the same principle, it only prints value bits and starts at the most significant one.
*) You're on a "little endian" machine. On these machines, the least significant byte is stored first in memory. Read more about Endianness on wikipedia.
**) Representations of types in C may also have padding bits. Some types aren't allowed to include padding (like char), but int is allowed to have them. This second option doesn't alias to char, so the padding bits remain invisible.
***) A correct version of this code (for printing all the value bits) must a) correctly determine the number of value bits (8 * sizeof int is wrong because bytes (char) can have more then 8 bits, even CHAR_BIT * sizeof int is wrong, because this would also count padding bits if present) and b) avoid the implementation-defined shifting behavior by first converting to unsigned. It could look for example like this:
#define IMAX_BITS(m) ((m) /((m)%0x3fffffffL+1) /0x3fffffffL %0x3fffffffL *30 \
+ (m)%0x3fffffffL /((m)%31+1)/31%31*5 + 4-12/((m)%31+3))
int main(void)
{
int x = 2;
for (unsigned mask = 1U << (IMAX_BITS((unsigned)-1) - 1); mask; mask >>= 1)
{
putchar((unsigned) x & mask ? '1' : '0');
}
puts("");
}
See this answer for an explanation of this strange macro.
I have been searching for this similar issue on internet but haven't found any good solution. I am in ubuntu and I am receiving serial hex data on serial port. I have connected my system using null modem cable to windows system and windows system is running docklight software which is sending hex data like below:
2E FF AA 4F CC 0D
Now by saving the data, I do not mean that I want to print data on terminal. I just want to save this data as it is in a buffer so that I an later process it.
To read the data I am doing
res = read (fd, buf, sizeof buf)
//where buf is
unsigned char buf[255];
At this point after reading, buf contains some random chars. From some links on internet, I got a way to convert it into hex:
unsinged char hexData[255];
for (int i = 0; i < res; i++)
{
sprintf((char*)hexData+i*2, "%02X", buf[i]);
}
Now this hexData contains 2EFFAA4FCC0D, which is ok for my purpose. (I do not know if it is the right way of doing it or not). Now lets say I want to convert E from the hexData into decimal. But at this point, E will be considered as character not a hex so may be I'll get 69 instead of 14.
How can I save hex data in a variable. Should I save it as array of chars or int. Thanks.
You already have data in binary form in buf
But if you still need to convert hex to decimal, you can use sscanf
sscanf(&hexBuf[i],"%02X",&intVar);// intVar is integer variable
It wll convert hex byte formed by hexBuf[i] and hexBuf[i+1] to binary in intVar, When you printf intVar with %d you will get to see the decimal value
You can store intVar as element of unsigned character array
You may want to think about what you're trying to achieve.
Hexadecimal is just a representation. The byte you are receiving could be shown as hexadecimal pairs, as binary octet or as a series of (more or less) printable characters (what you see if you print your unsigned char array).
If what you need is storing only the hexadecimal representation of those bytes, convert them to hexadecimal as you are doing, but remember you'll need an array twice as big as your buffer (since a single byte will be represented by two hexadecimal characters once you convert it).
Usually, the best thing to do is to keep them as an array of bytes (chars), that is, your buf array, and only convert them when you need to show the hexadecimal representation.
I'm trying to convert a struct to a char array to send over the network. However, I get some weird output from the char array when I do.
#include <stdio.h>
struct x
{
int x;
} __attribute__((packed));
int main()
{
struct x a;
a.x=127;
char *b = (char *)&a;
int i;
for (i=0; i<4; i++)
printf("%02x ", b[i]);
printf("\n");
for (i=0; i<4; i++)
printf("%d ", b[i]);
printf("\n");
return 0;
}
Here is the output for various values of a.x (on an X86 using gcc):
127:
7f 00 00 00
127 0 0 0
128:
ffffff80 00 00 00
-128 0 0 0
255:
ffffffff 00 00 00
-1 0 0 0
256:
00 01 00 00
0 1 0 0
I understand the values for 127 and 256, but why do the numbers change when going to 128? Why wouldn't it just be:
80 00 00 00
128 0 0 0
Am I forgetting to do something in the conversion process or am I forgetting something about integer representation?
*Note: This is just a small test program. In a real program I have more in the struct, better variable names, and I convert to little-endian.
*Edit: formatting
What you see is the sign preserving conversion from char to int. The behavior results from the fact that on your system, char is signed (Note: char is not signed on all systems). That will lead to negative values if a bit-pattern yields to a negative value for a char. Promoting such a char to an int will preserve the sign and the int will be negative too. Note that even if you don't put a (int) explicitly, the compiler will automatically promote the character to an int when passing to printf. The solution is to convert your value to unsigned char first:
for (i=0; i<4; i++)
printf("%02x ", (unsigned char)b[i]);
Alternatively, you can use unsigned char* from the start on:
unsigned char *b = (unsigned char *)&a;
And then you don't need any cast at the time you print it with printf.
The x format specifier by itself says that the argument is an int, and since the number is negative, printf requires eight characters to show all four non-zero bytes of the int-sized value. The 0 modifier tells to pad the output with zeros, and the 2 modifier says that the minimum output should be two characters long. As far as I can tell, printf doesn't provide a way to specify a maximum width, except for strings.
Now then, you're only passing a char, so bare x tells the function to use the full int that got passed instead — due to default argument promotion for "..." parameters. Try the hh modifier to tell the function to treat the argument as just a char instead:
printf("%02hhx", b[i]);
char is a signed type; so with two's complement, 0x80 is -128 for an 8-bit integer (i.e. a byte)
Treating your struct as if it were a char array is undefined behavior. To send it over the network, use proper serialization instead. It's a pain in C++ and even more so in C, but it's the only way your app will work independently of the machines reading and writing.
http://en.wikipedia.org/wiki/Serialization#C
Converting your structure to characters or bytes the way you're doing it, is going to lead to issues when you do try to make it network neutral. Why not address that problem now? There are a variety of different techniques you can use, all of which are likely to be more "portable" than what you're trying to do. For instance:
Sending numeric data across the network in a machine-neutral fashion has long been dealt with, in the POSIX/Unix world, via the functions htonl, htons, ntohl and ntohs. See, for example, the byteorder(3) manual page on a FreeBSD or Linux system.
Converting data to and from a completely neutral representation like JSON is also perfectly acceptable. The amount of time your programs spend converting the data between JSON and native forms is likely to pale in comparison to the network transmission latencies.
char is a signed type so what you are seeing is the two-compliment representation, casting to (unsigned char*) will fix that (Rowland just beat me).
On a side note you may want to change
for (i=0; i<4; i++) {
//...
}
to
for (i=0; i<sizeof(x); i++) {
//...
}
The signedness of char array is not the root of the problem! (It is -a- problem, but not the only problem.)
Alignment! That's the key word here. That's why you should NEVER try to treat structs like raw memory. Compliers (and various optimization flags), operating systems, and phases of the moon all do strange and exciting things to the actual location in memory of "adjacent" fields in a structure. For example, if you have a struct with a char followed by an int, the whole struct will be EIGHT bytes in memory -- the char, 3 blank, useless bytes, and then 4 bytes for the int. The machine likes to do things like this so structs can fit cleanly on pages of memory, and such like.
Take an introductory course to machine architecture at your local college. Meanwhile, serialize properly. Never treat structs like char arrays.
When you go to send it, just use:
(char*)&CustomPacket
to convert. Works for me.
You may want to convert to a unsigned char array.
Unless you have very convincing measurements showing that every octet is precious, don't do this. Use a readable ASCII protocol like SMTP, NNTP, or one of the many other fine Internet protocols codified by the IETF.
If you really must have a binary format, it's still not safe just to shove out the bytes in a struct, because the byte order, basic sizes, or alignment constraints may differ from host to host. You must design your wire protcol to use well-defined sizes and to use a well defined byte order. For your implementation, either use macros like ntohl(3) or use shifting and masking to put bytes into your stream. Whatever you do, make sure your code produces the same results on both big-endian and little-endian hosts.