C: Typecasting with bitfields reverses values?

C: Typecasting with bitfields reverses values? - c

I am trying to cast a byte stream (raw data from serial port) into a structure for ease of use. I have managed to replicate the problem in a minimal working example:
#include <stdio.h>
typedef struct {
unsigned int source: 4;
unsigned int destination: 4;
char payload[15];
} packet;
int main(void)
{
// machine 9 sends a message to machine 10 (A)
char raw[20] = {0x9A, 'H', 'e', 'l', 'l', 'o', '!', 0};
packet *message = (packet *)raw;
printf("machine %d ", message->source);
printf("says '%s' to ", message->payload);
printf("machine %d.\n", message->destination);
return 0;
}
I would expect the field source to get 9 from 0x9A and destination to get A from 0x9A so that the output says:
machine 9 says 'Hello!' to machine 10.
But I get:
machine 10 says 'Hello!' to machine 9.
Any idea why this might be so?

I am trying to cast a byte stream (raw data from serial port) into a structure for ease of use.
char raw[20] = {0x9A, 'H', 'e', 'l', 'l', 'o', '!', 0};
packet *message = (packet *)raw;
This is poor code for several reasons.
Alignment: (packet *)raw risks undefined behavior when the alignment needs of the structure packet exceed the alignment needs of a char.
Size: The size of the members .source and .destination might not be packed in 1 byte. Many attributes of bit-fields are implementation dependent. The overall size of raw[] (20) may differ from packet.
Aliasing. Compiler can assume changes to raw[20] does not affect message.
What should be done depends on the unposted larger code.

0x9A - lowest 4 bits A, highest 4 bits 9.
In your structure if you compile with GCC member source (occupying lower nibble) is assigned A and destination (occupying higher nibble) is assigned 9
So program output is correct.

I am trying to cast a byte stream (raw data from serial port) into a structure for ease of use.
Interpreting a raw byte sequence in some random protocol as the representation of a structure type requires taking the target ABI into account for full details of structure layout, including bitfields, and perhaps also understanding and applying the C extensions made available by your compiler. Since it is ABI- and compiler-dependent, the result is usually non-portable.
I have managed to replicate the problem in a minimal working example: [...]
I would expect [...]
There is not much you can safely expect here without referring to the ABI. It includes
the source and destination bitfields are packed into adjacent ranges of bits in the same "addressible storage unit".
the ASU containing those starts at the first byte of the overall structure.
But you cannot assume
anything about the size of the ASU containing the bitfields (other than it is no smaller than the smallest addressable unit of storage), or
the relative order of the bitfields within it, or
if it is larger than 8 bits, which 8 bits of it are used to store the two bitfields' representations. Nor
whether the storage for payload starts with the next byte following the ASU, or
whether the last byte of the payload member is the last byte of the overall structure.
I get:
machine 10 says 'Hello!' to machine 9.
Any idea why this might be so?
The machine seems to do what you appear to have expected with accessing the char array via a packet *, though that behavior is in fact undefined. That reveals that the machine has chosen a 1-byte ASU for the two bitfields, without any padding between that and the payload, and that it lays out the bitfields starting at the least-significant end of the ASU. That is well within the bounds of the C implementation's discretion.

Related

How was an array of char stored?

Here is something weird I found:
When I have a char* s of three elements, and assigned it to be "21",
The printed short int value of s appears to be 12594, which is same to 0010001 0010010 in binary, and 49 50 for separate char. But according to the ASCII chart, the value of '2' is 50 and '1' is 49.
when I shift the char to right, *(short*)s >>= 8, the result is agreed with (1.), which is '1' or 49. But after I assigned the char *s = '1', the printed string of s also appears to be "1", which I earlier thought it will become "11".
I am kind of confused about how bits stored in a char now, hope someone can explain this.
Following is the code I use:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%lu,%lu\n",sizeof(char), sizeof(short));
char* s = malloc(sizeof(char)*3);
*s = '2', *(s+1) = '1', *(s+2) = '\0';
printf("%s\n",s);
printf("%d\n",*(short int*)s);
*(short*)s >>= 8;
printf("%s\n",s);
printf("%d\n",*(short int*)s);
*s = '1';
printf("%s\n",s);
return 0;
}
And the output is:
1,2
21
12594
1
49
1
This program is compiled on macOS with gcc.

You need some understanding of the concept of "endianess" here, that values can be represented as "little endian" and "big endian".
I am going to skip the discussion of how legal it is, about involved undefined bahaviour.
(Here is however a relevant link, provided by Lundin, credits:
What is the strict aliasing rule?)
But lets look at a pair of byte in memory, of which the lower-addressed contains a 50 and the higher addressed contains a 49:
50 49
You introduce them exactly this way, by explicitly setting lower byte and higher byte (via char type).
Then you read them, forcing the compiler to consider it a short, which is a two byte sized type on your system.
Compilers and hardware can be created with different "opinions" on what is a good representation of two byte values in two cosecutive bytes. It is called "endianess".
Two compilers, both of which are perfectly standard-conforming can act like this:
The short to be returned is
take the value from lower address, multiply it by 256, add the value from higher address
take the value from the higher address, multiply it by 256, add the value from the lower address
They do not actually do so, it is a much more efficient mechanism implemented in hardware, but the point is that even the implementation in hardware implicity does this or that.

You are re-interpreting representations by aliasing types in a way that is not allowed by the standard: you can process a short value as if it were a char array, but not the opposite. Doing that can cause weird errors with optimizing compilers that could assume that the value has never been initialized, or could optimize out a full branch of code that contains Undefined Behaviour.
Then the answer to your question is called endianess. In a big endian representation, the most significant byte has the lowest address (258 or 0x102 will be represented as the 2 byte 0x01, 0x02 in that order) while in little endian representation the least significant byte has the lowest address (0x102 is represented as 0x02, 0x01 in that order).
Your system happens to be a little endian one.

Is this program compatible on both big and little endian systems?

I wrote a small program which reverses a string and prints it to screen:
void ReverseString(char *String)
{
char *Begin = String;
char *End = String + strlen(String) - 1;
char TempChar = '\0';
while (Begin < End)
{
TempChar = *Begin;
*Begin = *End;
*End = TempChar;
Begin++;
End--;
}
printf("%s",String);
}
It works perfectly in Dev C++ on Windows (little endian).
But I have a sudden doubt of its efficiency. If you look at this line:
while (Begin < End)
I am comparing the address of the beginning and end. Is this the correct way?
Does this code work on a big endian OS like Mac OS X ?
Or am I thinking the wrong way ?
I have got several doubts which I mentioned above.
Can anyone please clarify ?

Your code has no endianness-related issues. There's also nothing wrong with the way you're comparing the two pointers. In short, your code's fine.

Endianness is defined as the order of significance of the bytes in a multi-byte primitive type. So if your int is big-endian, that means the first byte (i.e. the one with the lowest address) of an int in memory contains the most significant bits of the int, and so on to the last/least significant. That's all it means. When we say a system is big-endian, that generally means that all of its pointer and arithmetic types are big-endian, although there are some odd special cases out there. Endian-ness doesn't affect pointer arithmetic or comparison, or the order in which strings are stored in memory.
Your code does not use any multi-byte primitive types[*], so endian-ness is irrelevant. In general, endian-ness only becomes relevant if you somehow access the individual bytes of such an object (for example by casting a pointer to unsigned char*, writing the memory to a file or over the network, and the like).
Supposing a caller did something like this:
int x = 0x00010203; // assuming sizeof(int) == 4 and CHAR_BIT == 8
ReverseString((char *)&x);
Then their code would be endian-dependent. On a big-endian system, they would pass you an empty string, since the first byte would be 0, so your code would leave x unchanged. On a little-endian system they would pass you a three-byte string, since the first three bytes would be 0x03, 0x02, 0x01 and the fourth byte 0, so your code would change x to 0x00030201
[*] well, the pointers are multi-byte, on OSX and on pretty much every C implementation. But you don't inspect their storage representations, you just use them as values, so there's no opportunity for behavior to differ according to endianness.

As far as I know, endianness does not affect a char * as each character is a single byte and forms an array of characters. Have a look at http://www.ibm.com/developerworks/aix/library/au-endianc/index.html?ca=drs-
The effect will be seen in multi byte data types like int.

As long as you manipulate whole type T objects (which is what you do with type T being char) you just can't run into endianness problems.
You could run into them if you for example tried to manipulate separate bytes within a larger type (an int for example) but you don't do anything like that. This is why endianness problems are impossible in your code, period.

Sending the array of arbitrary length through a socket. Endianness

I'm fighting with socket programming now and I've encountered a problem, which I don't know how to solve in a portable way.
The task is simple : I need to send the array of 16 bytes over the network, receive it in a client application and parse it. I know, there are functions like htonl, htons and so one to use with uint16 and uint32. But what should I do with the chunks of data greater than that?
Thank you.

You say an array of 16 bytes. That doesn't really help. Endianness only matters for things larger than a byte.
If it's really raw bytes then just send them, you will receive them just the same
If it's really a struct you want to send it
struct msg
{
int foo;
int bar;
.....
Then you need to work through the buffer pulling that values you want.
When you send you must assemble a packet into a standard order
int off = 0;
*(int*)&buff[off] = htonl(foo);
off += sizeof(int);
*(int*)&buff[off] = htonl(bar);
...
when you receive
int foo = ntohl((int)buff[off]);
off += sizeof(int);
int bar = ntohl((int)buff[off]);
....
EDIT: I see you want to send an IPv6 address, they are always in network byte order - so you can just stream it raw.

Endianness is a property of multibyte variables such as 16-bit and 32-bit integers. It has to do with whether the high-order or low-order byte goes first. If the client application is processing the array as individual bytes, it doesn't have to worry about endianness, as the order of the bits within the bytes is the same.

htons, htonl, etc., are for dealing with a single data item (e.g. an int) that's larger than one byte. An array of bytes where each one is used as a single data item itself (e.g., a string) doesn't need to be translated between host and network byte order at all.

Bytes themselves don't have endianness any more in that any single byte transmitted by a computer will have the same value in a different receiving computer. Endianness only has relevance these days to multibyte data types such as ints.
In your particular case it boils down to knowing what the receiver will do with your 16 bytes. If it will treat each of the 16 entries in the array as discrete single byte values then you can just send them without worrying about endiannes. If, on the other hand, the receiver will treat your 16 byte array as four 32 bit integers then you'll need to run each integer through hton() prior to sending.
Does that help?

does 8-bit processor have to face endianness problem?

If I have a int32 type integer in the 8-bit processor's memory, say, 8051, how could I identify the endianess of that integer? Is it compiler specific? I think this is important when sending multybyte data through serial lines etc.

With an 8 bit microcontroller that has no native support for wider integers, the endianness of integers stored in memory is indeed up to the compiler writer.
The SDCC compiler, which is widely used on 8051, stores integers in little-endian format (the user guide for that compiler claims that it is more efficient on that architecture, due to the presence of an instruction for incrementing a data pointer but not one for decrementing).

If the processor has any operations that act on multi-byte values, or has an multi-byte registers, it has the possibility to have an endian-ness.
http://69.41.174.64/forum/printable.phtml?id=14233&thread=14207 suggests that the 8051 mixes different endian-ness in different places.

The endianness is specific to the CPU architecture. Since a compiler needs to target a particular CPU, the compiler would have knowledge of the endianness as well. So if you need to send data over a serial connection, network, etc you may wish to use build-in functions to put data in network byte order - especially if your code needs to support multiple architectures.
For more information, see: http://www.gnu.org/s/libc/manual/html_node/Byte-Order.html

It's not just up to the compiler - '51 has some native 16-bit registers (DPTR, PC in standard, ADC_IN, DAC_OUT and such in variants) of given endianness which the compiler has to obey - but outside of that, the compiler is free to use any endianness it prefers or one you choose in project configuration...

An integer does not have endianness in it. You can't determine just from looking at the bytes whether it's big or little endian. You just have to know: For example if your 8 bit processor is little endian and you're receiving a message that you know to be big endian (because, for example, the field bus system defines big endian), you have to convert values of more than 8 bits. You'll need to either hard-code that or to have some definition on the system on which bytes to swap.
Note that swapping bytes is the easy thing. You may also have to swap bits in bit fields, since the order of bits in bit fields is compiler-specific. Again, you basically have to know this at build time.

unsigned long int x = 1;
unsigned char *px = (unsigned char *) &x;
*px == 0 ? "big endian" : "little endian"
If x is assigned the value 1 then the value 1 will be in the least significant byte.
If we then cast x to be a pointer to bytes, the pointer will point to the lowest memory location of x. If that memory location is 0 it is big endian, otherwise it is little endian.

#include <stdio.h>
union foo {
int as_int;
char as_bytes[sizeof(int)];
};
int main() {
union foo data;
int i;
for (i = 0; i < sizeof(int); ++i) {
data.as_bytes[i] = 1 + i;
}
printf ("%0x\n", data.as_int);
return 0;
}
Interpreting the output is up to you.

What's a portable way of converting Byte-Order of strings in C

I am trying to write server that will communicate with any standard client that can make socket connections (e.g. telnet client)
It started out as an echo server, which of course did not need to worry about network byte ordering.
I am familiar with ntohs, ntohl, htons, htonl functions. These would be great by themselves if I were transfering either 16 or 32-bit ints, or if the characters in the string being sent were multiples of 2 or 4 bytes.
I'd like create a function that operates on strings such as:
str_ntoh(char* net_str, char* host_str, int len)
{
uint32_t* netp, hostp;
netp = (uint32_t*)&net_str;
for(i=0; i < len/4; i++){
hostp[i] = ntoh(netp[i]);
}
}
Or something similar. The above thing assumes that the wordsize is 32-bits. We can't be sure that the wordsize on the sending machine is not 16-bits, or 64-bits right?
For client programs, such as telnet, they must be using hton* before they send and ntoh* after they receive data, correct?
EDIT: For the people that thing because 1-char is a byte that endian-ness doesn't matter:
int main(void)
{
uint32_t a = 0x01020304;
char* c = (char*)&a;
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
Run this snippet of code. The output for me is as follows:
$ ./a.out
4 3 2 1
Those on powerPC chipsets should get '1 2 3 4' but those of us on intel chipset should see what I got above for the most part.

Maybe I'm missing something here, but are you sending strings, that is, sequences of characters? Then you don't need to worry about byte order. That is only for the bit pattern in integers. The characters in a string are always in the "right" order.
EDIT:
Derrick, to address your code example, I've run the following (slightly expanded) version of your program on an Intel i7 (little-endian) and on an old Sun Sparc (big-endian)
#include <stdio.h>
#include <stdint.h>
int main(void)
{
uint32_t a = 0x01020304;
char* c = (char*)&a;
char d[] = { 1, 2, 3, 4 };
printf("The integer: %x %x %x %x\n", c[0], c[1], c[2], c[3]);
printf("The string: %x %x %x %x\n", d[0], d[1], d[2], d[3]);
return 0;
}
As you can see, I've added a real char array to your print-out of an integer.
The output from the little-endian Intel i7:
The integer: 4 3 2 1
The string: 1 2 3 4
And the output from the big-endian Sun:
The integer: 1 2 3 4
The string: 1 2 3 4
Your multi-byte integer is indeed stored in different byte order on the two machines, but the characters in the char array have the same order.

With your function signature as posted you don't have to worry about byte order. It accepts a char*, that can only handle 8-bit characters. With one byte per character, you cannot have a byte order problem.
You'd only run into a byte order problem if you send Unicode, either in UTF16 or UTF32 encoding. And the endian-ness of the sending machine doesn't match the one of the receiving machine. The simple solution for that is to use UTF8 encoding. Which is what most text is sent as across networks. Being byte oriented, it doesn't have a byte order issue either. Or you could send a BOM.

If you'd like to send them as an 8-bit encoding (the fact that you're using char implies this is what you want), there's no need to byte swap. However, for the unrelated issue of non-ASCII characters, so that the same character > 127 appears the same on both ends of the connection, I would suggest that you send the data in something like UTF-8, which can represent all unicode characters and can be safely treated as ASCII strings. The way to get UTF-8 text based on the default encoding varies by the platform and set of libraries you're using.
If you're sending 16-bit or 32-bit encoding... You can include one character with the byte order mark which the other end can use to determine the endianness of the character. Or, you can assume network byte order and use htons() or htonl() as you suggest. But if you'd like to use char, please see the previous paragraph. :-)

It seems to me that the function prototype doesn't match its behavior. You're passing in a char *, but you're then casting it to uint32_t *. And, looking more closely, you're casting the address of the pointer, rather than the contents, so I'm concerned that you'll get unexpected results. Perhaps the following would work better:
arr_ntoh(uint32_t* netp, uint32_t* hostp, int len)
{
for(i=0; i < len; i++)
hostp[i] = ntoh(netp[i]);
}
I'm basing this on the assumption that what you've really got is an array of uint32_t and you want to run ntoh() on all of them.
I hope this is helpful.