Converting struct into short int array - c

Hello I have a following structure:
struct TestStruct{
unsigned char a;
unsigned char b;
unsigned char c;
unsigned char d;
};
struct TestStruct test;
test.a = 0x01;
test.b = 0x02;
test.c = 0x01;
test.d = 0x02;
unsigned short int *ptr = (unsigned short int *)&test;
printf("%04x\n", *ptr++);
printf("%04x\n", *ptr++);
I want to get values 0x0102 but actually I get 0x0201. How can figure it out without reordering fields in struct? I want to keep it because I am creating IP header from scratch (for learning purpose) and for better readability I want to have the same ordering with RFC documentation.
Thanks in advance.

In computers, there is a concept of endianess. In short, when storing a multi-byte field, you must choose between storing the most significant byte first (big-endian), or the least significant byte first (little-endian). This difference is sometimes called byte-order by RFC documents.
If you are implementing code that speaks cross-endianess, you will need to be cognizant of which format values are read in. The header byteswap.h is supplied to swap between formats in the most efficient ways. Consider the following example program:
#include <stdio.h>
#include <byteswap.h>
int main(void) {
unsigned int x = 0x01020304;
unsigned char * arr = (unsigned char *)&x;
printf("int: %08x\n", x);
printf("raw: %02x %02x %02x %02x\n", arr[0], arr[1], arr[2], arr[3]);
x = __bswap_32(x);
printf("swapped\n");
printf("int: %08x\n", x);
printf("raw: %02x %02x %02x %02x\n", arr[0], arr[1], arr[2], arr[3]);
}
On my computer, it outputs:
int: 01020304
raw: 04 03 02 01
swapped
int: 04030201
raw: 01 02 03 04
This shows that my computer is little endian. For the integer 0x01020304, it stores the byte 0x04 in the smaller memory address.
For specifically network usage, linux provides headers that convert from network-host. These have the benefit of already 'knowing' what your internal order is, and handling the conversion for you. For example, here's an old snippet I wrote that parses headers of ARP-packets:
recvfrom(socket->fd, buffer, ETHER_FRAME_MAX_SIZE, 0, NULL, NULL);
frame->type = ntohs(frame->type);
frame->htype = ntohs(frame->htype);
frame->ptype = ntohs(frame->ptype);
frame->oper = ntohs(frame->oper);
This snippet converts the shorts in the struct into the correct host byte order, using the ntohs (which is short for network-to-host-short) provided by arpa/inet.h.

Your implementation assumes that your machine is big-endian, which is usually not true on modern machines.
Big endian machines store multibyte values with the least significant byte in the highest address and the most significant byte in the lowest address, while little endian machines (which tend to be more common these days) do the exact opposite, storing the least significant byte in the lowest address and the most significant byte in the highest address. For instance this is how each architecture would represent the 4-byte value 0x01020304 if it were to be stored at memory addresses 0x10-0x13.
Endianness
Byte 0x10
Byte 0x11
Byte 0x12
Byte 0x13
Big
0x01
0x02
0x03
0x04
Little
0x04
0x03
0x02
0x01
The C-standard forces your compiler to place the elements in your struct in the order that they are defined, so when you fill the struct and then use type-punning to interpret the memory location as a 2-byte int instead of (effectively) an array of 1-byte ints, the computer will assume which byte is most significant and which is less significant based on its own endianness.
To manually force the computer to recognize a multi-byte value as the endianness you expect, you need to use bit-shifting to move each byte into its proper place, for instance, using your struct as an example:
unsigned short fixedEndianness = ((unsigned short)test.a << 8) | (unsigned short)test.b;
...which will work on any architecture

Related

Does memcpy copy bytes in reverse order?

I am little bit confused on usage of memcpy. I though memcpy can be used to copy chunks of binary data to address we desire. I was trying to implement a small logic to directyl convert 2 bytes of hex to 16 bit signed integer without using union.
#include <stdio.h>
#include <stdint.h>
#include <string.h>
int main()
{ uint8_t message[2] = {0xfd,0x58};
// int16_t roll = message[0]<<8;
// roll|=message[1];
int16_t roll = 0;
memcpy((void *)&roll,(void *)&message,2);
printf("%x",roll);
return 0;
}
This return 58fd instead of fd58
No, memcpy did not reverse the bytes as it copied them. That would be a strange and wrong thing for memcpy to do.
The reason the bytes seem to be in the "wrong" order in the program you wrote is that that's the order they're actually in! There's probably a canonical answer on this somewhere, but here's what you need to understand about byte order, or "endianness".
When you declare a string, it's laid out in memory just about exactly as you expect. Suppose I write this little code fragment:
#include <stdio.h>
char string[] = "Hello";
printf("address of string: %p\n", (void *)&string);
printf("address of 1st char: %p\n", (void *)&string[0]);
printf("address of 5th char: %p\n", (void *)&string[4]);
If I compile and run it, I get something like this:
address of string: 0xe90a49c2
address of 1st char: 0xe90a49c2
address of 5th char: 0xe90a49c6
This tells me that the bytes of the string are laid out in memory like this:
0xe90a49c2 H
0xe90a49c3 e
0xe90a49c4 l
0xe90a49c5 l
0xe90a49c6 o
0xe90a49c7 \0
Here I've shown the string vertically, but if we laid it out horizontally, with addresses increasing from left to right, we would see the characters of the string "Hello" laid out from left to right also, just as we would expect.
But that's for strings, which are arrays of char. But integers of various sizes are not really built out of characters, and it turns out that the individual bytes of an integer are not necessarily laid out in memory in "left-to-right" order as we might expect. In fact, on the vast majority of machines today, the bytes within an integer are laid out in the opposite order. Let's take a closer look at how that works.
Suppose I write this code:
int16_t i2 = 0x1234;
printf("address of short: %p\n", (void *)&i2);
unsigned char *p = &i2;
printf("%p: %02x\n", p, *p);
p++;
printf("%p: %02x\n", p, *p);
This initializes a 16-bit (or "short") integer to the hex value 0x1234, and then uses a pointer to print the two bytes of the integer in "left-to-right" order, that is, with the lower-addressed byte first, followed by the higher-addressed byte.
On my machine, the result is something like:
address of short: 0xe68c99c8
0xe68c99c8: 34
0xe68c99c9: 12
You can clearly see that the byte that's stored at the "front" of the two-byte region in memory is 34, followed by 12. The least-significant byte is stored first. This is referred to as "little endian" byte order, because the "little end" of the integer — its least-significant byte, or LSB — comes first.
Larger integers work the same way:
int32_t i4 = 0x5678abcd;
printf("address of long: %p\n", (void *)&i4);
p = &i4;
printf("%p: %02x\n", p, *p);
p++;
printf("%p: %02x\n", p, *p);
p++;
printf("%p: %02x\n", p, *p);
p++;
printf("%p: %02x\n", p, *p);
This prints:
address of long: 0xe68c99bc
0xe68c99bc: cd
0xe68c99bd: ab
0xe68c99be: 78
0xe68c99bf: 56
There are machines that lay the byes out in the other order, with the most-significant byte (MSB) first. Those are called "big endian" machines, but for reasons I won't go into they're not as popular.
How do you construct an integer value out of individual bytes if you don't know your machine's byte order? The best way is to do it "mathematically", based on the properties of the numbers. For example, let's go back to your original array of bytes:
uint8_t message[2] = {0xfd, 0x58};
Now, you know, because you wrote it, that 0xfd is supposed to be the MSB and 0xf8 is supposed to be the LSB. So one good way of combining them together into an integer is like this:
int16_t roll = message[0] << 8; /* MSB */
roll |= message[1]; /* LSB */
The nice thing about this code is that it works correctly on machines of either endianness. I called this technique "mathematical" because it's equivalent to doing it this other way:
int16_t roll = message[0] * 256; /* MSB */
roll += message[1]; /* LSB */
And, in fact, this suggestion of mine involving roll = message[0] << 8 is very close to something you already tried, but had commented out in the code you posted. The difference is that you don't want to think about it in terms of two bytes next to each other in memory; you want to think about it in terms of the most- and least-significant byte. When you say << 8, you're obviously thinking about the most-significant byte, so that should be message[0].
Does memcpy copy bytes in reverse order?
memcpy does not reverse the order bytes.
This return 58fd instead of fd58
Yes, your computer is little endian, so bytes 0xfd,0x58 in order are interpreted by your computer as the value 0x58fd.

C, Little and Big Endian confusion

I try to understand C programming memory Bytes order, but I'm confuse.
I try my app with some value on this site for my output verification : www.yolinux.com/TUTORIALS/Endian-Byte-Order.html
For the 64bits value I use in my C program:
volatile long long ll = (long long)1099511892096;
__mingw_printf("\tlong long, %u Bytes, %u bits,\t%lld to %lli, %lli, 0x%016llX\n", sizeof(long long), sizeof(long long)*8, LLONG_MIN, LLONG_MAX , ll, ll);
void printBits(size_t const size, void const * const ptr)
{
unsigned char *b = (unsigned char*) ptr;
unsigned char byte;
int i, j;
printf("\t");
for (i=size-1;i>=0;i--)
{
for (j=7;j>=0;j--)
{
byte = b[i] & (1<<j);
byte >>= j;
printf("%u", byte);
}
printf(" ");
}
puts("");
}
Out
long long, 8 Bytes, 64 bits, -9223372036854775808 to 9223372036854775807, 1099511892096, 0x0000010000040880
80 08 04 00 00 01 00 00 (Little-Endian)
10000000 00001000 00000100 00000000 00000000 00000001 00000000 00000000
00 00 01 00 00 04 08 80 (Big-Endian)
00000000 00000000 00000001 00000000 00000000 00000100 00001000 10000000
Tests
0x8008040000010000, 1000000000001000000001000000000000000000000000010000000000000000 // online website hex2bin conv.
1000000000001000000001000000000000000000000000010000000000000000 // my C app
0x8008040000010000, 1000010000001000000001000000000000000100000000010000000000000000 // yolinux.com
0x0000010000040880, 0000000000000000000000010000000000000000000001000000100010000000 //online website hex2bin conv., 1099511892096 ! OK
0000000000000000000000010000000000000000000001000000100010000000 // my C app, 1099511892096 ! OK
[Convert]::ToInt64("0000000000000000000000010000000000000000000001000000100010000000", 2) // using powershell for other verif., 1099511892096 ! OK
0x0000010000040880, 0000000000000000000000010000010000000000000001000000100010000100 // yolinux.com, 1116691761284 (from powershell bin conv.) ! BAD !
Problem
yolinux.com website announce 0x0000010000040880 for BIG ENDIAN ! But my computer use LITTLE ENDIAN I think (Intel proc.)
and I get same value 0x0000010000040880 from my C app and from another website hex2bin converter.
__mingw_printf(...0x%016llX...,...ll) also print 0x0000010000040880 as you can see.
Following yolinux website I have inverted my "(Little-Endian)" and "(Big-Endian)" labels in my output for the moment.
Also, the sign bit must be 0 for a positive number it's the case on my result but also yolinux result.(can not help me to be sure.)
If I correctly understand Endianness only Bytes are swapped not bits and my groups of bits seems to be correctly inverted.
It is simply an error on yolinux.com or is I missing a step about 64-bit numbers and C programming?
When you print some "multi-byte" integer using printf (and the correct format specifier) it doesn't matter whether the system is little or big endian. The result will be the same.
The difference between little and big endian is the order that multi-byte types are stored in memory. But once data is read from memory into the core processor, there is no difference.
This code shows how an integer (4 bytes) is placed in memory on my machine.
#include <stdio.h>
int main()
{
unsigned int u = 0x12345678;
printf("size of int is %zu\n", sizeof u);
printf("DEC: u=%u\n", u);
printf("HEX: u=0x%x\n", u);
printf("memory order:\n");
unsigned char * p = (unsigned char *)&u;
for(int i=0; i < sizeof u; ++i) printf("address %p holds %x\n", (void*)&p[i], p[i]);
return 0;
}
Output:
size of int is 4
DEC: u=305419896
HEX: u=0x12345678
memory order:
address 0x7ffddf2c263c holds 78
address 0x7ffddf2c263d holds 56
address 0x7ffddf2c263e holds 34
address 0x7ffddf2c263f holds 12
So I can see that I'm on a little endian machine as the LSB (least significant byte, i.e. 78) is stored on the lowest address.
Executing the same program on a big endian machine would (assuming same address) show:
size of int is 4
DEC: u=305419896
HEX: u=0x12345678
memory order:
address 0x7ffddf2c263c holds 12
address 0x7ffddf2c263d holds 34
address 0x7ffddf2c263e holds 56
address 0x7ffddf2c263f holds 78
Now it is the MSB (most significant byte, i.e. 12) that are stored on the lowest address.
The important thing to understand is that this only relates to "how multi-byte type are stored in memory". Once the integer is read from memory into a register inside the core, the register will hold the integer in the form 0x12345678 on both little and big endian machines.
There is only a single way to represent an integer in decimal, binary or hexadecimal format. For example, number 43981 is equal to 0xABCD when written as hexadecimal, or 0b1010101111001101 in binary. Any other value (0xCDAB, 0xDCBA or similar) represents a different number.
The way your compiler and cpu choose to store this value internally is irrelevant as far as C standard is concerned; the value could be stored as a 36-bit one's complement if you're particularly unlucky, as long as all operations mandated by the standard have equivalent effects.
You will rarely have to inspect your internal data representation when programming. Practically the only time when you care about endiannes is when working on a communication protocol, because then the binary format of the data must be precisely defined, but even then your code will not be different regardless of the architecture:
// input value is big endian, this is defined
// by the communication protocol
uint32_t parse_comm_value(const char * ptr)
{
// but bit shifts in C have the same
// meaning regardless of the endianness
// of your architecture
uint32_t result = 0;
result |= (*ptr++) << 24;
result |= (*ptr++) << 16;
result |= (*ptr++) << 8;
result |= (*ptr++);
return result;
}
Tl;dr calling a standard function like printf("0x%llx", number); always prints the correct value using the specified format. Inspecting the contents of memory by reading individual bytes gives you the representation of the data on your architecture.

How does this code that tests for big- or little-endianness work?

I'm studying about Endianness and when I read a textbook they supplied me this run-time test to check (at run time) if the code is running on a little or big-Endian system. The book doesn't explain anything and I am left so confused how this code works. Can anyone please help me explain how this piece of code works. Thank you in advance
/* Test platform Endianness */
int bigendian(void) {
int i;
union {
char Array[4];
long Chars;
} TestUnion;
char c = 'a';
for(i=0; i<4; i++)
TestUnion.Array[i] = c++;
if (TestUnion.Chars == 0x61626364)
return 1;
else
return 0;
}
union provides different views of the data - here TestUnion can be interpreted as:
a char[4] array, or
a long int
The for loop populates TestUnion as a char[4] array - note that character a has ASCII code 0x61 and b is 0x62 and so on. So the memory is filled with 4 bytes 0x61, 0x62, 0x63, 0x64 each having address location in ascending order.
The if statement check if it's BigEndian or not by interpreting TestUnion as a long int. If it's BigEndian then the long int number is read from left-to-right which translates to 0x61626364. Otherwise it's LittleEndian, which would read from right-to-left, meaning 0x64636261.
You can check that function in your system using code such as:
printf( "bigendian ? %s\n", bigendian() ? "true" : "false" );
Endianness means whether the low byte or the high byte is stored first in memory. Little endian machines store the low byte first, big endian ones store the high byte first. If integers have more than two bytes, other orders are possible but are vanishingly rare.
What the code is doing is setting up a union, so that a char array and a long integer share the same address space. It then checks to see which way round the bytes of the long are, by calculating the expected value. It is poorly written for many reasons. Technically it's undefined behaviour to write to one field of a union then read from another. The code assumes that sizeof(long) == 4, that ASCII is the character set, and that the compiler will treat the union as expected. Probably all these will hold. He's also comparing a signed value to a hex value with the high bit set - I think that's OK but it rather depends on the minutae of the C standard.
A better test is simply
int x = 0xFF;
unsigned char *test = (unsigned char *)&x;
if(test[sizeof(int)-1] == 0xFF)
/* big-endian */
if(test[sizeof(int)-1] == 0x00)
/* little-endian */
Assuming you're working on a system that uses ASCII for its character set, the values 0x61, 0x62, 0x63, and 0x64 represent the letters a, b, c, and d, respectively.
The loop:
char c = 'a';
for(i=0; i<4; i++)
TestUnion.Array[i] = c++;
Populates the char array part of the union TestUnion with a, b, c, and d. If you're on a big-endian machine when you access the union as a 32-bit long it will be represented as 0x61626364; if it was little-endian the long would be 0x64636261.
This isn't a good "general case" algorithm to test for endianness, though, because:
Not all systems use ASCII to represent characters
A long is not necessarily 4*sizeof(char) bytes wide

Casting uint8_t array into uint16_t value in C

I'm trying to convert a 2-byte array into a single 16-bit value. For some reason, when I cast the array as a 16-bit pointer and then dereference it, the byte ordering of the value gets swapped.
For example,
#include <stdint.h>
#include <stdio.h>
main()
{
uint8_t a[2] = {0x15, 0xaa};
uint16_t b = *(uint16_t*)a;
printf("%x\n", (unsigned int)b);
return 0;
}
prints aa15 instead of 15aa (which is what I would expect).
What's the reason behind this, and is there an easy fix?
I'm aware that I can do something like uint16_t b = a[0] << 8 | a[1]; (which does work just fine), but I feel like this problem should be easily solvable with casting and I'm not sure what's causing the issue here.
As mentioned in the comments, this is due to endianness.
Your machine is little-endian, which (among other things) means that multi-byte integer values have the least significant byte first.
If you compiled and ran this code on a big-endian machine (ex. a Sun), you would get the result you expect.
Since your array is set up as big-endian, which also happens to be network byte order, you could get around this by using ntohs and htons. These functions convert a 16-bit value from network byte order (big endian) to the host's byte order and vice versa:
uint16_t b = ntohs(*(uint16_t*)a);
There are similar functions called ntohl and htonl that work on 32-bit values.
This is because of the endianess of your machine.
In order to make your code independent of the machine consider the following function:
#define LITTLE_ENDIAN 0
#define BIG_ENDIAN 1
int endian() {
int i = 1;
char *p = (char *)&i;
if (p[0] == 1)
return LITTLE_ENDIAN;
else
return BIG_ENDIAN;
}
So for each case you can choose which operation to apply.
You cannot do anything like *(uint16_t*)a because of the strict aliasing rule. Even if code appears to work for now, it may break later in a different compiler version.
A correct version of the code could be:
b = ((uint16_t)a[0] << CHAR_BIT) + a[1];
The version suggested in your question involving a[0] << 8 is incorrect because on a system with 16-bit int, this may cause signed integer overflow: a[0] promotes to int, and << 8 means * 256.
This might help to visualize things. When you create the array you have two bytes in order. When you print it you get the human readable hex value which is the opposite of the little endian way it was stored. The value 1 in little endian as a uint16_t type is stored as follows where a0 is a lower address than a1...
a0 a1
|10000000|00000000
Note, the least significant byte is first, but when we print the value in hex it the least significant byte appears on the right which is what we normally expect on any machine.
This program prints a little endian and big endian 1 in binary starting from least significant byte...
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <arpa/inet.h>
void print_bin(uint64_t num, size_t bytes) {
int i = 0;
for(i = bytes * 8; i > 0; i--) {
(i % 8 == 0) ? printf("|") : 1;
(num & 1) ? printf("1") : printf("0");
num >>= 1;
}
printf("\n");
}
int main(void) {
uint8_t a[2] = {0x15, 0xaa};
uint16_t b = *(uint16_t*)a;
uint16_t le = 1;
uint16_t be = htons(le);
printf("Little Endian 1\n");
print_bin(le, 2);
printf("Big Endian 1 on little endian machine\n");
print_bin(be, 2);
printf("0xaa15 as little endian\n");
print_bin(b, 2);
return 0;
}
This is the output (this is Least significant byte first)
Little Endian 1
|10000000|00000000
Big Endian 1 on little endian machine
|00000000|10000000
0xaa15 as little endian
|10101000|01010101

Copying a 4 element character array into an integer in C

A char is 1 byte and an integer is 4 bytes. I want to copy byte-by-byte from a char[4] into an integer. I thought of different methods but I'm getting different answers.
char str[4]="abc";
unsigned int a = *(unsigned int*)str;
unsigned int b = str[0]<<24 | str[1]<<16 | str[2]<<8 | str[3];
unsigned int c;
memcpy(&c, str, 4);
printf("%u %u %u\n", a, b, c);
Output is
6513249 1633837824 6513249
Which one is correct? What is going wrong?
It's an endianness issue. When you interpret the char* as an int* the first byte of the string becomes the least significant byte of the integer (because you ran this code on x86 which is little endian), while with the manual conversion the first byte becomes the most significant.
To put this into pictures, this is the source array:
a b c \0
+------+------+------+------+
| 0x61 | 0x62 | 0x63 | 0x00 | <---- bytes in memory
+------+------+------+------+
When these bytes are interpreted as an integer in a little endian architecture the result is 0x00636261, which is decimal 6513249. On the other hand, placing each byte manually yields 0x61626300 -- decimal 1633837824.
Of course treating a char* as an int* is undefined behavior, so the difference is not important in practice because you are not really allowed to use the first conversion. There is however a way to achieve the same result, which is called type punning:
union {
char str[4];
unsigned int ui;
} u;
strcpy(u.str, "abc");
printf("%u\n", u.ui);
Neither of the first two is correct.
The first violates aliasing rules and may fail because the address of str is not properly aligned for an unsigned int. To reinterpret the bytes of a string as an unsigned int with the host system byte order, you may copy it with memcpy:
unsigned int a; memcpy(&a, &str, sizeof a);
(Presuming the size of an unsigned int and the size of str are the same.)
The second may fail with integer overflow because str[0] is promoted to an int, so str[0]<<24 has type int, but the value required by the shift may be larger than is representable in an int. To remedy this, use:
unsigned int b = (unsigned int) str[0] << 24 | …;
This second method interprets the bytes from str in big-endian order, regardless of the order of bytes in an unsigned int in the host system.
unsigned int a = *(unsigned int*)str;
This initialization is not correct and invokes undefined behavior. It violates C aliasing rules an potentially violates processor alignment.
You said you want to copy byte-by-byte.
That means the the line unsigned int a = *(unsigned int*)str; is not allowed. However, what you're doing is a fairly common way of reading an array as a different type (such as when you're reading a stream from disk.
It just needs some tweaking:
char * str ="abc";
int i;
unsigned a;
char * c = (char * )&a;
for(i = 0; i < sizeof(unsigned); i++){
c[i] = str[i];
}
printf("%d\n", a);
Bear in mind, the data you're reading may not share the same endianness as the machine you're reading from. This might help:
void
changeEndian32(void * data)
{
uint8_t * cp = (uint8_t *) data;
union
{
uint32_t word;
uint8_t bytes[4];
}temp;
temp.bytes[0] = cp[3];
temp.bytes[1] = cp[2];
temp.bytes[2] = cp[1];
temp.bytes[3] = cp[0];
*((uint32_t *)data) = temp.word;
}
Both are correct in a way:
Your first solution copies in native byte order (i.e. the byte order the CPU uses) and thus may give different results depending on the type of CPU.
Your second solution copies in big endian byte order (i.e. most significant byte at lowest address) no matter what the CPU uses. It will yield the same value on all types of CPUs.
What is correct depends on how the original data (array of char) is meant to be interpreted.
E.g. Java code (class files) always use big endian byte order (no matter what the CPU is using). So if you want to read ints from a Java class file you have to use the second way. In other cases you might want to use the CPU dependent way (I think Matlab writes ints in native byte order into files, c.f. this question).
If your using CVI (National Instruments) compiler you can use the function Scan to do this:
unsigned int a;
For big endian:
Scan(str,"%1i[b4uzi1o3210]>%i",&a);
For little endian:
Scan(str,"%1i[b4uzi1o0123]>%i",&a);
The o modifier specifies the byte order.
i inside the square brackets indicates where to start in the str array.

Resources