memcpy inverting data, C language - c

I've a doubt here, i'm trying to use memcpy() to copy an string[9] to a unsigned long long int variable, here's the code:
unsigned char string[9] = "message";
string[8] = '\0';
unsigned long long int aux;
memcpy(&aux, string, 8);
printf("%llx\n", aux); // prints inverted data
/*
* expected: 6d65737361676565
* printed: 656567617373656d
*/
How do I make this copy without inverting the data?

Your system is using little endian byte ordering for integers. That means that the least significant byte comes first. For example, a 32 bit integer would store 258 (0x00000102) as 0x02 0x01 0x00 0x00.
Rather than copying your string into an integer, just loop through the characters and print each one in hex:
int i;
int len = strlen(string);
for (i=0; i<len; i++) {
printf("%02x ", string[i]);
}
printf("\n");
Since string is an array of unsigned char and you're doing bit manipulation for the purpose of implementing DES, you don't need to change it at all. Just use it as it.

Looks like you've just discovered by accident how CPUs store integer values. There's two competing schools of thought that are termed endian, with little-endian and big-endian both found in the wild.
If you want them in byte-for-byte order, an integer type will be problematic and should be avoided. Just use a byte array.
There are conversion functions that can go from one endian form to another, though you need to know what sort your architecture uses before converting properly.
So if you're reading in a binary value you must know what endian form it's in in order to import it correctly into a native int type. It's generally a good practice to pick a consistent endian form when writing binary files to avoid guessing, where the "network byte order" scheme used in the vast majority of internet protocols is a good default. Then you can use functions like htonl and ntohl to convert back and forth as necessary.

Related

What is a cross OS method of converting different sized integers to and from byte arrays in C?

I'm making a data parser/encoder that has to work on different machines of both endianness.
Metadata in the byte array dynamically declares the number of bytes used to represent each integer, and some integers (I'll know which ones) must be read in big endian, and some must be read in little endian.
I currently have the integer -> byte functions written (developing on a macOS little endian) and working on the Mac.
void longlong_to_bytes_big(long long num, unsigned char *byte_arr, unsigned char num_bytes)
{
unsigned char i;
for(i=0; i<num_bytes; i++)
byte_arr[i] = (num >> ((num_bytes - i - 1) * 8)) & 0xFF;
}
void longlong_to_bytes_little(long long num, unsigned char *byte_arr, unsigned char num_bytes)
{
unsigned char i;
for(i=0; i<num_bytes; i++)
byte_arr[i] = (num >> (i * 8)) & 0xFF;
}
But I'm worried this code actually only works for char, short and int on a little endian machine, and would give me the opposite endianness on a big endian machine.
Then for the other direction, I don't think I can combine all the different integer sizes into one function but I think each one should look something like this:
long long bytes_to_longlong_big(unsigned char *byte_arr)
{
unsigned char i, a[8];
for(i=0; i<8; i++)
a[i] = byte_arr[8-i-1];
return *(long long *)a;
}
long long bytes_to_longlong_small(unsigned char *byte_arr)
{
return *(long long *)byte_arr;
}
but again I'm pretty sure these will be backwards on a different endian machine due to the compilers implementation of (long long *).
Is there a machine endian agnostic way to accomplish this? Given the choice I'd prefer performance over simplicity.
The goal is that these byte arrays be in the same order, regardless of the compiler's endianness, but also regardless of endianness, the code needs to correctly interpret the byte array.
You can save/exchange data in "network order" and then use functions like ntohl and htonl (and friends) when reading and writing data. These function will automatically take care of endianess of the "current" system. Consequently, you don't need to write your own code.
You could be interested in textual formats such as JSON, XML, YAML. For human developers, they make debugging easier. You'll find many libraries supporting them.
You could also look into portable binary formats like XDR or ASN1
You could find some C or C++ code generators (so a metaprogramming approach) related to them (rpcgen, SWIG), and you could consider writing your own C/C++ generator with tools such as GPP or GNU m4 or your Guile or Python script.
For true network exchanges (e.g. Ethernet) - or disk IO, the bottleneck is usually the network (or the disk), not the encoding/decoding processing. That is why it usually makes sense to use textual formats.

Printing actual bit representation of integers in C

I wanted to print the actual bit representation of integers in C. These are the two approaches that I found.
First:
union int_char {
int val;
unsigned char c[sizeof(int)];
} data;
data.val = n1;
// printf("Integer: %p\nFirst char: %p\nLast char: %p\n", &data.f, &data.c[0], &data.c[sizeof(int)-1]);
for(int i = 0; i < sizeof(int); i++)
printf("%.2x", data.c[i]);
printf("\n");
Second:
for(int i = 0; i < 8*sizeof(int); i++) {
int j = 8 * sizeof(int) - 1 - i;
printf("%d", (val >> j) & 1);
}
printf("\n");
For the second approach, the outputs are 00000002 and 02000000. I also tried the other numbers and it seems that the bytes are swapped in the two. Which one is correct?
Welcome to the exotic world of endian-ness.
Because we write numbers most significant digit first, you might imagine the most significant byte is stored at the lower address.
The electrical engineers who build computers are more imaginative.
Someimes they store the most significant byte first but on your platform it's the least significant.
There are even platforms where it's all a bit mixed up - but you'll rarely encounter those in practice.
So we talk about big-endian and little-endian for the most part. It's a joke about Gulliver's Travels where there's a pointless war about which end of a boiled egg to start at. Which is itself a satire of some disputes in the Christian Church. But I digress.
Because your first snippet looks at the value as a series of bytes it encounters then in endian order.
But because the >> is defined as operating on bits it is implemented to work 'logically' without regard to implementation.
It's right of C to not define the byte order because hardware not supporting the model C chose would be burdened with an overhead of shuffling bytes around endlessly and pointlessly.
There sadly isn't a built-in identifier telling you what the model is - though code that does can be found.
It will become relevant to you if (a) as above you want to breakdown integer types into bytes and manipulate them or (b) you receive files for other platforms containing multi-byte structures.
Unicode offers something called a BOM (Byte Order Marker) in UTF-16 and UTF-32.
In fact a good reason (among many) for using UTF-8 is the problem goes away. Because each component is a single byte.
Footnote:
It's been pointed out quite fairly in the comments that I haven't told the whole story.
The C language specification admits more than one representation of integers and particularly signed integers. Specifically signed-magnitude, twos-complement and ones-complement.
It also permits 'padding bits' that don't represent part of the value.
So in principle along with tackling endian-ness we need to consider representation.
In principle. All modern computers use twos complement and extant machines that use anything else are very rare and unless you have a genuine requirement to support such platforms, I recommend assuming you're on a twos-complement system.
The correct Hex representation as string is 00000002 as if you declare the integer with hex represetation.
int n = 0x00000002; //n=2
or as you where get when printing integer as hex like in:
printf("%08x", n);
But when printing integer bytes 1 byte after the other, you also must consider the endianess, which is the byte order in multi-byte integers:
In big endian system (some UNIX system use it) the 4 bytes will be ordered in memory as:
00 00 00 02
While in little endian system (most of OS) the bytes will be ordered in memory as:
02 00 00 00
The first prints the bytes that represent the integer in the order they appear in memory. Platforms with different endian will print different results as they store integers in different ways.
The second prints the bits that make up the integer value most significant bit first. This result is independent of endian. The result is also independent of how the >> operator is implemented for signed ints as it does not look at the bits that may be influenced by the implementation.
The second is a better match to the question "Printing actual bit representation of integers in C". Although there is a lot of ambiguity.
It depends on your definition of "correct".
The first one will print the data exactly like it's laid out in memory, so I bet that's the one you're getting the maybe unexpected 02000000 for. *) IMHO, that's the correct one. It could be done simpler by just aliasing with unsigned char * directly (char pointers are always allowed to alias any other pointers, in fact, accessing representations is a usecase for char pointers mentioned in the standard):
int x = 2;
unsigned char *rep = (unsigned char *)&x;
for (int i = 0; i < sizeof x; ++i) printf("0x%hhx ", rep[i]);
The second one will print only the value bits **) and take them in the order from the most significant byte to the least significant one. I wouldn't call it correct because it also assumes that bytes have 8 bits, and because the shifting used is implementation-defined for negative numbers. ***) Furthermore, just ignoring padding bits doesn't seem correct either if you really want to see the representation.
edit: As commented by Gerhardh meanwhile, this second code doesn't print byte by byte but bit by bit. So, the output you claim to see isn't possible. Still, it's the same principle, it only prints value bits and starts at the most significant one.
*) You're on a "little endian" machine. On these machines, the least significant byte is stored first in memory. Read more about Endianness on wikipedia.
**) Representations of types in C may also have padding bits. Some types aren't allowed to include padding (like char), but int is allowed to have them. This second option doesn't alias to char, so the padding bits remain invisible.
***) A correct version of this code (for printing all the value bits) must a) correctly determine the number of value bits (8 * sizeof int is wrong because bytes (char) can have more then 8 bits, even CHAR_BIT * sizeof int is wrong, because this would also count padding bits if present) and b) avoid the implementation-defined shifting behavior by first converting to unsigned. It could look for example like this:
#define IMAX_BITS(m) ((m) /((m)%0x3fffffffL+1) /0x3fffffffL %0x3fffffffL *30 \
+ (m)%0x3fffffffL /((m)%31+1)/31%31*5 + 4-12/((m)%31+3))
int main(void)
{
int x = 2;
for (unsigned mask = 1U << (IMAX_BITS((unsigned)-1) - 1); mask; mask >>= 1)
{
putchar((unsigned) x & mask ? '1' : '0');
}
puts("");
}
See this answer for an explanation of this strange macro.

What does casting char* do to a reference of an int? (Using C)

In my course for intro to operating systems, our task is to determine if a system is big or little endian. There's plenty of results I've found on how to do it, and I've done my best to reconstruct my own version of a code. I suspect it's not the best way of doing it, but it seems to work:
#include <stdio.h>
int main() {
int a = 0x1234;
unsigned char *start = (unsigned char*) &a;
int len = sizeof( int );
if( start[0] > start[ len - 1 ] ) {
//biggest in front (Little Endian)
printf("1");
} else if( start[0] < start[ len - 1 ] ) {
//smallest in front (Big Endian)
printf("0");
} else {
//unable to determine with set value
printf( "Please try a different integer (non-zero). " );
}
}
I've seen this line of code (or some version of) in almost all answers I've seen:
unsigned char *start = (unsigned char*) &a;
What is happening here? I understand casting in general, but what happens if you cast an int to a char pointer? I know:
unsigned int *p = &a;
assigns the memory address of a to p, and that can you affect the value of a through dereferencing p. But I'm totally lost with what's happening with the char and more importantly, not sure why my code works.
Thanks for helping me with my first SO post. :)
When you cast between pointers of different types, the result is generally implementation-defined (it depends on the system and the compiler). There are no guarantees that you can access the pointer or that it correctly aligned etc.
But for the special case when you cast to a pointer to character, the standard actually guarantees that you get a pointer to the lowest addressed byte of the object (C11 6.3.2.3 ยง7).
So the compiler will implement the code you have posted in such a way that you get a pointer to the least significant byte of the int. As we can tell from your code, that byte may contain different values depending on endianess.
If you have a 16-bit CPU, the char pointer will point at memory containing 0x12 in case of big endian, or 0x34 in case of little endian.
For a 32-bit CPU, the int would contain 0x00001234, so you would get 0x00 in case of big endian and 0x34 in case of little endian.
If you de reference an integer pointer you will get 4 bytes of data(depends on compiler,assuming gcc). But if you want only one byte then cast that pointer to a character pointer and de reference it. You will get one byte of data. Casting means you are saying to compiler that read so many bytes instead of original data type byte size.
Values stored in memory are a set of '1's and '0's which by themselves do not mean anything. Datatypes are used for recognizing and interpreting what the values mean. So lets say, at a particular memory location, the data stored is the following set of bits ad infinitum: 01001010 ..... By itself this data is meaningless.
A pointer (other than a void pointer) contains 2 pieces of information. It contains the starting position of a set of bytes, and the way in which the set of bits are to be interpreted. For details, you can see: http://en.wikipedia.org/wiki/C_data_types and references therein.
So if you have
a char *c,
an short int *i,
and a float *f
which look at the bits mentioned above, c, i, and f are the same, but *c takes the first 8 bits and interprets it in a certain way. So you can do things like printf('The character is %c', *c). On the other hand, *i takes the first 16 bits and interprets it in a certain way. In this case, it will be meaningful to say, printf('The character is %d', *i). Again, for *f, printf('The character is %f', *f) is meaningful.
The real differences come when you do math with these. For example,
c++ advances the pointer by 1 byte,
i++ advanced it by 4 bytes,
and f++ advances it by 8 bytes.
More importantly, for
(*c)++, (*i)++, and (*f)++ the algorithm used for doing the addition is totally different.
In your question, when you do a casting from one pointer to another, you already know that the algorithm you are going to use for manipulating the bits present at that location will be easier if you interpret those bits as an unsigned char rather than an unsigned int. The same operatord +, -, etc will act differently depending upon what datatype the operators are looking at. If you have worked in Physics problems wherein doing a coordinate transformation has made the solution very simple, then this is the closest analog to that operation. You are transforming one problem into another that is easier to solve.

Is this program compatible on both big and little endian systems?

I wrote a small program which reverses a string and prints it to screen:
void ReverseString(char *String)
{
char *Begin = String;
char *End = String + strlen(String) - 1;
char TempChar = '\0';
while (Begin < End)
{
TempChar = *Begin;
*Begin = *End;
*End = TempChar;
Begin++;
End--;
}
printf("%s",String);
}
It works perfectly in Dev C++ on Windows (little endian).
But I have a sudden doubt of its efficiency. If you look at this line:
while (Begin < End)
I am comparing the address of the beginning and end. Is this the correct way?
Does this code work on a big endian OS like Mac OS X ?
Or am I thinking the wrong way ?
I have got several doubts which I mentioned above.
Can anyone please clarify ?
Your code has no endianness-related issues. There's also nothing wrong with the way you're comparing the two pointers. In short, your code's fine.
Endianness is defined as the order of significance of the bytes in a multi-byte primitive type. So if your int is big-endian, that means the first byte (i.e. the one with the lowest address) of an int in memory contains the most significant bits of the int, and so on to the last/least significant. That's all it means. When we say a system is big-endian, that generally means that all of its pointer and arithmetic types are big-endian, although there are some odd special cases out there. Endian-ness doesn't affect pointer arithmetic or comparison, or the order in which strings are stored in memory.
Your code does not use any multi-byte primitive types[*], so endian-ness is irrelevant. In general, endian-ness only becomes relevant if you somehow access the individual bytes of such an object (for example by casting a pointer to unsigned char*, writing the memory to a file or over the network, and the like).
Supposing a caller did something like this:
int x = 0x00010203; // assuming sizeof(int) == 4 and CHAR_BIT == 8
ReverseString((char *)&x);
Then their code would be endian-dependent. On a big-endian system, they would pass you an empty string, since the first byte would be 0, so your code would leave x unchanged. On a little-endian system they would pass you a three-byte string, since the first three bytes would be 0x03, 0x02, 0x01 and the fourth byte 0, so your code would change x to 0x00030201
[*] well, the pointers are multi-byte, on OSX and on pretty much every C implementation. But you don't inspect their storage representations, you just use them as values, so there's no opportunity for behavior to differ according to endianness.
As far as I know, endianness does not affect a char * as each character is a single byte and forms an array of characters. Have a look at http://www.ibm.com/developerworks/aix/library/au-endianc/index.html?ca=drs-
The effect will be seen in multi byte data types like int.
As long as you manipulate whole type T objects (which is what you do with type T being char) you just can't run into endianness problems.
You could run into them if you for example tried to manipulate separate bytes within a larger type (an int for example) but you don't do anything like that. This is why endianness problems are impossible in your code, period.

What's a portable way of converting Byte-Order of strings in C

I am trying to write server that will communicate with any standard client that can make socket connections (e.g. telnet client)
It started out as an echo server, which of course did not need to worry about network byte ordering.
I am familiar with ntohs, ntohl, htons, htonl functions. These would be great by themselves if I were transfering either 16 or 32-bit ints, or if the characters in the string being sent were multiples of 2 or 4 bytes.
I'd like create a function that operates on strings such as:
str_ntoh(char* net_str, char* host_str, int len)
{
uint32_t* netp, hostp;
netp = (uint32_t*)&net_str;
for(i=0; i < len/4; i++){
hostp[i] = ntoh(netp[i]);
}
}
Or something similar. The above thing assumes that the wordsize is 32-bits. We can't be sure that the wordsize on the sending machine is not 16-bits, or 64-bits right?
For client programs, such as telnet, they must be using hton* before they send and ntoh* after they receive data, correct?
EDIT: For the people that thing because 1-char is a byte that endian-ness doesn't matter:
int main(void)
{
uint32_t a = 0x01020304;
char* c = (char*)&a;
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
Run this snippet of code. The output for me is as follows:
$ ./a.out
4 3 2 1
Those on powerPC chipsets should get '1 2 3 4' but those of us on intel chipset should see what I got above for the most part.
Maybe I'm missing something here, but are you sending strings, that is, sequences of characters? Then you don't need to worry about byte order. That is only for the bit pattern in integers. The characters in a string are always in the "right" order.
EDIT:
Derrick, to address your code example, I've run the following (slightly expanded) version of your program on an Intel i7 (little-endian) and on an old Sun Sparc (big-endian)
#include <stdio.h>
#include <stdint.h>
int main(void)
{
uint32_t a = 0x01020304;
char* c = (char*)&a;
char d[] = { 1, 2, 3, 4 };
printf("The integer: %x %x %x %x\n", c[0], c[1], c[2], c[3]);
printf("The string: %x %x %x %x\n", d[0], d[1], d[2], d[3]);
return 0;
}
As you can see, I've added a real char array to your print-out of an integer.
The output from the little-endian Intel i7:
The integer: 4 3 2 1
The string: 1 2 3 4
And the output from the big-endian Sun:
The integer: 1 2 3 4
The string: 1 2 3 4
Your multi-byte integer is indeed stored in different byte order on the two machines, but the characters in the char array have the same order.
With your function signature as posted you don't have to worry about byte order. It accepts a char*, that can only handle 8-bit characters. With one byte per character, you cannot have a byte order problem.
You'd only run into a byte order problem if you send Unicode, either in UTF16 or UTF32 encoding. And the endian-ness of the sending machine doesn't match the one of the receiving machine. The simple solution for that is to use UTF8 encoding. Which is what most text is sent as across networks. Being byte oriented, it doesn't have a byte order issue either. Or you could send a BOM.
If you'd like to send them as an 8-bit encoding (the fact that you're using char implies this is what you want), there's no need to byte swap. However, for the unrelated issue of non-ASCII characters, so that the same character > 127 appears the same on both ends of the connection, I would suggest that you send the data in something like UTF-8, which can represent all unicode characters and can be safely treated as ASCII strings. The way to get UTF-8 text based on the default encoding varies by the platform and set of libraries you're using.
If you're sending 16-bit or 32-bit encoding... You can include one character with the byte order mark which the other end can use to determine the endianness of the character. Or, you can assume network byte order and use htons() or htonl() as you suggest. But if you'd like to use char, please see the previous paragraph. :-)
It seems to me that the function prototype doesn't match its behavior. You're passing in a char *, but you're then casting it to uint32_t *. And, looking more closely, you're casting the address of the pointer, rather than the contents, so I'm concerned that you'll get unexpected results. Perhaps the following would work better:
arr_ntoh(uint32_t* netp, uint32_t* hostp, int len)
{
for(i=0; i < len; i++)
hostp[i] = ntoh(netp[i]);
}
I'm basing this on the assumption that what you've really got is an array of uint32_t and you want to run ntoh() on all of them.
I hope this is helpful.

Resources