Is this program compatible on both big and little endian systems? - c

I wrote a small program which reverses a string and prints it to screen:
void ReverseString(char *String)
{
char *Begin = String;
char *End = String + strlen(String) - 1;
char TempChar = '\0';
while (Begin < End)
{
TempChar = *Begin;
*Begin = *End;
*End = TempChar;
Begin++;
End--;
}
printf("%s",String);
}
It works perfectly in Dev C++ on Windows (little endian).
But I have a sudden doubt of its efficiency. If you look at this line:
while (Begin < End)
I am comparing the address of the beginning and end. Is this the correct way?
Does this code work on a big endian OS like Mac OS X ?
Or am I thinking the wrong way ?
I have got several doubts which I mentioned above.
Can anyone please clarify ?

Your code has no endianness-related issues. There's also nothing wrong with the way you're comparing the two pointers. In short, your code's fine.

Endianness is defined as the order of significance of the bytes in a multi-byte primitive type. So if your int is big-endian, that means the first byte (i.e. the one with the lowest address) of an int in memory contains the most significant bits of the int, and so on to the last/least significant. That's all it means. When we say a system is big-endian, that generally means that all of its pointer and arithmetic types are big-endian, although there are some odd special cases out there. Endian-ness doesn't affect pointer arithmetic or comparison, or the order in which strings are stored in memory.
Your code does not use any multi-byte primitive types[*], so endian-ness is irrelevant. In general, endian-ness only becomes relevant if you somehow access the individual bytes of such an object (for example by casting a pointer to unsigned char*, writing the memory to a file or over the network, and the like).
Supposing a caller did something like this:
int x = 0x00010203; // assuming sizeof(int) == 4 and CHAR_BIT == 8
ReverseString((char *)&x);
Then their code would be endian-dependent. On a big-endian system, they would pass you an empty string, since the first byte would be 0, so your code would leave x unchanged. On a little-endian system they would pass you a three-byte string, since the first three bytes would be 0x03, 0x02, 0x01 and the fourth byte 0, so your code would change x to 0x00030201
[*] well, the pointers are multi-byte, on OSX and on pretty much every C implementation. But you don't inspect their storage representations, you just use them as values, so there's no opportunity for behavior to differ according to endianness.

As far as I know, endianness does not affect a char * as each character is a single byte and forms an array of characters. Have a look at http://www.ibm.com/developerworks/aix/library/au-endianc/index.html?ca=drs-
The effect will be seen in multi byte data types like int.

As long as you manipulate whole type T objects (which is what you do with type T being char) you just can't run into endianness problems.
You could run into them if you for example tried to manipulate separate bytes within a larger type (an int for example) but you don't do anything like that. This is why endianness problems are impossible in your code, period.

Related

What's the proper way to copy a char array of a given size to an integer in C?

Suppose I have a char array and an associated length: Arr and Len. Not a string, a char array. There is no null terminator. Yet I have to copy the array data into an integer of type int64_t. Here's how it's done, and for the purpose of this question I'm assuming Len will not exceed 8:
int64_t Word = 0;
memcpy(&Word, Arr, Len);
Is this actually the proper way to do this? I am copying memory, but is there a faster way to do it inline, for example? So Word can be register?
The problem with a type pun is it assumes that Arr has 8 bytes allocated. No, Arr has at most 8 bytes allocated. It could have 5, so casting Arr to a int64_t * then dereferencing it could try to access three illegal bytes at the end, resulting in segfault.
Is the proper way to do what I describe a memcpy() call, or is there a faster or better way?
Since you specify Len is at most (8), it's reasonable to assume little-endian storage, i.e., the least-significant byte at Arr[0].
If Len was fixed at (8), the compiler might be able to replace memcpy simply by loading the value from memory. That would also be dependent on whether the platform can do unaligned reads - if the compiler can't prove alignment - and may involve something like the bswap instruction on x86-64 if the architecture is big-endian.
The fact that a Len is a run-time value will likely generate a call to memcpy. The overhead of the call itself is not trivial. All things considered, it's probably best just to handle this in an endian-independent way using byte arithmetic. The code assumes 8-bit bytes, which seems consistent with your question.
uint64_t Word = 0;
while (Len--)
Word = (Word << 8) | Arr[Len];
On more exotic platforms, where (CHAR_BIT > 8), you can replace the right-hand side of the OR expression with (Arr[Len] & 0xff). In fact, this is optimised away on platforms with 8-bit (normative) bytes, so you might as well add it for completeness. Or just keep these issues in mind.
There are platforms with legal C implementations where char, short, int are 32-bit values, for example. These are quite common in the embedded world.

memcpy inverting data, C language

I've a doubt here, i'm trying to use memcpy() to copy an string[9] to a unsigned long long int variable, here's the code:
unsigned char string[9] = "message";
string[8] = '\0';
unsigned long long int aux;
memcpy(&aux, string, 8);
printf("%llx\n", aux); // prints inverted data
/*
* expected: 6d65737361676565
* printed: 656567617373656d
*/
How do I make this copy without inverting the data?
Your system is using little endian byte ordering for integers. That means that the least significant byte comes first. For example, a 32 bit integer would store 258 (0x00000102) as 0x02 0x01 0x00 0x00.
Rather than copying your string into an integer, just loop through the characters and print each one in hex:
int i;
int len = strlen(string);
for (i=0; i<len; i++) {
printf("%02x ", string[i]);
}
printf("\n");
Since string is an array of unsigned char and you're doing bit manipulation for the purpose of implementing DES, you don't need to change it at all. Just use it as it.
Looks like you've just discovered by accident how CPUs store integer values. There's two competing schools of thought that are termed endian, with little-endian and big-endian both found in the wild.
If you want them in byte-for-byte order, an integer type will be problematic and should be avoided. Just use a byte array.
There are conversion functions that can go from one endian form to another, though you need to know what sort your architecture uses before converting properly.
So if you're reading in a binary value you must know what endian form it's in in order to import it correctly into a native int type. It's generally a good practice to pick a consistent endian form when writing binary files to avoid guessing, where the "network byte order" scheme used in the vast majority of internet protocols is a good default. Then you can use functions like htonl and ntohl to convert back and forth as necessary.

What does casting char* do to a reference of an int? (Using C)

In my course for intro to operating systems, our task is to determine if a system is big or little endian. There's plenty of results I've found on how to do it, and I've done my best to reconstruct my own version of a code. I suspect it's not the best way of doing it, but it seems to work:
#include <stdio.h>
int main() {
int a = 0x1234;
unsigned char *start = (unsigned char*) &a;
int len = sizeof( int );
if( start[0] > start[ len - 1 ] ) {
//biggest in front (Little Endian)
printf("1");
} else if( start[0] < start[ len - 1 ] ) {
//smallest in front (Big Endian)
printf("0");
} else {
//unable to determine with set value
printf( "Please try a different integer (non-zero). " );
}
}
I've seen this line of code (or some version of) in almost all answers I've seen:
unsigned char *start = (unsigned char*) &a;
What is happening here? I understand casting in general, but what happens if you cast an int to a char pointer? I know:
unsigned int *p = &a;
assigns the memory address of a to p, and that can you affect the value of a through dereferencing p. But I'm totally lost with what's happening with the char and more importantly, not sure why my code works.
Thanks for helping me with my first SO post. :)
When you cast between pointers of different types, the result is generally implementation-defined (it depends on the system and the compiler). There are no guarantees that you can access the pointer or that it correctly aligned etc.
But for the special case when you cast to a pointer to character, the standard actually guarantees that you get a pointer to the lowest addressed byte of the object (C11 6.3.2.3 §7).
So the compiler will implement the code you have posted in such a way that you get a pointer to the least significant byte of the int. As we can tell from your code, that byte may contain different values depending on endianess.
If you have a 16-bit CPU, the char pointer will point at memory containing 0x12 in case of big endian, or 0x34 in case of little endian.
For a 32-bit CPU, the int would contain 0x00001234, so you would get 0x00 in case of big endian and 0x34 in case of little endian.
If you de reference an integer pointer you will get 4 bytes of data(depends on compiler,assuming gcc). But if you want only one byte then cast that pointer to a character pointer and de reference it. You will get one byte of data. Casting means you are saying to compiler that read so many bytes instead of original data type byte size.
Values stored in memory are a set of '1's and '0's which by themselves do not mean anything. Datatypes are used for recognizing and interpreting what the values mean. So lets say, at a particular memory location, the data stored is the following set of bits ad infinitum: 01001010 ..... By itself this data is meaningless.
A pointer (other than a void pointer) contains 2 pieces of information. It contains the starting position of a set of bytes, and the way in which the set of bits are to be interpreted. For details, you can see: http://en.wikipedia.org/wiki/C_data_types and references therein.
So if you have
a char *c,
an short int *i,
and a float *f
which look at the bits mentioned above, c, i, and f are the same, but *c takes the first 8 bits and interprets it in a certain way. So you can do things like printf('The character is %c', *c). On the other hand, *i takes the first 16 bits and interprets it in a certain way. In this case, it will be meaningful to say, printf('The character is %d', *i). Again, for *f, printf('The character is %f', *f) is meaningful.
The real differences come when you do math with these. For example,
c++ advances the pointer by 1 byte,
i++ advanced it by 4 bytes,
and f++ advances it by 8 bytes.
More importantly, for
(*c)++, (*i)++, and (*f)++ the algorithm used for doing the addition is totally different.
In your question, when you do a casting from one pointer to another, you already know that the algorithm you are going to use for manipulating the bits present at that location will be easier if you interpret those bits as an unsigned char rather than an unsigned int. The same operatord +, -, etc will act differently depending upon what datatype the operators are looking at. If you have worked in Physics problems wherein doing a coordinate transformation has made the solution very simple, then this is the closest analog to that operation. You are transforming one problem into another that is easier to solve.

Char C question about encoding signed/unsigned

I read that C not define if a char is signed or unsigned, and in GCC page this says that it can be signed on x86 and unsigned in PowerPPC and ARM.
Okey, I'm writing a program with GLIB that define char as gchar (not more than it, only a way for standardization).
My question is, what about UTF-8? It use more than an block of memory?
Say that I have a variable
unsigned char *string = "My string with UTF8 enconding ~> çã";
See, if I declare my variable as
unsigned
I will have only 127 values (so my program will to store more blocks of mem) or the UTF-8 change to negative too?
Sorry if I can't explain it correctly, but I think that i is a bit complex.
NOTE:
Thanks for all answer
I don't understand how it is interpreted normally.
I think that like ascii, if I have a signed and unsigned char on my program, the strings have diferently values, and it leads to confuse, imagine it in utf8 so.
I've had a couple requests to explain a comment I made.
The fact that a char type can default to either a signed or unsigned type can be significant when you're comparing characters and expect a certain ordering. In particular, UTF8 uses the high bit (assuming that char is an 8-bit type, which is true in the vast majority of platforms) to indicate that a character code point requires more than one byte to be represented.
A quick and dirty example of the problem:
#include <stdio.h>
int main( void)
{
signed char flag = 0xf0;
unsigned char uflag = 0xf0;
if (flag < (signed char) 'z') {
printf( "flag is smaller than 'z'\n");
}
else {
printf( "flag is larger than 'z'\n");
}
if (uflag < (unsigned char) 'z') {
printf( "uflag is smaller than 'z'\n");
}
else {
printf( "uflag is larger than 'z'\n");
}
return 0;
}
On most projects that I work, the unadorned char type is typically avoided in favor us using a typedef that explicitly specifies an unsigned char. Something like the uint8_t from stdint.h or
typedef unsigned char u8;
Generally dealing with an unsigned char type seems to work well and have few problems - the one area that I have seen occasional problems is when using something of that type to control a loop:
while (uchar_var-- >= 0) {
// infinite loop...
}
Two things:
Whether a char type is signed or unsigned won't affect your ability to translate UTF8-encoded-strings to and from whatever display string type you're using (WCHAR or whatnot). Don't worry about it, in other words: the UTF8 bytes are just bytes, and whatever you're using as an encoder/decoder will do the right thing.
Some of your confusion may be that you're trying to do this:
unsigned char *string = "This is a UTF8 string";
Don't do this-- you're mixing different concepts. A UTF-8 encoded string is just a sequence of bytes. C string literals (as above) were not really designed to represent this; they're designed to represent "ASCII-encoded" strings. Although for some cases (like mine here) they end up being the same thing, in your example in the question, they may not. And certainly in other cases they won't be. Load your Unicode strings from an external resource. In general I'd be wary of embedding non-ASCII characters in a .c source file; even if the compiler knows what to do with them, other software in your toolchain may not.
Using unsigned char has its pros and cons. The biggest benefits are that you don't get sign extension or other funny features such as signed overflow that would produce unexpected results from calculations. Unsigned char is also compatible with <cctype> macros/functions such as isalpha(ch) (all these require values in unsigned char range). On the other hand, all I/O functions require char*, requiring you to cast whenever you do I/O.
As for UTF-8, storing it in signed or unsigned arrays is fine but you have to be careful with those string literals as there is little guarantee about them being valid UTF-8. C++0x adds UTF-8 string literals to avoid possible issues and I would expect the next C standard to adopt those as well.
In general you should be fine, though, as long as you make sure that your source code files are always UTF-8 encoded.
signed / unsigned affect only arithmetic operations. if char is unsigned then higher values will be positive. in case of signed they will be negative. But range is same still.
Not really, unsigned / signed does not specify how many values a variable can hold. It specifies how they are interpreted.
So, an unsigned char has the same amount of values as a signed char, except that the one has negative numbers and the other doesn't. It is still 8 bits (if we assume that a char holds 8 bits, I'm not sure it does everywhere).
It makes no differences when using a char* as a string. The only time signed/unsigned would make a difference is if you would be interpreting it as a number, like for arithmetic or if you were to print it as an integer.
UTF-8 characters cannot be assumed to store in one byte. UTF-8 characters can be 1-4 bytes wide. So, a char, wchar_t, signed or unsigned would not be sufficient for assuming one unit can always store one UTF-8 character.
Most platforms (such as PHP, .NET, etc.) have you build strings normally (such as char[] in C) and you use a library to convert between encodings and parse characters out of the string.
As to you'r question:
think if I have a singed or unsigned ARRAY of chars can be it make my program run wrong? – drigoSkalWalker
Yes. Mine did. Heres a simple runnable excerpt from my app that totally comes out wrong if using ordinary signed chars.
Try running it after changing all chars to unsigned in parameters. Like this:
int is_valid(unsigned char c);
it should then work properly.
#include <stdio.h>
int is_valid(char c);
int main() {
char ch = 0xFE;
int ans = is_valid(ch);
printf("%d", ans);
}
int is_valid(char c) {
if((c == 0xFF) || (c == 0xFE)) {
printf("NOT valid\n");
return 0;
}
else {
printf("valid\n")
return 1;
}
}
What it does is validate if the char is a valid byte within utf-8.
0xFF and 0xFE are NOT valid bytes in utf-8.
imagine the problem if the function validates it as a valid byte?
what happens is this:
0xFE
=
11111110
=
254
If you save this in a ordinary char (that is signed) the leftmost bit, most significant bit, makes it negative. But what negative number is it?
It does this by flipping the bits and adding one bit.
11111110
00000001
00000001 + 00000001 =
00000010 = 2
and remember it made it negative, so it becomes -2
so (-2 == 0xFE) in the function ofcourse isnt true.
same goes for (-2 == 0xFF).
So a function that checks for invalid bytes ends up validating unvalid bytes as if they are ok :-o.
Two other reasons I can think of to stick to unsigned when dealing with utf-8 is:
If you might need some bitshifting to the right, there can be trouble because then you might end up adding 1's from the left if using signed chars.
utf-8 and unicode only uses positive numbers so... why dont you as well? keeping it simple :)

does 8-bit processor have to face endianness problem?

If I have a int32 type integer in the 8-bit processor's memory, say, 8051, how could I identify the endianess of that integer? Is it compiler specific? I think this is important when sending multybyte data through serial lines etc.
With an 8 bit microcontroller that has no native support for wider integers, the endianness of integers stored in memory is indeed up to the compiler writer.
The SDCC compiler, which is widely used on 8051, stores integers in little-endian format (the user guide for that compiler claims that it is more efficient on that architecture, due to the presence of an instruction for incrementing a data pointer but not one for decrementing).
If the processor has any operations that act on multi-byte values, or has an multi-byte registers, it has the possibility to have an endian-ness.
http://69.41.174.64/forum/printable.phtml?id=14233&thread=14207 suggests that the 8051 mixes different endian-ness in different places.
The endianness is specific to the CPU architecture. Since a compiler needs to target a particular CPU, the compiler would have knowledge of the endianness as well. So if you need to send data over a serial connection, network, etc you may wish to use build-in functions to put data in network byte order - especially if your code needs to support multiple architectures.
For more information, see: http://www.gnu.org/s/libc/manual/html_node/Byte-Order.html
It's not just up to the compiler - '51 has some native 16-bit registers (DPTR, PC in standard, ADC_IN, DAC_OUT and such in variants) of given endianness which the compiler has to obey - but outside of that, the compiler is free to use any endianness it prefers or one you choose in project configuration...
An integer does not have endianness in it. You can't determine just from looking at the bytes whether it's big or little endian. You just have to know: For example if your 8 bit processor is little endian and you're receiving a message that you know to be big endian (because, for example, the field bus system defines big endian), you have to convert values of more than 8 bits. You'll need to either hard-code that or to have some definition on the system on which bytes to swap.
Note that swapping bytes is the easy thing. You may also have to swap bits in bit fields, since the order of bits in bit fields is compiler-specific. Again, you basically have to know this at build time.
unsigned long int x = 1;
unsigned char *px = (unsigned char *) &x;
*px == 0 ? "big endian" : "little endian"
If x is assigned the value 1 then the value 1 will be in the least significant byte.
If we then cast x to be a pointer to bytes, the pointer will point to the lowest memory location of x. If that memory location is 0 it is big endian, otherwise it is little endian.
#include <stdio.h>
union foo {
int as_int;
char as_bytes[sizeof(int)];
};
int main() {
union foo data;
int i;
for (i = 0; i < sizeof(int); ++i) {
data.as_bytes[i] = 1 + i;
}
printf ("%0x\n", data.as_int);
return 0;
}
Interpreting the output is up to you.

Resources