Porting C endianness & pointers black magic to Swift - c

I'm trying to translate this snippet :
ntohs(*(UInt16*)VALUE) / 4.0
and some other ones, looking alike, from C to Swift.
Problem is, I have very few knowledge of Swift and I just can't understand what this snippet does... Here's all I know :
ntohs swap endianness to host endianness
VALUE is a char[32]
I just discovered that Swift : (UInt(data.0) << 6) + (UInt(data.1) >> 2) does the same thing. Could one please explain ?
I'm willing to return a Swift Uint (UInt64)
Thanks !

VALUE is a pointer to 32 bytes (char[32]).
The pointer is cast to UInt16 pointer. That means the first two bytes of VALUE are being interpreted as UInt16 (2 bytes).
* will dereference the pointer. We get the two bytes of VALUE as a 16-bit number. However it has net endianness (net byte order), so we cannot make integer operations on it.
We now swap the endianness to host, we get a normal integer.
We divide the integer by 4.0.
To do the same in Swift, let's just compose the byte values to an integer.
let host = (UInt(data.0) << 8) | UInt(data.1)
Note that to divide by 4.0 you will have to convert the integer to Float.

The C you quote is technically incorrect, although it will be compiled as intended by most production C compilers.¹ A better way to achieve the same effect, which should also be easier to translate to Swift, is
unsigned int val = ((((unsigned int)VALUE[0]) << 8) | // ² ³
(((unsigned int)VALUE[1]) << 0)); // ⁴
double scaledval = ((double)val) / 4.0; // ⁵
The first statement reads the first two bytes of VALUE, interprets them as a 16-bit unsigned number in network byte order, and converts them to host byte order (whether or not those byte orders are different). The second statement converts the number to double and scales it.
¹ Specifically, *(UInt16*)VALUE provokes undefined behavior because it violates the type-based aliasing rules, which are asymmetric: a pointer with character type may be used to access an object with any type, but a pointer with any other type may not be used to access an object with (array-of-)character type.
² In C, a cast to unsigned int here is necessary in order to make the subsequent shifting and or-ing happen in an unsigned type. If you cast to uint16_t, which might seem more appropriate, the "usual arithmetic conversions" would then convert it to int, which is signed, before doing the left shift. This would provoke undefined behavior on a system where int was only 16 bits wide (you're not allowed to shift into the sign bit). Swift almost certainly has completely different rules for arithmetic on types with small ranges; you'll probably need to cast to something before the shift, but I cannot tell you what.
³ I have over-parenthesized this expression so that the order of operations will be clear even if you aren't terribly familiar with C.
⁴ Left shifting by zero bits has no effect; it is only included for parallel structure.
⁵ An explicit conversion to double before the division operation is not necessary in C, but it is in Swift, so I have written it that way here.

It looks like the code is taking the single byte value[0]. This is then dereferenced, this should retrieve a number from a low memory address, 1 to 127 (possibly 255).
What ever number is there is then divided by 4.
I genuinely can't believe my interpretation is correct and can't check that cos I have no laptop. I really think there maybe a typo in your code as it is not a good thing to do. Portable, reusable
I must stress that the string is not converted to a number. Which is then used

Related

convert uint8_t* to uint64_t

What's more recommended or advisable way to convert array of uint8_t at offset i to uint64_t and why?
uint8_t * bytes = ...
uint64_t const v = ((uint64_t *)(bytes + i))[0];
or
uint64_t const v = ((uint64_t)(bytes[i+7]) << 56)
| ((uint64_t)(bytes[i+6]) << 48)
| ((uint64_t)(bytes[i+5]) << 40)
| ((uint64_t)(bytes[i+4]) << 32)
| ((uint64_t)(bytes[i+3]) << 24)
| ((uint64_t)(bytes[i+2]) << 16)
| ((uint64_t)(bytes[i+1]) << 8)
| ((uint64_t)(bytes[i]));
There are two primary differences.
One, the behavior of ((uint64_t *)(bytes + i))[0] is not defined by the C standard (unless certain prerequisites about what bytes point to are met). Generally, an array of bytes should not be accessed using a uint64_t type.
When memory defined as one type is accessed with another type, it is called aliasing, and the C standard only defines certain combinations of aliasing. Some compilers may support some aliasing beyond what the standard requires, but using it is not portable. Additionally, if bytes + i is not suitably aligned for a uint64_t, the access may cause an exception or otherwise malfunction.
Two, loading the bytes through aliasing, if it is defined (by the standard or by compiler extension), interprets the bytes using the memory ordering for the C implementation. Some C implementations store the bytes representing integers in memory from low address to high address for low-position-value bytes to high-position-value bytes, and some store them from high address to low address. (And they can be stored in non-consecutive orders too, although this is rare.) So loading the bytes this way will produce different values from the same bytes in memory based on what order the C implementation uses.
But loading the bytes and using shifts to combine them will always produce the same value from the same bytes in memory regardless of what order the C implementation uses.
The first method should be avoided, because there is no need for it. If one desires to interpret the bytes using the C implementation’s ordering, this can be done with:
uint64_t t;
memcpy(&t, bytes+i, sizeof t);
const uint64_t v = t;
Using memcpy provides a portable way of aliasing the uint64_t to store bytes into it. Good compilers recognize this idiom and will optimize the memcpy to a load from memory, if suitable for the target architecture (and if optimization is enabled).
If one desires to interpret the bytes using little-endian ordering, as shown in the code in the question, then the second method may be used. (Sometimes platforms will have routines that may provide more efficient code for this.)
You can also use memcpy
uint64_t n;
memcpy(&n, bytes + i, sizeof(uint64_t));
const uint64_t v = n;
The first option has two big problems that qualify as undefined behavior (anything can happen):
A uint8_t* or array of uint8_t is not necessarily aligned the same way as required by a larger type like uint64_t. Simply casting to uint64_t* leads to misaligned access. This can cause hardware exceptions, program crashes, slower code etc, all depending on the alignment requirements of the specific target.
It violates the internal type system of C, where each object in memory known by the compiler has an "effective type" that the compiler keeps track of. Based on this, the compiler is allowed to make certain assumptions regarding if a certain memory region have been accessed or not during optimization. If your code violates these type rules, as it would in this case, wrong machine code could get generated.
This is most commonly referred to as the strict aliasing rule and your cast followed by dereferencing would be a so-called "strict aliasing violation".
The second option is sound code, because:
When doing shifts or other forms of bitwise arithmetic, a large integer type should be used. That is, unsigned int or larger - depending on system. Using signed types or small integer types can lead to undefined behavior or unexpected results. See Implicit type promotion rules regarding problems with small integer types implicitly changing signedness in some expressions.
If not for the cast to uint64_t, then the bytes[i+7] << 56 shift would involve an implicit promotion of the left operand from uint8_t to int, which would be a bug. Because if the most significant bit (MSB) of the byte is set and we shift into/beyond the sign bit, we invoke undefined behavior - again, anything can happen.
And naturally we need to use a 64 bit type in this specific case or otherwise we wouldn't be able to shift as far as 56 bits. Shifting beyond the range of the type of the left operand is also undefined behavior.
Note that whether to pick the order of bytes[i+7] << 56 versus the alternative bytes[i+0] << 56 depends on the underlying CPU endianess. Bit shifts are nice since the actual shift ignores if the destination type is using big or little endian. But in this case you must know in advance which byte in the source array you want to correspond to the most significant. This code you have here will work if the array was built based on little endian formatting, since the last byte of the array is shifted to the highest address.
As for the uint64_t const v = , the const qualifier is a bit strange to have at local scope like that. It's harmless but confusing and doesn't really add anything of value inside a local scope. I would just drop it.

Operator "<<= " : What does it it mean?

I need help solving this problem in my mind, so if anyone had a similar problem it would help me.
Here's my code:
char c=0xAB;
printf("01:%x\n", c<<2);
printf("02:%x\n", c<<=2);
printf("03:%x\n", c<<=2);
Why the program prints:
01:fffffeac
02:ffffffac
03:ffffffb0
What I expected to print, that is, what I got on paper is:
01:fffffeac
02:fffffeac
03:fffffab0
I obviously realized I didn't know what the operator <<= was doing, I thought c = c << 2.
If anyone can clarify this, I would be grateful.
You're correct in thinking that
c <<= 2
is equivalent to
c = c << 2
But you have to remember that c is a single byte (on almost all systems), it can only contain eight bits, while a value like 0xeac requires 12 bits.
When the value 0xeac is assigned back to c then the value will be truncated and the top bits will simply be ignored, leaving you with 0xac (which when promoted to an int becomes 0xffffffac).
<<= means shift and assign. It's the compound assignment version of c = c << 2;.
There's several problems here:
char c=0xAB; is not guaranteed to give a positive result, since char could be an 8 bit signed type. See Is char signed or unsigned by default?. In which case 0xAB will get translated to a negative number in an implementation-defined way. Avoid this bug by always using uint8_t when dealing with raw binary bytes.
c<<2 is subject to Implicit type promotion rules - specifically c will get promoted to a signed int. If the previous issue occured where your char got a negative value, c now holds a negative int.
Left-shifting negative values in C invokes undefined behavior - it is always a bug. Shifting signed operands in general is almost never correct.
%x isn't a suitable format specifier to print the int you ended up with, nor is it suitable for char.
As for how to fix the code, it depends on what you wish to achieve. It's recommended to cast to uint32 before shifting.

Explain how specific C #define works

I have been looking at some of the codes at http://www.netlib.org/fdlibm/ to see how some functions work and I was looking at the code for e_log.c and in some parts of the code it says:
hx = __HI(x); /* high word of x */
lx = __LO(x); /* low word of x */
The code for __HI(x) and __LO(x) is:
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
which I really don't understand because I am not familiar with this type of C. Can someone please explain to me what __HI(x) and __LO(x) are doing?
Also later in the code for the function there is a statement:
__HI(x) = hx|(i^0x3ff00000);
Can someone please explain to me:
how is it possible to make a function equal to something (I generally work with python so I don't really know what is going on)?
what are __HI(x) and __LO(x) doing?
what does the program mean by "high word" and "low word" of x?
The final purpose of my analysis is understanding this code in order to port it into a Python implementation
These macros use compiler-dependent properties to access the representations of double types.
In C, all objects other than bit-fields are represented as sequences of bytes. The fdlibm code you are looking at is designed for implementations where int is four bytes and the double type is represented using eight bytes in a format defined by the IEEE-754 floating-point specification. That format is called binary64 or IEEE-754 basic 64-bit binary floating-point. It is also designed for an implementation where the C compiler guarantees that aliasing via pointer conversions is supported. (This is not guaranteed by the C standard, but C implementations may support it.)
Consider a double object named x. Given these macros:
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
When __LO(x) is used in source code, it is replaced by *(int*)&x. The &x takes the address of x. The address of x has type double *. The cast (int *) converts this to int *, a pointer to an int. Then * dereferences this pointer, resulting in a reference to the int that is at the low-address part of x.
When __HI(x) is used in the source code, (int*)&x again points to the low-address part of x. Adding 1 changes it to point to the high-address part. Then * dereferences this, resulting in a reference to the int that is at the high-address part.
The routines in fdlibm are special mathematical routines. To operate, they need to examine and modify the bytes that represent double values. The __LO and __HI macros give them this access.
These definitions of __HI and __LO work for implementations that store the double values in little-endian order (with the “least significant” part of the double in the lower-addressed memory location). The fdlibm code may contain alternate definitions for big-endian systems, likely selected by some #if statement.
In the code __HI(x) = hx|(i^0x3ff00000);, the value 0x3ff00000 is a bit mask for the bits that encode the exponent (and part of the significand) of a double value. Without context, we cannot say precisely what is happening here, but the code appears to be merging hx with some value from i. It is likely completing some computation of the bytes representing a new double value it is creating and storing those bytes in the “high” part of a double object.
I add a reply to integrate the one already present (not substitute).
hx = __HI(x); /* high word of x */
lx = __LO(x); /* low word of x */
Comments are useful... even if in this case the macro name could be clear enough. "high" and "low" refer to the two halves of an integer representation, typically a 16 or 32 bit because for an 8-bit int the used term is "nibble".
If we take a 16-bit unsigned integer which can range from 0 to 65535, or in hex 0x0000 to 0xFFFF, for example 0x1234, the two halves are:
0x1234
^^-------------------- lower half, or "low"
^^---------------------- upper half, or "high"
Note that "lower" means the less significant part. The correct way to get the two halves, assuming 16 bits, is to make a logical (bitwise) AND with 0xFF to get lo(), and to shift 8 bit right (divide by 256) to get high.
Now, inside a CPU the number 0x1234 is written in two consecutive locations, either as 0x12 then 0x34 if big-endian, or 0x34 then 0x12 if little-endian. Given this, other ways are possible to read single halves, reading the correct one directly from memory without calculation. To get the lo() of 0x1234 in a little endian machine, it is possible to read the single byte in the first location.
From the question:
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
__LO is defined to make a bitwise AND (sure way), while __HI peeks directly in the memory (less sure). It is strange because it seems that the integer to be splitted in two has double dimension of the size of the word of the machine. If the machine is 32 bit, the integer to be split is 64 bits long. And there is another caveat: those macro can read the halves, but can also be used to write separately the two halves. In fact, from the question:
__HI(x) = hx|(i^0x3ff00000);
the result is to set only the HI part (upper, most significant) of x. Note also the value used, 0x3FFF0000, which seems to indicate that x is 128 bits because the mask used to generate a half of it is 64 bits long.
Hope this is clear enough to translate C to python. You should use integers 128 bit long. When in need to get the LO() part, use a bitwise AND with 0xFFFFFFFF; to get HI(), shift right 64 times or do the equivalent division.
When HI and LO are to the left of an assignment, only that half of the value is written, and you should construct separately the two halves and sum them up (or bitwise or them together).
Hope it helps...
#define A B
is a preprocessor directive that substitutes literal A with literal B all over the source code before the compilation.
#define A(x) B
is a function-like preprocessor macro which uses a parameter x in order to do a parameterized preprocessor substitution. In this case, B can be a function of x as well.
Your macros
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
// called as
__HI(x) = hx|(i^0x3ff00000);
Since it is just a matter of code substitution, the assignment is perfectly legit. Why? Because in this case the macro is substituted by an R-value in both cases.
That rvalue is in both cases a variable of type int:
take x's address
cast it to a pointer to int
deference it (in case of __LO())
Add 1 and then deference it in case of __HI ().
What it will actually point depends on architecture because pointer arithmetics are architecture dependant. Also endianness has to be taken into account.
What we can say is that they are designed in order to access the lower and the higher halves of a data type whose size is 2*sizeof (int) big (so, if for example integer data is 32-bit wide, they will allow the access to lower 32 bytes and to upper 32 bytes). Furthermore, from the macro names we understand that it is a little-endian architecture (LSB comes first).
In order to port to Python code containing this macros you will need to do it at higher level, since Python does not support pointers.
These tips don't solve your specific task, but provide to you a working method for this task and similar:
A way to understand what a macro does is checking how it is actually translated by the preprocessor. This can be done on most compilers through the -E compiler option.
Use a debugger to understand the functionality: set a breakpoint just before the call to the macro, and analyze its effects on addresses and variables.

AVR uint8_t doesn't get correct value

I have a uint8_t that should contain the result of a bitwise calculation. The debugger says the variable is set correctly, but when i check the memory, the var is always at 0. The code proceeds like the var is 0, no matter what the debugger tells me. Here's the code:
temp = (path_table & (1 << current_bit)) >> current_bit;
//temp is always 0, debugger shows correct value
if (temp > 0) {
DS18B20_send_bit(pin, 0x01);
} else {
DS18B20_send_bit(pin, 0x00);
}
Temp's a uint8_t, path_table's a uint64_t and current_bit's a uint8_t. I've tried to make them all uint64_t but nothing changed. I've also tried using unsigned long long int instead. Nothing again.
The code always enters the else clause.
Chip's Atmega4809, and uses uint64_t in other parts of the code with no issues.
Note - If anyone knows a more efficient/compact way to extract a single bit from a variable i would really appreciate if you could share ^^
1 is an integer constant, of type int. The expression 1 << current_bit also has type int, but for 16-bit int, the result of that expression is undefined when current_bit is larger than 14. The behavior being undefined in your case, then, it is plausible that your debugger presents results for the overall expression that seem inconsistent with the observed behavior. If you used an unsigned int constant instead, i.e. 1u, then the resulting value of temp would be well defined as 0 whenever current_bit was greater than 15, because the result of the left shift would be zero.
Solve this problem by performing the computation in a type wide enough to hold the result. Here's a compact, correct, and pretty clear way to correct your code to do that:
DS18B20_send_bit(pin, (path_table & (((uint64_t) 1) << current_bit)) != 0);
Or if path_table has an unsigned type then I prefer this, though it's more of a departure from your original:
DS18B20_send_bit(pin, (path_table >> current_bit) & 1);
Realization #1 here is that AVR is 1980-1990s technology core. It is not a x64 PC that chews 64 bit numbers for breakfast, but an extremely inefficient 8-bit MCU. As such:
It likes 8 bit arithmetic.
It will struggle with 16 bit arithmetic, by doing tricks with 16 bit index registers, double accumulators or whatever 8 bit core tricks it prefers to do.
It will literally take ages to execute 32 bit arithmetic, by invoking software libraries inline.
It will probably melt through the floor if attempting 64 bit arithmetic.
Before you do anything else, you need to get rid of all 64 bit arithmetic and radically minimize the use of 32 bit arithmetic. Period. There should be no single variable of uint64_t in your code or you are doing it very very wrong.
With this revelation also comes that all 8 bit MCUs always have an int type which is 16 bits.
In the code 1<<current_bit, the integer constant 1 is of type int. Meaning that if current_bit is 15 or larger, you will shift bits into the sign bit of this temporary int. This is always a bug. Strictly speaking this is undefined behavior. In practice, you might end up with random change of sign of your numbers.
To avoid this, never use any form of bitwise operators on signed numbers. When mixing integer constants such as 1 with bitwise operators, change them to 1u to avoid bugs like the one mentioned.
If anyone knows a more efficient/compact way to extract a single bit from a variable i would really appreciate if you could share
The most efficient way in C is: uint8_t variable; ... if(variable & (1u << bits)). This should translate to the relevant "branch if bit set" instruction.
My general advise would be find your tool chain's disassembler and see what machine code that the C code actually generated. You don't have to be an assembler guru to read it, peeking at the instruction set should be enough.

Using bit operations to "turn off" binary digits of a pointer

I was able to use bit operations to "turn off" binary digits of a number.
Ex:
x = x & ~(1<<0)
x = x & ~(1<<1)
(and repeat until desired number of digits starting from the right are changed to 0)
I would like to apply this technique to a pointer's address.
Unfortunately, the & operator cannot be used with pointers. Using the same lines of code as above, where x is a pointer, the compiler says "invalid operands to binary & (have int and int)."
I tried to typecast the pointers as ints, but that doesn't work as I assume the ints are too small (and I just realized I'm not allowed to cast).
(note: though this is part of a homework problem, I've already reasoned out why I need to turn off some digits after a good couple hours, so I'm fine in that regard. I'm simply trying to see if I can get a clever technique to do what I want to do here).
Restrictions: I cannot use loops, conditionals, any special functions, constants greater than 255, division, mod.
(edit: added restrictions to the bottom)
Use uintptr_t from <stdint.h>. You should always use unsigned types for bit twiddling, and (u)intptr_t is specifically chosen to be able to hold a pointer's value.
Note however that adjusting a pointer manually and dereferencing it is undefined behaviour, so watch your step. You shall be able to recover the exact original value of the pointer (or another valid pointer) before doing so.
Edit : from your comment I understand that you don't plan on dereferencing the twiddled pointer at all, so no undefined behaviour for you. Here is how you can check if your pointers share the same 64-byte block :
uintptr_t p1 = (uintptr_t)yourPointer1;
uintptr_t p2 = (uintptr_t)yourPointer2;
uintptr_t mask = ~(uintptr_t)63u; // Shave off 5 low-order bits
return (p1 & mask) == (p2 & mask);
C language standard library includes the (optional though) type intptr_t, for which there is guarantee that "any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer".
Of course if you perform bitwise operation on the integer than the result is undefined behaviour.
Edit:
How unfortunate haha. I need a function to show two pointers are in
the same 64-byte block of memory. This holds true so long as every
digit but the least significant 6 digits of their binary
representations are equal. By making sure the last 6 digits are all
the same (ex: 0), I can return true if both pointers are equal. Well,
at least I hope so.
You should be able to check if they're in the same 64 block of memory by something like this:
if ((char *)high_pointer - (char *)low_pointer < 64) {
// do stuff
}
Edit2: This is likely to be undefined behaviour as pointed out by chris.
Original post:
You're probably looking for intptr_t or uintptr_t. The standard says you can cast to and from these types to pointers and have the value equal to the original.
However, despite it being a standard type, it is optional so some library implementations may choose not to implement it. Some architectures might not even represent pointers as integers so such a type wouldn't make sense.
It is still better than casting to and from an int or a long since it is guaranteed to work on implementations that supply it. Otherwise, at least you'll know at compile time that your program will break on a certain implementation/architecture.
(Oh, and as other answers have stated, manually changing the pointer when casted to an integer type and dereferencing it is undefined behaviour)

Resources