When I write the following program and use the GNU C++ compiler, the output is 1 which I think is due to the rotation operation performed by the compiler.
#include <iostream>
int main()
{
int a = 1;
std::cout << (a << 32) << std::endl;
return 0;
}
But logically, as it's said that the bits are lost if they overflow the bit width, the output should be 0. What is happening?
The code is on ideone, http://ideone.com/VPTwj.
This is caused due to a combination of an undefined behaviour in C and the fact that code generated for IA-32 processors has a 5 bit mask applied on the shift count. This means that on IA-32 processors, the range of a shift count is 0-31 only. 1
From The C programming language 2
The result is undefined if the right operand is negative, or greater than or equal to the number of bits in the left expression’s type.
From IA-32 Intel Architecture Software Developer’s Manual 3
The 8086 does not mask the shift count. However, all other IA-32 processors (starting with the Intel 286 processor) do mask the shift count to 5 bits, resulting in a maximum count of 31. This masking is done in all operating modes (including the virtual-8086 mode) to reduce the maximum execution time of the instructions.
1 http://codeyarns.com/2004/12/20/c-shift-operator-mayhem/
2 A7.8 Shift Operators, Appendix A. Reference Manual, The C Programming Language
3 SAL/SAR/SHL/SHR – Shift, Chapter 4. Instruction Set Reference, IA-32 Intel Architecture Software Developer’s Manual
In C++, shift is only well-defined if you shift a value less steps than the size of the type. If int is 32 bits, then only 0 to, and including, 31 steps is well-defined.
So, why is this?
If you take a look at the underlying hardware that performs the shift, if it only has to look at the lower five bits of a value (in the 32 bit case), it can be implemented using less logical gates than if it has to inspect every bit of the value.
Answer to question in comment
C and C++ are designed to run as fast as possible, on any available hardware. Today, the generated code is simply a ''shift'' instruction, regardless how the underlying hardware handles values outside the specified range. If the languages would have specified how shift should behave, the generated could would have to check that the shift count is in range before performing the shift. Typically, this would yield three instructions (compare, branch, shift). (Admittedly, in this case it would not be necessary as the shift count is known.)
It's undefined behaviour according to the C++ standard:
The value of E1 << E2 is E1
left-shifted E2 bit positions; vacated
bits are zero-filled. If E1 has an
unsigned type, the value of the result
is E1 × 2^E2, reduced modulo one more
than the maximum value representable
in the result type. Otherwise, if E1
has a signed type and non-negative
value, and E1×2^E2 is representable in
the result type, then that is the
resulting value; otherwise, the
behavior is undefined.
The answers of Lindydancer and 6502 explain why (on some machines) it happens to be a 1 that is being printed (although the behavior of the operation is undefined). I am adding the details in case they aren't obvious.
I am assuming that (like me) you are running the program on an Intel processor. GCC generates these assembly instructions for the shift operation:
movl $32, %ecx
sall %cl, %eax
On the topic of sall and other shift operations, page 624 in the Instruction Set Reference Manual says:
The 8086 does not mask the shift count. However, all other Intel Architecture processors
(starting with the Intel 286 processor) do mask the shift count to five bits, resulting in a
maximum count of 31. This masking is done in all operating modes (including the virtual-8086
mode) to reduce the maximum execution time of the instructions.
Since the lower 5 bits of 32 are zero, then 1 << 32 is equivalent to 1 << 0, which is 1.
Experimenting with larger numbers, we would predict that
cout << (a << 32) << " " << (a << 33) << " " << (a << 34) << "\n";
would print 1 2 4, and indeed that is what is happening on my machine.
It doesn't work as expected because you're expecting too much.
In the case of x86 the hardware doesn't care about shift operations where the counter is bigger than the size of the register (see for example SHL instruction description on x86 reference documentation for an explanation).
The C++ standard didn't want to impose an extra cost by telling what to do in these cases because generated code would have been forced to add extra checks and logic for every parametric shift.
With this freedom implementers of compilers can generate just one assembly instruction without any test or branch.
A more "useful" and "logical" approach would have been for example to have (x << y) equivalent to (x >> -y) and also the handling of high counters with a logical and consistent behavior.
However this would have required a much slower handling for bit shifting so the choice was to do what the hardware does, leaving to the programmers the need to write their own functions for side cases.
Given that different hardware does different things in these cases what the standard says is basically "Whatever happens when you do strange things just don't blame C++, it's your fault" translated in legalese.
Shifting a 32 bit variable by 32 or more bits is undefined behavior and may cause the compiler to make daemons fly out of your nose.
Seriously, most of the time the output will be 0 (if int is 32 bits or less) since you're shifting the 1 until it drops off again and nothing but 0 is left. But the compiler may optimize it to do whatever it likes.
See the excellent LLVM blog entry What Every C Programmer Should Know About Undefined Behavior, a must-read for every C developer.
Since you are bit shifting an int by 32 bits; you'll get: warning C4293: '<<' : shift count negative or too big, undefined behavior in VS. This means that you're shifting beyond the integer and the answer could be ANYTHING, because it is undefined behavior.
You could try the following. This actually gives the output as 0 after 32 left shifts.
#include<iostream>
#include<cstdio>
using namespace std;
int main()
{
int a = 1;
a <<= 31;
cout << (a <<= 1);
return 0;
}
I had the same problem and this worked for me:
f = ((long long)1 << (i-1));
Where i can be any integer bigger than 32 bits. The 1 has to be a 64 bit integer for the shifting to work.
Try using 1LL << 60. Here LL is for long long. You can shift now to max of 61 bits.
Related
Say for example I have a uint8_t that can be of any value, and I only want to flip all the bits from the least significant bit up to the most significant last 1 bit value? How would I do that in the most efficient way?, Is there a solution where I can avoid using a loop?
here are some cases:
left side is the original bits - right side after the flips.
00011101 -> 00000010
00000000 -> 00000000
11111111 -> 00000000
11110111 -> 00001000
01000000 -> 00111111
[EDIT]
The type could also be larger than uint8_t, It could be uint32_t, uint64_t and __uint128_t. I just use uint8_t because it's the easiest size to show in the example cases.
In general I expect that most solutions will have roughly this form:
Compute the mask of bits that need to flipped
XOR by that mask
As mentioned in the comments, x64 is a target of interest, and on x64 you can do step 1 like this:
Find the 1-based position p of the most significant 1, by leading zeroes (_lzcnt_u64) and subtracting that from 64 (or 32 whichever is appropriate).
Create a mask with p consecutive set bits starting from the least significant bit, probably using _bzhi_u64.
There are some variations, such as using BitScanReverse to find the most significant 1 (but it has an ugly case for zero), or using a shift instead of bzhi (but it has an ugly case for 64). lzcnt and bzhi is a good combination with no ugly cases. bzhi requires BMI2 (Intel Haswell or newer, AMD Zen or newer).
Putting it together:
x ^ _bzhi_u64(~(uint64_t)0, 64 - _lzcnt_u64(x))
Which could be further simplified to
_bzhi_u64(~x, 64 - _lzcnt_u64(x))
As shown by Peter. This doesn't follow the original 2-step plan, rather all bits are flipped, and then the bits that were originally leading zeroes are reset.
Since those original leading zeroes form a contiguous sequence of leading ones in ~x, an alternative to bzhi could be to add the appropriate power of two to ~x (though sometimes zero, which might be thought of as 264, putting the set bit just beyond the top of the number). Unfortunately the power of two that we need is a bit annoying to compute, at least I could not come up with a good way to do it, it seems like a dead end to me.
Step 1 could also be implemented in a generic way (no special operations) using a few shifts and bitwise ORs, like this:
// Get all-ones below the leading 1
// On x86-64, this is probably slower than Paul R's method using BSR and shift
// even though you have to special case x==0
m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32; // last step should be removed if x is 32-bit
AMD CPUs have slowish BSR (but fast LZCNT; https://uops.info/), so you might want this shift/or version for uint8_t or uint16_t (where it takes fewest steps), especially if you need compatibility with all CPUs and speed on AMD is more important than on Intel.
This generic version is also useful within SIMD elements, especially narrow ones, where we don't have a leading-zero-count until AVX-512.
TL:DR: use a uint64_t shift to implement efficiently with uint32_t when compiling for 64-bit machines that have lzcnt (AMD since K10, Intel since Haswell). Without lzcnt (only bsr that's baseline for x86) the n==0 case is still special.
For the uint64_t version, the hard part is that you have 65 different possible positions for the highest set bit, including non-existent (lzcnt producing 64 when all bits are zero). But a single shift with 64-bit operand-size on x86 can only produce one of 64 different values (assuming a constant input), since x86 shifts mask the count like foo >> (c&63)
Using a shift requires special-casing one leading-bit-position, typically the n==0 case. As Harold's answer shows, BMI2 bzhi avoids that, allowing bit counts from 0..64.
Same for 32-bit operand-size shifts: they mask c&31. But to generate a mask for uint32_t, we can use a 64-bit shift efficiently on x86-64. (Or 32-bit for uint16_t and uint8_t. Fun fact: x86 asm shifts with 8 or 16-bit operand-size still mask their count mod 32, so they can shift out all the bits without even using a wider operand-size. But 32-bit operand size is efficient, no need to mess with partial-register writes.)
This strategy is even more efficient than bzhi for a type narrower than register width.
// optimized for 64-bit mode, otherwise 32-bit bzhi or a cmov version of Paul R's is good
#ifdef __LZCNT__
#include <immintrin.h>
uint32_t flip_32_on_64(uint32_t n)
{
uint64_t mask32 = 0xffffffff; // (uint64_t)(uint32_t)-1u32
// this needs to be _lzcnt_u32, not __builtin_clz; we need 32 for n==0
// If lznct isn't available, we can't avoid handling n==0 zero specially
uint32_t mask = mask32 >> _lzcnt_u32(n);
return n ^ mask;
}
#endif
This works equivalently for uint8_t and uint16_t (literally the same code with same mask, using a 32-bit lzcnt on them after zero-extension). But not uint64_t (You could use a unsigned __int128 shift, but shrd masks its shift count mod 64 so compilers still need some conditional behaviour to emulate it. So you might as well do a manual cmov or something, or sbb same,same to generate a 0 or -1 in a register as the mask to be shifted.)
Godbolt with gcc and clang. Note that it's not safe to replace _lzcnt_u32 with __builtin_clz; clang11 and later assume that can't produce 32 even when they compile it to an lzcnt instruction1, and optimize the shift operand-size down to 32 which will act as mask32 >> clz(n) & 31.
# clang 14 -O3 -march=haswell (or znver1 or bdver4 or other BMI2 CPUs)
flip_32_on_64:
lzcnt eax, edi # skylake fixed the output false-dependency for lzcnt/tzcnt, but not popcnt. Clang doesn't care, it's reckless about false deps except inside a loop in a single function.
mov ecx, 4294967295
shrx rax, rcx, rax
xor eax, edi
ret
Without BMI2, e.g. with -march=bdver1 or barcelona (aka k10), we get the same code-gen except with shr rax, cl. Those CPUs do still have lzcnt, otherwise this wouldn't compile.
(I'm curious if Intel Skylake Pentium/Celeron run lzcnt as lzcnt or bsf. They lack BMI1/BMI2, but lzcnt has its own feature flag.
It seems low-power uarches as recent as Tremont are missing lzcnt, though, according to InstLatx64 for a Pentium Silver N6005 Jasper Lake-D, Tremont core. I didn't manually look for the feature bit in the raw CPUID dumps of recent Pentium/Celeron, but Instlat does have those available if someone wants to check.)
Anyway, bzhi also requires BMI2, so if you're comparing against that for any size but uint64_t, this is the comparison.
This shrx version can keep its -1 constant around in a register across loops. So the mov reg,-1 can be hoisted out of a loop after inlining, if the compiler has a spare register. The best bzhi strategy doesn't need a mask constant so it has nothing to gain. _bzhi_u64(~x, 64 - _lzcnt_u64(x)) is 5 uops, but works for 64-bit integers on 64-bit machines. Its latency critical path length is the same as this. (lzcnt / sub / bzhi).
Without LZCNT, one option might be to always flip as a way to get FLAGS set for CMOV, and use -1 << bsr(n) to XOR some of them back to the original state. This could reduce critical path latency. IDK if a C compiler could be coaxed into emitting this. Especially not if you want to take advantage of the fact that real CPUs keep the BSR destination unchanged if the source was zero, but only AMD documents this fact. (Intel says it's an "undefined" result.)
(TODO: finish this hand-written asm idea.)
Other C ideas for the uint64_t case: cmov or cmp/sbb (to generate a 0 or -1) in parallel with lzcnt to shorten the critical path latency? See the Godbolt link where I was playing with that.
ARM/AArch64 saturate their shift counts, unlike how x86 masks for scalar. If one could take advantage of that safely (without C shift-count UB) that would be neat, allowing something about as good as this.
x86 SIMD shifts also saturate their counts, which Paul R took advantage of with an AVX-512 answer using vlzcnt and variable-shift. (It's not worth copying data to an XMM reg and back for one scalar shift, though; only useful if you have multiple elements to do.)
Footnote 1: clang codegen with __builtin_clz or ...ll
Using __builtin_clzll(n) will get clang to use 64-bit operand-size for the shift, since values from 32 to 63 become possible. But you can't actually use that to compile for CPUs without lzcnt. The 63-bsr a compiler would use without lzcnt available would not produce the 64 we need for that case. Not unless you did n<<=1; / n|=1; or something before the bsr and adjusted the result, but that would be slower than cmov.
If you were using a 64-bit lzcnt, you'd want uint64_t mask = -1ULL since there will be 32 extra leading zeros after zero-extending to uint64_t. Fortunately all-ones is relatively cheap to materialize on all ISAs, so use that instead of 0xffffffff00000000ULL
Here’s a simple example for 32 bit ints that works with gcc and compatible compilers (clang et al), and is portable across most architectures.
uint32_t flip(uint32_t n)
{
if (n == 0) return 0;
uint32_t mask = ~0U >> __builtin_clz(n);
return n ^ mask;
}
DEMO
We could avoid the extra check for n==0 if we used lzcnt on x86-64 (or clz on ARM), and we were using a shift that allowed a count of 32. (In C, shifts by the type-width or larger are undefined behaviour. On x86, in practice the shift count is masked &31 for shifts other than 64-bit, so this could be usable for uint16_t or uint8_t using a uint32_t mask.)
Be careful to avoid C undefined behaviour, including any assumption about __builtin_clz with an input of 0; modern C compilers are not portable assemblers, even though we sometimes wish they were when the language doesn't portably expose the CPU features we want to take advantage of. For example, clang assumes that __builtin_clz(n) can't be 32 even when it compiles it to lzcnt.
See #PeterCordes's answer for details.
If your use case is performance-critical you might also want to consider a SIMD implementation for performing the bit flipping operation on a large number of elements. Here's an example using AVX512 for 32 bit elements:
void flip(const uint32_t in[], uint32_t out[], size_t n)
{
assert((n & 7) == 0); // for this example we only handle arrays which are vector multiples in size
for (size_t i = 0; i + 8 <= n; i += 8)
{
__m512i vin = _mm512_loadu_si512(&in[i]);
__m512i vlz = _mm512_lzcnt_epi32(vin);
__m512i vmask = _mm512_srlv_epi32(_mm512_set1_epi32(-1), vlz);
__m512i vout = _mm512_xor_si512(vin, vmask);
_mm512_storeu_si512(&out[i], vout);
}
}
This uses the same approach as other solutions, i.e. count leading zeroes, create mask, XOR, but for 32 bit elements it processes 8 elements per loop iteration. You could implement a 64 bit version of this similarly, but unfortunately there are no similar AVX512 intrinsics for element sizes < 32 bits or > 64 bits.
You can see the above 32 bit example in action on Compiler Explorer (note: you might need to hit the refresh button at the bottom of the assembly pane to get it to re-compile and run if you get "Program returned: 139" in the output pane - this seems to be due to a glitch in Compiler Explorer currently).
I recently picked up a copy of Applied Cryptography by Bruce Schneier and it's been a good read. I now understand how several algorithms outlined in the book work, and I'd like to start implementing a few of them in C.
One thing that many of the algorithms have in common is dividing an x-bit key, into several smaller y-bit keys. For example, Blowfish's key, X, is 64-bits, but you are required to break it up into two 32-bit halves; Xl and Xr.
This is where I'm getting stuck. I'm fairly decent with C, but I'm not the strongest when it comes to bitwise operators and the like.
After some help on IRC, I managed to come up with these two macros:
#define splitup(a, b, c) {b = a >> 32; c = a & 0xffffffff; }
#define combine(a, b, c) {a = (c << 32) | a;}
Where a is 64 bits and b and c are 32 bits. However, the compiler warns me about the fact that I'm shifting a 32 bit variable by 32 bits.
My questions are these:
What's bad about shifting a 32-bit variable 32 bits? I'm guessing it's undefined, but these macros do seem to be working.
Also, would you suggest I go about this another way?
As I said, I'm fairly familiar with C, but bitwise operators and the like still give me a headache.
EDIT
I figured out that my combine macro wasn't actually combining two 32-bit variables, but simply ORing 0 by a, and getting a as a result.
So, on top of my previous questions, I still don't have a method of combining the two 32-bit variables to get a 64-bit one; a suggestion on how to do it would be appreciated.
Yes, it is undefined behaviour.
ISO/IEC 9899:1999 6.5.7 Bitwise shift operators ¶3
The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
C11 aka ISO/IEC 9899:2011 says the same.
You should first cast b to the target integer type. Another point is that you should put parentheses around the macro parameters to avoid surprises by operator precedences. Additionally, the comma operator is very useful here, allowing you to avoid the braces, so that the macro can be used as a normal command, closed with a semicolon.
#define splitup(a,b,c) ( (b) = (a) >> 32, (c) = (a) & 0xffffffff )
#define combine(a,b,c) ( (a) = ((unsigned long long)(b) << 32) | (c) )
Additional casts may be necessary for `splitup to silence warnings about precision loss by over-paranoid compilers.
#define splitup(a,b,c) ( (b) = (unsigned long)((a) >> 32), (c) = (unsigned long)((a) & 0xffffffff) )
And please don't even think about using your self-written encryption for production code.
Shifting a 32-bit value by 32 bits or more is undefined in C and C++. One of the reasons it was left undefined is that on some hardware platforms the 32-bit shift instruction only takes into account 5 lowest bits of the supplied shift count. This means that whatever shift count you pass, it will be interpreted modulo 32. Attempting to shift by 32 on such platform will actually shift by 0, i.e. not shift at all.
The language authors did not want to burden the compilers written for such platform with the task of analyzing the shift count before doing the shift. Instead, the language specification says that the behavior is undefined. This means that if you want to get a 0 value from a 32-bit shift by 32 (or more), it is up to you to recognize the situation and process it accordingly.
what's bad about shifting a 32-bit variable 32 bits?
Its better to assign 0 to n-bit integer than to shift it by n-bits.
Example:
0 0 1 0 1 ----- 5 bit Integer
0 1 0 1 0 ----- 1st shift
1 0 1 0 0 ----- 2nd shift
0 1 0 0 0 ----- 3rd shift
1 0 0 0 0 ----- 4th shift
0 0 0 0 0 ----- 5th shift (all the bits are shifted!)
I still don't have a method of combining the two 32-bit variables to get a 64-bit one
Consider: a is 64 bit, b and c are 32 bit
a = b;
a = a << 32; //Note: a is 64 bit
a = a | c;
Unless this is some "reinventing the wheel to understand how it works" project, don't implement your own crypto functions.
Ever.
It's hard enough to use the available algorithms to work (and to choose the right one), don't shoot yourself in the foot by putting in production some home grown cryptor API. Chances are your encryption won't encrypt
What's bad about shifting a 32-bit variable 32 bits?
In addition to what have been already said, the 32nd bit is the sign bit, and you may get sign extension to preserve the sing, thereby losing significant bits.
Let uint8 and uint16 be datatypes for 8bit and 16bit positive integers.
uint8 a = 1;
uint16 b = a << 8;
I tested this program on 32Bit architecture with result
b = 256
Would the same programm on a system with registers of 8bit length yield the result:
b = 0 ?
because all bits in register gets shifted to 0 by a << 8?
Registers are irrelevant. This is about the width of your types.
When you shift a value by more bits than it possesses, the behaviour is undefined. The compiler, the program, the computer, the tax office can legally manifest any results accordingly. And, no, that's not just theoretical.
However, operands in C are promoted before interesting things are done on them. So, your uint8_t becomes an int before the left-shift.
Now it depends on your architecture (as determined by your compiler configuration) as to what happens: is int on your implementation only 8-bit? No, it's not! The result, then — regardless of any "register size" — must abide by the rules of the language, yielding the mathematically appropriate answer (256). And, even if it were, you'd hit that undefined behaviour so the question would be moot.
Under the bonnet, if more than one register is needed to hold a variable, then that's what will and must happen (at whatever performance cost is implied as a result). That's if a register is used at all; remember, you're programming in an abstraction, not hand-crafting machine code. The program snippet you showed can be completely optimised away during compilation and doesn't require any runtime instructions at all.
Would the same programm on a system with registers of 8bit length the result be b=0?
No.
In the expression a << 8 the variable a will get promoted to an int before the bit shift. And an int is guaranteed to be at least 16 bits.
b will have the value 256 on all platforms unless there's a bug in the compiler.
However, if you changed the second line to uint32 b = a << 16; you might get strange results. a would still get promoted to an int, but if int is two bytes long, then a << 16 will invoke undefined behavior.
I have a uint8_t that should contain the result of a bitwise calculation. The debugger says the variable is set correctly, but when i check the memory, the var is always at 0. The code proceeds like the var is 0, no matter what the debugger tells me. Here's the code:
temp = (path_table & (1 << current_bit)) >> current_bit;
//temp is always 0, debugger shows correct value
if (temp > 0) {
DS18B20_send_bit(pin, 0x01);
} else {
DS18B20_send_bit(pin, 0x00);
}
Temp's a uint8_t, path_table's a uint64_t and current_bit's a uint8_t. I've tried to make them all uint64_t but nothing changed. I've also tried using unsigned long long int instead. Nothing again.
The code always enters the else clause.
Chip's Atmega4809, and uses uint64_t in other parts of the code with no issues.
Note - If anyone knows a more efficient/compact way to extract a single bit from a variable i would really appreciate if you could share ^^
1 is an integer constant, of type int. The expression 1 << current_bit also has type int, but for 16-bit int, the result of that expression is undefined when current_bit is larger than 14. The behavior being undefined in your case, then, it is plausible that your debugger presents results for the overall expression that seem inconsistent with the observed behavior. If you used an unsigned int constant instead, i.e. 1u, then the resulting value of temp would be well defined as 0 whenever current_bit was greater than 15, because the result of the left shift would be zero.
Solve this problem by performing the computation in a type wide enough to hold the result. Here's a compact, correct, and pretty clear way to correct your code to do that:
DS18B20_send_bit(pin, (path_table & (((uint64_t) 1) << current_bit)) != 0);
Or if path_table has an unsigned type then I prefer this, though it's more of a departure from your original:
DS18B20_send_bit(pin, (path_table >> current_bit) & 1);
Realization #1 here is that AVR is 1980-1990s technology core. It is not a x64 PC that chews 64 bit numbers for breakfast, but an extremely inefficient 8-bit MCU. As such:
It likes 8 bit arithmetic.
It will struggle with 16 bit arithmetic, by doing tricks with 16 bit index registers, double accumulators or whatever 8 bit core tricks it prefers to do.
It will literally take ages to execute 32 bit arithmetic, by invoking software libraries inline.
It will probably melt through the floor if attempting 64 bit arithmetic.
Before you do anything else, you need to get rid of all 64 bit arithmetic and radically minimize the use of 32 bit arithmetic. Period. There should be no single variable of uint64_t in your code or you are doing it very very wrong.
With this revelation also comes that all 8 bit MCUs always have an int type which is 16 bits.
In the code 1<<current_bit, the integer constant 1 is of type int. Meaning that if current_bit is 15 or larger, you will shift bits into the sign bit of this temporary int. This is always a bug. Strictly speaking this is undefined behavior. In practice, you might end up with random change of sign of your numbers.
To avoid this, never use any form of bitwise operators on signed numbers. When mixing integer constants such as 1 with bitwise operators, change them to 1u to avoid bugs like the one mentioned.
If anyone knows a more efficient/compact way to extract a single bit from a variable i would really appreciate if you could share
The most efficient way in C is: uint8_t variable; ... if(variable & (1u << bits)). This should translate to the relevant "branch if bit set" instruction.
My general advise would be find your tool chain's disassembler and see what machine code that the C code actually generated. You don't have to be an assembler guru to read it, peeking at the instruction set should be enough.
I recently picked up a copy of Applied Cryptography by Bruce Schneier and it's been a good read. I now understand how several algorithms outlined in the book work, and I'd like to start implementing a few of them in C.
One thing that many of the algorithms have in common is dividing an x-bit key, into several smaller y-bit keys. For example, Blowfish's key, X, is 64-bits, but you are required to break it up into two 32-bit halves; Xl and Xr.
This is where I'm getting stuck. I'm fairly decent with C, but I'm not the strongest when it comes to bitwise operators and the like.
After some help on IRC, I managed to come up with these two macros:
#define splitup(a, b, c) {b = a >> 32; c = a & 0xffffffff; }
#define combine(a, b, c) {a = (c << 32) | a;}
Where a is 64 bits and b and c are 32 bits. However, the compiler warns me about the fact that I'm shifting a 32 bit variable by 32 bits.
My questions are these:
What's bad about shifting a 32-bit variable 32 bits? I'm guessing it's undefined, but these macros do seem to be working.
Also, would you suggest I go about this another way?
As I said, I'm fairly familiar with C, but bitwise operators and the like still give me a headache.
EDIT
I figured out that my combine macro wasn't actually combining two 32-bit variables, but simply ORing 0 by a, and getting a as a result.
So, on top of my previous questions, I still don't have a method of combining the two 32-bit variables to get a 64-bit one; a suggestion on how to do it would be appreciated.
Yes, it is undefined behaviour.
ISO/IEC 9899:1999 6.5.7 Bitwise shift operators ¶3
The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
C11 aka ISO/IEC 9899:2011 says the same.
You should first cast b to the target integer type. Another point is that you should put parentheses around the macro parameters to avoid surprises by operator precedences. Additionally, the comma operator is very useful here, allowing you to avoid the braces, so that the macro can be used as a normal command, closed with a semicolon.
#define splitup(a,b,c) ( (b) = (a) >> 32, (c) = (a) & 0xffffffff )
#define combine(a,b,c) ( (a) = ((unsigned long long)(b) << 32) | (c) )
Additional casts may be necessary for `splitup to silence warnings about precision loss by over-paranoid compilers.
#define splitup(a,b,c) ( (b) = (unsigned long)((a) >> 32), (c) = (unsigned long)((a) & 0xffffffff) )
And please don't even think about using your self-written encryption for production code.
Shifting a 32-bit value by 32 bits or more is undefined in C and C++. One of the reasons it was left undefined is that on some hardware platforms the 32-bit shift instruction only takes into account 5 lowest bits of the supplied shift count. This means that whatever shift count you pass, it will be interpreted modulo 32. Attempting to shift by 32 on such platform will actually shift by 0, i.e. not shift at all.
The language authors did not want to burden the compilers written for such platform with the task of analyzing the shift count before doing the shift. Instead, the language specification says that the behavior is undefined. This means that if you want to get a 0 value from a 32-bit shift by 32 (or more), it is up to you to recognize the situation and process it accordingly.
what's bad about shifting a 32-bit variable 32 bits?
Its better to assign 0 to n-bit integer than to shift it by n-bits.
Example:
0 0 1 0 1 ----- 5 bit Integer
0 1 0 1 0 ----- 1st shift
1 0 1 0 0 ----- 2nd shift
0 1 0 0 0 ----- 3rd shift
1 0 0 0 0 ----- 4th shift
0 0 0 0 0 ----- 5th shift (all the bits are shifted!)
I still don't have a method of combining the two 32-bit variables to get a 64-bit one
Consider: a is 64 bit, b and c are 32 bit
a = b;
a = a << 32; //Note: a is 64 bit
a = a | c;
Unless this is some "reinventing the wheel to understand how it works" project, don't implement your own crypto functions.
Ever.
It's hard enough to use the available algorithms to work (and to choose the right one), don't shoot yourself in the foot by putting in production some home grown cryptor API. Chances are your encryption won't encrypt
What's bad about shifting a 32-bit variable 32 bits?
In addition to what have been already said, the 32nd bit is the sign bit, and you may get sign extension to preserve the sing, thereby losing significant bits.