Faster way for extracting and combining bits from UINT16 to UINT8 - c

I'm searching for a faster way for my required special extract and combine operation as described below:
+-------+-------+-------+-------+-------+-------+-------+-------+
| BIT 7 | BIT 6 | BIT 5 | BIT 4 | BIT 3 | BIT 2 | BIT 1 | BIT 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| D1 | D0 | C1 | C0 | B1 | B0 | A1 | A0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
A = A0 OR A1
B = B0 OR B1
C = C0 OR C1
D = D0 OR D1
+-------+-------+-------+-------+-------+-------+-------+-------+
| BIT 7 | BIT 6 | BIT 5 | BIT 4 | BIT 3 | BIT 2 | BIT 1 | BIT 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | D | C | B | A |
+-------+-------+-------+-------+-------+-------+-------+-------+
For sake of simplicity above is only an 8-bit example, the same applies for 16 bit values. It should be implemented as fast as possible on dsPIC33F microcontroller.
The easy way in C is:
PairFlags |= (ChannelFlags & 0x0003) ? 0x0001 : 0;
PairFlags |= (ChannelFlags & 0x000C) ? 0x0002 : 0;
PairFlags |= (ChannelFlags & 0x0030) ? 0x0004 : 0;
PairFlags |= (ChannelFlags & 0x00C0) ? 0x0008 : 0;
PairFlags |= (ChannelFlags & 0x0300) ? 0x0010 : 0;
PairFlags |= (ChannelFlags & 0x0C00) ? 0x0020 : 0;
PairFlags |= (ChannelFlags & 0x3000) ? 0x0040 : 0;
PairFlags |= (ChannelFlags & 0xC000) ? 0x0080 : 0;
This will produce approx. 40 instructions (with O3) which corresponds to 1µs in my case.
The amount of instruction cycles should be reduced if possible. Is there a faster way either in C or inline assembly?

The following should work for reducing a 16-bit value to 8 bits (with each bit of output formed by ORing a pair of bits of input):
// Set even bits to bits in pair ORed together, and odd bits to 0...
PairFlags = (ChannelFlags | (ChannelFlags >> 1)) & 0x5555; // '0h0g0f0e0d0c0b0a'
// Compress the '00' or '01' bit pairs down to single '0' or '1' bits...
PairFlags = (PairFlags ^ (PairFlags >> 1)) & 0x3333; // '00hg00fe00dc00ba'
PairFlags = (PairFlags ^ (PairFlags >> 2)) & 0x0F0F; // '0000hgfe0000dcba'
PairFlags = (PairFlags ^ (PairFlags >> 4)) & 0x00FF; // '00000000hgfedcba'
Note: The ^ can be replaced by | in the above for the same result.

Assuming I got everything right (not tested), this seems to generate good, branch-free code at least on gcc and clang for x86 (-O3):
uint8_t convert (uint8_t ChannelFlags)
{
return ( ((ChannelFlags & A1A0)!=0) << A_POS ) |
( ((ChannelFlags & B1B0)!=0) << B_POS ) |
( ((ChannelFlags & C1C0)!=0) << C_POS ) |
( ((ChannelFlags & D1D0)!=0) << D_POS ) ;
}
This masks out each individual bitset, then check against zero to end up with 1 or 0 in a temporary int. This value is shifted in position in the result, before everything is finally bitwise OR:ed together. Full code:
#include <stdint.h>
#define A1A0 (3u << 0)
#define B1B0 (3u << 2)
#define C1C0 (3u << 4)
#define D1D0 (3u << 6)
#define A_POS 0
#define B_POS 1
#define C_POS 2
#define D_POS 3
uint8_t convert (uint8_t ChannelFlags)
{
return ( ((ChannelFlags & A1A0)!=0) << A_POS ) |
( ((ChannelFlags & B1B0)!=0) << B_POS ) |
( ((ChannelFlags & C1C0)!=0) << C_POS ) |
( ((ChannelFlags & D1D0)!=0) << D_POS ) ;
}
clang disassembly x86 gives 18 instructions branch free:
convert: # #convert
test dil, 3
setne al
test dil, 12
setne cl
add cl, cl
or cl, al
test dil, 48
setne al
shl al, 2
or al, cl
mov ecx, edi
shr cl, 7
shr dil, 6
and dil, 1
or dil, cl
shl dil, 3
or al, dil
ret

Not sure if more efficient but instead of using a ternary if, why not use only bitwise operations ? And just offset it with the bitshift operator
PairFlags = ((ChannelFlags & (0b1 << 0)) | (ChannelFlags & (0b10 << 0))) << 0;
PairFlags = ((ChannelFlags & (0b1 << 2)) | (ChannelFlags & (0b10 << 2))) << 1;
PairFlags = ((ChannelFlags & (0b1 << 4)) | (ChannelFlags & (0b10 << 4))) << 2;
//...

Here is an idea.
Observe one thing here:
A = A0 OR A1
B = B0 OR B1
C = C0 OR C1
D = D0 OR D1
You have 4 or operations. You can perform all of them in 1 instruction:
PairFlags = (PairFlags | (PairFlags >> 1))
Now you bits are aligned like that:
[D1][D1 or D0][D0 or C1][C1 or C0][C0 or B1][B1 or B0][B0 or A1][A1 or A0]
So you just need to extract bits 0, 2, 4, 6 to get the result.
Bit 0. Is already OK.
Bit 1 should be set to bit 2.
Bit 2 should be set to bit 4.
Bit 3 should be set to bit 6.
Final code something like that:
PairFlags = (PairFlags | (PairFlags >> 1))
PairFlags = (PairFlags&1) | ((PairFlags&4)>>1) | ((PairFlags&16)>>2) | ((PairFlags&64)>>3)

Related

Why does data packing 4 integers into a 32 bit integer have different results in Nextion and Teensy(Arduino compatible)

I'm controlling a Teensy 3.5 with a Nextion touchscreen. On the Nextion the following code packs 4 8 bit integers into a 32 bit integer:
sys0=vaShift_24.val<<8|vaShift_16.val<<8|vaShift_8.val<<8|vaShift_0.val
Using the same shift amount (8) has a different result on the Teensy, however, the following generates the same result:
combinedValue = (v24 << 24) | (v16 << 16) | (v08 << 8) | (v00);
I'm curious why these shifts work differently.
Nextion documentation: https://nextion.tech/instruction-set/
//Nextion:
vaShift_24.val=5
vaShift_16.val=4
vaShift_8.val=1
vaShift_0.val=51
sys0=vaShift_24.val<<8|vaShift_16.val<<8|vaShift_8.val<<8|vaShift_0.val
//Result is 84148531
//Teensy, Arduino, C++:
value24 = 5;
value16 = 4;
value8 = 1;
value0 = 51;
packedValue = (value24 << 24) | (value16 << 16) | (value8 << 8) | (value0);
Serial.print("24 to 0: ");
Serial.println(packedValue);
packedValue = (value24 << 8) | (value16 << 8) | (value8 << 8) | (value0);
Serial.print("8: ");
Serial.println(packedValue);
//Result:
//24 to 0: 84148531
//8: 1331
Problem seems to be in this line:
sys0=vaShift_24.val<<8|vaShift_16.val<<8|vaShift_8.val<<8|vaShift_0.val
You are shifting by 8 in many places. Presumably you want:
sys0 = vaShift_24.val << 24 | vaShift_16.val << 16 | vaShift_8.val << 8 | vaShift_0.val
Now result from bytes 5, 4, 1, and 55 should be, in hex
0x05040133.
If you are instead seeing
0x33010405
it means you would also have a byte order issue. But probably not.

Binary Interleaving, Binary Swizzling, Alternating Bits

Problem:
I have a sequence of bits of indices 7 6 5 4 3 2 1 0 and I want to swizzle them the following way :
7 6 5 4 3 2 1 0 = 7 6 5 4 3 2 1 0
_____| | | | | | | |_____
| ___| | | | | |___ |
| | _| | | |_ | |
| | | | | | | |
v v v v v v v v
_ 3 _ 2 _ 1 _ 0 7 _ 6 _ 5 _ 4 _
|___________________|
|
v
7 3 6 2 5 1 4 0
i.e. I want to interleave the bits of the low and high nibbles from a byte.
Naive solution:
I can achieve this behavior in C using the following way :
int output =
((input & (1 << 0)) << 0) |
((input & (1 << 1)) << 1) |
((input & (1 << 2)) << 2) |
((input & (1 << 3)) << 3) |
((input & (1 << 4)) >> 3) |
((input & (1 << 5)) >> 2) |
((input & (1 << 6)) >> 1) |
((input & (1 << 7)) >> 0);
However it's obviously very clunky.
Striving for a more elegant solution:
I was wondering if there where something I could do to achieve this behavior faster in less machine instructions. Using SSE for example?
Some context for curious people :
I use this for packing 2d signed integer vector coordinates into a 1d value that conserves proximity when dealing with memory and caching. The idea is similar to some texture layouts optimization used by some GPUs on mobile devices.
(i ^ 0xAAAAAAAA) - 0xAAAAAAAA converts from 1d integer to 1d signed integer with this power of two proximity I was talking about.
(x + 0xAAAAAAAA) ^ 0xAAAAAAAA is just the reverse operation, going from 1d signed integer to a 1d integer, still with the same properties.
To have it become 2d and keep the proximity property, I want to alternate the x and y bits.
So you want to interleave the bits of the low and high nibbles in each byte? For scalar code a 256-byte lookup table (LUT) is probably your best bet.
For x86 SIMD, SSSE3 pshufb (_mm_shuffle_epi8) can be used as a parallel LUT of 16x nibble->byte lookups in parallel. Use this to unpack a nibble to a byte.
__m128i interleave_high_low_nibbles(__m128i v) {
const __m128i lut_unpack_bits_low = _mm_setr_epi8( 0, 1, 0b00000100, 0b00000101,
... // dcba -> 0d0c0b0a
);
const __m128i lut_unpack_bits_high = _mm_slli_epi32(lut_unpack_bits_low, 1);
// dcba -> d0c0b0a0
// ANDing is required because pshufb uses the high bit to zero that element
// 8-bit element shifts aren't available so also we have to mask after shifting
__m128i lo = _mm_and_si128(v, _mm_set1_epi8(0x0f));
__m128i hi = _mm_and_si128(_mm_srli_epi32(v, 4), _mm_set1_epi8(0x0f));
lo = _mm_shuffle_epi8(lut_unpack_bits_low, lo);
hi = _mm_shuffle_epi8(lut_unpack_bits_high, hi);
return _mm_or_si128(lo, hi);
}
This is not faster than a memory LUT for a single byte, but it does 16 bytes in parallel. pshufb is a single-uop instruction on x86 CPUs made in the last decade. (Slow on first-gen Core 2 and K8.)
Having separate lo/hi LUT vectors means that setup can be hoisted out of a loop; otherwise we'd need to shift one LUT result before ORing together.

I am looking for an algorithm to shuffle the first 25 bits of a (32-bit) int

All of the bit shuffling algorithms I've found deal with 16-bit or 32-bit, which means that even if I use only the first 25-bits of an int, the shuffle will leave bits outside. This function is in an inner loop of a CPU-intensive process so I'd prefer it to be as fast as possible. I've tried modifying the code of the Hacker's Delight 32-bit shuffle algorithm
x = (x & 0x0000FF00) << 8 | (x >> 8) & 0x0000FF00 | x & 0xFF0000FF;
x = (x & 0x00F000F0) << 4 | (x >> 4) & 0x00F000F0 | x & 0xF00FF00F;
x = (x & 0x0C0C0C0C) << 2 | (x >> 2) & 0x0C0C0C0C | x & 0xC3C3C3C3;
x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 0x99999999;
but am having difficulty in doing some partly because I'm not sure where the masks come from. I tried shifting the number and re-shuffling but so far the results are all for naught. Any help would be GREATLY appreciated!
(I am using C but I can convert an algorithm from another language)
First, for the sake of evenness, we can extend the problem to a 26-bit shuffle by remembering that bit 25 will appear at the end of the interleaved list, so we can trim it off after the interleaving operation without affecting the positions of the other bits.
Now we want to interleave the first and second sets of 13 bits; but we only have an algorithm to interleave the first and second sets of 16 bits.
A straightfoward approach might be to just move the high and low parts of x into more workable positions before applying the standard algorithm:
x = (x & 0x1ffe000) << 3 | x & 0x00001fff;
x = (x & 0x0000FF00) << 8 | (x >> 8) & 0x0000FF00 | x & 0xFF0000FF;
x = (x & 0x00F000F0) << 4 | (x >> 4) & 0x00F000F0 | x & 0xF00FF00F;
x = (x & 0x0C0C0C0C) << 2 | (x >> 2) & 0x0C0C0C0C | x & 0xC3C3C3C3;
x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 0x99999999;
The zeroes at the top of each half will be interleaved and appear at the top of the result.

Implementing logical negation with only bitwise operators (except !)

~ & ^ | + << >> are the only operations I can use
Before I continue, this is a homework question, I've been stuck on this for a really long time.
My original approach: I thought that !x could be done with two's complement and doing something with it's additive inverse. I know that an xor is probably in here but I'm really at a loss how to approach this.
For the record: I also cannot use conditionals, loops, ==, etc, only the functions (bitwise) I mentioned above.
For example:
!0 = 1
!1 = 0
!anything besides 0 = 0
Assuming a 32 bit unsigned int:
(((x>>1) | (x&1)) + ~0U) >> 31
should do the trick
Assuming x is signed, need to return 0 for any number not zero, and 1 for zero.
A right shift on a signed integer usually is an arithmetical shift in most implementations (e.g. the sign bit is copied over). Therefore right shift x by 31 and its negation by 31. One of those two will be a negative number and so right shifted by 31 will be 0xFFFFFFFF (of course if x = 0 then the right shift will produce 0x0 which is what you want). You don't know if x or its negation is the negative number so just 'or' them together and you will get what you want. Next add 1 and your good.
implementation:
int bang(int x) {
return ((x >> 31) | ((~x + 1) >> 31)) + 1;
}
The following code copies any 1 bit to all positions. This maps all non-zeroes to 0xFFFFFFFF == -1, while leaving 0 at 0. Then it adds 1, mapping -1 to 0 and 0 to 1.
x = x | x << 1 | x >> 1
x = x | x << 2 | x >> 2
x = x | x << 4 | x >> 4
x = x | x << 8 | x >> 8
x = x | x << 16 | x >> 16
x = x + 1
For 32 bit signed integer x
// Set the bottom bit if any bit set.
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x ^= 1; // Toggle the bottom bit - now 0 if any bit set.
x &= 1; // Clear the unwanted bits to leave 0 or 1.
Assuming e.g. an 8-bit unsigned type:
~(((x >> 0) & 1)
| ((x >> 1) & 1)
| ((x >> 2) & 1)
...
| ((x >> 7) & 1)) & 1
You can just do ~x & 1 because it yields 1 for 0 and 0 for everything else

How to think of bit operations for simple operations?

for example:
unsigned int a; // value to merge in non-masked bits
unsigned int b; // value to merge in masked bits
unsigned int mask; // 1 where bits from b should be selected; 0 where from a.
unsigned int r; // result of (a & ~mask) | (b & mask) goes here
r = a ^ ((a ^ b) & mask);
merges bits from two values according to the mask.
[taken from here]
In this case, I can see that it works, but I am not sure what the logic is? And I am not sure I can create my own bit operations like this from scratch. How do I start thinking in bits?
Pencil and paper helps the best in cases like this. I usually write it down:
a = 10101110
b = 01100011
mask = 11110000
a ^ b = 10101110
01100011
--------
x => 11001101
x & mask = 11001101
11110000
--------
x => 11000000
a ^ x = 11000000
10101110
--------
x => 01101110
(final x is your r)
I don't know if this is the result you were after, but that's what it does. Writing it out usually helps when I don't understand a bitwise operation.
In this case, I can see that it works, but I am not sure what the logic is? And I am not sure I can create my own bit operations like this from scratch. How do I start thinking in bits?
People have answered your first question -- explaining the logic. I shall hopefully show you a terribly basic, long-winded but standard method of making any bit twiddling operations. (note, once you get used to working with bits you'll start thinking in & and | straight off without doing such nonsense).
Figure out what you'd like your operation to do.
Write out a FULL truth table.
Either read the sum-of-products direct from the table or make a Karnaugh map. The km will reduce the final eqution a lot.
???
Profit
Deriving for the example you gave. ie, where a mask selects bits from A or B. (0 is A, 1 is B)
This table is for 1 bit per input. I'm not doing more than one bit, as I don't want to waste my time :) ( why? 2^(2bits * 3inputs) = 64 cases :( 2^(3bits * 3inputs) = 512 cases :(()
But the good news is that in this case the operation is independant of the number of bits, so a 1 bit example is 100% fine. Infact it's recommended by me :)
| A | B | M || R |
============++====
| 0 | 0 | 0 || 0 |
| 0 | 0 | 1 || 0 |
| 0 | 1 | 0 || 0 |
| 0 | 1 | 1 || 1 |
| 1 | 0 | 0 || 1 |
| 1 | 0 | 1 || 0 |
| 1 | 1 | 0 || 1 |
| 1 | 1 | 1 || 1 |
Hopefully you can see how this truth table works.
how to get an expression from this? Two methods: KMaps and by-hand. Let's do it by-hand first, should we? :)
Looking at the points where R is true, we see:
| A | B | M || R |
============++====
| 0 | 1 | 1 || 1 |
| 1 | 0 | 0 || 1 |
| 1 | 1 | 0 || 1 |
| 1 | 1 | 1 || 1 |
From this we can dervive an expresion:
R = (~A & B & M) |
( A & ~B & ~M) |
( A & B & ~M) |
( A & B & M) |
Hopefully you can see how this works: just or together the full expressions seen in each case. By full I imply that you need to not-variables i nthere.
Let's try it in python:
a = 0xAE #10101110b
b = 0x64 #01100011b
m = 0xF0 #11110000b
r = (~a & b & m) | ( a & ~b & ~m) | ( a & b & ~m) | ( a & b & m)
print hex(r)
OUTPUT:
0x6E
These numbers are from Abel's example. The output is 0x6E, which is 01101110b.
So it worked! Hurrah. (ps, it's possible to derive an expression for ~r from the first table, should you wish to do so. Just take the cases where r is 0).
This expression you've made is a boolean "sum of products", aka Disjunctive Normal Form, although DNF is really the term used when using first-order predicate logic. This expression is also pretty unweidly. Making it smaller is a tedious thing to do on paper, and is the kind of thing you'll do 500,000 times at Uni' on a CS degree if you take the compiler or hardware courses. (Highly recommended :))
So let's do some boolean algebra magic on this (don't try and follow this, it's a waste of time):
(~a & b & m) | ( a & ~b & ~m) | ( a & b & ~m) | ( a & b & m)
|= ((~a & b & m) | ( a & ~b & ~m)) | ( a & b & ~m) | ( a & b & m)
take that first sub-clause that I made:
((~a & b & m) | ( a & ~b & ~m))
|= (~a | (a & ~b & ~m)) & (b | ( a & ~b & ~m)) & (m | ( a & ~b & ~m))
|= ((~a | a) & (a | ~b) &( a | ~m)) & (b | ( a & ~b & ~m)) & (m | ( a & ~b & ~m))
|= (T & (a | ~b) &( a | ~m)) & (b | ( a & ~b & ~m)) & (m | ( a & ~b & ~m))
|= ((a | ~b) & (a | ~m)) & (b | ( a & ~b & ~m)) & (m | ( a & ~b & ~m))
etc etc etc. This is the massively tedious bit incase you didn't guess. So just whack the expression in a website of your choice, which will tell you
r = (a & ~m) | (b & m)
Hurrah! Correct result. Note, it might even go so far as giving you an expression involving XORs, but who cares? Actually, some people do, as the expression with ands and ors is 4 operations (1 or, 2 and, 1 neg), whilst r = a ^ ((a ^ b) & mask) is 3 (2 xor, 1 and).
Now, how do you do it with kmaps? Well, first you need to know how to make them, I'll leave you to do that. :) Just google for it. There's software available, but I think it's best to do it by hand -- it's more fun and the programs don't allow you to cheat.
Cheat? Well, if you have lots of inputs, it's often best to reduce the table like so:
| A | B | M || R |
============++====
| X | X | 0 || A |
| X | X | 1 || B |
eg that 64 case table?
| A1| A0| B1| B0| M1| M0|| R1| R0|
========================++========
| X | X | X | X | 0 | 0 || A1| A0|
| X | X | X | X | 0 | 1 || A1| B0|
| X | X | X | X | 1 | 0 || B1| A0|
| X | X | X | X | 1 | 1 || B1| B0|
Boils down to 4 cases in this example :)
(Where X is "don't care".) Then put that table in your Kmap. Once again, an exercise for you to work out [ie, I've forgotten how to do this].
Hopefully you can now derive your own boolean madness, given a set of inputs and an expected set of outputs.
Have fun.
In order to create boolean expressions like that one, I think you'd have to learn some boolean algebra.
This looks good:
http://www.allaboutcircuits.com/vol_4/chpt_7/1.html
It even has a page on generating boolean expressions from truth tables.
It also has a section on Karnaugh Maps. To be honest, I've forgotten what those are, but they look like they could be useful for what you want to do.
http://www.allaboutcircuits.com/vol_4/chpt_8/1.html
a ^ x for some x gives the result of flipping those bits in a which are set in x.
a ^ b gives you a 1 where the bits in a and b differ, a 0 where they are the same.
Setting x to (a ^ b) & mask gives the result of flipping the bits in a which are different in a and b and are set in the mask. Thus a ^ ((a ^ b) & mask) gives the result of changing, where necessary, the values of the bits which are set in the mask from the value they take in a to the value they take in b.
The basis for most bitwise operations (&, |, ^, ~) is Boolean algebra. Think of performing Boolean algebra to multiple Boolean values in parallel and you've got the bitwise operators. The only operators this doesn't cover are the shift operators (<<, >>), which you can think of as shifting the bits or as multiplication by powers of two. x << 3 == x * pow(2,3) and x >> 4 == (int)(x * pow(2,-4)).
Thinking in bits in not that hard, you just need to convert, in your head, all the values into bits and work on them a bit at a time. That sounds hard but it does get easier over time. A good first step is to start thinking of them as hex digits (4 bits at a time).
For example, let's say a is 0x13, b is 0x22 and mask is 0x0f:
a : 0x13 : 0001 0011
b : 0x22 : 0010 0010
---------------------------------
a^b : 0x31 : 0011 0001
mask : 0x0f : 0000 1111
---------------------------------
(a^b)&mask : 0x01 : 0000 0001
a : 0x13 : 0001 0011
---------------------------------
a^((a^b)&mask) : 0x12 : 0001 0010
This particular example is a way to combine the top four bits of a with the bottom 4 bits of b (the mask decides which bits come from a and b.
As the site says, it's an optimization of (a & ~mask) | (b & mask):
a : 0x13 : 0001 0011
~mask : 0xf0 : 1111 0000
---------------------------------
a & ~mask : 0x10 : 0001 0000
b : 0x22 : 0010 0010
mask : 0x0f : 0000 1111
---------------------------------
b & mask : 0x20 : 0000 0010
a & ~mask : 0x10 : 0001 0000
b & mask : 0x20 : 0000 0010
---------------------------------
(a & ~mask) | : 0x12 : 0001 0010
(b & mask)
Aside: I wouldn't be overly concerned about not understanding something on that page you linked to. There's some serious "black magic" going on there. If you really want to understand bit fiddling, start with unoptimized ways of doing it.
First learn the logical (that is, 1-bit) operators well. Try to write down some rules, like
a && b = b && a
1 && a = a
1 || a = 1
0 && a = ... //you get the idea. Come up with as many as you can.
Include the "logical" xor operator:
1 ^^ b = !b
0 ^^ b = ...
Once you have a feel for these, move onto bitwise operators. Try some problems, look at some common tricks and techniques. With a bit of practice you'll feel much more confident.
Break the expression down into individual bits. Consider a single bit position in the expression (a ^ b) & mask. If the mask has zero at that bit position, (a ^ b) & mask will simply give you zero. Any bit xor'ed with zero will remain unchanged, so a ^ (a ^ b) & mask will simply return a's original value.
If the mask has a 1 at that bit position, (a ^ b) & mask will simply return the value of a ^ b. Now if we xor the value with a, we get a ^ (a ^ b) = (a ^ a) ^ b = b. This is a consequence of a ^ a = 0 -- any value xor'ed with itself will return zero. And then, as previously mentioned, zero xor'ed with any value will just give you the original value.
How to think in bits:
Read up on what others have done, and take note of their strategies. The Stanford site you link to is a pretty good resource -- there are often several techniques shown for a particular operation, allowing you to see the problem from different angles. You might have noticed that there are people who've submitted their own alternative solutions for a particular operation, which were inspired by the techniques applied to a different operation. You could take the same approach.
Also, it might help you to remember a handful of simple identities, which you can then string together for more useful operations. IMO listing out the results for each bit-combination is only useful for reverse-engineering someone else's work.
Maybe you dont need to think in bits - perhaps you can get your compiler to think in bits for you
and you can focus on the actual problem you're trying to solve instead. Using bit manipulation directly
in your code can produce some profoundly impenetrable (if impressive) code -- here's some nice macros
(from the windows ddk) that demonstrate this
// from ntifs.h
// These macros are used to test, set and clear flags respectivly
#define FlagOn(_F,_SF) ((_F) & (_SF))
#define BooleanFlagOn(F,SF) ((BOOLEAN)(((F) & (SF)) != 0))
#define SetFlag(_F,_SF) ((_F) |= (_SF))
#define ClearFlag(_F,_SF) ((_F) &= ~(_SF))
now if you want to set a flag in a value you can simply say SetFlag(x, y) much clearer I think. Moreover
if you focus on the problem you're trying to address with your bit fiddling the mechanics will become
second nature without you having to expend any effort. Look after the bits and the bytes will look after
themselves!

Resources