I am trying to accelerate my code using SSE, and the following code works well.
Basically a __m128 variable should point to 4 floats in a row, in order to do 4 operations at once.
This code is equivalent to computing c[i]=a[i]+b[i] with i from 0 to 3.
float *data1,*data2,*data3
// ... code ... allocating data1-2-3 which are very long.
__m128* a = (__m128*) (data1);
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);
However, when I want to shift a bit the data that I use (see below), in order to compute c[i]=a[i+1]+b[i] with i from 0 to 3, it crashes at execution time.
__m128* a = (__m128*) (data1+1); // <-- +1
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);
My guess is that it is related to the fact that __m128 is 128 bits and by float data are 32 bits. So, it may be impossible for a 128-bit pointer to point on an address that is not divisible by 128.
Anyway, do you know what the problem is and how I could go around it?
Instead of using implicit aligned loads/stores like this:
__m128* a = (__m128*) (data1+1); // <-- +1
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);
use explicit aligned/unaligned loads/stores as appropriate, e.g.:
__m128 va = _mm_loadu_ps(data1+1); // <-- +1 (NB: use unaligned load)
__m128 vb = _mm_load_ps(data2);
__m128 vc = _mm_add_ps(va, vb);
_mm_store_ps(data3, vc);
Same amount of code (i.e. same number of instructions), but it won't crash, and you have explicit control over which loads/stores are aligned and which are unaligned.
Note that recent CPUs have relatively small penalties for unaligned loads, but on older CPUs there can be a 2x or greater hit.
Your problem here is that a ends up pointing to something that is not a __m128; it points to something that contains the last 96 bits of an __m128 and 32 bits outside, which can be anything. It may be the first 32 bits of the next __m128, but eventually, when you arrive at the last __m128 in the same memory block, it will be something else. Maybe reserved memory that you cannot access, hence the crash.
Related
How could I pass my code to AVX2 code and get the same result as before?
Is it possible to use __m256i in the LongNumInit, LongNumPrint functions instead of uint8_t *L, or some similar type of variable?
My knowledge of AVX is quite limited; I investigated quite a bit however I do not understand very well how to transform my code any suggestion and explanation is welcome.
I'm really interested in this code in AVX2.
void LongNumInit(uint8_t *L, size_t N )
{
for(size_t i = 0; i < N; ++i){
L[i] = myRandom()%10;
}
}
void LongNumPrint( uint8_t *L, size_t N, uint8_t *Name )
{
printf("%s:", Name);
for ( size_t i=N; i>0;--i )
{
printf("%d", L[i-1]);
}
printf("\n");
}
int main (int argc, char **argv)
{
int i, sum1, sum2, sum3, N=10000, Rep=50;
seed = 12345;
// obtain parameters at run time
if (argc>1) { N = atoi(argv[1]); }
if (argc>2) { Rep = atoi(argv[2]); }
// Create Long Nums
unsigned char *V1= (unsigned char*) malloc( N);
unsigned char *V2= (unsigned char*) malloc( N);
unsigned char *V3= (unsigned char*) malloc( N);
unsigned char *V4= (unsigned char*) malloc( N);
LongNumInit ( V1, N ); LongNumInit ( V2, N ); LongNumInit ( V3, N );
//Print last 32 digits of Long Numbers
LongNumPrint( V1, 32, "V1" );
LongNumPrint( V2, 32, "V2" );
LongNumPrint( V3, 32, "V3" );
LongNumPrint( V4, 32, "V4" );
free(V1); free(V2); free(V3); free(V4);
return 0;
}
The result that I obtain in my initial code is this:
V1:59348245908804493219098067811457
V2:24890422397351614779297691741341
V3:63392771324953818089038280656869
V4:00000000000000000000000000000000
This is a terrible format for BigInteger in general, see https://codereview.stackexchange.com/a/237764 for a code review of the design flaws in using one decimal digit per byte for BigInteger, and what you could/should do instead.
And see Can long integer routines benefit from SSE? for #Mysticial's notes on ways to store your data that make SIMD for BigInteger math practical, specifically partial-word arithmetic where your temporaries might not be "normalized", letting you do lazy carry handling.
But apparently you're just asking about this code, the random-init and print functions, not how to do math between two numbers in this format.
We can vectorize both of these quite well. My LongNumPrintName() is a drop-in replacement for yours.
For LongNumInit I'm just showing a building-block that stores two 32-byte chunks and returns an incremented pointer. Call it in a loop. (It naturally produces 2 vectors per call so for small N you might make an alternate version.)
LongNumInit
What's the fastest way to generate a 1 GB text file containing random digits? generates space-separated random ASCII decimal digits at about 33 GB/s on 4GHz Skylake, including overhead of write() system calls to /dev/null. (This is higher than DRAM bandwidth; cache blocking for 128kiB lets the stores hit in L2 cache. The kernel driver for /dev/null doesn't even read the user-space buffer.)
It could easily be adapted into an AVX2 version of void LongNumInit(uint8_t *L, size_t N ). My answer there uses an AVX2 xorshift128+ PRNG (vectorized with 4 independent PRNGs in the 64-bit elements of a __m256i) like AVX/SSE version of xorshift128+. That should be similar quality of randomness to your rand() % 10.
It breaks that up into decimal digits via a multiplicative inverse to divide and modulo by 10 with shifts and vpmulhuw, using Why does GCC use multiplication by a strange number in implementing integer division?. (Actually using GNU C native vector syntax to let GCC determine the magic constant and emit the multiplies and shifts for convenient syntax like v16u dig1 = v % ten; and v /= ten;)
You can use _mm256_packus_epi16 to pack two vectors of 16-bit digits into 8-bit elements instead of turning the odd elements into ASCII ' ' and the even elements into ASCII '0'..'9'. (So change vec_store_digit_and_space to pack pairs of vectors instead of ORing with a constant, see below)
Compile this with gcc, clang, or ICC (or hopefully any other compiler that understands the GNU C dialect of C99, and Intel's intrinsics).
See https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html for the __attribute__((vector_size(32))) part, and https://software.intel.com/sites/landingpage/IntrinsicsGuide/ for the _mm256_* stuff. Also https://stackoverflow.com/tags/sse/info.
#include <immintrin.h>
// GNU C native vectors let us get the compiler to do stuff like %10 each element
typedef unsigned short v16u __attribute__((vector_size(32)));
// returns p + size of stores. Caller should use outpos = f(vec, outpos)
// p must be aligned
__m256i* vec_store_digits(__m256i vec, __m256i *restrict p)
{
v16u v = (v16u)vec;
v16u ten = (v16u)_mm256_set1_epi16(10);
v16u divisor = (v16u)_mm256_set1_epi16(6554); // ceil((2^16-1) / 10.0)
v16u div6554 = v / divisor; // Basically the entropy from the upper two decimal digits: 0..65.
// Probably some correlation with the modulo-based values, especially dig3, but we do this instead of
// dig4 for more ILP and fewer instructions total.
v16u dig1 = v % ten;
v /= ten;
v16u dig2 = v % ten;
v /= ten;
v16u dig3 = v % ten;
// dig4 would overlap much of the randomness that div6554 gets
// __m256i or v16u assignment is an aligned store
v16u *vecbuf = (v16u*)p;
// pack 16->8 bits.
vecbuf[0] = _mm256_packus_epi16(div6554, dig1);
vecbuf[1] = _mm256_packus_epi16(dig2, dig3)
return p + 2; // always a constant number of full vectors
}
The logic in random_decimal_fill_buffer that inserts newlines can be totally removed because you just want a flat array of decimal digits. Just call the above function in a loop until you've filled your buffer.
Handling small sizes (less than a full vector):
It would be convenient to pad your malloc up to the next multiple of 32 bytes so it's always safe to do a 32-byte load without checking for maybe crossing into an unmapped page.
And use C11 aligned_alloc to get 32-byte aligned storage. So for example, aligned_alloc(32, (size+31) & -32). This lets us just do full 32-byte stores even if N is odd. Logically only the first N bytes of the buffer hold our real data, but it's convenient to have padding we can scribble over to avoid any extra conditional checks for N being less than 32, or not a multiple of 32.
Unfortunately ISO C and glibc are missing aligned_realloc and aligned_calloc. MSVC does actually provide those: Why is there no 'aligned_realloc' on most platforms? allowing you to sometimes allocate more space at the end of an aligned buffer without copying it. A "try_realloc" would be ideal for C++ that might need to run copy-constructors if non-trivially copyable objects change address. Non-expressive allocator APIs that force sometimes-unnecessary copying is a pet peeve of mine.
LongNumPrint
Taking a uint8_t *Name arg is bad design. If the caller wants to printf a "something:" string first, they can do that. Your function should just do what printf "%d" does for an int.
Since you're storing your digits in reverse printing order, you'll want to byte-reverse into a tmp buffer and convert 0..9 byte values to '0'..'9' ASCII character values by ORing with '0'. Then pass that buffer to fwrite.
Specifically, use alignas(32) char tmpbuf[8192]; as a local variable.
You can work in fixed-size chunks (like 1kiB or 8kiB) instead allocating a potentially-huge buffer. You probably want to still go through stdio (instead of write() directly and managing your own I/O buffering). With an 8kiB buffer, an efficient fwrite might just pass that on to write() directly instead of memcpy into the stdio buffer. You might want to play around with tuning this, but keeping the tmp buffer comfortably smaller than half of L1d cache will mean it's still hot in cache when it's re-read after you wrote it.
Cache blocking makes the loop bounds a lot more complex but it's worth it for very large N.
Byte-reversing 32 bytes at a time:
You could avoid this work by deciding that your digits are stored in MSD-first order, but then if you did want to implement addition it would have to loop from the end backwards.
The your function could be implemented with SIMD _mm_shuffle_epi8 to reverse 16-byte chunks, starting from the end of you digit array and writing to the beginning of your tmp buffer.
Or better, load vmovdqu / vinserti128 16-byte loads to feed _mm256_shuffle_epi8 to byte-reverse within lanes, setting up for 32-byte stores.
On Intel CPUs, vinserti128 decodes to a load+ALU uop, but it can run on any vector ALU port, not just the shuffle port. So two 128-bit loads are more efficient than 256-bit load -> vpshufb - > vpermq which would probably bottleneck on shuffle-port throughput if data was hot in cache. Intel CPUs can do up to 2 loads + 1 store per clock cycle (or in IceLake, 2 loads + 2 stores). We'll probably bottleneck on the front-end if there are no memory bottlenecks, so in practice not saturating load+store and shuffle ports. (https://agner.org/optimize/ and https://uops.info/)
This function is also simplified by the assumption that we can always read 32 bytes from L without crossing into an unmapped page. But after a 32-byte reverse for small N, the first N bytes of the input become the last N bytes in a 32-byte chunk. It would be most convenient if we could always safely do a 32-byte load ending at the end of a buffer, but it's unreasonable to expect padding before the object.
#include <immintrin.h>
#include <stdalign.h>
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
// one vector of 32 bytes of digits, reversed and converted to ASCII
static inline
void ASCIIrev32B(void *dst, const void *src)
{
__m128i hi = _mm_loadu_si128(1 + (const __m128i*)src); // unaligned loads
__m128i lo = _mm_loadu_si128(src);
__m256i v = _mm256_set_m128i(lo, hi); // reverse 128-bit hi/lo halves
// compilers will hoist constants out of inline functions
__m128i byterev_lane = _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
__m256i byterev = _mm256_broadcastsi128_si256(byterev_lane); // same in each lane
v = _mm256_shuffle_epi8(v, byterev); // in-lane reverse
v = _mm256_or_si256(v, _mm256_set1_epi8('0')); // digits to ASCII
_mm256_storeu_si256(dst, v); // Will usually be aligned in practice.
}
// Tested for N=32; could be bugs in the loop bounds for other N
// returns bytes written, like fwrite: N means no error, 0 means error in all fwrites
size_t LongNumPrint( uint8_t *num, size_t N)
{
// caller can print a name if it wants
const int revbufsize = 8192; // 8kiB on the stack should be fine
alignas(32) char revbuf[revbufsize];
if (N<32) {
// TODO: maybe use a smaller revbuf for this case to avoid touching new stack pages
ASCIIrev32B(revbuf, num); // the data we want is at the *end* of a 32-byte reverse
return fwrite(revbuf+32-N, 1, N, stdout);
}
size_t bytes_written = 0;
const uint8_t *inp = num+N; // start with last 32 bytes of num[]
do {
size_t chunksize = (inp - num >= revbufsize) ? revbufsize : inp - num;
const uint8_t *inp_stop = inp - chunksize + 32; // leave one full vector for the end
uint8_t *outp = revbuf;
while (inp > inp_stop) { // may run 0 times
inp -= 32;
ASCIIrev32B(outp, inp);
outp += 32;
}
// reverse first (lowest address) 32 bytes of this chunk of num
// into last 32 bytes of this chunk of revbuf
// if chunksize%32 != 0 this will overlap, which is fine.
ASCIIrev32B(revbuf + chunksize - 32, inp_stop - 32);
bytes_written += fwrite(revbuf, 1, chunksize, stdout);
inp = inp_stop - 32;
} while ( inp > num );
return bytes_written;
// caller can putchar('\n') if it wants
}
// wrapper that prints name and newline
void LongNumPrintName(uint8_t *num, size_t N, const char *name)
{
printf("%s:", name);
//LongNumPrint_scalar(num, N);
LongNumPrint(num, N);
putchar('\n');
}
// main() included on Godbolt link that runs successfully
This compiles and runs (on Godbolt) with gcc -O3 -march=haswell and produces identical output to your scalar loop for the N=32 that main passes. (I used rand() instead of MyRandom(), so we could test with the same seed and get the same numbers, using your init function.)
Untested for larger N, but the general idea of chunksize = min(ptrdiff, 8k) and using that to loop downwards from the end of num[] should be solid.
We could load (not just store) aligned vectors if we converted the first N%32 bytes and passed that to fwrite before starting the main loop. But that probably either leads to an extra write() system call, or to clunky copying inside stdio. (Unless there was already buffered text not printed yet, like Name:, in which case we already have that penalty.)
Note that it's technically C UB to decrement inp past start of num. So inp -= 32 instead of inp = inp_stop-32 would have that UB for the iteration that leaves the outer loop. I actually avoid that in this version, but it generally works anyway because I think GCC assumes a flat memory model and de-factor defines the behaviour of pointer compares enough. And normal OSes reserve the zero page so num definitely can't be within 32 bytes of the start of physical memory (so inp can't wrap to a high address.) This paragraph is mostly left-over from the first totally untested attempt that I thought was decrementing the pointer farther in the inner loop than it actually was.
I would like to vectorize an equality test in which all elements in a vector are compared against the same value, and the results are written to an array of 8-bit words. Each 8-bit word in the resulting array should be zero or one. (This is a little wasteful, but bit packing the booleans is not an import detail in this problem). This function can be written as:
#include <stdint.h>
void vecEq (uint8_t* numbers, uint8_t* results, int len, uint8_t target) {
for(int i = 0; i < len; i++) {
results[i] = numbers[i] == target;
}
}
If we knew that both vectors were 256-bit aligned, we could start by broadcasting target into an AVX register and then using SIMD's _mm256_cmpeq_epi8 to perform 32 equality tests at a time. However, in the setting I'm working in, both numbers and results have been allocated by a runtime (the GHC runtime, but this is irrelevant). They are both guaranteed to be 64-bit aligned. Is there any way to vectorize this operation, preferably without using AVX registers?
The approach I've considered is broadcasting the 8-bit word to a 64-bit word up front and then XORing it with 8 elements at a time. This doesn't work though because I cannot find a vectorized way to convert the result of XOR (zero means equal, anything else means unequal) to a equality test result I need (0 means unequal, 1 means equal, nothing else should ever exist). Roughly, the sketch I have is:
void vecEq (uint64_t* numbers, uint64_t* results, int len, uint_8 target) {
uint64_t targetA = (uint64_t)target;
uint64_t targetB = targetA<<56 | targetA<<48 | targetA<<40 | targetA<<32 | targetA<<24 | targetA<<16 | targetA<<8 | targetA;
for(int i = 0; i < len; i++) {
uint64_t tmp = numbers[i] ^ targetB;
results[i] = ... something with tmp ...;
}
}
Further to the comments above (the code will vectorise just fine). If you are using AVX, the best strategy is usually just to use unaligned load/store intrinsics. They have no extra cost if your data does happen to be aligned, and are as cheap as the HW can make them for cases of misalignment. (On Intel CPUs, there's still a penalty for loads/stores that span two cache lines, aka a cache-line split).
Ideally you can still align your buffers by 32, but if your data has to come from L2 or L3 or RAM, misalignment often doesn't make a measurable difference. And the best strategy for dealing with possible misalignment is usually just to let the HW handle it, instead of scalar up to an alignment boundary or something like you'd do with SSE, or with AVX512 where alignment matters again (any misalignment leads to every load/store being a cache-line split).
Just use _mm256_loadu_si256 / _mm256_storeu_si256 and forget about it.
As an interesting aside, Visual C++ will no longer emit aligned loads or stores, even if you request them.
https://godbolt.org/z/pL9nw9 (e.g. vmovups instead of vmovaps)
If compiling with GCC, you probably want to use -march=haswell or -march=znver1 not just -mavx2, or at least -mno-avx256-split-unaligned-load and -mno-avx256-split-unaligned-store so 256-bit unaligned loads compile to single instructions. The CPUs that benefit from those tune=generic defaults don't support AVX2, for example Sandybridge and Piledriver.
What I want to do is:
Multiply the input floating point number by a fixed factor.
Convert them to 8-bit signed char.
Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].
I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(
I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.
void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
// input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
assert(width % 16 == 0);
int num_input_chunks = width / 16;
__m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
quant_mult, quant_mult, quant_mult, quant_mult);
for (int i = 0; i < num_rows; ++i) {
const float* input_row = input + i * width;
__m256i* output_row = output + i * num_input_chunks;
for (int j = 0; j < num_input_chunks; ++j) {
const float* x = input_row + j * 16;
// Process 16 floats at once, since each __m256i can contain 16 16-bit integers.
__m256 f_0 = _mm256_loadu_ps(x);
__m256 f_1 = _mm256_loadu_ps(x + 8);
__m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
__m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);
__m256i i_0 = _mm256_cvtps_epi32(m_0);
__m256i i_1 = _mm256_cvtps_epi32(m_1);
*(output_row + j) = _mm256_packs_epi32(i_0, i_1);
}
}
}
Any help is welcome, thank you so much!
For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)
Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.
Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.
#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
__m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
__m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
__m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
__m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
__m256i ab = _mm256_packs_epi32(a,b); // 16x int16_t
__m256i cd = _mm256_packs_epi32(c,d);
__m256i abcd = _mm256_packs_epi16(ab, cd); // 32x int8_t
// packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
// if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done
// but if you need sequential order, then vpermd:
__m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
return lanefix;
}
(Compiles nicely on the Godbolt compiler explorer).
Call this in a loop and _mm256_store_si256 the resulting vector.
(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)
With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.
If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.
Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)
Related:
SSE - AVX conversion from double to char
How can I convert a vector of float to short int using avx instructions?
Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .
In gcc 4.1.2, vec_ld() does not work correctly on board of CPU MPC74XX.
float temp[4];
__vector float Src;
Src = (__vector float)vec_ld(0, temp);
However, if float variable is aligned to 16 bytes, it works correctly:
float temp[4] __attribute__((aligned(16)));
Is this by design?
Yes, AltiVec loads and stores require 16 byte alignment. This is very well documented in the AltiVec manuals.
Unlike other SIMD architectures such as SSE however, note that AltiVec silently truncates unaligned addresses to the next lowest 16 byte boundary, rather than generating an exception, so your code will not crash, but it will not behave correctly if you attempt to load or store at an unaligned address.
In cases where you can not avoid unaligned loads you can load two adjacent aligned vectors and then use vec_lvsl + vec_perm to create the required vector:
float temp[4];
__vector float sr1, src2, src;
src1 = vec_ld(0, temp);
src2 = vec_ld(16, temp);
src = vec_perm(src1, src2, vec_lvsl(0, temp));
By the way, in Power8 they finally added support for unaligned load/store vector access. For details, see information on lxvd2x / lxvw4x and stxvd2x / stxvw4x instructions in section "7.6 VSX Instruction Set" of Power ISA 2.07 document.
Those who have access to IBM XL C/C++ Compiler, could use vec_xld2() / vec_xlw4() and vec_xstd2() / vec_xstw4() intrinsics.
As of version "g++ (GCC) 4.10.0 20140419 (experimental)", I am not aware of GCC equivalents, but I believe, users of GCC could access unaligned memory by pointer dereferencing:
signed int *data;
// ...
vector signed int r = *(vector signed int *)&(data[i]);
I read about Endianness and understood squat...
so I wrote this
main()
{
int k = 0xA5B9BF9F;
BYTE *b = (BYTE*)&k; //value at *b is 9f
b++; //value at *b is BF
b++; //value at *b is B9
b++; //value at *b is A5
}
k was equal to A5 B9 BF 9F
and (byte)pointer "walk" o/p was 9F BF b9 A5
so I get it bytes are stored backwards...ok.
~
so now I thought how is it stored at BIT level...
I means is "9f"(1001 1111) stored as "f9"(1111 1001)?
so I wrote this
int _tmain(int argc, _TCHAR* argv[])
{
int k = 0xA5B9BF9F;
void *ptr = &k;
bool temp= TRUE;
cout<<"ready or not here I come \n"<<endl;
for(int i=0;i<32;i++)
{
temp = *( (bool*)ptr + i );
if( temp )
cout<<"1 ";
if( !temp)
cout<<"0 ";
if(i==7||i==15||i==23)
cout<<" - ";
}
}
I get some random output
even for nos. like "32" I dont get anything sensible.
why ?
Just for completeness, machines are described in terms of both byte order and bit order.
The intel x86 is called Consistent Little Endian because it stores multi-byte values in LSB to MSB order as memory address increases. Its bit numbering convention is b0 = 2^0 and b31 = 2^31.
The Motorola 68000 is called Inconsistent Big Endian because it stores multi-byte values in MSB to LSB order as memory address increases. Its bit numbering convention is b0 = 2^0 and b31 = 2^31 (same as intel, which is why it is called 'Inconsistent' Big Endian).
The 32-bit IBM/Motorola PowerPC is called Consistent Big Endian because it stores multi-byte values in MSB to LSB order as memory address increases. Its bit numbering convention is b0 = 2^31 and b31 = 2^0.
Under normal high level language use the bit order is generally transparent to the developer. When writing in assembly language or working with the hardware, the bit numbering does come into play.
Endianness, as you discovered by your experiment refers to the order that bytes are stored in an object.
Bits do not get stored differently, they're always 8 bits, and always "human readable" (high->low).
Now that we've discussed that you don't need your code... About your code:
for(int i=0;i<32;i++)
{
temp = *( (bool*)ptr + i );
...
}
This isn't doing what you think it's doing. You're iterating over 0-32, the number of bits in a word - good. But your temp assignment is all wrong :)
It's important to note that a bool* is the same size as an int* is the same size as a BigStruct*. All pointers on the same machine are the same size - 32bits on a 32bit machine, 64bits on a 64bit machine.
ptr + i is adding i bytes to the ptr address. When i>3, you're reading a whole new word... this could possibly cause a segfault.
What you want to use is bit-masks. Something like this should work:
for (int i = 0; i < 32; i++) {
unsigned int mask = 1 << i;
bool bit_is_one = static_cast<unsigned int>(ptr) & mask;
...
}
Your machine almost certainly can't address individual bits of memory, so the layout of bits inside a byte is meaningless. Endianness refers only to the ordering of bytes inside multibyte objects.
To make your second program make sense (though there isn't really any reason to, since it won't give you any meaningful results) you need to learn about the bitwise operators - particularly & for this application.
Byte Endianness
On different machines this code may give different results:
union endian_example {
unsigned long u;
unsigned char a[sizeof(unsigned long)];
} x;
x.u = 0x0a0b0c0d;
int i;
for (i = 0; i< sizeof(unsigned long); i++) {
printf("%u\n", (unsigned)x.a[i]);
}
This is because different machines are free to store values in any byte order they wish. This is fairly arbitrary. There is no backwards or forwards in the grand scheme of things.
Bit Endianness
Usually you don't have to ever worry about bit endianness. The most common way to access individual bits is with shifts ( >>, << ) but those are really tied to values, not bytes or bits. They preform an arithmatic operation on a value. That value is stored in bits (which are in bytes).
Where you may run into a problem in C with bit endianness is if you ever use a bit field. This is a rarely used (for this reason and a few others) "feature" of C that allows you to tell the compiler how many bits a member of a struct will use.
struct thing {
unsigned y:1; // y will be one bit and can have the values 0 and 1
signed z:1; // z can only have the values 0 and -1
unsigned a:2; // a can be 0, 1, 2, or 3
unsigned b:4; // b is just here to take up the rest of the a byte
};
In this the bit endianness is compiler dependant. Should y be the most or least significant bit in a thing? Who knows? If you care about the bit ordering (describing things like the layout of a IPv4 packet header, control registers of device, or just a storage formate in a file) then you probably don't want to worry about some different compiler doing this the wrong way. Also, compilers aren't always as smart about how they work with bit fields as one would hope.
This line here:
temp = *( (bool*)ptr + i );
... when you do pointer arithmetic like this, the compiler moves the pointer on by the number you added times the sizeof the thing you are pointing to. Because you are casting your void* to a bool*, the compiler will be moving the pointer along by the size of one "bool", which is probably just an int under the covers, so you'll be printing out memory from further along than you thought.
You can't address the individual bits in a byte, so it's almost meaningless to ask which way round they are stored. (Your machine can store them whichever way it wants and you won't be able to tell). The only time you might care about it is when you come to actually spit bits out over a physical interface like I2C or RS232 or similar, where you have to actually spit the bits out one-by-one. Even then, though, the protocol would define which order to spit the bits out in, and the device driver code would have to translate between "an int with value 0xAABBCCDD" and "a bit sequence 11100011... [whatever] in protocol order".