Unsigned short int operation with Intel Intrinsics

Unsigned short int operation with Intel Intrinsics - c

I want to do some operation using the Intel intrinsics (vector of unsigned int of 16 bit) and the operations are the following :
load or set from an array of unsigned short int.
Div and Mod operations with unsigned short int.
Multiplication operation with unsigned short int.
Store operation of unsigned short int into an array.
I looked into the Intrinsics guide but it looks like there are only intrinsics for short integers and not the unsigned ones. Could someone have any trick that could help me with this ?
In fact I'm trying to store an image of a specific raster format in an array with a specific ordering. So I have to calculate the index where each pixel value is going to be stored:
unsigned int Index(unsigned int interleaving_depth, unsigned int x_size, unsigned int y_size, unsigned int z_size, unsigned int Pixel_number)
{
unsigned int x = 0, y = 0, z = 0, reminder = 0, i = 0;
y = Pixel_number/(x_size*z_size);
reminder = Pixel_number % (x_size*z_size);
i = reminder/(x_size*interleaving_depth);
reminder = reminder % (x_size*interleaving_depth);
if(i == z_size/interleaving_depth){
x = reminder/(z_size - i*interleaving_depth);
reminder = reminder % (z_size - i*interleaving_depth);
}
else
{
x = reminder/interleaving_depth;
reminder = reminder % interleaving_depth;
}
z = interleaving_depth*i + reminder;
if(z >= z_size)
z = z_size - 1;
return x + y*x_size + *x_size*y_size;
}

If you only want the low half of the result, multiplication is the same binary operation for signed or unsigned. So you can use pmullw on either. There are separate high-half multiply instructions for signed and unsigned short, though: _mm_mulhi_epu16 (pmulhuw) vs. _mm_mulhi_epi16 (pmuluw)
Similarly, you don't need an _mm_set_epu16 because it's the same operation: on x86 casting to signed doesn't change the bit-pattern, so Intel only bothered to provide _mm_set_epi16, but you can use it with args like 0xFFFFu instead of -1 with no problems. (Using Intel intrinsics automatically means your code only has to be portable to x86 32 and 64 bit.)
Load / store intrinsics don't change the data at all.
SSE/AVX doesn't have integer division or mod instructions. If you have compile-time-constant divisors, do it yourself with a multiply/shift. You can look at compiler output to get the magic constant and shift counts (Why does GCC use multiplication by a strange number in implementing integer division?), or even let gcc auto-vectorize something for you. Or even use GNU C native vector syntax to divide:
#include <immintrin.h>
__m128i div13_epu16(__m128i a)
{
typedef unsigned short __attribute__((vector_size(16))) v8uw;
v8uw tmp = (v8uw)a;
v8uw divisor = (v8uw)_mm_set1_epi16(13);
v8uw result = tmp/divisor;
return (__m128i)result;
// clang allows "lax" vector type conversions without casts
// gcc allows vector / scalar, e.g. tmp / 13. Clang requires set1
// to work with both, we need to jump through all the syntax hoops
}
compiles to this asm with gcc and clang (Godbolt compiler explorer):
div13_epu16:
pmulhuw xmm0, XMMWORD PTR .LC0[rip] # tmp93,
psrlw xmm0, 2 # tmp95,
ret
.section .rodata
.LC0:
.value 20165
# repeats 8 times
If you have runtime-variable divisors, it's going to be slower, but you can use http://libdivide.com/. It's not too bad if you reuse the same divisor repeatedly, so you only have to calculate a fixed-point inverse for it once, but code to use an arbitrary inverse needs a variable shift count which is less efficient with SSE (well also for integer), and potentially more instructions because some divisors require a more complicated sequence than others.

Related

How can I quickly get the value 2^64 divided by random integer in C lang? [duplicate]

How to compute the integer division, 264/n? Assuming:
unsigned long is 64-bit
We use a 64-bit CPU
1 < n < 264
If we do 18446744073709551616ul / n, we get warning: integer constant is too large for its type at compile time. This is because we cannot express 264 in a 64-bit CPU. Another way is the following:
#define IS_POWER_OF_TWO(x) ((x & (x - 1)) == 0)
unsigned long q = 18446744073709551615ul / n;
if (IS_POWER_OF_TWO(n))
return q + 1;
else
return q;
Is there any faster (CPU cycle) or cleaner (coding) implementation?

I'll use uint64_t here (which needs the <stdint.h> include) so as not to require your assumption about the size of unsigned long.
phuclv's idea of using -n is clever, but can be made much simpler. As unsigned 64-bit integers, we have -n = 264-n, then (-n)/n = 264/n - 1, and we can simply add back the 1.
uint64_t divide_two_to_the_64(uint64_t n) {
return (-n)/n + 1;
}
The generated code is just what you would expect (gcc 8.3 on x86-64 via godbolt):
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret

I've come up with another solution which was inspired by this question. From there we know that
(a1 + a2 + a3 + ... + an)/n =
(a1/n + a2/n + a3/n + ... + an/n) + (a1 % n + a2 % n + a3 % n + ... + an % n)/n
By choosing a1 = a2 = a3 = ... = an-1 = 1 and an = 264 - n we'll have
(a1 + a2 + a3 + ... + an)/n = (1 + 1 + 1 + ... + (264 - n))/n = 264/n
= [(n - 1)*1/n + (264 - n)/n] + [(n - 1)*0 + (264 - n) % n]/n
= (264 - n)/n + ((264 - n) % n)/n
264 - n is the 2's complement of n, which is -n, or we can also write it as ~0 - n + 1. So the final solution would be
uint64_t twoPow64div(uint64_t n)
{
return (-n)/n + (n + (-n) % n)/n + (n > 1ULL << 63);
}
The last part is to correct the result, because we deal with unsigned integers instead of signed ones like in the other question. Checked both 32 and 64-bit versions on my PC and the result matches with your solution
On MSVC however there's an intrinsic for 128-bit division, so you can use like this
uint64_t remainder;
return _udiv128(1, 0, n, &remainder);
which results in the cleanest output
mov edx, 1
xor eax, eax
div rcx
ret 0
Here's the demo
On most x86 compilers (one notable exception is MSVC) long double also has 64 bits of precision, so you can use either of these
(uint64_t)(powl(2, 64)/n)
(uint64_t)(((long double)~0ULL)/n)
(uint64_t)(18446744073709551616.0L/n)
although probably the performance would be worse. This can also be applied to any implementations where long double has more than 63 bits of significand, like PowerPC with its double-double implementation
There's a related question about calculating ((UINT_MAX + 1)/x)*x - 1: Integer arithmetic: Add 1 to UINT_MAX and divide by n without overflow with also clever solutions. Based on that we have
264/n = (264 - n + n)/n = (264 - n)/n + 1 = (-n)/n + 1
which is essentially just another way to get Nate Eldredge's answer
Here's some demo for other compilers on godbolt
See also:
Trick to divide a constant (power of two) by an integer
Efficient computation of 2**64 / divisor via fast floating-point reciprocal

We use a 64-bit CPU
Which 64-bit CPU?
In general, if you multiply a number with N bits by another number that has M bits, the result will have up to N+M bits. For integer division it's similar - if a number with N bits is divided by a number with M bits the result will have N-M+1 bits.
Because multiplication is naturally "widening" (the result has more digits than either of the source numbers) and integer division is naturally "narrowing" (the result has less digits); some CPUs support "widening multiplication" and "narrowing division".
In other words, some 64-bit CPUs support dividing a 128-bit number by a 64-bit number to get a 64-bit result. For example, on 80x86 it's a single DIV instruction.
Unfortunately, C doesn't support "widening multiplication" or "narrowing division". It only supports "result is same size as source operands".
Ironically (for unsigned 64-bit divisors on 64-bit 80x86) there is no other choice and the compiler must use the DIV instruction that will divide a 128-bit number by a 64-bit number. This means that the C language forces you to use a 64-bit numerator, then the code generated by the compiler extends your 64 bit numerator to 128 bits and divides it by a 64 bit number to get a 64 bit result; and then you write extra code to work around the fact that the language prevented you from using a 128-bit numerator to begin with.
Hopefully you can see how this situation might be considered "less than ideal".
What I'd want is a way to trick the compiler into supporting "narrowing division". For example, maybe by abusing casts and hoping that the optimiser is smart enough, like this:
__uint128_t numerator = (__uint128_t)1 << 64;
if(n > 1) {
return (uint64_t)(numerator/n);
}
I tested this for the latest versions of GCC, CLANG and ICC (using https://godbolt.org/ ) and found that (for 64-bit 80x86) none of the compilers are smart enough to realise that a single DIV instruction is all that is needed (they all generated code that does a call __udivti3, which is an expensive function to get a 128 bit result). The compilers will only use DIV when the (128-bit) numerator is 64 bits (and it will be preceded by an XOR RDX,RDX to set the highest half of the 128-bit numerator to zeros).
In other words, it's likely that the only way to get ideal code (the DIV instruction by itself on 64-bit 80x86) is to resort to inline assembly.
For example, the best code you'll get without inline assembly (from Nate Eldredge's answer) will be:
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret
...and the best code that's possible is:
mov edx, 1
xor rax, rax
div rdi
ret

Your way is pretty good. It might be better to write it like this:
return 18446744073709551615ul / n + ((n&(n-1)) ? 0:1);
The hope is to make sure the compiler notices that it can do a conditional move instead of a branch.
Compile and disassemble.

Porting ASM function for finding least significant set bit into C [duplicate]

this is sort of a follow up on some previous questions on bit manipulation. I modified the code from this site to enumerate strings with K of N bits set (x is the current int64_t with K bits set, and at the end of this code it is the lexicographically next integer with K bits set):
int64_t b, t, c, m, r,z;
b = x & -x;
t = x + b;
c = x^t;
// was m = (c >> 2)/b per link
z = __builtin_ctz(x);
m = c >> 2+z;
x = t|m;
The modification using __builtin_ctz() works fine as long as the least significant one bit is in the lower DWORD of x, but if is not, it totally breaks. This can be seen with the following code:
for(int i=0; i<64; i++) printf("i=%i, ctz=%i\n", i, __builtin_ctz(1UL << i));
which prints for GCC version 4.4.7:
i=0, ctz=0
i=1, ctz=1
i=2, ctz=2
...
i=30, ctz=30
i=31, ctz=31
i=32, ctz=0
or for icc version 14.0.0 something similar (except i>32 gives random results, not zero). Using division instead of shifting by 2+z works in both cases, but it's about 5x slower on my Sandy Bridge Xeon. Are there other intrinsics I should be using for 64-bit, or will I have to do some inline assembler?
Thanks!

__builtin_ctz takes arguments of type unsigned int, which is 32-bits on most platforms.
If long is 64 bits, you can use __builtin_ctzl which takes unsigned long. Or you can use __builtin_ctzll which takes unsigned long long - In this case you should use 1ULL << i instead of 1UL << i.

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

What I want to do is:
Multiply the input floating point number by a fixed factor.
Convert them to 8-bit signed char.
Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].
I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(
I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.
void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
// input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
assert(width % 16 == 0);
int num_input_chunks = width / 16;
__m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
quant_mult, quant_mult, quant_mult, quant_mult);
for (int i = 0; i < num_rows; ++i) {
const float* input_row = input + i * width;
__m256i* output_row = output + i * num_input_chunks;
for (int j = 0; j < num_input_chunks; ++j) {
const float* x = input_row + j * 16;
// Process 16 floats at once, since each __m256i can contain 16 16-bit integers.
__m256 f_0 = _mm256_loadu_ps(x);
__m256 f_1 = _mm256_loadu_ps(x + 8);
__m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
__m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);
__m256i i_0 = _mm256_cvtps_epi32(m_0);
__m256i i_1 = _mm256_cvtps_epi32(m_1);
*(output_row + j) = _mm256_packs_epi32(i_0, i_1);
}
}
}
Any help is welcome, thank you so much!

For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)
Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.
Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.
#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
__m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
__m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
__m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
__m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
__m256i ab = _mm256_packs_epi32(a,b); // 16x int16_t
__m256i cd = _mm256_packs_epi32(c,d);
__m256i abcd = _mm256_packs_epi16(ab, cd); // 32x int8_t
// packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
// if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done
// but if you need sequential order, then vpermd:
__m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
return lanefix;
}
(Compiles nicely on the Godbolt compiler explorer).
Call this in a loop and _mm256_store_si256 the resulting vector.
(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)
With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.
If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.
Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)
Related:
SSE - AVX conversion from double to char
How can I convert a vector of float to short int using avx instructions?

Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .

32-bit multiplication through 16-bit shifting

I am writing a soft-multiplication function call using shifting and addition. The existing function call goes like this:
unsigned long __mulsi3 (unsigned long a, unsigned long b) {
unsigned long answer = 0;
while(b)
{
if(b & 1) {
answer += a;
};
a <<= 1;
b >>= 1;
}
return answer;
}
Although my hardware does not have a multiplier, I have a hard shifter. The shifter is able to shift up to 16 bits at one time.
If I want to make full use of my 16-bit shifter. Any suggestions on how can I adapt the code above to reflect my hardware's capabilities? The given code shifts only 1-bit per iteration.
The 16-bit shifter can shift 32-bit unsigned long values up to 16 places at a time. The sizeof(unsigned long) == 32 bits

The ability to shift multiple bits is not going to help much, unless you have a hardware multiply, say 8-bit x 8-bit, or you can afford some RAM/ROM to do (say) a 4-bit by 4-bit multiply by lookup.
The straightforward shift and add (as you are doing) can be helped by swapping the arguments so that the multiplier is the smaller.
If your machine is faster doing 16 bit things in general, then treating your 32-bit 'a' as 'a1:a0' 16-bits at a time, and similarly 'b', you just might be able to same some cycles. Your result is only 32-bits, so you don't need to do 'a1 * b1' -- though one or both of those may be zero, so the win may not be big! Also, you only need the ls 16-bits of 'a0 * b1', so that can be done entirely 16-bits -- but if b1 (assuming b <= a) is generally zero this is not a big win, either. For 'a * b0', you need a 32-bit 'a' and 32-bit adds into 'answer', but your multiplier is 16-bits only... which may or may not help.
Skipping runs of multiplier zeros could help -- depending on processor and any properties of the multiplier.
FWIW: doing the magic 'a1*b1', '(a1-a0)*(b0-b1)', 'a0*b0' and combining the result by shifts, adds and subtracts is, in my small experience, an absolute nightmare... the signs of '(a1-a0)', '(b0-b1)' and their product have to be respected, which makes a bit of a mess of what looks like a cute trick. By the time you have finished with that and the adds and subtracts, you have to have a mighty slow multiply to make it all worth while ! When multiplying very, very long integers this may help... but there the memory issues may dominate... when I tried it, it was something of a disappointment.

Having 16-bit shifts can help you in making minor speed enhancement using the following approach:
(U1 * P + U0) * (V1 * P + V0) =
= U1 * V1 * P * P + U1 * V0 * P + U0 * V1 * P + U0 * V0 =
= U1 * V1 * (P*P+P) + (U1-U0) * (V0-V1) * P + U0 * V0 * (1-P)
provided P is a convenient power of 2 (for example, 2^16, 2^32), so multiplying to it is a fast shift. This reduces from 4 to 3 multiplications of smaller numbers, and, recursively, O(N^1.58) instead of O(N^2) for very long numbers.
This method is named Karatsubaʼs multiplication. There are more advanced versions described there.
For small numbers (e.g. 8 by 8 bits), the following method is fast, if you have enough fast ROM:
a * b = square(a+b)/4 - square(a-b)/4
if to tabulate int(square(x)/4), you'll need 1022 bytes for unsigned multiplication and 510 bytes for signed one.

The basic approach is (assuming shifting by 1) :-
Shift the top 16 bits
Set the bottom bit of the top 16 bits to the top bit of the bottom 16 bits
Shift the bottom 16 bits
Depends a bit on your hardware...
but you could try :-
assuming unsigned long is 32 bits
assuming Big Endian
then :-
union Data32
{
unsigned long l;
unsigned short s[2];
};
unsigned long shiftleft32(unsigned long valueToShift, unsigned short bitsToShift)
{
union Data32 u;
u.l = valueToShift
u.s[0] <<= bitsToShift;
u.s[0] |= (u.s[1] >> (16 - bitsToShift);
u.s[1] <<= bitsToShift
return u.l;
}
then do the same in reverse for shifting right

the code above is multiplying on the traditional way, the way we learnt in primary school :
EX:
0101
* 0111
-------
0101
0101.
0101..
--------
100011
of course you can not approach it like that if you don't have either a multiplier operator or 1-bit shifter!
though, you can do it in other ways, for example a loop :
unsigned long _mult(unsigned long a, unsigned long b)
{
unsigned long res =0;
while (a > 0)
{
res += b;
a--;
}
return res;
}
It is costy but it serves your needings, anyways you can think about other approaches if you have more constraints (like computation time ...)

_mm_crc32_u64 poorly defined

Why in the world was _mm_crc32_u64(...) defined like this?
unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v );
The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination operand on the instruction (for uniformity), but if I want to process data quickly, I want a source operand as large as possible (i.e. 64-bits if I have that much data left, smaller for the tail ends) and always a 32-bit destination operand. But the intrinsics don't allow a 64-bit source and 32-bit destination. Note the other intrinsics:
unsigned int _mm_crc32_u8 ( unsigned int crc, unsigned char v );
The type of "crc" is not an 8-bit type, nor is the return type, they are 32-bits. Why is there no
unsigned int _mm_crc32_u64 ( unsigned int crc, unsigned __int64 v );
? The Intel instruction supports this, and that is the intrinsic that makes the most sense.
Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.
My guess is something like this:
#define CRC32(D32,S) __asm__("crc32 %0, %1" : "+xrm" (D32) : ">xrm" (S))
for GCC, and
#define CRC32(D32,S) __asm { crc32 D32, S }
for VisualStudio. Unfortunately I have little understanding of how constraints work, and little experience with the syntax and semantics of assembly level programming.
Small edit: note the macros I've defined:
#define GET_INT64(P) *(reinterpret_cast<const uint64* &>(P))++
#define GET_INT32(P) *(reinterpret_cast<const uint32* &>(P))++
#define GET_INT16(P) *(reinterpret_cast<const uint16* &>(P))++
#define GET_INT8(P) *(reinterpret_cast<const uint8 * &>(P))++
#define DO1_HW(CR,P) CR = _mm_crc32_u8 (CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = _mm_crc32_u16(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = _mm_crc32_u32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = (_mm_crc32_u64((uint64)CR, GET_INT64(P))) & 0xFFFFFFFF;
Notice how different the last macro statement is. The lack of uniformity is certainly and indication that the intrinsic has not been defined sensibly. While it is not necessary to put in the explicit (uint64) cast in the last macro, it is implicit and does happen. Disassembling the generated code shows code for both casts 32->64 and 64->32, both of which are unnecessary.
Put another way, it's _mm_crc32_u64, not _mm_crc64_u64, but they've implemented it as if it were the latter.
If I could get the definition of CRC32 above correct, then I would want to change my macros to
#define DO1_HW(CR,P) CR = CRC32(CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = CRC32(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = CRC32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = CRC32(CR, GET_INT64(P))

The 4 intrinsic functions provided really do allow all possible uses of the Intel defined CRC32 instruction. The instruction output always 32-bits because the instruction is hard-coded to use a specific 32-bit CRC polynomial. However, the instruction allows your code to feed input data to it 8, 16, 32, or 64 bits at a time. Processing 64-bits at a time should maximize throughput. Processing 32-bits at a time is the best you can do if restricted to 32-bit build. Processing 8 or 16 bits at a time could simplify your code logic if the input byte count is odd or or not a multiple of 4/8.
#include <stdio.h>
#include <stdint.h>
#include <intrin.h>
int main (int argc, char *argv [])
{
int index;
uint8_t *data8;
uint16_t *data16;
uint32_t *data32;
uint64_t *data64;
uint32_t total1, total2, total3;
uint64_t total4;
uint64_t input [] = {0x1122334455667788, 0x1111222233334444};
total1 = total2 = total3 = total4 = 0;
data8 = (void *) input;
data16 = (void *) input;
data32 = (void *) input;
data64 = (void *) input;
for (index = 0; index < sizeof input / sizeof *data8; index++)
total1 = _mm_crc32_u8 (total1, *data8++);
for (index = 0; index < sizeof input / sizeof *data16; index++)
total2 = _mm_crc32_u16 (total2, *data16++);
for (index = 0; index < sizeof input / sizeof *data32; index++)
total3 = _mm_crc32_u32 (total3, *data32++);
for (index = 0; index < sizeof input / sizeof *data64; index++)
total4 = _mm_crc32_u64 (total4, *data64++);
printf ("CRC32 result using 8-bit chunks: %08X\n", total1);
printf ("CRC32 result using 16-bit chunks: %08X\n", total2);
printf ("CRC32 result using 32-bit chunks: %08X\n", total3);
printf ("CRC32 result using 64-bit chunks: %08X\n", total4);
return 0;
}

Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.
My friend and I wrote a c++ sse intrinsics wrapper which contains the more preferred usage of the crc32 instruction with 64bit src.
http://code.google.com/p/sse-intrinsics/
See the i_crc32() instruction.
(sadly there are even more flaws with intel's sse intrinsic specifications on other instructions, see this page for more examples of flawed intrinsic design)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight