SSE _mm_movemask_epi8 equivalent method for ARM NEON

SSE _mm_movemask_epi8 equivalent method for ARM NEON - arm

I decided to continue Fast corners optimisation and stucked at
_mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?

I know this post is quite outdated but I found it useful to give my (validated) solution. It assumes all ones/all zeroes in every lane of the Input argument.
const uint8_t __attribute__ ((aligned (16))) _Powers[16]=
{ 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };
// Set the powers of 2 (do it once for all, if applicable)
uint8x16_t Powers= vld1q_u8(_Powers);
// Compute the mask from the input
uint64x2_t Mask= vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(Input, Powers))));
// Get the resulting bytes
uint16_t Output;
vst1q_lane_u8((uint8_t*)&Output + 0, (uint8x16_t)Mask, 0);
vst1q_lane_u8((uint8_t*)&Output + 1, (uint8x16_t)Mask, 8);
(Mind http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47553, anyway.)
Similarly to Michael, the trick is to form the powers of the indexes of the non-null entries, and to sum them pairwise three times. This must be done with increasing data size to double the stride on every addition. You reduce from 2 x 8 8-bit entries to 2 x 4 16-bit, then 2 x 2 32-bit and 2 x 1 64-bit. The low byte of these two numbers gives the solution. I don't think there is an easy way to pack them together to form a single short value using NEON.
Takes 6 NEON instructions if the input is in the suitable form and the powers can be preloaded.

The obvious solution seems to be completely missed here.
// Use shifts to collect all of the sign bits.
// I'm not sure if this works on big endian, but big endian NEON is very
// rare.
int vmovmaskq_u8(uint8x16_t input)
{
// Example input (half scale):
// 0x89 FF 1D C0 00 10 99 33
// Shift out everything but the sign bits
// 0x01 01 00 01 00 00 01 00
uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7));
// Merge the even lanes together with vsra. The '??' bytes are garbage.
// vsri could also be used, but it is slightly slower on aarch64.
// 0x??03 ??02 ??00 ??01
uint32x4_t paired16 = vreinterpretq_u32_u16(
vsraq_n_u16(high_bits, high_bits, 7));
// Repeat with wider lanes.
// 0x??????0B ??????04
uint64x2_t paired32 = vreinterpretq_u64_u32(
vsraq_n_u32(paired16, paired16, 14));
// 0x??????????????4B
uint8x16_t paired64 = vreinterpretq_u8_u64(
vsraq_n_u64(paired32, paired32, 28));
// Extract the low 8 bits from each lane and join.
// 0x4B
return vgetq_lane_u8(paired64, 0) | ((int)vgetq_lane_u8(paired64, 8) << 8);
}

This question deserves a newer answer for aarch64. The addition of new capabilities to Armv8 allows the same function to be implemented in fewer instructions. Here's my version:
uint32_t _mm_movemask_aarch64(uint8x16_t input)
{
const uint8_t __attribute__ ((aligned (16))) ucShift[] = {-7,-6,-5,-4,-3,-2,-1,0,-7,-6,-5,-4,-3,-2,-1,0};
uint8x16_t vshift = vld1q_u8(ucShift);
uint8x16_t vmask = vandq_u8(input, vdupq_n_u8(0x80));
uint32_t out;
vmask = vshlq_u8(vmask, vshift);
out = vaddv_u8(vget_low_u8(vmask));
out += (vaddv_u8(vget_high_u8(vmask)) << 8);
return out;
}

after some tests it looks like following code works correct:
int32_t _mm_movemask_epi8_neon(uint8x16_t input)
{
const int8_t __attribute__ ((aligned (16))) xr[8] = {-7,-6,-5,-4,-3,-2,-1,0};
uint8x8_t mask_and = vdup_n_u8(0x80);
int8x8_t mask_shift = vld1_s8(xr);
uint8x8_t lo = vget_low_u8(input);
uint8x8_t hi = vget_high_u8(input);
lo = vand_u8(lo, mask_and);
lo = vshl_u8(lo, mask_shift);
hi = vand_u8(hi, mask_and);
hi = vshl_u8(hi, mask_shift);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
return ((hi[0] << 8) | (lo[0] & 0xFF));
}

Note that I haven't tested any of this, but something like this might work:
X := the vector that you want to create the mask from
A := 0x808080808080...
B := 0x00FFFEFDFCFB... (i.e. 0,-1,-2,-3,...)
X = vand_u8(X, A); // Keep d7 of each byte in X
X = vshl_u8(X, B); // X[7]>>=0; X[6]>>=1; X[5]>>=2; ...
// Each byte of X now contains its msb shifted 7-N bits to the right, where N
// is the byte index.
// Do 3 pairwise adds in order to pack all these into X[0]
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
// X[0] should now contain the mask. Clear the remaining bytes if necessary
This would need to be repeated once to process a 128-bit vector, since vpadd only works on 64-bit vectors.

I know this question is here for 8 years already but let me give you the answer which might solve all performance problems with emulation. It's based on the blog Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more.
Most usages of movemask instructions are coming from comparisons where the vectors have 0xFF or 0x00 values from the result of every 16 bytes. After that most cases to use movemasks are to check if none/all match, find leading/trailing or iterate over bits.
If this is the case which often is, then you can use shrn reg1, reg2, #4 instruction. This instruction called Shift-Right-then-Narrow instruction can reduce a 128-bit byte mask to a 64-bit nibble mask (by alternating low and high nibbles to the result). This allows the mask to be extracted to a 64-bit general purpose register.
const uint16x8_t equalMask = vreinterpretq_u16_u8(vceqq_u8(chunk, vdupq_n_u8(tag)));
const uint8x8_t res = vshrn_n_u16(equalMask, 4);
const uint64_t matches = vget_lane_u64(vreinterpret_u64_u8(res), 0);
return matches;
After that you can use all bit operations you typically use on x86 with very minor tweaks like shifting by 2 or doing a scalar AND.

Related

Efficient way of rotating a byte inside an AVX register

Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together?
For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits.
Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the register 1 bit to the left, and 7 to the right individually. I then use the blend operation (intrinsic operation _mm256_blend_epi16) to chose the correct bits from the first and second temporary result to get my final rotated byte.
This costs a total of 2 shift operations and 1 blend operation per byte, and 6 bytes needs to be rotated, thus 18 operations per byte (shift and blend has just about the same performance).
There must be a faster way to do this than by using 18 operations to rotate a single byte!
Furthermore, I need to assemble all the bytes afterwards in the new register. I do this by loading 7 masks with the "set" instruction into registers, so I can extract the correct byte from each register. I AND these mask with the registers to extract the correct byte from them. Afterwards I XOR the single byte registers together to get the new register with all the bytes.
This takes a total of 7+7+6 operations, so another 20 operations (per register).
I could use the extract intrinsic (_mm256_extract_epi8) to get the single bytes, and then use _mm256_set_epi8 to assemble the new registers, but I don't know yet whether that would be faster. (There is no listed performance for these functions in the Intel intrinsics guide, so maybe I am misunderstanding something here.)
This gives a total of 38 operations per register, which seems less than optimal for rotating 6 bytes differently inside a register.
I hope someone more proficient in AVX/SIMD can guide me here—whether I am going about this the wrong way—as I feel I might be doing just that right now.

The XOP instruction set does provide _mm_rot_epi8() (which is NOT Microsoft-specific; it is also available in GCC since 4.4 or earlier, and should be available in recent clang, too). It can be used to perform the desired task in 128-bit units. Unfortunately, I don't have a CPU with XOP support, so I cannot test that.
On AVX2, splitting the 256-bit register into two halves, one containing even bytes, and the other odd bytes shifted right 8 bits, allows a 16-bit vector multiply to do the trick. Given constants (using GCC 64-bit component array format)
static const __m256i epi16_highbyte = { 0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL };
static const __m256i epi16_lowbyte = { 0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL };
static const __m256i epi16_oddmuls = { 0x4040101004040101ULL,
0x4040101004040101ULL,
0x4040101004040101ULL,
0x4040101004040101ULL };
static const __m256i epi16_evenmuls = { 0x8080202008080202ULL,
0x8080202008080202ULL,
0x8080202008080202ULL,
0x8080202008080202ULL };
the rotation operation can be written as
__m256i byteshift(__m256i value)
{
return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_lowbyte), epi16_oddmuls), 8),
_mm256_and_si256(_mm256_mullo_epi16(_mm256_and_si256(_mm256_srai_epi16(value, 8), epi16_lowbyte), epi16_evenmuls), epi16_highbyte));
}
This has been verified to yield correct results on Intel Core i5-4200U using GCC-4.8.4. As an example, the input vector (as a single 256-bit hexadecimal number)
88 87 86 85 84 83 82 81 38 37 36 35 34 33 32 31 28 27 26 25 24 23 22 21 FF FE FD FC FB FA F9 F8
gets rotated into
44 E1 D0 58 24 0E 05 81 1C CD C6 53 A1 CC 64 31 14 C9 C4 52 21 8C 44 21 FF BF BF CF DF EB F3 F8
where the leftmost octet is rotated left by 7 bits, next 6 bits, and so on; seventh octet is unchanged, eighth octet is rotated by 7 bits, and so on, for all 32 octets.
I am not sure if the above function definition compiles to optimal machine code -- that depends on the compiler --, but I'm certainly happy with its performance.
Since you probably dislike the above concise format for the function, here it is in procedural, expanded form:
static __m256i byteshift(__m256i value)
{
__m256i low, high;
high = _mm256_srai_epi16(value, 8);
low = _mm256_and_si256(value, epi16_lowbyte);
high = _mm256_and_si256(high, epi16_lowbyte);
low = _mm256_mullo_epi16(low, epi16_lowmuls);
high = _mm256_mullo_epi16(high, epi16_highmuls);
low = _mm256_srli_epi16(low, 8);
high = _mm256_and_si256(high, epi16_highbyte);
return _mm256_or_si256(low, high);
}
In a comment, Peter Cordes suggested replacing the srai+and with an srli, and possibly the final and+or with a blendv. The former makes a lot of sense, as it is purely an optimization, but the latter may not (yet, on current Intel CPUs!) actually be faster.
I tried some microbenchmarking, but was unable to get reliable results. I typically use the TSC on x86-64, and take the median of a few hundred thousand tests using inputs and outputs stored to an array.
I think it is most useful if I will just list the variants here, so any user requiring such a function can make some benchmarks on their real-world workloads, and test to see if there is any measurable difference.
I also agree with his suggestion to use odd and even instead of high and low, but note that since the first element in a vector is numbered element 0, the first element is even, the second odd, and so on.
#include <immintrin.h>
static const __m256i epi16_oddmask = { 0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL };
static const __m256i epi16_evenmask = { 0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL };
static const __m256i epi16_evenmuls = { 0x4040101004040101ULL,
0x4040101004040101ULL,
0x4040101004040101ULL,
0x4040101004040101ULL };
static const __m256i epi16_oddmuls = { 0x8080202008080202ULL,
0x8080202008080202ULL,
0x8080202008080202ULL,
0x8080202008080202ULL };
/* Original version suggested by Nominal Animal. */
__m256i original(__m256i value)
{
return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
_mm256_and_si256(_mm256_mullo_epi16(_mm256_and_si256(_mm256_srai_epi16(value, 8), epi16_evenmask), epi16_oddmuls), epi16_oddmask));
}
/* Optimized as suggested by Peter Cordes, without blendv */
__m256i no_blendv(__m256i value)
{
return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
_mm256_and_si256(_mm256_mullo_epi16(_mm256_srli_epi16(value, 8), epi16_oddmuls), epi16_oddmask));
}
/* Optimized as suggested by Peter Cordes, with blendv.
* This is the recommended version. */
__m256i optimized(__m256i value)
{
return _mm256_blendv_epi8(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
_mm256_mullo_epi16(_mm256_srli_epi16(value, 8), epi16_oddmuls), epi16_oddmask);
}
Here are the same functions written in a way that shows the individual operations. Although it does not affect sane compilers at all, I've marked the function parameter and each temporary value const, so that it is obvious how you can insert each into a subsequent expression, to simplify the functions to their above concise forms.
__m256i original_verbose(const __m256i value)
{
const __m256i odd1 = _mm256_srai_epi16(value, 8);
const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
const __m256i odd2 = _mm256_and_si256(odd1, epi16_evenmask);
const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
const __m256i odd3 = _mm256_mullo_epi16(odd3, epi16_oddmuls);
const __m256i even3 = _mm256_srli_epi16(even3, 8);
const __m256i odd4 = _mm256_and_si256(odd3, epi16_oddmask);
return _mm256_or_si256(even3, odd4);
}
__m256i no_blendv_verbose(const __m256i value)
{
const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
const __m256i odd1 = _mm256_srli_epi16(value, 8);
const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
const __m256i odd2 = _mm256_mullo_epi16(odd1, epi16_oddmuls);
const __m256i even3 = _mm256_srli_epi16(even2, 8);
const __m256i odd3 = _mm256_and_si256(odd2, epi16_oddmask);
return _mm256_or_si256(even3, odd3);
}
__m256i optimized_verbose(const __m256i value)
{
const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
const __m256i odd1 = _mm256_srli_epi16(value, 8);
const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
const __m256i odd2 = _mm256_mullo_epi16(odd1, epi16_oddmuls);
const __m256i even3 = _mm256_srli_epi16(even2, 8);
return _mm256_blendv_epi8(even3, odd2, epi16_oddmask);
}
I personally do write my test functions initially in their above verbose forms, as forming the concise version is a trivial set of copy-pasting. I do, however, testing both versions to verify against introducing any errors, and keeping the verbose version accessible (as a comment or so), because the concise versions are basically write-only. It is much easier to edit the verbose version, then simplify it to the concise form, than trying to edit the concise version.

[Based on the first comment and some edits, the resulting solution is a little different. I will present that first, then leave the original thought below]
The main idea here is using multiplication by powers of 2 to accomplish the shifting, since these constants can vary across the vector. #harold pointed out the next idea, which is that multiplication of two duplicated bytes will automatically do the "rotation" of the shifted-out bits back into the lower bits.
Unpack and duplicate bytes into 16-bit values [... d c b a] -> [... dd cc bb aa]
Generate a 16-bit constant [128 64 32 16 8 4 2 1]
Multiply
The byte you want is the top eight bits of each 16-bit value, so right-shift and repack
Assuming __m128i source (you only have 8 bytes, right?):
__m128i duped = _mm_unpacklo_epi8(src, src);
__m128i res = _mm_mullo_epi16(duped, power_of_two_vector);
__m128i repacked = _mm_packus_epi16(_mm_srli_epi16(res, 8), __mm_setzero_si128());
[saving this original idea for comparison]
What about this: Use multiplication by powers of 2 to accomplish the shifts, using 16-bit products. Then OR the upper and lower halves of the product to accomplish the rotation.
Unpack the bytes into 16-bit words.
Generate a 16-bit [ 128 64 32 16 8 4 2 1 ]
Multiply the 16-bit words
Re-pack the 16-bit into two eight-bit vectors, a high byte vector and a low-byte vector
OR those two vectors to accomplish the rotate.
I'm a little fuzzy on the available multiply options and your instruction set limitation, but ideal would be an 8-bit by 8-bit multiply that produces 16-bit products. As far as I know it doesn't exist, which is why I suggest unpacking first, but I've seen other neat algorithms for doing this.

How to check the number of set bits in an 8-bit unsigned char?

So I have to find the set bits (on 1) of an unsigned char variable in C?
A similar question is How to count the number of set bits in a 32-bit integer? But it uses an algorithm that's not easily adaptable to 8-bit unsigned chars (or its not apparent).

The algorithm suggested in the question How to count the number of set bits in a 32-bit integer? is trivially adapted to 8 bit:
int NumberOfSetBits( uint8_t b )
{
b = b - ((b >> 1) & 0x55);
b = (b & 0x33) + ((b >> 2) & 0x33);
return (((b + (b >> 4)) & 0x0F) * 0x01);
}
It is simply a case of shortening the constants the the least significant eight bits, and removing the final 24 bit right-shift. Equally it could be adapted for 16bit using an 8 bit shift. Note that in the case for 8 bit, the mechanical adaptation of the 32 bit algorithm results in a redundant * 0x01 which could be omitted.

The fastest approach for an 8-bit variable is using a lookup table.
Build an array of 256 values, one per 8-bit combination. Each value should contain the count of bits in its corresponding index:
int bit_count[] = {
// 00 01 02 03 04 05 06 07 08 09 0a, ... FE FF
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, ..., 7, 8
};
Getting a count of a combination is the same as looking up a value from the bit_count array. The advantage of this approach is that it is very fast.
You can generate the array using a simple program that counts bits one by one in a slow way:
for (int i = 0 ; i != 256 ; i++) {
int count = 0;
for (int p = 0 ; p != 8 ; p++) {
if (i & (1 << p)) {
count++;
}
}
printf("%d, ", count);
}
(demo that generates the table).
If you would like to trade some CPU cycles for memory, you can use a 16-byte lookup table for two 4-bit lookups:
static const char split_lookup[] = {
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4
};
int bit_count(unsigned char n) {
return split_lookup[n&0xF] + split_lookup[n>>4];
}
Demo.

I think you are looking for Hamming Weight algorithm for 8bits?
If it is true, here is the code:
unsigned char in = 22; //This is your input number
unsigned char out = 0;
in = in - ((in>>1) & 0x55);
in = (in & 0x33) + ((in>>2) & 0x33);
out = ((in + (in>>4) & 0x0F) * 0x01) ;

Counting the number of digits different than 0 is also known as a Hamming Weight. In this case, you are counting the number of 1's.
Dasblinkenlight provided you with a table driven implementation, and Olaf provided you with a software based solution. I think you have two other potential solutions. The first is to use a compiler extension, the second is to use an ASM specific instruction with inline assembly from C.
For the first alternative, see GCC's __builtin_popcount(). (Thanks to Artless Noise).
For the second alternative, you did not specify the embedded processor, but I'm going to offer this in case its ARM based.
Some ARM processors have the VCNT instruction, which performs the count for you. So you could do it from C with inline assembly:
inline
unsigned int hamming_weight(unsigned char value) {
__asm__ __volatile__ (
"VCNT.8"
: "=value"
: "value"
);
return value;
}
Also see Fastest way to count number of 1s in a register, ARM assembly.
For completeness, here is Kernighan's bit counting algorithm:
int count_bits(int n) {
int count = 0;
while(n != 0) {
n &= (n-1);
count++;
}
return count;
}
Also see Please explain the logic behind Kernighan's bit counting algorithm.

I made an optimized version. With a 32-bit processor, utilizing multiplication, bit shifting and masking can make smaller code for the same task, especially when the input domain is small (8-bit unsigned integer).
The following two code snippets are equivalent:
unsigned int bit_count_uint8(uint8_t x)
{
uint32_t n;
n = (uint32_t)(x * 0x08040201UL);
n = (uint32_t)(((n >> 3) & 0x11111111UL) * 0x11111111UL);
/* The "& 0x0F" will be optimized out but I add it for clarity. */
return (n >> 28) & 0x0F;
}
/*
unsigned int bit_count_uint8_traditional(uint8_t x)
{
x = x - ((x >> 1) & 0x55);
x = (x & 0x33) + ((x >> 2) & 0x33);
x = ((x + (x >> 4)) & 0x0F);
return x;
}
*/
This produces smallest binary code for IA-32, x86-64 and AArch32 (without NEON instruction set) as far as I can find.
For x86-64, this doesn't use the fewest number of instructions, but the bit shifts and downcasting avoid the use of 64-bit instructions and therefore save a few bytes in the compiled binary.
Interestingly, in IA-32 and x86-64, a variant of the above algorithm using a modulo ((((uint32_t)(x * 0x08040201U) >> 3) & 0x11111111U) % 0x0F) actually generates larger code, due to a requirement to move the remainder register for return value (mov eax,edx) after the div instruction. (I tested all of these in Compiler Explorer)
Explanation
I denote the eight bits of the byte x, from MSB to LSB, as a, b, c, d, e, f, g and h.
abcdefgh
* 00001000 00000100 00000010 00000001 (make 4 copies of x
--------------------------------------- with appropriate
abc defgh0ab cdefgh0a bcdefgh0 abcdefgh bit spacing)
>> 3
---------------------------------------
000defgh 0abcdefg h0abcdef gh0abcde
& 00010001 00010001 00010001 00010001
---------------------------------------
000d000h 000c000g 000b000f 000a000e
* 00010001 00010001 00010001 00010001
---------------------------------------
000d000h 000c000g 000b000f 000a000e
... 000h000c 000g000b 000f000a 000e
... 000c000g 000b000f 000a000e
... 000g000b 000f000a 000e
... 000b000f 000a000e
... 000f000a 000e
... 000a000e
... 000e
^^^^ (Bits 31-28 will contain the sum of the bits
a, b, c, d, e, f, g and h. Extract these
bits and we are done.)

Maybe not the fastest, but straightforward:
int count = 0;
for (int i = 0; i < 8; ++i) {
unsigned char c = 1 << i;
if (yourVar & c) {
//bit n°i is set
//first bit is bit n°0
count++;
}
}

For 8/16 bit MCUs, a loop will very likely be faster than the parallel-addition approach, as these MCUs cannot shift by more than one bit per instruction, so:
size_t popcount(uint8_t val)
{
size_t cnt = 0;
do {
cnt += val & 1U; // or: if ( val & 1 ) cnt++;
} while ( val >>= 1 ) ;
return cnt;
}
For the incrementation of cnt, you might profile. If still too slow, an assember implementation might be worth a try using carry flag (if available). While I am in against using assembler optimizations in general, such algorithms are one of the few good exceptions (still just after the C version fails).
If you can omit the Flash, a lookup table as proposed by #dasblinkenlight is likey the fastest approach.
Just a hint: For some architectures (notably ARM and x86/64), gcc has a builtin: __builtin_popcount(), you also might want to try if available (although it takes int at least). This might use a single CPU instruction - you cannot get faster and more compact.

Allow me to post a second answer. This one is the smallest possible for ARM processors with Advanced SIMD extension (NEON). It's even smaller than __builtin_popcount() (since __builtin_popcount() is optimized for unsigned int input, not uint8_t).
#ifdef __ARM_NEON
/* ARM C Language Extensions (ACLE) recommends us to check __ARM_NEON before
including <arm_neon.h> */
#include <arm_neon.h>
unsigned int bit_count_uint8(uint8_t x)
{
/* Set all lanes at once so that the compiler won't emit instruction to
zero-initialize other lanes. */
uint8x8_t v = vdup_n_u8(x);
/* Count the number of set bits for each lane (8-bit) in the vector. */
v = vcnt_u8(v);
/* Get lane 0 and discard other lanes. */
return vget_lane_u8(v, 0);
}
#endif

Add all elements in a lane

Is there an intrinsic which allows one to add all of the elements in a lane? I am using Neon to multiply 8 numbers together, and I need to sum the result. Here is some paraphrased code to show what I'm currently doing (this could probably be optimised):
int16_t p[8], q[8], r[8];
int32_t sum;
int16x8_t pneon, qneon, result;
p[0] = some_number;
p[1] = some_other_number;
//etc etc
pneon = vld1q_s16(p);
q[0] = some_other_other_number;
q[1] = some_other_other_other_number;
//etc etc
qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
vst1q_s16(r,result);
sum = ((int32_t) r[0] + (int32_t) r[1] + ... //etc );
Is there a "better" way to do this?

If you're targeting the newer arm 64 bit architecture, then ADDV is just the right instruction for you.
Here's how your code will look with it.
qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
sum = vaddvq_s16(result);
That's it. Just one instruction to sum up all of the lanes in the vector register.
Sadly, this instruction doesn't feature in the older 32 bit arm architecture.

Something like this should work pretty optimal (caution: not tested)
const int16x4_t result_low = vget_low_s16(result); // Extract low 4 elements
const int16x4_t result_high = vget_high_s16(result); // Extract high 4 elements
const int32x4_t twopartsum = vaddl_s16(result_low, result_high); // Extend to 32 bits and add (4 partial 32-bit sums are formed)
const int32x2_t twopartsum_low = vget_low_s32(twopartsum); // Extract 2 low 32-bit partial sums
const int32x2_t twopartsum_high = vget_high_s32(twopartsum); // Extract 2 high 32-bit partial sums
const int32x2_t fourpartsum = vadd_s32(twopartsum_low, twopartsum_high); // Add partial sums (2 partial 32-bit sum are formed)
const int32x2_t eightpartsum = vpadd_s32(fourpartsum, fourpartsum); // Final reduction
const int32_t sum = vget_lane_s32(eightpartsum, 0); // Move to general-purpose registers

temp = vadd_f32(vget_high_f32(variance_n), vget_low_f32(variance_n));
sum = vget_lane_f32(vpadd_f32(variance_temp, variance_temp), 0);

128-bit rotation using ARM Neon intrinsics

I'm trying to optimize my code using Neon intrinsics. I have a 24-bit rotation over a 128-bit array (8 each uint16_t).
Here is my c code:
uint16_t rotated[8];
uint16_t temp[8];
uint16_t j;
for(j = 0; j < 8; j++)
{
//Rotation <<< 24 over 128 bits (x << shift) | (x >> (16 - shift)
rotated[j] = ((temp[(j+1) % 8] << 8) & 0xffff) | ((temp[(j+2) % 8] >> 8) & 0x00ff);
}
I've checked the gcc documentation about Neon Intrinsics and it doesn't have instruction for vector rotations. Moreover, I've tried to do this using vshlq_n_u16(temp, 8) but all the bits shifted outside a uint16_t word are lost.
How to achieve this using neon intrinsics ? By the way is there a better documentation about GCC Neon Intrinsics ?

After some reading on Arm Community Blogs, I've found this :
VEXT: Extract
VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors. VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.
The following Neon GCC Intrinsic does the same as the assembly provided in the picture :
uint16x8_t vextq_u16 (uint16x8_t, uint16x8_t, const int)
So the the 24bit rotation over a full 128bit vector (not over each element) could be done by the following:
uint16x8_t input;
uint16x8_t t0;
uint16x8_t t1;
uint16x8_t rotated;
t0 = vextq_u16(input, input, 1);
t0 = vshlq_n_u16(t0, 8);
t1 = vextq_u16(input, input, 2);
t1 = vshrq_n_u16(t1, 8);
rotated = vorrq_u16(t0, t1);

Use vext.8 to concat a vector with itself and give you the 16-byte window that you want (in this case offset by 3 bytes).
Doing this with intrinsics requires casting to keep the compiler happy, but it's still a single instruction:
#include <arm_neon.h>
uint16x8_t byterotate3(uint16x8_t input) {
uint8x16_t tmp = vreinterpretq_u8_u16(input);
uint8x16_t rotated = vextq_u8(tmp, tmp, 16-3);
return vreinterpretq_u16_u8(rotated);
}
g++5.4 -O3 -march=armv7-a -mfloat-abi=hard -mfpu=neon (on Godbolt) compiles it to this:
byterotate3(__simd128_uint16_t):
vext.8 q0, q0, q0, #13
bx lr
A count of 16-3 means we left-rotate by 3 bytes. (It means we take 13 bytes from the left vector and 3 bytes from the right vector, so it's also a right-rotate by 13).
Related: x86 also has instruction that takes a sliding window into the concatenation of two registers: palignr (added in SSSE3).
Maybe I'm missing something about NEON, but I don't understand why the OP's self-answer is using vext.16 (vextq_u16), which has 16-bit granularity. It's not even a different instruction, just an alias for vext.8 which makes it impossible to use an odd-numbered count, requiring extra instructions. The manual for vext.8 says:
VEXT pseudo-instruction
You can specify a datatype of 16, 32, or 64 instead of 8. In this
case, #imm refers to halfwords, words, or doublewords instead of
referring to bytes, and the permitted ranges are correspondingly
reduced.

I'm not 100% sure but I don't think NEON has rotate instructions.
You can compose the rotation operation you require with a left shift, a right shit and an or, e.g.:
uint8_t ror(uint8_t in, int rotation)
{
return (in >> rotation) | (in << (8-rotation));
}
Just do the same with the Neon intrinsics for left shift, right shit and or.
uint16x8_t temp;
uint8_t rot;
uint16x8_t rotated = vorrq_u16 ( vshlq_n_u16(temp, rot) , vshrq_n_u16(temp, 16 - rot) );
See http://en.wikipedia.org/wiki/Circular_shift "Implementing circular shifts."
This will rotate the values inside the lanes. If you want to rotate the lanes themselves use VEXT as described in the other answer.

Fast method to copy memory with translation - ARGB to BGR

Overview
I have an image buffer that I need to convert to another format. The origin image buffer is four channels, 8 bits per channel, Alpha, Red, Green, and Blue. The destination buffer is three channels, 8 bits per channel, Blue, Green, and Red.
So the brute force method is:
// Assume a 32 x 32 pixel image
#define IMAGESIZE (32*32)
typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
ARGB orig[IMAGESIZE];
BGR dest[IMAGESIZE];
for(x = 0; x < IMAGESIZE; x++)
{
dest[x].Red = orig[x].Red;
dest[x].Green = orig[x].Green;
dest[x].Blue = orig[x].Blue;
}
However, I need more speed than is provided by a loop and three byte copies. I'm hoping there might be a few tricks I can use to reduce the number of memory reads and writes, given that I'm running on a 32 bit machine.
Additional info
Every image is a multiple of at least 4 pixels. So we could address 16 ARGB bytes and move them into 12 RGB bytes per loop. Perhaps this fact can be used to speed things up, especially as it falls nicely into 32 bit boundaries.
I have access to OpenCL - and while that requires moving the entire buffer into the GPU memory, then moving the result back out, the fact that OpenCL can work on many portions of the image simultaneously, and the fact that large memory block moves are actually quite efficient may make this a worthwhile exploration.
While I've given the example of small buffers above, I really am moving HD video (1920x1080) and sometimes larger, mostly smaller, buffers around, so while a 32x32 situation may be trivial, copying 8.3MB of image data byte by byte is really, really bad.
Running on Intel processors (Core 2 and above) and thus there are streaming and data processing commands I'm aware exist, but don't know about - perhaps pointers on where to look for specialized data handling instructions would be good.
This is going into an OS X application, and I'm using XCode 4. If assembly is painless and the obvious way to go, I'm fine traveling down that path, but not having done it on this setup before makes me wary of sinking too much time into it.
Pseudo-code is fine - I'm not looking for a complete solution, just the algorithm and an explanation of any trickery that might not be immediately clear.

I wrote 4 different versions which work by swapping bytes. I compiled them using gcc 4.2.1 with -O3 -mssse3, ran them 10 times over 32MB of random data and found the averages.
Editor's note: the original inline asm used unsafe constraints, e.g. modifying input-only operands, and not telling the compiler about the side effect on memory pointed-to by pointer inputs in registers. Apparently this worked ok for the benchmark. I fixed the constraints to be properly safe for all callers. This should not affect benchmark numbers, only make sure the surrounding code is safe for all callers. Modern CPUs with higher memory bandwidth should see a bigger speedup for SIMD over 4-byte-at-a-time scalar, but the biggest benefits are when data is hot in cache (work in smaller blocks, or on smaller total sizes).
In 2020, your best bet is to use the portable _mm_loadu_si128 intrinsics version that will compile to an equivalent asm loop: https://gcc.gnu.org/wiki/DontUseInlineAsm.
Also note that all of these over-write 1 (scalar) or 4 (SIMD) bytes past the end of the output, so do the last 3 bytes separately if that's a problem.
--- #PeterCordes
The first version uses a C loop to convert each pixel separately, using the OSSwapInt32 function (which compiles to a bswap instruction with -O3).
void swap1(ARGB *orig, BGR *dest, unsigned imageSize) {
unsigned x;
for(x = 0; x < imageSize; x++) {
*((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]);
// warning: strict-aliasing UB. Use memcpy for unaligned loads/stores
}
}
The second method performs the same operation, but uses an inline assembly loop instead of a C loop.
void swap2(ARGB *orig, BGR *dest, unsigned imageSize) {
asm volatile ( // has to be volatile because the output is a side effect on pointed-to memory
"0:\n\t" // do {
"movl (%1),%%eax\n\t"
"bswapl %%eax\n\t"
"movl %%eax,(%0)\n\t" // copy a dword byte-reversed
"add $4,%1\n\t" // orig += 4 bytes
"add $3,%0\n\t" // dest += 3 bytes
"dec %2\n\t"
"jnz 0b" // }while(--imageSize)
: "+r" (dest), "+r" (orig), "+r" (imageSize)
: // no pure inputs; the asm modifies and dereferences the inputs to use them as read/write outputs.
: "flags", "eax", "memory"
);
}
The third version is a modified version of just a poseur's answer. I converted the built-in functions to the GCC equivalents and used the lddqu built-in function so that the input argument doesn't need to be aligned. (Editor's note: only P4 ever benefited from lddqu; it's fine to use movdqu but there's no downside.)
typedef char v16qi __attribute__ ((vector_size (16)));
void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) {
v16qi mask = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
uint8_t *end = orig + imagesize * 4;
for (; orig != end; orig += 16, dest += 12) {
__builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask));
}
}
Finally, the fourth version is the inline assembly equivalent of the third.
void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) {
static const int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
asm volatile (
"lddqu %3,%%xmm1\n\t"
"0:\n\t"
"lddqu (%1),%%xmm0\n\t"
"pshufb %%xmm1,%%xmm0\n\t"
"movdqu %%xmm0,(%0)\n\t"
"add $16,%1\n\t"
"add $12,%0\n\t"
"sub $4,%2\n\t"
"jnz 0b"
: "+r" (dest), "+r" (orig), "+r" (imagesize)
: "m" (mask) // whole array as a memory operand. "x" would get the compiler to load it
: "flags", "xmm0", "xmm1", "memory"
);
}
(These all compile fine with GCC9.3, but clang10 doesn't know __builtin_ia32_pshufb128; use _mm_shuffle_epi8.)
On my 2010 MacBook Pro, 2.4 Ghz i5 (Westmere/Arrandale), 4GB RAM, these were the average times for each:
Version 1: 10.8630 milliseconds
Version 2: 11.3254 milliseconds
Version 3: 9.3163 milliseconds
Version 4: 9.3584 milliseconds
As you can see, the compiler is good enough at optimization that you don't need to write assembly. Also, the vector functions were only 1.5 milliseconds faster on 32MB of data, so it won't cause much harm if you want to support the earliest Intel macs, which didn't support SSSE3.
Edit: liori asked for standard deviation information. Unfortunately, I hadn't saved the data points, so I ran another test with 25 iterations.
Average | Standard Deviation
Brute force: 18.01956 ms | 1.22980 ms (6.8%)
Version 1: 11.13120 ms | 0.81076 ms (7.3%)
Version 2: 11.27092 ms | 0.66209 ms (5.9%)
Version 3: 9.29184 ms | 0.27851 ms (3.0%)
Version 4: 9.40948 ms | 0.32702 ms (3.5%)
Also, here is the raw data from the new tests, in case anyone wants it. For each iteration, a 32MB data set was randomly generated and run through the four functions. The runtime of each function in microseconds is listed below.
Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845
Version 1: 10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601
Version 2: 10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936
Version 3: 9036 9619 9341 8970 9453 9758 9043 10114 9243 9027 9163 9176 9168 9122 9514 9049 9161 9086 9064 9604 9178 9233 9301 9717 9156
Version 4: 9339 10119 9846 9217 9526 9182 9145 10286 9051 9614 9249 9653 9799 9270 9173 9103 9132 9550 9147 9157 9199 9113 9699 9354 9314

The obvious, using pshufb.
#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>
// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
assert((uintptr_t)orig % 16 == 0);
assert(imagesize % 4 == 0);
__m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3);
uint8_t *end = orig + imagesize * 4;
for (; orig != end; orig += 16, dest += 12) {
_mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask));
}
}

Combining just a poseur's and Jitamaro's answers, if you assume that the inputs and outputs are 16-byte aligned and if you process pixels 4 at a time, you can use a combination of shuffles, masks, ands, and ors to store out using aligned stores. The main idea is to generate four intermediate data sets, then or them together with masks to select the relevant pixel values and write out 3 16-byte sets of pixel data. Note that I did not compile this or try to run it at all.
EDIT2: More detail about the underlying code structure:
With SSE2, you get better performance with 16-byte aligned reads and writes of 16 bytes. Since your 3 byte pixel is only alignable to 16-bytes for every 16 pixels, we batch up 16 pixels at a time using a combination of shuffles and masks and ors of 16 input pixels at a time.
From LSB to MSB, the inputs look like this, ignoring the specific components:
s[0]: 0000 0000 0000 0000
s[1]: 1111 1111 1111 1111
s[2]: 2222 2222 2222 2222
s[3]: 3333 3333 3333 3333
and the ouptuts look like this:
d[0]: 000 000 000 000 111 1
d[1]: 11 111 111 222 222 22
d[2]: 2 222 333 333 333 333
So to generate those outputs, you need to do the following (I will specify the actual transformations later):
d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))
d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))
d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))
Now, what should combine_<x> look like? If we assume that d is merely s compacted together, we can concatenate two s's with a mask and an or:
combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))
where (1 means select the left pixel, 0 means select the right pixel):
mask(0)= 111 111 111 111 000 0
mask(1)= 11 111 111 000 000 00
mask(2)= 1 111 000 000 000 000
But the actual transformations (f_<x>_low, f_<x>_high) are actually not that simple. Since we are reversing and removing bytes from the source pixel, the actual transformation is (for the first destination for brevity):
d[0]=
s[0][0].Blue s[0][0].Green s[0][0].Red
s[0][1].Blue s[0][1].Green s[0][1].Red
s[0][2].Blue s[0][2].Green s[0][2].Red
s[0][3].Blue s[0][3].Green s[0][3].Red
s[1][0].Blue s[1][0].Green s[1][0].Red
s[1][1].Blue
If you translate the above into byte offsets from source to dest, you get:
d[0]=
&s[0]+3 &s[0]+2 &s[0]+1
&s[0]+7 &s[0]+6 &s[0]+5
&s[0]+11 &s[0]+10 &s[0]+9
&s[0]+15 &s[0]+14 &s[0]+13
&s[1]+3 &s[1]+2 &s[1]+1
&s[1]+7
(If you take a look at all the s[0] offsets, they match just a poseur's shuffle mask in reverse order.)
Now, we can generate a shuffle mask to map each source byte to a destination byte (X means we don't care what that value is):
f_0_low= 3 2 1 7 6 5 11 10 9 15 14 13 X X X X
f_0_high= X X X X X X X X X X X X 3 2 1 7
f_1_low= 6 5 11 10 9 15 14 13 X X X X X X X X
f_1_high= X X X X X X X X 3 2 1 7 6 5 11 10
f_2_low= 9 15 14 13 X X X X X X X X X X X X
f_2_high= X X X X 3 2 1 7 6 5 11 10 9 15 14 13
We can further optimize this by looking the masks we use for each source pixel. If you take a look at the shuffle masks that we use for s[1]:
f_0_high= X X X X X X X X X X X X 3 2 1 7
f_1_low= 6 5 11 10 9 15 14 13 X X X X X X X X
Since the two shuffle masks don't overlap, we can combine them and simply mask off the irrelevant pixels in combine_, which we already did! The following code performs all these optimizations (plus it assumes that the source and destination addresses are 16-byte aligned). Also, the masks are written out in code in MSB->LSB order, in case you get confused about the ordering.
EDIT: changed the store to _mm_stream_si128 since you are likely doing a lot of writes and we don't want to necessarily flush the cache. Plus it should be aligned anyway so you get free perf!
#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>
// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
assert((uintptr_t)orig % 16 == 0);
assert(imagesize % 16 == 0);
__m128i shuf0 = _mm_set_epi8(
-128, -128, -128, -128, // top 4 bytes are not used
13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel
__m128i shuf1 = _mm_set_epi8(
7, 1, 2, 3, // top 4 bytes go to the first pixel
-128, -128, -128, -128, // unused
13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel
__m128i shuf2 = _mm_set_epi8(
10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel
-128, -128, -128, -128, // unused
13, 14, 15, 9); // bottom 4 go to third pixel
__m128i shuf3 = _mm_set_epi8(
13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel
-128, -128, -128, -128); // unused
__m128i mask0 = _mm_set_epi32(0, -1, -1, -1);
__m128i mask1 = _mm_set_epi32(0, 0, -1, -1);
__m128i mask2 = _mm_set_epi32(0, 0, 0, -1);
uint8_t *end = orig + imagesize * 4;
for (; orig != end; orig += 64, dest += 48) {
__m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);
__m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);
__m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);
__m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);
_mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));
_mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));
_mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));
}
}

I am coming a little late to the party, seeming that the community has already decided for poseur's pshufb-answer but distributing 2000 reputation, that is so extremely generous i have to give it a try.
Here's my version without platform specific intrinsics or machine-specific asm, i have included some cross-platform timing code showing a 4x speedup if you do both the bit-twiddling like me AND activate compiler-optimization (register-optimization, loop-unrolling):
#include "stdlib.h"
#include "stdio.h"
#include "time.h"
#define UInt8 unsigned char
#define IMAGESIZE (1920*1080)
int main() {
time_t t0, t1;
int frames;
int frame;
typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
ARGB* orig = malloc(IMAGESIZE*sizeof(ARGB));
if(!orig) {printf("nomem1");}
BGR* dest = malloc(IMAGESIZE*sizeof(BGR));
if(!dest) {printf("nomem2");}
printf("to start original hit a key\n");
getch();
t0 = time(0);
frames = 1200;
for(frame = 0; frame<frames; frame++) {
int x; for(x = 0; x < IMAGESIZE; x++) {
dest[x].Red = orig[x].Red;
dest[x].Green = orig[x].Green;
dest[x].Blue = orig[x].Blue;
x++;
}
}
t1 = time(0);
printf("finished original of %u frames in %u seconds\n", frames, t1-t0);
// on my core 2 subnotebook the original took 16 sec
// (8 sec with compiler optimization -O3) so at 60 FPS
// (instead of the 1200) this would be faster than realtime
// (if you disregard any other rendering you have to do).
// However if you either want to do other/more processing
// OR want faster than realtime processing for e.g. a video-conversion
// program then this would have to be a lot faster still.
printf("to start alternative hit a key\n");
getch();
t0 = time(0);
frames = 1200;
unsigned int* reader;
unsigned int* end = reader+IMAGESIZE;
unsigned int cur; // your question guarantees 32 bit cpu
unsigned int next;
unsigned int temp;
unsigned int* writer;
for(frame = 0; frame<frames; frame++) {
reader = (void*)orig;
writer = (void*)dest;
next = *reader;
reader++;
while(reader<end) {
cur = next;
next = *reader;
// in the following the numbers are of course the bitmasks for
// 0-7 bits, 8-15 bits and 16-23 bits out of the 32
temp = (cur&255)<<24 | (cur&65280)<<16|(cur&16711680)<<8|(next&255);
*writer = temp;
reader++;
writer++;
cur = next;
next = *reader;
temp = (cur&65280)<<24|(cur&16711680)<<16|(next&255)<<8|(next&65280);
*writer = temp;
reader++;
writer++;
cur = next;
next = *reader;
temp = (cur&16711680)<<24|(next&255)<<16|(next&65280)<<8|(next&16711680);
*writer = temp;
reader++;
writer++;
}
}
t1 = time(0);
printf("finished alternative of %u frames in %u seconds\n", frames, t1-t0);
// on my core 2 subnotebook this alternative took 10 sec
// (4 sec with compiler optimization -O3)
}
The results are these (on my core 2 subnotebook):
F:\>gcc b.c -o b.exe
F:\>b
to start original hit a key
finished original of 1200 frames in 16 seconds
to start alternative hit a key
finished alternative of 1200 frames in 10 seconds
F:\>gcc b.c -O3 -o b.exe
F:\>b
to start original hit a key
finished original of 1200 frames in 8 seconds
to start alternative hit a key
finished alternative of 1200 frames in 4 seconds

You want to use a Duff's device: http://en.wikipedia.org/wiki/Duff%27s_device. It's also working in JavaScript. This post however it's a bit funny to read http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html. Imagine a Duff device with 512 Kbytes of moves.

In combination with one of the fast conversion functions here, given access to Core 2s it might be wise to split the translation into threads, which work on their, say, fourth of the data, as in this psudeocode:
void bulk_bgrFromArgb(byte[] dest, byte[] src, int n)
{
thread threads[] = {
create_thread(bgrFromArgb, dest, src, n/4),
create_thread(bgrFromArgb, dest+n/4, src+n/4, n/4),
create_thread(bgrFromArgb, dest+n/2, src+n/2, n/4),
create_thread(bgrFromArgb, dest+3*n/4, src+3*n/4, n/4),
}
join_threads(threads);
}

This assembly function should do, however I don't know if you would like to keep old data or not, this function overrides it.
The code is for MinGW GCC with intel assembly flavour, you will have to modify it to suit your compiler/assembler.
extern "C" {
int convertARGBtoBGR(uint buffer, uint size);
__asm(
".globl _convertARGBtoBGR\n"
"_convertARGBtoBGR:\n"
" push ebp\n"
" mov ebp, esp\n"
" sub esp, 4\n"
" mov esi, [ebp + 8]\n"
" mov edi, esi\n"
" mov ecx, [ebp + 12]\n"
" cld\n"
" convertARGBtoBGR_loop:\n"
" lodsd ; load value from [esi] (4byte) to eax, increment esi by 4\n"
" bswap eax ; swap eax ( A R G B ) to ( B G R A )\n"
" stosd ; store 4 bytes to [edi], increment edi by 4\n"
" sub edi, 1; move edi 1 back down, next time we will write over A byte\n"
" loop convertARGBtoBGR_loop\n"
" leave\n"
" ret\n"
);
}
You should call it like so:
convertARGBtoBGR( &buffer, IMAGESIZE );
This function is accessing memory only twice per pixel/packet (1 read, 1 write) comparing to your brute force method that had (at least / assuming it was compiled to register) 3 read and 3 write operations. Method is the same but implementation makes it more efficent.

You can do it in chunks of 4 pixels, moving 32 bits with unsigned long pointers. Just think that with 4 32 bits pixels you can construct by shifting and OR/AND, 3 words representing 4 24bits pixels, like this:
//col0 col1 col2 col3
//ARGB ARGB ARGB ARGB 32bits reading (4 pixels)
//BGRB GRBG RBGR 32 bits writing (4 pixels)
Shifting operations are always done by 1 instruction cycle in all modern 32/64 bits processors (barrel shifting technique) so its the fastest way of constructing those 3 words for writing, bitwise AND and OR are also blazing fast.
Like this:
//assuming we have 4 ARGB1 ... ARGB4 pixels and 3 32 bits words, W1, W2 and W3 to write
// and *dest its an unsigned long pointer for destination
W1 = ((ARGB1 & 0x000f) << 24) | ((ARGB1 & 0x00f0) << 8) | ((ARGB1 & 0x0f00) >> 8) | (ARGB2 & 0x000f);
*dest++ = W1;
and so on.... with next pixels in a loop.
You'll need some adjusting with images that are not multiple of 4, but I bet this is the fastest approach of all, without using assembler.
And btw, forget about using structs and indexed access, those are the SLOWER ways of all for moving data, just take a look at a disassembly listing of a compiled C++ program and you'll agree with me.

typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
Aside from assembly or compiler intrinsics, I might try doing the following, while very carefully verifying the end behavior, as some of it (where unions are concerned) is likely to be compiler implementation dependent:
union uARGB
{
struct ARGB argb;
UInt32 x;
};
union uBGRA
{
struct
{
BGR bgr;
UInt8 Alpha;
} bgra;
UInt32 x;
};
and then for your code kernel, with whatever loop unrolling is appropriate:
inline void argb2bgr(BGR* pbgr, ARGB* pargb)
{
uARGB* puargb = (uARGB*)pargb;
uBGRA ubgra;
ubgra.x = __byte_reverse_32(pargb->x);
*pbgr = ubgra.bgra.bgr;
}
where __byte_reverse_32() assumes the existence of a compiler intrinsic that reverses the bytes of a 32-bit word.
To summarize the underlying approach:
view ARGB structure as a 32-bit integer
reverse the 32-bit integer
view the reversed 32-bit integer as a (BGR)A structure
let the compiler copy the (BGR) portion of the (BGR)A structure

Although you can use some tricks based on CPU usage,
This kind of operations can be done fasted with GPU.
It seems that you use C/ C++... So your alternatives for GPU programming may be ( on windows platform )
DirectCompute ( DirectX 11 ) See this video
Microsoft Research Project Accelerator Check this link
Cuda
"google" GPU programming ...
Shortly use GPU for this kind of array operations for make faster calculations. They are designed for it.

I haven't seen anyone showing an example of how to do it on the GPU.
A while ago I wrote something similar to your problem. I received data from a video4linux2 camera in YUV format and wanted to draw it as gray levels on the screen (just the Y component). I also wanted to draw areas that are too dark in blue and oversaturated regions in red.
I started out with the smooth_opengl3.c example from the freeglut distribution.
The data is copied as YUV into the texture and then the following GLSL shader programs are applied. I'm sure GLSL code runs on all macs nowadays and it will be significantly faster than all the CPU approaches.
Note that I have no experience on how you get the data back. In theory glReadPixels should read the data back but I never measured its performance.
OpenCL might be the easier approach, but then I will only start developing for that when I have a notebook that supports it.
(defparameter *vertex-shader*
"void main(){
gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
gl_FrontColor = gl_Color;
gl_TexCoord[0] = gl_MultiTexCoord0;
}
")
(progn
(defparameter *fragment-shader*
"uniform sampler2D textureImage;
void main()
{
vec4 q=texture2D( textureImage, gl_TexCoord[0].st);
float v=q.z;
if(int(gl_FragCoord.x)%2 == 0)
v=q.x;
float x=0; // 1./255.;
v-=.278431;
v*=1.7;
if(v>=(1.0-x))
gl_FragColor = vec4(255,0,0,255);
else if (v<=x)
gl_FragColor = vec4(0,0,255,255);
else
gl_FragColor = vec4(v,v,v,255);
}
")

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight