Optimizing a scanline conversion function for ARM

Optimizing a scanline conversion function for ARM - arm

The code below converts a row from an 8-Bit paletized format to 32-RGBA.
Before I trying to implement it, I would like to know if the code below is even suited for being optimized with Direct-Math or alternatively ARM Neon intrinsics or inline assembly. My first look at the documentation did not reveal anything that would cover the table-lookup part.
void CopyPixels(BYTE *pDst, BYTE *pSrc, int width,
const BYTE mask, Color* pColorTable)
{
if (width)
{
do
{
BYTE b = *pSrc++;
if (b != mask)
{
// Translate to 32-bit RGB value if not masked
const Color* pColor = pColorTable + b;
pDst[0] = pColor->Blue;
pDst[1] = pColor->Green;
pDst[2] = pColor->Red;
pDst[3] = 0xFF;
}
// Skip to next pixel
pDst += 4;
}
while (--width);
}
}

You will need a LUT of size 256*4bytes = 1024bytes.
This kind of job is not suited for SIMD at all. (except for the SSE part on Intel's new Haswell core)
NEON can handle LUTs of maximum 32bytes in size with VTBL and VTBX, but it's more or less meant to work in conjunction with CLZs as starting values for Newton-Raphson iterations.

I agree with Jake that this isn't a great vector processor problem, and may be more efficiently handled by the ARM main pipeline. That doesn't mean that you couldn't optimize it by assembly (but just plain ARM v7) for drastically improved results.
In particular, a simple improvement would be to construct your lookup table such that it can be used with a word sized copy. This would involve making sure the Color struct follows the 32-RGBA format, including having the 4th 0xFF as part of the lookup, so that you can just do a single word copy. This could be a significant performance boost with no assembly required, since it is a single memory fetch, rather than 3 (plus a constant assignment).
void CopyPixels(RGBA32Color *pDst, BYTE const *pSrc, int width,
const BYTE mask, RGBA32Color const *pColorTable)
{
if (width)
{
do
{
BYTE b = *pSrc++;
if (b != mask)
{
// Translate to 32-bit RGB value if not masked
*pDst = pColorTable[b];
}
// Skip to next pixel
pDst ++;
}
while (--width);
}
}

Related

Vectorize random init and print for BigInt with decimal digit array, with AVX2?

How could I pass my code to AVX2 code and get the same result as before?
Is it possible to use __m256i in the LongNumInit, LongNumPrint functions instead of uint8_t *L, or some similar type of variable?
My knowledge of AVX is quite limited; I investigated quite a bit however I do not understand very well how to transform my code any suggestion and explanation is welcome.
I'm really interested in this code in AVX2.
void LongNumInit(uint8_t *L, size_t N )
{
for(size_t i = 0; i < N; ++i){
L[i] = myRandom()%10;
}
}
void LongNumPrint( uint8_t *L, size_t N, uint8_t *Name )
{
printf("%s:", Name);
for ( size_t i=N; i>0;--i )
{
printf("%d", L[i-1]);
}
printf("\n");
}
int main (int argc, char **argv)
{
int i, sum1, sum2, sum3, N=10000, Rep=50;
seed = 12345;
// obtain parameters at run time
if (argc>1) { N = atoi(argv[1]); }
if (argc>2) { Rep = atoi(argv[2]); }
// Create Long Nums
unsigned char *V1= (unsigned char*) malloc( N);
unsigned char *V2= (unsigned char*) malloc( N);
unsigned char *V3= (unsigned char*) malloc( N);
unsigned char *V4= (unsigned char*) malloc( N);
LongNumInit ( V1, N ); LongNumInit ( V2, N ); LongNumInit ( V3, N );
//Print last 32 digits of Long Numbers
LongNumPrint( V1, 32, "V1" );
LongNumPrint( V2, 32, "V2" );
LongNumPrint( V3, 32, "V3" );
LongNumPrint( V4, 32, "V4" );
free(V1); free(V2); free(V3); free(V4);
return 0;
}
The result that I obtain in my initial code is this:
V1:59348245908804493219098067811457
V2:24890422397351614779297691741341
V3:63392771324953818089038280656869
V4:00000000000000000000000000000000

This is a terrible format for BigInteger in general, see https://codereview.stackexchange.com/a/237764 for a code review of the design flaws in using one decimal digit per byte for BigInteger, and what you could/should do instead.
And see Can long integer routines benefit from SSE? for #Mysticial's notes on ways to store your data that make SIMD for BigInteger math practical, specifically partial-word arithmetic where your temporaries might not be "normalized", letting you do lazy carry handling.
But apparently you're just asking about this code, the random-init and print functions, not how to do math between two numbers in this format.
We can vectorize both of these quite well. My LongNumPrintName() is a drop-in replacement for yours.
For LongNumInit I'm just showing a building-block that stores two 32-byte chunks and returns an incremented pointer. Call it in a loop. (It naturally produces 2 vectors per call so for small N you might make an alternate version.)
LongNumInit
What's the fastest way to generate a 1 GB text file containing random digits? generates space-separated random ASCII decimal digits at about 33 GB/s on 4GHz Skylake, including overhead of write() system calls to /dev/null. (This is higher than DRAM bandwidth; cache blocking for 128kiB lets the stores hit in L2 cache. The kernel driver for /dev/null doesn't even read the user-space buffer.)
It could easily be adapted into an AVX2 version of void LongNumInit(uint8_t *L, size_t N ). My answer there uses an AVX2 xorshift128+ PRNG (vectorized with 4 independent PRNGs in the 64-bit elements of a __m256i) like AVX/SSE version of xorshift128+. That should be similar quality of randomness to your rand() % 10.
It breaks that up into decimal digits via a multiplicative inverse to divide and modulo by 10 with shifts and vpmulhuw, using Why does GCC use multiplication by a strange number in implementing integer division?. (Actually using GNU C native vector syntax to let GCC determine the magic constant and emit the multiplies and shifts for convenient syntax like v16u dig1 = v % ten; and v /= ten;)
You can use _mm256_packus_epi16 to pack two vectors of 16-bit digits into 8-bit elements instead of turning the odd elements into ASCII ' ' and the even elements into ASCII '0'..'9'. (So change vec_store_digit_and_space to pack pairs of vectors instead of ORing with a constant, see below)
Compile this with gcc, clang, or ICC (or hopefully any other compiler that understands the GNU C dialect of C99, and Intel's intrinsics).
See https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html for the __attribute__((vector_size(32))) part, and https://software.intel.com/sites/landingpage/IntrinsicsGuide/ for the _mm256_* stuff. Also https://stackoverflow.com/tags/sse/info.
#include <immintrin.h>
// GNU C native vectors let us get the compiler to do stuff like %10 each element
typedef unsigned short v16u __attribute__((vector_size(32)));
// returns p + size of stores. Caller should use outpos = f(vec, outpos)
// p must be aligned
__m256i* vec_store_digits(__m256i vec, __m256i *restrict p)
{
v16u v = (v16u)vec;
v16u ten = (v16u)_mm256_set1_epi16(10);
v16u divisor = (v16u)_mm256_set1_epi16(6554); // ceil((2^16-1) / 10.0)
v16u div6554 = v / divisor; // Basically the entropy from the upper two decimal digits: 0..65.
// Probably some correlation with the modulo-based values, especially dig3, but we do this instead of
// dig4 for more ILP and fewer instructions total.
v16u dig1 = v % ten;
v /= ten;
v16u dig2 = v % ten;
v /= ten;
v16u dig3 = v % ten;
// dig4 would overlap much of the randomness that div6554 gets
// __m256i or v16u assignment is an aligned store
v16u *vecbuf = (v16u*)p;
// pack 16->8 bits.
vecbuf[0] = _mm256_packus_epi16(div6554, dig1);
vecbuf[1] = _mm256_packus_epi16(dig2, dig3)
return p + 2; // always a constant number of full vectors
}
The logic in random_decimal_fill_buffer that inserts newlines can be totally removed because you just want a flat array of decimal digits. Just call the above function in a loop until you've filled your buffer.
Handling small sizes (less than a full vector):
It would be convenient to pad your malloc up to the next multiple of 32 bytes so it's always safe to do a 32-byte load without checking for maybe crossing into an unmapped page.
And use C11 aligned_alloc to get 32-byte aligned storage. So for example, aligned_alloc(32, (size+31) & -32). This lets us just do full 32-byte stores even if N is odd. Logically only the first N bytes of the buffer hold our real data, but it's convenient to have padding we can scribble over to avoid any extra conditional checks for N being less than 32, or not a multiple of 32.
Unfortunately ISO C and glibc are missing aligned_realloc and aligned_calloc. MSVC does actually provide those: Why is there no 'aligned_realloc' on most platforms? allowing you to sometimes allocate more space at the end of an aligned buffer without copying it. A "try_realloc" would be ideal for C++ that might need to run copy-constructors if non-trivially copyable objects change address. Non-expressive allocator APIs that force sometimes-unnecessary copying is a pet peeve of mine.
LongNumPrint
Taking a uint8_t *Name arg is bad design. If the caller wants to printf a "something:" string first, they can do that. Your function should just do what printf "%d" does for an int.
Since you're storing your digits in reverse printing order, you'll want to byte-reverse into a tmp buffer and convert 0..9 byte values to '0'..'9' ASCII character values by ORing with '0'. Then pass that buffer to fwrite.
Specifically, use alignas(32) char tmpbuf[8192]; as a local variable.
You can work in fixed-size chunks (like 1kiB or 8kiB) instead allocating a potentially-huge buffer. You probably want to still go through stdio (instead of write() directly and managing your own I/O buffering). With an 8kiB buffer, an efficient fwrite might just pass that on to write() directly instead of memcpy into the stdio buffer. You might want to play around with tuning this, but keeping the tmp buffer comfortably smaller than half of L1d cache will mean it's still hot in cache when it's re-read after you wrote it.
Cache blocking makes the loop bounds a lot more complex but it's worth it for very large N.
Byte-reversing 32 bytes at a time:
You could avoid this work by deciding that your digits are stored in MSD-first order, but then if you did want to implement addition it would have to loop from the end backwards.
The your function could be implemented with SIMD _mm_shuffle_epi8 to reverse 16-byte chunks, starting from the end of you digit array and writing to the beginning of your tmp buffer.
Or better, load vmovdqu / vinserti128 16-byte loads to feed _mm256_shuffle_epi8 to byte-reverse within lanes, setting up for 32-byte stores.
On Intel CPUs, vinserti128 decodes to a load+ALU uop, but it can run on any vector ALU port, not just the shuffle port. So two 128-bit loads are more efficient than 256-bit load -> vpshufb - > vpermq which would probably bottleneck on shuffle-port throughput if data was hot in cache. Intel CPUs can do up to 2 loads + 1 store per clock cycle (or in IceLake, 2 loads + 2 stores). We'll probably bottleneck on the front-end if there are no memory bottlenecks, so in practice not saturating load+store and shuffle ports. (https://agner.org/optimize/ and https://uops.info/)
This function is also simplified by the assumption that we can always read 32 bytes from L without crossing into an unmapped page. But after a 32-byte reverse for small N, the first N bytes of the input become the last N bytes in a 32-byte chunk. It would be most convenient if we could always safely do a 32-byte load ending at the end of a buffer, but it's unreasonable to expect padding before the object.
#include <immintrin.h>
#include <stdalign.h>
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
// one vector of 32 bytes of digits, reversed and converted to ASCII
static inline
void ASCIIrev32B(void *dst, const void *src)
{
__m128i hi = _mm_loadu_si128(1 + (const __m128i*)src); // unaligned loads
__m128i lo = _mm_loadu_si128(src);
__m256i v = _mm256_set_m128i(lo, hi); // reverse 128-bit hi/lo halves
// compilers will hoist constants out of inline functions
__m128i byterev_lane = _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
__m256i byterev = _mm256_broadcastsi128_si256(byterev_lane); // same in each lane
v = _mm256_shuffle_epi8(v, byterev); // in-lane reverse
v = _mm256_or_si256(v, _mm256_set1_epi8('0')); // digits to ASCII
_mm256_storeu_si256(dst, v); // Will usually be aligned in practice.
}
// Tested for N=32; could be bugs in the loop bounds for other N
// returns bytes written, like fwrite: N means no error, 0 means error in all fwrites
size_t LongNumPrint( uint8_t *num, size_t N)
{
// caller can print a name if it wants
const int revbufsize = 8192; // 8kiB on the stack should be fine
alignas(32) char revbuf[revbufsize];
if (N<32) {
// TODO: maybe use a smaller revbuf for this case to avoid touching new stack pages
ASCIIrev32B(revbuf, num); // the data we want is at the *end* of a 32-byte reverse
return fwrite(revbuf+32-N, 1, N, stdout);
}
size_t bytes_written = 0;
const uint8_t *inp = num+N; // start with last 32 bytes of num[]
do {
size_t chunksize = (inp - num >= revbufsize) ? revbufsize : inp - num;
const uint8_t *inp_stop = inp - chunksize + 32; // leave one full vector for the end
uint8_t *outp = revbuf;
while (inp > inp_stop) { // may run 0 times
inp -= 32;
ASCIIrev32B(outp, inp);
outp += 32;
}
// reverse first (lowest address) 32 bytes of this chunk of num
// into last 32 bytes of this chunk of revbuf
// if chunksize%32 != 0 this will overlap, which is fine.
ASCIIrev32B(revbuf + chunksize - 32, inp_stop - 32);
bytes_written += fwrite(revbuf, 1, chunksize, stdout);
inp = inp_stop - 32;
} while ( inp > num );
return bytes_written;
// caller can putchar('\n') if it wants
}
// wrapper that prints name and newline
void LongNumPrintName(uint8_t *num, size_t N, const char *name)
{
printf("%s:", name);
//LongNumPrint_scalar(num, N);
LongNumPrint(num, N);
putchar('\n');
}
// main() included on Godbolt link that runs successfully
This compiles and runs (on Godbolt) with gcc -O3 -march=haswell and produces identical output to your scalar loop for the N=32 that main passes. (I used rand() instead of MyRandom(), so we could test with the same seed and get the same numbers, using your init function.)
Untested for larger N, but the general idea of chunksize = min(ptrdiff, 8k) and using that to loop downwards from the end of num[] should be solid.
We could load (not just store) aligned vectors if we converted the first N%32 bytes and passed that to fwrite before starting the main loop. But that probably either leads to an extra write() system call, or to clunky copying inside stdio. (Unless there was already buffered text not printed yet, like Name:, in which case we already have that penalty.)
Note that it's technically C UB to decrement inp past start of num. So inp -= 32 instead of inp = inp_stop-32 would have that UB for the iteration that leaves the outer loop. I actually avoid that in this version, but it generally works anyway because I think GCC assumes a flat memory model and de-factor defines the behaviour of pointer compares enough. And normal OSes reserve the zero page so num definitely can't be within 32 bytes of the start of physical memory (so inp can't wrap to a high address.) This paragraph is mostly left-over from the first totally untested attempt that I thought was decrementing the pointer farther in the inner loop than it actually was.

Vectorize equality test without SIMD

I would like to vectorize an equality test in which all elements in a vector are compared against the same value, and the results are written to an array of 8-bit words. Each 8-bit word in the resulting array should be zero or one. (This is a little wasteful, but bit packing the booleans is not an import detail in this problem). This function can be written as:
#include <stdint.h>
void vecEq (uint8_t* numbers, uint8_t* results, int len, uint8_t target) {
for(int i = 0; i < len; i++) {
results[i] = numbers[i] == target;
}
}
If we knew that both vectors were 256-bit aligned, we could start by broadcasting target into an AVX register and then using SIMD's _mm256_cmpeq_epi8 to perform 32 equality tests at a time. However, in the setting I'm working in, both numbers and results have been allocated by a runtime (the GHC runtime, but this is irrelevant). They are both guaranteed to be 64-bit aligned. Is there any way to vectorize this operation, preferably without using AVX registers?
The approach I've considered is broadcasting the 8-bit word to a 64-bit word up front and then XORing it with 8 elements at a time. This doesn't work though because I cannot find a vectorized way to convert the result of XOR (zero means equal, anything else means unequal) to a equality test result I need (0 means unequal, 1 means equal, nothing else should ever exist). Roughly, the sketch I have is:
void vecEq (uint64_t* numbers, uint64_t* results, int len, uint_8 target) {
uint64_t targetA = (uint64_t)target;
uint64_t targetB = targetA<<56 | targetA<<48 | targetA<<40 | targetA<<32 | targetA<<24 | targetA<<16 | targetA<<8 | targetA;
for(int i = 0; i < len; i++) {
uint64_t tmp = numbers[i] ^ targetB;
results[i] = ... something with tmp ...;
}
}

Further to the comments above (the code will vectorise just fine). If you are using AVX, the best strategy is usually just to use unaligned load/store intrinsics. They have no extra cost if your data does happen to be aligned, and are as cheap as the HW can make them for cases of misalignment. (On Intel CPUs, there's still a penalty for loads/stores that span two cache lines, aka a cache-line split).
Ideally you can still align your buffers by 32, but if your data has to come from L2 or L3 or RAM, misalignment often doesn't make a measurable difference. And the best strategy for dealing with possible misalignment is usually just to let the HW handle it, instead of scalar up to an alignment boundary or something like you'd do with SSE, or with AVX512 where alignment matters again (any misalignment leads to every load/store being a cache-line split).
Just use _mm256_loadu_si256 / _mm256_storeu_si256 and forget about it.
As an interesting aside, Visual C++ will no longer emit aligned loads or stores, even if you request them.
https://godbolt.org/z/pL9nw9 (e.g. vmovups instead of vmovaps)
If compiling with GCC, you probably want to use -march=haswell or -march=znver1 not just -mavx2, or at least -mno-avx256-split-unaligned-load and -mno-avx256-split-unaligned-store so 256-bit unaligned loads compile to single instructions. The CPUs that benefit from those tune=generic defaults don't support AVX2, for example Sandybridge and Piledriver.

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

What I want to do is:
Multiply the input floating point number by a fixed factor.
Convert them to 8-bit signed char.
Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].
I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(
I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.
void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
// input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
assert(width % 16 == 0);
int num_input_chunks = width / 16;
__m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
quant_mult, quant_mult, quant_mult, quant_mult);
for (int i = 0; i < num_rows; ++i) {
const float* input_row = input + i * width;
__m256i* output_row = output + i * num_input_chunks;
for (int j = 0; j < num_input_chunks; ++j) {
const float* x = input_row + j * 16;
// Process 16 floats at once, since each __m256i can contain 16 16-bit integers.
__m256 f_0 = _mm256_loadu_ps(x);
__m256 f_1 = _mm256_loadu_ps(x + 8);
__m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
__m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);
__m256i i_0 = _mm256_cvtps_epi32(m_0);
__m256i i_1 = _mm256_cvtps_epi32(m_1);
*(output_row + j) = _mm256_packs_epi32(i_0, i_1);
}
}
}
Any help is welcome, thank you so much!

For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)
Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.
Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.
#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
__m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
__m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
__m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
__m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
__m256i ab = _mm256_packs_epi32(a,b); // 16x int16_t
__m256i cd = _mm256_packs_epi32(c,d);
__m256i abcd = _mm256_packs_epi16(ab, cd); // 32x int8_t
// packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
// if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done
// but if you need sequential order, then vpermd:
__m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
return lanefix;
}
(Compiles nicely on the Godbolt compiler explorer).
Call this in a loop and _mm256_store_si256 the resulting vector.
(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)
With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.
If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.
Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)
Related:
SSE - AVX conversion from double to char
How can I convert a vector of float to short int using avx instructions?

Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .

Timer wraparound with flexible bit sizes

Given a counter/timer that increases and simply wraps at a given bit width, a well-known solution to the problem of finding the difference between two captured values of the counter (where the counter might have wrapped between the two points) is simply to perform unsigned subtraction on the counter (possibly then interpreting the result as signed if it's not known which one is larger).
For example given a 32-bit timer, code like this can be used to determine the length of time some code takes to run:
uint32_t start = GetSomePlatformSpecificTimer();
RunSomeOtherCode();
uint32_t end = GetSomePlatformSpecificTimer();
uint32_t platformTicksTakenByCode = end - start;
Or alternatively to check if some time limit has been reached:
uint32_t limit = GetSomePlatformSpecificTimer() + timeLimitInTicks;
while (true)
{
bool finished = DoSomethingSmall();
if (finished)
break;
if ((int32_t)(GetSomePlatformSpecificTimer() - limit) >= 0)
return ERROR_TIMEOUT;
}
This works great if the timer is known to be 32 bits wide. It also can be adjusted for 16-bit or 8-bit timers by changing the types used.
Is there a similarly simple way to do the same thing where the timer size does not match a type size? For example, a 24-bit timer, or an 18-bit timer.
Assume that the bit size is <= 32 and is specified by a #define COUNTER_WIDTH in some external header (and might change).
Is the best solution to sign-extend the two counter values from COUNTER_WIDTH to 32-bits and then use the code above? I can see that possibly working for the FF -> 00 rollover but I think it would break the 7F -> 80 rollover, so presumably there would have to be some sort of check for this (perhaps sign-extending if the values are near zero and zero-extending if the values are near the midpoint). I think this also means that the difference between two values should be no more than a quarter of the counter range, otherwise it could cause issues.
Or is there a better way to do this?

Instead of sign-extending, you could multiply up so that the full range becomes the same size as your arithmetic type. In other words, use fixed-point arithmetic to fill the integer. In your case, with uint32_t, that would look like
uint32_t start = GetSomePlatformSpecificTimer();
RunSomeOtherCode();
uint32_t end = GetSomePlatformSpecificTimer();
start <<= 32-COUNTER_WIDTH;
end <<= 32-COUNTER_WIDTH;
uint32_t platformTicksTakenByCode = end - start;
platformTicksTakenByCode >>= 32-COUNTER_WIDTH;
Obviously you'd want to encapsulate that arithmetic:
const uint32_t start = GetScaledTimer();
RunSomeOtherCode();
const uint32_t end = GetScaledTimer();
const uint32_t platformTicksTakenByCode = RescaleDuration(end - start);
with
uint32_t GetScaledTimer()
{
return GetSomePlatformSpecificTimer() << 32-COUNTER_WIDTH;
}
uint32_t RescaleDuration(uint32_t d)
{
return d >> 32-COUNTER_WIDTH;
}
You then have much the same behaviour as for your full-width timer, and the same option to use signed types if necessary.

Turn a large chunk of memory backwards, fast

I need to rewrite about 4KB of data in reverse order, at bit level (last bit of last byte becoming first bit of first byte), as fast as possible. Are there any clever sniplets to do it?
Rationale: The data is display contents of LCD screen in an embedded device that is usually positioned in a way that the screen is on your shoulders level. The screen has "6 o'clock" orientation, that is to be viewed from below - like lying flat or hanging above your eyes level. This is fixable by rotating the screen 180 degrees, but then I need to reverse the screen data (generated by library), which is 1 bit = 1 pixel, starting with upper left of the screen. The CPU isn't very powerful, and the device has enough work already, plus several frames a second would be desirable so performance is an issue; RAM not so much.
edit:
Single core, ARM 9 series. 64MB, (to be scaled down to 32MB later), Linux. The data is pushed from system memory to the LCD driver over 8-bit IO port.
The CPU is 32bit and performs much better at this word size than at byte level.

There's a classic way to do this. Let's say unsigned int is your 32-bit word. I'm using C99 because the restrict keyword lets the compiler perform extra optimizations in this speed-critical code that would otherwise be unavailable. These keywords inform the compiler that "src" and "dest" do not overlap. This also assumes you are copying an integral number of words, if you're not, then this is just a start.
I also don't know which bit shifting / rotation primitives are fast on the ARM and which are slow. This is something to consider. If you need more speed, consider disassembling the output from the C compiler and going from there. If using GCC, try O2, O3, and Os to see which one is fastest. You might reduce stalls in the pipeline by doing two words at the same time.
This uses 23 operations per word, not counting load and store. However, these 23 operations are all very fast and none of them access memory. I don't know if a lookup table would be faster or not.
void
copy_rev(unsigned int *restrict dest,
unsigned int const *restrict src,
unsigned int n)
{
unsigned int i, x;
for (i = 0; i < n; ++i) {
x = src[i];
x = (x >> 16) | (x << 16);
x = ((x >> 8) & 0x00ff00ffU) | ((x & 0x00ff00ffU) << 8);
x = ((x >> 4) & 0x0f0f0f0fU) | ((x & 0x0f0f0f0fU) << 4);
x = ((x >> 2) & 0x33333333U) | ((x & 0x33333333U) << 2);
x = ((x >> 1) & 0x55555555U) | ((x & 0x555555555) << 1);
dest[n-1-i] = x;
}
}
This page is a great reference: http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious
Final note: Looking at the ARM assembly reference, there is a "REV" opcode which reverses the byte order in a word. This would shave 7 operations per loop off the above code.

Fastest way would probably to store the reverse of all possible byte values in a look-up table. The table would take only 256 bytes.

Build a 256 element lookup table of byte values that are bit-reversed from their index.
{0x00, 0x80, 0x40, 0xc0, etc}
Then iterate through your array copying using each byte as an index into your lookup table.
If you are writing assembly language, the x86 instruction set has an XLAT instruction that does just this sort of lookup. Although it may not actually be faster than C code on modern processors.
You can do this in place if you iterate from both ends towards the middle. Because of cache effects, you may find it's faster to swap in 16 byte chunks (assuming a 16 byte cache line).
Here's the basic code (not including the cache line optimization)
// bit reversing lookup table
typedef unsigned char BYTE;
extern const BYTE g_RevBits[256];
void ReverseBitsInPlace(BYTE * pb, int cb)
{
int iter = cb/2;
for (int ii = 0, jj = cb-1; ii < iter; ++ii, --jj)
{
BYTE b1 = g_RevBits[pb[ii]];
pb[ii] = g_RevBits[pb[jj]];
pb[jj] = b1;
}
if (cb & 1) // if the number of bytes was odd, swap the middle one in place
{
pb[cb/2] = g_RevBits[pb[cb/2]];
}
}
// initialize the bit reversing lookup table using macros to make it less typing.
#define BITLINE(n) \
0x0##n, 0x8##n, 0x4##n, 0xC##n, 0x2##n, 0xA##n, 0x6##n, 0xE##n,\
0x1##n, 0x9##n, 0x5##n, 0xD##n, 0x3##n, 0xB##n, 0x7##n, 0xF##n,
const BYTE g_RevBits[256] = {
BITLINE(0), BITLINE(8), BITLINE(4), BITLINE(C),
BITLINE(2), BITLINE(A), BITLINE(6), BITLINE(E),
BITLINE(1), BITLINE(9), BITLINE(5), BITLINE(D),
BITLINE(3), BITLINE(B), BITLINE(7), BITLINE(F),
};

The Bit Twiddling Hacks site is alwas a good starting point for these kind of problems. Take a look here for fast bit reversal. Then its up to you to apply it to each byte/word of your memory block.
EDIT:
Inspired by Dietrich Epps answer and looking at the ARM instruction set, there is a RBIT opcode that reverses the bits contained in a register. So if performance is critical, you might consider using some assembly code.

Loop through the half of the array, convert and exchange bytes.
for( int i = 0; i < arraySize / 2; i++ ) {
char inverted1 = invert( array[i] );
char inverted2 = invert( array[arraySize - i - 1] );
array[i] = inverted2;
array[arraySize - i - 1] = inverted1;
}
For conversion use a precomputed table - an array of 2CHAR_BIT (CHAR_BIT will most likely be 8) elements where at position "I" the result of byte with value "I" inversion is stored. This will be very fast - one pass - and consume only 2CHAR_BIT for the table.

It looks like this code takes about 50 clocks per bit swap on my i7 XPS 8500 machine. 7.6 seconds for a million array flips. Single threaded. It prints some ASCI art based on patterns of 1s and 0s. I rotated the pic left 180 degrees after reversing the bit array, using a graphic editor, and they look identical to me. A double-reversed image comes out the same as the original.
As for pluses, it's a complete solution. It swaps bits from the back of a bit array to the front, vs operating on ints/bytes and then needing to swap ints/bytes in an array.
Also, this is a general purpose bit library, so you might find it handy in the future for solving other, more mundane problems.
Is it as fast as the accepted answer? I think it's close, but without working code to benchmark it's impossible to say. Feel free to cut and paste this working program.
// Reverse BitsInBuff.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include "time.h"
#include "memory.h"
//
// Manifest constants
#define uchar unsigned char
#define BUFF_BYTES 510 //400 supports a display of 80x40 bits
#define DW 80 // Display Width
// ----------------------------------------------------------------------------
uchar mask_set[] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 };
uchar mask_clr[] = { 0xfe, 0xfd, 0xfb, 0xf7, 0xef, 0xdf, 0xbf, 0x7f };
//
// Function Prototypes
static void PrintIntBits(long x, int bits);
void BitSet(uchar * BitArray, unsigned long BitNumber);
void BitClr(uchar * BitArray, unsigned long BitNumber);
void BitTog(uchar * BitArray, unsigned long BitNumber);
uchar BitGet(uchar * BitArray, unsigned long BitNumber);
void BitPut(uchar * BitArray, unsigned long BitNumber, uchar value);
//
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt);
static void PrintIntBits(long x, int bits);
// -----------------------------------------------------------------------------
// Reverse the bit ordering in an array
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt) {
unsigned long front=0, back = BitKnt-1;
uchar temp;
while( front<back ) {
temp = BitGet(Buff, front); // copy front bit to temp before overwriting
BitPut(Buff, front, BitGet(Buff, back)); // copy back bit to front bit
BitPut(Buff, back, temp); // copy saved value of front in temp to back of bit arra)
front++;
back--;
}
return Buff;
}
// ---------------------------------------------------------------------------
// ---------------------------------------------------------------------------
int _tmain(int argc, _TCHAR* argv[]) {
int i, j, k, LoopKnt = 1000001;
time_t start;
uchar Buff[BUFF_BYTES];
memset(Buff, 0, sizeof(Buff));
// make an ASCII art picture
for(i=0, k=0; i<(sizeof(Buff)*8)/DW; i++) {
for(j=0; j<DW/2; j++) {
BitSet(Buff, (i*DW)+j+k);
}
k++;
}
// print ASCII art picture
for(i=0; i<sizeof(Buff); i++) {
if(!(i % 10)) printf("\n"); // print bits in blocks of 80
PrintIntBits(Buff[i], 8);
}
i=LoopKnt;
start = clock();
while( i-- ) {
ReverseBitsInArray((uchar *)Buff, BUFF_BYTES * 8);
}
// print ASCII art pic flipped upside-down and rotated left
printf("\nMilliseconds elapsed = %d", clock() - start);
for(i=0; i<sizeof(Buff); i++) {
if(!(i % 10)) printf("\n"); // print bits in blocks of 80
PrintIntBits(Buff[i], 8);
}
printf("\n\nBenchmark time for %d loops\n", LoopKnt);
getchar();
return 0;
}
// -----------------------------------------------------------------------------
// Scaffolding...
static void PrintIntBits(long x, int bits) {
unsigned long long z=1;
int i=0;
z = z << (bits-1);
for (; z > 0; z >>= 1) {
printf("%s", ((x & z) == z) ? "#" : ".");
}
}
// These routines do bit manipulations on a bit array of unsigned chars
// ---------------------------------------------------------------------------
void BitSet(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] |= mask_set[BitNumber & 7];
}
// ----------------------------------------------------------------------------
void BitClr(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] &= mask_clr[BitNumber & 7];
}
// ----------------------------------------------------------------------------
void BitTog(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] ^= mask_set[BitNumber & 7];
}
// ----------------------------------------------------------------------------
uchar BitGet(uchar *buff, unsigned long BitNumber) {
return (uchar) ((buff[BitNumber >> 3] >> (BitNumber & 7)) & 1);
}
// ----------------------------------------------------------------------------
void BitPut(uchar *buff, unsigned long BitNumber, uchar value) {
if(value) { // if the bit at buff[BitNumber] is true.
BitSet(buff, BitNumber);
} else {
BitClr(buff, BitNumber);
}
}
Below is the code listing for an optimization using a new buffer, instead of swapping bytes in place. Given that only 2030:4080 BitSet()s are needed because of the if() test, and about half the GetBit()s and PutBits() are eliminated by eliminating TEMP, I suspect memory access time is a large, fixed cost to these kinds of operations, providing a hard limit to optimization.
Using a look-up approach, and CONDITIONALLY swapping bytes, rather than bits, reduces by a factor of 8 the number of memory accesses, and testing for a 0 byte gets amortized across 8 bits, rather than 1.
Using these two approaches together, testing to see if the entire 8-bit char is 0 before doing ANYTHING, including the table lookup, and the write, is likely going to be the fastest possible approach, but would require an extra 512 bytes for the new, destination bit array, and 256 bytes for the lookup table. The performance payoff might be quite dramatic though.
// -----------------------------------------------------------------------------
// Reverse the bit ordering in new array
uchar *ReverseBitsInNewArray(uchar *Dst, const uchar *Src, const int BitKnt) {
int front=0, back = BitKnt-1;
memset(Dst, 0, BitKnt/BitsInByte);
while( front < back ) {
if(BitGet(Src, back--)) { // memset() has already set all bits in Dst to 0,
BitSet(Dst, front); // so only reset if Src bit is 1
}
front++;
}
return Dst;

To reverse a single byte x you can handle the bits one at a time:
unsigned char a = 0;
for (i = 0; i < 8; ++i) {
a += (unsigned char)(((x >> i) & 1) << (7 - i));
}
You can create a cache of these results in an array so that you can quickly reverse a byte just by making a single lookup instead of looping.
Then you just have to reverse the byte array, and when you write the data apply the above mapping. Reversing a byte array is a well documented problem, e.g. here.

Single Core?
How much memory?
Is the display buffered in memory and pushed to the device, or is the only copy of the pixels in the screens memory?

The data is pushed from system memory to the LCD driver over 8-bit IO
port.
Since you'll be writing to the LCD one byte at a time, I think the best idea is to perform the bit reversal right when sending the data to the LCD driver rather than as a separate pre-pass. Something along those lines should be faster than any of the other answers:
void send_to_LCD(uint8_t* data, int len, bool rotate) {
if (rotate)
for (int i=len-1; i>=0; i--)
write(reverse(data[i]));
else
for (int i=0; i<len; i++)
write(data[i]);
}
Where write() is the function that sends a byte to the LCD driver and reverse() one of the single-byte bit reversal methods described in the other answers.
This approach avoids the need to store two copies of the video data in ram and also avoids the read-invert-write roundtrip. Also note that this is the simplest implementation: it could be trivially adapted to load, say, 4 bytes at a time from memory if this were to yield better performance. A smart vectorizing compiler may be even able to do it for you.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Optimizing a scanline conversion function for ARM - arm

Related

Vectorize random init and print for BigInt with decimal digit array, with AVX2?

Vectorize equality test without SIMD

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

Timer wraparound with flexible bit sizes

Turn a large chunk of memory backwards, fast

Categories

Resources