Few extremely large values when using RDTSCP in C - c

I am using the following function to measure clock cycles taken by a simple operation:
uint64_t rdtscp(void) {
uint32_t lo, hi;
__asm__ volatile ("rdtscp"
: /* outputs */ "=a" (lo), "=d" (hi)
: /* no inputs */
: /* clobbers */ "%rcx");
return (uint64_t)lo | (((uint64_t)hi) << 32);
}
To avoid taking into account possible delay due to interrupts, I parsed /proc/interrupts file in the read_intr() function and considered the value only if there were no new interrupts on CPU 0.
These functions are used to measure the cycles taken by the operation in a loop:
for (i = 0; i < 1000; i++)
{
for (j = 0; j < 256; j++)
{
rnd1 = rand() % 256;
intr1 = read_intr();
start = rdtscp();
rnd1 + 1; // Operation
end = rdtscp();
intr2 = read_intr();
if (intr1 == intr2)
{
printf("%lu\n", end - start);
fflush(stdout);
}
}
}
I then ran the above program using:
$ sudo schedtool -F -p 99 -e taskset 0x1 ./a.out > output
That is, the program is scheduled using SCHED_FIFO with 99 priority and forced to run on CPU 0.
I analyzed the output counter values to obtain the following statistical parameters about the execution cycles taken:
Mean = 36.49
Median = 33
However, when the output is sorted in descending order the first few values are:
32593
29081
28878
1113
1065
612
What is the reason for these (very few) extremely high counter readings even after interrupts are being considered? Also, given that the first three sorted readings are much higher than the following three (shown), could this be because of different reasons?
Edit: Based on gudok's comment, I turned off hyper-threading and also automatic power management from the BIOS settings, but still find a discrepancy as mentioned below.

Related

What is responsible for a RAM bottleneck when performing random memory accesses?

I have a program which accesses single bytes in a large array at random. Since this array exceeds the L2 cache, it requires many queries to RAM. I created a benchmark which emulates this program by generating random numbers and querying a large array. It seems like the benchmark is under-performing relative to my RAM's advertised speed.
I have DDR4-2933 RAM which is supposed to handle 2933 MT/s. The maximum transfer rate I have been able to achieve is 56.8 MT/s. What would be the bottleneck preventing faster execution? If I had to speculate, I might say the CPU could only reorder instructions within some fixed window and this would limit the parallelization of the fetches. Although, I have no evidence beyond the benchmarks.
Benchmark & Methodology
I created a short program which populates a large array with random numbers. Then, it loads values at random offsets from the array and XORs them together. Memory accesses should be reorderable.
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <sys/mman.h>
/* XOROSHIRO PRNG */
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
static uint64_t s[4];
uint64_t next(void) {
const uint64_t result = rotl(s[1] * 5, 7) * 9;
const uint64_t t = s[1] << 17;
s[2] ^= s[0];
s[3] ^= s[1];
s[1] ^= s[2];
s[0] ^= s[3];
s[2] ^= t;
s[3] = rotl(s[3], 45);
return result;
}
/* BENCHMARK */
int main(int argc, char ** argv) {
char const * prog = argc > 0 ? argv[0] : "[program]";
if(argc != 3) {
fprintf(stderr, "Usage: %s [size (power of 2)] [n_runs]", prog);
exit(1);
}
unsigned int sshift = strtol(argv[1], NULL, 10);
uint64_t calls = strtol(argv[2], NULL, 10);
size_t size = 1ull << sshift;
uint64_t smask = (1ull << sshift) - 1;
uint8_t * buf = malloc(size);
if(!buf) {
fprintf(stderr, "no mem");
exit(1);
}
// Seed PRNG with magic values
s[0] = 0x0BE38E2AC;
s[1] = 0x23D933C53;
s[2] = 0xE72482E32;
s[3] = 0x35C339D23;
for(size_t i = 0; i < size; i++) {
buf[i] = next();
}
clock_t start = clock();
double cpu_time_used;
uint8_t val = 0;
for(size_t i = 0; i < calls; i++) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
cpu_time_used = ((double) (clock() - start)) / CLOCKS_PER_SEC;
printf("time: %.3f\n", cpu_time_used);
printf("calls: %ld\n", calls);
printf("time/call: %.3e\n", cpu_time_used / calls);
printf("MT / sec: %.3e\n", calls / cpu_time_used / 1e6);
printf("val: %d\n", val); // ensure val is not optimized out
}
All benchmarks were executed on a Intel(R) Core(TM) i9-10885H CPU running at 2.40 GHz with CPU scaling disabled. The program was compiled with gcc -O3 main.c -o main -g -O3 -flto and called with nice -n -2 ./main 31 1000000000 for an array of size 2^31 and 1e9 accesses. The number of transactions per second issued to RAM can be approximated by the number of bytes fetched since virtually all of the lookups result in cache missed. I tested this using perf and it seemed like around 95% of cache lookups resulted in misses.
The random number generator occupies about 7.1% of program overhead. This was measured by removing the line val ^= buf[idx]; which executes the fetch. perf reported around 99% of memory overhead was due to last-level cache misses.
Follow-up Benchmarks
Loop unrolling:
By default, GCC 12.2.0 did not unroll the loop. It took some cajoling. I replaced:
for(size_t i = 0; i < calls; i++) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
with:
for(size_t i = 0; i < batches; i++) {
#pragma GCC unroll 3
for(size_t j = 0; j < 8; j++) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
}
I look five timings at 2.0 GHz. The version without loop unrolling seemed to perform better but not significantly. I'm not very surprised since the branch would seem highly predictable.
no unrolling unrolling
0 23.249 24.287
1 24.044 24.520
2 23.455 24.358
3 23.233 24.167
4 22.969 23.833
Multi-processing
I implemented a threaded version of the benchmark. It evenly divides the accesses among the threads. I also switch to measuring wall clock time. Here are the measurements (taken at 2.4 GHz):
threads time MT/s
0 1 17.345 57.653502
1 2 8.949 111.744329
2 4 5.022 199.123855
3 8 3.533 283.045570
4 16 3.203 312.207306
5 32 3.200 312.500000
6 64 3.173 315.159155
7 124 3.165 315.955766
Modified program:
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <sys/mman.h>
#include <threads.h>
#include <pthread.h>
/* XOROSHIRO PRNG */
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
static thread_local uint64_t s[4];
static uint64_t next(void) {
const uint64_t result = rotl(s[1] * 5, 7) * 9;
const uint64_t t = s[1] << 17;
s[2] ^= s[0];
s[3] ^= s[1];
s[1] ^= s[2];
s[0] ^= s[3];
s[2] ^= t;
s[3] = rotl(s[3], 45);
return result;
}
struct xor_args {
size_t iters;
uint8_t * buf;
uint8_t res;
uint64_t smask;
pthread_t id;
};
#include <sys/random.h>
void * xor_worker(struct xor_args * a) {
if(-1 == getrandom(&s, sizeof(s), GRND_RANDOM)) {
fprintf(stderr, "getrandom() failed\n");
exit(1);
}
uint8_t val = 0;
for(size_t i = 0; i < a->iters; i++) {
uint64_t idx = next() & a->smask;
val ^= a->buf[idx];
}
a->res = val;
return NULL;
}
/* BENCHMARK */
int main(int argc, char ** argv) {
char const * prog = argc > 0 ? argv[0] : "[program]";
if(argc != 4) {
fprintf(stderr, "Usage: %s [size (power of 2)] [n_runs] [threads]", prog);
exit(1);
}
unsigned int sshift = strtol(argv[1], NULL, 10);
uint64_t calls = strtol(argv[2], NULL, 10);
int threads = strtol(argv[3], NULL, 10);
size_t size = 1ull << sshift;
uint64_t smask = (1ull << sshift) - 1;
uint8_t * buf = malloc(size);
if(!buf) {
fprintf(stderr, "no mem");
exit(1);
}
// Seed PRNG with magic values
s[0] = 0x0BE38E2AC;
s[1] = 0x23D933C53;
s[2] = 0xE72482E32;
s[3] = 0x35C339D23;
for(size_t i = 0; i < size; i++) {
buf[i] = next();
}
struct timespec start, finish;
double elapsed;
clock_gettime(CLOCK_MONOTONIC, &start);
struct xor_args * args = malloc(sizeof(struct xor_args) * threads);
int s;
for(int i = 0; i < threads; i++) {
struct xor_args * arg = &args[i];
arg->iters = calls / (uint64_t) threads;
if(i == 0) {
args->iters += calls % threads;
}
arg->smask = smask;
arg->buf = buf;
s = pthread_create(&arg->id, NULL, (void * (*)(void *)) xor_worker, arg);
if(s != 0) {
fprintf(stderr, "pthread_create() failed");
exit(1);
}
}
uint8_t val = 0;
for(int i = 0; i < threads; i++) {
struct xor_args * arg = &args[i];
s = pthread_join(arg->id, NULL);
if(s != 0) {
fprintf(stderr, "pthread_join() failed");
exit(1);
}
val ^= arg->res;
}
clock_gettime(CLOCK_MONOTONIC, &finish);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
printf("time: %.3f\n", elapsed);
printf("calls: %ld\n", calls);
printf("time/call: %.3e\n", elapsed / calls);
printf("M T / sec: %.3e\n", calls / elapsed / 1e6);
printf("val: %d\n", val); // ensure val is not optimized out
}
The program is bound by the memory hierarchy. Indeed, on modern processor, accessing 1 byte from the RAM causes the whole cache line to be fetched from the RAM. This means 64 bytes on your processor. One channel of DDR4-2933 RAM can reach a theoretical bandwidth of 8*2933e6/1024**3 = 21.85 GiB/s. However, regarding the fact that the cache is useless here and that 64 bytes are fetched, the maximum throughput is 21.85/64*1024 = 350 MiB/s. Based on this, the program should at least take 1000000000/(350*1024**2) = 2.72 s.
In practice, no modern processor can fully saturate a DDR4 memory (especially using only 1 core). Modern RAM are designed to be fast mainly for contiguous access, not random one, despite the name. This is because speeding up random accesses is very hard and reading contiguous data is frequent (and far simpler to optimize at the hardware level).
The main problem with DDR RAM is its latency which is huge since several decades (50-120 ns per cache line fetch). It has not changed much since the last decade while processors have become significantly faster. As a result, it is critical to be able to mitigate this latency by sending a lot of simultaneous requests, that is increasing the concurrency. However, the execution flow is sequential and the number of simultaneous in-flight requests that 1 core can achieve is limited. Modern processors can execution few instructions in parallel from a sequential program as long as the instructions to be executed are independent. The thing is the program is pretty sequential due to the random number generator though multiple instructions can still be executed in parallel.
Multiple read requests can be done concurrently, but the size of the buffers to receive data from the RAM and the caches is limited. Intel processor units dedicated for that is the line-fill buffer (LFB) between the L1 and L2 cache, the super-queue between the L2 and the L3, and the integrated memory controller (iMC) between the L3 and the RAM. You can use perf to check which one is the bottleneck. AFAIK, the LFB has about 12-14 slots on your target processor so it should not be a bottleneck here. The super-queue can read 32-bytes/cycle, that is, 1 cache-line every 2 cycle, or 1 byte of buf every 2 cycles. That being said the L3 latency is pretty big: at least about a dozen of nanoseconds. In general, prefetching units are used to mitigate the latency but random accesses are unpredictable so hardware prefetchers are completely useless in this case.
Virtual memory make random accesses slower. Indeed, when a cache-line is fetched, the processor needs to translate a virtual address to a physical one. This is done by the Translation Lookaside Buffer (TLB) unit of a processor core. It is basically a cache able to translate the address of a virtual memory page to a physical one. There are multiple TLB units for different caches and the size of the cache is dependent of the size of the pages. Pages are typically 4K ones on your processor unless your OS decides to use huge pages in this specific case. Assuming its does not, the biggest TLB (STLB) has 1536 entries for 4K pages. This means it cannot be efficient for random accesses done on a 1536*4K = 6 MiB buffer. Beyond this limit, the number of TLB miss will be frequent. When a TLB miss happen, the processor needs to call special kernel functions so to know how to translate the virtual addresses based on kernel data structures. This operation is very expensive (like any kernel calls in general) so it introduces a significant additional latency. If huge pages are used, then the threshold is 3 GiB (for pages of 2 MiB) or even 16 GiB (for 1 GiB pages). Using huge pages can thus significantly speed things up compared to basic 4K pages.
Assuming the processor has large enough buffers and TLB caches so not to limit the concurrency, the DDR4-SDRAM are typically the main bottleneck. Indeed, it does not only has a huge latency, but it is also optimized for contiguous accesses. Indeed, such RAM devices are split in banks so to reduce the device latency. Contiguous memory requests can be efficiently managed by multiple banks, but random requests are often significantly less efficient due to bank-conflict: when 2 requests are assigned to the same banks, they are processed serially rather than concurrently. While random requests tends to to be spread amongst banks, this load-balancing is not as good as contiguous request resulting in a bit lower throughout.
In the end, the speed of the program is certainly latency-bound and the time taken is calls * latency / concurrency. It looks like latency / concurrency is 10-15 ns in your case.
One solution is to speed up a bit this program is to use prefetching instructions. It often help to hide the latency by telling the processor to prefetch data in the caches (or temporary buffers) early so subsequent accesses are faster since data are supposed to be closer from computing units. This optimization is brittle : data should be prefetched early enough (otherwise the processor will stall waiting due to cache misses) and data should be kept in the target caches (not be evicted by newer prefetching instructions). In your case, this is not so easy since the index is random. Prefetching relatively large chunks so to then read the values faster should help a bit. It should be slower for small arrays and faster for big ones. Note that software prefetching is not a silver-bullet : it is generally not as good as hardware prefetching (due to the limited hardware concurrency).
An alternative solution is generally to use multiple cores (thanks to multiple threads). That being said, this is not easy too here due to the (inherently sequential) random number generator (RNG) unless it is Ok to use multiple RNG. In this case, fetching full cache lines is generally the main bottleneck due to the high concurrency, the small RAM bandwidth and mainly due to the poor efficiency (1 byte used over 64). Unfortunately, there is currently no way to efficiently extract single bytes from a DDR4-SDRAM using any mainstream x86-64 processor (including yours).
For more information, please read:
What Every Programmer Should Know About Memory
How much of ‘What Every Programmer Should Know About Memory’ is still valid?
How does the CPU cache affect the performance of a C program

Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better.
Also, an AVX2 version of a similar problem, with many bins for a whole row of bits much wider than one uint64_t: Improve column population count algorithm)
I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target) of 64 short integers (uint16) based on a simple rule:
// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
if (mask & (1ull << i)) {
target[i]++
}
}
The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.
Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];
int main(int argc, char *argv[]) {
int i, j;
unsigned long x = 123;
unsigned long m = 1;
char *p = malloc(8 * 10000000);
if (!p) {
printf("failed to allocate\n");
exit(0);
}
memset(p, 0xff, 80000000);
printf("p=%p\n", p);
unsigned long *pLong = (unsigned long*)p;
double start = getTS();
for (j = 0; j < 10000000; j++) {
m = 1;
for (i = 0; i < 64; i++) {
if ((pLong[j] & m) == m) {
target[i]++;
}
m = (m << 1);
}
}
printf("took %f secs\n", getTS() - start);
return 0;
}
Thanks in advance!
On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3, your code runs in 500ms.
Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms.
Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1; without a test brings it down to 280ms.
Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
int main(int argc, char *argv[]) {
unsigned int target[64] = { 0 };
unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
int i, j;
if (!pLong) {
printf("failed to allocate\n");
exit(1);
}
memset(pLong, 0xff, sizeof(*pLong) * 10000000);
printf("p=%p\n", (void*)pLong);
double start = getTS();
uint64_t inflate[256];
for (i = 0; i < 256; i++) {
uint64_t x = i;
x = (x | (x << 28));
x = (x | (x << 14));
inflate[i] = (x | (x << 7)) & 0x0101010101010101ULL;
}
for (j = 0; j < 10000000 / 255 * 255; j += 255) {
uint64_t b[8] = { 0 };
for (int k = 0; k < 255; k++) {
uint64_t u = pLong[j + k];
for (int kk = 0; kk < 8; kk++, u >>= 8)
b[kk] += inflate[u & 255];
}
for (i = 0; i < 64; i++)
target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
}
for (; j < 10000000; j++) {
uint64_t m = 1;
for (i = 0; i < 64; i++) {
target[i] += (pLong[j] >> i) & 1;
m <<= 1;
}
}
printf("target = {");
for (i = 0; i < 64; i++)
printf(" %d", target[i]);
printf(" }\n");
printf("took %f secs\n", getTS() - start);
return 0;
}
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target array a local variable, as well as the inflate array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate array separately.
Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.
Related:
an earlier duplicate has some alternate ideas: How to quickly count bits into separate bins in a series of ints on Sandy Bridge?.
Harold's answer on AVX2 column population count algorithm over each bit-column separately.
Matrix transpose and population count has a couple useful answers with AVX2, including benchmarks. It uses 32-bit chunks instead of 64-bit.
Also: https://github.com/mklarqvist/positional-popcount has SSE blend, various AVX2, various AVX512 including Harley-Seal which is great for large arrays, and various other algorithms for positional popcount. Possibly only for uint16_t, but most could be adapted for other word widths. I think the algorithm I propose below is what they call adder_forest.
Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize your loop-over-bits for you, even if you write it branchlessly to give them a better chance.
And unfortunately not smart enough to auto-vectorize the fast version that gradually widens and adds.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is good: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to accumulate into the 32-bit counter array.
It only takes 4x 16-byte __m128i vectors to hold all 64 uint8_t elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.
The unpack doesn't have to be in-order: you can always shuffle target[] once at the very end, after accumulating all the results.
The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb (_mm_shuffle_epi8).
An even better strategy is to widen gradually
Starting with 2-bit accumulators, then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data, not "diluting" it too much right away. Higher information / entropy density means that each instruction does more useful work.
Using SWAR techniques for 32x 2-bit add inside scalar or SIMD registers is easy / cheap because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
uint64_t x = *(input++); // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555; // 0b...01010101;
uint64_t lo = x & even_1bits;
uint64_t hi = (x>>1) & even_1bits; // or use ANDN before shifting to avoid a MOV copy
accum2_lo += lo; // can do up to 3 iterations of this without overflow
accum2_hi += hi; // because a 2-bit integer overflows at 4
Then you repeat up to 4 vectors of 4-bit elements, then 8 vectors of 8-bit elements, then you should widen all the way to 32 and accumulate into the array in memory because you'll run out of registers anyway, and this outer outer loop work is infrequent enough that we don't need to bother with going to 16-bit. (Especially if we manually vectorize).
Biggest downside: this doesn't auto-vectorize, unlike #njuffa's version. But with gcc -O3 -march=sandybridge for AVX1 (then running the code on Skylake), this running scalar 64-bit is actually still slightly faster than 128-bit AVX auto-vectorized asm from #njuffa's code.
But that's timing on Skylake, which has 4 scalar ALU ports (and mov-elimination), while Sandybridge lacks mov-elimination and only has 3 ALU ports, so the scalar code will probably hit back-end execution-port bottlenecks. (But SIMD code may be nearly as fast, because there's plenty of AND / ADD mixed with the shifts, and SnB does have SIMD execution units on all 3 of its ports that have any ALUs on them. Haswell just added port 6, for scalar-only including shifts and branches.)
With good manual vectorization, this should be a factor of almost 2 or 4 faster.
But if you have to choose between this scalar or #njuffa's with AVX2 autovectorization, #njuffa's is faster on Skylake with -march=native
If building on a 32-bit target is possible/required, this suffers a lot (without vectorization because of using uint64_t in 32-bit registers), while vectorized code barely suffers at all (because all the work happens in vector regs of the same width).
// TODO: put the target[] re-ordering somewhere
// TODO: cleanup for N not a multiple of 3*4*21 = 252
// TODO: manual vectorize with __m128i, __m256i, and/or __m512i
void sum_gradual_widen (const uint64_t *restrict input, unsigned int *restrict target, size_t length)
{
const uint64_t *endp = input + length - 3*4*21; // 252 masks per outer iteration
while(input <= endp) {
uint64_t accum8[8] = {0}; // 8-bit accumulators
for (int k=0 ; k<21 ; k++) {
uint64_t accum4[4] = {0}; // 4-bit accumulators can hold counts up to 15. We use 4*3=12
for(int j=0 ; j<4 ; j++){
uint64_t accum2_lo=0, accum2_hi=0;
for(int i=0 ; i<3 ; i++) { // the compiler should fully unroll this
uint64_t x = *input++; // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555;
uint64_t lo = x & even_1bits; // 0b...01010101;
uint64_t hi = (x>>1) & even_1bits; // or use ANDN before shifting to avoid a MOV copy
accum2_lo += lo;
accum2_hi += hi; // can do up to 3 iterations of this without overflow
}
const uint64_t even_2bits = 0x3333333333333333;
accum4[0] += accum2_lo & even_2bits; // 0b...001100110011; // same constant 4 times, because we shift *first*
accum4[1] += (accum2_lo >> 2) & even_2bits;
accum4[2] += accum2_hi & even_2bits;
accum4[3] += (accum2_hi >> 2) & even_2bits;
}
for (int i = 0 ; i<4 ; i++) {
accum8[i*2 + 0] += accum4[i] & 0x0f0f0f0f0f0f0f0f;
accum8[i*2 + 1] += (accum4[i] >> 4) & 0x0f0f0f0f0f0f0f0f;
}
}
// char* can safely alias anything.
unsigned char *narrow = (uint8_t*) accum8;
for (int i=0 ; i<64 ; i++){
target[i] += narrow[i];
}
}
/* target[0] = bit 0
* target[1] = bit 8
* ...
* target[8] = bit 1
* target[9] = bit 9
* ...
*/
// TODO: 8x8 transpose
}
We don't care about order, so accum4[0] has 4-bit accumulators for every 4th bit, for example. The final fixup needed (but not yet implemented) at the very end is an 8x8 transpose of the uint32_t target[64] array, which can be done efficiently using unpck and vshufps with only AVX1. (Transpose an 8x8 float using AVX/AVX2). And also a cleanup loop for the last up to 251 masks.
We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
Benchmark results on Arch Linux i7-6700k from #njuffa's test harness, with this added. (Godbolt) N = (10000000 / (3*4*21) * 3*4*21) = 9999864 (i.e. 10000000 rounded down to a multiple of the 252 iteration "unroll" factor, so my simplistic implementation is doing the same amount of work, not counting re-ordering target[] which it doesn't do, so it does print mismatch results.
But the printed counts match another position of the reference array.)
I ran the program 4x in a row (to make sure the CPU was warmed up to max turbo) and took one of the runs that looked good (none of the 3 times abnormally high).
ref: the best bit-loop (next section)
fast: #njuffa's code. (auto-vectorized with 128-bit AVX integer instructions).
gradual: my version (not auto-vectorized by gcc or clang, at least not in the inner loop.) gcc and clang fully unroll the inner 12 iterations.
gcc8.2 -O3 -march=sandybridge -fpie -no-pie
ref: 0.331373 secs, fast: 0.011387 secs, gradual: 0.009966 secs
gcc8.2 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.397175 secs, fast: 0.011255 secs, gradual: 0.010018 secs
clang7.0 -O3 -march=sandybridge -fpie -no-pie
ref: 0.352381 secs, fast: 0.011926 secs, gradual: 0.009269 secs (very low counts for port 7 uops, clang used indexed addressing for stores)
clang7.0 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.293014 secs, fast: 0.011777 secs, gradual: 0.009235 secs
-march=skylake (allowing AVX2 for 256-bit integer vectors) helps both, but #njuffa's most because more of it vectorizes (including its inner-most loop):
gcc8.2 -O3 -march=skylake -fpie -no-pie
ref: 0.328725 secs, fast: 0.007621 secs, gradual: 0.010054 secs (gcc shows no gain for "gradual", only "fast")
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.333922 secs, fast: 0.007620 secs, gradual: 0.009866 secs
clang7.0 -O3 -march=skylake -fpie -no-pie
ref: 0.260616 secs, fast: 0.007521 secs, gradual: 0.008535 secs (IDK why gradual is faster than -march=sandybridge; it's not using BMI1 andn. I guess because it's using 256-bit AVX2 for the k=0..20 outer loop with vpaddq)
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.259159 secs, fast: 0.007496 secs, gradual: 0.008671 secs
Without AVX, just SSE4.2: (-march=nehalem), bizarrely clang's gradual is faster than with AVX / tune=sandybridge. "fast" is only barely slower than with AVX.
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.337178 secs, fast: 0.011983 secs, gradual: 0.010587 secs
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.293555 secs, fast: 0.012549 secs, gradual: 0.008697 secs
-fprofile-generate / -fprofile-use help some for GCC, especially for the "ref" version where it doesn't unroll at all by default.
I highlighted the best, but often they're within measurement noise margin of each other. It's unsurprising the -fno-pie -no-pie was sometimes faster: indexing static arrays with [disp32 + reg] is not an indexed addressing mode, just base + disp32, so it doesn't ever unlaminate on Sandybridge-family CPUs.
But with gcc sometimes -fpie was faster; I didn't check but I assume gcc just shot itself in the foot somehow when 32-bit absolute addressing was possible. Or just innocent-looking differences in code-gen happened to cause alignment or uop-cache problems; I didn't check in detail.
For SIMD, we can simply do 2 or 4x uint64_t in parallel, only accumulating horizontally in the final step where we widen bytes to 32-bit elements. (Perhaps by shuffling in-lane and then using pmaddubsw with a multiplier of _mm256_set1_epi8(1) to add horizontal byte pairs into 16-bit elements.)
TODO: manually-vectorized __m128i and __m256i (and __m512i) versions of this. Should be close to 2x, 4x, or even 8x faster than the "gradual" times above. Probably HW prefetch can still keep up with it, except maybe an AVX512 version with data coming from DRAM, especially if there's contention from other threads. We do a significant amount of work per qword we read.
Obsolete code: improvements to the bit-loop
Your portable scalar version can be improved, too, speeding it up from ~1.92 seconds (with a 34% branch mispredict rate overall, with the fast loops commented out!) to ~0.35sec (clang7.0 -O3 -march=sandybridge) with a properly random input on 3.9GHz Skylake. Or 1.83 sec for the branchy version with != 0 instead of == m, because compilers fail to prove that m always has exactly 1 bit set and/or optimize accordingly.
(vs. 0.01 sec for #njuffa's or my fast version above, so this is pretty useless in an absolute sense, but it's worth mentioning as a general optimization example of when to use branchless code.)
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0 for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.
Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++ version, they'd have to use a masked store like x86 vmaskmovps to avoid a non-atomic read / rewrite of unmodified elements of target. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.
Anyway, one way to write this is target[i] += (pLong[j] & m != 0);, using bool->int conversion to get a 0 / 1 integer.
But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1 into add same,same to efficiently left shift, but they still use xor-zero / test / setne to create a 0 / 1 integer.
An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using #chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t):
for (int j = 0; j < 10000000; j++) {
#if 1 // extract low bit directly
unsigned long long tmp = pLong[j];
for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data
target[i] += tmp&1;
tmp >>= 1;
}
#else // bool -> int shifting a mask
unsigned long m = 1;
for (i = 0; i < 64; i++) {
target[i]+= (pLong[j] & m) != 0;
m = (m << 1);
}
#endif
Note that unsigned long is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.
Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
# clang7.0 -O3 -march=sandybridge
.LBB1_2: # =>This Loop Header: Depth=1
# outer loop loads a uint64 from the src
mov rdx, qword ptr [r14 + 8*rbx]
mov rsi, -256
.LBB1_3: # Parent Loop BB1_2 Depth=1
# do {
mov edi, edx
and edi, 1 # isolate the low bit
add dword ptr [rsi + target+256], edi # and += into target
mov edi, edx
shr edi
and edi, 1 # isolate the 2nd bit
add dword ptr [rsi + target+260], edi
shr rdx, 2 # tmp >>= 2;
add rsi, 8
jne .LBB1_3 # } while(offset += 8 != 0);
This is slightly better than we get from test / setnz. Without unrolling, bt / setc might have been equal, but compilers are bad at using bt to implement bool (x & (1ULL << n)), or bts to implement x |= 1ULL << n.
If many words have their highest set bit far below bit 63, looping on while(tmp) could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2 / jnz.
On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).
vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate / test-run / gcc -O3 -march=native -fprofile-use, profile-guided optimization would enable loop unrolling.
This is probably slower than a branchy version on perfectly predictable data like you get from memset with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand().
One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target counts. A function to update the target counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS is the number of bits in the source data:
/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 # 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox), the output of the program is:
p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs
where ref refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
/*
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
#define N (10000000)
#define BITS (64)
#define BLOCK_SIZE (255)
/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
int main (void)
{
double start_ref, stop_ref, start, stop;
uint64_t *pLong;
unsigned int target_ref [BITS] = {0};
unsigned int target [BITS] = {0};
int i, j;
pLong = malloc (sizeof(pLong[0]) * N);
if (!pLong) {
printf("failed to allocate\n");
return EXIT_FAILURE;
}
printf("p=%p\n", pLong);
/* init data */
for (j = 0; j < N; j++) {
pLong[j] = KISS64;
}
/* count bits slowly */
start_ref = second();
for (j = 0; j < N; j++) {
uint64_t m = 1;
for (i = 0; i < BITS; i++) {
if ((pLong[j] & m) == m) {
target_ref[i]++;
}
m = (m << 1);
}
}
stop_ref = second();
/* count bits fast */
start = second();
for (j = 0; j < N / BLOCK_SIZE; j++) {
sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
}
sum_block (pLong, target, j * BLOCK_SIZE, N);
stop = second();
/* check whether result is correct */
for (i = 0; i < BITS; i++) {
if (target[i] != target_ref[i]) {
printf ("error # %d: res=%u ref=%u\n", i, target[i], target_ref[i]);
}
}
/* print benchmark results */
printf("ref took %f secs, fast took %f secs\n", stop_ref - start_ref, stop - start);
return EXIT_SUCCESS;
}
For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
typedef uint64_t T;
const size_t bytes = 8;
const size_t bits = bytes * 8;
const size_t block_size = 128;
static inline __m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x); // we only use the low 32bits of each lane, but this is fine with AVX2
// Each byte gets the source byte containing the corresponding bit
const __m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
const __m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);
// this is the extra step: byte == 0 ? 0 : -1
return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
}
void bitcount_vectorized(const T *data, uint32_t accumulator[bits], const size_t count)
{
for (size_t outer = 0; outer < count - (count % block_size); outer += block_size)
{
__m256i temp_accumulator[bits / 32] = { _mm256_setzero_si256() };
for (size_t inner = 0; inner < block_size; ++inner) {
for (size_t j = 0; j < bits / 32; j++)
{
const auto unpacked = expand_bits_to_bytes(static_cast<uint32_t>(data[outer + inner] >> (j * 32)));
temp_accumulator[j] = _mm256_sub_epi8(temp_accumulator[j], unpacked);
}
}
for (size_t j = 0; j < bits; j++)
{
accumulator[j] += ((uint8_t*)(&temp_accumulator))[j];
}
}
for (size_t outer = count - (count % block_size); outer < count; outer++)
{
for (size_t j = 0; j < bits; j++)
{
if (data[outer] & (T(1) << j))
{
accumulator[j]++;
}
}
}
}
void bitcount_naive(const T *data, uint32_t accumulator[bits], const size_t count)
{
for (size_t outer = 0; outer < count; outer++)
{
for (size_t j = 0; j < bits; j++)
{
if (data[outer] & (T(1) << j))
{
accumulator[j]++;
}
}
}
}
Depending on the chose compiler, the vectorized form achieved roughly a factor 25 speedup over the naive one.
On a Ryzen 5 1600X, the vectorized form roughly achieved the predicted throughput of ~600,000,000 elements per second.
Surprisingly, this is actually still 50% slower than the solution proposed by #njuffa.
See
Efficient Computation of Positional Population Counts Using SIMD Instructions by Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire (7 Nov 2019)
Faster Population Counts using AVX2 Instructions by Wojciech Muła, Nathan Kurz, Daniel Lemire (23 Nov 2016).
Basically, each full adder compresses 3 inputs to 2 outputs. So one can eliminate an entire 256-bit word for the price of 5 logic instructions. The full adder operation could be repeated until registers become exhausted. Then results in the registers are accumulated (as seen in most of the other answers).
Positional popcnt for 16-bit subwords is implemented here:
https://github.com/mklarqvist/positional-popcount
// Carry-Save Full Adder (3:2 compressor)
b ^= a;
a ^= c;
c ^= b; // xor sum
b |= a;
b ^= c; // carry
Note: the accumulate step for positional-popcnt is more expensive than for normal simd popcnt. Which I believe makes it feasible to add a couple of half-adders to the end of the CSU, it might pay to go all the way up to 256 words before accumulating.

cpu cacheline and prefetch policy

I read this article http://igoro.com/archive/gallery-of-processor-cache-effects/. The article said that because cacheline delay, the code:
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
will almost have same execute time, and I wrote some sample c code to test it. I run the code on Xeon(R) E3-1230 V2 with Ubuntu 64bit, ARMv6-compatible processor rev 7 with Debian, and also run it on Core 2 T6600. All results are not what the article said.
My code is as follows:
long int jobTime(struct timespec start, struct timespec stop) {
long int seconds = stop.tv_sec - start.tv_sec;
long int nsec = stop.tv_nsec - start.tv_nsec;
return seconds * 1000 * 1000 * 1000 + nsec;
}
int main() {
struct timespec start;
struct timespec stop;
int i = 0;
struct sched_param param;
int * arr = malloc(LENGTH * 4);
printf("---------sieofint %d\n", sizeof(int));
param.sched_priority = 0;
sched_setscheduler(0, SCHED_FIFO, &param);
//clock_gettime(CLOCK_MONOTONIC, &start);
//for (i = 0; i < LENGTH; i++) arr[i] *= 5;
//clock_gettime(CLOCK_MONOTONIC, &stop);
//printf("step %d : time %ld\n", 1, jobTime(start, stop));
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i += 2) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
printf("step %d : time %ld\n", 2, jobTime(start, stop));
}
Each time I choose one piece to compile and run (comment one and uncomment another).
compile with:
gcc -O0 -o cache cache.c -lrt
On Xeon I get this:
step 1 : 258791478
step 2 : 97875746
I want to know whether or not what the article said was correct? Alternatively, do the newest cpus have more advanced prefetch policies?
Short Answer (TL;DR): you're accessing uninitialized data, your first loop has to allocate new physical pages for the entire array within the timed loop.
When I run your code and comment each of the sections in turn, I get almost the same timing for the two loops. However, I do get the same results you report when I uncomment both sections and run them one after the other. This makes me suspect you also did that, and suffered from cold start effect when comparing the first loop with the second. It's easy to check - just replace the order of the loops and see if the first is still slower.
To avoid, either pick a large enough LENGTH (depending on your system) so that you dont get any cache benefits from the first loop helping the second, or just add a single traversal of the entire array that's not timed.
Note that the second option wouldn't exactly prove what the blog wanted to say - that memory latency masks the execution latency, so it doesn't matter how many elements of a cache line you use, you're still bottlenecked by the memory access time (or more accurately - the bandwidth)
Also - benchmarking code with -O0 is a really bad practice
Edit:
Here's what i'm getting (removed the scheduling as it's not related).
This code:
for (i = 0; i < LENGTH; i++) arr[i] = 1; // warmup!
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i++) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
printf("step %d : time %ld\n", 1, jobTime(start, stop));
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i+=16) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
Gives :
---------sieofint 4
step 1 : time 58862552
step 16 : time 50215446
While commenting the warmup line gives the same advantage as you reported on the second loop:
---------sieofint 4
step 1 : time 279772411
step 16 : time 50615420
Replacing the order of the loops (warmup is still commented) shows it's indeed not related to the step size but to the ordering:
---------sieofint 4
step 16 : time 250033980
step 1 : time 59168310
(gcc version 4.6.3, on Opteron 6272)
Now a note about what's going on here - in theory, you'd expect warmup to be meaningful only when the array is small enough to sit in some cache - in this case the LENGTH you used is too big even for the L3 on most machines. However, you're forgetting the pagemap - you didn't just skip warming the data itself - you avoided initializing it in the first place. This can never give you meaningful results in real life, but since this a benchmark you didn't notice that, you're just multiplying junk data for the latency of it.
This means that each new page you access on the first loop doesn't only go to memory, it would probably get a page fault and have to call the OS to map a new physical page for it. This is a lengthy process, multiplies by the number of 4K pages you use - accumulating to a very long time. At this array size you can't even benefit from TLBs (you have 16k different physical 4k pages, way more than most TLBs can support even with 2 levels), so it's just the question of the fault flows. This can probably be measures by any profiling tool.
The second iteration on the same array won't have this effect and would be much faster - even though is still has to do a full pagewalk on each new page (that's done purely in HW), and then fetch the data from memory.
By the way, this is also the reason when you benchmark some behavior, you repeat the same thing multiple times (in this case it would have solved your problem if you had run over the array several time with the same stride, and ignored the first few rounds).

When should prefetch be used on modern machines? [duplicate]

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:
It is a simple, small, self-contained example.
Removing the __builtin_prefetch instruction results in performance degradation.
Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.
That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.
Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.)
This code performs a very large data transpose.
This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.
To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.
#include <iostream>
using std::cout;
using std::endl;
#include <emmintrin.h>
#include <malloc.h>
#include <time.h>
#include <string.h>
#define ENABLE_PREFETCH
#define f_vector __m128d
#define i_ptr size_t
inline void swap_block(f_vector *A,f_vector *B,i_ptr L){
// To be super-optimized later.
f_vector *stop = A + L;
do{
f_vector tmpA = *A;
f_vector tmpB = *B;
*A++ = tmpB;
*B++ = tmpA;
}while (A < stop);
}
void transpose_even(f_vector *T,i_ptr block,i_ptr x){
// Transposes T.
// T contains x columns and x rows.
// Each unit is of size (block * sizeof(f_vector)) bytes.
//Conditions:
// - 0 < block
// - 1 < x
i_ptr row_size = block * x;
i_ptr iter_size = row_size + block;
// End of entire matrix.
f_vector *stop_T = T + row_size * x;
f_vector *end = stop_T - row_size;
// Iterate each row.
f_vector *y_iter = T;
do{
// Iterate each column.
f_vector *ptr_x = y_iter + block;
f_vector *ptr_y = y_iter + row_size;
do{
#ifdef ENABLE_PREFETCH
_mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0);
#endif
swap_block(ptr_x,ptr_y,block);
ptr_x += block;
ptr_y += row_size;
}while (ptr_y < stop_T);
y_iter += iter_size;
}while (y_iter < end);
}
int main(){
i_ptr dimension = 4096;
i_ptr block = 16;
i_ptr words = block * dimension * dimension;
i_ptr bytes = words * sizeof(f_vector);
cout << "bytes = " << bytes << endl;
// system("pause");
f_vector *T = (f_vector*)_mm_malloc(bytes,16);
if (T == NULL){
cout << "Memory Allocation Failure" << endl;
system("pause");
exit(1);
}
memset(T,0,bytes);
// Perform in-place data transpose
cout << "Starting Data Transpose... ";
clock_t start = clock();
transpose_even(T,block,dimension);
clock_t end = clock();
cout << "Done" << endl;
cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;
_mm_free(T);
system("pause");
}
When I run it with ENABLE_PREFETCH enabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.725 seconds
Press any key to continue . . .
When I run it with ENABLE_PREFETCH disabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.822 seconds
Press any key to continue . . .
So there's a 13% speedup from prefetching.
EDIT:
Here's some more results:
Operating System: Windows 7 Professional/Ultimate
Compiler: Visual Studio 2010 SP1
Compile Mode: x64 Release
Intel Core i7 860 # 2.8 GHz, 8 GB DDR3 # 1333 MHz
Prefetch : 0.868
No Prefetch: 0.960
Intel Core i7 920 # 3.5 GHz, 12 GB DDR3 # 1333 MHz
Prefetch : 0.725
No Prefetch: 0.822
Intel Core i7 2600K # 4.6 GHz, 16 GB DDR3 # 1333 MHz
Prefetch : 0.718
No Prefetch: 0.796
2 x Intel Xeon X5482 # 3.2 GHz, 64 GB DDR2 # 800 MHz
Prefetch : 2.273
No Prefetch: 2.666
Binary search is a simple example that could benefit from explicit prefetching. The access pattern in a binary search looks pretty much random to the hardware prefetcher, so there is little chance that it will accurately predict what to fetch.
In this example, I prefetch the two possible 'middle' locations of the next loop iteration in the current iteration. One of the prefetches will probably never be used, but the other will (unless this is the final iteration).
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int binarySearch(int *array, int number_of_elements, int key) {
int low = 0, high = number_of_elements-1, mid;
while(low <= high) {
mid = (low + high)/2;
#ifdef DO_PREFETCH
// low path
__builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
// high path
__builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
#endif
if(array[mid] < key)
low = mid + 1;
else if(array[mid] == key)
return mid;
else if(array[mid] > key)
high = mid-1;
}
return -1;
}
int main() {
int SIZE = 1024*1024*512;
int *array = malloc(SIZE*sizeof(int));
for (int i=0;i<SIZE;i++){
array[i] = i;
}
int NUM_LOOKUPS = 1024*1024*8;
srand(time(NULL));
int *lookups = malloc(NUM_LOOKUPS * sizeof(int));
for (int i=0;i<NUM_LOOKUPS;i++){
lookups[i] = rand() % SIZE;
}
for (int i=0;i<NUM_LOOKUPS;i++){
int result = binarySearch(array, SIZE, lookups[i]);
}
free(array);
free(lookups);
}
When I compile and run this example with DO_PREFETCH enabled, I see a 20% reduction in runtime:
$ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
$ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./with-prefetch
Performance counter stats for './with-prefetch':
356,675,702 L1-dcache-load-misses # 41.39% of all L1-dcache hits
861,807,382 L1-dcache-loads
8.787467487 seconds time elapsed
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./no-prefetch
Performance counter stats for './no-prefetch':
382,423,177 L1-dcache-load-misses # 97.36% of all L1-dcache hits
392,799,791 L1-dcache-loads
11.376439030 seconds time elapsed
Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.
I learned a lot from the excellent answers provided by #JamesScriven and #Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).
There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.
In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.
For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.
That is the reason, why there is such a modest improvement for #Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.
The crucial insight from the #JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.
So let's rig the competition and prepare the data in such a way that we know which address is accessed in X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):
//making random accesses to memory:
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
//the actual work is happening here
void operator()(){
//set up the oracle - let see it in the future oracle_offset steps
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//use oracle and prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work, the less the better
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index); #update oracle
index=next(mem[index]); #dependency on `mem[index]` is VERY important to prevent hardware from predicting future
}
}
Some remarks:
data is prepared in such a way, that the oracle is alway right.
maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is CPU-time+original-latency-time/CPU-time.
Compiling and executing leads:
>>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
>>> ./prefetch_demo
#preloops time no prefetch time prefetch factor
...
7 1.0711102260000001 0.230566831 4.6455521002498408
8 1.0511602149999999 0.22651144600000001 4.6406494398521474
9 1.049024333 0.22841439299999999 4.5926367389641687
....
to a speed-up between 4 and 5.
Listing of prefetch_demp.cpp:
//prefetch_demo.cpp
#include <vector>
#include <iostream>
#include <iomanip>
#include <chrono>
const int SIZE=1024*1024*1;
const int STEP_CNT=1024*1024*10;
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
template<bool prefetch>
struct Worker{
std::vector<int> mem;
double result;
int oracle_offset;
void operator()(){
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work:
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index);
index=next(mem[index]);
}
}
Worker(std::vector<int> &mem_):
mem(mem_), result(0.0), oracle_offset(0)
{}
};
template <typename Worker>
double timeit(Worker &worker){
auto begin = std::chrono::high_resolution_clock::now();
worker();
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9;
}
int main() {
//set up the data in special way!
std::vector<int> keys(SIZE);
for (int i=0;i<SIZE;i++){
keys[i] = i;
}
Worker<false> without_prefetch(keys);
Worker<true> with_prefetch(keys);
std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
std::cout<<std::setprecision(17);
for(int i=0;i<20;i++){
//let oracle see i steps in the future:
without_prefetch.oracle_offset=i;
with_prefetch.oracle_offset=i;
//calculate:
double time_with_prefetch=timeit(with_prefetch);
double time_no_prefetch=timeit(without_prefetch);
std::cout<<i<<"\t"
<<time_no_prefetch<<"\t"
<<time_with_prefetch<<"\t"
<<(time_no_prefetch/time_with_prefetch)<<"\n";
}
}
From the documentation:
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* ... */
}
Pre-fetching data can be optimized to the Cache Line size, which for most modern 64-bit processors is 64 bytes to for example pre-load a uint32_t[16] with one instruction.
For example on ArmV8 I discovered through experimentation casting the memory pointer to a uint32_t 4x4 matrix vector (which is 64 bytes in size) halved the required instructions required as before I had to increment by 8 as it was only loading half the data, even though my understanding was that it fetches a full cache line.
Pre-fetching an uint32_t[32] original code example...
int addrindex = &B[0];
__builtin_prefetch(&V[addrindex]);
__builtin_prefetch(&V[addrindex + 8]);
__builtin_prefetch(&V[addrindex + 16]);
__builtin_prefetch(&V[addrindex + 24]);
After...
int addrindex = &B[0];
__builtin_prefetch((uint32x4x4_t *) &V[addrindex]);
__builtin_prefetch((uint32x4x4_t *) &V[addrindex + 16]);
For some reason int datatype for the address index/offset gave better performance. Tested with GCC 8 on Cortex-a53. Using an equivalent 64 byte vector on other architectures might give the same performance improvement if you find it is not pre-fetching all the data like in my case. In my application with a one million iteration loop, it improved performance by 5% just by doing this. There were further requirements for the improvement.
the 128 megabyte "V" memory allocation had to be aligned to 64 bytes.
uint32_t *V __attribute__((__aligned__(64))) = (uint32_t *)(((uintptr_t)(__builtin_assume_aligned((unsigned char*)aligned_alloc(64,size), 64)) + 63) & ~ (uintptr_t)(63));
Also, I had to use C operators instead of Neon Intrinsics, since they require regular datatype pointers (in my case it was uint32_t *) otherwise the new built in prefetch method had a performance regression.
My real world example can be found at https://github.com/rollmeister/veriumMiner/blob/main/algo/scrypt.c in the scrypt_core() and its internal function which are all easy to read. The hard work is done by GCC8. Overall improvement to performance was 25%.

fast way to check if an array of chars is zero [duplicate]

This question already has answers here:
Faster approach to checking for an all-zero buffer in C?
(19 answers)
Closed 4 years ago.
I have an array of bytes, in memory. What's the fastest way to see if all the bytes in the array are zero?
Nowadays, short of using SIMD extensions (such as SSE on x86 processors), you might as well iterate over the array and compare each value to 0.
In the distant past, performing a comparison and conditional branch for each element in the array (in addition to the loop branch itself) would have been deemed expensive and, depending on how often (or early) you could expect a non-zero element to appear in the array, you might have elected to completely do without conditionals inside the loop, using solely bitwise-or to detect any set bits and deferring the actual check until after the loop completes:
int sum = 0;
for (i = 0; i < ARRAY_SIZE; ++i) {
sum |= array[i];
}
if (sum != 0) {
printf("At least one array element is non-zero\n");
}
However, with today's pipelined super-scalar processor designs complete with branch prediction, all non-SSE approaches are virtualy indistinguishable within a loop. If anything, comparing each element to zero and breaking out of the loop early (as soon as the first non-zero element is encountered) could be, in the long run, more efficient than the sum |= array[i] approach (which always traverses the entire array) unless, that is, you expect your array to be almost always made up exclusively of zeroes (in which case making the sum |= array[i] approach truly branchless by using GCC's -funroll-loops could give you the better numbers -- see the numbers below for an Athlon processor, results may vary with processor model and manufacturer.)
#include <stdio.h>
int a[1024*1024];
/* Methods 1 & 2 are equivalent on x86 */
int main() {
int i, j, n;
# if defined METHOD3
int x;
# endif
for (i = 0; i < 100; ++i) {
# if defined METHOD3
x = 0;
# endif
for (j = 0, n = 0; j < sizeof(a)/sizeof(a[0]); ++j) {
# if defined METHOD1
if (a[j] != 0) { n = 1; }
# elif defined METHOD2
n |= (a[j] != 0);
# elif defined METHOD3
x |= a[j];
# endif
}
# if defined METHOD3
n = (x != 0);
# endif
printf("%d\n", n);
}
}
$ uname -mp
i686 athlon
$ gcc -g -O3 -DMETHOD1 test.c
$ time ./a.out
real 0m0.376s
user 0m0.373s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD2 test.c
$ time ./a.out
real 0m0.377s
user 0m0.372s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD3 test.c
$ time ./a.out
real 0m0.376s
user 0m0.373s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD1 -funroll-loops test.c
$ time ./a.out
real 0m0.351s
user 0m0.348s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD2 -funroll-loops test.c
$ time ./a.out
real 0m0.343s
user 0m0.340s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD3 -funroll-loops test.c
$ time ./a.out
real 0m0.209s
user 0m0.206s
sys 0m0.003s
Here's a short, quick solution, if you're okay with using inline assembly.
#include <stdio.h>
int main(void) {
int checkzero(char *string, int length);
char str1[] = "wow this is not zero!";
char str2[] = {0, 0, 0, 0, 0, 0, 0, 0};
printf("%d\n", checkzero(str1, sizeof(str1)));
printf("%d\n", checkzero(str2, sizeof(str2)));
}
int checkzero(char *string, int length) {
int is_zero;
__asm__ (
"cld\n"
"xorb %%al, %%al\n"
"repz scasb\n"
: "=c" (is_zero)
: "c" (length), "D" (string)
: "eax", "cc"
);
return !is_zero;
}
In case you're unfamiliar with assembly, I'll explain what we do here: we store the length of the string in a register, and ask the processor to scan the string for a zero (we specify this by setting the lower 8 bits of the accumulator, namely %%al, to zero), reducing the value of said register on each iteration, until a non-zero byte is encountered. Now, if the string was all zeroes, the register, too, will be zero, since it was decremented length number of times. However, if a non-zero value was encountered, the "loop" that checked for zeroes terminated prematurely, and hence the register will not be zero. We then obtain the value of that register, and return its boolean negation.
Profiling this yielded the following results:
$ time or.exe
real 0m37.274s
user 0m0.015s
sys 0m0.000s
$ time scasb.exe
real 0m15.951s
user 0m0.000s
sys 0m0.046s
(Both test cases ran 100000 times on arrays of size 100000. The or.exe code comes from Vlad's answer. Function calls were eliminated in both cases.)
If you want to do this in 32-bit C, probably just loop over the array as a 32-bit integer array and compare it to 0, then make sure the stuff at the end is also 0.
Split the checked memory half, and compare the first part to the second.
a. If any difference, it can't be all the same.
b. If no difference repeat for the first half.
Worst case 2*N. Memory efficient and memcmp based.
Not sure if it should be used in real life, but I liked the self-compare idea.
It works for odd length. Do you see why? :-)
bool memcheck(char* p, char chr, size_t size) {
// Check if first char differs from expected.
if (*p != chr)
return false;
int near_half, far_half;
while (size > 1) {
near_half = size/2;
far_half = size-near_half;
if (memcmp(p, p+far_half, near_half))
return false;
size = far_half;
}
return true;
}
If the array is of any decent size, your limiting factor on a modern CPU is going to be access to the memory.
Make sure to use cache prefetching for a decent distance ahead (i.e. 1-2K) with something like __dcbt or prefetchnta (or prefetch0 if you are going to use the buffer again soon).
You will also want to do something like SIMD or SWAR to or multiple bytes at a time. Even with 32-bit words, it will be 4X less operations than a per character version. I'd recommend unrolling the or's and making them feed into a "tree" of or's. You can see what I mean in my code example - this takes advantage of superscalar capability to do two integer ops (the or's) in parallel by making use of ops that do not have as many intermediate data dependencies. I use a tree size of 8 (4x4, then 2x2, then 1x1) but you can expand that to a larger number depending on how many free registers you have in your CPU architecture.
The following pseudo-code example for the inner loop (no prolog/epilog) uses 32-bit ints but you could do 64/128-bit with MMX/SSE or whatever is available to you. This will be fairly fast if you have prefetched the block into the cache. Also you will possibly need to do unaligned check before if your buffer is not 4-byte aligned and after if your buffer (after alignment) is not a multiple of 32-bytes in length.
const UINT32 *pmem = ***aligned-buffer-pointer***;
UINT32 a0,a1,a2,a3;
while(bytesremain >= 32)
{
// Compare an aligned "line" of 32-bytes
a0 = pmem[0] | pmem[1];
a1 = pmem[2] | pmem[3];
a2 = pmem[4] | pmem[5];
a3 = pmem[6] | pmem[7];
a0 |= a1; a2 |= a3;
pmem += 8;
a0 |= a2;
bytesremain -= 32;
if(a0 != 0) break;
}
if(a0!=0) then ***buffer-is-not-all-zeros***
I would actually suggest encapsulating the compare of a "line" of values into a single function and then unrolling that a couple times with the cache prefetching.
Measured two implementations on ARM64, one using a loop with early return on false, one that ORs all bytes:
int is_empty1(unsigned char * buf, int size)
{
int i;
for(i = 0; i < size; i++) {
if(buf[i] != 0) return 0;
}
return 1;
}
int is_empty2(unsigned char * buf, int size)
{
int sum = 0;
for(int i = 0; i < size; i++) {
sum |= buf[i];
}
return sum == 0;
}
Results:
All results, in microseconds:
is_empty1 is_empty2
MEDIAN 0.350 3.554
AVG 1.636 3.768
only false results:
is_empty1 is_empty2
MEDIAN 0.003 3.560
AVG 0.382 3.777
only true results:
is_empty1 is_empty2
MEDIAN 3.649 3,528
AVG 3.857 3.751
Summary: only for datasets where the probability of false results is very small, the second algorithm using ORing performs better, due to the omitted branch. Otherwise, returning early is clearly the outperforming strategy.
Rusty Russel's memeqzero is very fast. It reuses memcmp to do the heavy lifting:
https://github.com/rustyrussell/ccan/blob/master/ccan/mem/mem.c#L92.

Resources