Find two worst values and delete in sum

Find two worst values and delete in sum - c

A microcontroller has the job to sample ADC Values (Analog to Digital Conversion). Since these parts are affected by tolerance and noise, the accuracy can be significantly increased by deleting the 4 worst values. The find and delete does take time, which is not ideal, since it will increase the cycle time.
Imagine a frequency of 100MHz, so each command of software does take 10ns to process, the more commands, the longer the controller is blocked from doing the next set of samples
So my goal is to do the sorting process as fast as possible for this i currently use this code, but this does only delete the two worst!
uint16_t getValue(void){
adcval[8] = {};
uint16_t min = 16383 //14bit full
uint16_t max = 1; //zero is physically almost impossible!
uint32_t sum = 0; //variable for the summing
for(uint8_t i=0; i<8;i++){
if(adc[i] > max) max = adc[i];
if(adc[i] < min) min = adc[i];
sum=sum+adcval[i];
}
uint16_t result = (sum-max-min)/6; //remove two worst and divide by 6
return result;
}
Now I would like to extend this function to delete the 4 worst values out of the 8 samples to get more precision. Any advice on how to do this?
Additionally, it would be wonderful to build an efficient function that finds the most deviating values, instead of the highest and lowest. For example, imagine the this two arrays
uint16_t adc1[8] {5,6,10,11,11,12,20,22};
uint16_t adc2[8] {5,6,7,7,10,11,15,16};
First case would gain precision by the described mechanism (delete the 4 worst). But the second case would have deleted the values 5 and 6 as well as 15 and 16. But this would theoretically make the calculation worse, since deleting 10,11,15,16 would be better. Is there any fast solution of deleting the 4 most deviating?

If your ADC is returning values from 5 to 16 14 bits and the voltage reference 3.3V, the voltage varies from 1mV to 3mV. It is very likely that it is the correct reading. It is very difficult to design good input circuit for 14 bits ADC.
It is better to run the running average. What is the running average? It is software low pass filter.
Blue are readings from the ADC, red -running average
Second signal is the very low amplitude sine wave (9-27mV - assuming 14 bits and 3.3Vref)
The algorithm:
static int average;
int running_average(int val, int level)
{
average -= average / level;
average += val * level;
return average / level;
}
void init_average(int val, int level)
{
average = val * level;
}
if the level is the power of 2. This version needs only 6 instructions (no branches) to calculate the average.
static int average;
int running_average(int val, int level)
{
average -= average >> level;
average += val << level;
return average >> level;
}
void init_average(int val, int level)
{
average = val << level;
}
I assume that average will no overflow. If yes you need to chose larger type

This answer is kinda of topic as it recommends a hardware solution but if performance is required and the MCU can't implement P__J__'s solution than this is your next best thing.
It seems you want to remove noise from your input signal. This can be done in software using DSP (digital signal processing) but it can also be done by configuring your hardware differently.
By adding the proper filter at the proper space before your ADC, it will be possible to remove much (outside) noise from your ADC output. (you can't of course go below a certain amount that is innate in the ADC but alas.)
There are several q&a on electronics.stackexchange.com.
One solution is adding a capacitor to filter some high frequency noise. As noted by DerStorm8
The Photon has another great solution here by suggesting RC, Sallen-Key and a cascade of Sallen-Key filters for a continuous signal filter.
Here (ADN007) is a Analog Design Note from Microchip on "Techniques that Reduce System Noise in ADC Circuits"
It may seem that designing a low noise, 12-bit Analog-to-Digital
Converter (ADC) board or even a 10-bit board is easy. This is
true, unless one ignores the basics of low noise design. For
instance, one would think that most amplifiers and resistors work
effectively in 12-bit or 10-bit environments. However, poor device
selection becomes a major factor in the success or failure of the
circuit. Another, often ignored, area that contributes a great deal
of noise, is conducted noise. Conducted noise is already in the
circuit board by the time the signal arrives at the input of the
ADC. The most effective way to remove this noise is by using a
low-pass (anti-aliasing) filter prior to the ADC. Including by-pass
capacitors and using a ground plane will also eliminate this type
of noise. A third source of noise is radiated noise. The major
sources of this type of noise are Electromagnetic Interference
(EMI) or capacitive coupling of signals from trace-to-trace.
If all three of these issues are addressed, then it is true that
designing a low noise 12-bit ADC board is easy.
And their recommended solution path:
It is easy to design a true 12-bit ADC system by using a few
key low noise guidelines. First, examine your devices (resistors
and amplifiers) to make sure they are low noise. Second, use a
ground plane whenever possible. Third, include a low-pass filter
in the signal path if you are changing the signal from analog to
digital. Finally, and always, include by-pass capacitors. These
capacitors not only remove noise but also foster circuit stability.
Here is a good paper by Analog Devices on input noise. They note in here that "there are some instances where input noise can actually be helpful in achieving higher resolution."
All analog-to-digital converters (ADCs) have a certain amount of input-referred noise—modeled as a noise source connected in series with the input of a noise-free ADC. Input-referred noise is not to be confused with quantization noise, which is only of interest when an ADC is processing time-varying signals. In most cases, less input noise is better; however, there are some instances where input noise can actually be helpful in achieving higher resolution. If this doesn’t seem to make sense right now, read on to find out how some noise can be good noise.

Given that you have a fixed size array, a hard-coded sorting network should be able to correctly sort the entire array with only 19 comparisons. Currently you have 8+2*8=24 comparisons already, although it is possible that the compiler unrolls the loop, leaving you with 16 comparisons. It is conceivable that, depending on the microcontroller hardware, a sorting network can be implemented with some degree of parallelism -- perhaps you also have to query the adc values sequentially which would give you opportunity to pre-sort them, while waiting for the comparison.
An optimal sorting network should be searchable online. Wikipedia has some pointers.
So, you would end up with some code like this:
sort_data(adcval);
return (adcval[2]+adcval[3]+adcval[4]+adcval[5])/4;
Update:
As you can take from this picture (source) of optimal sorting networks, a complete sort takes 19 comparisons. However 3 of those are not strictly needed if you only want to extract the middle 4 values. So you get down to 16 comparisons.

to delete the 4 worst values out of the 8 samples
The methods are described on geeksforgeeks k largest(or smallest) elements in an array and you can implement the best method that suits you.
I decided to use this good site to generate best sorting algorithm with SWAP() macros needed to sort the array of 8 elements. Then I created a small C program that will test any combination of 8 element array on my sorting function. Then, because we only care of groups of 4 elements, I did something bruteforce - for each of the SWAP() macros I tried to comment the macro and see if the program still succeeds. I could comment 5 SWAP macros, leaving 14 comparisons needed to identify the smallest 4 elements in the array of 8 samples.
/**
* Sorts the array, but only so that groups of 4 matter.
* So group of 4 smallest elements and 4 biggest elements
* will be sorted ok.
* s[0]...s[3] will have lowest 4 elements
* so they have to be "deleted"
* s[4]...s[7] will have the highest 4 values
*/
void sort_but_4_matter(int s[8]) {
#define SWAP(x, y) do { \
if (s[x] > s[y]) { \
const int t = s[x]; \
s[x] = s[y]; \
s[y] = t; \
} \
} while(0)
SWAP(0, 1);
//SWAP(2, 3);
SWAP(0, 2);
//SWAP(1, 3);
//SWAP(1, 2);
SWAP(4, 5);
SWAP(6, 7);
SWAP(4, 6);
SWAP(5, 7);
//SWAP(5, 6);
SWAP(0, 4);
SWAP(1, 5);
SWAP(1, 4);
SWAP(2, 6);
SWAP(3, 7);
//SWAP(3, 6);
SWAP(2, 4);
SWAP(3, 5);
SWAP(3, 4);
#undef SWAP
}
/* -------- testing code */
#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
int cmp_int(const void *a, const void *b) {
return *(const int*)a - *(const int*)b;
}
void printit_arr(const int *arr, size_t n) {
printf("{");
for (size_t i = 0; i < n; ++i) {
printf("%d", arr[i]);
if (i != n - 1) {
printf(" ");
}
}
printf("}");
}
void printit(const char *pre, const int arr[8],
const int in[8], const int res[4]) {
printf("%s: ", pre);
printit_arr(arr, 8);
printf(" ");
printit_arr(in, 8);
printf(" ");
printit_arr(res, 4);
printf("\n");
}
int err = 0;
void test(const int arr[8], const int res[4]) {
int in[8];
memcpy(in, arr, sizeof(int) * 8);
sort_but_4_matter(in);
// sort for memcmp below
qsort(in, 4, sizeof(int), cmp_int);
if (memcmp(in, res, sizeof(int) * 4) != 0) {
printit("T", arr, in, res);
err = 1;
}
}
void test_all_combinations() {
const int result[4] = { 0, 1, 2, 3 }; // sorted
const size_t n = 8;
int num[8] = { 0, 1, 2, 3, 4, 5, 6, 7 };
for (size_t j = 0; j < n; j++) {
for (size_t i = 0; i < n-1; i++) {
int temp = num[i];
num[i] = num[i+1];
num[i+1] = temp;
test(num, result);
}
}
}
int main() {
test_all_combinations();
return err;
}
Tested on godbolt. The sort_but_4_matter with gcc -O2 on x86_64 compiles to less then 100 instruction.

Related

Micro-optimizing a linear search loop over a huge array with OpenMP: can't break on a hit

I have a loop that takes between 90% and 99% of the program time approximately. It reads a huge LUT, and this loop is executed > 100,000 times, so it deserves some optimization.
EDIT:
The LUT (actually there are various arrays that compose the LUT) is made of arrays of ptrdiff_t and of unsigned __int128. They have to be that wide because of the algorithm (especially the 128 bit ones). T_RDY is the only bool array.
EDIT:
The LUT stores past combinations used to try to solve a problem that didn't work. There's no relation between them (that I can see yet), so I don't see a more appropriate search pattern.
The single threaded version of the loop is:
k = false;
for (ptrdiff_t i = 0; i < T_IND; i++) {
if (T_RDY[i] && !(~T_RWS[i] & M_RWS) && ((T_NUM[i] + P_LVL) <= P_LEN)) {
k = true;
break;
}
}
With this code, which makes use of OpenMP, I reduced the time between 2x and 3x in a 4 core processor:
k = false;
#pragma omp parallel for shared(k)
for (ptrdiff_t i = 0; i < T_IND; i++) {
if (k)
continue;
if (T_RDY[i] && !(~T_RWS[i] & M_RWS) && ((T_NUM[i] + P_LVL) <= P_LEN))
k = true;
}
EDIT:
Info about the data used:
#define DIM_MAX 128
#define P_LEN prb_lvl[0]
#define P_LVL prb_lvl[1]
#define M_RWS prb_mtx_rws[prb_lvl[1]]
#define T_RWS prb_tab
#define T_NUM prb_tab_num
#define T_RDY prb_tab_rdy
#define T_IND prb_tab_ind
extern ptrdiff_t prb_lvl [2];
extern uint128_t prb_mtx_rws [DIM_MAX];
extern uint128_t prb_tab [10000000];
extern ptrdiff_t prb_tab_num [10000000];
extern bool prb_tab_rdy [10000000];
extern ptrdiff_t prb_tab_ind;
However, the fact that I don't get an improvement of approx. 4x means that it introduces an overhead, which I guess goes from 2x to 1.5x. Part of the overhead is unavoidable (creating and destroying the threads), but there's some new overhead due to the facts that OpenMP doesn't allow to break from a parallel loop and that I added an if to each iteration, and I would like to get rid of it if possible.
Is there any other optimization that I could apply? Maybe using pthreads instead.
Should I bother editing some assembly?
I'm using GCC 9 with -O3 -flto (among others).
EDIT:
CPU: i7-5775C
But I plan to use other x64 CPUs with more cores.

You can coalesce k into bit tables and then do comparisons 64 at a time. If an entry in the main tables change, recompute that bit in the bit table.
If different queries use different M_RWS or P_LVL or something, then you'd need separate caches for separate search inputs. Or rebuild the cache for their current values, if you do multiple queries between changes. But hopefully that's not the case, otherwise the all-caps names are misleading.
Set up k as a bit table
#define KSZ (10000000/64 + !!(10000000 % 63))
static uint64_t k[KSZ];
void init_k(void){
// We can split this up to minimize cache misses, see below
for (size_t i;i<10000000;++i)
k[i/64] |= (uint64_t)((!!T_RDY[i]) & (!(~T_RWS[i] & M_RWS)) &((T_NUM[i] + P_LVL) <= P_LEN) ) << (i&63);
}
You can find the bit-index into k by searching for a non-zero 64-bit chunk, then using a bitscan to find the bit within that chunk:
size_t k2index(void){
size_t i;
for (i=0; i<KSZ;++i)
if (k[i]) break;
return 64 * i + __builtin_ctzll(k[i]);
}
You may want to split up your data reads so that you get sequential data access (each table is over 40=80MB as described) and don't get a cache miss on every single iteration.
#define KSZ (10000000/64 + !!(10000000%63))
static uint64_t k[KSZ], k0[KSZ], k1[KSZ]; //use calloc instead?
void init_k(void){
//I split these up to minimize cache misses
for (size_t i;i<10000000;++i)
k[i/64] |= (uint64_t)(!!T_RDY[i]) << (i&63);
for (size_t i;i<10000000;++i)
k0[i/64] |= (uint64_t)(!(~T_RWS[i] & M_RWS)) << (i&63);
for (size_t i;i<10000000;++i)
k1[i/64] |= (uint64_t)((T_NUM[i] + P_LVL) <= P_LEN) << (i&63);
//now combine them 64 bits at a time
for (size_t i;i<KSZ;++i)
k[i] &= k0[i];
for (size_t i;i<KSZ;++i)
k[i] &= k1[i];
}
If you split it up like this, you could also initialize (some of) them when you set up your other tables. Or if the tables updated, you could update the k value as well.

I do not want correct rounding for function exp

The GCC implementation of the C mathematical library on Debian systems has apparently an (IEEE 754-2008)-compliant implementation of the function exp, implying that rounding shall always be correct:
(from Wikipedia) The IEEE floating point standard guarantees that add, subtract, multiply, divide, fused multiply–add, square root, and floating point remainder will give the correctly rounded result of the infinite precision operation. No such guarantee was given in the 1985 standard for more complex functions and they are typically only accurate to within the last bit at best. However, the 2008 standard guarantees that conforming implementations will give correctly rounded results which respect the active rounding mode; implementation of the functions, however, is optional.
It turns out that I am encountering a case where this feature is actually hindering, because the exact result of the exp function is often nearly exactly at the middle between two consecutive double values (1), and then the program carries plenty of several further computations, losing up to a factor 400 (!) in speed: this was actually the explanation to my (ill-asked :-S) Question #43530011.
(1) More precisely, this happens when the argument of exp turns out to be of the form (2 k + 1) × 2-53 with k a rather small integer (like 242 for instance). In particular, the computations involved by pow (1. + x, 0.5) tend to call exp with such an argument when x is of the order of magnitude of 2-44.
Since implementations of correct rounding can be so much time-consuming in certain circumstances, I guess that the developers will also have devised a way to get a slightly less precise result (say, only up to 0.6 ULP or something like this) in a time which is (roughly) bounded for every value of the argument in a given range… (2)
… But how to do this??
(2) What I mean is that I just do not want that some exceptional values of the argument like (2 k + 1) × 2-53 would be much more time-consuming than most values of the same order of magnitude; but of course I do not mind if some exceptional values of the argument go much faster, or if large arguments (in absolute value) need a larger computation time.
Here is a minimal program showing the phenomenon:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
int main (void)
{
int i;
double a, c;
c = 0;
clock_t start = clock ();
for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
{
a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
c += exp (a); // Just to be sure that the compiler will actually perform the computation of exp (a).
}
clock_t stop = clock ();
printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
printf ("Clock time spent: %d\n", stop - start);
return 0;
}
Now after gcc -std=c99 program53.c -lm -o program53:
$ ./program53
1.000000e+06
Clock time spent: 13470008
$ ./program53
1.000000e+06
Clock time spent: 13292721
$ ./program53
1.000000e+06
Clock time spent: 13201616
On the other hand, with program52 and program54 (got by replacing 0x20000000000000 by resp. 0x10000000000000 and 0x40000000000000):
$ ./program52
1.000000e+06
Clock time spent: 83594
$ ./program52
1.000000e+06
Clock time spent: 69095
$ ./program52
1.000000e+06
Clock time spent: 54694
$ ./program54
1.000000e+06
Clock time spent: 86151
$ ./program54
1.000000e+06
Clock time spent: 74209
$ ./program54
1.000000e+06
Clock time spent: 78612
Beware, the phenomenon is implementation-dependent! Apparently, among the common implementations, only those of the Debian systems (including Ubuntu) show this phenomenon.
P.-S.: I hope that my question is not a duplicate: I searched for a similar question thoroughly without success, but maybe I did note use the relevant keywords… :-/

To answer the general question on why the library functions are required to give correctly rounded results:
Floating-point is hard, and often times counterintuitive. Not every programmer has read what they should have. When libraries used to allow some slightly inaccurate rounding, people complained about the precision of the library function when their inaccurate computations inevitably went wrong and produced nonsense. In response, the library writers made their libraries exactly rounded, so now people cannot shift the blame to them.
In many cases, specific knowledge about floating point algorithms can produce considerable improvements to accuracy and/or performance, like in the testcase:
Taking the exp() of numbers very close to 0 in floating-point numbers is problematic, since the result is a number close to 1 while all the precision is in the difference to one, so most significant digits are lost. It is more precise (and significantly faster in this testcase) to compute exp(x) - 1 through the C math library function expm1(x). If the exp() itself is really needed, it is still much faster to do expm1(x) + 1.
A similar concern exists for computing log(1 + x), for which there is the function log1p(x).
A quick fix that speeds up the provided testcase:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
int main (void)
{
int i;
double a, c;
c = 0;
clock_t start = clock ();
for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
{
a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
c += expm1 (a) + 1; // replace exp() with expm1() + 1
}
clock_t stop = clock ();
printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
printf ("Clock time spent: %d\n", stop - start);
return 0;
}
For this case, the timings on my machine are thus:
Original code
1.000000e+06
Clock time spent: 21543338
Modified code
1.000000e+06
Clock time spent: 55076
Programmers with advanced knowledge about the accompanying trade-offs may sometimes consider using approximate results where the precision is not critical
For an experienced programmer it may be possible to write an approximative implementation of a slow function using methods like Newton-Raphson, Taylor or Maclaurin polynomials, specifically inexactly rounded specialty functions from libraries like Intel's MKL, AMD's AMCL, relaxing the floating-point standard compliance of the compiler, reducing precision to ieee754 binary32 (float), or a combination of these.
Note that a better description of the problem would enable a better answer.

Regarding your comment to #EOF 's answer, the "write your own" remark from #NominalAnimal seems simple enough here, even trivial, as follows.
Your original code above seems to have a max possible argument for exp() of a=(1+2*0x400)/0x2000...=4.55e-13 (that should really be 2*0x3FF, and I'm counting 13 zeroes after your 0x2000... which makes it 2x16^13). So that 4.55e-13 max argument is very, very small.
And then the trivial taylor expansion is exp(a)=1+a+(a^2)/2+(a^3)/6+... which already gives you all double's precision for such small arguments. Now, you'll have to discard the 1 part, as explained above, and then that just reduces to expm1(a)=a*(1.+a*(1.+a/3.)/2.) And that should go pretty darn quick! Just make sure a stays small. If it gets a little bigger, just add the next term, a^4/24 (you see how to do that?).
>>EDIT<<
I modified the OP's test program as follows to test a little more stuff (discussion follows code)
/* https://stackoverflow.com/questions/44346371/
i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 16 /*denominator will be (multiplier)xBASE^EXPON*/
#define EXPON 13
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/
int main (int argc, char *argv[]) {
int N = (argc>1?atoi(argv[1]):1e6),
multiplier = (argc>2?atoi(argv[2]):2),
isexp = (argc>3?atoi(argv[3]):1); /* flags to turn on/off exp() */
int isexpm1 = 1; /* and expm1() for timing tests*/
int i, n=0;
double denom = ((double)multiplier)*pow((double)BASE,(double)EXPON);
double a, c=0.0, cm1=0.0, tm1=0.0;
clock_t start = clock();
n=0; c=cm1=tm1=0.0;
/* --- to smooth random fluctuations, do the same type of computation
a large number of (N) times with different values --- */
for (i=0; i<N; i++) {
n++;
a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
significant digits, and its last non-zero
digit is at (fixed-point) position 53. */
if ( isexp ) c += exp(a); /* turn this off to time expm1() alone */
if ( isexpm1 ) { /* you can turn this off to time exp() alone, */
cm1 += expm1(a); /* but difference is negligible */
tm1 += taylorm1(a); }
} /* --- end-of-for(i) --- */
int nticks = (int)(clock()-start);
printf ("N=%d, denom=%dx%d^%d, Clock time: %d (%.2f secs)\n",
n, multiplier,BASE,EXPON,
nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
c,c-(double)n,cm1,tm1);
return 0;
} /* --- end-of-function main() --- */
Compile and run it as test to reproduce OP's 0x2000... scenario, or run it with (up to three) optional args test #trials multiplier timeexp where #trials defaults to the OP's 1000000, and multipler defaults to 2 for the OP's 2x16^13 (change it to 4, etc, for her other tests). For the last arg, timeexp, enter a 0 to do only the expm1() (and my unnecessary taylor-like) calculation. The point of that is to show that the bad-timing-cases displayed by the OP disappear with expm1(), which takes "no time at all" regardless of multiplier.
So default runs, test and test 1000000 4, produce (okay, I called the program rounding)...
bash-4.3$ ./rounding
N=1000000, denom=2x16^13, Clock time: 11155070 (11.16 secs)
c=1.00000000000000023283e+06,
c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 4
N=1000000, denom=4x16^13, Clock time: 200211 (0.20 secs)
c=1.00000000000000011642e+06,
c-n=1.164153e-10, cm1=5.680083e-08, tm1=5.680083e-08
So the first thing you'll note is that the OP's c-n using exp() differs substantially from both cm1==tm1 using expm1() and my taylor approx. If you reduce N they come into agreement, as follows...
N=10, denom=2x16^13, Clock time: 941 (0.00 secs)
c=1.00000000000007140954e+01,
c-n=7.140954e-13, cm1=7.127632e-13, tm1=7.127632e-13
bash-4.3$ ./rounding 100
N=100, denom=2x16^13, Clock time: 5506 (0.01 secs)
c=1.00000000000010103918e+02,
c-n=1.010392e-11, cm1=1.008393e-11, tm1=1.008393e-11
bash-4.3$ ./rounding 1000
N=1000, denom=2x16^13, Clock time: 44196 (0.04 secs)
c=1.00000000000011345946e+03,
c-n=1.134595e-10, cm1=1.140730e-10, tm1=1.140730e-10
bash-4.3$ ./rounding 10000
N=10000, denom=2x16^13, Clock time: 227215 (0.23 secs)
c=1.00000000000002328306e+04,
c-n=2.328306e-10, cm1=1.131288e-09, tm1=1.131288e-09
bash-4.3$ ./rounding 100000
N=100000, denom=2x16^13, Clock time: 1206348 (1.21 secs)
c=1.00000000000000232831e+05,
c-n=2.328306e-10, cm1=1.133611e-08, tm1=1.133611e-08
And as far as timing of exp() versus expm1() is concerned, see for yourself...
bash-4.3$ ./rounding 1000000 2
N=1000000, denom=2x16^13, Clock time: 11168388 (11.17 secs)
c=1.00000000000000023283e+06,
c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 2 0
N=1000000, denom=2x16^13, Clock time: 24064 (0.02 secs)
c=0.00000000000000000000e+00,
c-n=-1.000000e+06, cm1=1.136017e-07, tm1=1.136017e-07
Question: you'll note that once the exp() calculation reaches N=10000 trials, its sum remains constant regardless of larger N. Not sure why that would be happening.
>>__SECOND EDIT__<<
Okay, #EOF , "you made me look" with your "heirarchical accumulation" comment. And that indeed works to bring the exp() sum closer (much closer) to the (presumably correct) expm1() sum. The modified code's immediately below followed by a discussion. But one discussion note here: recall multiplier from above. That's gone, and in its same place is expon so that denominator is now 2^expon where the default is 53, matching OP's default (and I believe better matching how she was thinking about it). Okay, and here's the code...
/* https://stackoverflow.com/questions/44346371/
i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 2 /*denominator=2^EXPON, 2^53=2x16^13 default */
#define EXPON 53
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/
int main (int argc, char *argv[]) {
int N = (argc>1?atoi(argv[1]):1e6),
expon = (argc>2?atoi(argv[2]):EXPON),
isexp = (argc>3?atoi(argv[3]):1), /* flags to turn on/off exp() */
ncparts = (argc>4?atoi(argv[4]):1), /* #partial sums for c */
binsize = (argc>5?atoi(argv[5]):10);/* #doubles to sum in each bin */
int isexpm1 = 1; /* and expm1() for timing tests*/
int i, n=0;
double denom = pow((double)BASE,(double)expon);
double a, c=0.0, cm1=0.0, tm1=0.0;
double csums[10], cbins[10][65537]; /* c partial sums and heirarchy */
int nbins[10], ibin=0; /* start at lowest level */
clock_t start = clock();
n=0; c=cm1=tm1=0.0;
if ( ncparts > 65536 ) ncparts=65536; /* array size check */
if ( ncparts > 1 ) for(i=0;i<ncparts;i++) cbins[0][i]=0.0; /*init bin#0*/
/* --- to smooth random fluctuations, do the same type of computation
a large number of (N) times with different values --- */
for (i=0; i<N; i++) {
n++;
a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
significant digits, and its last non-zero
digit is at (fixed-point) position 53. */
if ( isexp ) { /* turn this off to time expm1() alone */
double expa = exp(a); /* exp(a) */
c += expa; /* just accumulate in a single "bin" */
if ( ncparts > 1 ) cbins[0][n%ncparts] += expa; } /* accum in ncparts */
if ( isexpm1 ) { /* you can turn this off to time exp() alone, */
cm1 += expm1(a); /* but difference is negligible */
tm1 += taylorm1(a); }
} /* --- end-of-for(i) --- */
int nticks = (int)(clock()-start);
if ( ncparts > 1 ) { /* need to sum the partial-sum bins */
nbins[ibin=0] = ncparts; /* lowest-level has everything */
while ( nbins[ibin] > binsize ) { /* need another heirarchy level */
if ( ibin >= 9 ) break; /* no more bins */
ibin++; /* next available heirarchy bin level */
nbins[ibin] = (nbins[ibin-1]+(binsize-1))/binsize; /*#bins this level*/
for(i=0;i<nbins[ibin];i++) cbins[ibin][i]=0.0; /* init bins */
for(i=0;i<nbins[ibin-1];i++) {
cbins[ibin][(i+1)%nbins[ibin]] += cbins[ibin-1][i]; /*accum in nbins*/
csums[ibin-1] += cbins[ibin-1][i]; } /* accumulate in "one bin" */
} /* --- end-of-while(nprevbins>binsize) --- */
for(i=0;i<nbins[ibin];i++) csums[ibin] += cbins[ibin][i]; /*highest level*/
} /* --- end-of-if(ncparts>1) --- */
printf ("N=%d, denom=%d^%d, Clock time: %d (%.2f secs)\n", n, BASE,expon,
nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
c,c-(double)n,cm1,tm1);
if ( ncparts > 1 ) { printf("\t binsize=%d...\n",binsize);
for (i=0;i<=ibin;i++) /* display heirarchy */
printf("\t level#%d: #bins=%5d, c-n=%e\n",
i,nbins[i],csums[i]-(double)n); }
return 0;
} /* --- end-of-function main() --- */
Okay, and now you can notice two additional command-line args following the old timeexp. They are ncparts for the initial number of bins into which the entire #trials will be distributed. So at the lowest level of the heirarchy, each bin should (modulo bugs:) have the sum of #trials/ncparts doubles. The argument after that is binsize, which will be the number of doubles summed in each bin at every successive level, until the last level has fewer (or equal) #bins as binsize. So here's an example dividing 1000000 trials into 50000 bins, meaning 20doubles/bin at the lowest level, and 5doubles/bin thereafter...
bash-4.3$ ./rounding 1000000 53 1 50000 5
N=1000000, denom=2^53, Clock time: 11129803 (11.13 secs)
c=1.00000000000000465661e+06,
c-n=4.656613e-09, cm1=1.136017e-07, tm1=1.136017e-07
binsize=5...
level#0: #bins=50000, c-n=4.656613e-09
level#1: #bins=10002, c-n=1.734588e-08
level#2: #bins= 2002, c-n=7.974450e-08
level#3: #bins= 402, c-n=1.059379e-07
level#4: #bins= 82, c-n=1.133885e-07
level#5: #bins= 18, c-n=1.136214e-07
level#6: #bins= 5, c-n=1.138542e-07
Note how the c-n for exp() converges pretty nicely towards the expm1() value. But note how it's best at level#5, and isn't converging uniformly at all. And note if you break the #trials into only 5000 initial bins, you get just as good a result,
bash-4.3$ ./rounding 1000000 53 1 5000 5
N=1000000, denom=2^53, Clock time: 11165924 (11.17 secs)
c=1.00000000000003527384e+06,
c-n=3.527384e-08, cm1=1.136017e-07, tm1=1.136017e-07
binsize=5...
level#0: #bins= 5000, c-n=3.527384e-08
level#1: #bins= 1002, c-n=1.164153e-07
level#2: #bins= 202, c-n=1.158332e-07
level#3: #bins= 42, c-n=1.136214e-07
level#4: #bins= 10, c-n=1.137378e-07
level#5: #bins= 4, c-n=1.136214e-07
In fact, playing with ncparts and binsize doesn't seem to show much sensitivity, and it's not always "more is better" (i.e., less for binsize) either. So I'm not sure exactly what's going on. Could be a bug (or two), or could be yet another question for #EOF ...???
>>EDIT -- example showing pair addition "binary tree" heirarchy<<
Example below added as per #EOF 's comment
(Note: re-copy preceding code. I had to edit nbins[ibin] calculation for each next level to nbins[ibin]=(nbins[ibin-1]+(binsize-1))/binsize; from nbins[ibin]=(nbins[ibin-1]+2*binsize)/binsize; which was "too conservative" to create ...16,8,4,2 sequence)
bash-4.3$ ./rounding 1024 53 1 512 2
N=1024, denom=2^53, Clock time: 36750 (0.04 secs)
c=1.02400000000011573320e+03,
c-n=1.157332e-10, cm1=1.164226e-10, tm1=1.164226e-10
binsize=2...
level#0: #bins= 512, c-n=1.159606e-10
level#1: #bins= 256, c-n=1.166427e-10
level#2: #bins= 128, c-n=1.166427e-10
level#3: #bins= 64, c-n=1.161879e-10
level#4: #bins= 32, c-n=1.166427e-10
level#5: #bins= 16, c-n=1.166427e-10
level#6: #bins= 8, c-n=1.166427e-10
level#7: #bins= 4, c-n=1.166427e-10
level#8: #bins= 2, c-n=1.164153e-10
>>EDIT -- to show #EOF's elegant solution in comment below<<
"Pair addition" can be elegantly accomplished recursively, as per #EOF's comment below, which I'm reproducing here. (Note case 0/1 at end-of-recursion to handle n even/odd.)
/* Quoting from EOF's comment...
What I (EOF) proposed is effectively a binary tree of additions:
a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
Like this: Add adjacent pairs of elements, this produces
a new sequence of n/2 elements.
Recurse until only one element is left.
(Note that this will require n/2 elements of storage,
rather than a fixed number of bins like your implementation) */
double trecu(double *vals, double sum, int n) {
int midn = n/2;
switch (n) {
case 0: break;
case 1: sum += *vals; break;
default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
return(sum);
}

This is an "answer"/followup to EOF's preceding comments re his trecu() algorithm and code for his "binary tree summation" suggestion. "Prerequisites" before reading this are reading that discussion. It would be nice to collect all that in one organized place, but I haven't done that yet...
...What I did do was build EOF's trecu() into the test program from the preceding answer that I'd written by modifying the OP's original test program. But then I found that trecu() generated exactly (and I mean exactly) the same answer as the "plain sum" c using exp(), not the sum cm1 using expm1() that we'd expected from a more accurate binary tree summation.
But that test program's a bit (maybe two bits:) "convoluted" (or, as EOF said, "unreadable"), so I wrote a separate smaller test program, given below (with example runs and discussion below that), to separately test/exercise trecu(). Moreover, I also wrote function bintreesum() into the code below, which abstracts/encapsulates the iterative code for binary tree summation that I'd embedded into the preceding test program. In that preceding case, my iterative code indeed came close to the cm1 answer, which is why I'd expected EOF's recursive trecu() to do the same. Long-and-short of it is that, below, same thing happens -- bintreesum() remains close to correct answer, while trecu() gets further away, exactly reproducing the "plain sum".
What we're summing below is just sum(i),i=1...n, which is just the well-known n(n+1)/2. But that's not quite right -- to reproduce OP's problem, summand is not sum(i) alone but rather sum(1+i*10^(-e)), where e can be given on the command-line. So for, say, n=5, you don't get 15 but rather 5.000...00015, or for n=6 you get 6.000...00021, etc. And to avoid a long, long format, I printf() sum-n to remove that integer part. Okay??? So here's the code...
/* Quoting from EOF's comment...
What I (EOF) proposed is effectively a binary tree of additions:
a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
Like this: Add adjacent pairs of elements, this produces
a new sequence of n/2 elements.
Recurse until only one element is left. */
#include <stdio.h>
#include <stdlib.h>
double trecu(double *vals, double sum, int n) {
int midn = n/2;
switch (n) {
case 0: break;
case 1: sum += *vals; break;
default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
return(sum);
} /* --- end-of-function trecu() --- */
double bintreesum(double *vals, int n, int binsize) {
double binsum = 0.0;
int nbin0 = (n+(binsize-1))/binsize,
nbin1 = (nbin0+(binsize-1))/binsize,
nbins[2] = { nbin0, nbin1 };
double *vbins[2] = {
(double *)malloc(nbin0*sizeof(double)),
(double *)malloc(nbin1*sizeof(double)) },
*vbin0=vbins[0], *vbin1=vbins[1];
int ibin=0, i;
for ( i=0; i<nbin0; i++ ) vbin0[i] = 0.0;
for ( i=0; i<n; i++ ) vbin0[i%nbin0] += vals[i];
while ( nbins[ibin] > 1 ) {
int jbin = 1-ibin; /* other bin, 0<-->1 */
nbins[jbin] = (nbins[ibin]+(binsize-1))/binsize;
for ( i=0; i<nbins[jbin]; i++ ) vbins[jbin][i] = 0.0;
for ( i=0; i<nbins[ibin]; i++ )
vbins[jbin][i%nbins[jbin]] += vbins[ibin][i];
ibin = jbin; /* swap bins for next pass */
} /* --- end-of-while(nbins[ibin]>0) --- */
binsum = vbins[ibin][0];
free((void *)vbins[0]); free((void *)vbins[1]);
return ( binsum );
} /* --- end-of-function bintreesum() --- */
#if defined(TESTTRECU)
#include <math.h>
#define MAXN (2000000)
int main(int argc, char *argv[]) {
int N = (argc>1? atoi(argv[1]) : 1000000 ),
e = (argc>2? atoi(argv[2]) : -10 ),
binsize = (argc>3? atoi(argv[3]) : 2 );
double tens = pow(10.0,(double)e);
double *vals = (double *)malloc(sizeof(double)*MAXN),
sum = 0.0;
double trecu(), bintreesum();
int i;
if ( N > MAXN ) N=MAXN;
for ( i=0; i<N; i++ ) vals[i] = 1.0 + tens*(double)(i+1);
for ( i=0; i<N; i++ ) sum += vals[i];
printf(" N=%d, Sum_i=1^N {1.0 + i*%.1e} - N = %.8e,\n"
"\t plain_sum-N = %.8e,\n"
"\t trecu-N = %.8e,\n"
"\t bintreesum-N = %.8e \n",
N, tens, tens*((double)N)*((double)(N+1))/2.0,
sum-(double)N,
trecu(vals,0.0,N)-(double)N,
bintreesum(vals,N,binsize)-(double)N );
} /* --- end-of-function main() --- */
#endif
So if you save that as trecu.c, then compile it as cc DTESTTRECU trecu.c lm o trecu And then run with zero to three optional command-line args as trecu #trials e binsize Defaults are #trials=1000000 (like OP's program), e=10, and binsize=2 (for my bintreesum() function to do a binary-tree sum rather than larger-size bins).
And here are some test results illustrating the problem described above,
bash-4.3$ ./trecu
N=1000000, Sum_i=1^N {1.0 + i*1.0e-10} - N = 5.00000500e+01,
plain_sum-N = 5.00000500e+01,
trecu-N = 5.00000500e+01,
bintreesum-N = 5.00000500e+01
bash-4.3$ ./trecu 1000000 -15
N=1000000, Sum_i=1^N {1.0 + i*1.0e-15} - N = 5.00000500e-04,
plain_sum-N = 5.01087168e-04,
trecu-N = 5.01087168e-04,
bintreesum-N = 5.00000548e-04
bash-4.3$
bash-4.3$ ./trecu 1000000 -16
N=1000000, Sum_i=1^N {1.0 + i*1.0e-16} - N = 5.00000500e-05,
plain_sum-N = 6.67552231e-05,
trecu-N = 6.67552231e-05,
bintreesum-N = 5.00001479e-05
bash-4.3$
bash-4.3$ ./trecu 1000000 -17
N=1000000, Sum_i=1^N {1.0 + i*1.0e-17} - N = 5.00000500e-06,
plain_sum-N = 0.00000000e+00,
trecu-N = 0.00000000e+00,
bintreesum-N = 4.99992166e-06
So you can see that for the default run, e=10, everybody's doing everything right. That is, the top line that says "Sum" just does the n(n+1)/2 thing, so presumably displays the right answer. And everybody below that agrees for the default e=10 test case. But for the e=15 and e=16 cases below that, trecu() exactly agrees with the plain_sum, while bintreesum stays pretty close to the right answer. And finally, for e=17, plain_sum and trecu() have "disappeared", while bintreesum()'s still hanging in there pretty well.
So trecu()'s correctly doing the sum all right, but its recursion's apparently not doing that "binary tree" type of thing that my more straightforward iterative bintreesum()'s apparently doing correctly. And that indeed demonstrates that EOF's suggestion for "binary tree summation" realizes quite an improvement over the plain_sum for these 1+epsilon kind of cases. So we'd really like to see his trecu() recursion work!!! When I originally looked at it, I thought it did work. But that double-recursion (is there a special name for that?) in his default: case is apparently more confusing (at least to me:) than I thought. Like I said, it is doing the sum, but not the "binary tree" thing.
Okay, so who'd like to take on the challenge and explain what's going on in that trecu() recursion? And, maybe more importantly, fix it so it does what's intended. Thanks.

Identifying a trend in C - Micro controller sampling

I'm working on an MC68HC11 Microcontroller and have an analogue voltage signal going in that I have sampled. The scenario is a weighing machine, the large peaks are when the object hits the sensor and then it stabilises (which are the samples I want) and then peaks again before the object roles off.
The problem I'm having is figuring out a way for the program to detect this stable point and average it to produce an overall weight but can't figure out how :/. One way I have thought about doing is comparing previous values to see if there is not a large difference between them but I haven't had any success. Below is the C code that I am using:
#include <stdio.h>
#include <stdarg.h>
#include <iof1.h>
void main(void)
{
/* PORTA, DDRA, DDRG etc... are LEDs and switch ports */
unsigned char *paddr, *adctl, *adr1;
unsigned short i = 0;
unsigned short k = 0;
unsigned char switched = 1; /* is char the smallest data type? */
unsigned char data[2000];
DDRA = 0x00; /* All in */
DDRG = 0xff;
adctl = (unsigned char*) 0x30;
adr1 = (unsigned char*) 0x31;
*adctl = 0x20; /* single continuos scan */
while(1)
{
if(*adr1 > 40)
{
if(PORTA == 128) /* Debugging switch */
{
PORTG = 1;
}
else
{
PORTG = 0;
}
if(i < 2000)
{
while(((*adctl) & 0x80) == 0x00);
{
data[i] = *adr1;
}
/* if(i > 10 && (data[(i-10)] - data[i]) < 20) */
i++;
}
if(PORTA == switched)
{
PORTG = 31;
/* Print a delimeter so teemtalk can send to excel */
for(k=0;k<2000;k++)
{
printf("%d,",data[k]);
}
if(switched == 1) /*bitwise manipulation more efficient? */
{
switched = 0;
}
else
{
switched = 1;
}
PORTG = 0;
}
if(i >= 2000)
{
i = 0;
}
}
}
}
Look forward to hearing any suggestions :)
(The graph below shows how these values look, the red box is the area I would like to identify.

As you sample sequence has glitches (short lived transients) try to improve the hardware ie change layout, add decoupling, add filtering etc.
If that approach fails, then a median filter [1] of say five places long, which takes the last five samples, sorts them and outputs the middle one, so two samples of the transient have no effect on it's output. (seven places ...three transient)
Then a computationally efficient exponential averaging lowpass filter [2]
y(n) = y(n–1) + alpha[x(n) – y(n–1)]
choosing alpha (1/2^n, division with right shifts) to yield a time constant [3] of less than the underlying response (~50samples), but still filter out the noise. Increasing the effective fractional bits will avoid the quantizing issues.
With this improved sample sequence, thresholds and cycle count, can be applied to detect quiescent durations.
Additionally if the end of the quiescent period is always followed by a large, abrupt change then using a sample delay "array", enables the detection of the abrupt change but still have available the last of the quiescent samples for logging.
[1] http://en.wikipedia.org/wiki/Median_filter
[2] http://www.dsprelated.com/showarticle/72.php
[3] http://en.wikipedia.org/wiki/Time_constant
Note
Adding code for the above filtering operations will lower the maximum possible sample rate but printf can be substituted for something faster.

Continusously store the current value and the delta from the previous value.
Note when the delta is decreasing as the start of weight application to the scale
Note when the delta is increasing as the end of weight application to the scale
Take the X number of values with the small delta and average them
BTW, I'm sure this has been done 1M times before, I'm thinking that a search for scale PID or weight PID would find a lot of information.

Don't forget using ___delay_ms(XX) function somewhere between the reading values, if you will compare with the previous one. The difference in each step will be obviously small, if the code loop continuously.

Looking at your nice graphs, I would say you should look only for the falling edge, it is much consistent than leading edge.
In other words, let the samples accumulate, calculate the running average all the time with predefined window size, remember the deviation of the previous values just for reference, check for a large negative bump in your values (like absolute value ten times smaller then current running average), your running average is your value. You could go back a little bit (disregarding last few values in your average, and recalculate) to compensate for small positive bump visible in your picture before each negative bump...No need for heavy math here, you could not model the reality better then your picture has shown, just make sure that your code detect the end of each and every sample. You have to be fast enough with sample to make sure no negative bump was missed (or you will have big time error in your data averaging).
And you don't need that large arrays, running average is better based on smaller window size, smaller residual error in your case when you detect the negative bump.

When should prefetch be used on modern machines? [duplicate]

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:
It is a simple, small, self-contained example.
Removing the __builtin_prefetch instruction results in performance degradation.
Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.
That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.

Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.)
This code performs a very large data transpose.
This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.
To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.
#include <iostream>
using std::cout;
using std::endl;
#include <emmintrin.h>
#include <malloc.h>
#include <time.h>
#include <string.h>
#define ENABLE_PREFETCH
#define f_vector __m128d
#define i_ptr size_t
inline void swap_block(f_vector *A,f_vector *B,i_ptr L){
// To be super-optimized later.
f_vector *stop = A + L;
do{
f_vector tmpA = *A;
f_vector tmpB = *B;
*A++ = tmpB;
*B++ = tmpA;
}while (A < stop);
}
void transpose_even(f_vector *T,i_ptr block,i_ptr x){
// Transposes T.
// T contains x columns and x rows.
// Each unit is of size (block * sizeof(f_vector)) bytes.
//Conditions:
// - 0 < block
// - 1 < x
i_ptr row_size = block * x;
i_ptr iter_size = row_size + block;
// End of entire matrix.
f_vector *stop_T = T + row_size * x;
f_vector *end = stop_T - row_size;
// Iterate each row.
f_vector *y_iter = T;
do{
// Iterate each column.
f_vector *ptr_x = y_iter + block;
f_vector *ptr_y = y_iter + row_size;
do{
#ifdef ENABLE_PREFETCH
_mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0);
#endif
swap_block(ptr_x,ptr_y,block);
ptr_x += block;
ptr_y += row_size;
}while (ptr_y < stop_T);
y_iter += iter_size;
}while (y_iter < end);
}
int main(){
i_ptr dimension = 4096;
i_ptr block = 16;
i_ptr words = block * dimension * dimension;
i_ptr bytes = words * sizeof(f_vector);
cout << "bytes = " << bytes << endl;
// system("pause");
f_vector *T = (f_vector*)_mm_malloc(bytes,16);
if (T == NULL){
cout << "Memory Allocation Failure" << endl;
system("pause");
exit(1);
}
memset(T,0,bytes);
// Perform in-place data transpose
cout << "Starting Data Transpose... ";
clock_t start = clock();
transpose_even(T,block,dimension);
clock_t end = clock();
cout << "Done" << endl;
cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;
_mm_free(T);
system("pause");
}
When I run it with ENABLE_PREFETCH enabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.725 seconds
Press any key to continue . . .
When I run it with ENABLE_PREFETCH disabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.822 seconds
Press any key to continue . . .
So there's a 13% speedup from prefetching.
EDIT:
Here's some more results:
Operating System: Windows 7 Professional/Ultimate
Compiler: Visual Studio 2010 SP1
Compile Mode: x64 Release
Intel Core i7 860 # 2.8 GHz, 8 GB DDR3 # 1333 MHz
Prefetch : 0.868
No Prefetch: 0.960
Intel Core i7 920 # 3.5 GHz, 12 GB DDR3 # 1333 MHz
Prefetch : 0.725
No Prefetch: 0.822
Intel Core i7 2600K # 4.6 GHz, 16 GB DDR3 # 1333 MHz
Prefetch : 0.718
No Prefetch: 0.796
2 x Intel Xeon X5482 # 3.2 GHz, 64 GB DDR2 # 800 MHz
Prefetch : 2.273
No Prefetch: 2.666

Binary search is a simple example that could benefit from explicit prefetching. The access pattern in a binary search looks pretty much random to the hardware prefetcher, so there is little chance that it will accurately predict what to fetch.
In this example, I prefetch the two possible 'middle' locations of the next loop iteration in the current iteration. One of the prefetches will probably never be used, but the other will (unless this is the final iteration).
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int binarySearch(int *array, int number_of_elements, int key) {
int low = 0, high = number_of_elements-1, mid;
while(low <= high) {
mid = (low + high)/2;
#ifdef DO_PREFETCH
// low path
__builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
// high path
__builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
#endif
if(array[mid] < key)
low = mid + 1;
else if(array[mid] == key)
return mid;
else if(array[mid] > key)
high = mid-1;
}
return -1;
}
int main() {
int SIZE = 1024*1024*512;
int *array = malloc(SIZE*sizeof(int));
for (int i=0;i<SIZE;i++){
array[i] = i;
}
int NUM_LOOKUPS = 1024*1024*8;
srand(time(NULL));
int *lookups = malloc(NUM_LOOKUPS * sizeof(int));
for (int i=0;i<NUM_LOOKUPS;i++){
lookups[i] = rand() % SIZE;
}
for (int i=0;i<NUM_LOOKUPS;i++){
int result = binarySearch(array, SIZE, lookups[i]);
}
free(array);
free(lookups);
}
When I compile and run this example with DO_PREFETCH enabled, I see a 20% reduction in runtime:
$ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
$ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./with-prefetch
Performance counter stats for './with-prefetch':
356,675,702 L1-dcache-load-misses # 41.39% of all L1-dcache hits
861,807,382 L1-dcache-loads
8.787467487 seconds time elapsed
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./no-prefetch
Performance counter stats for './no-prefetch':
382,423,177 L1-dcache-load-misses # 97.36% of all L1-dcache hits
392,799,791 L1-dcache-loads
11.376439030 seconds time elapsed
Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.

I learned a lot from the excellent answers provided by #JamesScriven and #Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).
There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.
In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.
For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.
That is the reason, why there is such a modest improvement for #Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.
The crucial insight from the #JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.
So let's rig the competition and prepare the data in such a way that we know which address is accessed in X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):
//making random accesses to memory:
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
//the actual work is happening here
void operator()(){
//set up the oracle - let see it in the future oracle_offset steps
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//use oracle and prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work, the less the better
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index); #update oracle
index=next(mem[index]); #dependency on `mem[index]` is VERY important to prevent hardware from predicting future
}
}
Some remarks:
data is prepared in such a way, that the oracle is alway right.
maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is CPU-time+original-latency-time/CPU-time.
Compiling and executing leads:
>>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
>>> ./prefetch_demo
#preloops time no prefetch time prefetch factor
...
7 1.0711102260000001 0.230566831 4.6455521002498408
8 1.0511602149999999 0.22651144600000001 4.6406494398521474
9 1.049024333 0.22841439299999999 4.5926367389641687
....
to a speed-up between 4 and 5.
Listing of prefetch_demp.cpp:
//prefetch_demo.cpp
#include <vector>
#include <iostream>
#include <iomanip>
#include <chrono>
const int SIZE=1024*1024*1;
const int STEP_CNT=1024*1024*10;
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
template<bool prefetch>
struct Worker{
std::vector<int> mem;
double result;
int oracle_offset;
void operator()(){
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work:
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index);
index=next(mem[index]);
}
}
Worker(std::vector<int> &mem_):
mem(mem_), result(0.0), oracle_offset(0)
{}
};
template <typename Worker>
double timeit(Worker &worker){
auto begin = std::chrono::high_resolution_clock::now();
worker();
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9;
}
int main() {
//set up the data in special way!
std::vector<int> keys(SIZE);
for (int i=0;i<SIZE;i++){
keys[i] = i;
}
Worker<false> without_prefetch(keys);
Worker<true> with_prefetch(keys);
std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
std::cout<<std::setprecision(17);
for(int i=0;i<20;i++){
//let oracle see i steps in the future:
without_prefetch.oracle_offset=i;
with_prefetch.oracle_offset=i;
//calculate:
double time_with_prefetch=timeit(with_prefetch);
double time_no_prefetch=timeit(without_prefetch);
std::cout<<i<<"\t"
<<time_no_prefetch<<"\t"
<<time_with_prefetch<<"\t"
<<(time_no_prefetch/time_with_prefetch)<<"\n";
}
}

From the documentation:
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* ... */
}

Pre-fetching data can be optimized to the Cache Line size, which for most modern 64-bit processors is 64 bytes to for example pre-load a uint32_t[16] with one instruction.
For example on ArmV8 I discovered through experimentation casting the memory pointer to a uint32_t 4x4 matrix vector (which is 64 bytes in size) halved the required instructions required as before I had to increment by 8 as it was only loading half the data, even though my understanding was that it fetches a full cache line.
Pre-fetching an uint32_t[32] original code example...
int addrindex = &B[0];
__builtin_prefetch(&V[addrindex]);
__builtin_prefetch(&V[addrindex + 8]);
__builtin_prefetch(&V[addrindex + 16]);
__builtin_prefetch(&V[addrindex + 24]);
After...
int addrindex = &B[0];
__builtin_prefetch((uint32x4x4_t *) &V[addrindex]);
__builtin_prefetch((uint32x4x4_t *) &V[addrindex + 16]);
For some reason int datatype for the address index/offset gave better performance. Tested with GCC 8 on Cortex-a53. Using an equivalent 64 byte vector on other architectures might give the same performance improvement if you find it is not pre-fetching all the data like in my case. In my application with a one million iteration loop, it improved performance by 5% just by doing this. There were further requirements for the improvement.
the 128 megabyte "V" memory allocation had to be aligned to 64 bytes.
uint32_t *V __attribute__((__aligned__(64))) = (uint32_t *)(((uintptr_t)(__builtin_assume_aligned((unsigned char*)aligned_alloc(64,size), 64)) + 63) & ~ (uintptr_t)(63));
Also, I had to use C operators instead of Neon Intrinsics, since they require regular datatype pointers (in my case it was uint32_t *) otherwise the new built in prefetch method had a performance regression.
My real world example can be found at https://github.com/rollmeister/veriumMiner/blob/main/algo/scrypt.c in the scrypt_core() and its internal function which are all easy to read. The hard work is done by GCC8. Overall improvement to performance was 25%.

Hash table implementation

I just bought a book "C Interfaces and Implementations".
in chapter one , it has implemented a "Atom" structure, sample code as follow:
#define NELEMS(x) ((sizeof (x))/(sizeof ((x)[0])))
static struct atom {
struct atom *link;
int len;
char *str;
} *buckets[2048];
static unsigned long scatter[] = {
2078917053, 143302914, 1027100827, 1953210302, 755253631, 2002600785,
1405390230, 45248011, 1099951567, 433832350, 2018585307, 438263339,
813528929, 1703199216, 618906479, 573714703, 766270699, 275680090,
1510320440, 1583583926, 1723401032, 1965443329, 1098183682, 1636505764,
980071615, 1011597961, 643279273, 1315461275, 157584038, 1069844923,
471560540, 89017443, 1213147837, 1498661368, 2042227746, 1968401469,
1353778505, 1300134328, 2013649480, 306246424, 1733966678, 1884751139,
744509763, 400011959, 1440466707, 1363416242, 973726663, 59253759,
1639096332, 336563455, 1642837685, 1215013716, 154523136, 593537720,
704035832, 1134594751, 1605135681, 1347315106, 302572379, 1762719719,
269676381, 774132919, 1851737163, 1482824219, 125310639, 1746481261,
1303742040, 1479089144, 899131941, 1169907872, 1785335569, 485614972,
907175364, 382361684, 885626931, 200158423, 1745777927, 1859353594,
259412182, 1237390611, 48433401, 1902249868, 304920680, 202956538,
348303940, 1008956512, 1337551289, 1953439621, 208787970, 1640123668,
1568675693, 478464352, 266772940, 1272929208, 1961288571, 392083579,
871926821, 1117546963, 1871172724, 1771058762, 139971187, 1509024645,
109190086, 1047146551, 1891386329, 994817018, 1247304975, 1489680608,
706686964, 1506717157, 579587572, 755120366, 1261483377, 884508252,
958076904, 1609787317, 1893464764, 148144545, 1415743291, 2102252735,
1788268214, 836935336, 433233439, 2055041154, 2109864544, 247038362,
299641085, 834307717, 1364585325, 23330161, 457882831, 1504556512,
1532354806, 567072918, 404219416, 1276257488, 1561889936, 1651524391,
618454448, 121093252, 1010757900, 1198042020, 876213618, 124757630,
2082550272, 1834290522, 1734544947, 1828531389, 1982435068, 1002804590,
1783300476, 1623219634, 1839739926, 69050267, 1530777140, 1802120822,
316088629, 1830418225, 488944891, 1680673954, 1853748387, 946827723,
1037746818, 1238619545, 1513900641, 1441966234, 367393385, 928306929,
946006977, 985847834, 1049400181, 1956764878, 36406206, 1925613800,
2081522508, 2118956479, 1612420674, 1668583807, 1800004220, 1447372094,
523904750, 1435821048, 923108080, 216161028, 1504871315, 306401572,
2018281851, 1820959944, 2136819798, 359743094, 1354150250, 1843084537,
1306570817, 244413420, 934220434, 672987810, 1686379655, 1301613820,
1601294739, 484902984, 139978006, 503211273, 294184214, 176384212,
281341425, 228223074, 147857043, 1893762099, 1896806882, 1947861263,
1193650546, 273227984, 1236198663, 2116758626, 489389012, 593586330,
275676551, 360187215, 267062626, 265012701, 719930310, 1621212876,
2108097238, 2026501127, 1865626297, 894834024, 552005290, 1404522304,
48964196, 5816381, 1889425288, 188942202, 509027654, 36125855,
365326415, 790369079, 264348929, 513183458, 536647531, 13672163,
313561074, 1730298077, 286900147, 1549759737, 1699573055, 776289160,
2143346068, 1975249606, 1136476375, 262925046, 92778659, 1856406685,
1884137923, 53392249, 1735424165, 1602280572
};
const char *Atom_new(const char *str, int len) {
unsigned long h;
int i;
struct atom *p;
assert(str);
assert(len >= 0);
for (h = 0, i = 0; i < len; i++)
h = (h<<1) + scatter[(unsigned char)str[i]];
h &= NELEMS(buckets)-1;
for (p = buckets[h]; p; p = p->link)
if (len == p->len) {
for (i = 0; i < len && p->str[i] == str[i]; )
i++;
if (i == len)
return p->str;
}
p = ALLOC(sizeof (*p) + len + 1);
p->len = len;
p->str = (char *)(p + 1);
if (len > 0)
memcpy(p->str, str, len);
p->str[len] = '\0';
p->link = buckets[h];
buckets[h] = p;//insert atom in front of list
return p->str;
}
at end of chapter , in exercises 3.1, the book's author said
"Most texts recommend using a prime number for the size of
buckets. Using a prime and a good hash function usually gives a
better distribution of the lengths of the lists hanging off of buckets.
Atom uses a power of two, which is sometimes explicitly cited
as a bad choice. Write a program to generate or read, say, 10,000
typical strings and measure Atom_new’s speed and the distribution
of the lengths of the lists. Then change buckets so that it has
2,039 entries (the largest prime less than 2,048), and repeat the
measurements. Does using a prime help? How much does your
conclusion depend on your specific machine?"
so I did changed that hash table size to 2039,but it seems a prime number actually made
a bad distribution of the lengths of the lists, I have tried 64, 61, 61 actually made a bad distribution too.
I am just want to know why a prime table size make a bad distribution, is this because the hash function used with Atom_new a bad hash function?
I am using this function to print out the lengths of the atom lists
#define B_SIZE 2048
void Atom_print(void)
{
int i,t;
struct atom *atom;
for(i= 0;i<B_SIZE;i++) {
t = 0;
for(atom=buckets[i];atom;atom=atom->link) {
++t;
}
printf("%d ",t);
}
}

Well, along time ago I had to implement a hash table (in driver development), and I about the same. Why the heck should I use a prime number? OTOH power of 2 is even better - instead of calculating the modulus in case of power of 2 you may use bitwise AND.
So I've implemented such a hash table. The key was a pointer (returned by some 3rd-party function). Then, eventually I noticed that in my hash table only 1/4 of all the entries is filled. Because that hash function I used was identity function, and just in case it turned out that all the returned pointers are multiples of 4.
The idea of using the prime numbers for the hash table size is the following: real-world hash functions do not produce equally-distributed values. Usually there's (or at least there may be) some dependency. So, in order to diffuse this distribution it's recommended to use prime numbers.
BTW, theoretically there may happen that occasionally the hash function will produce the numbers that are multiples of your chosen prime number. But the probability of this is lower than if it was not a prime number.

I think it's the code to select the bucket. In the code you pasted it says:
h &= NELEMS(buckets)-1;
That works fine for sizes which are powers of two, since its final effect is choosing the lower bits of h. For other sizes, NELEMS(buckets)-1 will have bits in 0 and the bit-wise & operator will discard those bits, effectively leaving "holes" in the bucket list.
The general formula for bucket selection is:
h = h % NELEMS(buckets);

This is what Julienne Walker from Eternally Confuzzled has to say about hash table sizes:
When it comes to hash tables, the most
recommended table size is any prime
number. This recommendation is made
because hashing in general is
misunderstood, and poor hash functions
require an extra mixing step of
division by a prime to resemble a
uniform distribution. Another reason
that a prime table size is recommended
is because several of the collision
resolution methods require it to work.
In reality, this is a generalization
and is actually false (a power of two
with odd step sizes will typically
work just as well for most collision
resolution strategies), but not many
people consider the alternatives and
in the world of hash tables, prime
rules.

There's another factor at work here and that is that the constant hashing values should all be odd/prime and widely dispersed. If you have an even number of units (characters for instance) in the key to be hashed then having all odd constants will give you an even initial hash value. For an odd number of units you'd get an odd number. I've done some experimenting with this and just the 50/50% split was worth a lot in evening the distribution. Of course if all keys are equally long this doesn't matter.
The hashing also needs to ensure that you won't get the same initial hash value for "AAB" as for "ABA" or "BAA".

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight