Structure of the file and description of the system
The stream I want to analyze (a large binary file) is composed as follows:
40-bytes header
A stream of 10-bytes signals:
The first 8 bytes represent the time when the signal was registered
The last 2 bytes describe the channel where the signal was registered
The signal is emitted by a source which sends an impulse every SIGNAL_INTERVAL, and it may or may not be retrieved by the detector. If a detector counts, it sends the result to a counter's channel, which prints the count as shown above. The counter has 8 channels in total.
Multiplexing
In order to increase the number of detectors, a multiplexing approach is used. Two detectors send their counts to the same channel (say, detectors 1 and 9 are coupled on channel 1 of the counter). One of the signals (for example, 9) is delayed by DELAY, so that the delayed counts are shifted with respect to the non delayed ones.
Demultiplexing
The idea would be to divide the delayed data from the non delayed ones, then subtract the delay (adding 8 to the channel value so that a delayed count on channel 1 will be shown as a count on channel 9) and then rejoin the two arrays.
If SIGNAL_INTERVAL is constant, this is relatively easy: I define a "mask" [0, DELAY, SIGNAL_INTERVAL] and, taking a reference timestamp value, see where every count falls in the mask.
Trying different masks and counting which one gives the most counts, one can identify the delayed counts from the non delayed ones. This last part is done because we allow an error to the time count, so the stream will not be perfectly clustered. Moreover, it's impossible to know a priori if the first count is a delayed one, a non delayed one or even a casual count.
This is done channel by channel, as the channels may have a different response time from each other.
With this approach, the code is quite simple:
uint64_t maskCheck(struct count *data, int ch_num, int elements){
const int MAX_NUM = 27; // Maximum number of masks checked
int ref = 0; // The reference variable used as starting point at every cycle
uint64_t sing_count[2][MAX_NUM]; // The array containing the singles counts
uint64_t max_count; // Variable used to find the maximum in the array
int t = 0; // Time index for the following loop
uint64_t result = 0; // The final result, i.e. the index with the most singles counts
// Initializing max_count (it has MAX_NUM as length, so it must be initialized after being declared)
for(int i = 0; i < MAX_NUM; i++)sing_count[0][i] = 0;
for(int i = 0; i < MAX_NUM; i++)sing_count[1][i] = 0;
while(getChannel(data[t])!=ch_num){
t++;
if(t == elements - 1){
printf("%s\n", "Nothing found");
return 0;
}
}
if(getChannel(data[t]) == ch_num) ref = getTimestamp(data[t]);
uint64_t ref_indexed = ref;
for(int index = 0; index < MAX_NUM; index++){
sing_count[1][index] = ref + nsToBins(index) - nsToBins(MAX_NUM/2);
ref_indexed = sing_count[1][index];
for(t = 0; t < elements; t++){
// Skips the counts not occurring at ch_num
if(getChannel(data[t]) != ch_num) {
continue;
}
if(longAbs(getTimestamp(data[t]), ref_indexed) % nsToBins(SIGNAL_INTERVAL) <= nsToBins(MASK) + nsToBins(COUNT_ERROR) &&
longAbs(getTimestamp(data[t]), ref_indexed) % nsToBins(SIGNAL_INTERVAL) >= nsToBins(MASK) - nsToBins(COUNT_ERROR)){
sing_count[0][index]++;
}
else if(longAbs(getTimestamp(data[t]), ref_indexed) % nsToBins(SIGNAL_INTERVAL) <= nsToBins(COUNT_ERROR) ||
longAbs(getTimestamp(data[t]), ref_indexed) % nsToBins(SIGNAL_INTERVAL) >= nsToBins(SIGNAL_INTERVAL) - nsToBins(COUNT_ERROR)){
sing_count[0][index]++;
}
}
}
// This last part maximizes the array.
max_count = sing_count[0][0];
result = sing_count[1][0];
for(int i = 1; i < MAX_NUM; i++){
if(sing_count[0][i] > max_count)
{
max_count = sing_count[0][i];
result = sing_count[1][i];
}
}
where struct count is defined as a 10-byte array read by the functions getTimestamp() and getChannel(), and nsToBins() simply converts the time units.
Having the "best mask", I can divide the array through it and then perform all the other needed operations.
The problem
Now, here comes the problem. SIGNAL_INTERVAL is not constant, and it's not even well determined (to give you an idea, the frequency oscillates between 75.6 Mhz and 76.3 MHz).
The above approach turns out to be very unsuccessful now:
SIGNAL_INTERVAL has an error of about 0.3 ns
The measurement is performed over 30 seconds
Keeping in mind that the order of magnitude of SIGNAL_INTERVAL is of 10 ns, after just one second the error would be too big
This results in the timestamps being incorrectly divided, affecting all the subsequent operations.
What I had in mind was something to analyze the clusters in the data (SIGNAL_INTERVAL is not constant, but the oscillation is much smaller than DELAY so some clustering could in principle be observed) and find another way to divide the two arrays.
But so far I have nothing. Any help would be appreciated.
Related
I am using this program written in C to determine the permutations of size 10 of an regular alphabet.
When I run the program it only uses 36% of my 3GHz CPU leaving 50% free. It also only uses 7MB of my 8GB of RAM.
I would like to use at least 70-80% of my computer's performance and not just this misery. This limitation is making this procedure very time consuming and I don't know when (number of days) I will need to have the complete output. I need help to resolve this issue in the shortest possible time, whether improving the source code or other possibilities.
Any help is welcome even if this solution goes through instead of using the C language use another one that gives me better performance in the execution of the program.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
static int count = 0;
void print_permutations(char arr[], char prefix[], int n, int k) {
int i, j, l = strlen(prefix);
char newprefix[l + 2];
if (k == 0) {
printf("%d %s\n", ++count, prefix);
return;
}
for (i = 0; i < n; i++) {
//Concatenation of currentPrefix + arr[i] = newPrefix
for (j = 0; j < l; j++)
newprefix[j] = prefix[j];
newprefix[l] = arr[i];
newprefix[l + 1] = '\0';
print_permutations(arr, newprefix, n, k - 1);
}
}
int main() {
int n = 26, k = 10;
char arr[27] = "abcdefghijklmnopqrstuvwxyz";
print_permutations(arr, "", n, k);
system("pause");
return 0;
}
There are fundamental problems with your approach:
What are you trying to achieve?
If you want to enumerate the permutations of size 10 of a regular alphabet, your program is flawed as it enumerates all combinations of 10 letters from the alphabet. Your program will produce 2610 combinations, a huge number, 141167095653376, 141,167 billion! Ignoring the numbering, which will exceed the range of type int, that's more than 1.5 Petabytes, unlikely to fit on your storage space. Writing this at the top speed of 100MB/s would take more than 20 days.
The number of permutations, that is combinations of distinct letters from the 26 letter alphabet is not quite as large: 26! / 16! which is still large: 19275223968000, 7 times less than the previous result. That is still more than 212 terabytes of storage and 3 days at 100MB/s.
Storing these permutations is therefore impractical. You could change your program to just count the permutations and measure how long it takes if the count is the expected value. The first step of course is to correct your program to produce the correct set.
Test on smaller sets to verify correctness
Given the expected size of the problem, you should first test for smaller values such as enumerating permutations of 1, 2 and 3 letters to verify that you get the expected number of results.
Once you have correctness, only then focus on performance
Selecting different output methods, from printf("%d %s\n", ++count, prefix); to ++count; puts(prefix); to just ++count;, you will see that most of the time is spent in producing the output. Once you stop producing output, you might see that strlen() consumes a significant fraction of the execution time, which is useless since you can pass the prefix length from the caller. Further improvements may come from using a common array for the current prefix, precluding the need to copy at each recursive step.
Using multiple threads each producing its own output, for example each with a different initial letter, will not improve the overall time as the bottleneck is the bandwidth of the output device. But if you reduce the program to just enumerate and count the permutations, you might get faster execution with multiple threads, one per core, thereby increasing the CPU usage. But this should be the last step in your development.
Memory use is no measure of performance
Using as much memory as possible is not a goal in itself. Some problems may require a tradeoff between memory and time, where faster solving times are achieved using more core memory, but this one does not. 8MB is actually much more than your program's actual needs: this count includes the full stack space assigned to the program, of which only a tiny fraction will be used.
As a matter of fact, using less memory may improve overall performance as the CPU will make better use of its different caches.
Here is a modified program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static unsigned long long count;
void print_permutations(char arr[], int n, char used[], char prefix[], int pos, int k) {
if (pos == k) {
prefix[k] = '\0';
++count;
//printf("%llu %s\n", count, prefix);
//puts(prefix);
return;
}
for (int i = 0; i < n; i++) {
if (!used[i]) {
used[i] = 1;
prefix[pos] = arr[i];
print_permutations(arr, n, used, prefix, pos + 1, k);
used[i] = 0;
}
}
}
int main(int argc, char *argv[]) {
int n = 26, k = 10;
char arr[27] = "abcdefghijklmnopqrstuvwxyz";
char used[27] = { 0 };
char perm[27];
unsigned long long expected_count;
clock_t start, elapsed;
if (argc >= 2)
k = strtol(argv[1], NULL, 0);
if (argc >= 3)
n = strtol(argv[2], NULL, 0);
start = clock();
print_permutations(arr, n, used, perm, 0, k);
elapsed = clock() - start;
expected_count = 1;
for (int i = n; i > n - k; i--)
expected_count *= i;
printf("%llu permutations, expected %llu, %.0f permutations per second\n",
count, expected_count, count / ((double)elapsed / CLOCKS_PER_SEC));
return 0;
}
Without output, this program enumerates 140 million combinations per second on my slow laptop, it would take 1.5 days to enumerate the 19275223968000 10-letter permutations from the 26-letter alphabet. It uses almost 100% of a single core, but the CPU is still 63% idle as I have a dual core hyper-threaded Intel Core i5 CPU. Using multiple threads should yield increased performance, but the program must be changed to no longer use a global variable count.
There are multiple reasons for your bad experience:
Your metric:
Your metric is fundamentally flawed. Peak-CPU% is an imprecise measurement for "how much work does my CPU do". Which normally isn't really what you're most interested in. You can inflate this number my doing more work (like starting another thread that doesn't contribute to the output at all).
Your proper metric would be items per second: How many different strings will be printed or written to a file per second. To measure that, start a test run with a smaller size (like k=4), and measure how long it takes.
Your problem: Your problem is hard. Printing or writing down all 26^10 ~1.4e+14 different words with exactly 10 letters will take some time. Even if you changed it to all permutations - which your program doesn't do - it's still ~1.9e13. The resulting file will be 1.4 petabytes - which is most likely more than your hard drive will accept. Also, if you used your CPU to 100% and used one thousand cycles for one word, it'd take 1.5 years. 1000 cycles are an upper bound, you most likely won't be faster that this while still printing your result, as printf usually takes around 1000 cycles to complete.
Your output: Writing to stdout is slow comapred to writing to a file, see https://stackoverflow.com/a/14574238/4838547.
Your program: There are issues with your program that could be a problem for your performance. However, they are dominated by the other problems stated here. With my setup, this program uses 93.6% of its runtime in printf. Therefore, optimizing this code won't yield satisfying results.
Apologies if I have missed any similar post...
I have a ring buffer of BUFFER_SIZE elements that stores event data, so every event increases the buffer_index, starting at index 0.
Now I have a check_cyclic() function that needs to calculate the number of events since last call.
That function does only know the current buffer_index, and it stores that value statically to calculate the difference.
It's guaranteed that BUFFER_SIZE exceeds the maximum number of events that can happen during cycle time.
So we have to consider that there can have been only one overflow between two cyclic calls. My implementation is as follows:
void check_cyclic(int buffer_index)
{
static unsigned int last_buffer_index = 0;
static unsigned int events_since_last_call = 0;
if (buffer_index < last_buffer_index)
{ // overflow occured
events_since_last_call = buffer_index + 1 + BUFFER_SIZE - last_buffer_index;
}
else
{
events_since_last_call = buffer_index + 1 - last_buffer_index;
}
last_buffer_index = buffer_index;
//... do something
}
Of course, let the event routine itself increase a counter and reset this in check_cyclic() would be more efficient, but let's assume the interface is as given above.
Is there some more efficient way of calculating events_since_last_call, maybe if we define some requirements on the ring buffer? "Size must be a power of two" or whatever?
Hi: I have been ramping up on C and I have a couple philosophical questions based on arrays and pointers and how make things simple, quick, and small or balance the three at least, I suppose.
I imagine an MCU sampling an input every so often and storing the sample in an array, called "val", of size "NUM_TAPS". The index of 'val' gets decremented for the next sample after the current, so for instance if val[0] just got stored, the next value needs to go into val[NUM_TAPS-1].
At the end of the day I want to be able to refer to the newest sample as x[0] and the oldest sample as x[NUM_TAPS-1] (or equivalent).
It is a slightly different problem than many have solved on this and other forums describing rotating, circular, queue etc. buffers. I don't need (I think) a head and tail pointer because I always have NUM_TAPS data values. I only need to remap the indexes based on a "head pointer".
Below is the code I came up with. It seems to be working fine but it raises a few more questions I'd like to pose to the wider, much more expert community:
Is there a better way to assign indexes than a conditional assignment
(to wrap indexes < 0) with the modulus operator (to wrap indexes >
NUM_TAPS -1)? I can't think of a way that pointers to pointers would
help, but does anyone else have thoughts on this?
Instead of shifting the data itself as in a FIFO to organize the
values of x, I decided here to rotate the indexes. I would guess that
for data structures close to or smaller in size than the pointers
themselves that data moves might be the way to go but for very large
numbers (floats, etc.) perhaps the pointer assignment method is the
most efficient. Thoughts?
Is the modulus operator generally considered close in speed to
conditional statements? For example, which is generally faster?:
offset = (++offset)%N;
*OR**
offset++;
if (NUM_TAPS == offset) { offset = 0; }
Thank you!
#include <stdio.h>
#define NUM_TAPS 10
#define STARTING_VAL 0
#define HALF_PERIOD 3
void main (void) {
register int sample_offset = 0;
int wrap_offset = 0;
int val[NUM_TAPS];
int * pval;
int * x[NUM_TAPS];
int live_sample = 1;
//START WITH 0 IN EVERY LOCATION
pval = val; /* 1st address of val[] */
for (int i = 0; i < NUM_TAPS; i++) { *(pval + i) = STARTING_VAL ; }
//EVENT LOOP (SAMPLE A SQUARE WAVE EVERY PASS)
for (int loop = 0; loop < 30; loop++) {
if (0 == loop%HALF_PERIOD && loop > 0) {live_sample *= -1;}
*(pval + sample_offset) = live_sample; //really stupid square wave generator
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++) { x[i] = pval+(sample_offset + i)%NUM_TAPS; }
//METHOD #1: dump the samples using pval:
//for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*(pval+(sample_offset + i)%NUM_TAPS)); }
//printf("\n");
//METHOD #2: dump the samples using x:
for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*x[i]); }
printf("\n");
sample_offset = (sample_offset - 1)%NUM_TAPS; //represents the next location of the sample to be stored, relative to pval
sample_offset = (sample_offset < 0 ? NUM_TAPS -1 : sample_offset); //wrap around if the sample_offset goes negative
}
}
The cost of a % operator is the about 26 clock cycles since it is implemented using the DIV instruction. An if statement is likely faster since the instructions will be present in the pipeline and so the process will skip a few instructions but it can do this quickly.
Note that both solutions are slow compared to doing a BITWISE AND operation which takes only 1 clock cycle. For reference, if you want gory detail, check out this chart for the various instruction costs (measured in CPU Clock ticks)
http://www.agner.org/optimize/instruction_tables.pdf
The best way to do a fast modulo on a buffer index is to use a power of 2 value for the number of buffers so then you can use the quick BITWISE AND operator instead.
#define NUM_TAPS 16
With a power of 2 value for the number of buffers, you can use a bitwise AND to implement modulo very efficiently. Recall that bitwise AND with a 1 leaves the bit unchanged, while bitwise AND with a 0 leaves the bit zero.
So by doing a bitwise AND of NUM_TAPS-1 with your incremented index, assuming that NUM_TAPS is 16, then it will cycle through the values 0,1,2,...,14,15,0,1,...
This works because NUM_TAPS-1 equals 15, which is 00001111b in binary. The bitwise AND resulst in a value where only that last 4 bits to be preserved, while any higher bits are zeroed.
So everywhere you use "% NUM_TAPS", you can replace it with "& (NUM_TAPS-1)". For example:
#define NUM_TAPS 16
...
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++)
{ x[i] = pval+(sample_offset + i) & (NUM_TAPS-1); }
Here is your code modified to work with BITWISE AND, which is the fastest solution.
#include <stdio.h>
#define NUM_TAPS 16 // Use a POWER of 2 for speed, 16=2^4
#define MOD_MASK (NUM_TAPS-1) // Saves typing and makes code clearer
#define STARTING_VAL 0
#define HALF_PERIOD 3
void main (void) {
register int sample_offset = 0;
int wrap_offset = 0;
int val[NUM_TAPS];
int * pval;
int * x[NUM_TAPS];
int live_sample = 1;
//START WITH 0 IN EVERY LOCATION
pval = val; /* 1st address of val[] */
for (int i = 0; i < NUM_TAPS; i++) { *(pval + i) = STARTING_VAL ; }
//EVENT LOOP (SAMPLE A SQUARE WAVE EVERY PASS)
for (int loop = 0; loop < 30; loop++) {
if (0 == loop%HALF_PERIOD && loop > 0) {live_sample *= -1;}
*(pval + sample_offset) = live_sample; //really stupid square wave generator
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++) { x[i] = pval+(sample_offset + i) & MOD_MASK; }
//METHOD #1: dump the samples using pval:
//for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*(pval+(sample_offset + i) & MOD_MASK)); }
//printf("\n");
//METHOD #2: dump the samples using x:
for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*x[i]); }
printf("\n");
// sample_offset = (sample_offset - 1)%NUM_TAPS; //represents the next location of the sample to be stored, relative to pval
// sample_offset = (sample_offset < 0 ? NUM_TAPS -1 : sample_offset); //wrap around if the sample_offset goes negative
// MOD_MASK works faster than the above
sample_offset = (sample_offset - 1) & MOD_MASK;
}
}
At the end of the day I want to be able to refer to the newest sample as x[0] and the oldest sample as x[NUM_TAPS-1] (or equivalent).
Any way you implement this is very expensive, because each time you record a new sample, you have to move all the other samples (or pointers to them, or an equivalent). Pointers don't really help you here. In fact, using pointers as you do is probably a little more costly than just working directly with the buffer.
My suggestion would be to give up the idea of "remapping" indices persistently, and instead do it only virtually, as needed. I'd probably ease that and ensure it is done consistently by writing data access macros to use in place of direct access to the buffer. For example,
// expands to an expression designating the sample at the specified
// (virtual) index
#define SAMPLE(index) (val[((index) + sample_offset) % NUM_TAPS])
You would then use SAMPLE(n) instead of x[n] to read the samples.
I might consider also providing a macro for adding new samples, such as
// Updates sample_offset and records the given sample at the new offset
#define RECORD_SAMPLE(sample) do { \
sample_offset = (sample_offset + NUM_TAPS - 1) % NUM_TAPS; \
val[sample_offset] = sample; \
} while (0)
With regard to your specific questions:
Is there a better way to assign indexes than a conditional assignment (to wrap indexes < 0) with the modulus operator (to wrap
indexes > NUM_TAPS -1)? I can't think of a way that pointers to
pointers would help, but does anyone else have thoughts on this?
I would choose modulus over a conditional every time. Do, however, watch out for taking the modulus of a negative number (see above for an example of how to avoid doing so); such a computation may not mean what you think it means. For example -1 % 2 == -1, because C specifies that (a/b)*b + a%b == a for any a and b such that the quotient is representable.
Instead of shifting the data itself as in a FIFO to organize the values of x, I decided here to rotate the indexes. I would guess that
for data structures close to or smaller in size than the pointers
themselves that data moves might be the way to go but for very large
numbers (floats, etc.) perhaps the pointer assignment method is the
most efficient. Thoughts?
But your implementation does not rotate the indices. Instead, it shifts pointers. Not only is this about as expensive as shifting the data themselves, but it also adds the cost of indirection for access to the data.
Additionally, you seem to have the impression that pointer representations are small compared to representations of other built-in data types. This is rarely the case. Pointers are usually among the largest of a given C implementation's built-in data types. In any event, neither shifting around the data nor shifting around pointers is efficient.
Is the modulus operator generally considered close in speed to conditional statements? For example, which is generally faster?:
On modern machines, the modulus operator is much faster on average than a conditional whose result is difficult for the CPU to predict. CPUs these days have long instruction pipelines, and they perform branch prediction and corresponding speculative computation to enable them to keep these full when a conditional instruction is encountered, but when they discover that they have predicted incorrectly, they need to flush the whole pipeline and redo several computations. When that happens, it's a lot more expensive than a small number of unconditional arithmetical operations.
I'm trying to implement a circular buffer in order to average a stream of data points generated by a pressure sensor in C running on an embedded controller. The idea is to store the last N pressure readings in the buffer while maintaining a running sum of the buffer. Average = sum / N. Should be trivial.
However, the average I'm seeing is a value that starts near the pressure reading (I preload the buffer registers with a typical value), but which subsequently trends towards zero. If I also display the sum, it too is dropping asymptotically to zero. If the pressure changes, the average moves away from zero in the direction of the pressure change, but returns to its zero trend as soon as the pressure stabilizes.
If anyone could spot the error I'm making, it would be very helpful.
#define ARRAYSIZE 100
double Sum; // variable for running sum
double Average; // variable for average
double PressureValue[ARRAYSIZE]; // declare value array
int i; // data array index
int main(void) {
while (1)
{
if (i == ARRAYSIZE) i = 0; // test index, reset if it reaches the upper boundary
Sum = Sum - PressureValue[i]; // subtract the old datapoint from running sum
PressureValue[i] = PRESSURE; // replace previous loop datapoint with new data
Sum = Sum + PressureValue[i]; // add back the new current value to the running sum
Average = Sum / ARRAYSIZE; // calculate average value = SUM / ARRAYSIZE
++i; // increment index
} // end while loop
} // end main
The averaging code takes place in an interrupt handler; I'm reading the data from the pressure sensor via I2C with interrupts triggered at the end of each I2C communication phase. During the last phase, after the four bytes comprising the pressure data have been retrieved, they are assembled into a complete reading, and then converted to a decimal reading in PSI contained in the PRESSURE variable.
Obviously , this isn't a direct cut and paste from my code, but I didn't want anyone to have to wade through the whole thing, so I've limited it to just the stuff relevant to figuring the average, and changed the variable names to be more readable. Still, I can't spot what I'm doing wrong.
Thanks for your attention!
Doug G.
I don't see anything obviously wrong with your code, but as you say, you're not providing all of it, so who knows what's happening in the rest of it (in particular, how/if you're initializing i and Sum), but the following works fine for me, which is basically the same algorithm you have:
#include <stdio.h>
#include <stddef.h>
double PressureValue[8];
double Pressures[800];
int main(void) {
const size_t array_size = sizeof(PressureValue) / sizeof(PressureValue[0]);
const size_t num_pressures = sizeof(Pressures) / sizeof(Pressures[0]);
size_t count = 0, i = 0;
double average = 0;
/* Initialize PressureValue to {0, 1, 2, 3, ...} */
for ( size_t n = 0; n < array_size; ++n ) {
PressureValue[n] = n;
}
double sum = ((array_size - 1) / (double) 2) * array_size;
/* Initialize pressures to repeats of PressureValue */
for ( size_t n = 0; n < num_pressures; ++n ) {
Pressures[n] = n % array_size;
}
while ( count < num_pressures ) {
if ( i == array_size )
i = 0;
sum -= PressureValue[i];
PressureValue[i] = Pressures[count++];
sum += PressureValue[i++];
}
average = sum / array_size;
printf("Sum is %f\n", sum);
printf("Counted %zu pressures\n", count);
printf("Average is %f\n", average);
return 0;
}
Outputs:
paul#local:~/src/c/scratch$ ./pressure
Sum is 28.000000
Counted 800 pressures
Average is 3.500000
paul#local:~/src/c/scratch$
Just one more possibility, when you say they are "converted to a decimal reading in PSI contained in the PRESSURE variable", and elsewhere, for that matter, make sure you're not getting things truncated to zero because of integer division. If you've got things "trending to zero" as you're adding more, that's something I'd be immediately suspicious of. A classic error in converting Fahrenheit to Celsius, for instance, would be to write c = (f - 32) * (5 / 9), where that (5 / 9) truncates to zero every time, and always leaves you with c == 0.
Also, as a general rule, I understand that you "didn't want anyone to have to wade through the whole thing", but you'd be surprised how many times the real problem is not in the part of the code that you think it is. This is why it's important to provide an SSCCE to ensure that you can narrow down your code and actually isolate and reproduce the problem. If you try to narrow down your code and find that you can't isolate and reproduce the problem, then it's almost certain that your issue is not being caused by the thing you think is causing it.
It is also possible your code is working exactly as intended. If you are preloading your array with typical values outside of this loop and then running this code you would get the behavior you are describing. If you are preloading the array make sure you are preloading the sum and average otherwise you are essentially measuring gauge pressure with you preloaded value as atmospheric pressure.
There are 2 very big series of elements, the second 100 times bigger than the first. For each element of the first series, there are 0 or more elements on the second series. This can be traversed and processed with 2 nested loops. But the unpredictability of the amount of matching elements for each member of the first array makes things very, very slow.
The actual processing of the 2nd series of elements involves logical and (&) and a population count.
I couldn't find good optimizations using C but I am considering doing inline asm, doing rep* mov* or similar for each element of the first series and then doing the batch processing of the matching bytes of the second series, perhaps in buffers of 1MB or something. But the code would be get quite messy.
Does anybody know of a better way? C preferred but x86 ASM OK too. Many thanks!
Sample/demo code with simplified problem, first series are "people" and second series are "events", for clarity's sake. (the original problem is actually 100m and 10,000m entries!)
#include <stdio.h>
#include <stdint.h>
#define PEOPLE 1000000 // 1m
struct Person {
uint8_t age; // Filtering condition
uint8_t cnt; // Number of events for this person in E
} P[PEOPLE]; // Each has 0 or more bytes with bit flags
#define EVENTS 100000000 // 100m
uint8_t P1[EVENTS]; // Property 1 flags
uint8_t P2[EVENTS]; // Property 2 flags
void init_arrays() {
for (int i = 0; i < PEOPLE; i++) { // just some stuff
P[i].age = i & 0x07;
P[i].cnt = i % 220; // assert( sum < EVENTS );
}
for (int i = 0; i < EVENTS; i++) {
P1[i] = i % 7; // just some stuff
P2[i] = i % 9; // just some other stuff
}
}
int main(int argc, char *argv[])
{
uint64_t sum = 0, fcur = 0;
int age_filter = 7; // just some
init_arrays(); // Init P, P1, P2
for (int64_t p = 0; p < PEOPLE ; p++)
if (P[p].age < age_filter)
for (int64_t e = 0; e < P[p].cnt ; e++, fcur++)
sum += __builtin_popcount( P1[fcur] & P2[fcur] );
else
fcur += P[p].cnt; // skip this person's events
printf("(dummy %ld %ld)\n", sum, fcur );
return 0;
}
gcc -O5 -march=native -std=c99 test.c -o test
Since on average you get 100 items per person, you can speed things up by processing multiple bytes at a time. I re-arranged the code slightly in order to use pointers instead of indexes, and replaced one loop by two loops:
uint8_t *p1 = P1, *p2 = P2;
for (int64_t p = 0; p < PEOPLE ; p++) {
if (P[p].age < age_filter) {
int64_t e = P[p].cnt;
for ( ; e >= 8 ; e -= 8) {
sum += __builtin_popcountll( *((long long*)p1) & *((long long*)p2) );
p1 += 8;
p2 += 8;
}
for ( ; e ; e--) {
sum += __builtin_popcount( *p1++ & *p2++ );
}
} else {
p1 += P[p].cnt;
p2 += P[p].cnt;
}
}
In my testing this speeds up your code from 1.515s to 0.855s.
The answer by Neil doesn't require sorting by age, which btw could be a good idea --
If the second loop has holes (please correct original source code to support that idea), a common solution is to do cumsum[n+1]=cumsum[n]+__popcount(P[n]&P2[n]);
Then for each people
sum+=cumsum[fcur + P[p].cnt] - cumsum[fcur];
Anyway it seems that the computational burden is merely of order EVENTS, not EVENTS*PEOPLE. Some optimization can anyway take place by calling the inner loop for all the consecutive people meeting the condition.
If there are really max 8 predicates, it could makes sense to precalculate all the
sums (_popcounts(predicate[0..255])) for each people into separate arrays C[256][PEOPLE]. That just about doubles the memory requirements (on disk?), but localizes the search from 10GB+10GB+...+10GB (8 predicates) to one stream of 200MB (assuming 16 bit entries).
Depending on the probability of p(P[i].age < condition && P[i].height < cond2), it may not anymore make sense to calculate cumulative sums. Maybe, maybe not. More likely just some SSE parallelism 8 or 16 people at a time will do.
A completely new approach could be to use ROBDDs to encode the truth tables of each person / each event. First, if the event tables are not very random or if they do not consists of pathological functions, such as truth tables of bignum multiplication, then first one may achieve compression of the functions and secondly arithmetic operations for truth tables can be calculated in compressed form. Each subtree can be shared between users and each arithmetic operation for two identical subtrees has to be calculated only once.
I don't know if your sample code accurately reflects your problem but it can be rewritten like this:
for (int64_t p = 0; p < PEOPLE ; p++)
if (P[p].age < age_filter)
fcur += P[p].cnt;
for (int64_t e = 0; e < fcur ; e++)
sum += __builtin_popcount( P1[e] & P2[e] );
I don't know about gcc -O5 (it seems not documented here) and seems to produce the exact same code as gcc -O3 here with my gcc 4.5.4 (though, only tested on a relatively small code sample)
depending on what you want to achieve, -O3 can be slower than -O2
as with your problem, I'd suggest thinking more about your data structure than the actual algorithm.
you should not focus on solving the problem with an adequate algorithm/code optimisation as long as your data aren't repsented in a convenient manner.
if you want to quickly cut a large set of your data based on a single criteria (here, age in your example) I'd recommand using a variant of a sorted tree.
If your actual data(age,count etc.) is indeed 8-bit there is probably a lot of redundancy in calculations. In this case you can replace the processing by lookup tables - for each 8-bit value you'll have 256 possible outputs and instead of computation it might be possible to read the computed data from the table.
To tackle the branch mispredictions (missing in other answers) the code could do something like:
#ifdef MISPREDICTIONS
if (cond)
sum += value
#else
mask = - (cond == 0); // cond: 0 then -0, binary 00..; cond: 1 then -1, binary 11..
sum += (value & mask); // if mask is 0 sum value, else sums 0
#endif
It's not completely free since there are data dependencies (think superscalar cpu). But it usually gets a 10x boost for mostly unpredictable conditions.