Efficient Rice decoding implementation - c

I made a naïve implementation of a Rice decoder (and encoder):
void rice_decode(int k) {
int i = 0
int j = 0;
int x = 0;
while(i < size-k) {
int q = 0;
while(get(i) == 0) {
q++;
i++;
}
x = q<<k;
i++;
for(j=0; j<k; j++) {
x += get(i+j)<<j;
}
i += k;
printf("%i\n", x);
x = 0;
}
}
with size the size of the input bitset, get(i) a primitive returning the i-th bit of the bitset, and k the Rice parameter. As I am concerned with performances, I also made a more elaborate implementation with precomputation, which is faster. However, when I turn the -O3 flag on in gcc, the naïve implementation actually outperforms the latter.
My question is: do know any existing efficient implementation of a Rice encoder/decoder (I am more concerned with decoding) that fares better than this (the ones I could find are either slower or comparable) ? Alternatively, do you have any clever idea that could make decoding faster other than precomputation ?

Rice coding can be viewed as a variant of variable-length codes. Consequently, in the past I've used table-based techniques to generate automata/state machines that decode fixed huffman codes and rice codes quickly. A quick search about the web for fast variable-length codes or fast huffman yield many applicable results, some of them table-based.

There are some textbook bithacks applicable here.
while (get(i) == 0) { q++; i++;} finds the least significant bit set in a stream.
That can be replaced with -data & data, which returns a single bit set. That can be converted to index 'q' with some hash + LUT (e.g the one involving modulus with 37. -- or using SSE4 instructions with crc32, I'd bet one can simply do LUT[crc32(-data&data) & 63];
The next loop for(j=0; j<k; j++) x += get(i+j)<<j; OTOH should be replaced with x+= data & ((1<<k)-1); as one simply gets k bits from a stream and treats them as an unsigned integer.
Finally one shifts data>>=(q+k); and reads in enough bytes from input stream.

Related

How to Optimize Simple Circular/Rotating Buffer/FIFO Handling for Performance

Hi: I have been ramping up on C and I have a couple philosophical questions based on arrays and pointers and how make things simple, quick, and small or balance the three at least, I suppose.
I imagine an MCU sampling an input every so often and storing the sample in an array, called "val", of size "NUM_TAPS". The index of 'val' gets decremented for the next sample after the current, so for instance if val[0] just got stored, the next value needs to go into val[NUM_TAPS-1].
At the end of the day I want to be able to refer to the newest sample as x[0] and the oldest sample as x[NUM_TAPS-1] (or equivalent).
It is a slightly different problem than many have solved on this and other forums describing rotating, circular, queue etc. buffers. I don't need (I think) a head and tail pointer because I always have NUM_TAPS data values. I only need to remap the indexes based on a "head pointer".
Below is the code I came up with. It seems to be working fine but it raises a few more questions I'd like to pose to the wider, much more expert community:
Is there a better way to assign indexes than a conditional assignment
(to wrap indexes < 0) with the modulus operator (to wrap indexes >
NUM_TAPS -1)? I can't think of a way that pointers to pointers would
help, but does anyone else have thoughts on this?
Instead of shifting the data itself as in a FIFO to organize the
values of x, I decided here to rotate the indexes. I would guess that
for data structures close to or smaller in size than the pointers
themselves that data moves might be the way to go but for very large
numbers (floats, etc.) perhaps the pointer assignment method is the
most efficient. Thoughts?
Is the modulus operator generally considered close in speed to
conditional statements? For example, which is generally faster?:
offset = (++offset)%N;
*OR**
offset++;
if (NUM_TAPS == offset) { offset = 0; }
Thank you!
#include <stdio.h>
#define NUM_TAPS 10
#define STARTING_VAL 0
#define HALF_PERIOD 3
void main (void) {
register int sample_offset = 0;
int wrap_offset = 0;
int val[NUM_TAPS];
int * pval;
int * x[NUM_TAPS];
int live_sample = 1;
//START WITH 0 IN EVERY LOCATION
pval = val; /* 1st address of val[] */
for (int i = 0; i < NUM_TAPS; i++) { *(pval + i) = STARTING_VAL ; }
//EVENT LOOP (SAMPLE A SQUARE WAVE EVERY PASS)
for (int loop = 0; loop < 30; loop++) {
if (0 == loop%HALF_PERIOD && loop > 0) {live_sample *= -1;}
*(pval + sample_offset) = live_sample; //really stupid square wave generator
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++) { x[i] = pval+(sample_offset + i)%NUM_TAPS; }
//METHOD #1: dump the samples using pval:
//for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*(pval+(sample_offset + i)%NUM_TAPS)); }
//printf("\n");
//METHOD #2: dump the samples using x:
for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*x[i]); }
printf("\n");
sample_offset = (sample_offset - 1)%NUM_TAPS; //represents the next location of the sample to be stored, relative to pval
sample_offset = (sample_offset < 0 ? NUM_TAPS -1 : sample_offset); //wrap around if the sample_offset goes negative
}
}
The cost of a % operator is the about 26 clock cycles since it is implemented using the DIV instruction. An if statement is likely faster since the instructions will be present in the pipeline and so the process will skip a few instructions but it can do this quickly.
Note that both solutions are slow compared to doing a BITWISE AND operation which takes only 1 clock cycle. For reference, if you want gory detail, check out this chart for the various instruction costs (measured in CPU Clock ticks)
http://www.agner.org/optimize/instruction_tables.pdf
The best way to do a fast modulo on a buffer index is to use a power of 2 value for the number of buffers so then you can use the quick BITWISE AND operator instead.
#define NUM_TAPS 16
With a power of 2 value for the number of buffers, you can use a bitwise AND to implement modulo very efficiently. Recall that bitwise AND with a 1 leaves the bit unchanged, while bitwise AND with a 0 leaves the bit zero.
So by doing a bitwise AND of NUM_TAPS-1 with your incremented index, assuming that NUM_TAPS is 16, then it will cycle through the values 0,1,2,...,14,15,0,1,...
This works because NUM_TAPS-1 equals 15, which is 00001111b in binary. The bitwise AND resulst in a value where only that last 4 bits to be preserved, while any higher bits are zeroed.
So everywhere you use "% NUM_TAPS", you can replace it with "& (NUM_TAPS-1)". For example:
#define NUM_TAPS 16
...
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++)
{ x[i] = pval+(sample_offset + i) & (NUM_TAPS-1); }
Here is your code modified to work with BITWISE AND, which is the fastest solution.
#include <stdio.h>
#define NUM_TAPS 16 // Use a POWER of 2 for speed, 16=2^4
#define MOD_MASK (NUM_TAPS-1) // Saves typing and makes code clearer
#define STARTING_VAL 0
#define HALF_PERIOD 3
void main (void) {
register int sample_offset = 0;
int wrap_offset = 0;
int val[NUM_TAPS];
int * pval;
int * x[NUM_TAPS];
int live_sample = 1;
//START WITH 0 IN EVERY LOCATION
pval = val; /* 1st address of val[] */
for (int i = 0; i < NUM_TAPS; i++) { *(pval + i) = STARTING_VAL ; }
//EVENT LOOP (SAMPLE A SQUARE WAVE EVERY PASS)
for (int loop = 0; loop < 30; loop++) {
if (0 == loop%HALF_PERIOD && loop > 0) {live_sample *= -1;}
*(pval + sample_offset) = live_sample; //really stupid square wave generator
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++) { x[i] = pval+(sample_offset + i) & MOD_MASK; }
//METHOD #1: dump the samples using pval:
//for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*(pval+(sample_offset + i) & MOD_MASK)); }
//printf("\n");
//METHOD #2: dump the samples using x:
for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*x[i]); }
printf("\n");
// sample_offset = (sample_offset - 1)%NUM_TAPS; //represents the next location of the sample to be stored, relative to pval
// sample_offset = (sample_offset < 0 ? NUM_TAPS -1 : sample_offset); //wrap around if the sample_offset goes negative
// MOD_MASK works faster than the above
sample_offset = (sample_offset - 1) & MOD_MASK;
}
}
At the end of the day I want to be able to refer to the newest sample as x[0] and the oldest sample as x[NUM_TAPS-1] (or equivalent).
Any way you implement this is very expensive, because each time you record a new sample, you have to move all the other samples (or pointers to them, or an equivalent). Pointers don't really help you here. In fact, using pointers as you do is probably a little more costly than just working directly with the buffer.
My suggestion would be to give up the idea of "remapping" indices persistently, and instead do it only virtually, as needed. I'd probably ease that and ensure it is done consistently by writing data access macros to use in place of direct access to the buffer. For example,
// expands to an expression designating the sample at the specified
// (virtual) index
#define SAMPLE(index) (val[((index) + sample_offset) % NUM_TAPS])
You would then use SAMPLE(n) instead of x[n] to read the samples.
I might consider also providing a macro for adding new samples, such as
// Updates sample_offset and records the given sample at the new offset
#define RECORD_SAMPLE(sample) do { \
sample_offset = (sample_offset + NUM_TAPS - 1) % NUM_TAPS; \
val[sample_offset] = sample; \
} while (0)
With regard to your specific questions:
Is there a better way to assign indexes than a conditional assignment (to wrap indexes < 0) with the modulus operator (to wrap
indexes > NUM_TAPS -1)? I can't think of a way that pointers to
pointers would help, but does anyone else have thoughts on this?
I would choose modulus over a conditional every time. Do, however, watch out for taking the modulus of a negative number (see above for an example of how to avoid doing so); such a computation may not mean what you think it means. For example -1 % 2 == -1, because C specifies that (a/b)*b + a%b == a for any a and b such that the quotient is representable.
Instead of shifting the data itself as in a FIFO to organize the values of x, I decided here to rotate the indexes. I would guess that
for data structures close to or smaller in size than the pointers
themselves that data moves might be the way to go but for very large
numbers (floats, etc.) perhaps the pointer assignment method is the
most efficient. Thoughts?
But your implementation does not rotate the indices. Instead, it shifts pointers. Not only is this about as expensive as shifting the data themselves, but it also adds the cost of indirection for access to the data.
Additionally, you seem to have the impression that pointer representations are small compared to representations of other built-in data types. This is rarely the case. Pointers are usually among the largest of a given C implementation's built-in data types. In any event, neither shifting around the data nor shifting around pointers is efficient.
Is the modulus operator generally considered close in speed to conditional statements? For example, which is generally faster?:
On modern machines, the modulus operator is much faster on average than a conditional whose result is difficult for the CPU to predict. CPUs these days have long instruction pipelines, and they perform branch prediction and corresponding speculative computation to enable them to keep these full when a conditional instruction is encountered, but when they discover that they have predicted incorrectly, they need to flush the whole pipeline and redo several computations. When that happens, it's a lot more expensive than a small number of unconditional arithmetical operations.

How to generate a very large non singular matrix A in Ax = b?

I am solving the system of linear algebraic equations Ax = b by using Jacobian method but by taking manual inputs. I want to analyze the performance of the solver for large system. Is there any method to generate matrix A i.e non singular?
I am attaching my code here.`
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define TOL = 0.0001
void main()
{
int size,i,j,k = 0;
printf("\n enter the number of equations: ");
scanf("%d",&size);
double reci = 0.0;
double *x = (double *)malloc(size*sizeof(double));
double *x_old = (double *)malloc(size*sizeof(double));
double *b = (double *)malloc(size*sizeof(double));
double *coeffMat = (double *)malloc(size*size*sizeof(double));
printf("\n Enter the coefficient matrix: \n");
for(i = 0; i < size; i++)
{
for(j = 0; j < size; j++)
{
printf(" coeffMat[%d][%d] = ",i,j);
scanf("%lf",&coeffMat[i*size+j]);
printf("\n");
//coeffMat[i*size+j] = 1.0;
}
}
printf("\n Enter the b vector: \n");
for(i = 0; i < size; i++)
{
x[i] = 0.0;
printf(" b[%d] = ",i);
scanf("%lf",&b[i]);
}
double sum = 0.0;
while(k < size)
{
for(i = 0; i < size; i++)
{
x_old[i] = x[i];
}
for(i = 0; i < size; i++)
{
sum = 0.0;
for(j = 0; j < size; j++)
{
if(i != j)
{
sum += (coeffMat[i * size + j] * x_old[j] );
}
}
x[i] = (b[i] -sum) / coeffMat[i * size + i];
}
k = k+1;
}
printf("\n Solution is: ");
for(i = 0; i < size; i++)
{
printf(" x[%d] = %lf \n ",i,x[i]);
}
}
This is all a bit Heath Robinson, but here's what I've used. I have no idea how 'random' such matrices all, in particular I don't know what distribution they follow.
The idea is to generate the SVD of the matrix. (Called A below, and assumed nxn).
Initialise A to all 0s
Then generate n positive numbers, and put them, with random signs, in the diagonal of A. I've found it useful to be able to control the ratio of the largest of these positive numbers to the smallest. This ratio will be the condition number of the matrix.
Then repeat n times: generate a random n vector f , and multiply A on the left by the Householder reflector I - 2*f*f' / (f'*f). Note that this can be done more efficiently than by forming the reflector matrix and doing a normal multiplication; indeed its easy to write a routine that given f and A will update A in place.
Repeat the above but multiplying on the right.
As for generating test data a simple way is to pick an x0 and then generate b = A * x0. Don't expect to get exactly x0 back from your solver; even if it is remarkably well behaved you'll find that the errors get bigger as the condition number gets bigger.
Talonmies' comment mentions http://www.eecs.berkeley.edu/Pubs/TechRpts/1991/CSD-91-658.pdf which is probably the right approach (at least in principle, and in full generality).
However, you are probably not handling "very large" matrixes (e.g. because your program use naive algorithms, and because you don't run it on a large supercomputer with a lot of RAM). So the naive approach of generating a matrix with random coefficients and testing afterwards that it is non-singular is probably enough.
Very large matrixes would have many billions of coefficients, and you need a powerful supercomputer with e.g. terabytes of RAM. You probably don't have that, if you did, your program probably would run too long (you don't have any parallelism), might give very wrong results (read http://floating-point-gui.de/ for more) so you don't care.
A matrix of a million coefficients (e.g. 1024*1024) is considered small by current hardware standards (and is more than enough to test your code on current laptops or desktops, and even to test some parallel implementations), and generating randomly some of them (and computing their determinant to test that they are not singular) is enough, and easily doable. You might even generate them and/or check their regularity with some external tool, e.g. scilab, R, octave, etc. Once your program computed a solution x0, you could use some tool (or write another program) to compute Ax0 - b and check that it is very close to the 0 vector (there are some cases where you would be disappointed or surprised, since round-off errors matter).
You'll need some good enough pseudo random number generator perhaps as simple as drand48(3) which is considered as nearly obsolete (you should find and use something better); you could seed it with some random source (e.g. /dev/urandom on Linux).
BTW, compile your code with all warnings & debug info (e.g. gcc -Wall -Wextra -g). Your #define TOL = 0.0001 is probably wrong (should be #define TOL 0.0001 or const double tol = 0.0001;). Use the debugger (gdb) & valgrind. Add optimizations (-O2 -mcpu=native) when benchmarking. Read the documentation of every used function, notably those from <stdio.h>. Check the result count from scanf... In C99, you should not cast the result of malloc, but you forgot to test against its failure, so code:
double *b = malloc(size*sizeof(double));
if (!b) {perror("malloc b"); exit(EXIT_FAILURE); };
You'll rather end, not start, your printf control strings with \n because stdout is often (not always!) line buffered. See also fflush.
You probably should read also some basic linear algebra textbook...
Notice that actually writing robust and efficient programs to invert matrixes or to solve linear systems is a difficult art (which I don't know at all : it has programming issues, algorithmic issues, and mathematical issues; read some numerical analysis book). You can still get a PhD and spend your whole life working on that. Please understand that you need ten years to learn programming (or many other things).

memmove vs. copying individual array elements

In CLRS chapter 2 there is an exercise which asks whether the worst-case running time of insertion sort be improved to O(n lg n). I saw this question and found that it cannot be done.
The worst-case complexity cannot be improved but would by using memmove the real running time be better compared to individually moving the array elements?
Code for individually moving elements
void insertion_sort(int arr[], int length)
{
/*
Sorts into increasing order
For decreasing order change the comparison in for-loop
*/
for (int j = 1; j < length; j++)
{
int temp = arr[j];
int k;
for (k = j - 1; k >= 0 && arr[k] > temp; k--){
arr[k + 1] = arr[k];
}
arr[k + 1] = temp;
}
}
Code for moving elements by using memmove
void insertion_sort(int arr[], int length)
{
for (int j = 1; j < length; j++)
{
int temp = arr[j];
int k;
for (k = j - 1; k >= 0 && arr[k] > temp; k--){
;
}
if (k != j - 1){
memmove(&arr[k + 2], &arr[k + 1], sizeof(int) *(j - k - 2));
}
arr[k + 1] = temp;
}
}
I couldn't get the 2nd one to run perfectly but that is an example of what I am thinking of doing.
Would there be any visible speed improvements by using memmove?
The implementation behind memmove() might be more optimized in your C library. Some architectures have instructions for moving whole blocks of memory at once very efficiently. The theoretical running-time complexity won't be improved, but it may still run faster in real life.
memmove would be perfectly tuned to make maximum use of the available system resources (unique for each implementation, of course).
Here is a little quote from Expert C Programming - Deep C Secrets on the difference between using a loop and using memcpy (preceding it are two code snippets one copying a source into a destination using a for loop and another memcpy):
In this particular case both the source and destination use the same
cache line, causing every memory reference to miss the cache and
stall the processor while it waited for regular memory to deliver.
The library memcpy() routine is especially tuned for high performance.
It unrolls the loop to read for one cache line and then write, which
avoids the problem. Using the smart copy, we were able to get a huge
performance improvement. This also shows the folly of drawing
conclusions from simple-minded benchmark programs.
This dates back from 1994 but it still illustrates how much better optimised the standard library functions are compared to anything you roll on your own. The loop case took around 7 seconds to run versus 1 for the memcpy.
While memmove will be only slightly slower than memcpy due to the assumptions it needs to make about the source and destination (in memcpy they cannot overlap) it should still be far superior to any standard loop.
Note that this does not affect complexity (as it's been pointed out by another poster). Complexity does not depend on having a bigger cache or an unrolled loop :)
As requested here are the code snippets (slightly changed):
#include <string.h>
#define DUMBCOPY for (i = 0; i < 65536; i++) destination[i] = source[i]
#define SMARTCOPY memcpy(destination, source, 65536)
int main()
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; j++)
DUMBCOPY; /* or put SMARTCOPY here instead */
return 0;
}
On my machine (32 bit, Linux Mint, GCC 4.6.3) I got the following times:
Using SMARTCOPY:
$ time ./a.out
real 0m0.002s
user 0m0.000s
sys 0m0.000s
Using DUMBCOPY:
$ time ./a.out
real 0m0.050s
user 0m0.036s
sys 0m0.000s
It all depends on your compiler and other implementation details. It is true that memmove can be implemented in some tricky super-optimized way. But at the same time a smart compiler might be able to figure out what your per-element copying code is doing and optimize it in the same (or very similar) way. Try it and see for yourself.
You can not beat memcpy with C implementation. Because it is written in asm and with a good algorithms.
If you write asm code for a specific cpu in mind, and develop a good algorithms considering cache, you may have a chance.
Standard library functions are so well optimized, it is always better to use them.

Nested loop traversing arrays

There are 2 very big series of elements, the second 100 times bigger than the first. For each element of the first series, there are 0 or more elements on the second series. This can be traversed and processed with 2 nested loops. But the unpredictability of the amount of matching elements for each member of the first array makes things very, very slow.
The actual processing of the 2nd series of elements involves logical and (&) and a population count.
I couldn't find good optimizations using C but I am considering doing inline asm, doing rep* mov* or similar for each element of the first series and then doing the batch processing of the matching bytes of the second series, perhaps in buffers of 1MB or something. But the code would be get quite messy.
Does anybody know of a better way? C preferred but x86 ASM OK too. Many thanks!
Sample/demo code with simplified problem, first series are "people" and second series are "events", for clarity's sake. (the original problem is actually 100m and 10,000m entries!)
#include <stdio.h>
#include <stdint.h>
#define PEOPLE 1000000 // 1m
struct Person {
uint8_t age; // Filtering condition
uint8_t cnt; // Number of events for this person in E
} P[PEOPLE]; // Each has 0 or more bytes with bit flags
#define EVENTS 100000000 // 100m
uint8_t P1[EVENTS]; // Property 1 flags
uint8_t P2[EVENTS]; // Property 2 flags
void init_arrays() {
for (int i = 0; i < PEOPLE; i++) { // just some stuff
P[i].age = i & 0x07;
P[i].cnt = i % 220; // assert( sum < EVENTS );
}
for (int i = 0; i < EVENTS; i++) {
P1[i] = i % 7; // just some stuff
P2[i] = i % 9; // just some other stuff
}
}
int main(int argc, char *argv[])
{
uint64_t sum = 0, fcur = 0;
int age_filter = 7; // just some
init_arrays(); // Init P, P1, P2
for (int64_t p = 0; p < PEOPLE ; p++)
if (P[p].age < age_filter)
for (int64_t e = 0; e < P[p].cnt ; e++, fcur++)
sum += __builtin_popcount( P1[fcur] & P2[fcur] );
else
fcur += P[p].cnt; // skip this person's events
printf("(dummy %ld %ld)\n", sum, fcur );
return 0;
}
gcc -O5 -march=native -std=c99 test.c -o test
Since on average you get 100 items per person, you can speed things up by processing multiple bytes at a time. I re-arranged the code slightly in order to use pointers instead of indexes, and replaced one loop by two loops:
uint8_t *p1 = P1, *p2 = P2;
for (int64_t p = 0; p < PEOPLE ; p++) {
if (P[p].age < age_filter) {
int64_t e = P[p].cnt;
for ( ; e >= 8 ; e -= 8) {
sum += __builtin_popcountll( *((long long*)p1) & *((long long*)p2) );
p1 += 8;
p2 += 8;
}
for ( ; e ; e--) {
sum += __builtin_popcount( *p1++ & *p2++ );
}
} else {
p1 += P[p].cnt;
p2 += P[p].cnt;
}
}
In my testing this speeds up your code from 1.515s to 0.855s.
The answer by Neil doesn't require sorting by age, which btw could be a good idea --
If the second loop has holes (please correct original source code to support that idea), a common solution is to do cumsum[n+1]=cumsum[n]+__popcount(P[n]&P2[n]);
Then for each people
sum+=cumsum[fcur + P[p].cnt] - cumsum[fcur];
Anyway it seems that the computational burden is merely of order EVENTS, not EVENTS*PEOPLE. Some optimization can anyway take place by calling the inner loop for all the consecutive people meeting the condition.
If there are really max 8 predicates, it could makes sense to precalculate all the
sums (_popcounts(predicate[0..255])) for each people into separate arrays C[256][PEOPLE]. That just about doubles the memory requirements (on disk?), but localizes the search from 10GB+10GB+...+10GB (8 predicates) to one stream of 200MB (assuming 16 bit entries).
Depending on the probability of p(P[i].age < condition && P[i].height < cond2), it may not anymore make sense to calculate cumulative sums. Maybe, maybe not. More likely just some SSE parallelism 8 or 16 people at a time will do.
A completely new approach could be to use ROBDDs to encode the truth tables of each person / each event. First, if the event tables are not very random or if they do not consists of pathological functions, such as truth tables of bignum multiplication, then first one may achieve compression of the functions and secondly arithmetic operations for truth tables can be calculated in compressed form. Each subtree can be shared between users and each arithmetic operation for two identical subtrees has to be calculated only once.
I don't know if your sample code accurately reflects your problem but it can be rewritten like this:
for (int64_t p = 0; p < PEOPLE ; p++)
if (P[p].age < age_filter)
fcur += P[p].cnt;
for (int64_t e = 0; e < fcur ; e++)
sum += __builtin_popcount( P1[e] & P2[e] );
I don't know about gcc -O5 (it seems not documented here) and seems to produce the exact same code as gcc -O3 here with my gcc 4.5.4 (though, only tested on a relatively small code sample)
depending on what you want to achieve, -O3 can be slower than -O2
as with your problem, I'd suggest thinking more about your data structure than the actual algorithm.
you should not focus on solving the problem with an adequate algorithm/code optimisation as long as your data aren't repsented in a convenient manner.
if you want to quickly cut a large set of your data based on a single criteria (here, age in your example) I'd recommand using a variant of a sorted tree.
If your actual data(age,count etc.) is indeed 8-bit there is probably a lot of redundancy in calculations. In this case you can replace the processing by lookup tables - for each 8-bit value you'll have 256 possible outputs and instead of computation it might be possible to read the computed data from the table.
To tackle the branch mispredictions (missing in other answers) the code could do something like:
#ifdef MISPREDICTIONS
if (cond)
sum += value
#else
mask = - (cond == 0); // cond: 0 then -0, binary 00..; cond: 1 then -1, binary 11..
sum += (value & mask); // if mask is 0 sum value, else sums 0
#endif
It's not completely free since there are data dependencies (think superscalar cpu). But it usually gets a 10x boost for mostly unpredictable conditions.

Optimizing a search algorithm in C

Can the performance of this sequential search algorithm (taken from
The Practice of Programming) be improved using any of C's native utilities, e.g. if I set the i variable to be a register variable ?
int lookup(char *word, char*array[])
{
int i
for (i = 0; array[i] != NULL; i++)
if (strcmp(word, array[i]) == 0)
return i;
return -1;
}
Yes, but only very slightly. A much bigger performance improvement can be achieved by using better algorithms (for example keeping the list sorted and doing a binary search).
In general optimizing a given algorithm only gets you so far. Choosing a better algorithm (even if it's not completely optimized) can give you a considerable (order of magnitude) performance improvement.
I think, it will not make much of a difference. The compiler will already optimize it in that direction.
Besides, the variable i does not have much impact, word stays constant throughout the function and the rest is too large to fit in any register. It is only a matter how large the cache is and if the whole array might fit in there.
String comparisons are rather expensive computationally.
Can you perhaps use some kind of hashing for the array before searching?
There is well-known technique as sentinal method.
To use sentinal method, you must know about the length of "array[]".
You can remove "array[i] != NULL" comparing by using sentinal.
int lookup(char *word, char*array[], int array_len)
{
int i = 0;
array[array_len] = word;
for (;; ++i)
if (strcmp(word, array[i]) == 0)
break;
array[array_len] = NULL;
return (i != array_len) ? i : -1;
}
If you're reading TPOP, you will next see how they make this search many times faster with different data structures and algorithms.
But you can make things a bit faster by replacing things like
for (i = 0; i < n; ++i)
foo(a[i]);
with
char **p = a;
for (i = 0; i < n; ++i)
foo(*p);
++p;
If there is a known value at the end of the array (e.g. NULL) you can eliminate the loop counter:
for (p = a; *p != NULL; ++p)
foo(*p)
Good luck, that's a great book!
To optimize that code the best bet would be to rewrite the strcmp routine since you are only checking for equality and don't need to evaluate the entire word.
Other than that you can't do much else. You can't sort as it appears you are looking for text within a larger text. Binary search won't work either since the text is unlikely to be sorted.
My 2p (C-psuedocode):
wrd_end = wrd_ptr + wrd_len;
arr_end = arr_ptr - wrd_len;
while (arr_ptr < arr_end)
{
wrd_beg = wrd_ptr; arr_beg = arr_ptr;
while (wrd_ptr == arr_ptr)
{
wrd_ptr++; arr_ptr++;
if (wrd_ptr == wrd_en)
return wrd_beg;
}
wrd_ptr++;
}
Mark Harrison: Your for loop will never terminate! (++p is indented, but is not actually within the for :-)
Also, switching between pointers and indexing will generally have no effect on performance, nor will adding register keywords (as mat already mentions) -- the compiler is smart enough to apply these transformations where appropriate, and if you tell it enough about your cpu arch, it will do a better job of these than manual psuedo-micro-optimizations.
A faster way to match strings would be to store them Pascal style. If you don't need more than 255 characters per string, store them roughly like this, with the count in the first byte:
char s[] = "\x05Hello";
Then you can do:
for(i=0; i<len; ++i) {
s_len = strings[i][0];
if(
s_len == match_len
&& strings[i][s_len] == match[s_len-1]
&& 0 == memcmp(strings[i]+1, match, s_len-1)
) {
return 1;
}
}
And to get really fast, add memory prefetch hints for string start + 64, + 128 and the start of the next string. But that's just crazy. :-)
Another fast way to do it is to get your compiler to use a SSE2 optimized memcmp. Use fixed-length char arrays and align so the string starts on a 64-byte alignment. Then I believe you can get the good memcmp functions if you pass const char match[64] instead of const char *match into the function, or strncpy match into a 64,128,256,whatever byte array.
Thinking a bit more about this, these SSE2 match functions might be part of packages like Intel's and AMD's accelerator libraries. Check them out.
Realistically, setting I to be a register variable won't do anything that the compiler wouldn't do already.
If you are willing to spend some time upfront preprocessing the reference array, you should google "The World's Fastest Scrabble Program" and implement that. Spoiler: it's a DAG optimized for character lookups.
/* there is no more quick */
int lookup(char *word, char*array[])
{
int i;
for(i=0; *(array++) != NULL;i++)
if (strcmp(word, *array) == 0)
return i;
return -1;
}

Resources