Hash table implementation

Hash table implementation - c

I just bought a book "C Interfaces and Implementations".
in chapter one , it has implemented a "Atom" structure, sample code as follow:
#define NELEMS(x) ((sizeof (x))/(sizeof ((x)[0])))
static struct atom {
struct atom *link;
int len;
char *str;
} *buckets[2048];
static unsigned long scatter[] = {
2078917053, 143302914, 1027100827, 1953210302, 755253631, 2002600785,
1405390230, 45248011, 1099951567, 433832350, 2018585307, 438263339,
813528929, 1703199216, 618906479, 573714703, 766270699, 275680090,
1510320440, 1583583926, 1723401032, 1965443329, 1098183682, 1636505764,
980071615, 1011597961, 643279273, 1315461275, 157584038, 1069844923,
471560540, 89017443, 1213147837, 1498661368, 2042227746, 1968401469,
1353778505, 1300134328, 2013649480, 306246424, 1733966678, 1884751139,
744509763, 400011959, 1440466707, 1363416242, 973726663, 59253759,
1639096332, 336563455, 1642837685, 1215013716, 154523136, 593537720,
704035832, 1134594751, 1605135681, 1347315106, 302572379, 1762719719,
269676381, 774132919, 1851737163, 1482824219, 125310639, 1746481261,
1303742040, 1479089144, 899131941, 1169907872, 1785335569, 485614972,
907175364, 382361684, 885626931, 200158423, 1745777927, 1859353594,
259412182, 1237390611, 48433401, 1902249868, 304920680, 202956538,
348303940, 1008956512, 1337551289, 1953439621, 208787970, 1640123668,
1568675693, 478464352, 266772940, 1272929208, 1961288571, 392083579,
871926821, 1117546963, 1871172724, 1771058762, 139971187, 1509024645,
109190086, 1047146551, 1891386329, 994817018, 1247304975, 1489680608,
706686964, 1506717157, 579587572, 755120366, 1261483377, 884508252,
958076904, 1609787317, 1893464764, 148144545, 1415743291, 2102252735,
1788268214, 836935336, 433233439, 2055041154, 2109864544, 247038362,
299641085, 834307717, 1364585325, 23330161, 457882831, 1504556512,
1532354806, 567072918, 404219416, 1276257488, 1561889936, 1651524391,
618454448, 121093252, 1010757900, 1198042020, 876213618, 124757630,
2082550272, 1834290522, 1734544947, 1828531389, 1982435068, 1002804590,
1783300476, 1623219634, 1839739926, 69050267, 1530777140, 1802120822,
316088629, 1830418225, 488944891, 1680673954, 1853748387, 946827723,
1037746818, 1238619545, 1513900641, 1441966234, 367393385, 928306929,
946006977, 985847834, 1049400181, 1956764878, 36406206, 1925613800,
2081522508, 2118956479, 1612420674, 1668583807, 1800004220, 1447372094,
523904750, 1435821048, 923108080, 216161028, 1504871315, 306401572,
2018281851, 1820959944, 2136819798, 359743094, 1354150250, 1843084537,
1306570817, 244413420, 934220434, 672987810, 1686379655, 1301613820,
1601294739, 484902984, 139978006, 503211273, 294184214, 176384212,
281341425, 228223074, 147857043, 1893762099, 1896806882, 1947861263,
1193650546, 273227984, 1236198663, 2116758626, 489389012, 593586330,
275676551, 360187215, 267062626, 265012701, 719930310, 1621212876,
2108097238, 2026501127, 1865626297, 894834024, 552005290, 1404522304,
48964196, 5816381, 1889425288, 188942202, 509027654, 36125855,
365326415, 790369079, 264348929, 513183458, 536647531, 13672163,
313561074, 1730298077, 286900147, 1549759737, 1699573055, 776289160,
2143346068, 1975249606, 1136476375, 262925046, 92778659, 1856406685,
1884137923, 53392249, 1735424165, 1602280572
};
const char *Atom_new(const char *str, int len) {
unsigned long h;
int i;
struct atom *p;
assert(str);
assert(len >= 0);
for (h = 0, i = 0; i < len; i++)
h = (h<<1) + scatter[(unsigned char)str[i]];
h &= NELEMS(buckets)-1;
for (p = buckets[h]; p; p = p->link)
if (len == p->len) {
for (i = 0; i < len && p->str[i] == str[i]; )
i++;
if (i == len)
return p->str;
}
p = ALLOC(sizeof (*p) + len + 1);
p->len = len;
p->str = (char *)(p + 1);
if (len > 0)
memcpy(p->str, str, len);
p->str[len] = '\0';
p->link = buckets[h];
buckets[h] = p;//insert atom in front of list
return p->str;
}
at end of chapter , in exercises 3.1, the book's author said
"Most texts recommend using a prime number for the size of
buckets. Using a prime and a good hash function usually gives a
better distribution of the lengths of the lists hanging off of buckets.
Atom uses a power of two, which is sometimes explicitly cited
as a bad choice. Write a program to generate or read, say, 10,000
typical strings and measure Atom_new’s speed and the distribution
of the lengths of the lists. Then change buckets so that it has
2,039 entries (the largest prime less than 2,048), and repeat the
measurements. Does using a prime help? How much does your
conclusion depend on your specific machine?"
so I did changed that hash table size to 2039,but it seems a prime number actually made
a bad distribution of the lengths of the lists, I have tried 64, 61, 61 actually made a bad distribution too.
I am just want to know why a prime table size make a bad distribution, is this because the hash function used with Atom_new a bad hash function?
I am using this function to print out the lengths of the atom lists
#define B_SIZE 2048
void Atom_print(void)
{
int i,t;
struct atom *atom;
for(i= 0;i<B_SIZE;i++) {
t = 0;
for(atom=buckets[i];atom;atom=atom->link) {
++t;
}
printf("%d ",t);
}
}

Well, along time ago I had to implement a hash table (in driver development), and I about the same. Why the heck should I use a prime number? OTOH power of 2 is even better - instead of calculating the modulus in case of power of 2 you may use bitwise AND.
So I've implemented such a hash table. The key was a pointer (returned by some 3rd-party function). Then, eventually I noticed that in my hash table only 1/4 of all the entries is filled. Because that hash function I used was identity function, and just in case it turned out that all the returned pointers are multiples of 4.
The idea of using the prime numbers for the hash table size is the following: real-world hash functions do not produce equally-distributed values. Usually there's (or at least there may be) some dependency. So, in order to diffuse this distribution it's recommended to use prime numbers.
BTW, theoretically there may happen that occasionally the hash function will produce the numbers that are multiples of your chosen prime number. But the probability of this is lower than if it was not a prime number.

I think it's the code to select the bucket. In the code you pasted it says:
h &= NELEMS(buckets)-1;
That works fine for sizes which are powers of two, since its final effect is choosing the lower bits of h. For other sizes, NELEMS(buckets)-1 will have bits in 0 and the bit-wise & operator will discard those bits, effectively leaving "holes" in the bucket list.
The general formula for bucket selection is:
h = h % NELEMS(buckets);

This is what Julienne Walker from Eternally Confuzzled has to say about hash table sizes:
When it comes to hash tables, the most
recommended table size is any prime
number. This recommendation is made
because hashing in general is
misunderstood, and poor hash functions
require an extra mixing step of
division by a prime to resemble a
uniform distribution. Another reason
that a prime table size is recommended
is because several of the collision
resolution methods require it to work.
In reality, this is a generalization
and is actually false (a power of two
with odd step sizes will typically
work just as well for most collision
resolution strategies), but not many
people consider the alternatives and
in the world of hash tables, prime
rules.

There's another factor at work here and that is that the constant hashing values should all be odd/prime and widely dispersed. If you have an even number of units (characters for instance) in the key to be hashed then having all odd constants will give you an even initial hash value. For an odd number of units you'd get an odd number. I've done some experimenting with this and just the 50/50% split was worth a lot in evening the distribution. Of course if all keys are equally long this doesn't matter.
The hashing also needs to ensure that you won't get the same initial hash value for "AAB" as for "ABA" or "BAA".

Related

Find two worst values and delete in sum

A microcontroller has the job to sample ADC Values (Analog to Digital Conversion). Since these parts are affected by tolerance and noise, the accuracy can be significantly increased by deleting the 4 worst values. The find and delete does take time, which is not ideal, since it will increase the cycle time.
Imagine a frequency of 100MHz, so each command of software does take 10ns to process, the more commands, the longer the controller is blocked from doing the next set of samples
So my goal is to do the sorting process as fast as possible for this i currently use this code, but this does only delete the two worst!
uint16_t getValue(void){
adcval[8] = {};
uint16_t min = 16383 //14bit full
uint16_t max = 1; //zero is physically almost impossible!
uint32_t sum = 0; //variable for the summing
for(uint8_t i=0; i<8;i++){
if(adc[i] > max) max = adc[i];
if(adc[i] < min) min = adc[i];
sum=sum+adcval[i];
}
uint16_t result = (sum-max-min)/6; //remove two worst and divide by 6
return result;
}
Now I would like to extend this function to delete the 4 worst values out of the 8 samples to get more precision. Any advice on how to do this?
Additionally, it would be wonderful to build an efficient function that finds the most deviating values, instead of the highest and lowest. For example, imagine the this two arrays
uint16_t adc1[8] {5,6,10,11,11,12,20,22};
uint16_t adc2[8] {5,6,7,7,10,11,15,16};
First case would gain precision by the described mechanism (delete the 4 worst). But the second case would have deleted the values 5 and 6 as well as 15 and 16. But this would theoretically make the calculation worse, since deleting 10,11,15,16 would be better. Is there any fast solution of deleting the 4 most deviating?

If your ADC is returning values from 5 to 16 14 bits and the voltage reference 3.3V, the voltage varies from 1mV to 3mV. It is very likely that it is the correct reading. It is very difficult to design good input circuit for 14 bits ADC.
It is better to run the running average. What is the running average? It is software low pass filter.
Blue are readings from the ADC, red -running average
Second signal is the very low amplitude sine wave (9-27mV - assuming 14 bits and 3.3Vref)
The algorithm:
static int average;
int running_average(int val, int level)
{
average -= average / level;
average += val * level;
return average / level;
}
void init_average(int val, int level)
{
average = val * level;
}
if the level is the power of 2. This version needs only 6 instructions (no branches) to calculate the average.
static int average;
int running_average(int val, int level)
{
average -= average >> level;
average += val << level;
return average >> level;
}
void init_average(int val, int level)
{
average = val << level;
}
I assume that average will no overflow. If yes you need to chose larger type

This answer is kinda of topic as it recommends a hardware solution but if performance is required and the MCU can't implement P__J__'s solution than this is your next best thing.
It seems you want to remove noise from your input signal. This can be done in software using DSP (digital signal processing) but it can also be done by configuring your hardware differently.
By adding the proper filter at the proper space before your ADC, it will be possible to remove much (outside) noise from your ADC output. (you can't of course go below a certain amount that is innate in the ADC but alas.)
There are several q&a on electronics.stackexchange.com.
One solution is adding a capacitor to filter some high frequency noise. As noted by DerStorm8
The Photon has another great solution here by suggesting RC, Sallen-Key and a cascade of Sallen-Key filters for a continuous signal filter.
Here (ADN007) is a Analog Design Note from Microchip on "Techniques that Reduce System Noise in ADC Circuits"
It may seem that designing a low noise, 12-bit Analog-to-Digital
Converter (ADC) board or even a 10-bit board is easy. This is
true, unless one ignores the basics of low noise design. For
instance, one would think that most amplifiers and resistors work
effectively in 12-bit or 10-bit environments. However, poor device
selection becomes a major factor in the success or failure of the
circuit. Another, often ignored, area that contributes a great deal
of noise, is conducted noise. Conducted noise is already in the
circuit board by the time the signal arrives at the input of the
ADC. The most effective way to remove this noise is by using a
low-pass (anti-aliasing) filter prior to the ADC. Including by-pass
capacitors and using a ground plane will also eliminate this type
of noise. A third source of noise is radiated noise. The major
sources of this type of noise are Electromagnetic Interference
(EMI) or capacitive coupling of signals from trace-to-trace.
If all three of these issues are addressed, then it is true that
designing a low noise 12-bit ADC board is easy.
And their recommended solution path:
It is easy to design a true 12-bit ADC system by using a few
key low noise guidelines. First, examine your devices (resistors
and amplifiers) to make sure they are low noise. Second, use a
ground plane whenever possible. Third, include a low-pass filter
in the signal path if you are changing the signal from analog to
digital. Finally, and always, include by-pass capacitors. These
capacitors not only remove noise but also foster circuit stability.
Here is a good paper by Analog Devices on input noise. They note in here that "there are some instances where input noise can actually be helpful in achieving higher resolution."
All analog-to-digital converters (ADCs) have a certain amount of input-referred noise—modeled as a noise source connected in series with the input of a noise-free ADC. Input-referred noise is not to be confused with quantization noise, which is only of interest when an ADC is processing time-varying signals. In most cases, less input noise is better; however, there are some instances where input noise can actually be helpful in achieving higher resolution. If this doesn’t seem to make sense right now, read on to find out how some noise can be good noise.

Given that you have a fixed size array, a hard-coded sorting network should be able to correctly sort the entire array with only 19 comparisons. Currently you have 8+2*8=24 comparisons already, although it is possible that the compiler unrolls the loop, leaving you with 16 comparisons. It is conceivable that, depending on the microcontroller hardware, a sorting network can be implemented with some degree of parallelism -- perhaps you also have to query the adc values sequentially which would give you opportunity to pre-sort them, while waiting for the comparison.
An optimal sorting network should be searchable online. Wikipedia has some pointers.
So, you would end up with some code like this:
sort_data(adcval);
return (adcval[2]+adcval[3]+adcval[4]+adcval[5])/4;
Update:
As you can take from this picture (source) of optimal sorting networks, a complete sort takes 19 comparisons. However 3 of those are not strictly needed if you only want to extract the middle 4 values. So you get down to 16 comparisons.

to delete the 4 worst values out of the 8 samples
The methods are described on geeksforgeeks k largest(or smallest) elements in an array and you can implement the best method that suits you.
I decided to use this good site to generate best sorting algorithm with SWAP() macros needed to sort the array of 8 elements. Then I created a small C program that will test any combination of 8 element array on my sorting function. Then, because we only care of groups of 4 elements, I did something bruteforce - for each of the SWAP() macros I tried to comment the macro and see if the program still succeeds. I could comment 5 SWAP macros, leaving 14 comparisons needed to identify the smallest 4 elements in the array of 8 samples.
/**
* Sorts the array, but only so that groups of 4 matter.
* So group of 4 smallest elements and 4 biggest elements
* will be sorted ok.
* s[0]...s[3] will have lowest 4 elements
* so they have to be "deleted"
* s[4]...s[7] will have the highest 4 values
*/
void sort_but_4_matter(int s[8]) {
#define SWAP(x, y) do { \
if (s[x] > s[y]) { \
const int t = s[x]; \
s[x] = s[y]; \
s[y] = t; \
} \
} while(0)
SWAP(0, 1);
//SWAP(2, 3);
SWAP(0, 2);
//SWAP(1, 3);
//SWAP(1, 2);
SWAP(4, 5);
SWAP(6, 7);
SWAP(4, 6);
SWAP(5, 7);
//SWAP(5, 6);
SWAP(0, 4);
SWAP(1, 5);
SWAP(1, 4);
SWAP(2, 6);
SWAP(3, 7);
//SWAP(3, 6);
SWAP(2, 4);
SWAP(3, 5);
SWAP(3, 4);
#undef SWAP
}
/* -------- testing code */
#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
int cmp_int(const void *a, const void *b) {
return *(const int*)a - *(const int*)b;
}
void printit_arr(const int *arr, size_t n) {
printf("{");
for (size_t i = 0; i < n; ++i) {
printf("%d", arr[i]);
if (i != n - 1) {
printf(" ");
}
}
printf("}");
}
void printit(const char *pre, const int arr[8],
const int in[8], const int res[4]) {
printf("%s: ", pre);
printit_arr(arr, 8);
printf(" ");
printit_arr(in, 8);
printf(" ");
printit_arr(res, 4);
printf("\n");
}
int err = 0;
void test(const int arr[8], const int res[4]) {
int in[8];
memcpy(in, arr, sizeof(int) * 8);
sort_but_4_matter(in);
// sort for memcmp below
qsort(in, 4, sizeof(int), cmp_int);
if (memcmp(in, res, sizeof(int) * 4) != 0) {
printit("T", arr, in, res);
err = 1;
}
}
void test_all_combinations() {
const int result[4] = { 0, 1, 2, 3 }; // sorted
const size_t n = 8;
int num[8] = { 0, 1, 2, 3, 4, 5, 6, 7 };
for (size_t j = 0; j < n; j++) {
for (size_t i = 0; i < n-1; i++) {
int temp = num[i];
num[i] = num[i+1];
num[i+1] = temp;
test(num, result);
}
}
}
int main() {
test_all_combinations();
return err;
}
Tested on godbolt. The sort_but_4_matter with gcc -O2 on x86_64 compiles to less then 100 instruction.

2D array, prototype function and random numbers [duplicate]

I need a 'good' way to initialize the pseudo-random number generator in C++. I've found an article that states:
In order to generate random-like
numbers, srand is usually initialized
to some distinctive value, like those
related with the execution time. For
example, the value returned by the
function time (declared in header
ctime) is different each second, which
is distinctive enough for most
randoming needs.
Unixtime isn't distinctive enough for my application. What's a better way to initialize this? Bonus points if it's portable, but the code will primarily be running on Linux hosts.
I was thinking of doing some pid/unixtime math to get an int, or possibly reading data from /dev/urandom.
Thanks!
EDIT
Yes, I am actually starting my application multiple times a second and I've run into collisions.

This is what I've used for small command line programs that can be run frequently (multiple times a second):
unsigned long seed = mix(clock(), time(NULL), getpid());
Where mix is:
// Robert Jenkins' 96 bit Mix Function
unsigned long mix(unsigned long a, unsigned long b, unsigned long c)
{
a=a-b; a=a-c; a=a^(c >> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >> 13);
a=a-b; a=a-c; a=a^(c >> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >> 5);
a=a-b; a=a-c; a=a^(c >> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >> 15);
return c;
}

The best answer is to use <random>. If you are using a pre C++11 version, you can look at the Boost random number stuff.
But if we are talking about rand() and srand()
The best simplest way is just to use time():
int main()
{
srand(time(nullptr));
...
}
Be sure to do this at the beginning of your program, and not every time you call rand()!
Side Note:
NOTE: There is a discussion in the comments below about this being insecure (which is true, but ultimately not relevant (read on)). So an alternative is to seed from the random device /dev/random (or some other secure real(er) random number generator). BUT: Don't let this lull you into a false sense of security. This is rand() we are using. Even if you seed it with a brilliantly generated seed it is still predictable (if you have any value you can predict the full sequence of next values). This is only useful for generating "pseudo" random values.
If you want "secure" you should probably be using <random> (Though I would do some more reading on a security informed site). See the answer below as a starting point: https://stackoverflow.com/a/29190957/14065 for a better answer.
Secondary note: Using the random device actually solves the issues with starting multiple copies per second better than my original suggestion below (just not the security issue).
Back to the original story:
Every time you start up, time() will return a unique value (unless you start the application multiple times a second). In 32 bit systems, it will only repeat every 60 years or so.
I know you don't think time is unique enough but I find that hard to believe. But I have been known to be wrong.
If you are starting a lot of copies of your application simultaneously you could use a timer with a finer resolution. But then you run the risk of a shorter time period before the value repeats.
OK, so if you really think you are starting multiple applications a second.
Then use a finer grain on the timer.
int main()
{
struct timeval time;
gettimeofday(&time,NULL);
// microsecond has 1 000 000
// Assuming you did not need quite that accuracy
// Also do not assume the system clock has that accuracy.
srand((time.tv_sec * 1000) + (time.tv_usec / 1000));
// The trouble here is that the seed will repeat every
// 24 days or so.
// If you use 100 (rather than 1000) the seed repeats every 248 days.
// Do not make the MISTAKE of using just the tv_usec
// This will mean your seed repeats every second.
}

if you need a better random number generator, don't use the libc rand. Instead just use something like /dev/random or /dev/urandom directly (read in an int directly from it or something like that).
The only real benefit of the libc rand is that given a seed, it is predictable which helps with debugging.

On windows:
srand(GetTickCount());
provides a better seed than time() since its in milliseconds.

C++11 random_device
If you need reasonable quality then you should not be using rand() in the first place; you should use the <random> library. It provides lots of great functionality like a variety of engines for different quality/size/performance trade-offs, re-entrancy, and pre-defined distributions so you don't end up getting them wrong. It may even provide easy access to non-deterministic random data, (e.g., /dev/random), depending on your implementation.
#include <random>
#include <iostream>
int main() {
std::random_device r;
std::seed_seq seed{r(), r(), r(), r(), r(), r(), r(), r()};
std::mt19937 eng(seed);
std::uniform_int_distribution<> dist{1,100};
for (int i=0; i<50; ++i)
std::cout << dist(eng) << '\n';
}
eng is a source of randomness, here a built-in implementation of mersenne twister. We seed it using random_device, which in any decent implementation will be a non-determanistic RNG, and seed_seq to combine more than 32-bits of random data. For example in libc++ random_device accesses /dev/urandom by default (though you can give it another file to access instead).
Next we create a distribution such that, given a source of randomness, repeated calls to the distribution will produce a uniform distribution of ints from 1 to 100. Then we proceed to using the distribution repeatedly and printing the results.

Best way is to use another pseudorandom number generator.
Mersenne twister (and Wichmann-Hill) is my recommendation.
http://en.wikipedia.org/wiki/Mersenne_twister

i suggest you see unix_random.c file in mozilla code. ( guess it is mozilla/security/freebl/ ...) it should be in freebl library.
there it uses system call info ( like pwd, netstat ....) to generate noise for the random number;it is written to support most of the platforms (which can gain me bonus point :D ).

The real question you must ask yourself is what randomness quality you need.
libc random is a LCG
The quality of randomness will be low whatever input you provide srand with.
If you simply need to make sure that different instances will have different initializations, you can mix process id (getpid), thread id and a timer. Mix the results with xor. Entropy should be sufficient for most applications.
Example :
struct timeb tp;
ftime(&tp);
srand(static_cast<unsigned int>(getpid()) ^
static_cast<unsigned int>(pthread_self()) ^
static_cast<unsigned int >(tp.millitm));
For better random quality, use /dev/urandom. You can make the above code portable in using boost::thread and boost::date_time.

The c++11 version of the top voted post by Jonathan Wright:
#include <ctime>
#include <random>
#include <thread>
...
const auto time_seed = static_cast<size_t>(std::time(0));
const auto clock_seed = static_cast<size_t>(std::clock());
const size_t pid_seed =
std::hash<std::thread::id>()(std::this_thread::get_id());
std::seed_seq seed_value { time_seed, clock_seed, pid_seed };
...
// E.g seeding an engine with the above seed.
std::mt19937 gen;
gen.seed(seed_value);

#include <stdio.h>
#include <sys/time.h>
main()
{
struct timeval tv;
gettimeofday(&tv,NULL);
printf("%d\n", tv.tv_usec);
return 0;
}
tv.tv_usec is in microseconds. This should be acceptable seed.

As long as your program is only running on Linux (and your program is an ELF executable), you are guaranteed that the kernel provides your process with a unique random seed in the ELF aux vector. The kernel gives you 16 random bytes, different for each process, which you can get with getauxval(AT_RANDOM). To use these for srand, use just an int of them, as such:
#include <sys/auxv.h>
void initrand(void)
{
unsigned int *seed;
seed = (unsigned int *)getauxval(AT_RANDOM);
srand(*seed);
}
It may be possible that this also translates to other ELF-based systems. I'm not sure what aux values are implemented on systems other than Linux.

Suppose you have a function with a signature like:
int foo(char *p);
An excellent source of entropy for a random seed is a hash of the following:
Full result of clock_gettime (seconds and nanoseconds) without throwing away the low bits - they're the most valuable.
The value of p, cast to uintptr_t.
The address of p, cast to uintptr_t.
At least the third, and possibly also the second, derive entropy from the system's ASLR, if available (the initial stack address, and thus current stack address, is somewhat random).
I would also avoid using rand/srand entirely, both for the sake of not touching global state, and so you can have more control over the PRNG that's used. But the above procedure is a good (and fairly portable) way to get some decent entropy without a lot of work, regardless of what PRNG you use.

For those using Visual Studio here's yet another way:
#include "stdafx.h"
#include <time.h>
#include <windows.h>
const __int64 DELTA_EPOCH_IN_MICROSECS= 11644473600000000;
struct timezone2
{
__int32 tz_minuteswest; /* minutes W of Greenwich */
bool tz_dsttime; /* type of dst correction */
};
struct timeval2 {
__int32 tv_sec; /* seconds */
__int32 tv_usec; /* microseconds */
};
int gettimeofday(struct timeval2 *tv/*in*/, struct timezone2 *tz/*in*/)
{
FILETIME ft;
__int64 tmpres = 0;
TIME_ZONE_INFORMATION tz_winapi;
int rez = 0;
ZeroMemory(&ft, sizeof(ft));
ZeroMemory(&tz_winapi, sizeof(tz_winapi));
GetSystemTimeAsFileTime(&ft);
tmpres = ft.dwHighDateTime;
tmpres <<= 32;
tmpres |= ft.dwLowDateTime;
/*converting file time to unix epoch*/
tmpres /= 10; /*convert into microseconds*/
tmpres -= DELTA_EPOCH_IN_MICROSECS;
tv->tv_sec = (__int32)(tmpres * 0.000001);
tv->tv_usec = (tmpres % 1000000);
//_tzset(),don't work properly, so we use GetTimeZoneInformation
rez = GetTimeZoneInformation(&tz_winapi);
tz->tz_dsttime = (rez == 2) ? true : false;
tz->tz_minuteswest = tz_winapi.Bias + ((rez == 2) ? tz_winapi.DaylightBias : 0);
return 0;
}
int main(int argc, char** argv) {
struct timeval2 tv;
struct timezone2 tz;
ZeroMemory(&tv, sizeof(tv));
ZeroMemory(&tz, sizeof(tz));
gettimeofday(&tv, &tz);
unsigned long seed = tv.tv_sec ^ (tv.tv_usec << 12);
srand(seed);
}
Maybe a bit overkill but works well for quick intervals. gettimeofday function found here.
Edit: upon further investigation rand_s might be a good alternative for Visual Studio, it's not just a safe rand(), it's totally different and doesn't use the seed from srand. I had presumed it was almost identical to rand just "safer".
To use rand_s just don't forget to #define _CRT_RAND_S before stdlib.h is included.

Assuming that the randomness of srand() + rand() is enough for your purposes, the trick is in selecting the best seed for srand. time(NULL) is a good starting point, but you'll run into problems if you start more than one instance of the program within the same second. Adding the pid (process id) is an improvement as different instances will get different pids. I would multiply the pid by a factor to spread them more.
But let's say you are using this for some embedded device and you have several in the same network. If they are all powered at once and you are launching the several instances of your program automatically at boot time, they may still get the same time and pid and all the devices will generate the same sequence of "random" numbers. In that case, you may want to add some unique identifier of each device (like the CPU serial number).
The proposed initialization would then be:
srand(time(NULL) + 1000 * getpid() + (uint) getCpuSerialNumber());
In a Linux machine (at least in the Raspberry Pi where I tested this), you can implement the following function to get the CPU Serial Number:
// Gets the CPU Serial Number as a 64 bit unsigned int. Returns 0 if not found.
uint64_t getCpuSerialNumber() {
FILE *f = fopen("/proc/cpuinfo", "r");
if (!f) {
return 0;
}
char line[256];
uint64_t serial = 0;
while (fgets(line, 256, f)) {
if (strncmp(line, "Serial", 6) == 0) {
serial = strtoull(strchr(line, ':') + 2, NULL, 16);
}
}
fclose(f);
return serial;
}

Include the header at the top of your program, and write:
srand(time(NULL));
In your program before you declare your random number. Here is an example of a program that prints a random number between one and ten:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
//Initialize srand
srand(time(NULL));
//Create random number
int n = rand() % 10 + 1;
//Print the number
cout << n << endl; //End the line
//The main function is an int, so it must return a value
return 0;
}

Identifying a trend in C - Micro controller sampling

I'm working on an MC68HC11 Microcontroller and have an analogue voltage signal going in that I have sampled. The scenario is a weighing machine, the large peaks are when the object hits the sensor and then it stabilises (which are the samples I want) and then peaks again before the object roles off.
The problem I'm having is figuring out a way for the program to detect this stable point and average it to produce an overall weight but can't figure out how :/. One way I have thought about doing is comparing previous values to see if there is not a large difference between them but I haven't had any success. Below is the C code that I am using:
#include <stdio.h>
#include <stdarg.h>
#include <iof1.h>
void main(void)
{
/* PORTA, DDRA, DDRG etc... are LEDs and switch ports */
unsigned char *paddr, *adctl, *adr1;
unsigned short i = 0;
unsigned short k = 0;
unsigned char switched = 1; /* is char the smallest data type? */
unsigned char data[2000];
DDRA = 0x00; /* All in */
DDRG = 0xff;
adctl = (unsigned char*) 0x30;
adr1 = (unsigned char*) 0x31;
*adctl = 0x20; /* single continuos scan */
while(1)
{
if(*adr1 > 40)
{
if(PORTA == 128) /* Debugging switch */
{
PORTG = 1;
}
else
{
PORTG = 0;
}
if(i < 2000)
{
while(((*adctl) & 0x80) == 0x00);
{
data[i] = *adr1;
}
/* if(i > 10 && (data[(i-10)] - data[i]) < 20) */
i++;
}
if(PORTA == switched)
{
PORTG = 31;
/* Print a delimeter so teemtalk can send to excel */
for(k=0;k<2000;k++)
{
printf("%d,",data[k]);
}
if(switched == 1) /*bitwise manipulation more efficient? */
{
switched = 0;
}
else
{
switched = 1;
}
PORTG = 0;
}
if(i >= 2000)
{
i = 0;
}
}
}
}
Look forward to hearing any suggestions :)
(The graph below shows how these values look, the red box is the area I would like to identify.

As you sample sequence has glitches (short lived transients) try to improve the hardware ie change layout, add decoupling, add filtering etc.
If that approach fails, then a median filter [1] of say five places long, which takes the last five samples, sorts them and outputs the middle one, so two samples of the transient have no effect on it's output. (seven places ...three transient)
Then a computationally efficient exponential averaging lowpass filter [2]
y(n) = y(n–1) + alpha[x(n) – y(n–1)]
choosing alpha (1/2^n, division with right shifts) to yield a time constant [3] of less than the underlying response (~50samples), but still filter out the noise. Increasing the effective fractional bits will avoid the quantizing issues.
With this improved sample sequence, thresholds and cycle count, can be applied to detect quiescent durations.
Additionally if the end of the quiescent period is always followed by a large, abrupt change then using a sample delay "array", enables the detection of the abrupt change but still have available the last of the quiescent samples for logging.
[1] http://en.wikipedia.org/wiki/Median_filter
[2] http://www.dsprelated.com/showarticle/72.php
[3] http://en.wikipedia.org/wiki/Time_constant
Note
Adding code for the above filtering operations will lower the maximum possible sample rate but printf can be substituted for something faster.

Continusously store the current value and the delta from the previous value.
Note when the delta is decreasing as the start of weight application to the scale
Note when the delta is increasing as the end of weight application to the scale
Take the X number of values with the small delta and average them
BTW, I'm sure this has been done 1M times before, I'm thinking that a search for scale PID or weight PID would find a lot of information.

Don't forget using ___delay_ms(XX) function somewhere between the reading values, if you will compare with the previous one. The difference in each step will be obviously small, if the code loop continuously.

Looking at your nice graphs, I would say you should look only for the falling edge, it is much consistent than leading edge.
In other words, let the samples accumulate, calculate the running average all the time with predefined window size, remember the deviation of the previous values just for reference, check for a large negative bump in your values (like absolute value ten times smaller then current running average), your running average is your value. You could go back a little bit (disregarding last few values in your average, and recalculate) to compensate for small positive bump visible in your picture before each negative bump...No need for heavy math here, you could not model the reality better then your picture has shown, just make sure that your code detect the end of each and every sample. You have to be fast enough with sample to make sure no negative bump was missed (or you will have big time error in your data averaging).
And you don't need that large arrays, running average is better based on smaller window size, smaller residual error in your case when you detect the negative bump.

Recursive Fib with Threads, Segmentation Fault?

Any ideas why it works fine for values like 0, 1, 2, 3, 4... and seg faults for values like >15?
#include
#include
#include
void *fib(void *fibToFind);
main(){
pthread_t mainthread;
long fibToFind = 15;
long finalFib;
pthread_create(&mainthread,NULL,fib,(void*) fibToFind);
pthread_join(mainthread,(void*)&finalFib);
printf("The number is: %d\n",finalFib);
}
void *fib(void *fibToFind){
long retval;
long newFibToFind = ((long)fibToFind);
long returnMinusOne;
long returnMinustwo;
pthread_t minusone;
pthread_t minustwo;
if(newFibToFind == 0 || newFibToFind == 1)
return newFibToFind;
else{
long newFibToFind1 = ((long)fibToFind) - 1;
long newFibToFind2 = ((long)fibToFind) - 2;
pthread_create(&minusone,NULL,fib,(void*) newFibToFind1);
pthread_create(&minustwo,NULL,fib,(void*) newFibToFind2);
pthread_join(minusone,(void*)&returnMinusOne);
pthread_join(minustwo,(void*)&returnMinustwo);
return returnMinusOne + returnMinustwo;
}
}

Runs out of memory (out of space for stacks), or valid thread handles?
You're asking for an awful lot of threads, which require lots of stack/context.
Windows (and Linux) have a stupid "big [contiguous] stack" idea.
From the documentation on pthreads_create:
"On Linux/x86-32, the default stack size for a new thread is 2 megabytes."
If you manufacture 10,000 threads, you need 20 Gb of RAM.
I built a version of OP's program, and it bombed with some 3500 (p)threads
on Windows XP64.
See this SO thread for more details on why big stacks are a really bad idea:
Why are stack overflows still a problem?
If you give up on big stacks, and implement a parallel language with heap allocation
for activation records
(our PARLANSE is
one of these) the problem goes away.
Here's the first (sequential) program we wrote in PARLANSE:
(define fibonacci_argument 45)
(define fibonacci
(lambda(function natural natural )function
`Given n, computes nth fibonacci number'
(ifthenelse (<= ? 1)
?
(+ (fibonacci (-- ?))
(fibonacci (- ? 2))
)+
)ifthenelse
)lambda
)define
Here's an execution run on an i7:
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonaccisequential
Starting Sequential Fibonacci(45)...Runtime: 33.752067 seconds
Result: 1134903170
Here's the second, which is parallel:
(define coarse_grain_threshold 30) ; technology constant: tune to amortize fork overhead across lots of work
(define parallel_fibonacci
(lambda (function natural natural )function
`Given n, computes nth fibonacci number'
(ifthenelse (<= ? coarse_grain_threshold)
(fibonacci ?)
(let (;; [n natural ] [m natural ] )
(value (|| (= m (parallel_fibonacci (-- ?)) )=
(= n (parallel_fibonacci (- ? 2)) )=
)||
(+ m n)
)value
)let
)ifthenelse
)lambda
)define
Making the parallelism explicit makes the programs a lot easier to write, too.
The parallel version we test by calling (parallel_fibonacci 45). Here
is the execution run on the same i7 (which arguably has 8 processors,
but it is really 4 processors hyperthreaded so it really isn't quite 8
equivalent CPUs):
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallelcoarse
Parallel Coarse-grain Fibonacci(45) with cutoff 30...Runtime: 5.511126 seconds
Result: 1134903170
A speedup near 6+, not bad for not-quite-8 processors. One of the other
answers to this question ran the pthreads version; it took "a few seconds"
(to blow up) computing Fib(18), and this is 5.5 seconds for Fib(45).
This tells you pthreads
is a fundamentally bad way to do lots of fine grain parallelism, because
it has really, really high forking overhead. (PARLANSE is designed to
minimize that forking overhead).
Here's what happens if you set the technology constant to zero (forks on every call
to fib):
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallel
Starting Parallel Fibonacci(45)...Runtime: 15.578779 seconds
Result: 1134903170
You can see that amortizing fork overhead is a good idea, even if you have fast forks.
Fib(45) produces a lot of grains. Heap allocation
of activation records solves the OP's first-order problem (thousands of pthreads each
with 1Mb of stack burns gigabytes of RAM).
But there's a second order problem: 2^45 PARLANSE "grains" will burn all your memory too
just keeping track of the grains even if your grain control block is tiny.
So it helps to have a scheduler that throttles forks once you have "a lot"
(for some definition of "a lot" significantly less that 2^45) grains to prevent the
explosion of parallelism from swamping the machine with "grain" tracking data structures.
It has to unthrottle forks when the number of grains falls below a threshold
too, to make sure there is always lots of logical, parallel work for the physical
CPUs to do.

You are not checking for errors - in particular, from pthread_create(). When pthread_create() fails, the pthread_t variable is left undefined, and the subsequent pthread_join() may crash.
If you do check for errors, you will find that pthread_create() is failing. This is because you are trying to generate almost 2000 threads - with default settings, this would require 16GB of thread stacks to be allocated alone.
You should revise your algorithm so that it does not generate so many threads.

I tried to run your code, and came across several surprises:
printf("The number is: %d\n", finalFib);
This line has a small error: %d means printf expects an int, but is passed a long int. On most platforms this is the same, or will have the same behavior anyways, but pedantically speaking (or if you just want to stop the warning from coming up, which is a very noble ideal too), you should use %ld instead, which will expect a long int.
Your fib function, on the other hand, seems non-functional. Testing it on my machine, it doesn't crash, but it yields 1047, which is not a Fibonacci number. Looking closer, it seems your program is incorrect on several aspects:
void *fib(void *fibToFind)
{
long retval; // retval is never used
long newFibToFind = ((long)fibToFind);
long returnMinusOne; // variable is read but never initialized
long returnMinustwo; // variable is read but never initialized
pthread_t minusone; // variable is never used (?)
pthread_t minustwo; // variable is never used
if(newFibToFind == 0 || newFibToFind == 1)
// you miss a cast here (but you really shouldn't do it this way)
return newFibToFind;
else{
long newFibToFind1 = ((long)fibToFind) - 1; // variable is never used
long newFibToFind2 = ((long)fibToFind) - 2; // variable is never used
// reading undefined variables (and missing a cast)
return returnMinusOne + returnMinustwo;
}
}
Always take care of compiler warnings: when you get one, usually, you really are doing something fishy.
Maybe you should revise the algorithm a little: right now, all your function does is returning the sum of two undefined values, hence the 1047 I got earlier.
Implementing the Fibonacci suite using a recursive algorithm means you need to call the function again. As others noted, it's quite an inefficient way of doing it, but it's easy, so I guess all computer science teachers use it as an example.
The regular recursive algorithm looks like this:
int fibonacci(int iteration)
{
if (iteration == 0 || iteration == 1)
return 1;
return fibonacci(iteration - 1) + fibonacci(iteration - 2);
}
I don't know to which extent you were supposed to use threads—just run the algorithm on a secondary thread, or create new threads for each call? Let's assume the first for now, since it's a lot more straightforward.
Casting integers to pointers and vice-versa is a bad practice because if you try to look at things at a higher level, they should be widely different. Integers do maths, and pointers resolve memory addresses. It happens to work because they're represented the same way, but really, you shouldn't do this. Instead, you might notice that the function called to run your new thread accepts a void* argument: we can use it to convey both where the input is, and where the output will be.
So building upon my previous fibonacci function, you could use this code as the thread main routine:
void* fibonacci_offshored(void* pointer)
{
int* pointer_to_number = pointer;
int input = *pointer_to_number;
*pointer_to_number = fibonacci(input);
return NULL;
}
It expects a pointer to an integer, and takes from it its input, then writes it output there.1 You would then create the thread like that:
int main()
{
int value = 15;
pthread_t thread;
// on input, value should contain the number of iterations;
// after the end of the function, it will contain the result of
// the fibonacci function
int result = pthread_create(&thread, NULL, fibonacci_offshored, &value);
// error checking is important! try to crash gracefully at the very least
if (result != 0)
{
perror("pthread_create");
return 1;
}
if (pthread_join(thread, NULL)
{
perror("pthread_join");
return 1;
}
// now, value contains the output of the fibonacci function
// (note that value is an int, so just %d is fine)
printf("The value is %d\n", value);
return 0;
}
If you need to call the Fibonacci function from new distinct threads (please note: that's not what I'd advise, and others seem to agree with me; it will just blow up for a sufficiently large amount of iterations), you'll first need to merge the fibonacci function with the fibonacci_offshored function. It will considerably bulk it up, because dealing with threads is heavier than dealing with regular functions.
void* threaded_fibonacci(void* pointer)
{
int* pointer_to_number = pointer;
int input = *pointer_to_number;
if (input == 0 || input == 1)
{
*pointer_to_number = 1;
return NULL;
}
// we need one argument per thread
int minus_one_number = input - 1;
int minus_two_number = input - 2;
pthread_t minus_one;
pthread_t minus_two;
// don't forget to check! especially that in a recursive function where the
// recursion set actually grows instead of shrinking, you're bound to fail
// at some point
if (pthread_create(&minus_one, NULL, threaded_fibonacci, &minus_one_number) != 0)
{
perror("pthread_create");
*pointer_to_number = 0;
return NULL;
}
if (pthread_create(&minus_two, NULL, threaded_fibonacci, &minus_two_number) != 0)
{
perror("pthread_create");
*pointer_to_number = 0;
return NULL;
}
if (pthread_join(minus_one, NULL) != 0)
{
perror("pthread_join");
*pointer_to_number = 0;
return NULL;
}
if (pthread_join(minus_two, NULL) != 0)
{
perror("pthread_join");
*pointer_to_number = 0;
return NULL;
}
*pointer_to_number = minus_one_number + minus_two_number;
return NULL;
}
Now that you have this bulky function, adjustments to your main function are going to be quite easy: just change the reference to fibonacci_offshored to threaded_fibonacci.
int main()
{
int value = 15;
pthread_t thread;
int result = pthread_create(&thread, NULL, threaded_fibonacci, &value);
if (result != 0)
{
perror("pthread_create");
return 1;
}
pthread_join(thread, NULL);
printf("The value is %d\n", value);
return 0;
}
You might have been told that threads speed up parallel processes, but there's a limit somewhere where it's more expensive to set up the thread than run its contents. This is a very good example of such a situation: the threaded version of the program runs much, much slower than the non-threaded one.
For educational purposes, this program runs out of threads on my machine when the number of desired iterations is 18, and takes a few seconds to run. By comparison, using an iterative implementation, we never run out of threads, and we have our answer in a matter of milliseconds. It's also considerably simpler. This would be a great example of how using a better algorithm fixes many problems.
Also, out of curiosity, it would be interesting to see if it crashes on your machine, and where/how.
1. Usually, you should try to avoid to change the meaning of a variable between its value on input and its value after the return of the function. For instance, here, on input, the variable is the number of iterations we want; on output, it's the result of the function. Those are two very different meanings, and that's not really a good practice. I didn't feel like using dynamic allocations to return a value through the void* return value.

C Library for compressing sequential positive integers

I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:
uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 };
Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.
The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution).
I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.
Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.
A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.
Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.
For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.
I appreciate your help and let me know if you have any doubts.

I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.
However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.
Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.
Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,
Tokyo Cabinet is written in the C
language, and provided as API of C,
Perl, Ruby, Java, and Lua. Tokyo
Cabinet is available on platforms
which have API conforming to C99 and
POSIX.
As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.
With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.
Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.

You have two conflicting requirements:
You want to compress very small items (8 bytes each).
You need efficient random access for each item.
The second requirement is very likely to impose a fixed length for each item.

What exactly are you trying to compress? If you are thinking about the total space of index, is it really worth the effort to save the space?
If so one thing you could try is to chop the space into half and store it into two tables. First stores (upper uint, start index, length, pointer to second table) and the second would store (index, lower uint).
For fast searching, indices would be implemented using something like B+ Tree.

I did something similar years ago for a full-text search engine. In my case, each indexed word generated a record which consisted of a record number (document id) and a word number (it could just as easily have stored word offsets) which needed to be compressed as much as possible. I used a delta-compression technique which took advantage of the fact that there would be a number of occurrences of the same word within a document, so the record number often did not need to be repeated at all. And the word offset delta would often fit within one or two bytes. Here is the code I used.
Since it's in C++, the code may is not going to be useful to you as is, but can be a good starting point for writing compressions routines.
Please excuse the hungarian notation and the magic numbers strewn within the code. Like I said, I wrote this many years ago :-)
IndexCompressor.h
//
// index compressor class
//
#pragma once
#include "File.h"
const int IC_BUFFER_SIZE = 8192;
//
// index compressor
//
class IndexCompressor
{
private :
File *m_pFile;
WA_DWORD m_dwRecNo;
WA_DWORD m_dwWordNo;
WA_DWORD m_dwRecordCount;
WA_DWORD m_dwHitCount;
WA_BYTE m_byBuffer[IC_BUFFER_SIZE];
WA_DWORD m_dwBytes;
bool m_bDebugDump;
void FlushBuffer(void);
public :
IndexCompressor(void) { m_pFile = 0; m_bDebugDump = false; }
~IndexCompressor(void) {}
void Attach(File& File) { m_pFile = &File; }
void Begin(void);
void Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo);
void End(void);
WA_DWORD GetRecordCount(void) { return m_dwRecordCount; }
WA_DWORD GetHitCount(void) { return m_dwHitCount; }
void DebugDump(void) { m_bDebugDump = true; }
};
IndexCompressor.cpp
//
// index compressor class
//
#include "stdafx.h"
#include "IndexCompressor.h"
void IndexCompressor::FlushBuffer(void)
{
ASSERT(m_pFile != 0);
if (m_dwBytes > 0)
{
m_pFile->Write(m_byBuffer, m_dwBytes);
m_dwBytes = 0;
}
}
void IndexCompressor::Begin(void)
{
ASSERT(m_pFile != 0);
m_dwRecNo = m_dwWordNo = m_dwRecordCount = m_dwHitCount = 0;
m_dwBytes = 0;
}
void IndexCompressor::Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo)
{
ASSERT(m_pFile != 0);
WA_BYTE buffer[16];
int nbytes = 1;
ASSERT(dwRecNo >= m_dwRecNo);
if (dwRecNo != m_dwRecNo)
m_dwWordNo = 0;
if (m_dwRecordCount == 0 || dwRecNo != m_dwRecNo)
++m_dwRecordCount;
++m_dwHitCount;
WA_DWORD dwRecNoDelta = dwRecNo - m_dwRecNo;
WA_DWORD dwWordNoDelta = dwWordNo - m_dwWordNo;
if (m_bDebugDump)
{
TRACE("%8X[%8X] %8X[%8X] : ", dwRecNo, dwRecNoDelta, dwWordNo, dwWordNoDelta);
}
// 1WWWWWWW
if (dwRecNoDelta == 0 && dwWordNoDelta < 128)
{
buffer[0] = 0x80 | WA_BYTE(dwWordNoDelta);
}
// 01WWWWWW WWWWWWWW
else if (dwRecNoDelta == 0 && dwWordNoDelta < 16384)
{
buffer[0] = 0x40 | WA_BYTE(dwWordNoDelta >> 8);
buffer[1] = WA_BYTE(dwWordNoDelta & 0x00ff);
nbytes += sizeof(WA_BYTE);
}
// 001RRRRR WWWWWWWW WWWWWWWW
else if (dwRecNoDelta < 32 && dwWordNoDelta < 65536)
{
buffer[0] = 0x20 | WA_BYTE(dwRecNoDelta);
WA_WORD *p = (WA_WORD *) (buffer+1);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
// 0001rrww
buffer[0] = 0x10;
// encode recno
if (dwRecNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwRecNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwRecNoDelta < 65536)
{
buffer[0] |= 0x04;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwRecNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x08;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwRecNoDelta;
nbytes += sizeof(WA_DWORD);
}
// encode wordno
if (dwWordNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwWordNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwWordNoDelta < 65536)
{
buffer[0] |= 0x01;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x02;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwWordNoDelta;
nbytes += sizeof(WA_DWORD);
}
}
// update current setting
m_dwRecNo = dwRecNo;
m_dwWordNo = dwWordNo;
// add compressed data to buffer
ASSERT(buffer[0] != 0);
ASSERT(nbytes > 0 && nbytes < 10);
if (m_dwBytes + nbytes > IC_BUFFER_SIZE)
FlushBuffer();
CopyMemory(m_byBuffer + m_dwBytes, buffer, nbytes);
m_dwBytes += nbytes;
if (m_bDebugDump)
{
for (int i = 0; i < nbytes; ++i)
TRACE("%02X ", buffer[i]);
TRACE("\n");
}
}
void IndexCompressor::End(void)
{
FlushBuffer();
m_pFile->Write(WA_BYTE(0));
}

You've omitted critical information about the number of strings you intend to index.
But given that you say you expect the minimum length of an indexed string to be 256, storing the indices as 64% incurs at most 3% overhead. If the total length of the string file is less than 4GB, you could use 32-bit indices and incur 1.5% overhead. These numbers suggest to me that if compression matters, you're better off compressing the strings, not the indices. For that problem a variation on LZ77 seems in order.
If you want to try a wild idea, put each string in a separate file, pull them all into a zip file, and see how you can do with zziplib. This probably won't be great, but it's nearly zero work on your part.
More data on the problem would be welcome:
Number of strings
Average length of a string
Maximum length of a string
Median length of strings
Degree to which the strings file compresses with gzip
Whether you are allowed to change the order of strings to improve compression
EDIT
The comment and revised question makes the problem much clearer. I like your idea of grouping, and I would try a simple delta encoding, group the deltas, and use a variable-length code within each group. I wouldn't wire in 64 as the group size–I think you will probably want to determine that empirically.
You asked for existing libraries. For the grouping and delta encoding I doubt you will find much. For variable-length integer codes, I'm not seeing much in the way of C libraries, but you can find variable-length codings in Perl and Python. There are a ton of papers and some patents on this topic, and I suspect you're going to wind up having to roll your own. But there are some simple codes out there, and you could give UTF-8 a try—it can code unsigned integers up to 32 bits, and you can grab C code from Plan 9 and I'm sure many other sources.

Are you running on Windows? If so, I recommend creating the mmap file using naive solution your originally proposed, and then compressing the file using NTLM compression. Your application code never knows the file is compressed, and the OS does the file compression for you. You might not think this would be very performant or get good compression, but I think you'll be surprised if you try it.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight