Online entropy evaluation algorithm - solr

Is there a way to evaluate entropy on a stream of discrete values similar to SumamryStatistics for mean and deviation?
I need this algorithm for real-time solr component, and it will probably iterate over large document collections(100,000).
Related question, what is the best way to compute Entropy in Map Reduce like environment.

There may be a way - it depends somewhat on the characteristics of the stream, and what you want to do with the results.
The sample entropy is a function of the sample probability distribution. You can store a running count of each value together with the running total count, which means that the distribution can be calculated on demand. Excuse my sloppy Java, it's been about a year since I wrote it.
Map<K,Integer> runningCount = new Map<K,Integer>();
int totalCount = 0;
public void addValue(K k) {
runningCount.insert(k, runningCount.get(k) + 1);
totalCount += 1;
}
public Map<K,Double> getDistribution() {
Map<K,Double> dist = new Map<K,Double>();
for (K k : runningCount.keys()) {
dist.insert(k, runningCount.get(k) / totalCount);
}
return dist;
}
This means that you can also compute the entropy on demand:
public double getEntropy() {
Map<K,Double> dist = getDistribution();
double entropy = 0;
for (K k : dist.keys()) {
double p = dist.get(k);
entropy -= p * Math.log(p);
}
return entropy;
}
This algorithm is O(n) to compute both the distribution and the entropy, where n is the number of values your stream might take on. It is independent of the number of values in the stream, as you can see from the fact that the addValue method doesn't store the stream values.

Related

Finding maximum along nth dimension in C

I'm working on a problem where the execution time is critical. I have another C function that produces 3-D grids of values at a series of timestamps. What I want is to find the max_value in each 3-D grid at each timestamp. Additionally I am tracking the average value (sum / ncell) of each grid, and returning a maximum normalised by the average value.
I am not proficient in C, so I wanted to check if there is anything I am missing, either in terms of actual code, or use of OpenMP. I guess my question is:
What is the most efficient way to find the maximum values of a n-dimensional array sliced along the nth dimension?
I understand that the best you can hope for (as the grids are unordered) is O(n). My assessment is that this problem is then O(m x n), m = time dimension, n = dimension of the grid, and I think my implementation reaches that. Typically values for these dimensions are perhaps m=5000 to 20000, n=200*200*60.
Currently, I am timing my Python wrapper function (which includes the initialisation of the numpy.ndarrays that receive the max, normMax, and maxIndex values:
m = 2400
n = 54000
threads = 8
For which I am averaging ~0.33 seconds to find the maximum values.
If it's relevant, this is on my laptop with:
Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz (6MB cache)
32GB RAM
Code:
void find_max(double *mapPt, double *maxPt, double *normMaxPt,
int64_t *indexPt, int32_t nsamp, int32_t ncell,
int64_t threads)
{
double maxValue, currentValue, sum;
int32_t cell, maxIndex, timeSample;
#pragma omp parallel for num_threads(threads)
for (timeSample=0; timeSample<nsamp; timeSample++)
{
maxValue = 0.0;
maxIndex = 0;
sum = 0.0;
for (cell=0; cell<ncell; cell++)
{
currentValue = mapPt[cell * nsamp + timeSample];
sum += currentValue;
if (currentValue > maxValue)
{
maxValue = currentValue;
maxIndex = cell;
}
}
maxPt[timeSample] = maxValue;
normMaxPt[timeSample] = maxValue * ncell / sum;
indexPt[timeSample] = maxIndex;
}
}
I am compiling with gcc 7.4.0, with the important flags probably -Ofast and -lm.
I am completely happy for the answer to be "there's nothing more you can do", just want to know for peace of mind.
One suggestion I could see would be to have double *timesame_mapcells = &mapPt[timeSample]; at the start of every thread.
Then you can just index with cell * nsamp, so one addition less per access. But the compiler might have been clever enough to optimize that.
You could also try having two incremented variables in the for loop:
for (cell = 0, map_idx = timeSample; cell < ncell; cell++, map_idx += nsamps)
{
currentValue = mapPt[map_idx];
[...]
}
Which might save some cycles with the timeSample addition every time + the nsamps multiplication.
Then again, this is just a suggestion for you to try. I don't know whether that will have an observable impact on performance. (But I'm curious to know whether that's the case if you give it a go)

What sort of indexing method can I use to store the distances between X^2 vectors in an array without redundancy?

I'm working on a demo that requires a lot of vector math, and in profiling, I've found that it spends the most time finding the distances between given vectors.
Right now, it loops through an array of X^2 vectors, and finds the distance between each one, meaning it runs the distance function X^4 times, even though (I think) there are only (X^2)/2 unique distances.
It works something like this: (pseudo c)
#define MATRIX_WIDTH 8
typedef float vec2_t[2];
vec2_t matrix[MATRIX_WIDTH * MATRIX_WIDTH];
...
for(int i = 0; i < MATRIX_WIDTH; i++)
{
for(int j = 0; j < MATRIX_WIDTH; j++)
{
float xd, yd;
float distance;
for(int k = 0; k < MATRIX_WIDTH; k++)
{
for(int l = 0; l < MATRIX_WIDTH; l++)
{
int index_a = (i * MATRIX_LENGTH) + j;
int index_b = (k * MATRIX_LENGTH) + l;
xd = matrix[index_a][0] - matrix[index_b][0];
yd = matrix[index_a][1] - matrix[index_b][1];
distance = sqrtf(powf(xd, 2) + powf(yd, 2));
}
}
// More code that uses the distances between each vector
}
}
What I'd like to do is create and populate an array of (X^2) / 2 distances without redundancy, then reference that array when I finally need it. However, I'm drawing a blank on how to index this array in a way that would work. A hash table would do it, but I think it's much too complicated and slow for a problem that seems like it could be solved by a clever indexing method.
EDIT: This is for a flocking simulation.
performance ideas:
a) if possible work with the squared distance, to avoid root calculation
b) never use pow for constant, integer powers - instead use xd*xd
I would consider changing your algorithm - O(n^4) is really bad. When dealing with interactions in physics (also O(n^4) for distances in 2d field) one would implement b-trees etc and neglect particle interactions with a low impact. But it will depend on what "more code that uses the distance..." really does.
just did some considerations: the number of unique distances is 0.5*n*n(+1) with n = w*h.
If you write down when unique distances occur, you will see that both inner loops can be reduced, by starting at i and j.
Additionally if you only need to access those distances via the matrix index, you can set up a 4D-distance matrix.
If memory is limited we can save up nearly 50%, as mentioned above, with a lookup function that will access a triangluar matrix, as Code-Guru said. We would probably precalculate the line index to avoid summing up on access
float distanceArray[(H*W+1)*H*W/2];
int lineIndices[H];
searchDistance(int i, int j)
{
return i<j?distanceArray[i+lineIndices[j]]:distanceArray[j+lineIndices[i]];
}

Weight Initialisation

I plan to use the Nguyen-Widrow Algorithm for an NN with multiple hidden layers. While researching, I found a lot of ambiguities and I wish to clarify them.
The following is pseudo code for the Nguyen-Widrow Algorithm
Initialize all weight of hidden layers with random values
For each hidden layer{
beta = 0.7 * Math.pow(hiddenNeurons, 1.0 / number of inputs);
For each synapse{
For each weight{
Adjust weight by dividing by norm of weight for neuron and * multiplying by beta value
}
}
}
Just wanted to clarify whether the value of hiddenNeurons is the size of the particular hidden layer, or the size of all the hidden layers within the network. I got mixed up by viewing various sources.
In other words, if I have a network (3-2-2-2-3) (index 0 is input layer, index 4 is output layer), would the value hiddenNeurons be:
NumberOfNeuronsInLayer(1) + NumberOfNeuronsInLayer(2) + NumberOfNeuronsInLaer(3)
Or just
NumberOfNeuronsInLayer(i) , where i is the current Layer I am at
EDIT:
So, the hiddenNeurons value would be the size of the current hidden layer, and the input value would be the size of the previous hidden layer?
The Nguyen-Widrow initialization algorithm is the following :
Initialize all weight of hidden layers with (ranged) random values
For each hidden layer
2.1 calculate beta value, 0.7 * Nth(#neurons of input layer) root of
#neurons of current layer
2.2 for each synapse
2.1.1 for each weight
2.1.2 Adjust weight by dividing by norm of weight for neuron and
multiplying by beta value
Encog Java Framework
Sounds to me like you want more precise code. Here are some actual code lines from a project I'm participating to. Hope you read C. It's a bit abstracted and simplified. There is a struct nn, that holds the neural net data. You probably have your own abstract data type.
Code lines from my project (somewhat simplified):
float *w = nn->the_weight_array;
float factor = 0.7f * powf( (float) nn->n_hidden, 1.0f / nn->n_input);
for( w in all weight )
*w++ = random_range( -factor, factor );
/* Nguyen/Widrow */
w = nn->the_weight_array;
for( i = nn->n_input; i; i-- ){
_scale_nguyen_widrow( factor, w, nn->n_hidden );
w += nn->n_hidden;
}
Functions called:
static void _scale_nguyen_widrow( float factor, float *vec, unsigned int size )
{
unsigned int i;
float magnitude = 0.0f;
for ( i = 0; i < size; i++ )
magnitude += vec[i] * vec[i];
magnitude = sqrtf( magnitude );
for ( i = 0; i < size; i++ )
vec[i] *= factor / magnitude;
}
static inline float random_range( float min, float max)
{
float range = fabs(max - min);
return ((float)rand()/(float)RAND_MAX) * range + min;
}
Tip:
After you've implemented the Nguyen/Widrow weight initialization, you can actually add a little code line in the forward calculation that dumps each activation to a file. Then you can check how good the set of neurons hits the activation function. Find the mean and standard deviation. You can even plot it with a plotting tool, ie. gnuplot. (You need a plotting tool like gnuplot anyway for plotting error rates etc.) I did that for my implementation. The plots came out nice, and the initial learning became much faster using Nguyen/Widrow for my project.
PS: I'm not sure my implementation is correct according to Nguyen and Widrows intentions. I don't even think I care, as long as it does improve the initial learning.
Good luck,
-Øystein

DFT function implementation in C

I am working on implementation of BFSK implementation on a DSP processor and am currently simulating it on a LINUX machine using C. I am working on the demodulation function and it involves taking a FFT of the incoming data. For simulation purposes, I have a pre-defined function for DFT which is:
void dft(complex_float* in, complex_float* out, int N, int inv)
{
int i, j;
float a, f;
complex_float s, w;
f = inv ? 1.0/N : 1.0;
for (i = 0; i < N; i++) {
s.re = 0;
s.im = 0;
for (j = 0; j < N; j++) {
a = -2*PI*i*j/N;
if (inv) a = -a;
w.re = cos(a);
w.im = sin(a);
s.re += in[j].re * w.re - in[j].im * w.im;
s.im += in[j].im * w.re + in[j].re * w.im;
}
out[i].re = s.re*f;
out[i].im = s.im*f;
}
Here the complex_float is a struct defined as follows:
typedef struct {
float re;
float im;
} complex_float;
In the dft() function, the parameter N denotes the number of DFT points.
My doubt is that since the algorithm also involves a frequency hopping sequence, while demodulating the signal, I need to check the amplitude of DFT of the signal at different frequency components.
In MATLAB this was quite simple as the FFT function there involves the sampling frequency as well and I could find the power at any frequency point as
powerat_at_freq = floor((freq * fftLength) / Sampling_freq)
But the C function does not involve any frequencies, so how can I determine the magnitude of the DFT at any particular frequency?
The index in the FFT table for a particular frequency is calculated as follows:
int i = round(f / fT*N)
where f is the wanted frequency, fT is the sampling frequency and N is the number of FFT points.
The FFT should be fine-grained enough (i.e. N should be large) to cover all the frequencies.
If the precise frequency isn't present in the FFT, the nearest one will be used. More
info about FFT indexes versus frequencies:
How do I obtain the frequencies of each value in an FFT?
The frequency represented depends on the sample rate of the data fed to it (divided by the length if the FFT). Thus any DFT or FFT can represent any frequency you want just by feeding it the right amount of data at the right sample rate.
You can refer to the FFTW library which is famous and useful in the applicational area of FFT.
The official website is: http://www.fftw.org/
By the way, the matlab's FFT function is also implemented through the FFTW library.

What is wrong with my low pass filter?

I have an array of int samples ranging from 32766 to -32767. In part of trying to create an envelope detector I've written a low pass filter, but it doesn't seem to be doing the job. Please keep in mind I'm trying to filter an entire array in one shot (no buffers).
This is not streamed, but applied to recorded audio for later playback. It is written in C. An example cutoff argument would be 0.5.
void lopass(int *input, float cutoff, int *output)
{
float sample = 0;
for (int i=1 ; i < (1430529-10); i++) // we will go through all except the last 10 samples
{
for (int j = i; j < (i+10); j++) { // only do this for a WINDOW of a hundred samples
float _in = (float)input[j];
float _out = (float)output[j-1];
sample = (cutoff * _in) + (32766 - (32766*cutoff)) * _out;
}
output[i] = (int)sample;
}
}
I thought that I would run my filtering statement on a window of 10 samples. Not only is it super slow, but it doesn't really do much but seemingly lower the overall amplitude. \
If you have any advice, or suggestions (or code!) on how to do this properly, that would be great!
A low-pass filter is basically some variant of averaging a number of values together. That means at least in the normal case your inner loop will accumulate a value. It's hard to guess the exact intent from your code, but you end up with something on the extremely general order of:
sample = 0;
for (int j=i; j<i+10; j++)
sample += input[j];
output[i] = sample / 10;
As it stands right now, this just does averaging, with no cutoff specified -- that means it has a fixed (and fairly slow) cuttoff curve. The cutoff is governed only by the number of samples in the window.
To control the cutoff, you do not (at least normally) multiply all the input values by the same amount -- that would basically just modify the scale factor. Instead, you take a set of samples (10 of them, in your case) of the cutoff curve you want to apply, run them through an inverse FFT, and get a set of 10 coefficients. You then apply those coefficients in your loop:
sample = 0;
for (j=0; j<10; j++)
sample += input[i+j] * coefficients[j];
output[i] = sample;
The number of samples in your window isn't normally an input to the design process -- rather, it's an output. You start by specify the cutoff frequency (as a fraction of the sampling frequency) and the cutoff width, and based on those you compute the necessary window size.
There are quite a few different techniques for computing your coefficients. Regardless of how you compute them, however, you normally end up with something on this general order -- accumulate the sum of the samples in the window, each multiplied by its respective coefficient.
The EE times had a pretty good article on filter design a few years ago.
Don't know if it's relevant, but in your code the inner loop is doing nothing
for (j=???; j<???; j++) {
sample = ???;
}
is the same as
// for (j=???; j<???; j++) {
sample = ???; // for last j
// }
The arithmetic in the filter looks wrong, and as #pmg already pointed out, you are not storing output values correctly. It should probably be:
void lopass(int *input, float cutoff, int *output)
{
float sample = 0.0f;
output[0] = 0.0f;
for (int i = 1 ; i < (1430529 - 10); i++)
{
for (int j = i; j < (i + 10); j++)
{
float _in = (float)input[j];
float _out = (float)output[j-1];
sample = (cutoff * _in) + (1.0f - cutoff) * _out;
output[i] = (int)sample;
}
}
}
There are still a few minor issues to be fixed but this should at least work as a fairly crude single pole recursive (IIR) filter.
It's a broken moving-window filter of 10 samples in the inner loop (where you actually use only the last sample of the 10), when in your comments you say you want 100 samples in your rectangular filter window.
The first error will give you a filter transition frequency 10X too high.

Resources