GNU Scientific Library probability distribution functions in C - c

I have a set of GSL Histograms, which are used to make a set of probability distribution functions, which according to the documentation are stored in a struct, as follows:
Data Type: gsl_histogram_pdf
size_t n
This is the number of bins used to approximate the probability distribution function.
double * range
The ranges of the bins are stored in an array of n+1 elements pointed to by range.
double * sum
The cumulative probability for the bins is stored in an array of n elements pointed to by sum.
I am intending to use a KS test to determine, if data was similar or not. So, I am trying to access the sum of a given bin in this structure, to calculate the 'distance' and I assumed that, I should be able to access that value by using:
((my_type)->pdf->sum+x)
with X being the bin number.
Yet this always returns 0 no matter what I do, does anyone have any idea, what is going wrong?
Thanks in advance
---- EDIT ----
Here is a snippet of my code that deals with the pdf / histogram:
/* GSL Histogram creation */
for (i = 0; i < chrom->hits; i++) {
if ( (chrom+i)->spectra->peaks != 0 ) {
(chrom+i)->hist = gsl_histogram_alloc(bins);
gsl_histogram_set_ranges_uniform((chrom+i)->hist, low_mz, high_mz);
for (j = 0; j < (chrom+i)->spectra->peaks; j++) {
gsl_histogram_increment( (chrom+i)->hist, ((chrom+i)->spectra+j)->mz_value);
}
} else {
printf("0 value encountered!\n");
}
}
/* Histogram probability distribution function creation */
for (i = 0; i < chrom->hits; i++) {
if ( (chrom+i)->spectra->peaks != 0 ) {
(chrom+i)->pdf = gsl_histogram_pdf_alloc(bins);
gsl_histogram_pdf_init( (chrom+i)->pdf, (chrom+i)->hist);
} else {
continue;
}
}
/* Kolmogorov-Smirnov */
float D;
for (i = 0; i < chrom->hits-1; i++) {
printf("%f\n",((chrom+i)->pdf->sum+25));
for (j = i+1; j < chrom->hits; j++) {
D = 0;
diff = 0;
/* Determine max distance */
}
}

You compute a pointer to the value you intend to access.
Change your current pointer computation
printf("%f\n",((chrom+i)->pdf->sum+25));
either to a normal array subscript
printf("%f\n",(chrom+i)->pdf->sum[25]);
or to a pointer computation followed by a dereferencing
printf("%f\n",*((chrom+i)->pdf->sum+25));
See whether that fixes your issue. The value shouldn't be 0 either, but it might well get displayed as 0 as it might represent a pretty small floating point number, depending on memory virtual layout.

Related

What's the difference when you put a dot in numbers?

The program should read 5 numbers from the user. Then it has to draw 5 'lines' according to the rate of this numbers. At the biggest number it has to draw 20 ' * '-s and for every other number it has to draw less ' * '-s in proportion to the biggest number.
There's a program which can check if my code is correct, and I tried a lot of things to make it work, but I always got bad result (right side on the picture). Then I was checking the solution of similar exercises and I found that the only difference is that, in the soutions there's a dot after the defined number
#define SEGMENT 20.
And when I changed just this one thing in my code it worked correctly (left side on the picture). Can someone explain why does it make such a huge difference and why it's working without dot too with some inputs?
Here's the full code:
#include <stdio.h>
#define PCS 5
#define SEGMENT 20
//The only difference in the line above
//#define SEGMENT 20.
int main(void) {
double nums[PCS];
int max;
//Reding the five numbers and finding the biggest value
for(int i = 0; i < PCS; i++) {
scanf("%lf", &nums[i]);
if(i == 0 || max < nums[i]) {
max = nums[i];
}
}
//Calcualte the rate for the lines
double rate = SEGMENT / max;
//Print the lines
for(int i = 0; i < PCS; i++) {
for(int j = 0; j < (int)(nums[i]*rate); j++) {
printf("*");
}
printf("\n");
}
return 0;
}
In context of this statement
double rate = SEGMENT / max;
20 is an integer. Hence the expression 20 / max is also an integer with remainder dropped before being assigned to rate
20. is the same as 20.0 - it's a floating point real value. Hence, 20.0 / max gets evaluated as a real number (with max getting effectively promoted from integer to float before the division operation is applied). And the result of that expression is directly assigned to rate without any rounding.
If it helps, you can avoid the floating point computation (and all the weirdness that goes with fp) by doing integer math and having the division operation performed as the last step of your calculation.
That is, instead of this:
#define SEGMENT 20.0
...
double rate = SEGMENT / max;
...
for(int j = 0; j < (int)(nums[i]*rate); j++) {
This:
#define SEGMENT 20
...
for(int j = 0; j < (int)(nums[i]*SEGMENT)/max; j++) {
This keeps your math in the integer space and should still work. (Disclaimer: nums is still an array of double, so the expression (nums[i]*20)/max is still a floating point value, but it's not going to incur a rounding error.
When you divide an int by an int you get int (truncated);
7 / 2 == 3
7 / 2.0 == 3.5 (as well as 7.0 / 2)
In you code 20. is just a shorthand for 20.0

I have a question about sparse matrix multiplication code in C

I'm learning sparse matrix in "Fundamentals of Data Structures in C" by Horowitz.
And my problem is about sparse matrix multiplication! I do know how it works and algorithm, but I can't understand the code.
Below is the code about "mmult"
It is the part about so called "boundary condition" that makes me confused with this code. I don't understand why this condition is needed. Isn't it just fine without these terms?? I need some help understand this part...
The book says "...these dummy terms serve as sentinels that enable us to obtain an elegant algorithm.."
typedef struct {
int row;
int col;
int value;
} SM; // type SM is "Sparse Matrix"
void mmult(SM* A, SM* B, SM*C) {
int i, j;
int rowsA, colsB, totalA, totalB, totalC;
int rowbegin, A_Row, B_Col, sum;
SM* newB;
rowsA = A[0].row, colsB = B[0].col;
totalA = A[0].value, totalB = B[0].value;
totalC = 0;
if (A[0].col != B[0].row) {
fprintf(stderr, "can't multiply\n");
}
transpose(B, newB) // newB is a transposed matrix from B
/* set boundary condition */
A[totalA+1].row = rowsA;
newB[totalB+1].row = colsB;
newB[totalB+1].col = -1;
rowbegin = 1;
for (i = 1, A_Row = A[1].row, sum = 0; i <= totalA;) {
B_Col = newB[0].row;
for (j = 1; j <= totalB + 1) { // don't know why it should be iterated by totalB+1
/* current multiplying row != A[i].row */
if (A[i].row != A_Row) {
storesum(C, A_Row, B_Col, &totalC, &sum);
for(;newB[j].row == B_Col;j++);
i = rowbegin; // reset i to rowbegin, which is the first row term of current multiplying row;
}
/* current multiplying column != newB[j].col */
else if (newB[j].row != B_Col) {
storesum(C, A_Row, B_Col, &totalC, &sum);
B_Col = newB[j].row;
i = rowbegin;
}
/* Otherwise, during multiplication.. */
else {
switch(compare(A[i].col, newB[j].row)) {
case -1 :
i++;
break;
case 0 :
sum += (A[i].value * newB[j].value);
i++, j++;
break;
case 1 : j++;
}
}
}
for(;A[i].row == A_Row;) i++;
A_Row = row[i].row;
rowbegin = i;
}
}
void storesum(SM* C, int row, int col, int* totalC, int* sum) {
/* storesum is to store to C and set sum to 0 when multiplying current row or column is over */
if(*sum) {
(*totalC)++;
C[totalC].row = row;
C[totalC].col = col;
C[totalC].value = *sum;
*sum = 0;
}
}
It is the part about so called "boundary condition" that makes me
confused with this code. I don't understand why this condition is
needed. Isn't it just fine without these terms??
The matrix multiplication could be computed without including the extra entry, yes, but the function given would not do it correctly, no.
The book says "...these dummy terms serve as sentinels that enable us to obtain an elegant algorithm.."
That's a little vague, I agree. Consider how the algorithm works: for each row of matrix A, it must scan each column of matrix B (== row of matrix newB) to compute one element of the product matrix. But the chosen sparse-matrix representation does not record how many elements there are in each row or column, so the only way to know when you've processed the last element for a given column is to look at the next one in linear element order, and see that it belongs to a new column.
The given code integrates the check for end of column and the storage of the resulting element into the processing for the next element, but that leaves a problem: what do you do about the last element in the matrix's element list? It has no following element with to trigger recording of an element of the result matrix -- at least, not a natural one. That could be solved with some special-case logic, but it is tidier to just add a synthetic extra element that definitely belongs to a different column, so that the end of the matrix no longer constitutes a special case.
I'm not sure I agree that the term "sentinel" is a good fit for this. It's just the opposite of a sentinel in many ways, as a matter of fact. The term normally means a special value that cannot be a part of ordinary data, and therefore can be recognized as an end-of-data marker. String terminators are an example. This "sentinel", on the other hand, works by mimicing real data. It is, nevertheless, an extra, artificial element at the end of the list, and in this sense it's not crazy to call it a sentinel.

Comparing elements of an array in C to proceed in the removal of element

I have a problem that I can not figure out (it is probably an easy solution but I can not see it).
The thing is, I have a program that generates all the possible combinations of numbers. The program ask for the size of the set and size of the subsets and generates all the possible combinations accordingly. So far so good ... now ...
I want to write some routines that check for some things in order to eliminate those combinations, one of those routines is the one who checks the array looking to exclude the sequences that exceed a given number of sequenced numbers, for this the program asks for the maximum of numbers in sequences allowed. For example
Size of the set ? : 10 (stores in n)
size of the subset?: 10 (stores in k)
maximum of seq num: 10 (stores in maxp)
the array is called comb[] (integer) it is initialized as
for (i = 0; i < k; i++)
comb[i] = i;
but I have trouble with the routine that exludes certain combinations. The routine is
int todel (int comb[], int k)
{
int i, j, seq;
for (i = 0, seq = 0; (i+maxp) < k; i++)
{
int j = 0;
for (j = i; j < maxp; j++)
{
fprintf(stderr, "checkin comb %d with comb %d\n", j, j+1);
if (comb[j] == (comb[j+1] - 1))
{
seq++;
}
if (seq >= maxp) return 1;
}
}
return 0;
}
if I have a set of 10 a subset of 10 and a max allowed of 10, the program does not need to exlude anything.
But for a set 10 subset 9 and a max allowed of 1 the program should exclude all 10 combinations. But as it is the program is allowing the following combination
0,2,3,4,5,6,7,8,9
and it should exclude it because 0,2 does not match the criteria, 2,3 and all thw following does.
Another thing is that if I set the maximum allowed to 0 it takes all combinations as valid instead of none.
I know the fix should not be very hard and I am missing something really dumb.
I hope some insight from you (probably insults too).
Thank you !

Comparison between N data

I want compare N data of float type. This comparison must be done with a tolerance.
That means if the difference between 2 data (within N data) is less than or equal the tolerance then this 2 data will be considered valid, and I get one data, otherwise if the difference is more than tolerance then the data is invalid.
Have you any idea?
here is my code:
float mytab[N];
int i,j,index=0;
for (i = 0; i < N-1; i++)
{
for (j = i+1; j < N; j++)
{
if(tab[i].valid && tab[j].valid)
{
if ( ABS(tab[i]-tab[j])<= toleance)
{
mytab[index] = tab[i];
index++;
}
}
}
}
//after i search the min value of mytab which constain a
valid value within tolerance.
Example:
tolerance = 0.15;
Data: 20.005, 20.017, 21.20, 21.25, 25.75, 25.9, 20.1
In this example, if we based on the tolerance, we can choose (20.005 OR 20.017 OR 20.1) OR (21.20 OR 21.25).
But if we based on majority voting, we choose 20... instead of 21...
If I understand your basic question, you need to compare two floats. I think you are close with ABS ... but you need the floating-point version fabs available in math.h in C99.
#include <stdio.h>
#include <math.h>
int main (void)
{
float f1 = 1.00001;
float f2 = 1.00003;
float tol= 0.00010;
if (fabs(f1 - f2) <= tol) {
puts("Test1: f1 and f2 are equal-ish.");
} else {
puts("Test1: f1 and f2 are not equal-ish.");
}
tol= 0.0000001;
if (fabs(f1 - f2) <= tol) {
puts("Test2: f1 and f2 are equal-ish.");
} else {
puts("Test2: f1 and f2 are not equal-ish");
}
}
Testing
$ cc -g -Wall -O0 -std=c99 -pedantic -o Test test.c && ./Test
Test1: f1 and f2 are equal-ish.
Test2: f1 and f2 are not equal-ish
Please, make yourself clearer. Depending on the set of numbers, you can create multiple (differents and not intersecting) subsets that share this property.
If you intend to create the largest subset with values that are within a tolerance range to every single value of the original super set, than it's unique, but you're doing it the wrong way. You should, for each value in the set, if it's within tolerance range of every single value in the set. And only after checking with every single number, you can include it.
Like this:
float mytab[N];
int marker=1; //marker that will tell if any number is outside tolerance range of some other element (then marker will be converted to 0
int i,j,index=0;
for (i = 0; i < N; i++)
{
marker=1 //for every new number, reset marker
for (j = 0; j < N; j++)
{
if(tab[i].valid && tab[j].valid)
{
if ( fabs(tab[i]-tab[j])> toleance)
{
marker=0;
}
}
}
if(marker)
{
mytab[index]=tab[i]; index++; //marker will only be 1 if the number is within tolerance range of every element
}
}
Of course it's a very inneficient code. The greatest ranges will be between your candidate number and the smallest and largest number in your set. So, what I would do is to sort your list (or simply discover which is the largest and the smallest number in your set), and compare each element to those 2 elements. If they are within range with those, they are with everyone else. So 2 comparisons for each number, and not n (or n/2 if you were a bit smarter than me in the code above, like you tried to be in first place)

logsumexp implementation in C?

Does anybody know of an open source numerical C library that provides the logsumexp-function?
The logsumexp(a) function computes the sum of exponentials log(e^{a_1}+...e^{a_n}) of the components of the array a, avoiding numerical overflow.
Here's a very simple implementation from scratch (tested, at least minimally):
double logsumexp(double nums[], size_t ct) {
double max_exp = nums[0], sum = 0.0;
size_t i;
for (i = 1 ; i < ct ; i++)
if (nums[i] > max_exp)
max_exp = nums[i];
for (i = 0; i < ct ; i++)
sum += exp(nums[i] - max_exp);
return log(sum) + max_exp;
}
This does the trick of effectively dividing all of the arguments by the largest, then adding its log back in at the end to avoid overflow, so it's well-behaved for adding a large number of similarly-scaled values, with errors creeping in if some arguments are many orders of magnitude larger than others.
If you want it to run without crashing when given 0 arguments, you'll have to add a case for that :)

Resources