(can skip this part just an explanation of the code below. my problems are under the code block.)
hi. i'm trying to algro for throttling loop cycles based on how much bandwidth the linux computer is using. i'm reading /proc/net/dev once a second and keeping track of the bytes transmitted in 2 variables. one is the last time it was checked the other is the recent time. from there subtracts the recent one from the last one to calculate how many bytes has been sent in 1 second.
from there i have the variables max_throttle, throttle, max_speed, and sleepp.
the idea is to increase or decrease sleepp depending on bandwidth being used. the less bandwidth the lower the delay and the higher the longer.
i am currently having to problems dealing with floats and ints. if i set all my variables to ints max_throttle becomes 0 always no matter what i set the others to and even if i initialize them.
also even though my if statement says "if sleepp is less then 0 return it to 0" it keeps going deeper and deeper into the negatives then levels out at aroung -540 with 0 bandwidth being used.
and the if(ii & 0x40) is for speed and usage control. in my application there will be no 1 second sleep so this code allows me to limit the sleepp from changing about once every 20-30 iterations. although im also having a problem with it where after the 2X iterations when it does trigger it continues to trigger every iteration after instead of only being true once and then being true again after 20-30 more iterations.
edit:: simpler test cast for my variable problem.
#include <stdio.h>
int main()
{
int max_t, max_s, throttle;
max_s = 400;
throttle = 90;
max_t = max_s * (throttle / 100);
printf("max throttle:%d\n", max_t);
return 0;
}
In C, operator / is an integer division when used with integers only. Therefore, 90/100 = 0. In order to do floating-point division with integers, first convert them to floats (or double or other fp types).
max_t = max_s * (int)(((float)throttle / 100.0)+0.5);
The +0.5 is rounding before converting to int. You might want to consider some standard flooring functions, I don't know your use case.
Also note that the 100.0 is a float literal, whereas 100 would be an intger literal. So, although they seem identical, they are not.
As kralyk pointed out, C’s integer division of 90/100 is 0. But rather than using floats you can work with ints… Just do the division after the multiplication (note the omission of parentheses):
max_t = max_s * throttle / 100;
This gives you the general idea. For example if you want the kind of rounding kralyk mentions, add 50 before doing the division:
max_t = (max_s * throttle + 50) / 100;
Related
I came across some code online in which the PID is implemented for arduino. I am confused of the implementation. I have basic understanding of how PID works, however my source of confusion is why the hexadecimal is being used for m_prevError? what is the value 0x80000000L representing and why is right shifting by 10 when calculating the velocity?
// ServoLoop Constructor
ServoLoop::ServoLoop(int32_t proportionalGain, int32_t derivativeGain)
{
m_pos = RCS_CENTER_POS;
m_proportionalGain = proportionalGain;
m_derivativeGain = derivativeGain;
m_prevError = 0x80000000L;
}
// ServoLoop Update
// Calculates new output based on the measured
// error and the current state.
void ServoLoop::update(int32_t error)
{
long int velocity;
char buf[32];
if (m_prevError!=0x80000000)
{
velocity = (error*m_proportionalGain + (error - m_prevError)*m_derivativeGain)>>10;
m_pos += velocity;
if (m_pos>RCS_MAX_POS)
{
m_pos = RCS_MAX_POS;
}
else if (m_pos<RCS_MIN_POS)
{
m_pos = RCS_MIN_POS;
}
}
m_prevError = error;
}
Shifting a binary number to right by 1 means multiplying its corresponding decimal value by 2. Here shifting by 10 means multiplying by 2^10 which is 1024. As any basic control loop, it could be a gain of the velocity where the returned-back value is converted to be suitable to re-use by any other method.
The L here 0x80000000L is declaring that value as long. So, this value 0x80000000 may be an initial value of error or so. Also, you need to revise the full program to see how things work and what value is assigned to something like error.
Contrary to the other answer, shifting to the right has the effect to divide by a power of two, in this case >> 10 would divide by 1024. But a real division would be better, more clear, and optimized by the compiler with a shift anyway. So I find this shift ugly.
The intent is to implement some float math without actually use floating point numbers - it is a kind of fixed point calculation, where the fractional part is about 10 bits. To understand, assuming to simplify the derivative coefficient=0, an m_proportionalGain set to 1024 would mean 1, while if set to 512 it would mean 0.5. In fact in the case of proportional=1024, and error=100, the formula would give
100*1024 / 1024 = 100
(gain=1), while proportional=512 would give
100*512 / 1024 = 50
(gain=0.5).
As for previous error m_prevError set to 0x80000000, it is simply a special value which is checked in the loop to see if "there is already" a previous error. If not, i.e. if prevError has the special value, the entire loop is skipped once; in other words, it serves the purpose to skip the first update after creation of the object. Not very cleaver I suppose, I would prefer to simply set the previous error equal to 0 and skip completely the check in ::update(). Using special values as flag has the problem that sometimes the calculations result in the special value itself - it would be a big bug. If absolutely needed, it is better to use a true flag.
All in all, I think this is a poor PID algorithm, as it lacks completely the integrative part; it seems that the variable m_pos is thought for this integrative purpose, it is managed quite that way, but never used - only set. Nevertheless this algorithm can work, but all depends on the target system and the wanted performances: on most situations, this algorithm leaves a residual error.
EDIT:
My question is: rand()%N is considered very bad, whereas the use of integer arithmetic is considered superior, but I cannot see the difference between the two.
People always mention:
low bits are not random in rand()%N,
rand()%N is very predictable,
you can use it for games but not for cryptography
Can someone explain if any of these points are the case here and how to see that?
The idea of the non-randomness of the lower bits is something that should make the PE of the two cases that I show differ, but it's not the case.
I guess many like me would always avoid using rand(), or rand()%N because we've been always taught that it is pretty bad. I was curious to see how "wrong" random integers generated with c rand()%N effectively are. This is also a follow up to Ryan Reich's answer in How to generate a random integer number from within a range.
The explanation there sounds very convincing, to be honest; nevertheless, I thought I’d give it a try. So, I compare the distributions in a VERY naive way. I run both random generators for different numbers of samples and domains. I didn't see the point of computing a density instead of histograms, so I just computed histograms and, just by looking, I would say they both look just as uniform. Regarding the other point that was raised, about the actual randomness (despite being uniformly distributed). I — again naively —compute the permutation entropy for these runs, which are the same for both sample sets, which tell us that there's no difference between both regarding the ordering of the occurrence.
So, for many purposes, it seems to me that rand()%N would be just fine, how can we see their flaws?
Here I show you a very simple, inefficient and not very elegant (but I think correct) way of computing these samples and get the histograms together with the permutation entropies.
I show plots for domains (0,i) with i in {5,10,25,50,100} for different number of samples:
There's not much to see in the code I guess, so I will leave both the C and the matlab code for replication purposes.
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main(int argc, char *argv[]){
unsigned long max = atoi(argv[2]);
int samples=atoi(argv[3]);
srand(time(NULL));
if(atoi(argv[1])==1){
for(int i=0;i<samples;++i)
printf("%ld\n",rand()%(max+1));
}else{
for(int i=0;i<samples;++i){
unsigned long
num_bins = (unsigned long) max + 1,
num_rand = (unsigned long) RAND_MAX + 1,
bin_size = num_rand / num_bins,
defect = num_rand % num_bins;
long x;
do {
x = rand();
}
while (num_rand - defect <= (unsigned long)x);
printf("%ld\n",x/bin_size);
}
}
return 0;
}
And here is the Matlab code to plot this and compute the PEs (the recursion for the permutations I took it from: https://www.mathworks.com/matlabcentral/answers/308255-how-to-generate-all-possible-permutations-without-using-the-function-perms-randperm):
system('gcc randomTest.c -o randomTest.exe;');
max = 100;
samples = max*10000;
trials = 200;
system(['./randomTest.exe 1 ' num2str(max) ' ' num2str(samples) ' > file1'])
system(['./randomTest.exe 2 ' num2str(max) ' ' num2str(samples) ' > file2'])
a1=load('file1');
a2=load('file2');
uni = figure(1);
title(['Samples: ' num2str(samples)])
subplot(1,3,1)
h1 = histogram(a1,max+1);
title('rand%(max+1)')
subplot(1,3,2)
h2 = histogram(a2,max+1);
title('Integer arithmetic')
as=[a1,a2];
ns=3:8;
H = nan(numel(ns),size(as,2));
for op=1:size(as,2)
x = as(:,op);
for n=ns
sequenceOcurrence = zeros(1,factorial(n));
sequences = myperms(1:n);
sequencesArrayIdx = sum(sequences.*10.^(size(sequences,2)-1:-1:0),2);
for i=1:numel(x)-n
[~,sequenceOrder] = sort(x(i:i+n-1));
out = sequenceOrder'*10.^(numel(sequenceOrder)-1:-1:0).';
sequenceOcurrence(sequencesArrayIdx == out) = sequenceOcurrence(sequencesArrayIdx == out) + 1;
end
chunks = length(x) - n + 1;
ps = sequenceOcurrence/chunks;
hh = sum(ps(logical(ps)).*log2(ps(logical(ps))));
H(n,op) = hh/log2(factorial(n));
end
end
subplot(1,3,3)
plot(ns,H(ns,:),'--*','linewidth',2)
ylabel('PE')
xlabel('Sequence length')
filename = ['all_' num2str(max) '_' num2str(samples) ];
export_fig(filename)
Due to the way modulo arithmetic works if N is significant compared to RAND_MAX doing %N will make it so you're considerably more likely to get some values than others. Imagine RAND_MAX is 12, and N is 9. If the distribution is good then the chances of getting one of 0, 1, or 2 is 0.5, and the chances of getting one of 3, 4, 5, 6, 7, 8 is 0.5. The result being that you're twice as likely to get a 0 instead of a 4. If N is an exact divider of RAND_MAX this distribution problem doesn't happen, and if N is very small compared to RAND_MAX the issue becomes less noticeable. RAND_MAX may not be a particularly large value (maybe 2^15 - 1), making this problem worse than you may expect. The alternative of doing (rand() * n) / (RAND_MAX + 1) also doesn't give an even distribution, however, it will be every mth value (for some m) that will be more likely to occur rather than the more likely values all being at the low end of the distribution.
If N is 75% of RAND_MAX then the values in the bottom third of your distribution are twice as likely as the values in the top two thirds (as this is where the extra values map to)
The quality of rand() will depend on the implementation of the system that you're on. I believe that some systems have had very poor implementation, OS Xs man pages declare rand obsolete. The Debian man page says the following:
The versions of rand() and srand() in the Linux C Library use the same
random number generator as random(3) and srandom(3), so the lower-order
bits should be as random as the higher-order bits. However, on older
rand() implementations, and on current implementations on different
systems, the lower-order bits are much less random than the higher-
order bits. Do not use this function in applications intended to be
portable when good randomness is needed. (Use random(3) instead.)
Both approaches have their pitfalls, and your graphs are little more than a pretty verification of the central limit theorem! For a sensible implementation of rand():
% N suffers from a "pigeon-holing" effect if 1u + RAND_MAX is not a multiple of N
/((RAND_MAX + 1u)/N) does not, in general, evenly distribute the return of rand across your range, due to integer truncation effects.
On balance, if N is small cf. RAND_MAX, I'd plump for % for its tractability. In any case test your generator to see it it has the appropriate statistical properties for your application.
rand() % N is considered extremely poor not because the distribution is bad, but because the randomness is poor-to-nonexistent. (If anything the distribution will be too good.)
If N is not small with respect to RAND_MAX, both
rand() % N
and
rand() / (RAND_MAX / N + 1)
will have more or less the same, poor distribution -- certain values will occur with significantly higher probability than others.
Looking at distribution histograms won't show you that for some implementations, rand() % N has a much, much worse problem -- to show that you'd have to perform some correlations with previous values. (For example, try taking rand() % 2, then subtracting from the previous value you got, and plotting a histogram of the differences. If the difference is never 0, you've got a problem.)
I would like to say that the implementations for which rand()'s low-order bits aren't random are simply buggy. I'd like to think that all those buggy implementations would have disappeared by now. I'd like to think that programmers shouldn't have to worry about calling rand()%N any more. But, unfortunately, my wishes don't change the fact that this seems to be one of those bugs that never get fixed, meaning that programmers do still have to worry.
See also the C FAQ list, question 13.16.
I am trying to optimize a code in C, specificly a critical loop which takes almost 99.99% of total execution time. Here is that loop:
#pragma omp parallel shared(NTOT,i) num_threads(4)
{
# pragma omp for private(dx,dy,d,j,V,E,F,G) reduction(+:dU) nowait
for(j = 1; j <= NTOT; j++){
if(j == i) continue;
dx = (X[j][0]-X[i][0])*a;
dy = (X[j][1]-X[i][1])*a;
d = sqrt(dx*dx+dy*dy);
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
E = dS[0]*dx+dS[1]*dy;
F = spin[2*j-2]*dx+spin[2*j-1]*dy;
G = -3*(D/(d*d*d*d*d))*E*F;
dU += (V+G);
}
}
All variables are local. The loop takes 0.7 second for NTOT=3600 which is a large amount of time, especially when I have to do this 500,000 times in the whole program, resulting in 97 hours spent in this loop. My question is if there are other things to be optimized in this loop?
My computer's processor is an Intel core i5 with 4 CPU(4X1600Mhz) and 3072K L3 cache.
Optimize for hardware or software?
Soft:
Getting rid of time consuming exceptions such as divide by zeros:
d = sqrt(dx*dx+dy*dy + 0.001f );
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
You could also try John Carmack , Terje Mathisen and Gary Tarolli 's "Fast inverse square root" for the
D/(d*d*d)
part. You get rid of division too.
float qrsqrt=q_rsqrt(dx*dx+dy*dy + easing);
qrsqrt=qrsqrt*qrsqrt*qrsqrt * D;
with sacrificing some precision.
There is another division also to be gotten rid of:
(D/(d*d*d*d*d))
such as
qrsqrt_to_the_power2 * qrsqrt_to_the_power3 * D
Here is the fast inverse sqrt:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what ?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
To overcome big arrays' non-caching behaviour, you can do the computation in smaller patches/groups especially when is is many to many O(N*N) algorithm. Such as:
get 256 particles.
compute 256 x 256 relations.
save 256 results on variables.
select another 256 particles as target(saving the first 256 group in place)
do same calculations but this time 1st group vs 2nd group.
save first 256 results again.
move to 3rd group
repeat.
do same until all particles are versused against first 256 particles.
Now get second group of 256.
iterate until all 256's are complete.
Your CPU has big cache so you can try 32k particles versus 32k particles directly. But L1 may not be big so I would stick with 512 vs 512(or 500 vs 500 to avoid cache line ---> this is going to be dependent on architecture) if I were you.
Hard:
SSE, AVX, GPGPU, FPGA .....
As #harold commented, SSE should be start point to compare and you should vectorize or at least parallelize through 4-packed vector instructions which have advantage of optimum memory fetching ability and pipelining. When you need 3x-10x more performance(on top of SSE version using all cores), you will need an opencl/cuda compliant gpu(equally priced as i5) and opencl(or cuda) api or you can learn opengl too but it seems harder(maybe directx easier).
Trying SSE is easiest, should give 3x faster than the fast inverse I mentionad above. An equally priced gpu should give another 3x of SSE at least for thousands of particles. Going or over 100k particles, whole gpu can achieve 80x performance of a single core of cpu for this type of algorithm when you optimize it enough(making it less dependent to main memory). Opencl gives ability to address cache to save your arrays. So you can use terabytes/s of bandwidth in it.
I would always do random pausing
to pin down exactly which lines were most costly.
Then, after fixing something I would do it again, to find another fix, and so on.
That said, some things look suspicious.
People will say the compiler's optimizer should fix these, but I never rely on that if I can help it.
X[i], X[j], spin[2*j-1(and 2)] look like candidates for pointers. There is no need to do this index calculation and then hope the optimizer can remove it.
You could define a variable d2 = dx*dx+dy*dy and then say d = sqrt(d2). Then wherever you have d*d you can instead write d2.
I suspect a lot of samples will land in the sqrt function, so I would try to figure a way around using that.
I do wonder if some of these quantities like (dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]) could be calculated in a separate unrolled loop outside this loop. In some cases two loops can be faster than one if the compiler can save some registers.
I cannot believe that 3600 iterations of an O(1) loop can take 0.7 seconds. Perhaps you meant the double loop with 3600 * 3600 iterations? Otherwise I can suggest checking if optimization is enabled, and how long threads spawning takes.
General
Your inner loop is very simple and it contains only a few operations. Note that divisions and square roots are roughly 15-30 times slower than additions, subtractions and multiplications. You are doing three of them, so most of the time is eaten by them.
First of all, you can compute reciprocal square root in one operation instead of computing square root, then getting reciprocal of it. Second, you should save the result and reuse it when necessary (right now you divide by d twice). This would result in one problematic operation per iteration instead of three.
invD = rsqrt(dx*dx+dy*dy);
V = (D * (invD*invD*invD))*(...);
...
G = -3*(D * (invD*invD*invD*invD*invD))*E*F;
dU += (V+G);
In order to further reduce time taken by rsqrt, I advise vectorizing it. I mean: compute rsqrt for two or four input values at once with SSE. Depending on size of your arguments and desired precision of result, you can take one of the routines from this question. Note that it contains a link to a small GitHub project with all the implementations.
Indeed you can go further and vectorize the whole loop with SSE (or even AVX), that is not hard.
OpenCL
If you are ready to use some big framework, then I suggest using OpenCL. Your loop is very simple, so you won't have any problems porting it to OpenCL (except for some initial adaptation to OpenCL).
Then you can use CPU implementations of OpenCL, e.g. from Intel or AMD. Both of them would automatically use multithreading. Also, they are likely to automatically vectorize your loop (e.g. see this article). Finally, there is a chance that they would find a good implementation of rsqrt automatically, if you use native_rsqrt function or something like that.
Also, you would be able to run your code on GPU. If you use single precision, it may result in significant speedup. If you use double precision, then it is not so clear: modern consumer GPUs are often slow with double precision, because they lack the necessary hardware.
Minor optimisations:
(d * d * d) is calculated twice. Store d*d and use it for d^3 and d^5
Modify 2 * x by x<<1;
I have several variables listed below:
int cpu_time_b = 6
float clock_cycles_a = 2 * pow(10, 10));
float cpi_a = 2.0;
int cycle_time_a = 250;
float cpi_b = 1.2;
int cycle_time_b = 500
I am working out the clock rate of b with the following calculation:
(((1.2*clock_cycles_a)/cpu_time_b)/(1 * pow(10, 9)))
Clearly the answer should be 4 however my program is outputting 6000000204800000000.0 as the answer
I think that overflow is possibly happening here. Is this the case and if so, how could I fix the problem?
All calculations should be made to ensure comparable numbers are "reduced" together. in your example, it seems like only
cpu_time_b
is truly variable (undefined in the scope of your snippet. All other variables appears as constants. All constants should be computed before compilation especially if they are susceptible to cause overflow.
clock_cycles_a
cancels the denominator. pow is time consuming (may not be critical here) and not always that precise. You multiply the 2 explicitly when you declare clock_cycles_a and then use 1.2 below. etc. Reducing the whole thing keeping only the actual variable becomes:
24.0/cpu_time_b
which makes me deduce that cpu_time_b should be 6?
Finaly, while you write the equation, we have no idea of what you do with the result. Store it in the wrong variable type? printf with the wrong format? etc?
For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data.
Now I have problem implementing the algorithm because of the limitations in the C's datatype.
( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering )
PROBLEM STATEMENT:
The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. The probability of doc being spam or not is calculated using
Here, probability value can be very easily around 0.01. So even if I use datatype "double" my calculation will go for a toss. To confirm this I wrote a sample code given below.
#define PROBABILITY_OF_UNLIKELY_SPAM_WORD (0.01)
#define PROBABILITY_OF_MOSTLY_SPAM_WORD (0.99)
int main()
{
int index;
long double numerator = 1.0;
long double denom1 = 1.0, denom2 = 1.0;
long double doc_spam_prob;
/* Simulating FEW unlikely spam words */
for(index = 0; index < 162; index++)
{
numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
denom2 = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
denom1 = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD);
}
/* Simulating lot of mostly definite spam words */
for (index = 0; index < 1000; index++)
{
numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
denom2 = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
denom1 = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD);
}
doc_spam_prob= (numerator/(denom1+denom2));
return 0;
}
I tried Float, double and even long double datatypes but still same problem.
Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!.
This is the first time I am hitting such issue. So how exactly should this problem be tackled?
This happens often in machine learning. AFAIK, there's nothing you can do about the loss in precision. So to bypass this, we use the log function and convert divisions and multiplications to subtractions and additions, resp.
SO I decided to do the math,
The original equation is:
I slightly modify it:
Taking logs on both sides:
Let,
Substituting,
Hence the alternate formula for computing the combined probability:
If you need me to expand on this, please leave a comment.
Here's a trick:
for the sake of readability, let S := p_1 * ... * p_n and H := (1-p_1) * ... * (1-p_n),
then we have:
p = S / (S + H)
p = 1 / ((S + H) / S)
p = 1 / (1 + H / S)
let`s expand again:
p = 1 / (1 + ((1-p_1) * ... * (1-p_n)) / (p_1 * ... * p_n))
p = 1 / (1 + (1-p_1)/p_1 * ... * (1-p_n)/p_n)
So basically, you will obtain a product of quite large numbers (between 0 and, for p_i = 0.01, 99). The idea is, not to multiply tons of small numbers with one another, to obtain, well, 0, but to make a quotient of two small numbers. For example, if n = 1000000 and p_i = 0.5 for all i, the above method will give you 0/(0+0) which is NaN, whereas the proposed method will give you 1/(1+1*...1), which is 0.5.
You can get even better results, when all p_i are sorted and you pair them up in opposed order (let's assume p_1 < ... < p_n), then the following formula will get even better precision:
p = 1 / (1 + (1-p_1)/p_n * ... * (1-p_n)/p_1)
that way you devide big numerators (small p_i) with big denominators (big p_(n+1-i)), and small numerators with small denominators.
edit: MSalter proposed a useful further optimization in his answer. Using it, the formula reads as follows:
p = 1 / (1 + (1-p_1)/p_n * (1-p_2)/p_(n-1) * ... * (1-p_(n-1))/p_2 * (1-p_n)/p_1)
Your problem is caused because you are collecting too many terms without regard for their size. One solution is to take logarithms. Another is to sort your individual terms. First, let's rewrite the equation as 1/p = 1 + ∏((1-p_i)/p_i). Now your problem is that some of the terms are small, while others are big. If you have too many small terms in a row, you'll underflow, and with too many big terms you'll overflow the intermediate result.
So, don't put too many of the same order in a row. Sort the terms (1-p_i)/p_i. As a result, the first will be the smallest term, the last the biggest. Now, if you'd multiply them straight away you would still have an underflow. But the order of calculation doesn't matter. Use two iterators into your temporary collection. One starts at the beginning (i.e. (1-p_0)/p_0), the other at the end (i.e (1-p_n)/p_n), and your intermediate result starts at 1.0. Now, when your intermediate result is >=1.0, you take a term from the front, and when your intemediate result is < 1.0 you take a result from the back.
The result is that as you take terms, the intermediate result will oscillate around 1.0. It will only go up or down as you run out of small or big terms. But that's OK. At that point, you've consumed the extremes on both ends, so it the intermediate result will slowly approach the final result.
There's of course a real possibility of overflow. If the input is completely unlikely to be spam (p=1E-1000) then 1/p will overflow, because ∏((1-p_i)/p_i) overflows. But since the terms are sorted, we know that the intermediate result will overflow only if ∏((1-p_i)/p_i) overflows. So, if the intermediate result overflows, there's no subsequent loss of precision.
Try computing the inverse 1/p. That gives you an equation of the form 1 + 1/(1-p1)*(1-p2)...
If you then count the occurrence of each probability--it looks like you have a small number of values that recur--you can use the pow() function--pow(1-p, occurences_of_p)*pow(1-q, occurrences_of_q)--and avoid individual roundoff with each multiplication.
You can use probability in percents or promiles:
doc_spam_prob= (numerator*100/(denom1+denom2));
or
doc_spam_prob= (numerator*1000/(denom1+denom2));
or use some other coefficient
I am not strong in math so I cannot comment on possible simplifications to the formula that might eliminate or reduce your problem. However, I am familiar with the precision limitations of long double types and am aware of several arbitrary and extended precision math libraries for C. Check out:
http://www.nongnu.org/hpalib/
and
http://www.tc.umn.edu/~ringx004/mapm-main.html