Perfect Power detection in linear time - c

I'm trying to write a C program which, given a positive integer n (> 1) detect whether exists numbers x and r so that n = x^r
This is what I did so far:
while (c>=d) {
double y = pow(sum, 1.0/d);
if (floor(y) == y) {
out = y;
break;
}
d++;
}
In the program above, "c" is the maxium value for the exponent (r) and "d" will start by being equal to 2. Y is the value to be checked and the variable "out" is set to output that value later on. Basically, what the script does, is to check if the square roots of y exists: if not, he tries with the square cube and so on... When he finds it, he store the value of y in "out" so that: y = out^d
My question is, is there any more efficient way to find these values? I found some documentation online, but that's far more complicated than my high-school algebra. How can I implement this in a more efficient way?
Thanks!

In one of your comments, you state you want this to be compatible with gigantic numbers. In that case, you may want to bring in the GMP library, which supports operations on arbitrarily large numbers, one of those operations being checking if it is a perfect power.
It is open source, so you can check out the source code and see how they do it, if you don't want to bring in the whole library.

If n fits in a fixed-size (e.g. 32-bit) integer variable, the optimal solution is probably just hard-coding the list of such numbers and binary-searching it. Keep in mind, in int range, there are roughly
sqrt(INT_MAX) perfect squares
cbrt(INT_MAX) perfect cubes
etc.
In 32 bits, that's roughly 65536 + 2048 + 256 + 128 + 64 + ... < 70000.

You need the r-base logarithm, use an identity to calculate it using the natural log
So:
log_r(x) = log(x)/log(r)
So you need to calculate:
x = log(n)/log(r)
(In my neck of the wood, this is highschool math. Which immediately explains my having to look up whether I remembered that identity correctly :))

After you are calculating y in
double y = pow(sum, 1.0/d);
you can get the nearest int to it and you can use your own power function to check for the
equality condition with sum.
int x = (int)(y+0.5);
int a = your_power_func(x,d);
if (a == sum)
break;
I guess this way you can confirm whether a number is integer power of some other number or not.

Related

Finding whether an interval contains at least one integer without math.h

For a class project I need to split some audio clips in smaller sections, for which we are provided a min length and a max length, to figure out whether this is possible, I do the following:
a = length/max
b = length/min
mathematically I figured that [a,b] contains at least one integer if ⌊b⌋ >= ⌈a⌉, but I can't use math.h for floor() and ceil(). Since a and b are always positive I can use type casting for floor(), but I am at a loss at how to do ceil(). I thought about using ((int)x)+1 but that would round integers up which would break the formula.
I would like either a way to do ceil() which would solve my problem, or another way to check whether an interval contains at least one integer.
You don't need the math.h to perform floor. Please look at the following code:
int length=5,min=2,max=3; // only an example of inputs.
int a = length/max;
int b = length/min;
if(a!=b){
//there is at least one integer in the interval.
}else{
if(length % min==0 || length % max==0 ){
//there is at least one integer in the interval.
}else{
//there is no integer in the interval.
}
}
The result for the above example will be that there is an integer in the interval.
You can also perform ceil without using math.h as following:
int a;
if(length % max == 0){
a = length / max;
}else{
a = (length / max) + 1;
}
If I understood you question right, I guess, you can do ceil(a) in this case, and then check if the result is less then b. Thus, for example, for interval [1.3, 3.5], ceil(1.3) will return 2, which fits into this interval.
UPD
Also you could do (b - a). If it's > 1, there's for sure at least one integer between them.
There is a general trick in programming that will come in hand if you ever find yourself programming Apple Basic, or any other language where floating point math is supported.
You can "round" a number by addition, then truncation, as follows:
x = some floating value
rounded_x = int(x + roundoff_amount)
Where roundoff_amount is the difference between the lowest fraction to round up, and 1.
So, to round at .5, your round_off would be 1 - .5 = .5, and you would do int(x + .5). If x is .5 or .51 then the result becomes 1.0 or 1.01 and int() takes that to 1. Obviously, if x is higher, then you still get rounded to 1, until x becomes 1.5 when rounding takes it to 2. To round upwards starting at .6, your roundoff amount would be 1 - .6 = .4, and you would do int(x + .4), etc.
You can do a similar thing to get ceil behavior. Set your roundoff_amount to be 0.99999... and do the round. You can choose your value to provide a "nearby" window, since floats have some inaccuracy inherent that might prevent getting a perfectly integer value after adding fractions.

How do you use bitwise operators, masks, to find if a number is a multiple of another number?

So I have been told that this can be done and that bitwise operations and masks can be very useful but I must be missing something in how they work.
I am trying to calculate whether a number, say x, is a multiple of y. If x is a multiple of y great end of story, otherwise I want to increase x to reach the closest multiple of y that is greater than x (so that all of x fits in the result). I have just started learning C and am having difficulty understanding some of these tasks.
Here is what I have tried but when I input numbers such as 5, 9, or 24 I get the following respectively: 0, 4, 4.
if(x&(y-1)){ //if not 0 then multiple of y
x = x&~(y-1) + y;
}
Any explanations, examples of the math that is occurring behind the scenes, are greatly appreciated.
EDIT: So to clarify, I somewhat understand the shifting of bits to get whether an item is a multiple. (As was explained in a reply 10100 is a multiple of 101 as it is just shifted over). If I have the number 16, which is 10000, its complement is 01111. How would I use this complement to see if an item is a multiple of 16? Also can someone give a numerical explanation of the code given above? Showing this may help me understand why it does not work. Once I understand why it does not work I will be able to problem solve on my own I believe.
Why would you even think about using bit-wise operations for this? They certainly have their place but this isn't it.
A better method is to simply use something like:
unsigned multGreaterOrEqual(unsigned x, unsigned y) {
if ((x % y) == 0)
return x;
return (x / y + 1) * y;
}
In the trivial cases, every number that is an even multiple of a power of 2 is just shifted to the left (this doesn't apply when possibly altering the sign bit)
For example
10100
is 4 times
101
and
10100
is 2 time
1010
As for other multiples, they would have to be found by combining the outputs of two shifts. You might want to look up some primitive means of computer division, where division looks roughly like
x = a / b
implemented like
buffer = a
while a is bigger than b; do
yes: subtract a from b
add 1 to x
done
faster routines try to figure out higher level place values first, skipping lots of subtractions. All of these routine can be done bitwise; but it is a big pain. In the ALU these routines are done bitwise. Might want to look up a digital logic design book for more ideas.
Ok, so I have discovered what the error was in my code and since the majority say that it is impossible to calculate whether a number is a multiple of another number using masks I figured I would share what I have learned.
It is possible! - if you are using the correct data types that is.
The code given above works if y is declared as a constant unsigned long as x which was being passed in was also an unsigned long. The key point is not the long or constant part but that the number is unsigned. This sign bit causes miscalculation as the first place in the number indicates sign and when performing bitwise operations signs can get muddled.
So here is my code if we are looking for multiples of 16:
const unsigned long y = 16; //declared globally in my case
Then an unsigned long is passed to the function which runs the following code:
if(x&(y-1)){ //if not 0 then multiple of y
x = x&~(y-1) + y;
}
x will now be the size of the nearest multiple of 16.

Deterministic bit scrambling to filter coordinates

I am trying to write a function that, given an (x,y) coordinate pair and the random seed of the program, will psuedo-randomly return true for some preset percentage of all such pairs. There are no limits on x or y beyond the restrictions of the data type, which is a 32-bit signed int.
My current approach is to scramble the bits of x, y, and the seed together and then compare the resulting number to the percentage:
float percentage = 0.005;
...
unsigned int n = (x ^ y) ^ seed;
return (((float) n / UINT_MAX) < percentage);
However, it seems that this approach would be biased for certain values of x and y. For example, if it returns true for (0,a), it will also return true for (a,0).
I know this implementation that just XORs them together is naive. Is there a better bit-scrambling algorithm to use here that will not be biased?
Edit: To clarify, I am not starting with a set of (x,y) coordinates, nor am I trying to get a fixed-size set of coordinates that evaluate to true. The function should be able to evaluate a truth value for arbitrary x, y, and seed, with the percentage controlling the average frequency of "true" coordinates.
The easy solution is to use a good hashing algorithm. You can do the range check on the value of hash(seed || x || y).
Of course, selecting points individually with percentage p does not guarantee that you will end up with a sample whose size will be exactly p * N. (That's the expected size of the sample, but any given sample will deviate a bit.) If you want to get a sample of size precisely k from a universe of N objects, you can use the following simple algorithm:
Examine the elements in the sample one at a time until k reaches 0.
When examining element i, add it to the sample if its hash value mapped onto the range [0, N-i) is less than k. If you add the element to the sample, decrement k.
There's no way to get the arithmetic absolutely perfect (since there is no way to perfectly partition 2i different hash values into n buckets unless n is a power of 2), so there will always be a tiny bias. (Floating point arithmetic does not help; the number of possible floating point values is also fixed, and suffers from the same bias.)
If you do 64-bit arithmetic, the bias will be truly tiny, but the arithmetic is more complicated unless your environment provides a 128-bit multiply. So you might feel satisfied with 32-bit computations, where the bias of one in a couple of thousand million [Note 1] doesn't matter. Here, you can use the fact that any 32 bits in your hash should be as unbiased as any other 32 bits, assuming your hash algorithm is any good (see below). So the following check should work fine:
// I need k elements from a remaining universe of n, and I have a 64-bit hash.
// Return true if I should select this element
bool select(uint32_t n, uint32_t k, uint64_t hash) {
return ((hash & (uint32_t)(-1)) * (uint64_t)n) >> 32 < k;
}
// Untested example sampler
// select exactly k elements from U, using a seed value
std::vector<E> sample(const std::vector<E>& U, uint64_t seed, uint32_t k) {
std::vector<E> retval;
uint32_t n = U.size();
for (uint32_t n = U.size(); k && n;) {
E& elt = U[--n];
if (select(n, k, hash_function(seed, elt))) {
retval.push_back(elt);
--k;
}
}
return retval;
}
Assuming you need to do this a lot, you'll want to use a fast hash algorithm; since you're not actually working in a secure environment, you don't need to worry about whether the algorithm is cryptographically secure.
Many high-speed hashing algorithms work on 64-bit units, so you could maximize the speed by constructing a 128-bit input consisting of a 64-bit seed and the two 32-bit co-ordinates. You can then unroll the hash loop to do exactly two blocks.
I won't venture a guess at the best hash function for your purpose. You might want to check out one or more of these open-source hashing functions:
Farmhash https://code.google.com/p/farmhash/
Murmurhash https://code.google.com/p/smhasher/
xxhash https://code.google.com/p/xxhash/
siphash https://github.com/majek/csiphash/
... and many more.
Notes
A couple of billion, if you're on that side of the Atlantic.
I would prefer feeding seed, x, and y through a Combined Linear Congruential Generator.
This is generally much faster than hashing, and it is designed specifically for the purpose: To output a pseudo-random number uniformly in a certain range.
Using coefficients recommended by Wichmann-Hill (which are also used in some versions of Microsoft Excel) we can do:
si = 171 * s % 30269;
xi = 172 * x % 30307;
yi = 170 * y % 30323;
r_combined = fmod(si/30269. + xi/30307. + yi/30323., 1.);
return r_combined < percentage;
Where s is the seed on the first call, and the previous si on each subsequent call. (Thanks to rici's comment for this point.)

Generating random number in sorted order

I want to generate random number in sorted order.
I wrote below code:
void CreateSortedNode(pNode head)
{
int size = 10, last = 0;
pNode temp;
while(size-- > 0) {
temp = (pnode)malloc(sizeof(struct node));
last += (rand()%10);
temp->data = last;//randomly generate number in sorted order
list_add(temp);
}
}
[EDIT:]
Expecting number will be generated in increased or decreased order: i.e {2, 5, 9, 23, 45, 68 }
int main()
{
int size = 10, last = 0;
while(size-- > 0) {
last += (rand()%10);
printf("%4d",last);
}
return 0;
}
Any better idea?
Solved back in 1979 (by Bentley and Saxe at Carnegie-Mellon):
https://apps.dtic.mil/dtic/tr/fulltext/u2/a066739.pdf
The solution is ridiculously compact in terms of code too!
Their paper is in Pascal, I converted it to Python so it should work with any language:
from random import random
cur_max=100 #desired maximum random number
n=100 #size of the array to fill
x=[0]*(n) #generate an array x of size n
for i in range(n,0,-1):
cur_max=cur_max*random()**(1/i) #the magic formula
x[i-1]=cur_max
print(x) #the results
Enjoy your sorted random numbers...
Without any information about sample size or sample universe, it's not easy to know if the following is interesting but irrelevant or a solution, but since it is in any case interesting, here goes.
The problem:
In O(1) space, produce an unbiased ordered random sample of size n from an ordered set S of size N: <S1,S2,…SN>, such that the elements in the sample are in the same order as the elements in the ordered set.
The solution:
With probability n/|S|, do the following:
add S1 to the sample.
decrement n
Remove S1 from S
Repeat steps 1 and 2, each time with the new first element (and size) of S until n is 0, at which point the sample will have the desired number of elements.
The solution in python:
from random import randrange
# select n random integers in order from range(N)
def sample(n, N):
# insist that 0 <= n <= N
for i in range(N):
if randrange(N - i) < n:
yield i
n -= 1
if n <= 0:
break
The problem with the solution:
It takes O(N) time. We'd really like to take O(n) time, since n is likely to be much smaller than N. On the other hand, we'd like to retain the O(1) space, in case n is also quite large.
A better solution (outline only)
(The following is adapted from a 1987 paper by Jeffrey Scott Vitter, "An Efficient Algorithm for Sequential Random Sampling". See Dr. Vitter's publications page.. Please read the paper for the details.)
Instead of incrementing i and selecting a random number, as in the above python code, it would be cool if we could generate a random number according to some distribution which would be the number of times that i will be incremented without any element being yielded. All we need is the distribution (which will obviously depend on the current values of n and N.)
Of course, we can derive the distribution precisely from an examination of the algorithm. That doesn't help much, though, because the resulting formula requires a lot of time to compute accurately, and the end result is still O(N).
However, we don't always have to compute it accurately. Suppose we have some easily computable reasonably good approximation which consistently underestimates the probabilities (with the consequence that it will sometimes not make a prediction). If that approximation works, we can use it; if not, we'll need to fallback to the accurate computation. If that happens sufficiently rarely, we might be able to achieve O(n) on the average. And indeed, Dr. Vitter's paper shows how to do this. (With code.)
Suppose you wanted to generate just three random numbers, x, y, and z so that they are in sorted order x <= y <= z. You will place these in some C++ container, which I'll just denote as a list like D = [x, y, z], so we can also say that x is component 0 of D, or D_0 and so on.
For any sequential algorithm that first draws a random value for x, let's say it comes up with 2.5, then this tells us some information about what y has to be, Namely, y >= 2.5.
So, conditional on the value of x, your desired random number algorithm has to satisfy the property that p(y >= x | x) = 1. If the distribution you are drawing from is anything like a common distribution, like uniform or Guassian, then it's clear to see that usually p(y >= x) would be some other expression involving the density for that distribution. (In fact, only a pathological distribution like a Dirac Delta at "infinity" could be independent, and would be nonsense for your application.)
So what we can speculate with great confidence is that p(y >= t | x) for various values of t is not equal to p(y >= t). That's the definition for dependent random variables. So now you know that the random variable y (second in your eventual list) is not statistically independent of x.
Another way to state it is that in your output data D, the components of D are not statistically independent observations. And in fact they must be positively correlated since if we learn that x is bigger than we thought, we also automatically learn that y is bigger than or equal to what we thought.
In this sense, a sequential algorithm that provides this kind of output is an example of a Markov Chain. The probability distribution of a given number in the sequence is conditionally dependent on the previous number.
If you really want a Markov Chain like that (I suspect that you don't), then you could instead draw a first number at random (for x) and then draw positive deltas, which you will add to each successive number, like this:
Draw a value for x, say 2.5
Draw a strictly positive value for y-x, say 13.7, so y is 2.5 + 13.7 = 16.2
Draw a strictly positive value for z-y, say 0.001, so z is 16.201
and so on...
You just have to acknowledge that the components of your result are not statistically independent, and so you cannot use them in an application that relies on statistical independence assumptions.

Problem with Precision floating point operation in C

For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data.
Now I have problem implementing the algorithm because of the limitations in the C's datatype.
( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering )
PROBLEM STATEMENT:
The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. The probability of doc being spam or not is calculated using
Here, probability value can be very easily around 0.01. So even if I use datatype "double" my calculation will go for a toss. To confirm this I wrote a sample code given below.
#define PROBABILITY_OF_UNLIKELY_SPAM_WORD (0.01)
#define PROBABILITY_OF_MOSTLY_SPAM_WORD (0.99)
int main()
{
int index;
long double numerator = 1.0;
long double denom1 = 1.0, denom2 = 1.0;
long double doc_spam_prob;
/* Simulating FEW unlikely spam words */
for(index = 0; index < 162; index++)
{
numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
denom2 = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
denom1 = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD);
}
/* Simulating lot of mostly definite spam words */
for (index = 0; index < 1000; index++)
{
numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
denom2 = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
denom1 = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD);
}
doc_spam_prob= (numerator/(denom1+denom2));
return 0;
}
I tried Float, double and even long double datatypes but still same problem.
Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!.
This is the first time I am hitting such issue. So how exactly should this problem be tackled?
This happens often in machine learning. AFAIK, there's nothing you can do about the loss in precision. So to bypass this, we use the log function and convert divisions and multiplications to subtractions and additions, resp.
SO I decided to do the math,
The original equation is:
I slightly modify it:
Taking logs on both sides:
Let,
Substituting,
Hence the alternate formula for computing the combined probability:
If you need me to expand on this, please leave a comment.
Here's a trick:
for the sake of readability, let S := p_1 * ... * p_n and H := (1-p_1) * ... * (1-p_n),
then we have:
p = S / (S + H)
p = 1 / ((S + H) / S)
p = 1 / (1 + H / S)
let`s expand again:
p = 1 / (1 + ((1-p_1) * ... * (1-p_n)) / (p_1 * ... * p_n))
p = 1 / (1 + (1-p_1)/p_1 * ... * (1-p_n)/p_n)
So basically, you will obtain a product of quite large numbers (between 0 and, for p_i = 0.01, 99). The idea is, not to multiply tons of small numbers with one another, to obtain, well, 0, but to make a quotient of two small numbers. For example, if n = 1000000 and p_i = 0.5 for all i, the above method will give you 0/(0+0) which is NaN, whereas the proposed method will give you 1/(1+1*...1), which is 0.5.
You can get even better results, when all p_i are sorted and you pair them up in opposed order (let's assume p_1 < ... < p_n), then the following formula will get even better precision:
p = 1 / (1 + (1-p_1)/p_n * ... * (1-p_n)/p_1)
that way you devide big numerators (small p_i) with big denominators (big p_(n+1-i)), and small numerators with small denominators.
edit: MSalter proposed a useful further optimization in his answer. Using it, the formula reads as follows:
p = 1 / (1 + (1-p_1)/p_n * (1-p_2)/p_(n-1) * ... * (1-p_(n-1))/p_2 * (1-p_n)/p_1)
Your problem is caused because you are collecting too many terms without regard for their size. One solution is to take logarithms. Another is to sort your individual terms. First, let's rewrite the equation as 1/p = 1 + ∏((1-p_i)/p_i). Now your problem is that some of the terms are small, while others are big. If you have too many small terms in a row, you'll underflow, and with too many big terms you'll overflow the intermediate result.
So, don't put too many of the same order in a row. Sort the terms (1-p_i)/p_i. As a result, the first will be the smallest term, the last the biggest. Now, if you'd multiply them straight away you would still have an underflow. But the order of calculation doesn't matter. Use two iterators into your temporary collection. One starts at the beginning (i.e. (1-p_0)/p_0), the other at the end (i.e (1-p_n)/p_n), and your intermediate result starts at 1.0. Now, when your intermediate result is >=1.0, you take a term from the front, and when your intemediate result is < 1.0 you take a result from the back.
The result is that as you take terms, the intermediate result will oscillate around 1.0. It will only go up or down as you run out of small or big terms. But that's OK. At that point, you've consumed the extremes on both ends, so it the intermediate result will slowly approach the final result.
There's of course a real possibility of overflow. If the input is completely unlikely to be spam (p=1E-1000) then 1/p will overflow, because ∏((1-p_i)/p_i) overflows. But since the terms are sorted, we know that the intermediate result will overflow only if ∏((1-p_i)/p_i) overflows. So, if the intermediate result overflows, there's no subsequent loss of precision.
Try computing the inverse 1/p. That gives you an equation of the form 1 + 1/(1-p1)*(1-p2)...
If you then count the occurrence of each probability--it looks like you have a small number of values that recur--you can use the pow() function--pow(1-p, occurences_of_p)*pow(1-q, occurrences_of_q)--and avoid individual roundoff with each multiplication.
You can use probability in percents or promiles:
doc_spam_prob= (numerator*100/(denom1+denom2));
or
doc_spam_prob= (numerator*1000/(denom1+denom2));
or use some other coefficient
I am not strong in math so I cannot comment on possible simplifications to the formula that might eliminate or reduce your problem. However, I am familiar with the precision limitations of long double types and am aware of several arbitrary and extended precision math libraries for C. Check out:
http://www.nongnu.org/hpalib/
and
http://www.tc.umn.edu/~ringx004/mapm-main.html

Resources