Interview : Hash function: sine function - c

I was asked this interview question. I am not sure what the correct answer for it is (and the reasoning behind the answer):
Is sin(x) a good hash function?

If you mean sin(), it's not a good hashing function because:
it's quite predictable and for some x it's no better than just x itself. There should be no seemingly apparent relationship between the key and the hash of the key.
it does not produce an integer value. You cannot index/subscript arrays with floating-point indices and there must be some kind of array in the hash table.
floating-point is very implementation-specific and even if you make a hash function out of sin(), it may not work with a different compiler or on a different kind of CPU/computer.
sin() may be much slower than some simpler integer-arithmetic function.

Not really.
It's horribly slow.
You'll need to convert the result to some integer type anyway to avoid the insanity of floating-point equality comparisons. (Not actually the usual precision problems that are endemic to FP equality comparisons and which arise from calculating two things slightly different ways; I mean specifically the problems caused by things like the fact that 387-derived FPUs store extra bits of precision in their registers, so if a comparison is done between two freshly-calculated values in registers you could get a different answer than if exactly one of the operands was loaded into a register from memory.)
It's almost flat near the peaks and troughs, so the quantisation step (multiplying by some large number and rounding to an integer) will produce many hash values near the min and max, rather than an even distribution.

Based off of mathematical knowledge:
Sine(x) is periodic so it's going to reach the same number from different values of x, so Sine(x) would be awful as a hashing function because you will get multiple values hashing to the exact same point. There are **infinitely many values between 0 and pi for the return value, but then past that the values will repeat. So 0 & pi & 2*pi will all hash to the same point.
If you could make the increment small enough and have Sine(x) multiplied by say x^2 or something of that nature it'd be mediocre at best, but then again, if you were to do that why not just use x^2 anyway and toss out the periodic function all together.
**infinitely: a large enough number that I'm not willing to count.
NOTE: Sine(x) will have values that are small and could be affected by rounding error.
NOTE: Any value taken from a sine function should be multiplied by an integer and then either modded or the floor or ceiling taken so that the value can be used as an array offset, etc.

sin(x) is trigonometric function which repeats itself after every 360 degrees, so it's going to be a poor hash function as the hash will be repeated too often.
A simple refutation:
sin(0) == sin(360) == sin(720) == sin(..)
This is not a property of a goodhash function.
Even if you decide to use it, it's difficult to represent the value returned by sin.
Sin function:
sin x = x - x^3/3! + x^5/5! - ...
This can't accurately represented due to floating point precision issue, which means for a same value it may produce two different hashes!

Another point to note:
For sine(x) as hash function - Keys in a given close range will have hash values in close range too, it is not desirable. A good hash function evenly distributes hash values irrespective of the nature of the keys.

Hash values generally have to be integers to be useful. Since sin doesn't generate integers it wouldn't be appropriate.

Let's say we have a string s. It can be expressed as a number in hexadecimal and feeded to the function. If you added 2 pi it would cease to be a valid input, as it wouldn't be an integer anymore (only non-negative integers are accepted by the function). You have to find a string that gives a collision, not just multiply the hex expression of the string with 2 pi. And adding (concatenating?) 2 pi directly to the string wouldn't help finding a collision. There might be another way though but not that trivial.

I think sin(x) can make an excellent cryptographic hash function,
if used wisely. The input should be a natural number in radians
and never contain pi. We must use arbitrary-precision arithmetic.
For every natural number x (radians), sin(x)
is always a transcendental irrational number and there is no other
natural number with the same sine. But there's a catch: An attacker could gain
information about the input, by computing the arcsin of the hash.
In order to prevent this, we ignore the decimal part and some of the
first digits from the fractional part, keeping only the next n (say 100) digits,
making such an attack computationally infeasible.
It seems that a small change in the input gives a completely different result,
which is a desirable property.
The result of the function seems statistically random, again a good property.
I'm not sure how to prove that is is collision-resistant but i can't see why
it couldn't be. Also, i can't think of a way to find a specific input that results
in a specific hash. I'm not saying that we should blindly believe that it is
certainly a good crypt. hash function. I just think that it seems like a
good candidate to be one. We should give it a chance
and focus on proving that it is. And it might me a very good one.
To those that might say it is slow: Yes, it is. And that's good when hashing passwords.
Here i'm attaching some perl code for this idea. It runs on linux with bash and bc.
(bc is a command-line arbitrary-precision calculator, included in most distros)
I'll be checking this page for any answers, since this interests me a lot.
Don't be harsh though, i'm just a CS undergrad, willing to learn more.
use warnings;
use strict;
my $input='5AFF36B7';#Input for bc (as a hex number)
$input='1'.$input;#put '1' in front of input, so that 0x0 , 0x00 , 0x1 , 0x01 , etc ... ,
#all give different nonzero results
my $a=`bc -l -q <<< "scale=256;obase=16;ibase=16;s($input)"`;#call bc, keep result in $a
#keep only fractional part
$a=~tr/a-zA-Z0-9//cd;#Clean up string, keep only alphanumerics
my #m = $a =~ /./g;#Convert string to array of chars
#PRINT OUTPUT
#We ignore some digits, for security reasons:
#If we don't ignore any of the first digits, an attacker could gain information
#about the input by computing the inverse of sin (the arcsin of the hash)
#By ignoring enough of the first digits, it becomes computationally
#infeasible to compute arcsin
#Also, to avoid problems with roundoff error, we ignore some of the last digits
for (my $c=100;$c<200;$c++){
print $m[$c];
}

Related

Determine if a given integer number is element of the Fibonacci sequence in C without using float

I had recently an interview, where I failed and was finally told having not enough experience to work for them.
The position was embedded C software developer. Target platform was some kind of very simple 32-bit architecture, those processor does not support floating-point numbers and their operations. Therefore double and float numbers cannot be used.
The task was to develop a C routine for this architecture. This takes one integer and returns whether or not that is a Fibonacci number. However, from the memory only an additional 1K temporary space is allowed to use during the execution. That means: even if I simulate very great integers, I can't just build up the sequence and interate through.
As far as I know, a positive integer is a exactly then a Fibonacci number if one of
(5n ^ 2) + 4
or
(5n ^ 2) − 4
is a perfect square. Therefore I responded the question: it is simple, since the routine must determine whether or not that is the case.
They responded then: on the current target architecture no floating-point-like operations are supported, therefore no square root numbers can be retrieved by using the stdlib's sqrt function. It was also mentioned that basic operations like division and modulus may also not work because of the architecture's limitations.
Then I said, okay, we may build an array with the square numbers till 256. Then we could iterate through and compare them to the numbers given by the formulas (see above). They said: this is a bad approach, even if it would work. Therefore they did not accept that answer.
Finally I gave up. Since I had no other ideas. I asked, what would be the solution: they said, it won't be told; but advised me to try to look for it myself. My first approach (the 2 formula) should be the key, but the square root may be done alternatively.
I googled at home a lot, but never found any "alternative" square root counter algorithms. Everywhere was permitted to use floating numbers.
For operations like division and modulus, the so-called "integer-division" may be used. But what is to be used for square root?
Even if I failed the interview test, this is a very interesting topic for me, to work on architectures where no floating-point operations are allowed.
Therefore my questions:
How can floating numbers simulated (if only integers are allowed to use)?
What would be a possible soultion in C for that mentioned problem? Code examples are welcome.
The point of this type of interview is to see how you approach new problems. If you happen to already know the answer, that is undoubtedly to your credit but it doesn't really answer the question. What's interesting to the interviewer is watching you grapple with the issues.
For this reason, it is common that an interviewer will add additional constraints, trying to take you out of your comfort zone and seeing how you cope.
I think it's great that you knew that fact about recognising Fibonacci numbers. I wouldn't have known it without consulting Wikipedia. It's an interesting fact but does it actually help solve the problem?
Apparently, it would be necessary to compute 5n²±4, compute the square roots, and then verify that one of them is an integer. With access to a floating point implementation with sufficient precision, this would not be too complicated. But how much precision is that? If n can be an arbitrary 32-bit signed number, then n² is obviously not going to fit into 32 bits. In fact, 5n²+4 could be as big as 65 bits, not including a sign bit. That's far beyond the precision of a double (normally 52 bits) and even of a long double, if available. So computing the precise square root will be problematic.
Of course, we don't actually need a precise computation. We can start with an approximation, square it, and see if it is either four more or four less than 5n². And it's easy to see how to compute a good guess: it will very close to n×√5. By using a good precomputed approximation of √5, we can easily do this computation without the need for floating point, without division, and without a sqrt function. (If the approximation isn't accurate, we might need to adjust the result up or down, but that's easy to do using the identity (n+1)² = n²+2n+1; once we have n², we can compute (n+1)² with only addition.
We still need to solve the problem of precision, so we'll need some way of dealing with 66-bit integers. But we only need to implement addition and multiplication of positive integers, is considerably simpler than a full-fledged bignum package. Indeed, if we can prove that our square root estimation is close enough, we could safely do the verification modulo 2³¹.
So the analytic solution can be made to work, but before diving into it, we should ask whether it's the best solution. One very common caregory of suboptimal programming is clinging desperately to the first idea you come up with even when as its complications become increasingly evident. That will be one of the things the interviewer wants to know about you: how flexible are you when presented with new information or new requirements.
So what other ways are there to know if n is a Fibonacci number. One interesting fact is that if n is Fib(k), then k is the floor of logφ(k×√5 + 0.5). Since logφ is easily computed from log2, which in turn can be approximated by a simple bitwise operation, we could try finding an approximation of k and verifying it using the classic O(log k) recursion for computing Fib(k). None of the above involved numbers bigger than the capacity of a 32-bit signed type.
Even more simply, we could just run through the Fibonacci series in a loop, checking to see if we hit the target number. Only 47 loops are necessary. Alternatively, these 47 numbers could be precalculated and searched with binary search, using far less than the 1k bytes you are allowed.
It is unlikely an interviewer for a programming position would be testing for knowledge of a specific property of the Fibonacci sequence. Thus, unless they present the property to be tested, they are examining the candidate’s approaches to problems of this nature and their general knowledge of algorithms. Notably, the notion to iterate through a table of squares is a poor response on several fronts:
At a minimum, binary search should be the first thought for table look-up. Some calculated look-up approaches could also be proposed for discussion, such as using find-first-set-bit instruction to index into a table.
Hashing might be another idea worth considering, especially since an efficient customized hash might be constructed.
Once we have decided to use a table, it is likely a direct table of Fibonacci numbers would be more useful than a table of squares.

How to model random variables?

I want to know how to model random variables using "basic operations". The only random function I know, at least for C, is rand(), along with srand for seeding. There probably exists packages somewhere online but lets say I want to implement it on my own. I don't know if there are other very common random functions, but if not, lets just stick with rand() and the C language.
rand() allows me to pseudo-randomly generate an int from 0 to RAND_MAX. I can then use mod to get an int in some range. I can next mod 2 to choose a sign and get negative numbers. I can also do rand()/RAND_MAX to model values in the interval (0,1) and shift this to model Uniform(a,b).
But what I am not sure about is if I can extend this to model any probability distribution and at what point do I have to worry about accuracy especially when dealing with infinities and irrational probabilities. Also, this method is very crude so I would like to know of more standard ways using basic tools if any.
A simple example:
I have the random variable X such that Pr(X = 1)=1/pi and Pr(X=0)=1-1/pi. Since pi is irrational, I would approximate the probability of getting 1/pi with rand() and choose X=1 if I get an int from 0 to Round(RAND_MAX*1/pi). So this is approximating twice, once for pi and another time for rounding.
Is there a better approach? How would one go about modeling something more complicated such as a continuous random variable on the interval (0,infinity) or a discrete random variable with irrational probabilities on a countably infinite set. Would my approach still work or would I have to worry about rounding errors?
EDIT: Also how does the pseudo-randomness instead of randomness of rand() change things and how would I account for these changes?
I can then use mod to get an int in some range
No, you can't. Try it with dice. You want a number between 1 and 5. So you take the roll mod 5 (kind of, it would actually be ((roll-1)%5)+1). This maps 1 to 1, 2 to 2, etc. 5 to 5 and 6 to 1. You now have 1 twice as likely as any other roll.
The correct way of doing this is to find the nearest power of 2 higher than your range, mask out the bits of the random number above that power of 2 then check if you're in range. If you aren't in range, try again (will potentially loop forever, in practice the average number of retries is less than 2). This assumes that your random numbers are a stream of bits and not something else. This is usually a safe assumption for decent generators.
I can also do rand()/RAND_MAX to model values in the interval (0,1)
No, you can't. That's not how floating point numbers work. This generates a horrible distribution.
Either the number of bits in the integer is smaller than the number of bits in the mantissa, then you'll just have a bunch of floating point numbers you can't ever generate. Or the number of bits in the integer is bigger than the number of bits in the mantissa and then you'll truncate your integer when converting it to floating point before the division and will generate certain numbers much more often.
in the interval (0,1) and shift this to model Uniform(a,b).
This makes things even worse. First you lose bits in one direction, then you lose bits in the other direction.
To actually generate uniformly distributed floating point numbers in an arbitrary range is harder than it looks.
I've done some experiments to figure this out myself a few years ago, learning floating point internals in the process and I've written some code with a lot of comments with reasoning here: https://github.com/art4711/random-double
In short, to generate random floating point numbers in an arbitrary range: find the bigger absolute value of the range. That is the start, the other end of the range is the end. Figure out the next representable number from start to end. Subtract that next number from start, that becomes the step. Calculate how many steps exist between start and end. Generate a uniformly distributed random number between 0 and number of steps. start + step * random number is the answer. Also, because of how floating point work, this might not be exactly what you're looking for. All possible floating point values are most certainly not possible to generate using this method (except in very special cases). But this method guarantees that every possible value is equally likely.
Notice that your misconceptions are very common. Almost everyone does those things. Random numbers in the industry are anything but random. The word random in computer science pretty much means "predictable, repeatable, easily breakable and exploitable, quite possibly not well distributed". And don't get me started on the quality of the "random" number generators in standard libraries. If you dig around my github stuff, you'll find a package for Go with a long README rant about this.
I'm not going to respond to the rest of your question, those bits require a book or two.

Factorial Using FFT

I'm trying to implement a program in C that calculates the factorial of a very large n (up to a million), using fft and binary splitting method.
I've implemented a simple library to represent arbitrary precision integer.
To calculate the fft and ifft, i use twofft.c and four1.c routines from "Numerical Recipes in C"
Up to a certain n, all goes right, but when the numbers (floating arrays) are too big, the ifft (calculate with four1),after normalization and rounding, has values that are wrong.
For example, if i have two number with 2000 digits that ends with 40 zeros, and i have to multiply them each other (using fft), when i calculate the ifft, some ending zeros become "one".
this happens because when i rounded one of this "zeros", (0,50009 for examples), they became "one".
Now, i don't know if is my implementation wrong or if i have to rounding this numebrs in a different way.
I've tried to use both binary split method and prime factorization, but for n >= 9000, the result is wrong.
there is a way to resolve this?
thanks for your attention and sorry for my bad english.
How do you represent arbitrary precision integers?
I mean what type are you actually using?
Can you please show us your code?
If you feel really lazy you can clone this project i've made few months ago:
https://github.com/nomadster/ESP
Edit:
By further reading your post i suppose by this statement
"this happens because when i rounded one of this "zeros", (0,50009 for examples), they became "one""
that you are still unaware of the fact that fft multiplication only works when the roundoff error is smaller than 0.5.
So it seems to me (if and only if i've correctly interpreted your cryptic message) that you are using a floating point type that doesn't have the required precision.
For the record:
I also noticed wrong values returned by ifft from four1.c from numerical recipes. I only tested it with N=256 complex values as input, assembled in a way, that they should result in a real only time domain signal.
The resulting time domain vector has to be mirrored (end to start and vice versa ...) and shifted by one to correspond with the IFFTs of other implementations. (I tested numpy.fft.ifft, octave's ifft and a inverse discrete fourier transformation without any optimisation, simply based on the IDFT formula, which should be definitly correct).
There has to be a fundamental algorithm fault in the version provided by numerical recipies. In their books nothing related to this problem is described.

Normalising 18 bit input between 0-9999

I'm writing a program in which i require to normalise an 18-bit input between 0-9999. This is something i have never come across before,
I have searched the internet and correct me if i am wrong here, but is this as simple as converting the 18-bit binary(000000000000000000) input into a natural number and then divide it by 1000.
Is there is a different and more efficient method ????
Thank you
No, what you want to do is multiply your input by 0.03814697265.
The reasoning is pretty simple: you take your range of inputs (0..2^18) and split it in 10000 "slices". Thus each slice will have a range of just over 26. Then if you divide your input from the original range by this 26 (or multiply it by 1/26), you'll get your number in the 0..9999 range.
Edit: depending on your background, you may need to know that here I use ^ with the meaning of exponentiation. Might be moot since this question is tagged C and it has no first-class concept of exponentiation, but it's definetly not XOR!

casting doubles to integers in order to gain speed

in Redis (http://code.google.com/p/redis) there are scores associated to elements, in order to take this elements sorted. This scores are doubles, even if many users actually sort by integers (for instance unix times).
When the database is saved we need to write this doubles ok disk. This is what is used currently:
snprintf((char*)buf+1,sizeof(buf)-1,"%.17g",val);
Additionally infinity and not-a-number conditions are checked in order to also represent this in the final database file.
Unfortunately converting a double into the string representation is pretty slow. While we have a function in Redis that converts an integer into a string representation in a much faster way. So my idea was to check if a double could be casted into an integer without lost of data, and then using the function to turn the integer into a string if this is true.
For this to provide a good speedup of course the test for integer "equivalence" must be fast. So I used a trick that is probably undefined behavior but that worked very well in practice. Something like that:
double x = ... some value ...
if (x == (double)((long long)x))
use_the_fast_integer_function((long long)x);
else
use_the_slow_snprintf(x);
In my reasoning the double casting above converts the double into a long, and then back into an integer. If the range fits, and there is no decimal part, the number will survive the conversion and will be exactly the same as the initial number.
As I wanted to make sure this will not break things in some system, I joined #c on freenode and I got a lot of insults ;) So I'm now trying here.
Is there a standard way to do what I'm trying to do without going outside ANSI C? Otherwise, is the above code supposed to work in all the Posix systems that currently Redis targets? That is, archs where Linux / Mac OS X / *BSD / Solaris are running nowaday?
What I can add in order to make the code saner is an explicit check for the range of the double before trying the cast at all.
Thank you for any help.
Perhaps some old fashion fixed point math could help you out. If you converted your double to a fixed point value, you still get decimal precision and converting to a string is as easy as with ints with the addition of a single shift function.
Another thought would be to roll your own snprintf() function. Doing the conversion from double to int is natively supported by many FPU units so that should be lightning fast. Converting that to a string is simple as well.
Just a few random ideas for you.
The problem with doing that is that the comparisons won't work out the way you'd expect. Just because one floating point value is less than another doesn't mean that its representation as an integer will be less than the other's. Also, I see you comparing one of the (former) double values for equality. Due to rounding and representation errors in the low-order bits, you almost never want to do that.
If you are just looking for some kind of key to do something like hashing on, it would probably work out fine. If you actually care about which values really have greater or lesser value, its a bad idea.
I don't see a problem with the casts, as long as x is within the range of long long. Maybe you should check out the modf() function which separates a double into its integral and fractional part. You can then add checks against (double)LLONG_MIN and (double)LLONG_MAX for the integral part to make sure. Though there may be difficulties with the precision of double.
But before doing anything of this, have you made sure it actually is a bottleneck by measuring its performance? And is the percentage of integer values high enough that it would really make a difference?
Your test is perfectly fine (assuming you have already separately handled infinities and NANs by this point) - and it's probably one of the very few occaisions when you really do want to compare floats for equality. It doesn't invoke undefined behaviour - even if x is outside of the range of long long, you'll just get an "implementation-defined result", which is OK here.
The only fly in the ointment is that negative zero will end up as positive zero (because negative zero compares equal to positive zero).

Resources