casting doubles to integers in order to gain speed - c

in Redis (http://code.google.com/p/redis) there are scores associated to elements, in order to take this elements sorted. This scores are doubles, even if many users actually sort by integers (for instance unix times).
When the database is saved we need to write this doubles ok disk. This is what is used currently:
snprintf((char*)buf+1,sizeof(buf)-1,"%.17g",val);
Additionally infinity and not-a-number conditions are checked in order to also represent this in the final database file.
Unfortunately converting a double into the string representation is pretty slow. While we have a function in Redis that converts an integer into a string representation in a much faster way. So my idea was to check if a double could be casted into an integer without lost of data, and then using the function to turn the integer into a string if this is true.
For this to provide a good speedup of course the test for integer "equivalence" must be fast. So I used a trick that is probably undefined behavior but that worked very well in practice. Something like that:
double x = ... some value ...
if (x == (double)((long long)x))
use_the_fast_integer_function((long long)x);
else
use_the_slow_snprintf(x);
In my reasoning the double casting above converts the double into a long, and then back into an integer. If the range fits, and there is no decimal part, the number will survive the conversion and will be exactly the same as the initial number.
As I wanted to make sure this will not break things in some system, I joined #c on freenode and I got a lot of insults ;) So I'm now trying here.
Is there a standard way to do what I'm trying to do without going outside ANSI C? Otherwise, is the above code supposed to work in all the Posix systems that currently Redis targets? That is, archs where Linux / Mac OS X / *BSD / Solaris are running nowaday?
What I can add in order to make the code saner is an explicit check for the range of the double before trying the cast at all.
Thank you for any help.

Perhaps some old fashion fixed point math could help you out. If you converted your double to a fixed point value, you still get decimal precision and converting to a string is as easy as with ints with the addition of a single shift function.
Another thought would be to roll your own snprintf() function. Doing the conversion from double to int is natively supported by many FPU units so that should be lightning fast. Converting that to a string is simple as well.
Just a few random ideas for you.

The problem with doing that is that the comparisons won't work out the way you'd expect. Just because one floating point value is less than another doesn't mean that its representation as an integer will be less than the other's. Also, I see you comparing one of the (former) double values for equality. Due to rounding and representation errors in the low-order bits, you almost never want to do that.
If you are just looking for some kind of key to do something like hashing on, it would probably work out fine. If you actually care about which values really have greater or lesser value, its a bad idea.

I don't see a problem with the casts, as long as x is within the range of long long. Maybe you should check out the modf() function which separates a double into its integral and fractional part. You can then add checks against (double)LLONG_MIN and (double)LLONG_MAX for the integral part to make sure. Though there may be difficulties with the precision of double.
But before doing anything of this, have you made sure it actually is a bottleneck by measuring its performance? And is the percentage of integer values high enough that it would really make a difference?

Your test is perfectly fine (assuming you have already separately handled infinities and NANs by this point) - and it's probably one of the very few occaisions when you really do want to compare floats for equality. It doesn't invoke undefined behaviour - even if x is outside of the range of long long, you'll just get an "implementation-defined result", which is OK here.
The only fly in the ointment is that negative zero will end up as positive zero (because negative zero compares equal to positive zero).

Related

Why is infinity = 0x3f3f3f3f?

In some situations, one generally uses a large enough integer value to represent infinity. I usually use the largest representable positive/negative integer. That usually yields more code, since you need to check if one of the operands is infinity before virtually all arithmetic operations in order to avoid overflows. Sometimes it would be desirable to have saturated integer arithmetic. For that reason, some people use smaller values for infinity, that can be added or multiplied several times without overflow. What intrigues me is the fact that it's extremely common to see (specially in programming competitions):
const int INF = 0x3f3f3f3f;
Why is that number special? It's binary representation is:
00111111001111110011111100111111
I don't see any specially interesting property here. I see it's easy to type, but if that was the reason, almost anything would do (0x3e3e3e3e, 0x2f2f2f2f, etc). It can be added once without overflow, which allows for:
a = min(INF, b + c);
But all the other constants would do, then. Googling only shows me a lot of code snippets that use that constant, but no explanations or comments.
Can anyone spot it?
I found some evidence about this here (original content in Chinese); the basic idea is that 0x7fffffff is problematic since it's already "the top" of the range of 4-byte signed ints; so, adding anything to it results in negative numbers; 0x3f3f3f3f, instead:
is still quite big (same order of magnitude of 0x7fffffff);
has a lot of headroom; if you say that the valid range of integers is limited to numbers below it, you can add any "valid positive number" to it and still get an infinite (i.e. something >=INF). Even INF+INF doesn't overflow. This allows to keep it always "under control":
a+=b;
if(a>INF)
a=INF;
is a repetition of equal bytes, which means you can easily memset stuff to INF;
also, as #Jörg W Mittag noticed above, it has a nice ASCII representation, that allows both to spot it on the fly looking at memory dumps, and to write it directly in memory.
I may or may not be one of the earliest discoverers of 0x3f3f3f3f. I published a Romanian article about it in 2004 (http://www.infoarena.ro/12-ponturi-pentru-programatorii-cc #9), but I've been using this value since 2002 at least for programming competitions.
There are two reasons for it:
0x3f3f3f3f + 0x3f3f3f3f doesn't overflow int32. For this some use 100000000 (one billion).
one can set an array of ints to infinity by doing memset(array, 0x3f, sizeof(array))
0x3f3f3f3f is the ASCII representation of the string ????.
Krugle finds 48 instances of that constant in its entire database. 46 of those instances are in a Java project, where it is used as a bitmask for some graphics manipulation.
1 project is an operating system, where it is used to represent an unknown ACPI device.
1 project is again a bitmask for Java graphics.
So, in all of the projects indexed by Krugle, it is used 47 times because of its bitpattern, once because of its ASCII interpretation, and not a single time as a representation of infinity.

Interview : Hash function: sine function

I was asked this interview question. I am not sure what the correct answer for it is (and the reasoning behind the answer):
Is sin(x) a good hash function?
If you mean sin(), it's not a good hashing function because:
it's quite predictable and for some x it's no better than just x itself. There should be no seemingly apparent relationship between the key and the hash of the key.
it does not produce an integer value. You cannot index/subscript arrays with floating-point indices and there must be some kind of array in the hash table.
floating-point is very implementation-specific and even if you make a hash function out of sin(), it may not work with a different compiler or on a different kind of CPU/computer.
sin() may be much slower than some simpler integer-arithmetic function.
Not really.
It's horribly slow.
You'll need to convert the result to some integer type anyway to avoid the insanity of floating-point equality comparisons. (Not actually the usual precision problems that are endemic to FP equality comparisons and which arise from calculating two things slightly different ways; I mean specifically the problems caused by things like the fact that 387-derived FPUs store extra bits of precision in their registers, so if a comparison is done between two freshly-calculated values in registers you could get a different answer than if exactly one of the operands was loaded into a register from memory.)
It's almost flat near the peaks and troughs, so the quantisation step (multiplying by some large number and rounding to an integer) will produce many hash values near the min and max, rather than an even distribution.
Based off of mathematical knowledge:
Sine(x) is periodic so it's going to reach the same number from different values of x, so Sine(x) would be awful as a hashing function because you will get multiple values hashing to the exact same point. There are **infinitely many values between 0 and pi for the return value, but then past that the values will repeat. So 0 & pi & 2*pi will all hash to the same point.
If you could make the increment small enough and have Sine(x) multiplied by say x^2 or something of that nature it'd be mediocre at best, but then again, if you were to do that why not just use x^2 anyway and toss out the periodic function all together.
**infinitely: a large enough number that I'm not willing to count.
NOTE: Sine(x) will have values that are small and could be affected by rounding error.
NOTE: Any value taken from a sine function should be multiplied by an integer and then either modded or the floor or ceiling taken so that the value can be used as an array offset, etc.
sin(x) is trigonometric function which repeats itself after every 360 degrees, so it's going to be a poor hash function as the hash will be repeated too often.
A simple refutation:
sin(0) == sin(360) == sin(720) == sin(..)
This is not a property of a goodhash function.
Even if you decide to use it, it's difficult to represent the value returned by sin.
Sin function:
sin x = x - x^3/3! + x^5/5! - ...
This can't accurately represented due to floating point precision issue, which means for a same value it may produce two different hashes!
Another point to note:
For sine(x) as hash function - Keys in a given close range will have hash values in close range too, it is not desirable. A good hash function evenly distributes hash values irrespective of the nature of the keys.
Hash values generally have to be integers to be useful. Since sin doesn't generate integers it wouldn't be appropriate.
Let's say we have a string s. It can be expressed as a number in hexadecimal and feeded to the function. If you added 2 pi it would cease to be a valid input, as it wouldn't be an integer anymore (only non-negative integers are accepted by the function). You have to find a string that gives a collision, not just multiply the hex expression of the string with 2 pi. And adding (concatenating?) 2 pi directly to the string wouldn't help finding a collision. There might be another way though but not that trivial.
I think sin(x) can make an excellent cryptographic hash function,
if used wisely. The input should be a natural number in radians
and never contain pi. We must use arbitrary-precision arithmetic.
For every natural number x (radians), sin(x)
is always a transcendental irrational number and there is no other
natural number with the same sine. But there's a catch: An attacker could gain
information about the input, by computing the arcsin of the hash.
In order to prevent this, we ignore the decimal part and some of the
first digits from the fractional part, keeping only the next n (say 100) digits,
making such an attack computationally infeasible.
It seems that a small change in the input gives a completely different result,
which is a desirable property.
The result of the function seems statistically random, again a good property.
I'm not sure how to prove that is is collision-resistant but i can't see why
it couldn't be. Also, i can't think of a way to find a specific input that results
in a specific hash. I'm not saying that we should blindly believe that it is
certainly a good crypt. hash function. I just think that it seems like a
good candidate to be one. We should give it a chance
and focus on proving that it is. And it might me a very good one.
To those that might say it is slow: Yes, it is. And that's good when hashing passwords.
Here i'm attaching some perl code for this idea. It runs on linux with bash and bc.
(bc is a command-line arbitrary-precision calculator, included in most distros)
I'll be checking this page for any answers, since this interests me a lot.
Don't be harsh though, i'm just a CS undergrad, willing to learn more.
use warnings;
use strict;
my $input='5AFF36B7';#Input for bc (as a hex number)
$input='1'.$input;#put '1' in front of input, so that 0x0 , 0x00 , 0x1 , 0x01 , etc ... ,
#all give different nonzero results
my $a=`bc -l -q <<< "scale=256;obase=16;ibase=16;s($input)"`;#call bc, keep result in $a
#keep only fractional part
$a=~tr/a-zA-Z0-9//cd;#Clean up string, keep only alphanumerics
my #m = $a =~ /./g;#Convert string to array of chars
#PRINT OUTPUT
#We ignore some digits, for security reasons:
#If we don't ignore any of the first digits, an attacker could gain information
#about the input by computing the inverse of sin (the arcsin of the hash)
#By ignoring enough of the first digits, it becomes computationally
#infeasible to compute arcsin
#Also, to avoid problems with roundoff error, we ignore some of the last digits
for (my $c=100;$c<200;$c++){
print $m[$c];
}

Long Long, decimals and input validation in C

Currently I'm using TCC as it's the easiest thing to get setup on windows. Simply unzip and you're ready to go. However I'm open to other compilers, GCC, whatever microsoft has on offer etc.
My problem is that I need to validate the input to a size 16 array of integers. Here are my rules:
if number is under 15 (including negative values) then input is valid
if number is under -2147483648 then -2147483648
if number is over 2147483647 then 15
else if number is over 15 then mod 16
if the number is a decimal, remove decimal point and validate again
Considering I'm using C, the last point scares me, and I'll come back to it later. For now I'm just trying to handle the first 4 conditions.
The problem I'm running into is that trying to test for the outer bounds results in Integer overflows and screws up my checks. So I've made a temporary array of long longs to hold the input for validation purposes. The moment everything is successfully validated it should fit in an array of Integers, so I will (somehow) copy the long longs from the temp array to the actual one and start the program as normal.
I've messed around with long longs and trying to do what I want to do, but my code is getting messy fast and everything is so vague and machine dependant in C so when something goes wrong I don't know whether it's me and my crappy coding, or the fact that my machine is different to everyone else's that is causing the error. I am going to stick at it cause I know this sort of thing can be investigated and worked out, however I don't want to waste too much time on it so I'll ask SO and see if there's a shortcut.
The decimal validation part I've got various ideas on how to approach, but I'm not hopeful. What's your opinion?
Anyone who wants to know why I'm doing this: It doesn't matter, I can solve the higher level problem that requires this array quite easily and it will work for all valid inputs. However I'm just being pedantic right now, hence this question.
First, your conditions have some problems. If a number is under -2147483648, it's also under 15, so that check never matches (neither being a decimal for numbers under 15).
Second, you can check for overflows with strtol (check errno for ERANGE) and then compare with your limits (though there's no need if your long has 32-bits and is in two's complement).
As for decimals, if you always want to remove the decimal point (not what you're saying you want because you condition that on a series of other conditions failing), you can setup a preprocessing step that removes periods from the string. It can easily be done in-place with two pointers – a read pointer and a write pointer.

What is the most efficient way to store and work with a floating point number with 1,000,000 significant digits in C?

I'm writing a utility to calculate π to a million digits after the decimal. On a 32- or 64-bit consumer desktop system, what is the most efficient way to store and work with such a large number accurate to the millionth digit?
clarification: The language would be C.
Forget floating point, you need bit strings that represent integers
This takes a bit less than 1/2 megabyte per number. "Efficient" can mean a number of things. Space-efficient? Time-efficient? Easy-to-program with?
Your question is tagged floating-point, but I'm quite sure you do not want floating point at all. The entire idea of floating point is that our data is only known to a few significant figures and even the famous constants of physics and chemistry are known precisely to only a handful or two of digits. So there it makes sense to keep a reasonable number of digits and then simply record the exponent.
But your task is quite different. You must account for every single bit. Given that, no floating point or decimal arithmetic package is going to work unless it's a template you can arbitrarily size, and then the exponent will be useless. So you may as well use integers.
What you really really need is a string of bits. This is simply an array of convenient types. I suggest <stdint.h> and simply using uint32_t[125000] (or 64) to get started. This actually might be a great use of the more obscure constants from that header that pick out bit sizes that are fast on a given platform.
To be more specific we would need to know more about your goals. Is this for practice in a specific language? For some investigation into number theory? If the latter, why not just use a language that already supports Bignum's, like Ruby?
Then the storage is someone else's problem. But, if what you really want to do is implement a big number package, then I might suggest using bcd (4-bit) strings or even ordinary ascii 8-bit strings with printable digits, simply because things will be easier to write and debug and maximum space and time efficiency may not matter so much.
I'd recommend storing it as an array of short ints, one per digit, and then carefully write utility classes to add and subtract portions of the number. You'll end up moving from this array of ints to floats and back, but you need a 'perfect' way of storing the number - so use its exact representation. This isn't the most efficient way in terms of space, but a million ints isn't very big.
It's all in the way you use the representation. Decide how you're going to 'work with' this number, and write some good utility functions.
If you're willing to tolerate computing pi in hex instead of decimal, there's a very cute algorithm that allows you to compute a given hexadecimal digit without knowing the previous digits. This means, by extension, that you don't need to store (or be able to do computation with) million digit numbers.
Of course, if you want to get the nth decimal digit, you will need to know all of the hex digits up to that precision in order to do the base conversion, so depending on your needs, this may not save you much (if anything) in the end.
Unless you're writing this purely for fun and/or learning, I'd recommend using a library such as GNU Multiprecision. Look into the mpf_t data type and its associated functions for storing arbitrary-precision floating-point numbers.
If you are just doing this for fun/learning, then represent numbers as an array of chars, which each array element storing one decimal digit. You'll have to implement long addition, long multiplication, etc.
Try PARI/GP, see wikipedia.
You could store its decimals digits as text in a file and mmap it to an array.
i once worked on an application that used really large numbers (but didnt need good precision). What we did was store the numbers as logarithms since you can store a pretty big number as a log10 within an int.
Think along this lines before resorting to bit stuffing or some complex bit representations.
I am not too good with complex math, but i reckon there are solutions which are elegant when storing numbers with millions of bits of precision.
IMO, any programmer of arbitrary precision arithmetics needs understanding of base conversion. This solves anyway two problems: being able to calculate pi in hex digits and converting the stuff to decimal representation and as well finding the optimal container.
The dominant constraint is the number of correct bits in the multiplication instruction.
In Javascript one has always 53-bits of accuracy, meaning that a Uint32Array with numbers having max 26 bits can be processed natively. (waste of 6 bits per word).
In 32-bit architecture with C/C++ one can easily get A*B mod 2^32, suggesting basic element of 16 bits. (Those can be parallelized in many SIMD architectures starting from MMX). Also each 16-bit result can contain 4-digit decimal numbers (wasting about 2.5 bits) per word.

rounding shortcuts in C

I am working in C to implement pseudo-code that says:
delay = ROUND(64*(floatDelay - intDelay))
where intDelay = (int) floatDelay
The floatDelay will always be positive. Is there an advantage to using the round function from math.h:
#inlcude <math.h>
delay=(int) round(64*(floatDelay-intDelay));
or can I use:
delay=(int)(64*(floatDelay - intDelay) + 0.5))
There isn't any advantages that I know of, other than the fact that the cast to int may not be immediately obvious to other programmers that it works like a trunc...
Whereas with the round function, your intentions are clear.
You should always use the appropriate math libs when dealing with floating point numbers. A float may be only a very close approximation of the actual value, and that can cause weirdness.
For instance, 5f might be approximated to 4.9999999... and if you try to cast directly to int it will be truncated to 4.
To see why in depth, you should look up floating point numbers on wikipedia. But in short instead of storing the number as a straight series of bits like an int, it's stored in two parts. There's a "fraction" and an exponent, where the final value of the float is fraction * (base ^ exponent).
Either is fine, provided as you say floatDelay is positive.
It's possible that one is marginally faster than the other, though it would be hard to tell which without benchmarking, given that round() is quite possibly implemented as a compiler intrinsic. It's even more likely that any speed difference is overwhelmingly unimportant, so use whichever you feel is clearer.
For instance, 5f might be approximated to 4.9999999...
and if you try to cast directly to int it will be truncated to 4.
Is this really true?
If you make sure the you add the 0.5 before you truncate to int,
is really 4.9999 a problem.
I mean:
4.9999+0.5=5.4999 -> 5
/Johan

Resources