lightweight (quasi-random) integer fingerprint of C string - c

I would like to generate a nicely-mixed-up integer fingerprint of an arbitrary C string (s). Most C strings will consist of ASCII text characters:
I want very different fingerprints for similar strings, esp such similar strings as "ab" and "ba"
I want it to be difficult to invert back from the fingerprint to the string (well, my string is typically longer than 32 bits, which means that many strings would map into the same integer), which means again that I want similar strings to yield very different codes;
I want to use the 32 bits available to me efficiently in the integer result,
I want the function source to be small
I want the function to be fast.
one usage is security (but not encryption) related. I can ask a user for a text password, convert it into an integer for storage and later test whether this integer is correct. (I know I could store strings, but I don't want to. guessing a 32-bit integer correctly is impossible if my program can slow down incorrect attempts to the point where brute force cannot work faster than password guessing. another use of this function is as the start of a hash index function (mod array length) into an array.)
alas, I am probably reinventing the wheel here. such functions have probably been written a million times, and by people who are much more versed in cryptography. I don't need AES, of course, but something much more lightweight. the use is different.
my first thinking was
mod 64 each character to take advantage of the ASCII text aspect. now I have 6 bits. call this x.
I can place a 6bit string into 5 locations in a 32-bit space, leaving 2 bits over.
take the current string index position (0, 1, 2...), mod5 it to determine where I want to start to place my x into my running integer result code. XOR my x into this running-result integer.
use the remaining 2 bits to increment a counter [mod 4 to prevent overflow] for each character processed.
then I thought that bit operations may be computer-fast but take more source code. I can think of other choices. take each index position i and multiply it by an ascii representation of each character [or the x from above], and call this y[i]. now do the following:
calculate the natural logarithm of the sums of the y (or this sum plus the running result), and just pretend that the first 32 bits of this result [maybe leaving off the first few bits], which are really a double, are an integer representation. I can XOR each bitint(log(y[i])) into the running integer result.
do it even cheaper. just add the y's, and then do the logarithm with 32-bit pickoff just once at the end. alternatively, run a sum-y through srand as a seed and grab a rand.
there are probably a few other ways to do it, too. in sum, the function should map strings into very different integers, be short to code, and be very fast.
Any pointers?

A common method of generating a non-reversible digest or hash of a string is to generate a Cyclic Redundancy Checksum (CRC).
Source for CRC is widely available, in this case you should use a common CRC-32 such as that used by Ethernet. Different CRCs work on the same principle, buy use different polynomials. Do not be tempted to invent your own polynomial; the distribution is likely to be sub-optimal.

What you're looking for is called a "hash". Two examples of hash functions I'm aware of that return short integers are MurmurHash and SipHash. MurmurHash, as I recall, is not designed to be a cryptographic hash, while SipHash, on the other hand, is indeed designed with security in mind, as stated on its homepage. MurmurHash has 2 versions that return a 32-bit and a 64-bit output. SipHash returns a 64-bit output.

Related

How to manipulate big integers for the RSA algorithm in C?

I want to implement the RSA cryptosystem in C. Right now I can encrypt values that fit in one byte (which I know is way too small for any security), but when I increase the size of the primes p and q (and thus of the modulus n = pq), and the size of the encrypted values, it doesn't work.
I believe I know the reasons why my code fails:
the value to be encrypted must be less than the n value (which is not always the case), and
the value of n = pq that I calculate is incorrect, because the real value can't be stored in the variable type that I'm using, and overflows instead.
My question is, how can I use big numbers (pack into 256 bytes, 512 bytes, etc.) and manipulate them in C? Must I use libraries (e.g. GMP)?
The C language has no native support for arbitrary-precision integer ("bignum") arithmetic, so you will either have to use a library that provides it (I've heard that GMP is a popular choice) or write your own code to handle it.
If you choose the do-it-yourself path, I would recommend representing your numbers as arrays of some reasonably large native unsigned integer type (e.g. uint32_t or uint64_t), with each array element representing a digit in base 2k, where k is the number of bits in the underlying native integers.
For RSA, you don't need to worry about representing negative values, since all the math is done with numbers ranging from 0 up to the RSA modulus n. If you want, you can also take advantage of the upper limit n to fix the number of base 2k digits used for all the values in a particular RSA instance, so that you don't have to explicitly store it alongside each digit array.
Ps. Note that "textbook RSA" is not a secure encryption scheme by itself. To make it semantically secure, you need to also include a suitable randomized padding scheme such as OAEP. Also, padded or not, normal RSA can only encrypt messages shorter than the modulus, minus the length taken up by padding, if any. To encrypt longer messages, the standard solution is to use hybrid encryption, where you first encrypt the message using a symmetric encryption scheme (I would recommend AES-SIV) with a random key, and then encrypt the key with RSA.

Why is infinity = 0x3f3f3f3f?

In some situations, one generally uses a large enough integer value to represent infinity. I usually use the largest representable positive/negative integer. That usually yields more code, since you need to check if one of the operands is infinity before virtually all arithmetic operations in order to avoid overflows. Sometimes it would be desirable to have saturated integer arithmetic. For that reason, some people use smaller values for infinity, that can be added or multiplied several times without overflow. What intrigues me is the fact that it's extremely common to see (specially in programming competitions):
const int INF = 0x3f3f3f3f;
Why is that number special? It's binary representation is:
00111111001111110011111100111111
I don't see any specially interesting property here. I see it's easy to type, but if that was the reason, almost anything would do (0x3e3e3e3e, 0x2f2f2f2f, etc). It can be added once without overflow, which allows for:
a = min(INF, b + c);
But all the other constants would do, then. Googling only shows me a lot of code snippets that use that constant, but no explanations or comments.
Can anyone spot it?
I found some evidence about this here (original content in Chinese); the basic idea is that 0x7fffffff is problematic since it's already "the top" of the range of 4-byte signed ints; so, adding anything to it results in negative numbers; 0x3f3f3f3f, instead:
is still quite big (same order of magnitude of 0x7fffffff);
has a lot of headroom; if you say that the valid range of integers is limited to numbers below it, you can add any "valid positive number" to it and still get an infinite (i.e. something >=INF). Even INF+INF doesn't overflow. This allows to keep it always "under control":
a+=b;
if(a>INF)
a=INF;
is a repetition of equal bytes, which means you can easily memset stuff to INF;
also, as #Jörg W Mittag noticed above, it has a nice ASCII representation, that allows both to spot it on the fly looking at memory dumps, and to write it directly in memory.
I may or may not be one of the earliest discoverers of 0x3f3f3f3f. I published a Romanian article about it in 2004 (http://www.infoarena.ro/12-ponturi-pentru-programatorii-cc #9), but I've been using this value since 2002 at least for programming competitions.
There are two reasons for it:
0x3f3f3f3f + 0x3f3f3f3f doesn't overflow int32. For this some use 100000000 (one billion).
one can set an array of ints to infinity by doing memset(array, 0x3f, sizeof(array))
0x3f3f3f3f is the ASCII representation of the string ????.
Krugle finds 48 instances of that constant in its entire database. 46 of those instances are in a Java project, where it is used as a bitmask for some graphics manipulation.
1 project is an operating system, where it is used to represent an unknown ACPI device.
1 project is again a bitmask for Java graphics.
So, in all of the projects indexed by Krugle, it is used 47 times because of its bitpattern, once because of its ASCII interpretation, and not a single time as a representation of infinity.

Interview : Hash function: sine function

I was asked this interview question. I am not sure what the correct answer for it is (and the reasoning behind the answer):
Is sin(x) a good hash function?
If you mean sin(), it's not a good hashing function because:
it's quite predictable and for some x it's no better than just x itself. There should be no seemingly apparent relationship between the key and the hash of the key.
it does not produce an integer value. You cannot index/subscript arrays with floating-point indices and there must be some kind of array in the hash table.
floating-point is very implementation-specific and even if you make a hash function out of sin(), it may not work with a different compiler or on a different kind of CPU/computer.
sin() may be much slower than some simpler integer-arithmetic function.
Not really.
It's horribly slow.
You'll need to convert the result to some integer type anyway to avoid the insanity of floating-point equality comparisons. (Not actually the usual precision problems that are endemic to FP equality comparisons and which arise from calculating two things slightly different ways; I mean specifically the problems caused by things like the fact that 387-derived FPUs store extra bits of precision in their registers, so if a comparison is done between two freshly-calculated values in registers you could get a different answer than if exactly one of the operands was loaded into a register from memory.)
It's almost flat near the peaks and troughs, so the quantisation step (multiplying by some large number and rounding to an integer) will produce many hash values near the min and max, rather than an even distribution.
Based off of mathematical knowledge:
Sine(x) is periodic so it's going to reach the same number from different values of x, so Sine(x) would be awful as a hashing function because you will get multiple values hashing to the exact same point. There are **infinitely many values between 0 and pi for the return value, but then past that the values will repeat. So 0 & pi & 2*pi will all hash to the same point.
If you could make the increment small enough and have Sine(x) multiplied by say x^2 or something of that nature it'd be mediocre at best, but then again, if you were to do that why not just use x^2 anyway and toss out the periodic function all together.
**infinitely: a large enough number that I'm not willing to count.
NOTE: Sine(x) will have values that are small and could be affected by rounding error.
NOTE: Any value taken from a sine function should be multiplied by an integer and then either modded or the floor or ceiling taken so that the value can be used as an array offset, etc.
sin(x) is trigonometric function which repeats itself after every 360 degrees, so it's going to be a poor hash function as the hash will be repeated too often.
A simple refutation:
sin(0) == sin(360) == sin(720) == sin(..)
This is not a property of a goodhash function.
Even if you decide to use it, it's difficult to represent the value returned by sin.
Sin function:
sin x = x - x^3/3! + x^5/5! - ...
This can't accurately represented due to floating point precision issue, which means for a same value it may produce two different hashes!
Another point to note:
For sine(x) as hash function - Keys in a given close range will have hash values in close range too, it is not desirable. A good hash function evenly distributes hash values irrespective of the nature of the keys.
Hash values generally have to be integers to be useful. Since sin doesn't generate integers it wouldn't be appropriate.
Let's say we have a string s. It can be expressed as a number in hexadecimal and feeded to the function. If you added 2 pi it would cease to be a valid input, as it wouldn't be an integer anymore (only non-negative integers are accepted by the function). You have to find a string that gives a collision, not just multiply the hex expression of the string with 2 pi. And adding (concatenating?) 2 pi directly to the string wouldn't help finding a collision. There might be another way though but not that trivial.
I think sin(x) can make an excellent cryptographic hash function,
if used wisely. The input should be a natural number in radians
and never contain pi. We must use arbitrary-precision arithmetic.
For every natural number x (radians), sin(x)
is always a transcendental irrational number and there is no other
natural number with the same sine. But there's a catch: An attacker could gain
information about the input, by computing the arcsin of the hash.
In order to prevent this, we ignore the decimal part and some of the
first digits from the fractional part, keeping only the next n (say 100) digits,
making such an attack computationally infeasible.
It seems that a small change in the input gives a completely different result,
which is a desirable property.
The result of the function seems statistically random, again a good property.
I'm not sure how to prove that is is collision-resistant but i can't see why
it couldn't be. Also, i can't think of a way to find a specific input that results
in a specific hash. I'm not saying that we should blindly believe that it is
certainly a good crypt. hash function. I just think that it seems like a
good candidate to be one. We should give it a chance
and focus on proving that it is. And it might me a very good one.
To those that might say it is slow: Yes, it is. And that's good when hashing passwords.
Here i'm attaching some perl code for this idea. It runs on linux with bash and bc.
(bc is a command-line arbitrary-precision calculator, included in most distros)
I'll be checking this page for any answers, since this interests me a lot.
Don't be harsh though, i'm just a CS undergrad, willing to learn more.
use warnings;
use strict;
my $input='5AFF36B7';#Input for bc (as a hex number)
$input='1'.$input;#put '1' in front of input, so that 0x0 , 0x00 , 0x1 , 0x01 , etc ... ,
#all give different nonzero results
my $a=`bc -l -q <<< "scale=256;obase=16;ibase=16;s($input)"`;#call bc, keep result in $a
#keep only fractional part
$a=~tr/a-zA-Z0-9//cd;#Clean up string, keep only alphanumerics
my #m = $a =~ /./g;#Convert string to array of chars
#PRINT OUTPUT
#We ignore some digits, for security reasons:
#If we don't ignore any of the first digits, an attacker could gain information
#about the input by computing the inverse of sin (the arcsin of the hash)
#By ignoring enough of the first digits, it becomes computationally
#infeasible to compute arcsin
#Also, to avoid problems with roundoff error, we ignore some of the last digits
for (my $c=100;$c<200;$c++){
print $m[$c];
}

What is the most efficient way to store and work with a floating point number with 1,000,000 significant digits in C?

I'm writing a utility to calculate π to a million digits after the decimal. On a 32- or 64-bit consumer desktop system, what is the most efficient way to store and work with such a large number accurate to the millionth digit?
clarification: The language would be C.
Forget floating point, you need bit strings that represent integers
This takes a bit less than 1/2 megabyte per number. "Efficient" can mean a number of things. Space-efficient? Time-efficient? Easy-to-program with?
Your question is tagged floating-point, but I'm quite sure you do not want floating point at all. The entire idea of floating point is that our data is only known to a few significant figures and even the famous constants of physics and chemistry are known precisely to only a handful or two of digits. So there it makes sense to keep a reasonable number of digits and then simply record the exponent.
But your task is quite different. You must account for every single bit. Given that, no floating point or decimal arithmetic package is going to work unless it's a template you can arbitrarily size, and then the exponent will be useless. So you may as well use integers.
What you really really need is a string of bits. This is simply an array of convenient types. I suggest <stdint.h> and simply using uint32_t[125000] (or 64) to get started. This actually might be a great use of the more obscure constants from that header that pick out bit sizes that are fast on a given platform.
To be more specific we would need to know more about your goals. Is this for practice in a specific language? For some investigation into number theory? If the latter, why not just use a language that already supports Bignum's, like Ruby?
Then the storage is someone else's problem. But, if what you really want to do is implement a big number package, then I might suggest using bcd (4-bit) strings or even ordinary ascii 8-bit strings with printable digits, simply because things will be easier to write and debug and maximum space and time efficiency may not matter so much.
I'd recommend storing it as an array of short ints, one per digit, and then carefully write utility classes to add and subtract portions of the number. You'll end up moving from this array of ints to floats and back, but you need a 'perfect' way of storing the number - so use its exact representation. This isn't the most efficient way in terms of space, but a million ints isn't very big.
It's all in the way you use the representation. Decide how you're going to 'work with' this number, and write some good utility functions.
If you're willing to tolerate computing pi in hex instead of decimal, there's a very cute algorithm that allows you to compute a given hexadecimal digit without knowing the previous digits. This means, by extension, that you don't need to store (or be able to do computation with) million digit numbers.
Of course, if you want to get the nth decimal digit, you will need to know all of the hex digits up to that precision in order to do the base conversion, so depending on your needs, this may not save you much (if anything) in the end.
Unless you're writing this purely for fun and/or learning, I'd recommend using a library such as GNU Multiprecision. Look into the mpf_t data type and its associated functions for storing arbitrary-precision floating-point numbers.
If you are just doing this for fun/learning, then represent numbers as an array of chars, which each array element storing one decimal digit. You'll have to implement long addition, long multiplication, etc.
Try PARI/GP, see wikipedia.
You could store its decimals digits as text in a file and mmap it to an array.
i once worked on an application that used really large numbers (but didnt need good precision). What we did was store the numbers as logarithms since you can store a pretty big number as a log10 within an int.
Think along this lines before resorting to bit stuffing or some complex bit representations.
I am not too good with complex math, but i reckon there are solutions which are elegant when storing numbers with millions of bits of precision.
IMO, any programmer of arbitrary precision arithmetics needs understanding of base conversion. This solves anyway two problems: being able to calculate pi in hex digits and converting the stuff to decimal representation and as well finding the optimal container.
The dominant constraint is the number of correct bits in the multiplication instruction.
In Javascript one has always 53-bits of accuracy, meaning that a Uint32Array with numbers having max 26 bits can be processed natively. (waste of 6 bits per word).
In 32-bit architecture with C/C++ one can easily get A*B mod 2^32, suggesting basic element of 16 bits. (Those can be parallelized in many SIMD architectures starting from MMX). Also each 16-bit result can contain 4-digit decimal numbers (wasting about 2.5 bits) per word.

Convert really big number from binary to decimal and print it

I know how to convert binary to decimal. I know at least 2 methods: table and power ;-)
I want to convert binary to decimal and print this decimal. Moreover, I'm not interested in this `decimal'; I want just to print it.
But, as I wrote above, I know only 2 methods to convert binary to decimal and both of them required addition. So, I'm computing some value for 1 or 0 in binary and add it to the remembered value. This is a thin place. I have a really-really big number (1 and 64 zeros). While converting I need to place some intermediate result in some 'variable'. In C, I have an `int' type, which is 4 bytes only and not more than 10^11.
So, I don't have enough memory to store intermedite result while converting from binary to decimal. As I wrote above, I'm not interested in THAT decimal, I just want to print the result. But, I don't see any other ways to solve it ;-( Is there any solution to "just print" from binary?
Or, maybe, I should use something like BCD (Binary Coded Decimal) for intermediate representation? I really don't want to use this, 'cause it is not so cross-platform (Intel's processors have a built-in feature, but for other I'll need to write own implementation).
I would glad to hear your thoughts. Thanks for patience.
Language: C.
I highly recommend using a library such as GMP (GNU multiprecision library). You can use the mpz_t data type for large integers, the various import/export routines to get your data into an mpz_t, and then use mpz_out_str() to print it out in base 10.
Biggest standard integral data type is unsigned long long int - on my system (32-bit Linux on x86) it has range 0 - 1.8*10^20 which is not enough for you, so you need to create your own type (struct or array) and write basic math (basically you just need an addition) for that type.
If I were you (and memory is not an issue), I'd use an array - one byte per decimal digit rather then BCD. BCD is more compact as it stores 2 decimal digits per byte but you need to put much more effort working with high and low nibbles separately.
And to print you just add '0' (character, not digit) to every byte of your array and you get a printable string.
Well, when converting from binary to decimal, you really don't need ALL the binary bits at the same time. You just need the bits you are currently calculating the power of and probably a double variable to hold the results.
You could put the binary value in an array, lets say i[64], iterate through it, get the power depending on its position and keep adding it to the double.
Converting to decimal really means calculating each power of ten, so why not just store these in an array of bytes? Then printing is just looping through the array.
Couldn't you allocate memory for, say, 5 int's, and store your number at the beginning of the array? Then manually iterate over the array in int-sized chunks. Perhaps something like:
int* big = new int[5];
*big = <my big number>;

Resources