non-commutative combination of two byte arrays - arrays

If I want to combine two numbers (Int,Long,...) n1,n2in a non-commutative way, p*n1 + n2 where p is an arbitrary prime seems reasonable enough a choice.
As many hashing options return a byte array, though, I am now trying to substitute the numbers with byte arrays.
Assume a,b:Array[Byte] are of the same length.
+ simply becomes an xor
but what should I use as a "Multiplication"?
p:Long a(n arbitrary) prime, a:Array[Byte] of arbitrary length
I could, of course, convert a to a long, multiply, then convert the result back to an Array of Bytes. The problem with that is that I will need "p*a" to be of the same length as a for the subsequent xor to make sense. I could circumvent this by zero-extending the shorter of the two byte arrays, but then the byte arrays quickly grow in length.
I could, on the other hand, convert p to a byte array and xor it with a. Here, the issue is that then (p*(p*a+b)+c) becomes (a+b+c), which is commutative, which we don't want.
I could add p to every byte in the array (throwing away the overflow).
I could add p to every byte in the array (not throwing away the overflow).
I could circular shift a by some f(p) bits (and hope it doesn't end up becoming a again)
And I could think of a lot more nonsense. But what should I do? What actually makes sense?

If you want to mimic the original ideal of multiplying by a prime, the obvious generalization is to do arithmetic in the Galois field GF(2^8) - see https://en.wikipedia.org/wiki/Finite_field_arithmetic and note that you can essentially use log and antilog tables of size 256 to replace multiplication with not much more than table lookup - https://en.wikipedia.org/wiki/Finite_field_arithmetic#Implementation_tricks. Arithmetic over a finite field of any sort will have many of the nice properties of arithmetic modulo a prime - arithmetic modulo p is GP(p) or GF(p^1), if you prefer.
However this is all rather untried and perhaps a little high-flown. Other options include checksum algorithms such as https://en.wikipedia.org/wiki/Adler-32 or - if you already have a hash algorithm that maps long strings into a short array of bytes, simply concatenating the two arrays of bytes to be combined and running the result through the hash algorithm again, perhaps with some padding before and after to give you some parameters you can play with if you need to vary or tune things.

Related

Splitting number into bit halves

I'm implementing karatsuba's method as part of an exercise. Karatsuba's method itself isn't terribly difficult, but one part of it is confusing me. Both numbers being multiplied have to be split into two halves, the high and the low bits. But I can't find much information about how this split is done.
I noticed most Karatsuba implementations use strings to represent huge numbers, but I'm doing something a bit different. I'm representing them as an array of ints, where each element is the next 30 bits of the huge number. Note that this means these arrays may be odd-length. If the huge number's size is not a multiple of 30, it gets leading zeros so it can still be represented as such.
So how can this be split into high and low halves? The main problem I'm running into is that since it can be odd-length, that means I can't just divide the arrays by their elements. Basically, how can I select the first and last bit halves of these int arrays so I can continue recursing in Karatsuba's method?
As long as I can retrieve the bits, I can create two smaller int arrays from them.

Finding the pair of strings with most number of identical letters in an array

Suppose I have an array of strings of different lengths.
It can be assumed that the strings have no repeating characters.
Using a brute-force algorithm, I can find the pair of strings that have the most number of identical letters (order does not matter - for example, "ABCDZFW" and "FBZ" have 3 identical letters) in n-squared time.
Is there a more efficient way to do this?
Attempt: I've tried to think of a solution using the trie data structure, but this won't work since a trie would only group together strings with similar prefixes.
I can find the pair of strings that have the most number of identical
letters (order does not matter - for example, "ABCDZFW" and "FBZ" have
3 identical letters) in n-squared time.
I think you can't as string comparison itself is O(max(length(s1), length(s2))) along with the O(n^2) loop for checking all pairs. However you can optimize the comparison of strings in some extent.
As you mentioned the strings don't have duplicates and I am assuming the strings consist of only uppercase letters according to your input. So, it turns into each string can be only 26 characters long.
For each string, we can use a bitmask. And for each character of a string, we can set the corresponding bit 1. For example:
ABCGH
11000111 (from LSB to MSB)
Thus, we have n bit-masks for n strings.
Way #1
Now you can check all possible pairs of strings using O(n^2) loop and compare the string by ANDing two corresponding mask and check the number of set bits (hamming weight). Obviously this is an improvement of your version because the string comparison is optimized now - Only an AND operation between two 32 bit integer which is a O(1) operation.
For example for any two strings comparison will be:
ABCDG
ABCEF
X1 = mask(ABCDG) => 1001111
X2 = mask(ABCEF) => 0110111
X1 AND X2 => 0000111
hamming weight(0000111) => 3 // number of set bits
Way #2
Now, one observation is the AND of same type bit is 1. So for every masks, we will try to maximize the Hamming weight (total number of set bits) of AND value of two string's masks as the string with most matched characters have same bit 1 and ANDing these two masks will make those bits 1.
Now build a Trie with all masks - every node of the trie will hold 0 or 1 based on the corresponding bit is set or not. Insert each mask from MSB ot LSB. Before inserting ith mask into Trie(already holding i - 1 masks), we will query to try maximizing the Hamming weight of AND recusively by going to same bit's branch (to make the bit 1 in final AND variable) and also to opposite bit's branch because in later levels you might get more set bits in this branch.
Regarding this Trie part, for nice pictorial explanation, you can find a similar thread here (this works with XOR).
Here in worst case, we will need to traverse many branches of trie for maximizing the hamming weight. And in worst case it will take around 6 * 10^6 operations (which will take ~1 sec in typical machine) and also we need additional space for building trie. But say the total number of strings is 10^5, then for O(n^2) algorithms, it will take 10^10 operations which is too much - so the trie approach is still far better.
Let me know if you're having problem with implementation. Unfortunately I can able to help you with code only if you're a C/C++ or Java guy.
Thanks #JimMischel for pointing out a major flaw. I slightly misunderstood the statement first.

lightweight (quasi-random) integer fingerprint of C string

I would like to generate a nicely-mixed-up integer fingerprint of an arbitrary C string (s). Most C strings will consist of ASCII text characters:
I want very different fingerprints for similar strings, esp such similar strings as "ab" and "ba"
I want it to be difficult to invert back from the fingerprint to the string (well, my string is typically longer than 32 bits, which means that many strings would map into the same integer), which means again that I want similar strings to yield very different codes;
I want to use the 32 bits available to me efficiently in the integer result,
I want the function source to be small
I want the function to be fast.
one usage is security (but not encryption) related. I can ask a user for a text password, convert it into an integer for storage and later test whether this integer is correct. (I know I could store strings, but I don't want to. guessing a 32-bit integer correctly is impossible if my program can slow down incorrect attempts to the point where brute force cannot work faster than password guessing. another use of this function is as the start of a hash index function (mod array length) into an array.)
alas, I am probably reinventing the wheel here. such functions have probably been written a million times, and by people who are much more versed in cryptography. I don't need AES, of course, but something much more lightweight. the use is different.
my first thinking was
mod 64 each character to take advantage of the ASCII text aspect. now I have 6 bits. call this x.
I can place a 6bit string into 5 locations in a 32-bit space, leaving 2 bits over.
take the current string index position (0, 1, 2...), mod5 it to determine where I want to start to place my x into my running integer result code. XOR my x into this running-result integer.
use the remaining 2 bits to increment a counter [mod 4 to prevent overflow] for each character processed.
then I thought that bit operations may be computer-fast but take more source code. I can think of other choices. take each index position i and multiply it by an ascii representation of each character [or the x from above], and call this y[i]. now do the following:
calculate the natural logarithm of the sums of the y (or this sum plus the running result), and just pretend that the first 32 bits of this result [maybe leaving off the first few bits], which are really a double, are an integer representation. I can XOR each bitint(log(y[i])) into the running integer result.
do it even cheaper. just add the y's, and then do the logarithm with 32-bit pickoff just once at the end. alternatively, run a sum-y through srand as a seed and grab a rand.
there are probably a few other ways to do it, too. in sum, the function should map strings into very different integers, be short to code, and be very fast.
Any pointers?
A common method of generating a non-reversible digest or hash of a string is to generate a Cyclic Redundancy Checksum (CRC).
Source for CRC is widely available, in this case you should use a common CRC-32 such as that used by Ethernet. Different CRCs work on the same principle, buy use different polynomials. Do not be tempted to invent your own polynomial; the distribution is likely to be sub-optimal.
What you're looking for is called a "hash". Two examples of hash functions I'm aware of that return short integers are MurmurHash and SipHash. MurmurHash, as I recall, is not designed to be a cryptographic hash, while SipHash, on the other hand, is indeed designed with security in mind, as stated on its homepage. MurmurHash has 2 versions that return a 32-bit and a 64-bit output. SipHash returns a 64-bit output.

Use GMP types (mpf_t/mpz_t) as keys in a hashtable

I need to use GMPs mpf_t/mpz_t as keys in a hashtable.
Is there any efficient way to access the raw bytes of the number representation so I can run a hash function over them?
I already read the documentation but I don't really feel smarter now. ;)
Thanks!
Regards,
Ethon
Out of curiosity, why use hashing when you can sort by value? Comparison is very quick, as it compares the bytes / limbs from MSB to LSB, returning a result as soon as they differ.
You can access the raw data using the platform-dependent mp_limb_t type. Both mpz_t and mpf_t have an mp_limb_t vector stored at the address specified by _mp_d, with the number of significant limbs given by the absolute value of the _mp_size field. (the definitions are in gmp.h)
Of course, if the hash function depends on an 8-bit byte vector, you will need to convert the limb vector. Fortunately, the number of bits in a mp_limb_t - GMP_LIMB_BITS - is always going to be divisible by 8 on any sane platform.

How to find 0 in an integer array of size 100 ,having 99 elements as 1 and only one element 0 in most efficient way

I need to find the position( or index ) say i of an integer array A of size 100, such that A[i]=0. 99 elements of array A are 1 and only one element is 0. I want the most efficient way solving this problem.(So no one by one element comparison).
Others have already answered the fundamental question - you will have to check all entries, or at least, up until the point where you find the zero. This would be a worst case of 99 comparisons. (Because if the first 99 are ones then you already know that the last entry must be the zero, so you don't need to check it)
The possible flaw in these answers is the assumption that you can only check one entry at a time.
In reality we would probably use direct memory access to compare several integers at once. (e.g. if your "integer" is 32 bits, then processors with SIMD instructions could compare 128 bits at once to see if any entry in a group of 4 values contains the zero - this would make your brute force scan just under 4 times faster. Obviously the smaller the integer, the more entries you could compare at once).
But that isn't the optimal solution. If you can dictate the storage of these values, then you could store the entire "array" as binary bits (0/1 values) in just 100 bits (the easiest would be to use two 64-bit integers (128 bits) and fill the spare 28 bits with 1's) and then you could do a "binary chop" to find the data.
Essentially a "binary chop" works by chopping the data in half. One half will be all 1's, and the other half will have the zero in it. So a single comparison allows you to reject half of the values at once. (You can do a single comparison because half of your array will fit into a 64-bit long, so you can just compare it to 0xffffffffffffffff to see if it is all 1's). You then repeat on the half that contains the zero, chopping it in two again and determining which half holds the zero... and so on. This will always find the zero value in 7 comparisons - much better than comparing all 100 elements individually.
This could be further optimised because once you get down to the level of one or two bytes you could simply look up the byte/word value in a precalculated look-up table to tell you which bit is the zero. This would bring the algorithm down to 4 comparisons and one look-up (in a 64kB table), or 5 comparisons and one look-up (in a 256-byte table).
So we're down to about 5 operations in the worst case.
But if you could dictate the storage of the array, you could just "store" the array by noting down the index of the zero entry. There is no need at all to store all the individual values. This would only take 1 byte of memory to store the state, and this byte would already contain the answer, giving you a cost of just 1 operation (reading the stored value).
You cannot do it better then linear scan - unless the data is sorted or you have some extra data on it. At the very least you need to read all data, since you have no clue where this 0 is hiding.
If it is [sorted] - just access the relevant [minimum] location.
Something tells me that the expected answer is "compare pairs":
while (a[i] == a[i+1]) i += 2;
Although it looks better that the obvious approach, it's still O(n),
Keep track of it as you insert to build the array. Then just access the stored value directly. O(1) with a very small set of constants.
Imagine 100 sea shells, under one is a pearl. There is no more information.
There is really no way to find it faster than trying to turn them all over. The computer can't do any better with the same knowledge. In other words, a linear scan is the best you can do unless you save the position of the zero earlier in the process and just use that.
More trivia than anything else, but if you happen to have a quantum computer this can be done faster than linear.
Grover's algortithm

Resources