Hash function to matrix - c

I have two matrices and I need to compare them, but I don't want to compare position by position, I think that is not the best way. I thought of hash functions, does anyone know how to calculate the hash of a matrix?

If your matrices are implemented as arrays, I'd suggest using memcmp() from string.h to determine if they are equal.
If floating point values are involved and the values result from actual computations, there's no way around checking them value by value, as you'll have to include epsilons to accomodate numeric errors.

You can compute a hash of the whole floating point array (as a byte sequence). If you want a comparison function able to cope with small differences in the coefficients, you can compare trivial scalars and vectors computed from each matrix. It makes sense if you have to compare each matrix with more than one matrix. Examples coming to mind:
trace of the matrix
vector of L0, L1, L2 norms of all columns or rows
diagonal of LU factorization
tridiagonal reduction (if symmetric)
diagonal of eigenvalues (if possible)
diagonal of SVD

First, a hash won't tell you if two matrices are equal, only can tell you if they're distinct; because there can be (and there will be, Murphy's Law is always lurking) collisions.
You can calculate a hash by chaining any function over the elements... if you can cast the elements to integer values (not truncation, but the binary representation), maybe you could XOR all of them (but keep in mind that this wouldn't work if some values are the same but with distinct representation like -0 and +0 or NaN).
So my advice is that you could have some kind of hash function (even the sum of all the elements could be valid) precalculated (this is important, there's no point in calculate the hash each time you want to make a comparison and then compare the hashes) to discard quickly some different matrices, but if the hash is the same, you will have to compare each position.

When you say hash i guess you want to checksum the matrices and compare the checksums to confirm equality. Assuming that each of your matrices is stored as a contiguous chunk of data, you could compute the start address and length (in bytes) of each chunk and then generate checksums for both (since you expect them to be equal sometimes, the length would be same). If the checksums are same with a very high probability the two matrices are also equal. If correctness is critical, you could run a comparison loop over the two matrices once their checksums match. That way you will not be invoking the cost of comparison unless equality is very likely.
wikipedia checksum reference

Related

counterintuitive speed difference between LM and shift-invert modes in scipy.sparse.linalg.eigsh?

I'm trying to find the smallest (as in most negative, not lowest magnitude) several eigenvalues of a list of sparse Hermitian matrices in Python using scipy.sparse.linalg.eigsh. The matrices are ~1000x1000, and the list length is ~500-2000. In addition, I know upper and lower bounds on the eigenvalues of all the matrices -- call them eig_UB and eig_LB, respectively.
I've tried two methods:
Using shift-invert mode with sigma=eig_LB.
Subtracting eig_UB from the diagonal of each matrix (thus shifting the smallest eigenvalues to be the largest magnitude eigenvalues), diagonalizing the resulting matrices with default eigsh settings (no shift-invert mode and using which='LM'), and then adding eig_UB to the resulting eigenvalues.
Both methods work and their results agree, but method 1 is around 2-2.5x faster. This seems counterintuitive, since (at least as I understand the eigsh documentation) shift-invert mode subtracts sigma from the diagonal, inverts the matrix, and then finds eigenvalues, whereas default mode directly finds the largest magnitude eigenvalues. Does anyone know what could explain the difference in performance?
One other piece of information: I've checked, and the matrices that result from shift-inverting (that is, (M-sigma*identity)^(-1) if M is the original matrix) are no longer sparse, which seems like it should make finding their eigenvalues take even longer.
This is probably resolved. As pointed out in https://arxiv.org/pdf/1504.06768.pdf, you don't actually need to invert the shifted sparse matrix and then repeatedly apply it in some Lanczos-type method -- you just need to repeatedly solve an inverse problem (M-sigma*identity)*v(n+1)=v(n) to generate a sequence of vectors {v(n)}. This inverse problem can be done quickly for a sparse matrix after LU decomposition.

non-commutative combination of two byte arrays

If I want to combine two numbers (Int,Long,...) n1,n2in a non-commutative way, p*n1 + n2 where p is an arbitrary prime seems reasonable enough a choice.
As many hashing options return a byte array, though, I am now trying to substitute the numbers with byte arrays.
Assume a,b:Array[Byte] are of the same length.
+ simply becomes an xor
but what should I use as a "Multiplication"?
p:Long a(n arbitrary) prime, a:Array[Byte] of arbitrary length
I could, of course, convert a to a long, multiply, then convert the result back to an Array of Bytes. The problem with that is that I will need "p*a" to be of the same length as a for the subsequent xor to make sense. I could circumvent this by zero-extending the shorter of the two byte arrays, but then the byte arrays quickly grow in length.
I could, on the other hand, convert p to a byte array and xor it with a. Here, the issue is that then (p*(p*a+b)+c) becomes (a+b+c), which is commutative, which we don't want.
I could add p to every byte in the array (throwing away the overflow).
I could add p to every byte in the array (not throwing away the overflow).
I could circular shift a by some f(p) bits (and hope it doesn't end up becoming a again)
And I could think of a lot more nonsense. But what should I do? What actually makes sense?
If you want to mimic the original ideal of multiplying by a prime, the obvious generalization is to do arithmetic in the Galois field GF(2^8) - see https://en.wikipedia.org/wiki/Finite_field_arithmetic and note that you can essentially use log and antilog tables of size 256 to replace multiplication with not much more than table lookup - https://en.wikipedia.org/wiki/Finite_field_arithmetic#Implementation_tricks. Arithmetic over a finite field of any sort will have many of the nice properties of arithmetic modulo a prime - arithmetic modulo p is GP(p) or GF(p^1), if you prefer.
However this is all rather untried and perhaps a little high-flown. Other options include checksum algorithms such as https://en.wikipedia.org/wiki/Adler-32 or - if you already have a hash algorithm that maps long strings into a short array of bytes, simply concatenating the two arrays of bytes to be combined and running the result through the hash algorithm again, perhaps with some padding before and after to give you some parameters you can play with if you need to vary or tune things.

Why array of complex numbers is declared row-wise in fftw?

The fftw manual states that (Page 4)
The data is an array of type fftw_complex, which is by default a double[2] composed of the real (in[i][0]) and imaginary (in[i][1]) parts of a complex number.
For doing FFT of a time series of n samples, the size of the matrix becomes n rows and 2 columns. If we wish to do element-by-element manipulation, isn't accessing in[i][0] for different values of i slow for large values of n since C stores 2D arrays in row-major format?
The real and imaginary part are stored consecutively in memory (assuming little endian lay out where byte 0 of R0 is at the smallest address) :
In-1,Rn-1..|I1,R1|I0,R0
That means that it's possible to copy an element i into place accessing the same cacheline (usually 64byte today), as both real & imaginary are adjacent. If you stored the 2D array in the Fortan order and wanted to assign to one element, then you immediately access memory on 2 different cachelines, as they are stored N*sizeof double locations apart in memory - Row & COlumn Major order
Now if your processing was just operating on the real parts in one thread and the imaginary seperately in another, for some reason, then yes it would be more efficient to store them in Column major order, or even as seperate parallel arrays. In general though, data is stored close together because it is used together.
All arrays in C are really single dimensional byte arrays, unless you store an array of pointers to arrays, usually done with things like strings with varying lengths.
Sometimes in matrix calculations, it's actually faster to first transpose one array, because of the rules of matrix multiplication, it's complex but if you want the real nitty gritty details search for Ulrich Dreppers article at LWN.net about memory which shows an example that benefits from this technique (section 5 IIRC).
Very often Scientific numberic libraries have worked in Column major order, because Fortran compatability was more important than using the array in a natural way. Most languages prefer Row major, as it's generally more desirable, when you store fixed length strings in a table for instance.

Eigen Sparse Matrix

I am trying to multiply two large sparse matrices of size 300k * 1000k and 1000k*300k using Eigen. The matrices are highly sparse ~0.01% non zero entries, however there's no block or other structure in their sparsity.
It turns out that Eigen chokes and ends up taking 55-60G of memory. Actually, it makes the final matrix dense which explains why it takes so much memory.
I have tried multiplying matrices of similar sizes when one of the matrix is diagonal and the multiplication works fine, with ~2-3 G of memory.
Any thoughts on whats going wrong?
Even though your matrices are sparse, the result might be completely dense. You can try to remove smallest entries with (A*B).prune(ref,eps); where ref is a reference value for what is not a zero and eps is a tolerance value. Basically, all entries smaller than ref*eps will be removed during the computation of the product, thus reducing both the memory usage and size of the result. A better option would be to find a way to avoid performing this product.

Interview : Hash function: sine function

I was asked this interview question. I am not sure what the correct answer for it is (and the reasoning behind the answer):
Is sin(x) a good hash function?
If you mean sin(), it's not a good hashing function because:
it's quite predictable and for some x it's no better than just x itself. There should be no seemingly apparent relationship between the key and the hash of the key.
it does not produce an integer value. You cannot index/subscript arrays with floating-point indices and there must be some kind of array in the hash table.
floating-point is very implementation-specific and even if you make a hash function out of sin(), it may not work with a different compiler or on a different kind of CPU/computer.
sin() may be much slower than some simpler integer-arithmetic function.
Not really.
It's horribly slow.
You'll need to convert the result to some integer type anyway to avoid the insanity of floating-point equality comparisons. (Not actually the usual precision problems that are endemic to FP equality comparisons and which arise from calculating two things slightly different ways; I mean specifically the problems caused by things like the fact that 387-derived FPUs store extra bits of precision in their registers, so if a comparison is done between two freshly-calculated values in registers you could get a different answer than if exactly one of the operands was loaded into a register from memory.)
It's almost flat near the peaks and troughs, so the quantisation step (multiplying by some large number and rounding to an integer) will produce many hash values near the min and max, rather than an even distribution.
Based off of mathematical knowledge:
Sine(x) is periodic so it's going to reach the same number from different values of x, so Sine(x) would be awful as a hashing function because you will get multiple values hashing to the exact same point. There are **infinitely many values between 0 and pi for the return value, but then past that the values will repeat. So 0 & pi & 2*pi will all hash to the same point.
If you could make the increment small enough and have Sine(x) multiplied by say x^2 or something of that nature it'd be mediocre at best, but then again, if you were to do that why not just use x^2 anyway and toss out the periodic function all together.
**infinitely: a large enough number that I'm not willing to count.
NOTE: Sine(x) will have values that are small and could be affected by rounding error.
NOTE: Any value taken from a sine function should be multiplied by an integer and then either modded or the floor or ceiling taken so that the value can be used as an array offset, etc.
sin(x) is trigonometric function which repeats itself after every 360 degrees, so it's going to be a poor hash function as the hash will be repeated too often.
A simple refutation:
sin(0) == sin(360) == sin(720) == sin(..)
This is not a property of a goodhash function.
Even if you decide to use it, it's difficult to represent the value returned by sin.
Sin function:
sin x = x - x^3/3! + x^5/5! - ...
This can't accurately represented due to floating point precision issue, which means for a same value it may produce two different hashes!
Another point to note:
For sine(x) as hash function - Keys in a given close range will have hash values in close range too, it is not desirable. A good hash function evenly distributes hash values irrespective of the nature of the keys.
Hash values generally have to be integers to be useful. Since sin doesn't generate integers it wouldn't be appropriate.
Let's say we have a string s. It can be expressed as a number in hexadecimal and feeded to the function. If you added 2 pi it would cease to be a valid input, as it wouldn't be an integer anymore (only non-negative integers are accepted by the function). You have to find a string that gives a collision, not just multiply the hex expression of the string with 2 pi. And adding (concatenating?) 2 pi directly to the string wouldn't help finding a collision. There might be another way though but not that trivial.
I think sin(x) can make an excellent cryptographic hash function,
if used wisely. The input should be a natural number in radians
and never contain pi. We must use arbitrary-precision arithmetic.
For every natural number x (radians), sin(x)
is always a transcendental irrational number and there is no other
natural number with the same sine. But there's a catch: An attacker could gain
information about the input, by computing the arcsin of the hash.
In order to prevent this, we ignore the decimal part and some of the
first digits from the fractional part, keeping only the next n (say 100) digits,
making such an attack computationally infeasible.
It seems that a small change in the input gives a completely different result,
which is a desirable property.
The result of the function seems statistically random, again a good property.
I'm not sure how to prove that is is collision-resistant but i can't see why
it couldn't be. Also, i can't think of a way to find a specific input that results
in a specific hash. I'm not saying that we should blindly believe that it is
certainly a good crypt. hash function. I just think that it seems like a
good candidate to be one. We should give it a chance
and focus on proving that it is. And it might me a very good one.
To those that might say it is slow: Yes, it is. And that's good when hashing passwords.
Here i'm attaching some perl code for this idea. It runs on linux with bash and bc.
(bc is a command-line arbitrary-precision calculator, included in most distros)
I'll be checking this page for any answers, since this interests me a lot.
Don't be harsh though, i'm just a CS undergrad, willing to learn more.
use warnings;
use strict;
my $input='5AFF36B7';#Input for bc (as a hex number)
$input='1'.$input;#put '1' in front of input, so that 0x0 , 0x00 , 0x1 , 0x01 , etc ... ,
#all give different nonzero results
my $a=`bc -l -q <<< "scale=256;obase=16;ibase=16;s($input)"`;#call bc, keep result in $a
#keep only fractional part
$a=~tr/a-zA-Z0-9//cd;#Clean up string, keep only alphanumerics
my #m = $a =~ /./g;#Convert string to array of chars
#PRINT OUTPUT
#We ignore some digits, for security reasons:
#If we don't ignore any of the first digits, an attacker could gain information
#about the input by computing the inverse of sin (the arcsin of the hash)
#By ignoring enough of the first digits, it becomes computationally
#infeasible to compute arcsin
#Also, to avoid problems with roundoff error, we ignore some of the last digits
for (my $c=100;$c<200;$c++){
print $m[$c];
}

Resources