Storing and mapping/searching binary vectors? - database

I want to store ~1000 .. ~mln binary vectors of size ~100000 .. ~1mln bits.
The search is based on binary operations, overlap-bits count, hamming distance, ...etc.
Something like :
SELECT id,vec FROM abc ORDER BY count(param & vec) LIMIT TOP 5%
SELECT id,vec FROM abc ORDER BY overlap(param,vec) LIMIT TOP 5%
SELECT vec FROM abc WHERE vec = param
It doesnt have to be SQL, this is just for illustration..
What is best DB,files,cache ... which one ?
Current best idea is using indexed-binary vectors and storing them as arrays in postgresql DB. The vectors are sparse, so it should work

Related

How to count for 2 different arrays how many times the elements are repeated, in MATLAB?

I have array A (44x1) and B (41x1), and I want to count for both arrays how many times the elements are repeated. And if the repeated values are present in both arrays, I want their counting to be divided (for instance: value 0.5 appears 500 times in A and 350 times in B, so now divide 500 by 350).
I have to do this for bigger arrays as well, so I was thinking about using a looping (but no idea how to do it on MATLAB).
I got what I want on python:
import pandas as pd
data1 = pd.read_excel('C:/Users/Desktop/Python/data1.xlsx')
data2 = pd.read_excel('C:/Users/Desktop/Python/data2.xlsx')
for i in data1['Mag'].value_counts() & data2['Mag'].value_counts():
a = data1['Mag'].value_counts()/data2['Mag'].value_counts()
print(a)
break
Any idea of how to do the same on MATLAB? Thanks!
Since you can enumerate all valid earthquake magnitude values, you could use:
% Make up some data
A=randi([2 58],[100 1])/10;
B=randi([2 58],[20 1])/10;
% Round data to nearest tenth
%A=round(A,1); %uncomment if necessary
%B=round(B,1); %same
% Divide frequencies
validmags=0.2:0.1:5.8;
Afreqs=sum(double( abs(A-validmags)<1e-6 ),1); %relies on implicit expansion; A must be a column vector and validmags must be a row vector; dimension argument to sum() only to remind user; double() not really needed
Bfreqs=sum(double( abs(B-validmags)<1e-6 ),1); %same
Bfreqs./Afreqs, %for a fancier version: [{'Magnitude'} num2cell(validmags) ; {'Freq(B)/Freq(A)'} num2cell(Bfreqs./Afreqs)].'
The last line will produce NaN for 0/0, +Inf for nn/0, and 0 for 0/nn.
You could also use uniquetol, align the unique values of each vector, and divide the respective absolute frequencies. But I think the above approach is cleaner and easier to understand.

How to calculate the product of frequencies of different elements of an array efficiently?

We are given an array and I have to calculate the product of frequencies of numbers in a particular range of the array i,e. [L,R].
How to do it?
My approach:- Say, [1,2,2,2,45,45,4]. L=2 and R=6. Answer=3(frequency of 2)*2(frequency of 45)=6.
Just traverse the array(FROM L TO R) and put the frequencies of each number in a map; finally multiply all those values. Is there any better method to do this for multiple range queries online?
Do we require persistence ?
If the size of array is 'N' and number of queries is 'Q' . I want a much better time complexity than O(N*Q).

hashing of an integer vector for fast querying

I represent a monomial with a 3x2 vector. For example,
x y z : ( (variable id, exponent of variable) )
: ( (1,1) , (2, 1) , (3,1) )
x^2 z^2 : ( (1,2) , (3,2) , (null, null) )
The total number of distinct variable is one million, but a monomial contains at most three distinct variables.
I want to do query like
Is x in the monomial y^2 z^2?
Is the power of y greater than 3 in y^2 z^2?
Is there a hash function that will answer those question in O(1)?
Or should I just loop over the 3x2 vector?
I ask because I maintain a hash table with 5 millions of such monomial.
So, by default, I must use a hashing structure to do a search.
Maybe there is an hashing scheme that can also answer the two questions above.
Related question: best codebook for manipulation of monomials
I think you have to loop over the 3x2 vector.
And one could argue that looping over the vector is an O(1) operation. Why? Because the number of entries in the vector is exactly 3, which is a small constant number. To put it another way, the loop count is independent of the big numbers:
n which is the number of variables, and
m which is the number of monomials.
It doesn't matter how big (or small) m and n are, looping over the vector always takes the same amount of time, so it's technically an O(1) operation.

Choosing distributed computing framework for very large overlap querys

I am trying to analyze 2 billion rows (of text files in HDFS). Each file's lines contain an array of sorted integers:
[1,2,3,4]
The integer values can be 0 to 100,000. I am looking to overlap within each array of integers all possibly combinations (one-way aka 1,2 and 2,1 are not necessary). Then reduce and sum the counts of those overlaps. For example:
File:
[1,2,3,4]
[2,3,4]
Final Output:
(1,2) - 1
(1,3) - 1
(1,4) - 1
(2,3) - 2
(2,4) - 2
(3,4) - 2
The methodology that I have tried is using Apache Spark, to create a simple job that parallelizes the processing and reducing of blocks of data. However I am running into issues where the memory can't hold a hash of ((100,000)^2)/2 options and thus I am having to result in running traditional map reduce of map, sort, shuffle, reduce locally, sort, shuffle, reduce globally. I know creating the combinations is a double for loop so O(n^2) but what is the most efficient way to programmatically do this so I can minimally write to disk? I am trying to perform this task sub 2 hours on a cluster of 100 nodes (64gb ram/2 cores) Also any recommended technologies or frameworks. Below is what I have been using in Apache Spark and Pydoop. I tried using more memory optimized Hashs, however they still were too much memory.
import collection.mutable.HashMap
import collection.mutable.ListBuffer
def getArray(line: String):List[Int] = {
var a = line.split("\\x01")(1).split("\\x02")
var ids = new ListBuffer[Int]
for (x <- 0 to a.length - 1){
ids += Integer.parseInt(a(x).split("\\x03")(0))
}
return ids.toList
}
var textFile = sc.textFile("hdfs://data/")
val counts = textFile.mapPartitions(lines => {
val hashmap = new HashMap[(Int,Int),Int]()
lines.foreach( line => {
val array = getArray(line)
for((x,i) <- array.view.zipWithIndex){
for (j <- (i+1) to array.length - 1){
hashmap((x,array(j))) = hashmap.getOrElse((x,array(j)),0) + 1
}
}
})
hashmap.toIterator
}).reduceByKey(_ + _)
Also Tried PyDoop:
def mapper(_, text, writer):
columns = text.split("\x01")
slices = columns[1].split("\x02")
slice_array = []
for slice_obj in slices:
slice_id = slice_obj.split("\x03")[0]
slice_array.append(int(slice_id))
val array = getArray(line)
for (i, x) in enumerate(array):
for j in range(i+1, len(array) - 1):
write.emit((x,array[j]),1)
def reducer(key, vals, writer):
writer.emit(key, sum(map(int, vals)))
def combiner(key, vals, writer):
writer.count('combiner calls', 1)
reducer(key, vals, writer)
I think your problem can be reduced to word count where the corpus contains at most 5 billion distinct words.
In both of your code examples, you're trying to pre-count all of the items appearing in each partition and sum the per-partition counts during the reduce phase.
Consider the worst-case memory requirements for this, which occur when every partition contains all of the 5 billion keys. The hashtable requires at least 8 bytes to represent each key (as two 32-bit integers) and 8 bytes for the count if we represent it as a 64-bit integer. Ignoring the additional overheads of Java/Scala hashtables (which aren't insignificant), you may need at least 74 gigabytes of RAM to hold the map-side hashtable:
num_keys = 100000**2 / 2
bytes_per_key = 4 + 4 + 8
bytes_per_gigabyte = 1024 **3
hashtable_size_gb = (num_keys * bytes_per_key) / (1.0 * bytes_per_gigabyte)
The problem here is that the keyspace at any particular mapper is huge. Things are better at the reducers, though: assuming a good hash partitioning, each reducer processes an even share of the keyspace, so the reducers only require roughly (74 gigabytes / 100 machines) ~= 740 MB per machine to hold their hashtables.
Performing a full shuffle of the dataset with no pre-aggregation is probably a bad idea, since the 2 billion row dataset probably becomes much bigger once you expand it into pairs.
I'd explore partial pre-aggregation, where you pick a fixed size for your map-side hashtable and spill records to reducers once the hashtable becomes full. You can employ different policies, such as LRU or randomized eviction, to pick elements to evict from the hashtable. The best technique might depend on the distribution of keys in your dataset (if the distribution exhibits significant skew, you may see larger benefits from partial pre-aggregation).
This gives you the benefit of reducing the amount of data transfer for frequent keys while using a fixed amount of memory.
You could also consider using a disk-backed hashtable that can spill blocks to disk in order to limit its memory requirements.

efficient methods to do summation

Is there any efficient techniques to do the following summation ?
Given a finite set A containing n integers A={X1,X2,…,Xn}, where Xi is an integer. Now there are n subsets of A, denoted by A1, A2, ... , An. We want to calculate the summation for each subset. Are there some efficient techniques ?
(Note that n is typically larger than the average size of all the subsets of A.)
For example, if A={1,2,3,4,5,6,7,9}, A1={1,3,4,5} , A2={2,3,4} , A3= ... . A naive way of computing the summation for A1 and A2 needs 5 Flops for additions:
Sum(A1)=1+3+4+5=13
Sum(A2)=2+3+4=9
...
Now, if computing 3+4 first, and then recording its result 7, we only need 3 Flops for addtions:
Sum(A1)=1+7+5=13
Sum(A2)=2+7=9
...
What about the generalized case ? Is there any efficient methods to speed up the calculation? Thanks!
For some choices of subsets there are ways to speed up the computation, if you don't mind doing some (potentially expensive) precomputation, but not for all. For instance, suppose your subsets are {1,2}, {2,3}, {3,4}, {4,5}, ..., {n-1,n}, {n,1}; then the naive approach uses one arithmetic operation per subset, and you obviously can't do better than that. On the other hand, if your subsets are {1}, {1,2}, {1,2,3}, {1,2,3,4}, ..., {1,2,...,n} then you can get by with n-1 arithmetic ops, whereas the naive approach is much worse.
Here's one way to do the precomputation. It will not always find optimal results. For each pair of subsets, define the transition cost to be min(size of symmetric difference, size of Y - 1). (The symmetric difference of X and Y is the set of things that are in X or Y but not both.) So the transition cost is the number of arithmetic operations you need to do to compute the sum of Y's elements, given the sum of X's. Add the empty set to your list of subsets, and compute a minimum-cost directed spanning tree using Edmonds' algorithm (http://en.wikipedia.org/wiki/Edmonds%27_algorithm) or one of the faster but more complicated variations on that theme. Now make sure that when your spanning tree has an edge X -> Y you compute X before Y. (This is a "topological sort" and can be done efficiently.)
This will give distinctly suboptimal results when, e.g., you have {1,2}, {3,4}, {1,2,3,4}, {5,6}, {7,8}, {5,6,7,8}. After deciding your order of operations using the procedure above you could then do an optimization pass where you find cheaper ways to evaluate each set's sum given the sums already computed, and this will probably give fairly decent results in practice.
I suspect, but have made no attempt to prove, that finding an optimal procedure for a given set of subsets is NP-hard or worse. (It is certainly computable; the set of possible computations you might do is finite. But, on the face of it, it may be awfully expensive; potentially you might be keeping track of about 2^n partial sums, be adding any one of them to any other at each step, and have up to about n^2 steps, for a super-naive cost of (2^2n)^(n^2) = 2^(2n^3) operations to try every possibility.)
Assuming that 'addition' isn't simply an ADD operation but instead some very intensive function involving two integer operands, then an obvious approach would be to cache the results.
You could achieve that via a suitable data structure, for example a key-value dictionary containing keys formed by the two operands and the answers as the value.
But as you specified C in the question, then the simplest approach would be an n by n array of integers, where the solution to x + y is stored at array[x][y].
You can then repeatedly iterate over the subsets, and for each pair of operands you check the appropriate position in the array. If no value is present then it must be calculated and placed in the array. The value then replaces the two operands in the subset and you iterate.
If the operation is commutative then the operands should be sorted prior to looking up the array (i.e. so that the first index is always the smallest of the two operands) as this will maximise "cache" hits.
A common optimization technique is to pre-compute intermediate results. In your case, you might pre-compute all sums with 2 summands from A and store them in a lookup table. This will result in |A|*|A+1|/2 table entries, where |A| is the cardinality of A.
In order to compute the element sum of Ai, you:
look up the sum of the first two elements of Ai and save them in tmp
while there is an element x left in Ai:
look up the sum of tmp and x
In order to compute the element sum of A1 = {1,3,4,5} from your example, you do the following:
lookup(1,3) = 4
lookup(4,4) = 8
lookup(8,5) = 13
Note that computing the sum of any given Ai doesn't require summation, since all the work has already been conducted while pre-computing the lookup table.
If you store the lookup table in a hash table, then lookup() is in O(1).
Possible optimizations to this approach:
construct the lookup table while computing the summation results; hence, you only compute those summations that you actually need. Your lookup table is now a cache.
if your addition operation is commutative, you can save half of your cache size by storing only those summations where the smaller summand comes first. Then modify lookup() such that lookup(a,b) = lookup(b,a) if a > b.
If assuming summation is time consuming action you can find LCS of every pair of subsets (by assuming they are sorted as mentioned in comments, or if they are not sorted sort them), after that calculate sum of LCS of maximum length (over all LCS in pairs), then replace it's value in related arrays with related numbers, update their LCS and continue this way till there is no LCS with more than one number. Sure this is not optimum, but it's better than naive algorithm (smaller number of summation). However you can do backtracking to find best solution.
e.g For your sample input:
A1={1,3,4,5} , A2={2,3,4}
LCS (A_1,A_2) = {3,4} ==>7 ==>replace it:
A1={1,5,7}, A2={2,7} ==> LCS = {7}, maximum LCS length is `1`, so calculate sums.
Still you can improve it by calculation sum of two random numbers, then again taking LCS, ...
NO. There is no efficient techique.
Because it is NP complete problem. and there are no efficient solutions for such problem
why is it NP-complete?
We could use algorithm for this problem to solve set cover problem, just by putting extra set in set, conatining all elements.
Example:
We have sets of elements
A1={1,2}, A2={2,3}, A3 = {3,4}
We want to solve set cover problem.
we add to this set, set of numbers containing all elements
A4 = {1,2,3,4}
We use algorhitm that John Smith is aking for and we check solution A4 is represented whit.
We solved NP-Complete problem.

Resources