Fast response for subset queries - database

I have a database of 10,000 vector of integers ranging from 1 to 1,000. The length of each vector can be up to 1,000. For example, it can look like this:
vec1: 1 2 56 78
vec2: 23 34 35 36 37 38
vec3: 1 2 3 4 5 7
vec4: 2 3 4 6 100
...
vec10000: 13 234
Now, I want to store this database in a way that is fast in response to a particular type of request. Each request will come in the form of an integer vector, up to 10,000 long:
query: 1 2 3 4 5 7 56 78 100
The response should be the indices of the vectors that are subsets of this query string. For example, in the above list, only vec1 and vec3 are subsets of the query, so the response in this case should be
response: 1 3
This database is not going to change so you can preprocess it in any possible way. You may specify that queries come in any protocol as well, as long as the information is the same. For example, it can come as a sorted list or a boolean table.
What is the best strategy to encode the database and the query to achieve the highest response rate possible?

Since you are using python, this method seems easy. (For any other language also, it is implementable but will include modular arithmetic etc.)
So, for each number from 1-1000, assign a prime number to it. So,
1 => 2
2 => 3
3 => 5
4 => 7
...
...
25 => 97
...
...
1000 => 7919
For every set, use its value to be the hash function generated by product of all values in the set.
eg. If your vector, vec-x = {1,2,5,25}, vec-x = 2 * 3 * 11 * 97
Similarly, your query vector can be calculated as above. Let its value be Q.
If Q % vec-i == 0, it is a subset, else not.

What about just preprocessing your vector list into an indicator matrix and using matrix multiplication, something like:
import numpy as np
# generate 10000 random vectors with length in [0-1000]
# and elements in [0-1000]
vectors = [np.random.randint(1000, size=n)
for n in np.random.randint(1000, size=10000)]
# generate indicator matrix
database = np.zeros((10000, 1000), dtype='int8')
for i, vector in enumerate(vectors):
database[i, vector] = 1
lengths = database.sum(axis=1)
def query(ints):
tmp = np.zeros(1000, dtype='int8')
tmp[ints] = 1
return np.where(database.dot(tmp) == lengths)[0]
The dot product of a database row and the transformed query will be equal to the number of elements of the row that are in the query. If this number is equal to total number of elements in the row, then we've found a subset. Note that this uses 0-based indexing.
Here's this revised for your example data
vectors = [[1, 2, 56, 78],
[23, 34, 35, 36, 37, 38],
[1, 2, 3, 4, 5, 7],
[2, 3, 4, 6, 100],
[13, 234]]
database = np.zeros((5, 1000), dtype='int8')
for i, vector in enumerate(vectors):
database[i, vector] = 1
lengths = database.sum(axis=1)
print query([1, 2, 3, 4, 5, 7, 56, 78, 100])
# [0, 2] 0-based indexing

Related

Uniform proportion algorithm

Suppose that I have two arrays A and B where A is a m by n matrix and B is a vector of size m. Each value in B refers to same row in A and has a value of 1 or 0. Now assume A and B like below:
A= 1 2 3 4 B= 1
5 6 7 8 0
5 6 7 8 0
5 6 7 8 0
5 6 7 8 0
5 6 7 8 1
5 6 7 8 0
5 6 7 8 0
I want to break both arrays to k parts and I want all parts have a (semi) uniform number of 1s and 0s. By default some proportions are empty of 1 while some have many.
I need an algorithm to sort both arrays before doing this breaking (splitting) job. How should this type of sort be done. Or what's the best way?
It is worth mentioning that the real data has 679 rows with corresponding 1 for 70 of them and 0 for others. And by now the desired k is 10.
You haven't given any code examples, and I don't want to give any code, because I've recently asked a similar question as a homework exercise. However, here is some pseudocode mixed with Java-esque method signatures. In the following, I will assume that one row of your dataset is modeled as Pair<A, B> with some generic types A and B (thinking of A as "features" and B as "labels" in a supervised machine learning task). In your concrete case, A would be some kind of list of integers, and B might be Boolean.
First, you define a helper method that can shuffle and split the dataset into k parts, completely ignoring the labels. In Java-syntax:
public static <A,B> ArrayList<ArrayList<Pair<A,B>>> split(
ArrayList<Pair<A, B>> dataset,
int k
) {
// shuffle the dataset
// generate `k` new lists
// add rows from the shuffled list to the `k` lists
// in round-robin fashion, i.e.
// move `i`-th item to the `i%k`-th list.
}
Building on top of that, you can define the stratified split version:
public static <A,B> ArrayList<ArrayList<Pair<A,B>>> stratifiedSplit(
ArrayList<Pair<A,B>> dataset,
int k
) {
// create a (hash?)map for the strata.
// In this map, you want to collect rows in separate
// lists, depending on their label:
HashMap<B, ArrayList<Pair<A,B>>> strata = ...;
// (Check whether your collection library of choice
// provides a `groupBy` or a `groupingBy` of some sort.
// In C#, this might help:
// https://msdn.microsoft.com/en-us/library/bb534304(v=vs.110).aspx )
// In your concrete case,
// your map should look something like this:
// {
// false -> [
// ([5, 6, 7, 8], false),
// ([5, 6, 7, 8], false),
// ([5, 6, 7, 8], false),
// ([5, 6, 7, 8], false),
// ([5, 6, 7, 8], false),
// ([5, 6, 7, 8], false)
// ],
// true -> [
// ([5, 6, 7, 8], true),
// ([1, 2, 3, 4], true)
// ]
// }
// where `{}`=map, `[]`=list/array, `()`=tuple/pair.
// Now you generate `k` lists to hold the result.
// For each stratum, you call the ordinary non-stratified
// `split` method, and append the `k` pieces returned by
// this method to the `k` result lists.
// In the end, you again shuffle each of the `k` result
// lists (so that the labels aren't sorted in the end)
// return `k` result lists.
}
Writing out the details is left as an exercise.

How can I use numpy to more efficiently modify the recorded sizes of nested sub-arrays, where the modification is condition-dependent?

I have a working declustering algorithm that I would like to speed up using numpy. Given an array a, the consecutive differences diffa are obtained. Each of these consecutive differences are then checked to see whether each is greater or lesser than some threshold value t_c, which produces an array of 0's and 1's False and True. Taking into account that diffa is one index smaller than a, the counting schema is slightly modified. First, the size of each cluster of 0's and 1's is calculated as array cl_size. If the array contains 0, then the size of the cluster is its original size plus one; if the array contains 1, then the size of the cluster is its original size minus one. Below is an example that I would like to adapt for a much larger dataset.
import numpy as np
thresh = 21
a = np.array([1, 2, 5, 10, 20, 40, 70, 71, 72, 74, 100, 130, 160, 171, 200, 201])
diffa = np.diff(a)
print(diffa)
>> [ 1 3 5 10 20 30 1 1 2 26 30 30 11 29 1]
def get_cluster_statistics(array, t_c, func_kw='all'):
""" This function separates clusters of datapoints such that the number
of clusters and the number of events in each cluster can be known. """
# GET CONSECUTIVE DIFFERENCES
ts_dif = np.diff(array)
# GET BOOLEAN ARRAY MASK OF 0's AND 1's FOR TIMES ABOVE THRESHOLD T_C
bool_mask = np.array(ts_dif > t_c) * 1
# COPY BOOLEAN ARRAY MASK (DO NOT MODIFY ORIGINAL ARRAY)
bm_arr = bool_mask[:]
# SPLIT CLUSTERS INTO SUB-ARRAYS
res = np.split(bm_arr, np.where(abs(np.diff(bm_arr)) != 0)[0] + 1)
print(res)
>>[array([0, 0, 0, 0, 0]), array([1]), array([0, 0, 0]), array([1, 1, 1]), array([0]), array([1]), array([0])]
# GET SIZE OF EACH SUB-ARRAY CLUSTER
cl_size = np.array([res[idx].size for idx in range(len(res))])
print(cl_size)
>>[5 1 3 3 1 1 1]
# CHOOSE BETWEEN CHECKING ANY OR ALL VALUES OF SUB-ARRAYS (check timeit)
func = dict(zip(['all', 'any'], [np.all, np.any]))[func_kw]
# INITIALIZE EMPTY OUTPUT LIST
ans = []
# CHECK EACH SPLIT SUB-ARRAY IN RES
for idx in range(len(res)):
# print("res[%d] = %s" %(idx, res[idx]))
if func(res[idx] == 1):
term = [1 for jdx in range(cl_size[idx]-1)]
# cl_size[idx] = cl_size[idx]-1
ans.append(term)
elif func(res[idx] == 0):
# cl_size[idx] = cl_size[idx]+1
term = [cl_size[idx]+1]
ans.append(term)
print(ans)
>> [[6], [], [4], [1, 1], [2], [], [2]]
out = np.sum(ans)
print(out)
>> [6, 4, 1, 1, 2, 2]
get_cluster_statistics(a, thresh, 'any')
After this, I apply Counter via importable module collections to count the frequency of clusters of various sizes.
I am not sure how but I think there is a numpy solution that is more efficient, specifically in the section of code under # CHECK EACH SPLIT SUB-ARRAY IN RES. Any help would be appreciated.

How to trim a series of arrays in Matlab by eliminating elements

I wonder if there is a way of looping through a number of arrays of different sizes and trimming data from the beginning of each array in order to achieve the same amount of elements in each array?
For instance, if I have:
A = [4 3 9 8 13]
B = [15 2 6 11 1 12 8 9 10 13 4]
C = [2 3 11 12 10 9 15 4 14]
and I want B an C to lose some elements at the beginning, such that they end up being 5 elements in length, just like A, to achieve:
A = [4 3 9 8 13]
B = [8 9 10 13 4]
C = [10 9 15 4 14]
How would I do that?
EDIT/UPDATE:
I have accepted the answer proposed by #excaza, who wrote a nice function called "naivetrim". I saved that function as a .m script and then used it: First I define my three arrays and, as #excaza suggests, called the function:
[A, B, C] = naivetrim(A, B, C);
Another solution variation that worked for me - based on #Sardar_Usama's answer below (looping it). I liked this as well, because it was a bit more straightforward (with my level, I can follow what is happening in the code)
A = [4 3 9 8 13]
B = [15 2 6 11 1 12 8 9 10 13 4]
C = [2 3 11 12 10 9 15 4 14]
arrays = {A,B,C}
temp = min([numel(A),numel(B), numel(C)]); %finding the minimum number of elements
% Storing only required elements
for i = 1:size(arrays,2)
currentarray = arrays{i}
arrays(i) = {currentarray(end-temp+1:end)}
end
A naive looped solution:
function testcode()
% Sample data arrays
A = [4, 3, 9, 8, 13];
B = [15, 2, 6, 11, 1, 12, 8, 9, 10, 13, 4];
C = [2, 3, 11, 12, 10, 9, 15, 4, 14];
[A, B, C] = naivetrim(A, B, C);
end
function varargout = naivetrim(varargin)
% Assumes all inputs are vectors
% Find minumum length
lengths = zeros(1, length(varargin), 'uint32'); % Preallocate
for ii = 1:length(varargin)
lengths(ii) = length(varargin{ii});
end
% Loop through input arrays and trim any that are longer than the shortest
% input vector
minlength = min(lengths);
varargout = cell(size(varargin)); % Preallocate
for ii = 1:length(varargout)
if length(varargin{ii}) >= minlength
varargout{ii} = varargin{ii}(end-minlength+1:end);
end
end
end
Which returns:
A =
4 3 9 8 13
B =
8 9 10 13 4
C =
10 9 15 4 14
If you have a large number of arrays you may be better off with alternative intermediate storage data types, like cells or structures, which would be "simpler" to assign and iterate through.
Timing code for a few different similar approaches can be found in this Gist.
Performance Profile, MATLAB (R2016b)
Number of Elements in A: 999999
Number of Elements in B: 424242
Number of Elements in C: 101325
Trimming, deletion: 0.012537 s
Trimming, copying: 0.000430 s
Trimming, cellfun copying: 0.000493 s
If there are not many matrices then it can be done as:
temp = min([numel(A),numel(B), numel(C)]); %finding the minimum number of elements
% Storing only required elements
A = A(end-temp+1:end);
B = B(end-temp+1:end);
C = C(end-temp+1:end);

Python2.7 looping through custom sequence

Rookie questions season continues :)
I've got a function that has to be fed with numerical value from certain range. This part of the code will be replicated for each datasource I'm linking in, but which changed numerical parameters.
Example (that works):
for i in [0, 1, 2, 3, 7, 8, 15, 31, 32]:
RowTDE(i)
Question
I would like to avoid typing in all the necessary values, therefore I would like to use something like this:
for i in [:2]+[7:10]+[15:]:
RowTDE(i)
I've tried it and got:
SyntaxError: invalid syntax
Do I need to create a list of integers first to use it? Like
intList = [1, 2, 3, 4, ... 33].
Also, as mentioned previously for each data source this range will differ, but maximum numerical value will be less then 40 (each number represents a column index).
As always I would much appreciate your help with this and just let me know if you need more info.
Happy Monday morning :)
You can add ranges:
>>> for i in range(3) + range(7, 9) + range(15, 16) + range(31, 33):
print i
0
1
2
7
8
15
31
32
or build the range then slice it:
>>> r = range(33)
>>> for i in r[:3] + r[7:9] + r[15:16] + r[31:]:
print i
0
1
2
7
8
15
31
32
But you can't slice nothing, hence [:2] on its own is a SyntaxError.
Slice notation on its own doesn't make sense. It's implemented by objects that support it using the __getitem__ method.
You could (ab)use __getitem__ to create an object that uses that syntax:
import itertools
class SliceAbuse(object):
def __getitem__(self, key):
last = None
for obj in key:
if isinstance(obj, slice):
for n in xrange(obj.start, obj.stop + 1, obj.step or 1):
last = n
yield n
elif obj is Ellipsis:
for n in itertools.count(last + 1):
yield n
else:
last = obj
yield obj
For example:
for n in SliceAbuse()[1:5, 7:9, 11, ...]: # To infinity and beyond
print n
if n == 20:
break
Although since your ranges are rather small, you can use the fact that range() in Python 2 returns a list object, which you can concatenate with other lists:
range(1, 4) + range(10, 15) == [1, 2, 3, 10, 11, 12, 13, 14]
Note that this won't work in Python 3, as range doesn't return a list.

Partition an array of numbers into sets by proximity

Let's say we have an array like
[37, 20, 16, 8, 5, 5, 3, 0]
What algorithm can I use so that I can specify the number of partitions and have the array broken into them.
For 2 partitions, it should be
[37] and [20, 16, 8, 5, 5, 3, 0]
For 3, it should be
[37],[20, 16] and [8, 5, 5, 3, 0]
I am able to break them down by proximity by simply subtracting the element with right and left numbers but that doesn't ensure the correct number of partitions.
Any ideas?
My code is in ruby but any language/algo/pseudo-code will suffice.
Here's the ruby code by Vikram's algorithm
def partition(arr,clusters)
# Return same array if clusters are less than zero or more than array size
return arr if (clusters >= arr.size) || (clusters < 0)
edges = {}
# Get weights of edges
arr.each_with_index do |a,i|
break if i == (arr.length-1)
edges[i] = a - arr[i+1]
end
# Sort edge weights in ascending order
sorted_edges = edges.sort_by{|k,v| v}.collect{|k| k.first}
# Maintain counter for joins happening.
prev_edge = arr.size+1
joins = 0
sorted_edges.each do |edge|
# If join is on right of previous, subtract the number of previous joins that happened on left
if (edge > prev_edge)
edge -= joins
end
joins += 1
# Join the elements on the sides of edge.
arr[edge] = arr[edge,2].flatten
arr.delete_at(edge+1)
prev_edge = edge
# Get out when right clusters are done
break if arr.size == clusters
end
end
(assuming the array is sorted in descending order)
37, 20, 16, 8, 5, 5, 3, 0
Calculate the differences between adjacent numbers:
17, 4, 8, 3, 0, 2, 3
Then sort them in descending order:
17, 8, 4, 3, 3, 2, 0
Then take the first few numbers. For example, for 4 partitions, take 3 numbers:
17, 8, 4
Now look at the original array and find the elements with these given differences (you should attach the index in the original array to each element in the difference array to make this most easy).
17 - difference between 37 and 20
8 - difference between 16 and 8
4 - difference between 20 and 16
Now print the stuff:
37 | 20 | 16 | 8, 5, 5, 3, 0
I think your problem can be solved using k-clustering using kruskal's algorithm . Kruskal algorithm is used to find the clusters such that there is maximum spacing between them.
Algorithm : -
Construct path graph from your data set like following : -
[37, 20, 16, 8, 5, 5, 3, 0]
path graph: - 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7
then weight for each edge will be difference between their values
edge(0,1) = abs(37-20) = 17
edge(1,2) = abs(20-16) = 4
edge(2,3) = abs(16-8) = 8
edge(3,4) = abs(8-5) = 3
edge(4,5) = abs(5-5) = 0
edge(5,6) = abs(5-3) = 2
edge(6,7) = abs(3-0) = 3
Use kruskal on this graph till there are only k clusters remaining : -
Sort the edges first according to weights in ascending order:-
(4,5),(5,6),(6,7),(3,4),(1,2),(2,3),(0,1)
Use krushkal on it find exactly k = 3 clusters : -
iteration 1 : join (4,5) clusters = 7 clusters: [37,20,16,8,(5,5),3,0]
iteration 2 : join (5,6) clusters = 6 clusters: [37,20,16,8,(5,5,3),0]
iteration 3 : join (6,7) clusters = 5 clusters: [37,20,16,8,(5,5,3,0)]
iteration 4 : join (3,4) clusters = 4 clusters: [37,20,16,(8,5,5,3,0)]
iteration 5 : join (1,2) clusters = 3 clusters: [37,(20,16),(8,5,5,3,0)]
stop as clusters = 3
reconstrusted solution : [(37), (20, 16), (8, 5, 5, 3, 0)] is what
u desired
While #anatolyg's solution may be fine, you should also look at k-means clustering. It's usually done in higher dimensions, but ought to work fine in 1d.
You pick k; your examples are k=2 and k=3. The algorithm seeks to put the inputs into k sets that minimize the sum of distances squared from the set's elements to the centroid (mean position) of the set. This adds a bit of rigor to your rather fuzzy definition of the right result.
While getting an optimal result is NP hard, there is a simple greedy solution.
It's an iteration. Take a guess to get started. Either pick k elements at random to be the initial means or put all the elements randomly into k sets and compute their means. Some care is needed here because each of the k sets must have at least one element.
Additionally, because your integer sets can have repeats, you'll have to ensure the initial k means are distinct. This is easy enough. Just pick from a set that has been "unqualified."
Now iterate. For each element find its closest mean. If it's already in the set corresponding to that mean, leave it there. Else move it. After all elements have been considered, recompute the means. Repeat until no elements need to move.
The Wikipedia page on this is pretty good.

Resources