Suppose that I have two arrays A and B where A is a m by n matrix and B is a vector of size m. Each value in B refers to same row in A and has a value of 1 or 0. Now assume A and B like below:
A= 1 2 3 4 B= 1
5 6 7 8 0
5 6 7 8 0
5 6 7 8 0
5 6 7 8 0
5 6 7 8 1
5 6 7 8 0
5 6 7 8 0
I want to break both arrays to k parts and I want all parts have a (semi) uniform number of 1s and 0s. By default some proportions are empty of 1 while some have many.
I need an algorithm to sort both arrays before doing this breaking (splitting) job. How should this type of sort be done. Or what's the best way?
It is worth mentioning that the real data has 679 rows with corresponding 1 for 70 of them and 0 for others. And by now the desired k is 10.
You haven't given any code examples, and I don't want to give any code, because I've recently asked a similar question as a homework exercise. However, here is some pseudocode mixed with Java-esque method signatures. In the following, I will assume that one row of your dataset is modeled as Pair<A, B> with some generic types A and B (thinking of A as "features" and B as "labels" in a supervised machine learning task). In your concrete case, A would be some kind of list of integers, and B might be Boolean.
First, you define a helper method that can shuffle and split the dataset into k parts, completely ignoring the labels. In Java-syntax:
public static <A,B> ArrayList<ArrayList<Pair<A,B>>> split(
ArrayList<Pair<A, B>> dataset,
int k
) {
// shuffle the dataset
// generate `k` new lists
// add rows from the shuffled list to the `k` lists
// in round-robin fashion, i.e.
// move `i`-th item to the `i%k`-th list.
}
Building on top of that, you can define the stratified split version:
public static <A,B> ArrayList<ArrayList<Pair<A,B>>> stratifiedSplit(
ArrayList<Pair<A,B>> dataset,
int k
) {
// create a (hash?)map for the strata.
// In this map, you want to collect rows in separate
// lists, depending on their label:
HashMap<B, ArrayList<Pair<A,B>>> strata = ...;
// (Check whether your collection library of choice
// provides a `groupBy` or a `groupingBy` of some sort.
// In C#, this might help:
// https://msdn.microsoft.com/en-us/library/bb534304(v=vs.110).aspx )
// In your concrete case,
// your map should look something like this:
// {
// false -> [
// ([5, 6, 7, 8], false),
// ([5, 6, 7, 8], false),
// ([5, 6, 7, 8], false),
// ([5, 6, 7, 8], false),
// ([5, 6, 7, 8], false),
// ([5, 6, 7, 8], false)
// ],
// true -> [
// ([5, 6, 7, 8], true),
// ([1, 2, 3, 4], true)
// ]
// }
// where `{}`=map, `[]`=list/array, `()`=tuple/pair.
// Now you generate `k` lists to hold the result.
// For each stratum, you call the ordinary non-stratified
// `split` method, and append the `k` pieces returned by
// this method to the `k` result lists.
// In the end, you again shuffle each of the `k` result
// lists (so that the labels aren't sorted in the end)
// return `k` result lists.
}
Writing out the details is left as an exercise.
Related
I need to develop an algorithm which would accept two numbers m and n - dimensions of 2D array - as input and generate 2D array filled with numbers [1..m*n] with the following condition:
All (4) elements adjacent to a given element cannot be equal to currentElement + 1
Adjacent elements are located to the two/three/four sides (depending on position) of a given element
0 1 0
1 2 1
0 1 0
(E.g four 1s are adjacent to 2)
Example:
Input: m = 3, n = 3 (does not essentially have to be square matrix)
(Sample) output:
[
[7, 2, 5],
[1, 6, 9],
[3, 8, 4]
]
Note that there apparently may exist more than one possible output. In that case, numbers in the array have to be generated randomly (though still meeting the conditions), not following any preset sequence (e.g not [ [1, 3, 5], [4, 6, 2], [7, 9, 8] ] because it clearly uses a non-randomly generated sequence of numbers, odds first, then evens, etc)
Basically, for the same input, on two different occasions, two different arrays should be generated.
P.S: that was a coding interview question and I wonder how I could solve it, so, any help is highly appreciated.
I have a database of 10,000 vector of integers ranging from 1 to 1,000. The length of each vector can be up to 1,000. For example, it can look like this:
vec1: 1 2 56 78
vec2: 23 34 35 36 37 38
vec3: 1 2 3 4 5 7
vec4: 2 3 4 6 100
...
vec10000: 13 234
Now, I want to store this database in a way that is fast in response to a particular type of request. Each request will come in the form of an integer vector, up to 10,000 long:
query: 1 2 3 4 5 7 56 78 100
The response should be the indices of the vectors that are subsets of this query string. For example, in the above list, only vec1 and vec3 are subsets of the query, so the response in this case should be
response: 1 3
This database is not going to change so you can preprocess it in any possible way. You may specify that queries come in any protocol as well, as long as the information is the same. For example, it can come as a sorted list or a boolean table.
What is the best strategy to encode the database and the query to achieve the highest response rate possible?
Since you are using python, this method seems easy. (For any other language also, it is implementable but will include modular arithmetic etc.)
So, for each number from 1-1000, assign a prime number to it. So,
1 => 2
2 => 3
3 => 5
4 => 7
...
...
25 => 97
...
...
1000 => 7919
For every set, use its value to be the hash function generated by product of all values in the set.
eg. If your vector, vec-x = {1,2,5,25}, vec-x = 2 * 3 * 11 * 97
Similarly, your query vector can be calculated as above. Let its value be Q.
If Q % vec-i == 0, it is a subset, else not.
What about just preprocessing your vector list into an indicator matrix and using matrix multiplication, something like:
import numpy as np
# generate 10000 random vectors with length in [0-1000]
# and elements in [0-1000]
vectors = [np.random.randint(1000, size=n)
for n in np.random.randint(1000, size=10000)]
# generate indicator matrix
database = np.zeros((10000, 1000), dtype='int8')
for i, vector in enumerate(vectors):
database[i, vector] = 1
lengths = database.sum(axis=1)
def query(ints):
tmp = np.zeros(1000, dtype='int8')
tmp[ints] = 1
return np.where(database.dot(tmp) == lengths)[0]
The dot product of a database row and the transformed query will be equal to the number of elements of the row that are in the query. If this number is equal to total number of elements in the row, then we've found a subset. Note that this uses 0-based indexing.
Here's this revised for your example data
vectors = [[1, 2, 56, 78],
[23, 34, 35, 36, 37, 38],
[1, 2, 3, 4, 5, 7],
[2, 3, 4, 6, 100],
[13, 234]]
database = np.zeros((5, 1000), dtype='int8')
for i, vector in enumerate(vectors):
database[i, vector] = 1
lengths = database.sum(axis=1)
print query([1, 2, 3, 4, 5, 7, 56, 78, 100])
# [0, 2] 0-based indexing
I have the following sample sheet:
1/A B C D E F G H I J
2
3 Points 8 4 2 1
4
5 Values 1 2 3 4 4 3 1 2
I'm trying to sum the 'Points' based upon the array index from the 'Values'.
My expected result from this is: 30
Here is my formula:
{=SUM(INDEX($C$3:$F$3,1,C5:J5))}
For some reason though, this only returns the first value of the array, rather than the entire sum.
To clarify, the C# version would be something like:
var points = new int[] { 8, 4, 2, 1 };
var values = new int[] { 2, 4, 3, 1, 2, 4, 2 };
var result = (from v in values
select points[v - 1]).Sum(); // -1 as '4' will crash, but in Excel '4' is fine
Edit: Adding further clarifying example
Another example to clarify:
Points is the array. The 'values' represents the index of the array to sum.
The example above is the same as:
=SUM(8, 4, 2, 1, 1, 2, 8, 4)
INDEX will never take its row or column parameters from arrays and then perform multiple times within one array formula contained in one cell. For this OFFSET will be needed.
Either
{=SUM(N(OFFSET($C$3,,C5:J5-1)))}
as an array formula.
Or
=SUMPRODUCT(N(OFFSET($C$3,,C5:J5-1)))
as an implicit array formula without the need for [Ctrl]+[Shift]+[Enter].
I have an array in which I want to replace values at a known set of indices with the value immediately preceding it. As an example, my array might be
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 0];
and the indices of values to be replaced by previous values might be
y = [2, 3, 8];
I want this replacement to occur from left to right, or else start to finish. That is, the value at index 2 should be replaced by the value at index 1, before the value at index 3 is replaced by the value at index 2. The result using the arrays above should be
[1, 1, 1, 4, 5, 6, 7, 7, 9, 0]
However, if I use the obvious method to achieve this in Matlab, my result is
>> x(y) = x(y-1)
x =
1 1 2 4 5 6 7 7 9 0
Hopefully you can see that this operation was performed right to left and the value at index 3 was replaced by the value at index 2, then 2 was replaced by 1.
My question is this: Is there some way of achieving my desired result in a simple way, without brute force looping over the arrays or doing something time consuming like reversing the arrays around?
Well, practically this is a loop but the order is number of consecutive index elements
while ~isequal(x(y),x(y-1))
x(y)=x(y-1)
end
Using nancumsum you can achieve a fully vectorized version. Nevertheless, for most cases the solution karakfa provided is probably one to prefer. Only for extreme cases with long sequences in y this code is faster.
c1=[0,diff(y)==1];
c1(c1==0)=nan;
shift=nancumsum(c1,2,4);
y(~isnan(shift))=y(~isnan(shift))-shift(~isnan(shift));
x(y)=x(y-1)
Let's say we have an array like
[37, 20, 16, 8, 5, 5, 3, 0]
What algorithm can I use so that I can specify the number of partitions and have the array broken into them.
For 2 partitions, it should be
[37] and [20, 16, 8, 5, 5, 3, 0]
For 3, it should be
[37],[20, 16] and [8, 5, 5, 3, 0]
I am able to break them down by proximity by simply subtracting the element with right and left numbers but that doesn't ensure the correct number of partitions.
Any ideas?
My code is in ruby but any language/algo/pseudo-code will suffice.
Here's the ruby code by Vikram's algorithm
def partition(arr,clusters)
# Return same array if clusters are less than zero or more than array size
return arr if (clusters >= arr.size) || (clusters < 0)
edges = {}
# Get weights of edges
arr.each_with_index do |a,i|
break if i == (arr.length-1)
edges[i] = a - arr[i+1]
end
# Sort edge weights in ascending order
sorted_edges = edges.sort_by{|k,v| v}.collect{|k| k.first}
# Maintain counter for joins happening.
prev_edge = arr.size+1
joins = 0
sorted_edges.each do |edge|
# If join is on right of previous, subtract the number of previous joins that happened on left
if (edge > prev_edge)
edge -= joins
end
joins += 1
# Join the elements on the sides of edge.
arr[edge] = arr[edge,2].flatten
arr.delete_at(edge+1)
prev_edge = edge
# Get out when right clusters are done
break if arr.size == clusters
end
end
(assuming the array is sorted in descending order)
37, 20, 16, 8, 5, 5, 3, 0
Calculate the differences between adjacent numbers:
17, 4, 8, 3, 0, 2, 3
Then sort them in descending order:
17, 8, 4, 3, 3, 2, 0
Then take the first few numbers. For example, for 4 partitions, take 3 numbers:
17, 8, 4
Now look at the original array and find the elements with these given differences (you should attach the index in the original array to each element in the difference array to make this most easy).
17 - difference between 37 and 20
8 - difference between 16 and 8
4 - difference between 20 and 16
Now print the stuff:
37 | 20 | 16 | 8, 5, 5, 3, 0
I think your problem can be solved using k-clustering using kruskal's algorithm . Kruskal algorithm is used to find the clusters such that there is maximum spacing between them.
Algorithm : -
Construct path graph from your data set like following : -
[37, 20, 16, 8, 5, 5, 3, 0]
path graph: - 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7
then weight for each edge will be difference between their values
edge(0,1) = abs(37-20) = 17
edge(1,2) = abs(20-16) = 4
edge(2,3) = abs(16-8) = 8
edge(3,4) = abs(8-5) = 3
edge(4,5) = abs(5-5) = 0
edge(5,6) = abs(5-3) = 2
edge(6,7) = abs(3-0) = 3
Use kruskal on this graph till there are only k clusters remaining : -
Sort the edges first according to weights in ascending order:-
(4,5),(5,6),(6,7),(3,4),(1,2),(2,3),(0,1)
Use krushkal on it find exactly k = 3 clusters : -
iteration 1 : join (4,5) clusters = 7 clusters: [37,20,16,8,(5,5),3,0]
iteration 2 : join (5,6) clusters = 6 clusters: [37,20,16,8,(5,5,3),0]
iteration 3 : join (6,7) clusters = 5 clusters: [37,20,16,8,(5,5,3,0)]
iteration 4 : join (3,4) clusters = 4 clusters: [37,20,16,(8,5,5,3,0)]
iteration 5 : join (1,2) clusters = 3 clusters: [37,(20,16),(8,5,5,3,0)]
stop as clusters = 3
reconstrusted solution : [(37), (20, 16), (8, 5, 5, 3, 0)] is what
u desired
While #anatolyg's solution may be fine, you should also look at k-means clustering. It's usually done in higher dimensions, but ought to work fine in 1d.
You pick k; your examples are k=2 and k=3. The algorithm seeks to put the inputs into k sets that minimize the sum of distances squared from the set's elements to the centroid (mean position) of the set. This adds a bit of rigor to your rather fuzzy definition of the right result.
While getting an optimal result is NP hard, there is a simple greedy solution.
It's an iteration. Take a guess to get started. Either pick k elements at random to be the initial means or put all the elements randomly into k sets and compute their means. Some care is needed here because each of the k sets must have at least one element.
Additionally, because your integer sets can have repeats, you'll have to ensure the initial k means are distinct. This is easy enough. Just pick from a set that has been "unqualified."
Now iterate. For each element find its closest mean. If it's already in the set corresponding to that mean, leave it there. Else move it. After all elements have been considered, recompute the means. Repeat until no elements need to move.
The Wikipedia page on this is pretty good.