Rookie questions season continues :)
I've got a function that has to be fed with numerical value from certain range. This part of the code will be replicated for each datasource I'm linking in, but which changed numerical parameters.
Example (that works):
for i in [0, 1, 2, 3, 7, 8, 15, 31, 32]:
RowTDE(i)
Question
I would like to avoid typing in all the necessary values, therefore I would like to use something like this:
for i in [:2]+[7:10]+[15:]:
RowTDE(i)
I've tried it and got:
SyntaxError: invalid syntax
Do I need to create a list of integers first to use it? Like
intList = [1, 2, 3, 4, ... 33].
Also, as mentioned previously for each data source this range will differ, but maximum numerical value will be less then 40 (each number represents a column index).
As always I would much appreciate your help with this and just let me know if you need more info.
Happy Monday morning :)
You can add ranges:
>>> for i in range(3) + range(7, 9) + range(15, 16) + range(31, 33):
print i
0
1
2
7
8
15
31
32
or build the range then slice it:
>>> r = range(33)
>>> for i in r[:3] + r[7:9] + r[15:16] + r[31:]:
print i
0
1
2
7
8
15
31
32
But you can't slice nothing, hence [:2] on its own is a SyntaxError.
Slice notation on its own doesn't make sense. It's implemented by objects that support it using the __getitem__ method.
You could (ab)use __getitem__ to create an object that uses that syntax:
import itertools
class SliceAbuse(object):
def __getitem__(self, key):
last = None
for obj in key:
if isinstance(obj, slice):
for n in xrange(obj.start, obj.stop + 1, obj.step or 1):
last = n
yield n
elif obj is Ellipsis:
for n in itertools.count(last + 1):
yield n
else:
last = obj
yield obj
For example:
for n in SliceAbuse()[1:5, 7:9, 11, ...]: # To infinity and beyond
print n
if n == 20:
break
Although since your ranges are rather small, you can use the fact that range() in Python 2 returns a list object, which you can concatenate with other lists:
range(1, 4) + range(10, 15) == [1, 2, 3, 10, 11, 12, 13, 14]
Note that this won't work in Python 3, as range doesn't return a list.
Related
I have a question about coding. There are similar types of questions in the database which I came across but none of them clears my doubt. I am going thru the book of "Scala for Impatient". The code below removes negative elements from the Array and gives positive elements as output
val a = ArrayBuffer(-1, 1, 0, -2, -1, 2, 5, 6, 7)
val positionsToKeep = for (i <- a.indices if a(i) >= 0) yield i
for (j <- positionsToKeep.indices) a(j) = a(positionsToKeep(j))
a.trimEnd(a.length - positionsToKeep.length)
It gives the output as (1,0,2,5,6,7) removing all negative elements.
I am unable to understand line 3 & 4.
for (j <- positionsToKeep.indices) a(j) = a(positionsToKeep(j))
a.trimEnd(a.length - positionsToKeep.length)
I'm scratching my head since 2 days on these 2 lines but can't give up and I finally posting it here seeking some help.
As a is a bufferArray so we can change the values of the array a.
Line 3:
Line 3 is populating or you can say updating the value of positionToKeep into a.
a(j) = positionToKeep(j)
// which is running like this
// a(0) = positionToKeep(0)
// a(1) = positionToKeep(1) .... and so on
Now what will happen after populating all the values of positionToKeep into a there might be the case some older values remains untouched. Line four is deleting or dropping these elements. In the case when we have all the positive values in array a line four has like no use but when the length of a is greater then positionToKeep then we need line 4.
Line 4: consider the scenario
val a = Array(1, 2, 3, 4, 5, 6)
Then our positionToKeep will have all the element and the length of both the array will be equal.
val positionToKeep = Array(1, 2, 3, 4, 5, 6)
In this case line four trimEnd(0) because length of a and positionToKeep are equal.
val a = Array( 1, 2, 3, 4, -5, -6, 8, 9, -3)
In this case we will have Array(1,2,3,4,8,9) in positionToKeep
In line 3 we will update array a and after updating before line four this is how our array a will look like.
Array(1,2,3,4,8,9,8,9,-3) as we need values only up to length 6 as we have only 6 positive values. We need to drop last 3 element that what is tripEnd doing for us.
I have a database of 10,000 vector of integers ranging from 1 to 1,000. The length of each vector can be up to 1,000. For example, it can look like this:
vec1: 1 2 56 78
vec2: 23 34 35 36 37 38
vec3: 1 2 3 4 5 7
vec4: 2 3 4 6 100
...
vec10000: 13 234
Now, I want to store this database in a way that is fast in response to a particular type of request. Each request will come in the form of an integer vector, up to 10,000 long:
query: 1 2 3 4 5 7 56 78 100
The response should be the indices of the vectors that are subsets of this query string. For example, in the above list, only vec1 and vec3 are subsets of the query, so the response in this case should be
response: 1 3
This database is not going to change so you can preprocess it in any possible way. You may specify that queries come in any protocol as well, as long as the information is the same. For example, it can come as a sorted list or a boolean table.
What is the best strategy to encode the database and the query to achieve the highest response rate possible?
Since you are using python, this method seems easy. (For any other language also, it is implementable but will include modular arithmetic etc.)
So, for each number from 1-1000, assign a prime number to it. So,
1 => 2
2 => 3
3 => 5
4 => 7
...
...
25 => 97
...
...
1000 => 7919
For every set, use its value to be the hash function generated by product of all values in the set.
eg. If your vector, vec-x = {1,2,5,25}, vec-x = 2 * 3 * 11 * 97
Similarly, your query vector can be calculated as above. Let its value be Q.
If Q % vec-i == 0, it is a subset, else not.
What about just preprocessing your vector list into an indicator matrix and using matrix multiplication, something like:
import numpy as np
# generate 10000 random vectors with length in [0-1000]
# and elements in [0-1000]
vectors = [np.random.randint(1000, size=n)
for n in np.random.randint(1000, size=10000)]
# generate indicator matrix
database = np.zeros((10000, 1000), dtype='int8')
for i, vector in enumerate(vectors):
database[i, vector] = 1
lengths = database.sum(axis=1)
def query(ints):
tmp = np.zeros(1000, dtype='int8')
tmp[ints] = 1
return np.where(database.dot(tmp) == lengths)[0]
The dot product of a database row and the transformed query will be equal to the number of elements of the row that are in the query. If this number is equal to total number of elements in the row, then we've found a subset. Note that this uses 0-based indexing.
Here's this revised for your example data
vectors = [[1, 2, 56, 78],
[23, 34, 35, 36, 37, 38],
[1, 2, 3, 4, 5, 7],
[2, 3, 4, 6, 100],
[13, 234]]
database = np.zeros((5, 1000), dtype='int8')
for i, vector in enumerate(vectors):
database[i, vector] = 1
lengths = database.sum(axis=1)
print query([1, 2, 3, 4, 5, 7, 56, 78, 100])
# [0, 2] 0-based indexing
I have a tensor of lengths in tensorflow, let's say it looks like this:
[4, 3, 5, 2]
I wish to create a mask of 1s and 0s whose number of 1s correspond to the entries to this tensor, padded by 0s to a total length of 8. I.e. I want to create this tensor:
[[1,1,1,1,0,0,0,0],
[1,1,1,0,0,0,0,0],
[1,1,1,1,1,0,0,0],
[1,1,0,0,0,0,0,0]
]
How might I do this?
This can now be achieved by tf.sequence_mask. More details here.
This can be achieved using a variety of TensorFlow transformations:
# Make a 4 x 8 matrix where each row contains the length repeated 8 times.
lengths = [4, 3, 5, 2]
lengths_transposed = tf.expand_dims(lengths, 1)
# Make a 4 x 8 matrix where each row contains [0, 1, ..., 7]
range = tf.range(0, 8, 1)
range_row = tf.expand_dims(range, 0)
# Use the logical operations to create a mask
mask = tf.less(range_row, lengths_transposed)
# Use the select operation to select between 1 or 0 for each value.
result = tf.select(mask, tf.ones([4, 8]), tf.zeros([4, 8]))
I've got a bit shorter version, than previous answer. Not sure if it is more efficient or not
def mask(self, seq_length, max_seq_length):
return tf.map_fn(
lambda x: tf.pad(tf.ones([x], dtype=tf.int32), [[0, max_seq_length - x]]),
seq_length)
I have an array in which I want to replace values at a known set of indices with the value immediately preceding it. As an example, my array might be
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 0];
and the indices of values to be replaced by previous values might be
y = [2, 3, 8];
I want this replacement to occur from left to right, or else start to finish. That is, the value at index 2 should be replaced by the value at index 1, before the value at index 3 is replaced by the value at index 2. The result using the arrays above should be
[1, 1, 1, 4, 5, 6, 7, 7, 9, 0]
However, if I use the obvious method to achieve this in Matlab, my result is
>> x(y) = x(y-1)
x =
1 1 2 4 5 6 7 7 9 0
Hopefully you can see that this operation was performed right to left and the value at index 3 was replaced by the value at index 2, then 2 was replaced by 1.
My question is this: Is there some way of achieving my desired result in a simple way, without brute force looping over the arrays or doing something time consuming like reversing the arrays around?
Well, practically this is a loop but the order is number of consecutive index elements
while ~isequal(x(y),x(y-1))
x(y)=x(y-1)
end
Using nancumsum you can achieve a fully vectorized version. Nevertheless, for most cases the solution karakfa provided is probably one to prefer. Only for extreme cases with long sequences in y this code is faster.
c1=[0,diff(y)==1];
c1(c1==0)=nan;
shift=nancumsum(c1,2,4);
y(~isnan(shift))=y(~isnan(shift))-shift(~isnan(shift));
x(y)=x(y-1)
Let's say we have an array like
[37, 20, 16, 8, 5, 5, 3, 0]
What algorithm can I use so that I can specify the number of partitions and have the array broken into them.
For 2 partitions, it should be
[37] and [20, 16, 8, 5, 5, 3, 0]
For 3, it should be
[37],[20, 16] and [8, 5, 5, 3, 0]
I am able to break them down by proximity by simply subtracting the element with right and left numbers but that doesn't ensure the correct number of partitions.
Any ideas?
My code is in ruby but any language/algo/pseudo-code will suffice.
Here's the ruby code by Vikram's algorithm
def partition(arr,clusters)
# Return same array if clusters are less than zero or more than array size
return arr if (clusters >= arr.size) || (clusters < 0)
edges = {}
# Get weights of edges
arr.each_with_index do |a,i|
break if i == (arr.length-1)
edges[i] = a - arr[i+1]
end
# Sort edge weights in ascending order
sorted_edges = edges.sort_by{|k,v| v}.collect{|k| k.first}
# Maintain counter for joins happening.
prev_edge = arr.size+1
joins = 0
sorted_edges.each do |edge|
# If join is on right of previous, subtract the number of previous joins that happened on left
if (edge > prev_edge)
edge -= joins
end
joins += 1
# Join the elements on the sides of edge.
arr[edge] = arr[edge,2].flatten
arr.delete_at(edge+1)
prev_edge = edge
# Get out when right clusters are done
break if arr.size == clusters
end
end
(assuming the array is sorted in descending order)
37, 20, 16, 8, 5, 5, 3, 0
Calculate the differences between adjacent numbers:
17, 4, 8, 3, 0, 2, 3
Then sort them in descending order:
17, 8, 4, 3, 3, 2, 0
Then take the first few numbers. For example, for 4 partitions, take 3 numbers:
17, 8, 4
Now look at the original array and find the elements with these given differences (you should attach the index in the original array to each element in the difference array to make this most easy).
17 - difference between 37 and 20
8 - difference between 16 and 8
4 - difference between 20 and 16
Now print the stuff:
37 | 20 | 16 | 8, 5, 5, 3, 0
I think your problem can be solved using k-clustering using kruskal's algorithm . Kruskal algorithm is used to find the clusters such that there is maximum spacing between them.
Algorithm : -
Construct path graph from your data set like following : -
[37, 20, 16, 8, 5, 5, 3, 0]
path graph: - 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7
then weight for each edge will be difference between their values
edge(0,1) = abs(37-20) = 17
edge(1,2) = abs(20-16) = 4
edge(2,3) = abs(16-8) = 8
edge(3,4) = abs(8-5) = 3
edge(4,5) = abs(5-5) = 0
edge(5,6) = abs(5-3) = 2
edge(6,7) = abs(3-0) = 3
Use kruskal on this graph till there are only k clusters remaining : -
Sort the edges first according to weights in ascending order:-
(4,5),(5,6),(6,7),(3,4),(1,2),(2,3),(0,1)
Use krushkal on it find exactly k = 3 clusters : -
iteration 1 : join (4,5) clusters = 7 clusters: [37,20,16,8,(5,5),3,0]
iteration 2 : join (5,6) clusters = 6 clusters: [37,20,16,8,(5,5,3),0]
iteration 3 : join (6,7) clusters = 5 clusters: [37,20,16,8,(5,5,3,0)]
iteration 4 : join (3,4) clusters = 4 clusters: [37,20,16,(8,5,5,3,0)]
iteration 5 : join (1,2) clusters = 3 clusters: [37,(20,16),(8,5,5,3,0)]
stop as clusters = 3
reconstrusted solution : [(37), (20, 16), (8, 5, 5, 3, 0)] is what
u desired
While #anatolyg's solution may be fine, you should also look at k-means clustering. It's usually done in higher dimensions, but ought to work fine in 1d.
You pick k; your examples are k=2 and k=3. The algorithm seeks to put the inputs into k sets that minimize the sum of distances squared from the set's elements to the centroid (mean position) of the set. This adds a bit of rigor to your rather fuzzy definition of the right result.
While getting an optimal result is NP hard, there is a simple greedy solution.
It's an iteration. Take a guess to get started. Either pick k elements at random to be the initial means or put all the elements randomly into k sets and compute their means. Some care is needed here because each of the k sets must have at least one element.
Additionally, because your integer sets can have repeats, you'll have to ensure the initial k means are distinct. This is easy enough. Just pick from a set that has been "unqualified."
Now iterate. For each element find its closest mean. If it's already in the set corresponding to that mean, leave it there. Else move it. After all elements have been considered, recompute the means. Repeat until no elements need to move.
The Wikipedia page on this is pretty good.