Looping through a collection and deleting things on the way - loops

I want to go through a collection and find the first pair of matching elements, but my current approach is having trouble with the indexing going out of bounds all the time.
Here's a simplified MWE example:
function processstuff(stuff)
for pointer1 in 1:length(stuff)
for pointer2 in pointer1:length(stuff)
println("$(stuff)")
pointer1 == pointer2 && continue
if stuff[pointer1] == stuff[pointer2]
# items match, remove them
deleteat!(stuff, pointer1)
deleteat!(stuff, pointer2)
end
end
end
end
processstuff(collect(rand(1:5, 20)))
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[1, 4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 2, 1, 1]
[4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 1, 2, 4, 3, 2, 1, 1]
[4, 3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 1, 2, 4, 3, 2, 1, 1]
[3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 1, 2, 4, 2, 1, 1]
[3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 1, 2, 4, 2, 1, 1]
[3, 3, 2, 4, 5, 2, 2, 2, 3, 1, 1, 2, 4, 2, 1, 1]
ERROR: LoadError: BoundsError: attempt to access 16-element Array{Int64,1} at index [17]
(Obviously this example is just comparing two numbers, the real comparison isn't.)
The idea of updating the collection of stuff by removing both elements that have been processed looks like it works, because I think Julia updates the iteration thing each time through. But only for a while...?

You can use the following approach (assuming you want to remove pairs):
function processstuff!(stuff)
pointer1 = 1
while pointer1 < length(stuff)
for pointer2 in pointer1+1:length(stuff)
if stuff[pointer1] == stuff[pointer2]
deleteat!(stuff, (pointer1, pointer2))
pointer1 -= 1 # correct pointer location as we later add 1 to it
break
end
end
pointer1 += 1
end
end
In your code there were several problems:
you called deleteat! twice, which could invalidate indexing
your inner loop tried to delete pointer1 several times
in outer loop I use while to dynamically track changing size of stuff

Related

Sort the rows of the array by the value of the element of the main diagonal in each of the rows (in the initial array)

Sort the rows of the array by the value of the element of the main diagonal in each of the rows (in the initial array)
[[3, 2, 7, 1, 3, 7, 2, 6, 4, 8],
[5, 3, 7, 1, 1, 1, 6, 4, 6, 7],
[1, 9, 7, 8, 2, 1, 3, 7, 9, 8],
[1, 7, 3, 7, 6, 6, 6, 8, 4, 8],
[4, 2, 3, 2, 2, 3, 2, 4, 7, 6]]
There is such an array, how should it look as a result?

numpy arrays: building a 3d array by adding 2d slices one at a time

Looking for some help with numpy and building a 3d array from multiply 2d arrays. I want to make a loop, such that on every iteration I make a new 2d array and make it a new slice in an existing 3d array. Here's my code sample.
import numpy as np
import random
import array
a = np.random.randint(0, 9, size=(10, 10)) <-- make random 10x10 matrix
b = a <-- save copy
a = np.random.randint(0, 9, size=(10, 10)) <-- make random 10x10 matrix
a.shape
(10, 10) <-- verify it's 10x10
b.shape
(10, 10) <-- verify it's 10x10
b = np.array([b, a]) <-- convert two 2d matrix into one 3d matrix
b.shape
(2, 10, 10) <-- verify it's a 3d matrix with two planes
a = np.random.randint(0, 9, size=(10, 10)) <-- make new random 10x10 matrix
b = np.array([b, a]) <-- add new 2d plane to the 3d matrix
b.shape
(2,) <-- should be (3, 10, 10)
Can anyone see what I'm doing wrong?
When you combine two arrays by using np.array([...]), they have to be the same shape. If they aren't numpy treats them not as numpy arrays, but as dumb/blind objects. There should have been a warning when you ran the last b = np.array([b, a]):
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
Instead, use np.stack
b = np.stack([*b, a])
*b basically expands the children of b, so the above is equivalent to b = np.stack([b[0], b[1], a])
Or you can use np.vstack (vertical stack):
b = np.vstack([b, a[None]])
a[None] basically wraps a in another array. a.shape == (10, 10), a[None].shape == (1, 10, 10)
Both of the above produce the following:
>>> b.shape
(3, 10, 10)
>>> b
array([[[3, 8, 0, 2, 8, 0, 0, 5, 7, 7],
[0, 5, 2, 8, 8, 2, 1, 4, 5, 8],
[3, 2, 2, 4, 1, 8, 2, 0, 7, 5],
[5, 6, 5, 0, 8, 7, 4, 0, 4, 6],
[6, 2, 3, 7, 4, 3, 6, 6, 4, 8],
[2, 5, 1, 7, 1, 3, 0, 6, 0, 5],
[3, 4, 0, 7, 3, 4, 5, 0, 7, 4],
[0, 7, 2, 8, 7, 7, 4, 3, 2, 6],
[4, 6, 2, 5, 5, 8, 5, 8, 0, 8],
[3, 4, 1, 0, 3, 7, 0, 6, 7, 3]],
[[4, 0, 6, 2, 4, 4, 7, 0, 7, 2],
[5, 8, 5, 8, 2, 8, 3, 7, 4, 6],
[2, 1, 2, 0, 4, 5, 6, 3, 0, 0],
[8, 7, 3, 0, 8, 8, 0, 4, 1, 4],
[0, 2, 5, 7, 5, 3, 0, 5, 1, 7],
[1, 5, 8, 0, 2, 6, 5, 0, 3, 2],
[4, 4, 4, 3, 3, 8, 6, 6, 5, 5],
[5, 3, 6, 8, 0, 3, 0, 8, 8, 3],
[4, 2, 6, 6, 6, 2, 0, 0, 6, 2],
[7, 3, 8, 0, 7, 1, 1, 8, 6, 2]],
[[6, 6, 1, 1, 6, 4, 6, 2, 6, 7],
[0, 5, 6, 7, 5, 0, 0, 5, 8, 2],
[6, 6, 1, 5, 2, 3, 2, 3, 3, 2],
[0, 3, 7, 6, 4, 5, 3, 1, 7, 2],
[7, 6, 3, 0, 1, 7, 8, 3, 8, 5],
[3, 1, 8, 6, 1, 5, 0, 8, 6, 1],
[1, 4, 8, 1, 7, 0, 1, 1, 5, 3],
[2, 1, 4, 8, 2, 3, 1, 6, 8, 7],
[8, 1, 1, 0, 6, 1, 0, 6, 1, 6],
[1, 8, 4, 7, 7, 5, 0, 3, 8, 6]]])

Removing array rows based on certain matches between elements

Consider the small sample of a 6-column integer array:
import numpy as np
J = np.array([[1, 3, 1, 3, 2, 5],
[2, 6, 3, 4, 2, 6],
[1, 7, 2, 5, 2, 5],
[4, 2, 8, 3, 8, 2],
[0, 3, 0, 3, 0, 3],
[2, 2, 3, 3, 2, 3],
[4, 3, 4, 3, 3, 4])
I want to remove, from J:
a) all rows where the first and second PAIRS of elements are exact matches
(this remove rows like [1,3, 1,3, 2,5])
b) all rows where the second and third PAIRS of elements are exact matches
(this remove rows like [1,7, 2,5, 2,5])
Matches between any other pairs are OK.
I have a solution, below, but it is handled in two steps. If there is a more direct, cleaner, or more readily extendable approach, I'd be very interested.
K = J[~(np.logical_and(J[:,0] == J[:,2], J[:,1] == J[:,3]))]
L = K[~(np.logical_and(K[:,2] == J[:,4], K[:,3] == K[:,5]))]
K removes the 1st, 5th, and 7th rows from J, leaving
K = [[2, 6, 3, 4, 2, 6],
[1, 7, 2, 5, 2, 5],
[4, 2, 8, 3, 8, 2],
[2, 2, 3, 3, 2, 3]])
L removes the 2nd row from K, giving the final outcome.
L = [[2, 6, 3, 4, 2, 6],
[4, 2, 8, 3, 8, 2],
[2, 2, 3, 3, 2, 3]])
I'm hoping for an efficient solution because, learning from this problem, I need to extend these ideas to 8-column arrays where
I eliminate rows having exact matches between the 1st and 2nd PAIRS, the 2nd and 3rd PAIRS, and the 3rd and 4th PAIRS.
Since we are checking for adjacent pairs for equality, a differencing on 3D reshaped data seems would be one way to do it for a cleaner vectorized one -
# a is input array
In [117]: b = a.reshape(a.shape[0],-1,2)
In [118]: a[~(np.diff(b,axis=1)==0).all(2).any(1)]
Out[118]:
array([[2, 6, 3, 4, 2, 6],
[4, 2, 8, 3, 8, 2],
[2, 2, 3, 3, 2, 3]])
If you are going for performance, skip the differencing and go for sliced equality -
In [142]: a[~(b[:,:-1] == b[:,1:]).all(2).any(1)]
Out[142]:
array([[2, 6, 3, 4, 2, 6],
[4, 2, 8, 3, 8, 2],
[2, 2, 3, 3, 2, 3]])
Generic no. of cols
Extends just as well on generic no. of cols -
In [156]: a
Out[156]:
array([[1, 3, 1, 3, 2, 5, 1, 3, 1, 3, 2, 5],
[2, 6, 3, 4, 2, 6, 2, 6, 3, 4, 2, 6],
[1, 7, 2, 5, 2, 5, 1, 7, 2, 5, 2, 5],
[4, 2, 8, 3, 8, 2, 4, 2, 8, 3, 8, 2],
[0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3],
[2, 2, 3, 3, 2, 3, 2, 2, 3, 3, 2, 3],
[4, 3, 4, 3, 3, 4, 4, 3, 4, 3, 3, 4]])
In [158]: b = a.reshape(a.shape[0],-1,2)
In [159]: a[~(b[:,:-1] == b[:,1:]).all(2).any(1)]
Out[159]:
array([[4, 2, 8, 3, 8, 2, 4, 2, 8, 3, 8, 2],
[2, 2, 3, 3, 2, 3, 2, 2, 3, 3, 2, 3]])
Of course, we are assuming the number of cols allows pairing.
What you have is quite reasonable. Here's what I would write:
def eliminate_pairs(x: np.ndarray) -> np.ndarray:
first_second = (x[:, 0] == x[:, 2]) & (x[:, 1] == x[:, 3])
second_third = (x[:, 1] == x[:, 3]) & (x[:, 2] == x[:, 4])
return x[~(first_second | second_third)]
You could also apply DeMorgan's theorem and eliminate an extra not operation, but that's less important than clarity.
Let's try a loop:
mask = False
for i in range(0,3,2):
mask = (J[:,i:i+2]==J[:,i+2:i+4]).all(1) | mask
J[~mask]
Output:
array([[2, 6, 3, 4, 2, 6],
[4, 2, 8, 3, 8, 2],
[2, 2, 3, 3, 2, 3]])

slice 2D numpy array based on condition

I have an numpy array
import numpy as np
a = np.array([
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
[999, 999, 999, 1, 2, 3, 4, 999, 999, 999],
[999, 999, 999, 5, 6, 7, 8, 999, 999, 999],
[999, 999, 999, 9, 10, 11, 12, 999, 999, 999],
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999]])
how to return the filtered values, containing only the different values than 999 using numpy slicing?
filtered = np.where(a != 999)
In [5]: filtered
Out[5]:
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5,
6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1,
2, 3, 4, 5, 6, 7, 8, 9]))
Desired output:
output = np.array([
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
You can do the following:
>>> mask = (a!=999)
>>> dim1 = np.any(mask, axis=1).sum()
>>> a[mask].reshape(dim1, -1)
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
This of course assume that you only have a single contiguous box in the whole array.
Yours is a special case, because the subarray is rectangular. You can get the flat values using fancy indexing:
>>> a[filtered]
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
And if you know the shape already, you can reshape that:
>>> a[filtered].reshape(3,4)
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
However, there can be no guarantee that the input data will leave you with a rectangular array after the filtering in the general case. Consider, for example, what output array should look like if the input array had a[0,0] == 13.
You can also do this. Create a 2D mask using the condition. Typecast the condition mask to int or float, depending on the array, and multiply it with the original array.
In [8]: arr
Out[8]:
array([[ 1., 2., 3., 4., 5.],
[ 6., 7., 8., 9., 10.]])
In [9]: arr*(arr % 2 == 0).astype(np.int)
Out[9]:
array([[ 0., 2., 0., 4., 0.],
[ 6., 0., 8., 0., 10.]])

LingPipe LDA matrix representation

I am trying to extract possible topics from list of tweets and LingPipe LDA seems easy to understand and well documented with code sample.
My challenge is to produce the matrix representation using tweets data. For example,
static String[] WORDS = new String[] {
"river", "stream", "bank", "money", "loan"
};
static final int[][] DOC_WORDS = new int[][] {
{ 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 0, 0, 0 },
{ 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 0, 0 },
{ 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 0 },
{ 0, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4 }
}
The zero at the end of the above matrix is supposed to represent that none of the word in WORDS array is found in the content. However in this representation, it is presumed to be the zero index or the word 'river' is found.
As tweet is short, I am not sure how I can represent the matrix so that it can show the 'absence' of the word too.
Any advice or suggestion of other method is mush appreciated.

Resources