Unique values in particular column only 2-d array (using numpy) - arrays

I have a 2-d array in numpy. I wish to obtain unique values only in a particular column.
import numpy as np
data = np.genfromtxt('somecsvfile',dtype='str',delimiter=',')
#data looks like
[a,b,c,d,e,f,g],
[e,f,z,u,e,n,c],
...
[g,f,z,u,a,v,b]
Using numpy/scipy only, how do I obtain an array or list of unique values in the 5th column. (I know it can easily be done with pandas.)
The expected output would be 2 values: [e,a]

Correct answer posted. A simple referencing question in essence.
np.unique(data[:, 4])
With thanks.

Related

Deleting consecutive and non-consecutive columns from a cell array in Matlab

I'm trying to delete multiple consecutive and non-consecutive columns from a 80-column, 1-row cell array mycells. My question is: what's the correct indexing of a vector of columns in Matlab?
What I tried to do is: mycells(1,[4:6,8,9]) = [] in an attempt to remove columns 4 to 6, column 8 and 9. But I get the error: A null assignment can have only one non-colon index.
Use a colon for the first index. That way only the 2nd index is "non-colon". E.g.,
mycells(:,[4:6,8,9]) = []
MATLAB could have been smart enough to recognize that when there is only one row the 1 and : amount to the same thing and you will still get a rectangular array result, but it isn't.
Before getting the above VERY VERY HELPFUL AND MUCH SIMPLER answers, I ended up doing something more convoluted. As it worked in my case, I'll post it here for anyone in future:
So, I had a cell array vector, of which I wanted to drop specific cells. I created another cell array of the ones I wanted to remove:
remcols = mycells(1,[4:6,8,9])
Then I used the bellow function to overwrite onto mycells only those cells which are different between remcols and mycells (these were actually the cells I wanted to keep from mycells):
mycells = setdiff(mycells,remcols)
This is not neat at all but hopefully serves the purpose of someone somewhere in the world.

Dividing an array by an array column in R

My data is the following:
print(xr)
[1] 1.1235685 1.0715964 0.2043725 4.0639341
> class(xr)
[1] "array"
I'm trying to divide the values of all the columns in my array by the value given by the 1st column (ie, 1.1235685). The resulting array would be:
1.000 0.953 0.181 3.616
How would I do this in R, given my R-data object type? The columns do not have names, because of the datatype. (If there's a way I can assign a column names before dividing them, then that's even better.)
I'm new to R, so apologies for the simple question.
Thank you.
Some people already answered this in the comments, but I'll try to provide a more comprehensive one. The code to do what you want is pretty simple.
xr <- array(data = c(1.1235685, 1.0715964, 0.2043725, 4.0639341))
xr/xr[1]
However, if you created that array with only one dimension, I would recommend you use a numeric vector instead, which has no "dim" attribute. You'd create it as follows:
xr <- c(1.1235685, 1.0715964, 0.2043725, 4.0639341))
xr/xr[1]

Selecting columns from multi-dimensional numpy array

I created a multi-dimensional numpy array using this code:
pt=[[0 for j in range(intervals+1)] for i in range(users+1)]
A `print (np.shape(pt)) gives me
(1001,169)
I then proceeded to populate the array (code not shown) before trying to select everything but the first column to feed into matplotlib.
I referred to posts on how to select columns from a multi-dimensional array:
here
here
and
here
all of whom say I should do:
pt[:,1:]
to select everything but the first column. However this gives me the error message:
TypeError: list indices must be integers or slices, not tuple
Anyone else that reaches this post from having made the same mistake (see comments above), if you want to continue using lists then do pt[:][0:1] but really I recommend switching to numpy and ignoring all the results you get when you search for 'declaring python array'

aggregate values of one colum by classes in second column using numpy

I've a numpy array with shape N,2 and N>10000. I the first column I have e.g. 6 class values (e.g. 0.0,0.2,0.4,0.6,0.8,1.0) in the second column I have float values. Now I want to calculate the average of the second column for all different classes of the first column resulting in 6 averages one for each class.
Is there a numpy way to do this, to avoid manual loops especially if N is very large?
In pure numpy you would do something like:
unq, idx, cnt = np.unique(arr[:, 0], return_inverse=True,
return_counts=True)
avg = np.bincount(idx, weights=arr[:, 1]) / cnt
I copied the answer from Warren to here, since it solves my problem best and I want to check it as solved:
This is a "groupby/aggregation" operation. The question is this close
to being a duplicate of
getting median of particular rows of array based on index.
... You could also use scipy.ndimage.labeled_comprehension as
suggested there, but you would have to convert the first column to
integers (e.g. idx = (5*data[:, 0]).astype(int)
I did exactly this.

R array permutation naming columns and rows

This is a follow up question to convert 4-dimensional array to 2-dimensional data set in R which was answered by #Ben-Bolker.
I have a 3D array called 'y' with dimensions [37,29,2635] (i.e. firms, years, class). Using Ben's formula:
avm11<-matrix(aperm(y,c(1,3,2)),prod(dim(y)[c(1,3)]))
I managed to convert it to a 2D array with dimensions [37*2635,29]. However, the row names have become meaningless numbers and I'd need to generate row names during the permutation so that I'd get 97495 unique row names of the type firm_class.
I've been trying to do so via paste0() but I'm doing something wrong. Any suggestions?
You could use seq_len() for this. E.g.
rownames(avm11)<-paste(seq_len(length(avm11[,1])),"observation",sep=" ")
To make the rownames dependable on the variable firm_class:
#Create data frame:
df<-data.frame(X=c(0,1,2,3,4,5),Y=c("a","b","c","d","e","f"))
n<-length(df[,1])
#Generate names containing class name and unique number:
namevector<-sapply(seq_len(n),function(i) names_vector[i]<-paste("Class", df[i,1], i,sep="."))
#Equate rownames to generated names:
rownames(df)<-namevector

Resources