aggregate values of one colum by classes in second column using numpy - arrays

I've a numpy array with shape N,2 and N>10000. I the first column I have e.g. 6 class values (e.g. 0.0,0.2,0.4,0.6,0.8,1.0) in the second column I have float values. Now I want to calculate the average of the second column for all different classes of the first column resulting in 6 averages one for each class.
Is there a numpy way to do this, to avoid manual loops especially if N is very large?

In pure numpy you would do something like:
unq, idx, cnt = np.unique(arr[:, 0], return_inverse=True,
return_counts=True)
avg = np.bincount(idx, weights=arr[:, 1]) / cnt

I copied the answer from Warren to here, since it solves my problem best and I want to check it as solved:
This is a "groupby/aggregation" operation. The question is this close
to being a duplicate of
getting median of particular rows of array based on index.
... You could also use scipy.ndimage.labeled_comprehension as
suggested there, but you would have to convert the first column to
integers (e.g. idx = (5*data[:, 0]).astype(int)
I did exactly this.

Related

Per se number of row in a numpy array - Python

Imagine that I have a numpy array of 'n' rows by 'm' elements, i.e. shape = (n,m).
If the array is called xrandom and I have a routine like the one shown below and additionally I want to know the row number, what alternatives do I have? (I know that if I had done things differently, I would have the row number, however, I want to know if there is another way to know the row number per se:
for xreg in xrandom:
print(xreg, -line number of xreg-)
I haven't tried anything as I don't know how to.
You can use Python's enumerate:
for row_number, row in enumerate(xrandom):
print(row, row_number)

Sum Each Row in Array (Google Sheets)

I have a complex formula the produces an Array (10+rows and 10+columns).
For simplicity's sake, let's just say it's =unique(a1:z10)
I'm looking for a formula that can counta() each Row of the array individually. It should basically return a 1-column array that counts the number of values in each row.
Because I will then wrap that in a max() function to see the highest count among them all.
Thanks guys. I hope my question is intelligible.
Let me know if further clarification needed.
The standard way of getting row totals of an m rows by n columns array is
=mmult(<array>,<colvector>)
where <array> is an array of numbers and <colvector> is an array n rows high and one column wide containing all ones.
The standard way of getting <colvector> for a range is
=row(<range>)^0
but this doesn't work for an array because you can only use the row function with a range.
So I think you'd have to generate <colvector> another way - the easiest way is to use Sequence, but unfortunately it means repeating the formula for your <array> to get the column count.
Example
Supposing we choose this as our complex array:
=ArrayFormula(if(mod(sequence(10,10),8),"",sequence(10,10)))
a 10 X 10 array with some spaces in it.
The whole formula to get the row counts would be:
=ArrayFormula(mmult(n(if(mod(sequence(10,10),8),"",sequence(10,10))<>""),
sequence(columns(if(mod(sequence(10,10),7),"",sequence(10,10))))^0))
try:
=MAX(ARRAYFORMULA(MMULT(IFERROR(LEN(B:K)/LEN(B:K), 0), TRANSPOSE(COLUMN(B:K)^0))))
if you want to do it all in one step use:
=MAX(ARRAYFORMULA(MMULT(IFERROR(LEN(B:K)/LEN(B:K), 0),
ROW(INDIRECT("A1:A"&TRANSPOSE(COLUMNS(B:K))))^0)))
where you replace B:K ranges with your formula that outputs the array

Dividing an array by an array column in R

My data is the following:
print(xr)
[1] 1.1235685 1.0715964 0.2043725 4.0639341
> class(xr)
[1] "array"
I'm trying to divide the values of all the columns in my array by the value given by the 1st column (ie, 1.1235685). The resulting array would be:
1.000 0.953 0.181 3.616
How would I do this in R, given my R-data object type? The columns do not have names, because of the datatype. (If there's a way I can assign a column names before dividing them, then that's even better.)
I'm new to R, so apologies for the simple question.
Thank you.
Some people already answered this in the comments, but I'll try to provide a more comprehensive one. The code to do what you want is pretty simple.
xr <- array(data = c(1.1235685, 1.0715964, 0.2043725, 4.0639341))
xr/xr[1]
However, if you created that array with only one dimension, I would recommend you use a numeric vector instead, which has no "dim" attribute. You'd create it as follows:
xr <- c(1.1235685, 1.0715964, 0.2043725, 4.0639341))
xr/xr[1]

Concatenate subcells through one dimension of a cell array without using loops in MATLAB

I have a cell array. Each cell contains a vector of variable length. For example:
example_cell_array=cellfun(#(x)x.*rand([length(x),1]),cellfun(#(x)ones(x,1), num2cell(ceil(10.*rand([7,4]))), 'UniformOutput', false), 'UniformOutput', false)
I need to concatenate the contents of the cells down through one dimension then perform an operation on each concatenated vector generating scalar for each column in my cell array (like sum() for example - the actual operation is complex, time consuming, and not naturally vectorisable - especially for diffent length vecotrs).
I can do this with loops easily (for my concatenated vector sum example) as follows:
[M N]=size(example_cell_array);
result=zeros(1,N);
cat_cell_array=cell(1,N);
for n=1:N
cat_cell_array{n}=[];
for m=1:M
cat_cell_array{n}=[cat_cell_array{n};example_cell_array{m,n}];
end
end
result=cell2mat(cellfun(#(x)sum(x), cat_cell_array, 'UniformOutput', false))
Unfortunately this is WAY too slow. (My cell array is 1Mx5 with vectors in each cell ranging in length from 100-200)
Is there a simple way to produce the concatenated cell array where the vectors contained in the cells have been concatenated down one dimension?
Something like:
dim=1;
cat_cell_array=(?concatcells?(dim,example_cell_array);
Edit:
Since so many people have been testing the solutions: Just FYI, the function I'm applying to each concatenated vector is circ_kappa(x) available from Circular Statistics Toolbox
Some approaches might suggest you to unpack the numeric data from example_cell_array using {..} and then after concatenation pack it back into bigger sized cells to form your cat_cell_array. Then, again you need to unpack numeric data from that concatenated cell array to perform your operation on each cell.
Now, in my view, this multiple unpacking and packing approaches won't be efficient ones if example_cell_array isn't one of your intended outputs. So, considering all these, let me suggest two approaches here.
Loopy approach
The first one is a for-loop code -
data1 = vertcat(example_cell_array{:}); %// extract all numeric data for once
starts = [1 sum(cellfun('length',example_cell_array),1)]; %// intervals lengths
idx = cumsum(starts); %// get indices to work on intervals basis
result = zeros(1,size(example_cell_array,2));
%// replace this with "result(size(example_cell_array,2))=0;" for performance
for k1 = 1:numel(idx)-1
result(k1) = sum(data1(idx(k1):idx(k1+1)-1));
end
So, you need to edit sum with your actual operation.
Almost-vectorized approach
If example_cell_array has a lot of columns, my second suggestion would be an almost vectorized approach, though it doesn't perform badly either with a small number of columns. Now this code uses cellfun at the first line to get the lengths for each cell in concatenated version. cellfun is basically a wrapper to a loop code, but this is not very expensive in terms of runtime and that's why I categorized this approach as an almost vectorized one.
The code would be -
lens = sum(cellfun('length',example_cell_array),1); %// intervals lengths
maxlens = max(lens);
numlens = numel(lens);
array1(maxlens,numlens)=0;
array1(bsxfun(#ge,lens,[1:maxlens]')) = vertcat(example_cell_array{:}); %//'
result = sum(array1,1);
The thing you need to do now, is to make your operation run on column basis with array1 using the mask created by the bsxfun implementation. Thus, if array1 is a M x 5 sized array, you need to select the valid elements from each column using the mask and then do the operation on those elements. Let me know if you need more info on the masking issue.
Hope one of these approaches would work for you!
Quick Tests: Using a 250000x5 sized example_cell_array, quick tests show that both these approaches for the sum operation perform very well and give about 400x speedup over the code in the question at my end.
For the concatenation itself, it sounds like you might want the functional form of cat:
for n=1:N
cat_cell_array{n} = cat(1, example_cell_array{:,n});
end
This will concatenate all the arrays in the cells in each column in the original input array.
You can define a function like this:
cellcat = #(C) arrayfun(#(k) cat(1, C{:, k}), 1:size(C,2), 'uni', 0);
And then just use
>> cellcat(example_cell_array)
ans =
[42x1 double] [53x1 double] [51x1 double] [47x1 double]
I think you are looking to generate cat_cell_array without using for loops. If so, you can do it as follows:
cat_cell_array=cellfun(#(x) cell2mat(x),num2cell(example_cell_array,1),'UniformOutput',false);
The above line can replace your entire for loop according to me. Then you can calculate your complex function over this cat_cell_array.
If only result is important to you and you do not want to store cat_cell_array, then you can do everything in a single line (not recommended for readability):
result=cell2mat(cellfun(#(x)sum(x), cellfun(#(x) cell2mat(x),num2cell(example_cell_array,1),'Uni',false), 'Uni', false));

How to get mean, median, and other statistics over entire matrix, array or dataframe?

I know this is a basic question but for some strange reason I am unable to find an answer.
How should I apply basic statistical functions like mean, median, etc. over entire array, matrix or dataframe to get unique answers and not a vector over rows or columns
Since this comes up a fair bit, I'm going to treat this a little more comprehensively, to include the 'etc.' piece in addition to mean and median.
For a matrix, or array, as the others have stated, mean and median will return a single value. However, var will compute the covariances between the columns of a two dimensional matrix. Interestingly, for a multi-dimensional array, var goes back to returning a single value. sd on a 2-d matrix will work, but is deprecated, returning the standard deviation of the columns. Even better, mad returns a single value on a 2-d matrix and a multi-dimensional array. If you want a single value returned, the safest route is to coerce using as.vector() first. Having fun yet?
For a data.frame, mean is deprecated, but will again act on the columns separately. median requires that you coerce to a vector first, or unlist. As before, var will return the covariances, and sd is again deprecated but will return the standard deviation of the columns. mad requires that you coerce to a vector or unlist. In general for a data.frame if you want something to act on all values, you generally will just unlist it first.
Edit: Late breaking news(): In R 3.0.0 mean.data.frame is defunctified:
o mean() for data frames and sd() for data frames and matrices are
defunct.
By default, mean and median etc work over an entire array or matrix.
E.g.:
# array:
m <- array(runif(100),dim=c(10,10))
mean(m) # returns *one* value.
# matrix:
mean(as.matrix(m)) # same as before
For data frames, you can coerce them to a matrix first (the reason this is by default over columns is because a dataframe can have columns with strings in it, which you can't take the mean of):
# data frame
mdf <- as.data.frame(m)
# mean(mdf) returns column means
mean( as.matrix(mdf) ) # one value.
Just be careful that your dataframe has all numeric columns before coercing to matrix. Or exclude the non-numeric ones.
You can use library dplyr via install.packages('dplyr') and then
dataframe.mean <- dataframe %>%
summarise_all(mean) # replace for median

Resources