Complex NumPy Array Manipulation - arrays

I have two numpy arrays:
e.g.
np.array_1([
[5,2,0]
[4,3,0]
[4,2,0]
[3,2,1]
[4,1,1]
])
np.array_2([
[5,2,10]
[4,2,52]
[3,2,80]
[1,2,4]
[5,3,6]
])
In np.array_1, 0 and 1 at index 2 represent two different categories. For arguments sake say 0 = Red and 1 = Blue.
So, where the first two elements match in the two numpy arrays, I need to average the third element in np.array_2 by category. For example, [5,2,10] and [4,2,52] both match with category 0 i.e. Red. The code will return the average of the elements at index 2 for the Red category. It will also do the same for the Blue category.
I have no idea where to start with this, any ideas welcome.

You marked your post with Numpy tag due to the type of source arrays,
but it is much easier and intuitive to generate your result using Pandas.
Start from conversion of your both arrays to pandasonic DataFrames.
While converting the first array, convert also 0 and 1 in the last
column to Red and Blue:
import pandas as pd
df1 = pd.DataFrame(array_1, columns=['A', 'B', 'key'])
df1.key.replace({0: 'Red', 1: 'Blue'}, inplace=True)
df2 = pd.DataFrame(array_2, columns=['A', 'B', 'C'])
Then, to generate the result, run:
result = df2.merge(df1, on=['A', 'B']).groupby('key').C.mean().rename('Mean')
The result is:
key
Blue 80
Red 31
Name: Mean, dtype: int32
Details:
df2.merge(df1, on=['A', 'B']) - Generates:
A B C key
0 5 2 10 Red
1 4 2 52 Red
2 3 2 80 Blue
eliminating at the same time rows which don't belong to any group
(are neither Red nor Blue).
groupby('key') - From the above result, generates groups by key
(Red / Blue).
C.mean() - the last step is to take C column (from each group)
and compute its mean.
The result is a Series with:
index - the grouping key,
value - the value computed for the corresponding group.
rename('Mean') - Change the name from the source column name (C)
to a more meaningful Mean.

Related

Countif the Result of Subtracting Two Arrays Exceeds a Certain Value in Excel

I am new to array formulae and am having trouble with the following scenario:
I have the following matrix:
F G H I J ... R S T U V
1 0 0 1 1
0 1 1 1 2 3 1 2
2 0 2 3 1 2 0 1 0 0
2 1 0 0 1 0 0 3 0 0
My goal is to count the number of rows within which the difference between the sum of columns F:J and the sum of columns R:V is greater than a threshold. Critically, only rows with full data should be included: row 1 (where there are only values for columns F1:J1) and row 2 (where there are only some values for columns F2:J2) should be ignored.
If the threshold = 2.5, then the solution is 1. That is, row 3 is the only row with complete data where the difference between the sum of F3:J3 (8) and the sum of R3:V3 (3) is greater than 2.5 (e.g., 5 > 2.5).
I have tried to put together the following formula, rather pathetically, based on the teachings of #Tom Sharpe and #QHarr:
=COUNT(IF(SUBTOTAL(9,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))-SUBTOTAL(9,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))>2.5,IF(AND(SUBTOTAL(2,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))=COLUMNS(F1:J1),SUBTOTAL(2,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))=COLUMNS(R1:V1)),SUBTOTAL(9,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))),IF(AND(SUBTOTAL(2,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))=COLUMNS(F1:J1),SUBTOTAL(2,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))=COLUMNS(R1:V1)),SUBTOTAL(9,OFFSET(R1,ROW(R1:V1)-ROW(R1),0,1,COLUMNS(R1:V1))))))
But it seems to always produce a value of 1, even if I edit the matrix such that the difference between the sum of F4:J4 and R4:v4 also exceeds 2.5. Sadly I am struggling to understand why and would appreciate any guidance on the matter.
As an array formula in one cell without volatile functions:
=SUM((MMULT(--(LEN(F2:J5)*LEN(R2:V5)>0),--TRANSPOSE(COLUMN(F2:J2)>0))=5)*(MMULT(F2:J5-R2:V5,TRANSPOSE(--(COLUMN(F2:J2)>0)))>2.5))
should do the trick :D
Maybe, in say X1 (assuming you have labelled your columns):
=COUNTIF(Y:Y,TRUE)
In Y1 whatever your chosen cutoff (eg 2.5) and in Y2:
=((COUNTBLANK(F2:J2)+COUNTBLANK(R2:V2)=0)*SUM(F2:J2)-SUM(R2:V2))>Y$1
copied down to suit.
Try this:
=SUMPRODUCT((MMULT(F1:J4-R1:V4,--(ROW(INDIRECT("1:"&COLUMNS(F1:J4)))>0))>2.5)*(MMULT((LEN(F1:J4)>0)+(LEN(R1:V4)>0),--(ROW(INDIRECT("1:"&COLUMNS(F1:J4)))>0))=(COLUMNS(F1:J4)+COLUMNS(R1:V4))))
I think this will do it, replacing your AND's by multiplies (*):
=SUMPRODUCT(--((SUBTOTAL(9,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))-SUBTOTAL(9,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))>2.5)*(SUBTOTAL(2,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))=COLUMNS(F1:J1))*(SUBTOTAL(2,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))=COLUMNS(R1:V1))>0))
It could be simplified a bit more but a bit short of time.
Just another option...
=IF(NOT(OR(IFERROR(MATCH(TRUE,ISBLANK(F1:J1),0),FALSE),IFERROR(MATCH(TRUE,ISBLANK(R1:V1),0),FALSE))), SUBTOTAL(9,F1:J1)-SUBTOTAL(9,R1:V1), "Missing Value(s)")
My approach was a little different from what you tried to adapt from #TomSharp in that I'm validating the cells have data (not blank) and then perform the calculation, othewise return an error message. This is still an array function call, so when you enter the formulas, press ctrl+shft+enter.
The condition part of the opening if() checks to see that each range's cells are not blank: if a match( true= isblank(cell))
means a cell is blank (bad), if no match ... ie no blank cells, Match will return an #NA "error" (good). False is good = Errors found ? No. ((ie no blank cells))
Then the threshold condition becomes:
=COUNTIF(X1:X4,">"&Threshold)' Note: no Array formula here
I gave the threshold (Cell W6) a named range for read ablity.

Reshape a 3D array and remove missing values

I have an NxMxT array where each element of the array is a grid of Earth. If the grid is over the ocean, then the value is 999. If the grid is over land, it contains an observed value. N is longitude, M is latitude, and T is months.
In particular, I have an array called tmp60 for the ten years 1960 through 1969, so 120 months for each grid.
To test what the global mean in January 1960 was, I write:
tmpJan60=tmp60(:,:,1);
tmpJan60(tmpJan60(:,:)>200)=NaN;
nanmean(nanmean(tmpJan60))
which gives me 5.855.
I am confused about the reshape function. I thought the following code should yield the same average, namely 5.855, but it does not:
load tmp60
N1=size(tmp60,1)
N2=size(tmp60,2)
N3=size(tmp60,3)
reshtmp60 = reshape(tmp60, N1*N2,N3);
reshtmp60( reshtmp60(:,1)>200,: )=[];
mean(reshtmp60(:,1))
this gives me -1.6265, which is not correct.
I have checked the result in Excel (!) and 5.855 is correct, so I assume I make a mistake in the reshape function.
Ideally, I want a matrix that takes each grid, going first down the N-dimension, and make the 720 rows with 120 columns (each column is a month). These first 720 rows will represent one longitude band around Earth for the same latitude. Next, I want to increase the latitude by 1, thus another 720 rows with 120 columns. Ultimately I want to do this for all 360 latitudes.
If longitude and latitude were inputs, say column 1 and 2, then the matrix should look like this:
temp = [-179.75 -89.75 -1 2 ...
-179.25 -89.75 2 4 ...
...
179.75 -89.75 5 9 ...
-179.75 -89.25 2 5 ...
-179.25 -89.25 3 4 ...
...
-179.75 89.75 2 3 ...
...
179.75 89.75 6 9 ...]
So temp(:,3) should be all January 1960 observations.
One way to do this is:
grid1 = tmp60(1,1,:);
g1 = reshape(grid1, [1,120]);
grid2 = tmp60(2,1,:);
g2 = reshape(grid2,[1,120]);
g = [g1;g2];
But obviously very cumbersome.
I am not able to automate this procedure for the N*M elements, so comments are appreciated!
A link to the file tmp60.mat
The main problem in your code is treating the nans. Observe the following example:
a = randi(10,6);
a(a>7)=nan
m = [mean(a(:),'omitnan') mean(mean(a,'omitnan'),'omitnan')]
m =
3.8421 3.6806
Both elements in m are simply the mean on all elements in a. But they are different! The reason is the taking the mean of all values together, with mean(a(:),'omitnan') is like summing all not-nan values, and divide by the number of values we summed:
sum(a(:),'omitnan')/sum(~isnan(a(:)))==mean(a(:),'omitnan') % this is true
but taking the mean of the first dimension, we get 6 mean values:
sum(a,'omitnan')./sum(~isnan(a))==mean(a,'omitnan') % this is also true
and when we take the mean of them we divide by a larger number, because all nans were omitted already:
mean(sum(a,'omitnan')./sum(~isnan(a)))==mean(a(:),'omitnan') % this is false
Here is what I think you want in your code:
% this is exactly as your first test:
tmpJan60=tmn60(:,:,1);
tmpJan60(tmpJan60>200) = nan;
m1 = mean(mean(tmpJan60,'omitnan'),'omitnan')
% this creates the matrix as you want it:
result = reshape(permute(tmn60,[3 1 2]),120,[]).';
result(result>200) = nan;
r = reshape(result(:,1),720,360);
m2 = mean(mean(r,'omitnan'),'omitnan')
isequal(m1,m2)
To create the matrix you first permute the dimensions so the one you want to keep as is (time) will be the first. Then reshape the array to Tx(lon*lat), so you get 120 rows for all time steps and 259200 columns for all combinations of the coordinates. All that's left is to transpose it.
m1 is your first calculation, and m2 is what you try to do in the second one. They are equal here, but their value is not 5.855, even if I use your code.
However, I think the right solution will be to take the mean of all values together:
mean(result(:,1),'omitnan')

cell arrays manipulations in MATLAB --- creating a relation matrix

I have two cell arrays, named as countryname and export.
There is only one column in countryname, which is the code of the names of countries:
USA
CHN
ABW
There are two columns in export:
USA ABW
USA CHN
CHN USA
ABW USA
Each pair (X,Y) in a row of export means "country X has relation with country Y". The size of countryname has been simplified to 3. How can I achieve the following in MATLAB?
Create a square 3 by 3 (in general n by n, where n is the size of countryname) matrix M such that
M(i,j)=1 if country i has relation with country j
M(i,j)=0 otherwise.
The country names are relabeled as positive integers in countryname.
The first thing you need to do is establish a mapping from the country name to an integer value from 1 to 3. You can do that with a containers.Map where the input is a string and the output is an integer. Therefore, we will assign 'USA' to 1, 'CHN' to 2 and 'ABW' to 3. Assuming that you've initialized the cell arrays like you've mentioned above:
countryname = {'USA', 'CHN', 'ABW'};
export = {'USA', 'ABW'; 'USA', 'CHN'; 'CHN', 'USA'; 'ABW', 'USA'};
... you would create a containers.Map like so:
map = containers.Map(countryname, 1:numel(countryname));
Once you have this, you simply map the country names to integers and you can use the values function to help you do this. However, what will be returned is a cell array of individual elements. We need to unpack the cell array, so you can use cell2mat for that. As such, we can now create a 4 x 2 index matrix where each cell element is converted to a numerical value:
ind = cell2mat(values(map, export));
We thus get:
>> ind
ind =
1 3
1 2
2 1
3 1
Now that we have this, you can use sparse to create the final matrix for you where the first column serves as the row locations and the second column serves as the column locations. These locations will tell you where it will be non-zero in your final matrix. However, this will be a sparse matrix and so you'll need to convert the matrix to full to finally get a numerical matrix.
M = full(sparse(ind(:,1), ind(:,2), 1));
We get:
>> M
M =
0 1 1
1 0 0
1 0 0
As a more convenient representation, you can create a table to display the final matrix. Convert the matrix M to a table using array2table and we can add the row and column names to be the country names themselves:
>> T = array2table(M, 'RowNames', countryname, 'VariableNames', countryname)
T =
USA CHN ABW
___ ___ ___
USA 0 1 1
CHN 1 0 0
ABW 1 0 0
Take note that the above code to create the table only works for MATLAB R2013b and above. If that isn't what you require, just stick with the original numerical matrix M.
This is using basic MATLAB functionalities only. Solution posted above by #rayryeng is surely much more advanced and may be faster to code as well. However, this should also help you in understanding at fundamental level
clear
country={'USA','CHN','ABW'};
export={'USA' 'ABW'; 'USA' 'CHN'; 'CHN' 'USA' ; 'ABW' 'USA'};
M=zeros(length(country));
for i=1:length(country)
c=country(i);
ind_state=strfind(export(:,1),char(c)); % this gives state of every which is 1 or blank.
ind_match=find(not(cellfun('isempty', ind_state))); % extracting only indices which are 1.
exp_match=export(ind_match,2); % find corresponding export rel countries from second column
% useful only when your first ind_match matrix has more than 1 element.
% Like 'USA' appears twice in first column of export countries.
for j=1:length(exp_match)
c=exp_match(j);
ind_state=strfind(country,char(c));
ind_match=find(not(cellfun('isempty', ind_state)));
M(i,ind_match)=1; % Selective make elements of M 1 when there is match.
end
end
M

Find closest value in array column 4 where array column 1 and 2 match data of another array. Create a new array extracting the results

I have an extensive dataset in an array format of
a=[X, Y, Z, value]. At the same time i have another array b=[X,Y], with all the unique combinations of coordinates (X,Y) for the same dataset.
I would like to generate a new array, where for a given z=100, it contains the records of the original array a[X,Y,Z,value] where the Z is closest to the given z=100 for each possible X,Y combination.
The purpose of this is to extract a Z slice of the original dataset at a given depth
a description of the desired outcome would go like this
np.in1d(a[:,0], b[:,0]) and np.in1d(a[:,1], b[:,1]) # for each row
#where both these two arguments are True
a[:,2] == z+min(abs(a[:,2]-z))) # find the rows where Z is closest to z=100
#and append these rows to a new array c[X,Y,Z,value]
The idea is to first find the unique X,Y data and effectively slice the dataset in X,Y columns of the domain. Then search each of these columns to extract the row where Z is closest to the given z value
Any suggestion even for a much different approach would be highly appreciated
from pylab import *
a=array(rand(10000,4))*[[20,20,200,1]] # data in a 20*20*200 space
a[:,:2] //= 1 # int coords for X,Y
bj=a.T[0]+1j*a.T[1] # trick for sorting on 2 cols.
b=np.unique(bj)
ib=bj.argsort() # indices for sorting /X,Y
splits=bj[ib].searchsorted(b) # indices for splitting.
xy=np.split(a[ib],splits) # list of subsets of data grouped by (x,y)
c=array([s[abs(s.T[2]-100).argmin()] for s in xy[1:]]) #locate the good point in each list
print(c[:10])
gives:
[[ 0. 0. 110.44068611 0.71688432]
[ 0. 1. 103.64897184 0.31287547]
[ 0. 2. 100.85948189 0.74353677]
[ 0. 3. 105.28286975 0.98118126]
[ 0. 4. 99.1188121 0.85775638]
[ 0. 5. 107.53733825 0.61015178]
[ 0. 6. 100.82311896 0.25322798]
[ 0. 7. 104.16430907 0.26522796]
[ 0. 8. 100.47370563 0.2433701 ]
[ 0. 9. 102.40445547 0.89028359]]
At a higher level, with pandas :
labels=list('xyzt')
df=pd.DataFrame(a,columns=labels)
df['dist']=abs(df.z-100)
indices=df.groupby(['x','y'])['dist'].apply(argmin)
c=df.ix[indices][labels].reset_index(drop=True)
print(c.head())
for
x y z t
0 0 0 110.440686 0.716884
1 0 1 103.648972 0.312875
2 0 2 100.859482 0.743537
3 0 3 105.282870 0.981181
4 0 4 99.118812 0.857756
It is clearer, but 8x slower.

Array constants aren't working as expected in excel

I was trying to use an array constant to do some calculations. I saw this thread: Array Constants in Excel, but I am using the array constant within the formula so it is not relivant. If I use =SUM({1,2,3}) the result is 6 as expected. However,if I use it with DCOUNT, it doesn't work as expected:
A
1 Colour
2 Red
3 Yellow
4 Green
5 Red
6
7 Colour
8 =Red
The result of =DCOUNT(A1:A5;;A7:A8) is 2.
The result of =DCOUNT(A1:A5;;{"Colour";"=Red"}) is #Value!. The error message is Value used in formula is wrong data type.
Is this some inconsistency in MS Excel 2010? Or have I done something wrong?
EDIT
It was suggested that "=Red" was the issue, but the reference to this page at heading Elements you can use in constants IMO doesn't really expain it. If it were the issue however, then the following should work:
A
1 Number
2 1
3 2
4 3
5 1
6
7 Number
8 1
The formula =DCOUNT(A1:A5;;A7:A8) gives 2, but the formula =DCOUNT(A1:A5;;{"Number";1}) or =DCOUNT(A1:A5;;{"Number";"1"}) both still give the same error as my previous example.
A range can be used as an array but an array cannot be used as a range.
Since DCOUNT specifies only a range parameter, an array constant is illegal type for that parameter.
According to these pages:
Introducing array formulas in Excel
Putting advanced array formulas to work
they would imply that array constants are to be used with items that do not take ranges but:
Either an array, which would result in a single value
-or-
Single values, resulting in array (Ctrl-Shift-Enter must be used to generate the array result)
To do what I was trying to do (count all of the cells that contained the string Red in range A2:A5), I would do something like this:
A
1 Colour
2 Red
3 Yellow
4 Green
5 Red
=SUM(IF(A2:A5="Red", 1, 0)) which would count the number of Red entries by creating the intermediate array {1;0;0;1} and then add all the elements together resulting in 2.

Resources