How to concatenate two strings in H2O - concatenation

I have a H2O frame in R with two character columns and I would like to create a new column by concatenating them. I tried the following but it failed as Paste function is not supported by H2O. Any other ideas? I searched for a solution but haven't found one so far. Thank you.
df$Col3 = paste(df$Col1, df$Col2)

One option would be to use the h2o.interaction function. It's not as simple as a paste function and I don't think you can choose the concatenation separator (it uses _), but it may work for your purposes. Here is a brief example.
library(h2o)
h2o.init()
h2oframe <- as.h2o(Titanic)
h2oframe$Col3 <- h2o.interaction(h2oframe,
factors = list(c("Sex", "Age")),
pairwise = T,
max_factors = 100000,
min_occurrence = 1)
head(h2oframe)
Class Sex Age Survived Freq Col3
1 1st Male Child No 0 Male_Child
2 2nd Male Child No 0 Male_Child
3 3rd Male Child No 35 Male_Child
4 Crew Male Child No 0 Male_Child
5 1st Female Child No 0 Female_Child
6 2nd Female Child No 0 Female_Child

Related

How can I identify three highest values in a column by ID and take their squares and then add them in SAS?

I am working on injury severity scores (ISS) and my dataset has these four columns: ID, High_AIS, Dxcode (diagnosis code), ISS_bodyregion. Each ID/case has several values for "dxcode" and respective High_AIS and ISS_bodyregion - which means each ID/case has multiple injuries in different body regions. The rule to calculate ISS specifies that we have to select AIS values of three different ISS body regions
For some IDs, we have only one value (of course when a person only has single injury and one associated dxcode and AIS). My goal is to calculate ISS (ranges from 0-75) and in order to do this, I want to tell SAS the following things:
Select three largest AIS values by ID (of course when ID has more than 3 values for AIS), take their squares and add them to get ISS.
If ID has only one injury and that has the AIS = 6, the ISS will automatically be equal to 75 (regardless of the injuries elsewhere).
If ID has less than 3 AIS values (for example, 5th ID has only two AIS values: 0 and 1), then consider only two, square them and add them, as we do not have third severely ISS body region for this ID.
If ID has only 3 AIS (for example, 1,0,0) then consider only three, square them and add them even if it is ISS=1.
If ID has all the injuries and AIS values equal to 0 (for example: 0,0) then ISS will equal to 0.
If ID has multiple injuries, and AIS values are: 2,2,1,1,1 and ISS_bodyregion = 5,5,6,6,6. Then we see that ISS_bodyregion repeats itself, the instructions suggest that we only select highest AIS value of ISS body region only once, because it has to be from DIFFERENT ISS body regions. So, in such situation, I want to tell SAS that if ISS_bodyregion repeats itself, only select the one with highest AIS value and leave the rest.
I am so confused as I am telling SAS to keep account of all these aforementioned considerations and I cannot seem to put them all in a single code. Thank you so much in advance. I have already sorted my data by ID descending high_AIS.
So if you are trying to implement this algorithm https://aci.health.nsw.gov.au/networks/institute-of-trauma-and-injury-management/data/injury-scoring/injury_severity_score then you need data like this:
data have;
input id region :$20. ais ;
cards;
1 HEAD/NECK 4
1 HEAD/NECK 3
1 FACE 1
1 CHEST 2
1 ABDOMEN 2
1 EXTREMITIES 3
1 EXTERNAL 1
2 ABDOMEN 3
3 FACE 1
3 CHEST 2
4 HEAD/NECK 6
;
So first find the max per id per region. For example by using PROC SUMMARY.
proc summary data=have nway;
class id region;
var ais;
output out=bodysys max=ais;
run;
Now order by ID and AIS
proc sort data=bodysys ;
by id ais ;
run;
Now you can process by ID and accumulate the AIS scores into an array. You can use MOD() function to cycle through the array so that the last three observations per ID will be the values left in the array (skips the need to first subset to three observations per ID).
data want;
do count=0 by 1 until(last.id);
set bodysys;
by id;
array x[3] ais1-ais3 ;
x[1+mod(count,3)] = ais;
end;
iss=0;
if ais>5 then iss=75;
else do count=1 to 3 ;
iss + x[count]**2;
end;
keep id ais1-ais3 iss ;
run;
Result:
Obs id ais1 ais2 ais3 iss
1 1 2 3 4 29
2 2 3 . . 9
3 3 1 2 . 5
4 4 6 . . 75

Link two tables based on conditions in matlab

I am using matlab to prepare my dataset in order to run it in certain data mining models and I am facing an issue with linking the data between two of my tables.
So, I have two tables, A and B, which contain sequential recordings of certain values in a certain timestamps and I want to create a third table, C, in which I will add columns of both A and B in the same rows according to some conditions.
Tables A and B don't have the same amount of rows (A has more measurements) but they both have two columns:
1st column: time of the recording (hh:mm:ss) and
2nd column: recorded value in that time
Columns of A and B are going to be added in table C when all the following conditions stand:
The time difference between A and B is more than 3 sec but less than 5 sec
The recorded value of A is the 40% - 50% of the recorded value of B.
Any help would be greatly appreciated.
For the first condition you need something like [row,col,val]=find((A(:,1)-B(:,1))>2sec && (A(:,1)-B(:,1))<5sec) where you do need to use datenum or equivalent to transform your timestamps. For the second condition this works the same, use [row,col,val]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2)
datenum allows you to transform your arrays, so do that first:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
you might need to check the documentation on datenum, regarding the format your string is in.
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
creates the datenums for 3 and 5 seconds. All combined:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
[row1,col1,val1]=find((A(:,1)-B(:,1))>time1(1)&& (A(:,1)-B(:,1))<time1(2));
[row2,col2,val2]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2);
The variables of row and col you might not need when you want only the values though. val1 contains the values of condition 1, val2 of condition 2. If you want both conditions to be valid at the same time, use both in the find command:
[row3,col3,val3]=find((A(:,1)-B(:,1))>time1(1)&& ...
(A(:,1)-B(:,1))<time1(2) && A(:,2)>0.4*B(:,2)...
&& A(:,2)<0.5*B(:,2);
The actual adding of your two arrays based on the conditions:
C = A(row3,2)+B(row3,2);
Thank you for your response and help! However for the time I followed a different approach by converting hh:mm:ss to seconds that will make the comparison easier later on:
dv1 = datevec(A, 'dd.mm.yyyy HH:MM:SS.FFF ');
secs = [3600,60,1];
dv1(:,6) = floor(dv1(:,6));
timestamp = dv1(:,4:6)*secs.';
Now I am working on combining both time and weight conditions in a piece of code that will run. Should I use an if condition inside a for loop or is a for loop not necessary?

summing & matching cell arrays of different sizes

I have a 4016 x 4 cell, called 'totalSalesCell'. The first two columns contain text the remaining two are numeric.
1st field CompanyName
2nd field UniqueID
3rd field NumberItems
4th field TotalValue
In my code I have a loop which goes over the last month in weekly steps - i.e. 4 loops.
At each loop my code returns a cell of the same structure as totalSalesCell, called weeklySalesCell which generally contains a different number of rows to totalSalesCell.
There are two things I need to do. First if weeklySalesCell contains a company that is not in totalSalesCell it needs to be added to totalSalesCell, which I believe the code below will do for me.
co_list = unique([totalSalesCell(:, 1); weeklySalesCell (:, 1)]);
index = ismember(co_list, totalSalesCell(:, 1));
new_co = co_list(index==0, :);
totalSalesCell = [totalSalesCell; new_co];
The second thing I need to do and am unsure of the best way of going about it is to then add the weeklySalesCell numeric fields to the totalSalesCell. As mentioned the cells will 90% of the time have different row numbers so cannot apply a simple addition. Below is an example of what I wish to achieve.
totalSalesCell weeklySalesCell Result
co_id sales_value co_id sales_value co_id sales_value
23DFG 5 DGH84 3 23DFG 5
DGH84 6 ABC33 1 DGH84 9
12345 7 PLM78 4 ABC33 1
PLM78 4 12345 3 12345 10
KLH11 11 PLM78 8
KLH11 11
I believe the following codes must take care of both of your tasks -
[x1,x2] = ismember(totalSalesCell(:,1),weeklySalesCell(:,1))
corr_c2 = nonzeros(x1.*x2)
newval = cell2mat(totalSalesCell(x1,2)) + cell2mat(weeklySalesCell(corr_c2,2))
totalSalesCell(x1,2) = num2cell(newval)
excl_c2 = ~ismember(weeklySalesCell(:,1),totalSalesCell(:,1))
out = vertcat(totalSalesCell,weeklySalesCell(excl_c2,:)) %// desired output
Output -
out =
'23DFG' [ 5]
'DGH8444' [ 9]
'12345' [10]
'PLM78' [ 8]
'KLH11' [11]
'ABC33' [ 1]

Setting Up a Dynamic Stopping Point for a Loop

Data is setup with a bunch of information corresponding to an ID, which can show-up more than once.
ID Data
1 X
1 Y
2 A
2 B
2 Z
3 X
I want a loop that signifies which instance of the ID I am looking at. Is it the first time, second time, etc? I want it as a string in the form _# so I have to go beyond the simple _n function in Stata, to my knowledge. If someone knows a way to do what I want without the loop let me know, but I would still like the answer.
I have the following loop in Stata
by ID: gen count_one = _n
gen count_two = ""
quietly forval j = 1/3 {
replace count_two = "_`j'" if count_one == `j'
}
The output now looks like this:
ID Data count_one count_two
1 X 1 _1
1 Y 2 _2
2 A 1 _1
2 B 2 _2
2 Z 3 _3
3 X 1 _1
The question is how can I replace the 16 above with to tell Stata to take the max of the count_one column because I need to run this weekly and that max will change and I want to reduce errors.
It's hard to understand why you want this, but it is one line whether you want numeric or string:
bysort ID : gen nummax = _N
bysort ID : gen strmax = "_" + string(_N)
Note that the sort order within ID is irrelevant to the number of observations for each.
Some parts of your question aren't clear ("...replace the 16 above with to tell Stata...") but:
Why don't you just use _n with tostring?
gsort +ID +data
bys ID: g count_one=_n
tostring count_one, gen(count_two)
replace count_two="_"+count_two
Then to generate the max (answering the partial question at the end there) -- although note this value will be repeated across instances of each ID value:
bys ID: egen maxcount1=max(count_one)
or more elegantly:
bys ID: g maxcount2=_N

Changing indices and order in arrays

I have a struct mpc with the following structure:
num type col3 col4 ...
mpc.bus = 1 2 ... ...
2 2 ... ...
3 1 ... ...
4 3 ... ...
5 1 ... ...
10 2 ... ...
99 1 ... ...
to from col3 col4 ...
mpc.branch = 1 2 ... ...
1 3 ... ...
2 4 ... ...
10 5 ... ...
10 99 ... ...
What I need to do is:
1: Re-order the rows of mpc.bus, such that all rows of type 1 are first, followed by 2 and at last, 3. There is only one element of type 3, and no other types (4 / 5 etc.).
2: Make the numbering (column 1 of mpc.bus, consecutive, starting at 1.
3: Change the numbers in the to-from columns of mpc.branch, to correspond to the new numbering in mpc.bus.
4: After running simulations, reverse the steps above to turn up with the same order and numbering as above.
It is easy to update mpc.bus using find.
type_1 = find(mpc.bus(:,2) == 1);
type_2 = find(mpc.bus(:,2) == 2);
type_3 = find(mpc.bus(:,2) == 3);
mpc.bus(:,:) = mpc.bus([type1; type2; type3],:);
mpc.bus(:,1) = 1:nb % Where nb is the number of rows of mpc.bus
The numbers in the to/from columns in mpc.branch corresponds to the numbers in column 1 in mpc.bus.
It's OK to update the numbers on the to, from columns of mpc.branch as well.
However, I'm not able to find a non-messy way of retracing my steps. Can I update the numbering using some simple commands?
For the record: I have deliberately not included my code for re-numbering mpc.branch, since I'm sure someone has a smarter, simpler solution (that will make it easier to redo when the simulations are finished).
Edit: It might be easier to create normal arrays (to avoid woriking with structs):
bus = mpc.bus;
branch = mpc.branch;
Edit #2: The order of things:
Re-order and re-number.
Columns (3:end) of bus and branch are changed. (Not part of this question)
Restore original order and indices.
Thanks!
I'm proposing this solution. It generates a n x 2 matrix, where n corresponds to the number of rows in mpc.bus and a temporary copy of mpc.branch:
function [mpc_1, mpc_2, mpc_3] = minimal_example
mpc.bus = [ 1 2;...
2 2;...
3 1;...
4 3;...
5 1;...
10 2;...
99 1];
mpc.branch = [ 1 2;...
1 3;...
2 4;...
10 5;...
10 99];
mpc.bus = sortrows(mpc.bus,2);
mpc_1 = mpc;
mpc_tmp = mpc.branch;
for I=1:size(mpc.bus,1)
PAIRS(I,1) = I;
PAIRS(I,2) = mpc.bus(I,1);
mpc.branch(mpc_tmp(:,1:2)==mpc.bus(I,1)) = I;
mpc.bus(I,1) = I;
end
mpc_2 = mpc;
% (a) the following mpc_tmp is only needed if you want to truly reverse the operation
mpc_tmp = mpc.branch;
%
% do some stuff
%
for I=1:size(mpc.bus,1)
% (b) you can decide not to use the following line, then comment the line below (a)
mpc.branch(mpc_tmp(:,1:2)==mpc.bus(I,1)) = PAIRS(I,2);
mpc.bus(I,1) = PAIRS(I,2);
end
% uncomment the following line, if you commented (a) and (b) above:
% mpc.branch = mpc_tmp;
mpc.bus = sortrows(mpc.bus,1);
mpc_3 = mpc;
The minimal example above can be executed as is. The three outputs (mpc_1, mpc_2 & mpc_3) are just in place to demonstrate the workings of the code but are otherwise not necessary.
1.) mpc.bus is ordered using sortrows, simplifying the approach and not using find three times. It targets the second column of mpc.bus and sorts the remaining matrix accordingly.
2.) The original contents of mpc.branch are stored.
3.) A loop is used to replace the entries in the first column of mpc.bus with ascending numbers while at the same time replacing them correspondingly in mpc.branch. Here, the reference to mpc_tmp is necessary so ensure a correct replacement of the elements.
4.) Afterwards, mpc.branch can be reverted analogously to (3.) - here, one might argue, that if the original mpc.branch was stored earlier on, one could just copy the matrix. Also, the original values of mpc.bus are re-assigned.
5.) Now, sortrows is applied to mpc.bus again, this time with the first column as reference to restore the original format.

Resources