reordering a complicated list fo data

reordering a complicated list fo data - file

I have a very useful script that creates this kind of (long) lists:
The first elements are x y coordinates on a 2D surface.
The next elements are ordered so:
['colorName1', 'colorName2', 'colorName3', 'colorName4']
[ density1, density2, density3, density4]
*
1 2 ['PINKwA','GB','PINK','TUwA'] [0.23816363 0.61917833 0.01219634 0.13046169]
1 3 ['PINKwA','GB','PINK','TUwA'] [0.23638376 0.6241587 0.01482295 0.12463459]
1 4 ['PINKwA','GB','PINK','TUwA'] [0.23460388 0.62913907 0.01744955 0.11880749]
1 5 ['PINKwA','GB','PINK','TUwA'] [0.23282401 0.63411944 0.02007616 0.11298039]
... and it continues ... the color names are changing and also their order
3 55 ['OR0A','PINK','PINKwA','GB'] [0.08645924 0.09921065 0.08746096 0.72686915]
3 56 ['OR0A','PINK','PINKwA','GB'] [0.08900035 0.10021389 0.0836124 0.72717336]
3 57 ['OR0A','PINK','PINKwA','GB'] [0.09154145 0.10121713 0.07976385 0.72747757]
4 1 ['PINKwA','GB','PINK','TUwA'] [0.26096751 0.61844932 0.01412691 0.10645625]
4 2 : ['PINKwA','GB','PINK','TUwA'] [0.25918763 0.62342969 0.01675352 0.10062915]
...etc.
I have a list of the colorName's, and a list of the x y coordinates
I would like to find a solution to extract for each color name
(1) the set of x y coordinates at which they appear with:
(2) the corresponding density
(3) set the density to 0 if the colorName is not present
Any ideas??

I suggest you to create a data structure that hold color, density and coordinates as its properties.
Now you can iterate through the list of the data structures to find your color and extract its properties easily.
About density of nameless color, you can check for color name before setting other properties.
The data structure should be something like this
Color
{
colorName
posX
posY
density
}

Related

How can I identify three highest values in a column by ID and take their squares and then add them in SAS?

I am working on injury severity scores (ISS) and my dataset has these four columns: ID, High_AIS, Dxcode (diagnosis code), ISS_bodyregion. Each ID/case has several values for "dxcode" and respective High_AIS and ISS_bodyregion - which means each ID/case has multiple injuries in different body regions. The rule to calculate ISS specifies that we have to select AIS values of three different ISS body regions
For some IDs, we have only one value (of course when a person only has single injury and one associated dxcode and AIS). My goal is to calculate ISS (ranges from 0-75) and in order to do this, I want to tell SAS the following things:
Select three largest AIS values by ID (of course when ID has more than 3 values for AIS), take their squares and add them to get ISS.
If ID has only one injury and that has the AIS = 6, the ISS will automatically be equal to 75 (regardless of the injuries elsewhere).
If ID has less than 3 AIS values (for example, 5th ID has only two AIS values: 0 and 1), then consider only two, square them and add them, as we do not have third severely ISS body region for this ID.
If ID has only 3 AIS (for example, 1,0,0) then consider only three, square them and add them even if it is ISS=1.
If ID has all the injuries and AIS values equal to 0 (for example: 0,0) then ISS will equal to 0.
If ID has multiple injuries, and AIS values are: 2,2,1,1,1 and ISS_bodyregion = 5,5,6,6,6. Then we see that ISS_bodyregion repeats itself, the instructions suggest that we only select highest AIS value of ISS body region only once, because it has to be from DIFFERENT ISS body regions. So, in such situation, I want to tell SAS that if ISS_bodyregion repeats itself, only select the one with highest AIS value and leave the rest.
I am so confused as I am telling SAS to keep account of all these aforementioned considerations and I cannot seem to put them all in a single code. Thank you so much in advance. I have already sorted my data by ID descending high_AIS.

So if you are trying to implement this algorithm https://aci.health.nsw.gov.au/networks/institute-of-trauma-and-injury-management/data/injury-scoring/injury_severity_score then you need data like this:
data have;
input id region :$20. ais ;
cards;
1 HEAD/NECK 4
1 HEAD/NECK 3
1 FACE 1
1 CHEST 2
1 ABDOMEN 2
1 EXTREMITIES 3
1 EXTERNAL 1
2 ABDOMEN 3
3 FACE 1
3 CHEST 2
4 HEAD/NECK 6
;
So first find the max per id per region. For example by using PROC SUMMARY.
proc summary data=have nway;
class id region;
var ais;
output out=bodysys max=ais;
run;
Now order by ID and AIS
proc sort data=bodysys ;
by id ais ;
run;
Now you can process by ID and accumulate the AIS scores into an array. You can use MOD() function to cycle through the array so that the last three observations per ID will be the values left in the array (skips the need to first subset to three observations per ID).
data want;
do count=0 by 1 until(last.id);
set bodysys;
by id;
array x[3] ais1-ais3 ;
x[1+mod(count,3)] = ais;
end;
iss=0;
if ais>5 then iss=75;
else do count=1 to 3 ;
iss + x[count]**2;
end;
keep id ais1-ais3 iss ;
run;
Result:
Obs id ais1 ais2 ais3 iss
1 1 2 3 4 29
2 2 3 . . 9
3 3 1 2 . 5
4 4 6 . . 75

How to display an array with textbox in a figure?

I'm trying to display an array as a figure in MATLAB using coloured textbox that varies according to the value at that location.
So far, I have tried to use the MATLAB Edit Plot Tool to draw such a figure and then generate the code to see what it might look like. Here is what I came up with:
figure1=figure
annotation(figure1,'textbox',...
[0.232125037302298 0.774079320113315 0.034810205908684 0.0410764872521246],...
'String','HIT',...
'FitBoxToText','off',...
'BackgroundColor',[0.470588235294118 0.670588235294118 0.188235294117647]);
annotation(figure1,'textbox',...
[0.27658937630558 0.774079320113315 0.034810205908684 0.0410764872521246],...
'String',{'STAY'},...
'FitBoxToText','off',...
'BackgroundColor',[1 0 0]);
Here the result does not look so good. I'd like something neat and not as hard to write. Visually, I'd like something like this:

I've found a possible solution using the pcolor function.
Warning: I've tested it only with Octave
If you want to create a (m x n) table with, as per your picture, 4 colour, you have to:
create an array with size (m+1 x n+1) of integers' in the1:4` range setting them according to the desired order
call pcolor to plot the table
adjust the size of the figure
create your own colormap according to the desired colors
set the `colormap'
add the desired text using the text function
set the tick and ticklabel of the axes
Edit to answer the comment
In the following you can find a possible implementation of the proposed solution.
The code creates two figure:
In the first one wil be ploted the values of the input matrix
In the second one the user defined strings
The association "color - value" is performed through the user-defined colormap.
Since in the matrix x there are 4 different possible values (it has been defined as x=randi([1 4],n_row+1,n_col+1);) the colormap has to consists of 4 RGB entry as follows.
cm=[1 0.3 0.3 % RED
0.3 0.3 1 % BLUE
0 1 0 % GREEN
1 1 1]; % WHITE
Should you want to change the association, you just have to change the order of the rows of the colormap.
The comments in the code should clarify the above steps.
Code updated
% Define a rnadom data set
n_row=24;
n_col=10;
x=randi([1 4],n_row+1,n_col+1);
for fig_idx=1:2
% Open two FIGURE
% In the first one wil be ploted the values of the input matrix
% In the second one the user defined strings
figure('position',[ 1057 210 606 686])
% Plot the matrix
s=pcolor(x);
set(s,'edgecolor','w','linewidth',3)
% Define the colormap
%cm=[1 1 1
% 0 1 0
% 0.3 0.3 1
% 1 0.3 0.3];
cm=[1 0.3 0.3 % RED
0.3 0.3 1 % BLUE
0 1 0 % GREEN
1 1 1]; % WHITE
% Set the colormap
colormap(cm);
% Write the text according to the color
[r,c]=find(x(1:end-1,1:end-1) == 1);
for i=1:length(r)
if(fig_idx == 1)
ht=text(c(i)+.1,r(i)+.5,num2str(x(r(i),c(i))));
else
ht=text(c(i)+.1,r(i)+.5,'SUR');
end
set(ht,'fontweight','bold','fontsize',10);
end
% Write the text according to the color
[r,c]=find(x(1:end-1,1:end-1) == 2);
for i=1:length(r)
if(fig_idx == 1)
ht=text(c(i)+.1,r(i)+.5,num2str(x(r(i),c(i))));
else
ht=text(c(i)+.1,r(i)+.5,'DBL');
end
set(ht,'fontweight','bold','fontsize',10);
end
% Write the text according to the color
[r,c]=find(x(1:end-1,1:end-1) == 3);
for i=1:length(r)
if(fig_idx == 1)
ht=text(c(i)+.1,r(i)+.5,num2str(x(r(i),c(i))));
else
ht=text(c(i)+.1,r(i)+.5,'HIT');
end
set(ht,'fontweight','bold','fontsize',10);
end
% Write the text according to the color
[r,c]=find(x(1:end-1,1:end-1) == 4);
for i=1:length(r)
if(fig_idx == 1)
ht=text(c(i)+.1,r(i)+.5,num2str(x(r(i),c(i))));
else
ht=text(c(i)+.1,r(i)+.5,'STK');
end
set(ht,'fontweight','bold','fontsize',10);
end
% Create and set the X labels
xt=.5:10.5;
xtl={' ';'2';'3';'4';'5';'6';'7';'8';'9';'10';'A'};
set(gca,'xtick',xt);
set(gca,'xticklabel',xtl,'xaxislocation','top','fontweight','bold');
% Create and set the X labels
yt=.5:24.5;
ytl={' ';'Soft20';'Soft19';'Soft18';'Soft17';'Soft16';'Soft15';'Soft14';'Soft13'; ...
'20';'19';'18';'17';'16';'15';'14';'13';'12';'11';'10';'9';'8';'7';'6';'5'};
set(gca,'ytick',yt);
set(gca,'yticklabel',ytl,'fontweight','bold');
title('Dealer''s Card')
end
Table with the values in the input matrix
Table with the user-defined strings

This is an answer inspired by il_raffa's answer, but with also quite a few differences. There is no better or worse, it's just a matter of preferences.
Main differences are:
it uses imagesc instead of pcolor
it uses a second overlaid axes for fine control of the grid color/thickness/transparency etc...
The association between value - label - color is set right at the beginning in one single table. All the code will then respect this
table.
It goes like this:
%% Random data
n_row = 24;
n_col = 10;
vals = randi([1 4], n_row, n_col);
%% Define labels and associated colors
% this is your different labels and the color associated. There will be
% associated to the values 1,2,3, etc ... in the order they appear in this
% table:
Categories = {
'SUR' , [1 0 0] % red <= Label and color associated to value 1
'DBL' , [0 0 1] % blue <= Label and color associated to value 2
'HIT' , [0 1 0] % green <= Label and color associated to value 3
'STK' , [1 1 1] % white <= you know what this is by now ;-)
} ;
% a few more settings
BgColor = 'w' ; % Background color for various elements
strTitle = 'Dealer''s Card' ;
%% Parse settings
% get labels according to the "Categories" defined above
labels = Categories(:,1) ;
% build the colormap according to the "Categories" defined above
cmap = cell2mat( Categories(:,2) ) ;
%% Display
hfig = figure('Name',strTitle,'Color',BgColor,...
'Toolbar','none','Menubar','none','NumberTitle','off') ;
ax1 = axes ;
imagesc(vals) % Display each cell with an associated color
colormap(cmap); % Set the colormap
grid(ax1,'off') % Make sure there is no grid
% Build and place the texts objects
textStrings = labels(vals) ;
[xl,yl] = meshgrid(1:n_col,1:n_row);
hStrings = text( xl(:), yl(:), textStrings(:), 'HorizontalAlignment','center');
%% Modify text color if needed
% (White text for the darker box colors)
textColors = repmat(vals(:) <= 2 ,1,3);
set(hStrings,{'Color'},num2cell(textColors,2));
%% Set the axis labels
xlabels = [ cellstr(num2str((2:10).')) ; {'A'} ] ;
ylabels = [ cellstr(num2str((5:20).')) ; cellstr(reshape(sprintf('soft %2d',[13:20]),7,[]).') ] ;
set(ax1,'XTick', 1:numel(xlabels), ...
'XTickLabel', xlabels, ...
'YTick', 1:numel(ylabels), ...
'YTickLabel', ylabels, ...
'TickLength', [0 0], ...
'fontweight', 'bold' ,...
'xaxislocation','top') ;
title(strTitle)
%% Prettify
ax2 = axes ; % create new axe and retrieve handle
% superpose the new axe on top, at the same position
set(ax2,'Position', get(ax1,'Position') );
% make it transparent (no color)
set(ax2,'Color','none')
% set the X and Y grid ticks and properties
set(ax2,'XLim',ax1.XLim , 'XTick',[0 ax1.XTick+0.5],'XTickLabel','' ,...
'YLim',ax1.YLim , 'YTick',[0 ax1.YTick+0.5],'YTickLabel','' ,...
'GridColor',BgColor,'GridAlpha',1,'Linewidth',2,...
'XColor',BgColor,'YColor',BgColor) ;
% Make sure the overlaid axes follow the underlying one
resizeAxe2 = #(s,e) set(ax2,'Position', get(ax1,'Position') );
hfig.SizeChangedFcn = resizeAxe2 ;
It produces the following figure:
Of course, you can replace the colors with your favorite colors.
I would encourage you to play with the grid settings of the ax2 for different effects, and you can also play with the properties of the text objects (make them bold, other color etc ...). Have fun !

Filling in missing values in one dataset based on another in presence of repeated observations in R

Using R, I would like to use information from dataframe 2 to fill in missing values in dataframe 1. Here are the headers from my files. File 1 is a dataframe with data and location (long/lat) of an event. Some of the spatial information is missing.
> head(file1)
day.of.event longitude latitude PLZ
1 01.01.2009 750303 243535 9050
2 01.01.2009 645616 235136 5056
3 01.01.2009 722132 253715 9602
4 01.01.2009 645149 222845 8836
5 01.01.2009 NA NA 3000
6 01.01.2009 NA NA 3000
However, based on the postcode (PLZ) , I can find these in the Swiss official register (cadastre). The NAs in the first file should be replaced by the E/N corresponding to the PLZ (postcode).
> head(file2)
Ortschaftsname PLZ Zusatzziffer Gemeindename Kantonskürzel E N
1 Aadorf 8355 0 Aadorf TG 710450 261277
2 Aarau 5000 0 Aarau AG 646063 248867
3 Aarau 5004 0 Aarau AG 646950 250197
4 Aarau Rohr 5032 0 Aarau AG 648491 250615
5 Aarberg 3270 0 Aarberg BE 588188 210368
6 Aarburg 4663 0 Aarburg AG 635148 241461
Now as I have several hundreds of thousands of events, the postcode will be repeated but I would like to replace all NAs for postcode "3000"(for example) with the same longitude (E) and latitude (N)(repeat for all NAs).
There must be an easier way than doing this manually?

the following is not the best way to do this task, but if the order doesnot matter than you could do something like this.
a<-subset(file1,PLZ==3000) # extract all the rows where PLZ is 3000
b<-subset(file1,PLZ!=3000) # remaining part of dataframe
a$longitude<-rep(lonvalue,nrow(a))
a$latitude<-rep(latvalue,nrow(a))
file1<-rbind(b,a)
in the above code, either hardcode or pass by variable the value of latitude or longitude you want to add
EDIT:
You can write a loop. Iterate over all rows of file1
something like:
for row in row.numbers
{
if is.na(file1$longitude[row])
{
t=subset(file2,PLZ==file1$PLZ[row])
file1$longitude[row]<-t$E
file1$latitude[row]<-t$N
}
}
the above will work if in file2 for each PLZ there is a single row

Setting Up a Dynamic Stopping Point for a Loop

Data is setup with a bunch of information corresponding to an ID, which can show-up more than once.
ID Data
1 X
1 Y
2 A
2 B
2 Z
3 X
I want a loop that signifies which instance of the ID I am looking at. Is it the first time, second time, etc? I want it as a string in the form _# so I have to go beyond the simple _n function in Stata, to my knowledge. If someone knows a way to do what I want without the loop let me know, but I would still like the answer.
I have the following loop in Stata
by ID: gen count_one = _n
gen count_two = ""
quietly forval j = 1/3 {
replace count_two = "_`j'" if count_one == `j'
}
The output now looks like this:
ID Data count_one count_two
1 X 1 _1
1 Y 2 _2
2 A 1 _1
2 B 2 _2
2 Z 3 _3
3 X 1 _1
The question is how can I replace the 16 above with to tell Stata to take the max of the count_one column because I need to run this weekly and that max will change and I want to reduce errors.

It's hard to understand why you want this, but it is one line whether you want numeric or string:
bysort ID : gen nummax = _N
bysort ID : gen strmax = "_" + string(_N)
Note that the sort order within ID is irrelevant to the number of observations for each.

Some parts of your question aren't clear ("...replace the 16 above with to tell Stata...") but:
Why don't you just use _n with tostring?
gsort +ID +data
bys ID: g count_one=_n
tostring count_one, gen(count_two)
replace count_two="_"+count_two
Then to generate the max (answering the partial question at the end there) -- although note this value will be repeated across instances of each ID value:
bys ID: egen maxcount1=max(count_one)
or more elegantly:
bys ID: g maxcount2=_N

Changing indices and order in arrays

I have a struct mpc with the following structure:
num type col3 col4 ...
mpc.bus = 1 2 ... ...
2 2 ... ...
3 1 ... ...
4 3 ... ...
5 1 ... ...
10 2 ... ...
99 1 ... ...
to from col3 col4 ...
mpc.branch = 1 2 ... ...
1 3 ... ...
2 4 ... ...
10 5 ... ...
10 99 ... ...
What I need to do is:
1: Re-order the rows of mpc.bus, such that all rows of type 1 are first, followed by 2 and at last, 3. There is only one element of type 3, and no other types (4 / 5 etc.).
2: Make the numbering (column 1 of mpc.bus, consecutive, starting at 1.
3: Change the numbers in the to-from columns of mpc.branch, to correspond to the new numbering in mpc.bus.
4: After running simulations, reverse the steps above to turn up with the same order and numbering as above.
It is easy to update mpc.bus using find.
type_1 = find(mpc.bus(:,2) == 1);
type_2 = find(mpc.bus(:,2) == 2);
type_3 = find(mpc.bus(:,2) == 3);
mpc.bus(:,:) = mpc.bus([type1; type2; type3],:);
mpc.bus(:,1) = 1:nb % Where nb is the number of rows of mpc.bus
The numbers in the to/from columns in mpc.branch corresponds to the numbers in column 1 in mpc.bus.
It's OK to update the numbers on the to, from columns of mpc.branch as well.
However, I'm not able to find a non-messy way of retracing my steps. Can I update the numbering using some simple commands?
For the record: I have deliberately not included my code for re-numbering mpc.branch, since I'm sure someone has a smarter, simpler solution (that will make it easier to redo when the simulations are finished).
Edit: It might be easier to create normal arrays (to avoid woriking with structs):
bus = mpc.bus;
branch = mpc.branch;
Edit #2: The order of things:
Re-order and re-number.
Columns (3:end) of bus and branch are changed. (Not part of this question)
Restore original order and indices.
Thanks!

I'm proposing this solution. It generates a n x 2 matrix, where n corresponds to the number of rows in mpc.bus and a temporary copy of mpc.branch:
function [mpc_1, mpc_2, mpc_3] = minimal_example
mpc.bus = [ 1 2;...
2 2;...
3 1;...
4 3;...
5 1;...
10 2;...
99 1];
mpc.branch = [ 1 2;...
1 3;...
2 4;...
10 5;...
10 99];
mpc.bus = sortrows(mpc.bus,2);
mpc_1 = mpc;
mpc_tmp = mpc.branch;
for I=1:size(mpc.bus,1)
PAIRS(I,1) = I;
PAIRS(I,2) = mpc.bus(I,1);
mpc.branch(mpc_tmp(:,1:2)==mpc.bus(I,1)) = I;
mpc.bus(I,1) = I;
end
mpc_2 = mpc;
% (a) the following mpc_tmp is only needed if you want to truly reverse the operation
mpc_tmp = mpc.branch;
%
% do some stuff
%
for I=1:size(mpc.bus,1)
% (b) you can decide not to use the following line, then comment the line below (a)
mpc.branch(mpc_tmp(:,1:2)==mpc.bus(I,1)) = PAIRS(I,2);
mpc.bus(I,1) = PAIRS(I,2);
end
% uncomment the following line, if you commented (a) and (b) above:
% mpc.branch = mpc_tmp;
mpc.bus = sortrows(mpc.bus,1);
mpc_3 = mpc;
The minimal example above can be executed as is. The three outputs (mpc_1, mpc_2 & mpc_3) are just in place to demonstrate the workings of the code but are otherwise not necessary.
1.) mpc.bus is ordered using sortrows, simplifying the approach and not using find three times. It targets the second column of mpc.bus and sorts the remaining matrix accordingly.
2.) The original contents of mpc.branch are stored.
3.) A loop is used to replace the entries in the first column of mpc.bus with ascending numbers while at the same time replacing them correspondingly in mpc.branch. Here, the reference to mpc_tmp is necessary so ensure a correct replacement of the elements.
4.) Afterwards, mpc.branch can be reverted analogously to (3.) - here, one might argue, that if the original mpc.branch was stored earlier on, one could just copy the matrix. Also, the original values of mpc.bus are re-assigned.
5.) Now, sortrows is applied to mpc.bus again, this time with the first column as reference to restore the original format.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight