merge or join unequal data and duplicate value of one of them in sas - database

I am trying to merge 2 datasets (df1, df2) with the one of them df2 has only 1 observation that I want to assign its value to all length of the df1 duplicate with merge in sas.
I am aware that I can add that manually but I want to use automated way as this is just a step in my long code with big data.
Here is a reproducible example and datasets:
data df1;
input a b c;
datalines;
1 2 3
6 7 8
5 6 9
;
run;
data df2;
input d ;
datalines;
4
;
run;
data df3;
merge df1 df2;
run;
/*I need the resulting df3 to be */;
a b c d
1 2 3 4
6 7 8 4
5 6 9 4
Any help will be greatly appreciated.

Then you don't want to MERGE the dataset, since there are no common variables that the merge could actually use.
Instead just SET both datasets, but take care to not read past the end of single observation set.
data want;
set long_dataset;
if _n_=1 then set short_dataset;
run;

Related

SAS Compare values across multiple variables

I have the following example data set with an ID and the contract status in six months (01/2017 - 06/2017).
Example data:
ID Month1 Month2 Month3 Month4 Month5 Month6**
12 5 5 5 5 5 5
34 5 5 6 6 5 5
56 6 6 6 -7 -7 -7
78 6 6 5 5 5 5
12 5 5 5 5 6 -7
If the status is 5 the ID is active, if 6 it's canceled and -7 is "not able to reactivate".
I want to check two kind of changes:
1) IDs which change from status 5 to 6
2) IDs which change from 6 to 5
When the status changes from 5 to 6 I want a new variable "churn" containing the month in which the status changes to 6.
For the second group, I want a new variable "reactivation" containing the month in which the status changes to 5.
If an ID is in both groups (from 5 to 6 to 5) both variables should be filled.
What I have so far is an array, which shows me how many status matches occur in one row, but I do not get the next step. Here is the code:
data want (drop= i j);
set have (obs=100);
array stat_check {*} month1-month6;
sum=0;
do i=1 to dim(stat_check)-1;
do j=i+1 to dim(stat_check);
sum=sum(sum,stat_check(i) eq stat_check(j));
end;
end;
run;
Thanks in advance!
For an array approach, sounds like you need to compare each variable in the array to the variable immediately before it. You don't need two passes through the array, only one. You want to compare month2 to month1, month3 to month2 ... month6 to month5.
I would try something like (untested):
data want (drop= month);
set have (obs=100);
array stat_check {*} month1-month6;
sum=0;
do month=2 to dim(stat_check);
if stat_check{month}-stat_check{month-1} = 1 then Churn=month;
else if stat_check{month}-stat_check{month-1} = -1 then Reactivation =month;
end;
run;
If you could have multiple churns or multiple reactivations for the same ID, that would capture the latest churn or reactivation.
But honestly, I would transpose the data to have one row per ID-month. That would avoid the need for an array, and would allow you to capture multiple churns/reactivations. Generally it is easier to work with tall skinny data rather than short wide data. For example, it would be easy to count the number of months each ID was active.
You can try this one. vname function is used to get the variable name (month)
data two (drop= i j);
set one;
array stat_check {*} m1-m6;
sum=0;
do i=1 to dim(stat_check)-1;
do j=i+1 to dim(stat_check);
sum=stat_check(i)-stat_check(j);
if sum=1 then churn=vname(stat_check(i));
if sum=-1 then reactivation=vname(stat_check(i));
end;
end;
run;

Dynamically set size of array without hardcoding

So I have datasets with variables and values like so:
A1 A2 A3 A4 A5 A6
1 3 5 6 10 2
The variables can go up to A2000 in certain cases. I want to perform the same operation on each variable using an array. Is there a way to dynamically set the size of the array without manually typing it?
Example code of what I am striving for is below
data A;
input A1-A6;
datalines;
1 3 5 6 10 2;
run;
data A;
set A;
array a[*] a1-a&size;
do i=1 to &size;
{perform some operation here}
end;
run;
My question is how can I write code to get the parameter &size that represents the size of the array? In this example, &size = six.
Sure, use the : wildcard. This only works if a1-a6 are already defined (or a-whatever) in the dataset, though.
data have;
input a1-a6;
datalines;
1 2 3 4 5 6
7 8 9 10 11 12
;;;;
run;
data want;
set have;
array a a:;
do i=1 to dim(a);
sum = sum(sum ,a[i]);
end;
run;
Otherwise, what you put above would absolutely work. You don't need the [*] bit, though, and I prefer to keep the dim instead of &size on the loop control in case you change the way this works in the future. Of course you need to have a way to determine &size which will depend on your data.
%let size=6;
data want;
set have;
array a a1-a&size.;
do i=1 to dim(a);
sum = sum(sum ,a[i]);
end;
run;

Divide a data set into bins of size n matlab

I have a data set of size 11490x1. the data is recorded every 0.25 second(i.e. 4hz). So, 1 second accounts for 4 data points. The goal here is to further create sub sets every 3 seconds, meaning that I want to look at data every 3 seconds and analyze it. for example: if I had data such as [1 2 3 4 5 6 8 2 4 2 4 3 2 4 2 5 2 5 24 2 5 1 5 1], I want to have a sub set [1 2 3 4 5 6 8 2 4 2 4 3 ] and so on...
Any help would be appreciate.
It really depends on how you plan to "analyse" your data. The simplest way is to use a loop:
n = 4*3;
breaks = 0:n:numel(data)
for i = 1:numel(breaks)-1
sub = data(breaks(i)+1:breaks(i+1));
%// do analysis
%// OR sub{i} = data(breaks(i)+1:breaks(i+1));
end
A vectorized approach might use reshape(data,[],12) after padding data so that mod(numel(data),12)==0
A third way might be to break your matrix up into a cell array using mat2cell or in a for loop like above but instead of sub=... rather use sub{i}=...

Calculating moving average using do loop in SAS

I am trying to find a way to calculate a moving average using SAS do loops. I am having difficulty. I essentially want to calculate a 4 unit moving average.
DATA data;
INPUT a b;
CARDS;
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
17 18
;
run;
data test(drop = i);
set data;
retain c 0;
do i = 1 to _n_-4;
c = (c+a)/4;
end;
run;
proc print data = test;
run;
One option is to use the merge-ahead:
DATA have;
INPUT a b;
CARDS;
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
17 18
;
run;
data want;
merge have have(firstobs=2 rename=a=a_1) have(firstobs=3 rename=a=a_2) have(firstobs=4 rename=a=a_3);
c = mean(of a:);
run;
Merge the data to itself, each time the merged dataset advancing one - so the 2nd starts with 2, third starts with 3, etc. That gives you all 4 'a' on one line.
SAS has a lag() function. What this does is create the lag of the variable it is applied to. SO for example, if your data looked like this:
DATA data;
INPUT a ;
CARDS;
1
2
3
4
5
;
Then the following would create a lag one, two, three etc variable;
data data2;
set data;
a_1=lag(a);
a_2=lag2(a);
a_3=lag3(a);
drop b;
run;
would create the following dataset
a a_1 a_2 a_3
1 . . .
2 1 . .
3 2 1 .
4 3 2 1
etc.
Moving averages can be easily calculated from these.
Check out http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm
(Please note, I did not get a chance to run the codes, so they may have errors.)
Straight from Cody's Collection of Popular Programming Tasks and How to Tackle them.
*Presenting a macro to compute a moving average;
%macro Moving_ave(In_dsn=, /*Input data set name */
Out_dsn=, /*Output data set name */
Var=, /*Variable on which to compute
the average */
Moving=, /* Variable for moving average */
n= /* Number of observations on which
to compute the average */);
data &Out_dsn;
set &In_dsn;
***compute the lags;
_x1 = &Var;
%do i = 1 %to &n - 1;
%let Num = %eval(&i + 1);
_x&Num = lag&i(&Var);
%end;
***if the observation number is greater than or equal to the
number of values needed for the moving average, output;
if _n_ ge &n then do;
&Moving = mean (of _x1 - _x&n);
output;
end;
drop _x:;
run;
%mend Moving_ave;
*Testing the macro;
%moving_Ave(In_dsn=data,
Out_dsn=test,
Var=a,
Moving=Average,
n=4)

What format does matlab need for n-dimensional data input?

I have a 4-dimensional dictionary I made with a Python script for a data mining project I'm working on, and I want to read the data into Matlab to do some statistical tests on the data.
To read a 2-dimensional matrix is trivial. I figured that since my first dimension is only 4-deep, I could just write each slice of it out to a separate file (4 files total) with each file having many 2-dimensional slices, looking something like this:
2 3 6
4 5 8
6 7 3
1 4 3
6 6 7
8 9 0
This however does not work, and matlab reads it as a single continuous 6 x 3 matrix. I even took a look a dlmread but could not figure out how to get it do what I wanted. How do I format this so I can put 3 (or preferably more) dimensions in a single file?
A simple solution is to create a file with two lines only: the first line contains the target array size, the second line contains all your data. Then, all you need to do is reshape the data.
Say your file is
3 2 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
You do the following to read the array into the variable data
fid = fopen('myFile'); %# open the file (don't forget the extension)
arraySize = str2num(fgetl(fid)); %# read the first line, convert to numbers
data = str2num(fgetl(fid)); %# read the second line
data = reshape(data,arraySize); %# reshape the data
fclose(fid); %# close the file
Have a look at data to see how Matlab orders elements in multidimensional arrays.
Matlab stores data column wise. So from your example (assuming its a 3x2x3 matrix), matlab will store it as first, second and third column from the first "slice", followed by the first, second third columns from the second slice and so on like this
2
4
3
5
6
8
6
1
7
4
3
3
6
8
6
9
7
0
So you can write the data out like this from python (I don't know how) and then read it into matlab. Then you can reshape it back into a 3x2x3 matrix and you'll retain your correct ordering.

Resources