Creating dummy variables in SAS from a categorical variable - loops

I am looking to create dummy variable for a categorical variable in SAS. The categorical variable includes information on sites and takes on values such as Manila, Rabat etc., all in all there are about 50 different sites. What would be the most efficient way to create dummies without creating each dummy separately using "if then"? Maybe using loops? How would that look like

short answer: yes. Without further input I'm afraid there is little we can provide. here are a few examples:
data with_categoric(keep=category:);
set sashelp.zipcode;
category1 = (TIMEZONE='Central' and length(COUNTYNM) <=4);
if 35>Y then category2='low';
else if 35<Y<41 then category2='medium';
else category2='high';
run;
An alternative way to do the Category2 is via proc format:
proc format;
value level
low-35 = 'low'
35-41 = 'med'
41-high ='high';
quit;
data W_proc_foramt;
set sashelp.zipcode;
levelled = Y;
format levelled level.;
run;
Your can check more from Documentation

The easiest way to create the dummy category is to use the observation number as a suffix.
Solution:
/*Create Table with 5 Records*/
data input;
input Category $40.;
cards;
A
B
C
D
E
;;;;
run;
/*Create dummy categories using "_N_" record number as suffix */
data work.dummy;
set work.input;
dummy= catx("-","CAT",put(_N_,8.));
put _all_;
run;
Output:
Category=A dummy=CAT-1 _ERROR_=0 _N_=1
Category=B dummy=CAT-2 _ERROR_=0 _N_=2
Category=C dummy=CAT-3 _ERROR_=0 _N_=3
Category=D dummy=CAT-4 _ERROR_=0 _N_=4
Category=E dummy=CAT-5 _ERROR_=0 _N_=5

I needed to convert categorical into dummy variable in SAS and run linear regression, but did not find one place with all answers, so I will put here the result of my search.
Say we have a dataset (mydata) with dependent variable Y and categorical variables A1,A2...An. Each idependent variable has X1,X2...Xm valid values. e.g.:
A1 | A2 | A3
---|----|---
x1 | y1 | z1
x2 | y1 | z2
x1 | y2 | z3
The output after dummy conversion would be:
A1x1 | A1x2 | A2y1 | A2y2| A3z1| A3z2 | A3z3
-----|------|------|-----|-----|------|-----
1 | 0 | 1 | 0 | 1 | 0 | 0
0 | 1 | 1 | 0 | 0 | 1 | 0
1 | 0 | 0 | 0 | 0 | 0 | 1
The code to accomplish the conversion to dummy is:
DATA mydata;
set mydata;
dummy=1;
RUN;
PROC logistic data=mydata outdesignonly outdesign=design;
CLASS A1 A2 A3/param=glm;
MODEL dummy=A1 A2 A3;
RUN;
DATA mydata_dummy;
merge mydata(drop=dummy) design(drop=dummy intercept);
RUN;
DATA mydata_dummy;
SET mydata_dummy;
DROP A1 A2 A3;
RUN;
The side effect of covnerting categorical variables into dummy variables, is the inflation in varibale names.
To avoid listing all new column names (e.g. for REG)
You cannot use (MODEL Y=all) because Y will be also in all
Instead of.
MODEL Y=A1x1 A1x2 A2y1 A2y2 A3z1 A3z2 A3z3
Do the following:
PROC CONTENTS data=mydata_dummy noprint out=_contents_;
RUN;
PROC sql noprint;
SELECT name into :names separated by ' '
from _contents_ where upcase(name) ^='Y';
RUN;
PROC reg DATA=mydata_dummy;
MODEL Y=&names;
RUN;
....Thanks

Related

Making foreach go through all values of a variable + return the current value

I want to build an iteration using the Stata FAQ here. The second method seems fitting for my case. I have built the following code:
levelsof ID, local(levels)
foreach l of local levels {
var var1 var2 if ID == `l', lags(1/4) vsquish
vargranger
}
Idea: iterate over all IDs in ID, then do vargranger. However, it runs once, then outputs no observations. Which is not true, as I have 200 IDs in my search variable.
Second thing I want to add into my loop, a return / print function of the current ID used in ID.
The output should look like this, for each value of ID:
ID = XYZ
Sample: 2001 - 2019 Number of obs = 16
...
vargranger
Granger causality Wald tests
+------------------------------------------------------------------+
| Equation Excluded | chi2 df Prob > chi2 |
|--------------------------------------+---------------------------|
| var1 var2 | 11.617 4 0.020 |
| var1 ALL | 11.617 4 0.020 |
|--------------------------------------+---------------------------|
| var2 var1 | 6.2796 4 0.179 |
| var2 ALL | 6.2796 4 0.179 |
+------------------------------------------------------------------+
display of the current level is easy enough, say:
levelsof ID, local(levels)
foreach l of local levels {
di "{title:`l'}" _n
count if !missing(var1, var2) & ID == `l'
var var1 var2 if ID == `l', lags(1/4) vsquish
vargranger
}
The report "no observations" is presumably coming from var and is not a reflection of how many identifiers you have. You should add checks like that above for (e.g.) how many observations you have to play with.

Mask some fields when loading hdb in kdb

If there are two instances: a1 and a2, and both connect to the same hdb. I would like a2 connect to the hdb but add some filter. For example, there is a table called elec.
I would like a2 starts with filtering some of the values. If I write codes and let a2 load it when starting, doesn't that load the information to memory? Is there any way I can load it like normal hdb when starting a2 instance?
Basically, the question is how to mask some fields in one table when loading hdb?
It is possible to prevent columns from being returned by select statements be manipulating the table definition in your HDB instance. The below example has a single date paritioned table. We update the definition to a flipped dictionary with only a subset of the columns defined. This however is reversible and will not update the meta of the table in your instance which will still show all columns.
q)meta trade
c | t f a
----| -----
date| d
sym | s p
size| j
px | f
side| s
q)flip trade
`sym`size`px`side!`trade
q)`trade set flip `sym`size`px!`trade
q)select from trade where date=2017.05.27
date sym size px
------------------------------
2017.05.27 APPl 9968 92.79204
2017.05.27 APPl 9788 94.97189
2017.05.27 APPl 9660 27.62907
q)meta trade
c | t f a
----| -----
date| d
sym | s p
size| j
px | f
side| s

SPSS: using IF function with REPEAT when each case has multiple linked instances

I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |IV.1|IV.2|IV.3
------|------|------|------|------|--------|---------|--------|----|----|----
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 1 | 0 | 1
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1 | 0 | 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 0 | 0 | 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0 | 0 | 0
For each case, there are linked variables i.e. Drug.1 is linked with DateB.1 and IV.1 (Indicator Variable.1); Drug.2 is linked with DateB.2 and IV.2, etc.
The variable IV.1 only = 1 if Drug.1 is the case that I want to analyze (in this example, I want to analyze each receipt of Drug "X"), and so on for the other IV variables. Otherwise, IV = 0 if the drug for that scenario is not "X".
I want to calculate the difference between DateA and DateB for each instance where Drug "X" is received.
e.g. In the example above I want to calculate a new variable:
DateDiffA1_B1.1 = DateA1 - DateB1.1
DateDiffA1_B2.1 = DateA1 - DateB2.1
DateDiffA1_B1.3 = DateA1 - DateB1.3
DateDiffA1_B2.3 = DateA1 - DateB2.3
DateDiffA1_B3.3 = DateA1 - DateB3.3
I'm not sure if this new variable would need to be linked to each instance of Drug "X" as for the other variables, or if it could be a single variable that COUNTS all the instances for each case.
The end goal is to COUNT how many times each case had a date difference of <= 2 weeks when they received Drug "X". If they did not receive Drug "X", I do not want to COUNT the date difference.
I will eventually want to compare those who did receive Drug "X" with a date difference <= 2 weeks to those who did not, so having another indicator variable to help separate out these specific patients would be beneficial.
I am unsure about the best way to go about this; I suspect it will require a combination of IF and REPEAT functions using the IV variable, but I am relatively new with SPSS and syntax and am not sure how this should be coded to avoid errors.
Thanks for your help!
EDIT: It seems like I may need to use IV as a vector variable to loop through the linked variables in each case. I've tried the syntax below to no avail:
DATASET ACTIVATE DataSet1.
vector IV = IV.1 to IV.3.
loop #i = .1 to .3.
do repeat DateB = DateB.1 to DateB.3
/ DrugDateDiff = DateDiff.1 to DateDiff.3.
if IV(#i) = 1
/ DrugDateDiff = datediff(DateA, DateB, "days").
end repeat.
end loop.
execute.
Actually there is no need to add the vector and the loop, all you need can be done within one DO REPEAT:
compute N2W=0.
do repeat DateB = DateB.1 to DateB.3 /IV=IV.1 to IV.3 .
if IV=1 and datediff(DateA, DateB, "days")<=14 N2W = N2W + 1.
end repeat.
execute.
This syntax will first put a zero in the count variable N2W. Then it will loop through all the dates, and only if the matching IV is 1, the syntax will compare them to dateA, and add 1 to the count if the difference is <=2 weeks.
if you prefer to keep the count variable as missing when none of the IV are 1, instead of compute N2W=0. start the syntax with:
If any(1, IV.1 to IV.3) N2W=0.

Loop across many datasets to get one summary table

I have about 100 datasets in Stata. I want to loop across all of them to get one summary table for the proportion of people across all datasets who are taking a drug aceinhib. I can write code which produces a table for each dataset, but what I want is a summary of all these tables in one table.
Here is an example using just 5 datasets:
forval i=1/5 {
capture use "FILEADDRESS\FILENAME`i'", clear
table aceinhib
capture save "FILEADDRESS\NEW_FILENAME`i'", replace
}
This gives me:
----------------------
aceinhib | Freq.
----------+-----------
0 | 1578935
1 | 138,961
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 5671774
1 | 421,732
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 2350391
1 | 198,875
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 884,660
1 | 51,087
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 1470388
1 | 130,614
----------------------
What I want is:
----------------------
aceinhib | Freq.
----------+-----------
0 | 11956148
1 | 941269
----------------------
-- namely, the combined results of the 5 tables above.
Consider this pattern:
scalar a = 0
scalar b = 0
quietly forval i = 1/1000 {
sysuse auto, clear
count if foreign
scalar a = scalar(a) + r(N)
count if !foreign
scalar b = scalar(b) + r(N)
}
gen double count = cond(_n == 1, scalar(a), cond(_n == 2, scalar(b), .))
gen which = cond(_n == 1, "Foreign", cond(_n == 2, "Domestic", ""))
list which count in 1/2
Just cumulate counts from one file to another. For the real problem, don't read in the same dataset, repeatedly, but different files in a loop.
Perhaps this will point you in a useful direction.
clear
tempfile working
save `working', emptyok
forval i=1/5{
quietly use "FILEADDRESS\FILENAME`i'", clear
* replace "somevariable" with the name of a variable that is never missing
collapse (count) N=somevariable, by(aceinhib)
append using `working'
quietly save `working', replace
}
use `working', clear
collapse (sum) N, by(aceinhib)
list
If all files have the same structure, you could append them into one file before your table command. The following solutions also rely on aceinhib being coded as 0/1. If the files are not too large to append, it could be as simple as:
use "FILEADDRESS\FILENAME1", clear
forvalues i = 2/100 {
append using "FILEADDRESS\FILENAME`i'"
}
table aceinhib
If the resulting data file from append is too large, and there are no weights involved, you may continue as you have and employ the replace option for table:
forvalues i = 1/100 {
use "FILENAME`i'", clear
table aceinhib, replace
rename table1 freq
save "NEW_FILENAME`i'"
}
use "NEW_FILENAME1", clear
forvalues i = 2/100 {
append using "NEW_FILENAME`i'"
}
collapse (sum) freq, by(aceinhib)
list
Note that this approach will create data files containing the individual frequency tables. A third approach relies on storing the results of tab into a matrix for each iteration of the loop, and adding them to another matrix to store the cumulative freq of 0/1 values for aceinhib in each dataset:
mat b = (0\0)
forvalues i = 1/100 {
use "`FILENAME`i''", clear
tab aceinhib, matcell(aceinhib`i')
mat aceinhib = aceinhib + aceinhib`i'
}
mat list aceinhib
This is how I would approach the problem, although there may be cleaner solutions leveraging user written packages or other base Stata functionality that I haven't included here.

SAS - Do a loop while a column value doesn't change

I'm trying via SAS guide to use a loop (via PROC LOOP) to create a new column with a increment ID whenever the value of a specific column changes.
Just for example I'm looking for something like this:
Date | Name | Status | ID
------------------------------------------
20150101 | Tiago | Single | 1
20150102 | Tiago | Single | 1
20150103 | Tiago | Married | 2
20150104 | Tiago | Divorced | 3
20150105 | Tiago | Divorced | 3
20150106 | Tiago | Married | 4
In this case, the new column will be the ID, that will increment whenever the status changes along the records. With this I can then group by name, to see every change that occurred in time (even if they are repeated).
This question seems a little bit confused. If the original data is already sorted with the sample data provided, a data step like this could do.
data new;
set test;
by status notsorted;
if first.status then id + 1;
run;
The notsorted option is used to keep the original data. first.status will be True for the first appearance of status. id + 1 is a summary statement. The variable in the summary statement is not initialized to missing.
And by the way, what is PROC LOOP?
To expand Dajun's answer to work with grouping by name.
/* Sort so that name forms groups and id will go in date order */
proc sort data = test;
by name date;
run;
data want;
set test;
/* Tell SAS we want to know when the value of name or status changes */
by name status notsorted;
/* Reset the ID for each group */
if first.name then id = 0;
/* iterate the ID as per DaJun */
if first.status then id + 1;
run;

Resources