Loop across many datasets to get one summary table - loops

I have about 100 datasets in Stata. I want to loop across all of them to get one summary table for the proportion of people across all datasets who are taking a drug aceinhib. I can write code which produces a table for each dataset, but what I want is a summary of all these tables in one table.
Here is an example using just 5 datasets:
forval i=1/5 {
capture use "FILEADDRESS\FILENAME`i'", clear
table aceinhib
capture save "FILEADDRESS\NEW_FILENAME`i'", replace
}
This gives me:
----------------------
aceinhib | Freq.
----------+-----------
0 | 1578935
1 | 138,961
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 5671774
1 | 421,732
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 2350391
1 | 198,875
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 884,660
1 | 51,087
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 1470388
1 | 130,614
----------------------
What I want is:
----------------------
aceinhib | Freq.
----------+-----------
0 | 11956148
1 | 941269
----------------------
-- namely, the combined results of the 5 tables above.

Consider this pattern:
scalar a = 0
scalar b = 0
quietly forval i = 1/1000 {
sysuse auto, clear
count if foreign
scalar a = scalar(a) + r(N)
count if !foreign
scalar b = scalar(b) + r(N)
}
gen double count = cond(_n == 1, scalar(a), cond(_n == 2, scalar(b), .))
gen which = cond(_n == 1, "Foreign", cond(_n == 2, "Domestic", ""))
list which count in 1/2
Just cumulate counts from one file to another. For the real problem, don't read in the same dataset, repeatedly, but different files in a loop.

Perhaps this will point you in a useful direction.
clear
tempfile working
save `working', emptyok
forval i=1/5{
quietly use "FILEADDRESS\FILENAME`i'", clear
* replace "somevariable" with the name of a variable that is never missing
collapse (count) N=somevariable, by(aceinhib)
append using `working'
quietly save `working', replace
}
use `working', clear
collapse (sum) N, by(aceinhib)
list

If all files have the same structure, you could append them into one file before your table command. The following solutions also rely on aceinhib being coded as 0/1. If the files are not too large to append, it could be as simple as:
use "FILEADDRESS\FILENAME1", clear
forvalues i = 2/100 {
append using "FILEADDRESS\FILENAME`i'"
}
table aceinhib
If the resulting data file from append is too large, and there are no weights involved, you may continue as you have and employ the replace option for table:
forvalues i = 1/100 {
use "FILENAME`i'", clear
table aceinhib, replace
rename table1 freq
save "NEW_FILENAME`i'"
}
use "NEW_FILENAME1", clear
forvalues i = 2/100 {
append using "NEW_FILENAME`i'"
}
collapse (sum) freq, by(aceinhib)
list
Note that this approach will create data files containing the individual frequency tables. A third approach relies on storing the results of tab into a matrix for each iteration of the loop, and adding them to another matrix to store the cumulative freq of 0/1 values for aceinhib in each dataset:
mat b = (0\0)
forvalues i = 1/100 {
use "`FILENAME`i''", clear
tab aceinhib, matcell(aceinhib`i')
mat aceinhib = aceinhib + aceinhib`i'
}
mat list aceinhib
This is how I would approach the problem, although there may be cleaner solutions leveraging user written packages or other base Stata functionality that I haven't included here.

Related

How to validate two data sets coming from an algorithm to check its effectiveness

I have two data sets:
Ist (AKA "OLD") [smaller - just a sample]:
Origin | Alg.Result | Score
Star123 | Star123 | 100
Star234 | Star200 | 90
Star421 | Star420 | 98
Star578 | Star570 | 95
... | ... | ...
IInd (AKA "NEW") [bigger - used all real data]:
Origin | Alg.Result | Score
Star123 | Star120 | 90
Star234 | Star234 | 100
Star421 | Star423 | 98
Star578 | Star570 | 95
... | ... | ...
Those DFs are the results of two different algorithms. Let's call them "OLD" and "NEW".
The logic of those algorithms is following:
it takes value from some table (represented in the column: 'Origin'), and tries to match this value from some different table (outcome represented as a column: Alg. Result). Plus it calculates a score of the match based on some internal logic (column: Score).
Additionally important information:
I DF (old) is a smaller sample
II DF (new) is a bigger sample
Values in ORIGIN are the same for both datasets, excluding the fact that the old dataset has fewer of them compared to the NEW set.
Values in Alg. Result can:
be exactly the same as in Origin
can be similar
can be completely something else
In a solution where those algorithms are used, the threshold is used based on SCORE. For OLD it's a Score > 90. For the new, it's the same.
What I want to achieve is to:
How accurate is the new algorithm?
Validate how accurate is the new approach ("NEW") in matching values with Origin values.
What are the discrepancies between the OLD and NEW sets:
which cases the OLD has that the NEW doesn't have
which cases the NEW has, which the OLD doesn't have
What kind of comparison would you do to achieve those goals?
I thought about checking:
True positive => by taking NEW dataset with condition NEW.Origin == NEW.Alg.Result and NEW.Score == 100
False positive => by taking NEW dataset with condition NEW.Origin != NEW.Alg.Result and NEW.Score == 100
False-negative => by taking NEW dataset with condition NEW.Origin == NEW.Al.Result and NEW.Score != 100
I don't see a sense to count True negatives if the algorithm always generates some match. I'm not sure what this could look like.
What else you'd suggest? What to do to compare OLD and NEW values? Do you have some ideas?

Creating dummy variables in SAS from a categorical variable

I am looking to create dummy variable for a categorical variable in SAS. The categorical variable includes information on sites and takes on values such as Manila, Rabat etc., all in all there are about 50 different sites. What would be the most efficient way to create dummies without creating each dummy separately using "if then"? Maybe using loops? How would that look like
short answer: yes. Without further input I'm afraid there is little we can provide. here are a few examples:
data with_categoric(keep=category:);
set sashelp.zipcode;
category1 = (TIMEZONE='Central' and length(COUNTYNM) <=4);
if 35>Y then category2='low';
else if 35<Y<41 then category2='medium';
else category2='high';
run;
An alternative way to do the Category2 is via proc format:
proc format;
value level
low-35 = 'low'
35-41 = 'med'
41-high ='high';
quit;
data W_proc_foramt;
set sashelp.zipcode;
levelled = Y;
format levelled level.;
run;
Your can check more from Documentation
The easiest way to create the dummy category is to use the observation number as a suffix.
Solution:
/*Create Table with 5 Records*/
data input;
input Category $40.;
cards;
A
B
C
D
E
;;;;
run;
/*Create dummy categories using "_N_" record number as suffix */
data work.dummy;
set work.input;
dummy= catx("-","CAT",put(_N_,8.));
put _all_;
run;
Output:
Category=A dummy=CAT-1 _ERROR_=0 _N_=1
Category=B dummy=CAT-2 _ERROR_=0 _N_=2
Category=C dummy=CAT-3 _ERROR_=0 _N_=3
Category=D dummy=CAT-4 _ERROR_=0 _N_=4
Category=E dummy=CAT-5 _ERROR_=0 _N_=5
I needed to convert categorical into dummy variable in SAS and run linear regression, but did not find one place with all answers, so I will put here the result of my search.
Say we have a dataset (mydata) with dependent variable Y and categorical variables A1,A2...An. Each idependent variable has X1,X2...Xm valid values. e.g.:
A1 | A2 | A3
---|----|---
x1 | y1 | z1
x2 | y1 | z2
x1 | y2 | z3
The output after dummy conversion would be:
A1x1 | A1x2 | A2y1 | A2y2| A3z1| A3z2 | A3z3
-----|------|------|-----|-----|------|-----
1 | 0 | 1 | 0 | 1 | 0 | 0
0 | 1 | 1 | 0 | 0 | 1 | 0
1 | 0 | 0 | 0 | 0 | 0 | 1
The code to accomplish the conversion to dummy is:
DATA mydata;
set mydata;
dummy=1;
RUN;
PROC logistic data=mydata outdesignonly outdesign=design;
CLASS A1 A2 A3/param=glm;
MODEL dummy=A1 A2 A3;
RUN;
DATA mydata_dummy;
merge mydata(drop=dummy) design(drop=dummy intercept);
RUN;
DATA mydata_dummy;
SET mydata_dummy;
DROP A1 A2 A3;
RUN;
The side effect of covnerting categorical variables into dummy variables, is the inflation in varibale names.
To avoid listing all new column names (e.g. for REG)
You cannot use (MODEL Y=all) because Y will be also in all
Instead of.
MODEL Y=A1x1 A1x2 A2y1 A2y2 A3z1 A3z2 A3z3
Do the following:
PROC CONTENTS data=mydata_dummy noprint out=_contents_;
RUN;
PROC sql noprint;
SELECT name into :names separated by ' '
from _contents_ where upcase(name) ^='Y';
RUN;
PROC reg DATA=mydata_dummy;
MODEL Y=&names;
RUN;
....Thanks

Use TQuery.Locate() function to find other then first matching

Locate moves the cursor to the first row matching a specified set of search criteria.
Let's say that q is TQuery component, which is connected to the database with two columns TAG and TAGTEXT. With next code I am getting letter a. And I would like to use Locate() function to get letter d.
If q.Locate('TAG','1',[loPartialKey]) Then
begin
tag60 := q.FieldByName('TAGTEXT');
end
For example if I got table like this:
TAG | TAGTEXT
+---+--------+
| 1 | a |
+---+--------+
| 2 | b |
+---+--------+
| 3 | c |
+---+--------+
| 1 | d |
+---+--------+
| 4 | e |
+---+--------+
| 1 | f |
+---+--------+
is it possible to locate the second time number one occurred in table?
EDIT
My job is to find the occurrence of TAG with value 1 (which occurrence I need depends on the parameter I get), I need to iterate through table and get the values from all the TAGTEXT fields till I find that value in TAG field is again number 1. Number 1 in this case represents the start of new segment, and all between the two number 1s belongs to one segment. It doesn't have to be same number of rows in each segment. Also I am not allowed to do any changes on table.
What I thought I could do is to create a counter variable that is going to be increased by one every time it comes to TAG with value 1 in it. When the counter equals to the parameter that represents the occurrence I know that I am in the right segment and I am going to iterate through that segment and get the values I need.
But this might be slow solution, and I wanted to know if there was any faster.
You need to be a bit wary of using Locate for a purpose like this, because some
TDataSet descendants' implementation of Locate (or the underlying db-access layer) construct a temporary index on the dataset. which can be discarded immediately afterwards, so repeatedly calling Locate to iterate the rows of a given segment may be a lot more inefficient than one might expect it to be.
Also, TClientDataSet constructs, uses and then discards an expression parser for each invocation of Locate (in its internal call to LocateRecord), which is a lot of overhead for repeated calls, especial when they are entirely avoidable.
In any case, the best way to do this is to ensure that your table records which segment a given row belongs to, adding a column like the SegmentID below if your table does not already have one:
TAG | TAGTEXT|SegmentID
+---+--------+---------+
| 1 | a | 1
| 2 | b | 1
| 3 | c | 1
| 1 | d | 2
+---+--------+---------+ // btw, what happened to the 2 missing rows after this one?
| 4 | e | 2
| 1 | f | 3
+---+--------+---------+
Then, you could use code like this to iterate the rows of a segment:
procedure IterateSegment(Query : TSomeTypeOfQueryComponent; SegmentID : Integer);
var
Sql; String;
begin
Sql := Format('select * from mytable where SegmentID = %d order by Tag', [SegmentID]);
if Query.Active then
Query.Close;
Query.Sql.Text := Sql;
Query.Open;
Query.DisableControls;
try
while not Query.Eof do begin
// process row here
Query.Next;
end;
finally
Query.EnableControls;
end;
end;
Once you have the SegmentID column in the table, if you don't want to open a new query to iterate a block, you can set up a local index (by SegmentID then Tag), assuming your dataset type supports it, set a filter on the dataset to restrict it to a given SegmentID and then iterate over it
You have much options to do this.
If your component don´t provide a locateNext you can make your on function locateNext, comparing the value and make next until find.
You can also bring the sql with order by then use locate for de the first value and test if the next value match the comparision.
If you use a clientDataset you can filter into the component filter propertie, or set IndexFieldNames to order values instead the "order by" of sql in the prior suggestion.
You can filter it on the SQL Where clausule too.

Excel Lookup IP addresses in multiple ranges

I am trying to find a formula for column A that will check an IP address in column B and find if it falls into a range (or between) 2 addresses in two other columns C and D.
E.G.
A B C D
+---------+-------------+-------------+------------+
| valid? | address | start | end |
+---------+-------------+-------------+------------+
| yes | 10.1.1.5 | 10.1.1.0 | 10.1.1.31 |
| Yes | 10.1.3.13 | 10.1.2.16 | 10.1.2.31 |
| no | 10.1.2.7 | 10.1.1.128 | 10.1.1.223 |
| no | 10.1.1.62 | 10.1.3.0 | 10.1.3.127 |
| yes | 10.1.1.9 | 10.1.4.0 | 10.1.4.255 |
| no | 10.1.1.50 | … | … |
| yes | 10.1.1.200 | | |
+---------+-------------+-------------+------------+
This is supposed to represent an Excel table with 4 columns a heading and 7 rows as an example.
I can do a lateral check with
=IF(AND((B3>C3),(B3 < D3)),"yes","no")
which only checks 1 address against the range next to it.
I need something that will check the 1 IP address against all of the ranges. i.e. rows 1 to 100.
This is checking access list rules against routes to see if I can eliminate redundant rules... but has other uses if I can get it going.
To make it extra special I can not use VBA macros to get it done.
I'm thinking some kind of index match to look it up in an array but not sure how to apply it. I don't know if it can even be done. Good luck.
Ok, so I've been tracking this problem since my initial comment, but have not taken the time to answer because just like Lana B:
I like a good puzzle, but it's not a good use of time if i have to keep guessing
+1 to Lana for her patience and effort on this question.
However, IP addressing is something I deal with regularly, so I decided to tackle this one for my own benefit. Also, no offense, but getting the MIN of the start and the MAX of the end is wrong. This will not account for gaps in the IP white-list. As I mentioned, this required 15 helper columns and my result is simply 1 or 0 corresponding to In or Out respectively. Here is a screenshot (with formulas shown below each column):
The formulas in F2:J2 are:
=NUMBERVALUE(MID(B2,1,FIND(".",B2)-1))
=NUMBERVALUE(MID(B2,FIND(".",B2)+1,FIND(".",B2,FIND(".",B2)+1)-1-FIND(".",B2)))
=NUMBERVALUE(MID(B2,FIND(".",B2,FIND(".",B2)+1)+1,FIND(".",B2,FIND(".",B2,FIND(".",B2)+1)+1)-1-FIND(".",B2,FIND(".",B2)+1)))
=NUMBERVALUE(MID(B2,FIND(".",B2,FIND(".",B2,FIND(".",B2)+1)+1)+1,LEN(B2)))
=F2*256^3+G2*256^2+H2*256+I2
Yes, I used formulas instead of "Text to Columns" to automate the process of adding more information to a "living" worksheet.
The formulas in L2:P2 are the same, but replace B2 with C2.
The formulas in R2:V2 are also the same, but replace B2 with D2.
The formula for X2 is
=SUMPRODUCT(--($P$2:$P$8<=J2)*--($V$2:$V$8>=J2))
I also copied your original "valid" set in column A, which you'll see matches my result.
You will need helper columns.
Organise your data as outlined in the picture.
Split address, start and end into columns by comma (ribbon menu Data=>Text To Columns).
Above the start/end parts, calculate MIN FOR START, and MAX FOR END for all split text parts (i.e. MIN(K5:K1000) .
FORMULAS:
VALIDITY formula - copy into cell D5, and drag down:
=IF(AND(B6>$I$1,B6<$O$1),"In",
IF(OR(B6<$I$1,B6>$O$1),"Out",
IF(B6=$I$1,
IF(C6<$J$1, "Out",
IF( C6>$J$1, "In",
IF( D6<$K$1, "Out",
IF( D6>$K$1, "In",
IF(E6>=$L$1, "In", "Out"))))),
IF(B6=$O$1,
IF(C6>$P$1, "Out",
IF( C6<$P$1, "In",
IF( D6>$Q$1, "Out",
IF( D6<$Q$1, "In",
IF(E6<=$R$1, "In", "Out") )))) )
)))

SPSS: using IF function with REPEAT when each case has multiple linked instances

I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |IV.1|IV.2|IV.3
------|------|------|------|------|--------|---------|--------|----|----|----
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 1 | 0 | 1
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1 | 0 | 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 0 | 0 | 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0 | 0 | 0
For each case, there are linked variables i.e. Drug.1 is linked with DateB.1 and IV.1 (Indicator Variable.1); Drug.2 is linked with DateB.2 and IV.2, etc.
The variable IV.1 only = 1 if Drug.1 is the case that I want to analyze (in this example, I want to analyze each receipt of Drug "X"), and so on for the other IV variables. Otherwise, IV = 0 if the drug for that scenario is not "X".
I want to calculate the difference between DateA and DateB for each instance where Drug "X" is received.
e.g. In the example above I want to calculate a new variable:
DateDiffA1_B1.1 = DateA1 - DateB1.1
DateDiffA1_B2.1 = DateA1 - DateB2.1
DateDiffA1_B1.3 = DateA1 - DateB1.3
DateDiffA1_B2.3 = DateA1 - DateB2.3
DateDiffA1_B3.3 = DateA1 - DateB3.3
I'm not sure if this new variable would need to be linked to each instance of Drug "X" as for the other variables, or if it could be a single variable that COUNTS all the instances for each case.
The end goal is to COUNT how many times each case had a date difference of <= 2 weeks when they received Drug "X". If they did not receive Drug "X", I do not want to COUNT the date difference.
I will eventually want to compare those who did receive Drug "X" with a date difference <= 2 weeks to those who did not, so having another indicator variable to help separate out these specific patients would be beneficial.
I am unsure about the best way to go about this; I suspect it will require a combination of IF and REPEAT functions using the IV variable, but I am relatively new with SPSS and syntax and am not sure how this should be coded to avoid errors.
Thanks for your help!
EDIT: It seems like I may need to use IV as a vector variable to loop through the linked variables in each case. I've tried the syntax below to no avail:
DATASET ACTIVATE DataSet1.
vector IV = IV.1 to IV.3.
loop #i = .1 to .3.
do repeat DateB = DateB.1 to DateB.3
/ DrugDateDiff = DateDiff.1 to DateDiff.3.
if IV(#i) = 1
/ DrugDateDiff = datediff(DateA, DateB, "days").
end repeat.
end loop.
execute.
Actually there is no need to add the vector and the loop, all you need can be done within one DO REPEAT:
compute N2W=0.
do repeat DateB = DateB.1 to DateB.3 /IV=IV.1 to IV.3 .
if IV=1 and datediff(DateA, DateB, "days")<=14 N2W = N2W + 1.
end repeat.
execute.
This syntax will first put a zero in the count variable N2W. Then it will loop through all the dates, and only if the matching IV is 1, the syntax will compare them to dateA, and add 1 to the count if the difference is <=2 weeks.
if you prefer to keep the count variable as missing when none of the IV are 1, instead of compute N2W=0. start the syntax with:
If any(1, IV.1 to IV.3) N2W=0.

Resources