Export summary table with two group-by variables to LaTeX - export

I am trying to export a two-way summary table to LaTeX using the community-contributed command estout. This is a table summarizing the mean values of numerical weight across two categorical variables foreign and pricehigh:
sysuse auto, clear
gen pricehigh = 0
replace pricehigh = 1 if price > 6165
tabulate foreign pricehigh, summarize(weight) means label
Means of Weight (lbs.)
| pricehigh
Car type | 0 1 | Total
-----------+----------------------+----------
Domestic | 3,080.513 4,026.923 | 3,317.115
Foreign | 2,118.462 2,601.111 | 2,315.909
-----------+----------------------+----------
Total | 2,840 3,443.636 | 3,019.459
However, Stata tells me that the summarize() option for tabulate is not allowed when using tabulate and estpost:
estpost tabulate foreign pricehigh, summarize(weight) means label
option summarize() not allowed
r(198);
I have been searching the estout documentation (particularly here) and Statalist, but cannot find how to re-create this table using estout.

The community-contributed command tabout can easily produce the desired output as follows:
sysuse auto, clear
generate pricehigh = 0
replace pricehigh = 1 if price > 6165
tabout foreign pricehigh using table.tex, style(tex) content(mean weight) sum replace
type table.tex
\begin{center}
\footnotesize
\newcolumntype{Y}{>{\raggedleft\arraybackslash}X}
\begin{tabularx} {14} {#{} l Y Y Y #{}}
\toprule
& \multicolumn{3}{c}{pricehigh} \\
\cmidrule(l{1em}){2-4}
& 0 & 1 & Total \\
\cmidrule(l{1em}){2-4}
& Mean weight & Mean weight & Mean weight \\
\midrule
Car type \\
Domestic & 3,080.5 & 4,026.9 & 3,317.1 \\
Foreign & 2,118.5 & 2,601.1 & 2,315.9 \\
Total & 2,840.0 & 3,443.6 & 3,019.5 \\
\bottomrule
\end{tabularx}
\normalsize
\end{center}
In contrast, doing the same with estout requires you to create the table yourself:
sysuse auto, clear
generate pricehigh = 0
replace pricehigh = 1 if price > 6165
matrix A = J(3, 3, 0)
summarize weight if !foreign & !pricehigh, meanonly
matrix A[1,1] = r(mean)
summarize weight if !foreign & pricehigh, meanonly
matrix A[1,2] = r(mean)
summarize weight if !foreign, meanonly
matrix A[1,3] = r(mean)
summarize weight if foreign & !pricehigh, meanonly
matrix A[2,1] = r(mean)
summarize weight if foreign & pricehigh, meanonly
matrix A[2,2] = r(mean)
summarize weight if foreign, meanonly
matrix A[2,3] = r(mean)
summarize weight if !pricehigh, meanonly
matrix A[3,1] = r(mean)
summarize weight if pricehigh, meanonly
matrix A[3,2] = r(mean)
summarize weight, meanonly
matrix A[3,3] = r(mean)
matrix colnames A = 0 1 Total
matrix rownames A = Domestic Foreign Total
Result:
esttab matrix(A), title(Means of Weight (lbs.)) mtitles(pricehigh) gaps
Means of Weight (lbs.)
---------------------------------------------------
pricehigh
0 1 Total
---------------------------------------------------
Domestic 3080.513 4026.923 3317.115
Foreign 2118.462 2601.111 2315.909
Total 2840 3443.636 3019.459
---------------------------------------------------
The command esttab (a wrapper for estout) is used here for illustration.

Related

Creating dummy variables in SAS from a categorical variable

I am looking to create dummy variable for a categorical variable in SAS. The categorical variable includes information on sites and takes on values such as Manila, Rabat etc., all in all there are about 50 different sites. What would be the most efficient way to create dummies without creating each dummy separately using "if then"? Maybe using loops? How would that look like
short answer: yes. Without further input I'm afraid there is little we can provide. here are a few examples:
data with_categoric(keep=category:);
set sashelp.zipcode;
category1 = (TIMEZONE='Central' and length(COUNTYNM) <=4);
if 35>Y then category2='low';
else if 35<Y<41 then category2='medium';
else category2='high';
run;
An alternative way to do the Category2 is via proc format:
proc format;
value level
low-35 = 'low'
35-41 = 'med'
41-high ='high';
quit;
data W_proc_foramt;
set sashelp.zipcode;
levelled = Y;
format levelled level.;
run;
Your can check more from Documentation
The easiest way to create the dummy category is to use the observation number as a suffix.
Solution:
/*Create Table with 5 Records*/
data input;
input Category $40.;
cards;
A
B
C
D
E
;;;;
run;
/*Create dummy categories using "_N_" record number as suffix */
data work.dummy;
set work.input;
dummy= catx("-","CAT",put(_N_,8.));
put _all_;
run;
Output:
Category=A dummy=CAT-1 _ERROR_=0 _N_=1
Category=B dummy=CAT-2 _ERROR_=0 _N_=2
Category=C dummy=CAT-3 _ERROR_=0 _N_=3
Category=D dummy=CAT-4 _ERROR_=0 _N_=4
Category=E dummy=CAT-5 _ERROR_=0 _N_=5
I needed to convert categorical into dummy variable in SAS and run linear regression, but did not find one place with all answers, so I will put here the result of my search.
Say we have a dataset (mydata) with dependent variable Y and categorical variables A1,A2...An. Each idependent variable has X1,X2...Xm valid values. e.g.:
A1 | A2 | A3
---|----|---
x1 | y1 | z1
x2 | y1 | z2
x1 | y2 | z3
The output after dummy conversion would be:
A1x1 | A1x2 | A2y1 | A2y2| A3z1| A3z2 | A3z3
-----|------|------|-----|-----|------|-----
1 | 0 | 1 | 0 | 1 | 0 | 0
0 | 1 | 1 | 0 | 0 | 1 | 0
1 | 0 | 0 | 0 | 0 | 0 | 1
The code to accomplish the conversion to dummy is:
DATA mydata;
set mydata;
dummy=1;
RUN;
PROC logistic data=mydata outdesignonly outdesign=design;
CLASS A1 A2 A3/param=glm;
MODEL dummy=A1 A2 A3;
RUN;
DATA mydata_dummy;
merge mydata(drop=dummy) design(drop=dummy intercept);
RUN;
DATA mydata_dummy;
SET mydata_dummy;
DROP A1 A2 A3;
RUN;
The side effect of covnerting categorical variables into dummy variables, is the inflation in varibale names.
To avoid listing all new column names (e.g. for REG)
You cannot use (MODEL Y=all) because Y will be also in all
Instead of.
MODEL Y=A1x1 A1x2 A2y1 A2y2 A3z1 A3z2 A3z3
Do the following:
PROC CONTENTS data=mydata_dummy noprint out=_contents_;
RUN;
PROC sql noprint;
SELECT name into :names separated by ' '
from _contents_ where upcase(name) ^='Y';
RUN;
PROC reg DATA=mydata_dummy;
MODEL Y=&names;
RUN;
....Thanks

Pivot Table with merged date fields

I have a source data sheet, each data item having two date fields, startDate and endDate. What I would like to to in excel is generate a pivot table with row headers for each date from either of these columns, and two summary columns, one for Count Started, the other Count Ended.
For example, the following source data:
ItemId | startDate | endDate
1 | 6/1/16 | 6/2/16
2 | 6/2/16 | 6/3/16
3 | 6/1/16 | 6/3/16
Would produce a pivot table like this:
Date | Started | Ended
6/1/16 | 2 | 0
6/2/16 | 1 | 1
6/3/16 | 0 | 2
I doubt I would choose a PivotTable solution for this (that's unlike me!) but I think possible with a PT:
1) Create a PT from multiple consolidation ranges (example here) with ranges A:B and A:C (assuming ItemID is in A1).
2) After 7. select ColumnsA:C (in the new sheet) and apply Remove Duplicates (with all Columns checked).
3) Create a new PT from what remains (Column for COLUMNS, Value for ROWS, Count of Row for VALUES)
4) Right-click on startDate, Move, and click on first option.
5) In PivotTable Options..., Totals & Filters uncheck both Grand Totals and in Layout & Format, Format, check For empty cells show and enter 0.
6) Adjust labels to suit.

SPSS: using IF function with REPEAT when each case has multiple linked instances

I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |IV.1|IV.2|IV.3
------|------|------|------|------|--------|---------|--------|----|----|----
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 1 | 0 | 1
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1 | 0 | 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 0 | 0 | 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0 | 0 | 0
For each case, there are linked variables i.e. Drug.1 is linked with DateB.1 and IV.1 (Indicator Variable.1); Drug.2 is linked with DateB.2 and IV.2, etc.
The variable IV.1 only = 1 if Drug.1 is the case that I want to analyze (in this example, I want to analyze each receipt of Drug "X"), and so on for the other IV variables. Otherwise, IV = 0 if the drug for that scenario is not "X".
I want to calculate the difference between DateA and DateB for each instance where Drug "X" is received.
e.g. In the example above I want to calculate a new variable:
DateDiffA1_B1.1 = DateA1 - DateB1.1
DateDiffA1_B2.1 = DateA1 - DateB2.1
DateDiffA1_B1.3 = DateA1 - DateB1.3
DateDiffA1_B2.3 = DateA1 - DateB2.3
DateDiffA1_B3.3 = DateA1 - DateB3.3
I'm not sure if this new variable would need to be linked to each instance of Drug "X" as for the other variables, or if it could be a single variable that COUNTS all the instances for each case.
The end goal is to COUNT how many times each case had a date difference of <= 2 weeks when they received Drug "X". If they did not receive Drug "X", I do not want to COUNT the date difference.
I will eventually want to compare those who did receive Drug "X" with a date difference <= 2 weeks to those who did not, so having another indicator variable to help separate out these specific patients would be beneficial.
I am unsure about the best way to go about this; I suspect it will require a combination of IF and REPEAT functions using the IV variable, but I am relatively new with SPSS and syntax and am not sure how this should be coded to avoid errors.
Thanks for your help!
EDIT: It seems like I may need to use IV as a vector variable to loop through the linked variables in each case. I've tried the syntax below to no avail:
DATASET ACTIVATE DataSet1.
vector IV = IV.1 to IV.3.
loop #i = .1 to .3.
do repeat DateB = DateB.1 to DateB.3
/ DrugDateDiff = DateDiff.1 to DateDiff.3.
if IV(#i) = 1
/ DrugDateDiff = datediff(DateA, DateB, "days").
end repeat.
end loop.
execute.
Actually there is no need to add the vector and the loop, all you need can be done within one DO REPEAT:
compute N2W=0.
do repeat DateB = DateB.1 to DateB.3 /IV=IV.1 to IV.3 .
if IV=1 and datediff(DateA, DateB, "days")<=14 N2W = N2W + 1.
end repeat.
execute.
This syntax will first put a zero in the count variable N2W. Then it will loop through all the dates, and only if the matching IV is 1, the syntax will compare them to dateA, and add 1 to the count if the difference is <=2 weeks.
if you prefer to keep the count variable as missing when none of the IV are 1, instead of compute N2W=0. start the syntax with:
If any(1, IV.1 to IV.3) N2W=0.

Loop across many datasets to get one summary table

I have about 100 datasets in Stata. I want to loop across all of them to get one summary table for the proportion of people across all datasets who are taking a drug aceinhib. I can write code which produces a table for each dataset, but what I want is a summary of all these tables in one table.
Here is an example using just 5 datasets:
forval i=1/5 {
capture use "FILEADDRESS\FILENAME`i'", clear
table aceinhib
capture save "FILEADDRESS\NEW_FILENAME`i'", replace
}
This gives me:
----------------------
aceinhib | Freq.
----------+-----------
0 | 1578935
1 | 138,961
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 5671774
1 | 421,732
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 2350391
1 | 198,875
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 884,660
1 | 51,087
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 1470388
1 | 130,614
----------------------
What I want is:
----------------------
aceinhib | Freq.
----------+-----------
0 | 11956148
1 | 941269
----------------------
-- namely, the combined results of the 5 tables above.
Consider this pattern:
scalar a = 0
scalar b = 0
quietly forval i = 1/1000 {
sysuse auto, clear
count if foreign
scalar a = scalar(a) + r(N)
count if !foreign
scalar b = scalar(b) + r(N)
}
gen double count = cond(_n == 1, scalar(a), cond(_n == 2, scalar(b), .))
gen which = cond(_n == 1, "Foreign", cond(_n == 2, "Domestic", ""))
list which count in 1/2
Just cumulate counts from one file to another. For the real problem, don't read in the same dataset, repeatedly, but different files in a loop.
Perhaps this will point you in a useful direction.
clear
tempfile working
save `working', emptyok
forval i=1/5{
quietly use "FILEADDRESS\FILENAME`i'", clear
* replace "somevariable" with the name of a variable that is never missing
collapse (count) N=somevariable, by(aceinhib)
append using `working'
quietly save `working', replace
}
use `working', clear
collapse (sum) N, by(aceinhib)
list
If all files have the same structure, you could append them into one file before your table command. The following solutions also rely on aceinhib being coded as 0/1. If the files are not too large to append, it could be as simple as:
use "FILEADDRESS\FILENAME1", clear
forvalues i = 2/100 {
append using "FILEADDRESS\FILENAME`i'"
}
table aceinhib
If the resulting data file from append is too large, and there are no weights involved, you may continue as you have and employ the replace option for table:
forvalues i = 1/100 {
use "FILENAME`i'", clear
table aceinhib, replace
rename table1 freq
save "NEW_FILENAME`i'"
}
use "NEW_FILENAME1", clear
forvalues i = 2/100 {
append using "NEW_FILENAME`i'"
}
collapse (sum) freq, by(aceinhib)
list
Note that this approach will create data files containing the individual frequency tables. A third approach relies on storing the results of tab into a matrix for each iteration of the loop, and adding them to another matrix to store the cumulative freq of 0/1 values for aceinhib in each dataset:
mat b = (0\0)
forvalues i = 1/100 {
use "`FILENAME`i''", clear
tab aceinhib, matcell(aceinhib`i')
mat aceinhib = aceinhib + aceinhib`i'
}
mat list aceinhib
This is how I would approach the problem, although there may be cleaner solutions leveraging user written packages or other base Stata functionality that I haven't included here.

Sum array of data within date range and other = text

I have a dataset with two tabs, one with monthly goal(target) and another tab with sales and order data. I'm trying to summarize sales data from the other tab into the target tab using several parameters with an Index(Match and SumIfs:
My Attempt:
=SUMIFS(INDEX(OrderBreakdown!$A$2:$T$8048,,MATCH(C2,OrderBreakdown!$G$2:$G$8048)),OrderBreakdown!$I$2:$I$8048,">="&A2,OrderBreakdown!$I$2:$I$8048,"<="&B2)
Order Breakdown is the other sheet, column D in OrderBreakdown sheet is what I want to sum if OrderBreakdown_Category(Col G) = Col C and if OrderBreakdown_Order Date(Col I) >= Start Date(Col A) and if OrderBreakdown_Order Date(Col I) <= End Date(Col A)
My answer should be much more in line with Col D but instead I'm getting $MM
Here's a sample of the dataset I'm pulling from:
dataset I'm pulling from
Ok, I am not sure why your range to sum is from A through T - that is probably where you went wrong. Also, I did not find the index method necessary. This should work for you
=SUMIFS(OrderBreakdown!$D$2:$D$8048,OrderBreakdown!$I$2:$I$8048, ">=" & A2,OrderBreakdown!$I$2:$I$8048, "<=" & B2, OrderBreakdown!$G$2:$G$8048, "<=" & C2)
Here is my sample data Starting on first sheet row 2
1/1/2011 1/30/2011 Office Supplies
Then the orderBreakdown tab starts on column C
Discount Sales Profit Quantity Category sub-category OrderDate
0.5 $45.00 ($26.00) 3 Office Supplies Paper 1/1/11 Eugene Mo Stockholm Sweden North Home Offic 1/5/11 Second Cla: Stockholm 2011-(11 0.1-2011 2011 1/1/2011
0 $854.00 $290.00 7 Furniture BookCases 1/2/2011
0 $854.00 $290.00 7 Furniture BookCases 12/32/2010

Resources