Create tables of means and export them to LaTeX - export

I am using the following code to create tables in Stata:
sysuse auto, clear
table rep78, contents(mean mpg mean weight)
--------------------------------------
Repair |
Record |
1978 | mean(mpg) mean(weight)
----------+---------------------------
1 | 21 3,100
2 | 19.125 3,353.8
3 | 19.4333 3,299
4 | 21.6667 2,870
5 | 27.3636 2,322.7
--------------------------------------
How can I directly export such tables in LaTeX markup?

The community-contributed command tabout provides an out-of-the-box solution:
. tabout rep78 using table.tex, style(tex) content(mean mpg mean weight) sum replace
Table output written to: table.tex
\begin{center}
\footnotesize
\newcolumntype{Y}{>{\raggedleft\arraybackslash}X}
\begin{tabularx} {14} {#{} l Y Y #{}}
\toprule
& mpg & weight \\
\midrule
Repair Record 1978 \\
1 & 21.0 & 3,100.0 \\
2 & 19.1 & 3,353.8 \\
3 & 19.4 & 3,299.0 \\
4 & 21.7 & 2,870.0 \\
5 & 27.4 & 2,322.7 \\
Total & 21.3 & 3,032.0 \\
\bottomrule
\end{tabularx}
\normalsize
\end{center}

Below is a very simple working example of using the community-contributed command texdoc:
/* Create sample data */
clear *
set obs 10
gen date = _n
expand 10
set seed 123
gen i = runiform()
cd "/path/to/my/output"
/* Initialize and create LaTeX document */
texdoc init TexTest, replace
tex \documentclass{article}
tex \usepackage{stata}
tex \begin{document}
tex \section{Table 1}
texdoc stlog TexLog
table date, contents(mean i)
texdoc stlog close
tex \end{document}
The above code snippet creates TexTest.tex at /path/to/my/output/.
I use texdoc stlog before submitting the table command to capture the output of table in the log file. If you open the resulting tex file you will notice there is a line that has \input{TexLog.log.tex}. The compiler will insert the contents of the log file at that location in the LaTeX document.
You can compile the tex file using your preferred method. I use pdflatex in a Linux environment but had issues after initially installing it and had to resolve some dependencies for stata.sty.
After compiling, the resulting pdf file will contain the following table:

Related

How to use loops with texdoc

Say that I run these two regressions and save the output to be used with texdoc to create a LaTeX file:
capture ssc install texdoc
sysuse auto2, clear
global spec1 "if foreign"
global spec2 "if !foreign"
foreach spec in 1 2 {
reg price mpg ${spec`spec'}
global b_mpg_`spec': di %6.2fc _b[mpg]
global se_mpg_`spec': di %6.2fc _se[mpg]
qui test mpg=0
global mpg_p_`spec': di %12.2fc r(p)
glo mpg_star_`spec'=cond(${mpg_p_`spec'}<.01,"***",cond(${mpg_p_`spec'}<.05,"**",cond(${mpg_p_`spec'}<.1,"*","")))
local N=e(N)
global N_`spec': di %12.0fc `N'
scalar r2=e(r2)
global r2_`spec': di %6.3fc r2
sum price if e(sample)
global ymean_`spec': di %12.2fc r(mean)
}
I can then use texdoc to create the .tex file as follows:
texdoc init "test.tex", replace force
tex MPG & ${b_mpg_1}${mpg_star_1} & ${b_mpg_2}${mpg_star_2} \\
tex & (${se_mpg_1}) & (${se_mpg_2}) \\ \addlinespace
tex Y Mean & ${ymean_1} & ${ymean_2} \\
tex Observations & $N_1 & $N_2 \\
tex R-Squared & $r2_1 & $r2_2 \\
tex Sample & Foreign & Domestic
texdoc close
How can I instead use loops (e.g., foreach or forvalues) in this last chunk of code so as to not have to write out each variable column more than once? In this example, there are only two columns, but in other examples, I have up to 9 columns, and so it quickly gets unwieldy to have to put in each column.
It would be trivial to add rows using a loop. However, I am not sure how to add different columns - I don't know how to add a loop within a line that begins with tex. And if I do something like:
tex MPG
foreach i in 1 2 {
tex & ${b_mpg_`i'}${mpg_star_`i'}
}
tex \\
the tex code goes onto multiple lines and so is unusable (and is also very inelegant).
Use a loop to build up a string of the desired output and then use texdoc to write the string to the file:
// Initialize variables to store the output strings for each column
local tex_MPG ""
local tex_se_MPG ""
local tex_Y_Mean ""
local tex_Observations ""
local tex_R_Squared ""
local tex_Sample ""
// Loop over the two specifications
foreach spec in 1 2 {
// Build up the output strings for each column
local tex_MPG = "${tex_MPG} & ${b_mpg_`spec'}${mpg_star_`spec'}"
local tex_se_MPG = "${tex_se_MPG} & (${se_mpg_`spec'})"
local tex_Y_Mean = "${tex_Y_Mean} & ${ymean_`spec'}"
local tex_Observations = "${tex_Observations} & ${N_`spec'}"
local tex_R_Squared = "${tex_R_Squared} & ${r2_`spec'}"
local tex_Sample = "${tex_Sample} & ${cond(`spec'==1,"Foreign","Domestic")}"
}
// Initialize the LaTeX file
texdoc init "test.tex", replace force
// Write the output strings to the file
tex `tex_MPG' \\
tex `tex_se_MPG' \\ \addlinespace
tex `tex_Y_Mean' \\
tex `tex_Observations' \\
tex `tex_R_Squared' \\
tex `tex_Sample'
// Close the LaTeX file
texdoc close
Loops are used to build up the output strings for each column, and then writes the strings to the LaTeX file using the tex command. The \addlinespace command adds a small vertical space between rows in the table.
Thanks.

Create a list of lists from two columns in a data frame - Scala

I have a data frame where passengerId and path are Strings. The path represents the flight path of the passenger so passenger 10096 started in country CO and traveled to country BM. I need to find out the longest amount of flights each passenger has without traveling to the UK.
+-----------+--------------------+
|passengerId| path|
+-----------+--------------------+
| 10096| co,bm|
| 10351| pk,uk|
| 10436| co,co,cn,tj,us,ir|
| 1090| dk,tj,jo,jo,ch,cn|
| 11078| pk,no,fr,no|
| 11332|sg,cn,co,bm,sg,jo...|
| 11563|us,sg,th,cn,il,uk...|
| 1159| ca,cl,il,sg,il|
| 11722| dk,dk,pk,sg,cn|
| 11888|au,se,ca,tj,th,be...|
| 12394| dk,nl,th|
| 12529| no,be,au|
| 12847| cn,cg|
| 13192| cn,tk,cg,uk,uk|
| 13282| co,us,iq,iq|
| 13442| cn,pk,jo,us,ch,cg|
| 13610| be,ar,tj,no,ch,no|
| 13772| be,at,iq|
| 13865| be,th,cn,il|
| 14157| sg,dk|
+-----------+--------------------+
I need to get it like this.
val data = List(
(1,List("UK","IR","AT","UK","CH","PK")),
(2,List("CG","IR")),
(3,List("CG","IR","SG","BE","UK")),
(4,List("CG","IR","NO","UK","SG","UK","IR","TJ","AT")),
(5,List("CG","IR"))
I'm trying to use this solution but I can't make this list of lists. It also seems like the input used in the solution has each country code as a separate item in the list, while my path column has the country codes listed as a single element to describe the flight path.
If the goal is just to generate the list of destinations from a string, you can simply use split:
df.withColumn("path", split('path, ","))
If the goal is to compute the maximum number of steps without going to the UK, you could do something like this:
df
// split the string on 'uk' and generate one row per sub journey
.withColumn("path", explode(split('path, ",?uk,?")))
// compute the size of each sub journey
.withColumn("path_size", size(split('path, ",")))
// retrieve the longest one
.groupBy("passengerId")
.agg(max('path_size) as "max_path_size")

Benchmark channel creation NextFlow

I am performing a scatter-gather operation on NextFlow.
It looks like the following:
reads = PATH+"test_1.fq"
outdir = "results"
split_read_ch = channel.fromFilePairs(reads, checkIfExists: true, flat:true ).splitFastq( by: 10, file:"test_split" )
process Scatter_fastP {
tag 'Scatter_fastP'
publishDir outdir
input:
tuple val(name), path(reads) from split_read_ch
output:
file "${reads}.trimmed.fastq" into gather_fatsp_ch
script:
"""
fastp -i ${reads} -o ${reads}.trimmed.fastq
"""
}
gather_fatsp_ch.collectFile().view().println{ it.text }
I run this code with all the benchmarks options proposed by Nextflow (https://www.nextflow.io/docs/latest/tracing.html):
nextflow run main.nf -with-report nextflow_report -with-trace nextflow_trace -with-timeline nextflow_timeline -with-dag nextflow_dag.html
In these tracing files, I can find the resources and speed of the 10 Scatter_fastP processes.
But I would like to also measure the resources and speed of the creation of the split_read_ch and the gather_fastp_ch channels.
I have tried to include the channels' creation in processes but I cannot find a solution to make it work.
Is there a way to include the channel creation into the tracing files? Or is there a way I have not found to create these channels into processes?
Thank you in advance for your help.
Although Nextflow can parse FASTQ files and split them into smaller files etc, generally it's better to pass off these operations to another process or set of processes, especially if your input FASTQ files are large. This is beneficial in two ways: (1) your main nextflow process doesn't need to work as hard, and (2) you get granular task process stats in your nextflow reports.
The following example uses GNU split to split the input FASTQ files, and gathers the outputs using the groupTuple() operator and the groupKey() built-in to stream the collected values as soon as possible. You'll need to adapt for your non-gzipped inputs:
nextflow.enable.dsl=2
params.num_lines = 40000
params.suffix_length = 5
process split_fastq {
input:
tuple val(name), path(fastq)
output:
tuple val(name), path("${name}-${/[0-9]/*params.suffix_length}.fastq.gz")
shell:
'''
zcat "!{fastq}" | split \\
-a "!{params.suffix_length}" \\
-d \\
-l "!{params.num_lines}" \\
--filter='gzip > ${FILE}.fastq.gz' \\
- \\
"!{name}-"
'''
}
process fastp {
input:
tuple val(name), path(fastq)
output:
tuple val(name), path("${fastq.getBaseName(2)}.trimmed.fastq.gz")
"""
fastp -i "${fastq}" -o "${fastq.getBaseName(2)}.trimmed.fastq.gz"
"""
}
workflow {
Channel.fromFilePairs( './data/*.fastq.gz', size: 1 ) \
| split_fastq \
| map { name, fastq -> tuple( groupKey(name, fastq.size()), fastq ) } \
| transpose() \
| fastp \
| groupTuple() \
| map { key, fastqs -> tuple( key.toString(), fastqs ) } \
| view()
}

Make dictionary from a chart setup

I'm trying to use keys (sizes of pipe) that have outlets of different sizes. Each outlet has its own set of sizes. I created dictionaries within a dictionary style which is called from a pickerView. But I can't call individual items from the buried dictionaries/arrays...
-------> _____
| | |
M ___| |___ size >>> 1/2"
|_|__ _ __ | ___|___
|______|______| | |
| | | outlets >>> 1/2" 3/8"
|__C___|___C__| /\ /\
/ \ / \
C M C M
I have already tried this setup with a structure and also tried individual dictionaries that all call to each other:>>
var out2 = ["1": [["1": [1.5 , 1.5]], ["3/4": [1.5 , 1.5]], ["1/2": [1.5 , 1.5]]]]
This does work but I can't call individual pieces from it.
struct teeDict {
var out1: [(size: String) :
[[(outlet1: String) :
[(c: Double) , (m: Double)]],
[(outlet2: String) :
[(c: Double) , (m: Double)]]]] =
["1/2" :
[["1/2" :
[1.0 , 1.0]],
["3/8" :
[1.0 , 1.0]]]]
}
I tried to create a dictionary instead of using arrays but nothing is recognized. How do I get this to work?? Thanks.
SOURCE DATA (for clarification)

creating a hierarchy using do loop

I'm trying to build a hierarchy from two columns where by one column is a client identifier and the other is its direct parent but I have an issue because a client can have a parent which can have another parent.
At the moment I have a number of merge statements in a macro that
%macro hierarchy (level);
data sourcedata (drop = PARENT_ID);
merge sourcedata (in = a)
inputdata (in = b);
by CLIENT_ID;
if a;
length Parent_L&level $32.;
Parent_L&level = PARENT_ID;
if Parent_L&level = Parent_L%eval(&level-1) then Parent_L&level ="";
CLIENT_ID = Parent_L&level;
proc sort;
by Parent_L&level;
run;
%mend;
%hierarchy(2)
%hierarchy(3)
%hierarchy(4)
my output looks like
client_ID Parent_L1 Parent_L2 Parent_L3 Parent_L4
clientA clientB . . .
ClientE clientA clientB . .
what I'm looking for is a way to do this until the last parent_Ln is all blank as I'm not sure how many levels I need to go to
thanks
The length you need is the depth of the tree.
It is a bit cumbersome to compute that in SQL, since you really need a hash table, or at least arrays.
You could use a hash in data step or ds2 to access the parent from the child.
If you have SAS/OR, then it is easier:
you can solve a shortest path and produce your table from
the paths from the roots of the forest.
For this example I will use the dataset from the link Reeza provided:
data have;
infile datalines delimiter='|';
input subject1 subject2;
datalines;
2 | 4
2 | 5
2 | 6
2 | 8
4 | 7
4 | 11
6 | 9
6 | 10
6 | 12
10 | 15
10 | 16
13 | 14
16 | 17
;
proc optmodel printlevel=0;
set<num,num> REFERRALS;
set CLIENTS = union{<u,v> in REFERRALS} {u, v};
set ROOTS = CLIENTS diff setof{<u,v> in REFERRALS} v;
read data have into REFERRALS=[subject1 subject2];
/* source, sink, seq, tail, head. */
set<num,num,num,num,num> PATHS;
num len{ROOTS,CLIENTS} init .;
num longest = max(of len[*]);
solve with NETWORK / graph_direction = directed
links = ( include = REFERRALS )
out = ( sppaths = PATHS spweights = len )
shortpath = ( source = ROOTS )
;
num parent{ci in CLIENTS, i in 1 .. longest} init .;
for{<u, v, i, s, t> in PATHS}
parent[v, len[u,v] - i + 1] = s;
print parent;
create data want from [client]=CLIENTS
{i in 1 .. longest} <COL('Parent_L'||i) = parent[client,i]>;
quit;

Resources