Renaming and creating variables in a list of Stata files - database

I have a list of Stata datasets: among some a variable tor is absent, and I want to add that variable if it doesn't exist.
The datasets contain a variable called xclass where x could be anything (e.g. Aclass, lclass, etc.). I would like to rename those variables to dec.
I want to create a variable adjusted which is "yes" if the file name contains adjusted and "no" if not.
I guess it would look something like:
Loop through list of datasets and their variables {
if variable contains pattern class
rename to dec
if no variable tor, then
gen str tor = total
if file name contains pattern adjusted
gen str adjusted = yes
else gen str adjusted = no
}
But then in proper Stata language.
So I've got this now, but it's not working, it doesn't do anything...
cd "C:\Users\test"
local filelist: dir "." files "*.dta", respectcase
foreach filename of local myfilelist {
ds *class
local found `r(varlist)'
local nfound : word count `found'
if `nfound' == 1 {
rename `found' dec
}
else if `nfound' > 1 {
di as err "warning: multiple *class variables in `filename'"
}
capture confirm var tor
if !_rc == 0 {
gen tor = "total"
}
gen adjusted = cond(strpos("`filename'", "_adjusted_"), "yes", "no")
}

This is not an answer, this is advice that won't fit into a comment.
What you are attempting is not elementary Stata. If indeed you are unfamiliar with Stata (not stata) you will find it challenging to automate this process. I'm sympathetic to you as a new user of Stata - it's a lot to absorb. And even worse if perhaps you are under pressure to produce some output quickly. Nevertheless, I'd like to encourage you to take a step back from your immediate tasks.
When I began using Stata in a serious way, I started by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. There are a lot of examples to copy and paste into Stata's do-file editor to run yourself, and better yet, to experiment with changing the options to see how the results change.
All of these manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu. The objective in doing the reading was not so much to master Stata as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax.
The Stata documentation is really exemplary - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry.
With that said, you will perhaps find the foreach command helpful for looping, the filelist command for obtaining a list of Stata datasets (not databases), and the ds command for obtaining a list of variable names within a Stata dataset. More subtly, the capture command will let you attempt to generate your tor variable and will simply fail gracefully if it already exists, saving a small amount of program logic.

The middle part can be sketched:
// assumes local macro filename contains file name
ds *class
local found `r(varlist)'
local nfound : word count `found'
if `nfound' == 1 {
rename `found' dec
}
else if `nfound' > 1 {
di as err "warning: multiple *class variables in `filename'"
}
capture confirm var tor
if _rc {
gen tor = "total"
}
gen adjusted = cond(strpos("`filename'", "adjusted"), "yes", "no")
On managing lists of files: filelist (SSC) is very good; also see fs (SSC) for a different approach.
EDIT: Here is proof of concept for the last detail:
. local filename1 "something adjusted somehow"
. local filename2 "frog toad newt dragon"
. di cond(strpos("`filename1'", "adjusted"), "yes", "no")
yes
. di cond(strpos("`filename2'", "adjusted"), "yes", "no")
no
strpos("<string1>", "<string2>") returns a non-zero result, namely the starting position of the second string in the first if the first contains the second. Non-zero as an argument means true in Stata; zero means false.
See help strpos() and if desired help cond().
I can't see your filenames to comment or test your code, but one possible problem is that the local macro is not defined in the same namespace as that in which you are trying to evaluate the expression. (That's what local means.) A macro that isn't defined will be evaluated as an empty string, with the result you mention.

Related

Run TreSpEx analysis

I am trying to run TreSpex analysis on a series of trees, which are saved in newick format as .fasta.txt files in a folder.
I have a list of Taxa names saved in a .txt file
I enter:
perl TreSpEx.v1.pl -fun e -ipt *fasta.txt -tf Taxa_List.txt
But it won't run. I tried writing a loop for each file within the folder but am not very good with them and my line of
for i in treefile/; do perl TreSpEx.v1.1.pl -fun e -ipt *.fasta.txt -tf Taxa_List.txt; done
won't work because -ipt apparently needs a name that starts with a letter or number
In your second example you are actually doing the same thing as in first (but posible several times).
I'm not familiar with TreSpEx or know Bash very well for that matter (which it seems you are using), but you might try something like below.
for i in treefile/*.fasta.txt ; do
perl TreSpEx.v1.1.pl -fun e -ipt $i -tf Taxa_List.txt;
done
Basically, you need to use a variable from the for loop (i) to pass name of each file to the command.

SPSS loop ROC analysis for lots of variables

In SPSS, I would like to perform ROC analysis for lots of variables (989). The problem, when selecting all variables, it gives me the AUC values and the curves, but a case is immediately excluded if it has one missing value within any of the 989 variables. So, I was thinking of having a single-variable ROC analysis put into loop. But I don't have any idea how to do so. I already named all the variables var1, var2, var3, ..., var988, var989.
So, how could I loop a ROC analysis? (Checking "Treat user-missing values as valid" doesn't do the trick)
Thanks!
this sounds like a job for python. Its usually the best solution for this sort of job in SPSS.
So heres a framwork that might help you. I am woefully unfamiliar with ROC-Analysis, but this general pattern is applicable to all kinds of looping scenarios:
begin program.
import spss
for i in range(spss.GetVariableCount()):
var = spss.GetVariableName(i)
cmd = r'''
* your variable-wise analysis goes here --> use spss syntax, beetween the three ' no
* indentation is needed. since I dont know what your syntax looks like, we'll just
* run descriptives and frequencies for all your variables as an example
descriptives %(var)s
/sta mean stddev min max.
fre %(var)s.
'''%locals()
spss.Submit(cmd)
end program.
Just to quickly go over what this does: In line 4 we tell spss to do the following as many times as theres variables in the active dataset, 989 in your case. In line 5 we define a (python) variable named var which contains the variable name of the variable at index i (0 to 988 - the first variable in the dataset having index 0). Then we define a command for spss to execute. I like to put it in raw strings because that simplifies things like giving directories. A raw string is defined by the r''' and ends at the '''. in line 12. "spss.Submit(cmd)" gives the command defined after "cmd = " to spss for execution. Most importantly though, whenever the name of the variable would appear in your syntax, substitute it with "%(var)s"
If you put "set mprint on." a line above the "begin program." youll see exactly what it does in the viewer.

error in looping over files, -fs- command

I'm trying to split some datasets in two parts, running a loop over files like this:
cd C:\Users\Macrina\Documents\exports
qui fs *
foreach f in `r(files)' {
use `r(files)'
keep id adv*
save adv_spa*.dta
clear
use `r(files)'
drop adv*
save fin_spa*.dta
}
I don't know whether what is inside the loop is correctly written but the point is that I get the error:
invalid '"e2.dta'
where e2.dta is the second file in the folder. Does this message refer to the loop or maybe what is inside the loop? Where is the mistake?
You want lines like
use "`f'"
not
use `r(files)'
given that fs (installed from SSC, as you should explain) returns r(files) as a list of all the files whereas you want to use each one in turn (not all at once).
The error message was informative: use is puzzled by the second filename it sees (as only one filename makes sense). The other filenames are ignored: use fails as soon as something is evidently wrong.
Incidentally, note that putting "" around filenames remains essential if any includes spaces.

SPSS: Use index variable inside quotation marks

I have several datasets over which i want to run identical commands.
My basic idea is to create a vector with the names of the datasets and loop over it, using the specified name in my GET command:
VECTOR=(9) D = Name1 to Name9.
LOOP #i = 1 to 9.
GET
FILE = Directory\D(#i).sav
VALUE LABELS V1 to V8 'some text D(#i)'
LOOP END.
Now SPSS doesn't recognize that i want it to use the specific value of the vector D.
In Stata i'd use
local D(V1 to V8)
foreach D{
....`D' .....
}
You can't use VECTOR in this way i.e. using GET command within a VECTOR/LOOP loop.
However you can use DEFINE/!ENDDEFINE. This is SPSS's native macro facility language, if you are not aware of this, you'll most likely need to do a lot of reading on it and understand it's syntax usage.
Here's an example:
DEFINE !RunJob ()
!DO !i !IN 1 !TO 9
GET FILE = !CONCAT("Directory\D(",#i,").sav").
VALUE LABELS V1 to V8 !QUOTE(!ONCAT("some text D(",#i,")",
!DOEND
!ENDDEFINE.
SET MPRINT ON.
!RunJob.
SET MPRINT OFF.
All the code between DEFINE and !ENDDEFINE is the body of the macro and the syntax near to the end !RunJob. then runs and executes those procedures defined in the macro.
This a very simply use of a macro with no parameters/arguments assigned but there is scope for much more complexity.
If you are new to DEFINE/!ENDEFINE I would actually suggest you NOT invest time in learning this but instead learn Python Program ability which can be used to achieve the same (and much more) with relative ease compared to DEFINE/!ENDDEFINE.
A python solution to your example would look like this (you will need Python Programmability integration with your SPSS):
BEGIN PROGRAM.
for i in xrange(1,9+1):
spss.Submit("""
GET FILE = Directory\D(%(i)s).sav
VALUE LABELS V1 to V8 'some text D(%(i)s)'.""" % locals())
END PROGRAM.
As you will notice there is much more simplicity to the python solution.
#Caspar: use Python for SPSS for such jobs. SPSS macros have been long deprecated and had better be avoided.
If you use Python for this, you don't even have to type in the file names: you can simply look up all file names in some folder that end with ".sav" as shown in this example.
HTH!
The Python approach is as Ruben says much superior to the old macro facility, but you can use the SPSSINC PROCESS FILES extension command to do tasks like this without any need to know Python. PROCESS FILES is included in the Python Essentials in recent versions of Statistics but can be downloaded from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) in older versions.
The idea is that you create a syntax file that works on one data file, and PROCESS FILES iterates that over a list of input files or a wildcard specification. For each file, it defines file handles and macros that you can use in the syntax file to open and process the data.

Linux Bash pass function argument to array name

I'm working on a script that has a number of functions in place which pull data from a few different arrays. We hope to keep the arrays individualized for reporting purposes. The information in the arrays does not change and the only thing different between each function is which array name is being used. Since all of the functions have 98% the same content I'm trying to pull them into 1 single array for simplified management.
The issue I'm facing though is that I'm not able to figure out the correct syntax to obtain the length of an array based on the array title that is passed in the function argument. I can't post the actual script, but here is a mock up that details a simplified version of what I'm testing with. I believe if we can get it working using the mock script below I can transfer the needed changes to the actual script.
array1=(
"item1 123"
"item2 456"
)
array2=(
"stockA qwe"
"stockB asd"
"stockC zxc"
)
test() {
local ref=${1}[#]
IFS=$'\n'; for i in ${!ref}; do echo $i ; done
}
test array1
test array2
The script above so far will echo the content of each array line based on argument 1 when the function and it's argument is called; which is working as needed. I've tried many different combinations such as len=${#${1}[#]} but I always receive a "bad substitution" error. The functions I mention before have while loops and for statements that use the array length to know when to stop, so being able to pull that information really ties it all together. What I'm hoping for is something like the flow below
I plan to continue my research on this, but thank you for any help and knowledge that can be provided!
-Cyanide
I think the only solution is to create a copy of the array, then take the length of that array:
local ref=${1}[#]
copy=( "${!ref}" )
len=${#copy[#]}
Since bash does not allow chaining of the parameter expansion operators, I know of no shorter way to use both ${#...} and ${!...} on the same line.

Resources