I am trying to write a loop to generate and fill in a dummy variable for whether an individual was a member of a particular party in the year in question. My data is long with each observation being a person, year. It looks like the following.
X1 X2 X3
AR, 1972-1981 PDC, 1982-1986 PFL, 1986-.
MD, 1966-1980 PMDB, 1980-1988 PSB, 1988-.
MD, 1966-1968 AR, 1968-1980 PDS, 1980-1985
Before the comma is the party and after are the years in which the person was a member of the party.
Any help would be greatly appreciated!
So far the code I have is:
rename X1 XA
rename X2 XB
rename X3 XC
foreach var of varlist XA XB XC{
split `var', parse (,)
}
tabulate XA1, gen(p)
Here's one way to do it. I had to make an assumption about what the missing year corresponds to in X3, so you will need to alter that.
/* Enter Data */
clear
input str20 X1 str20 X2 str20 X3
"AR, 1972-1981" "PDC, 1982-1986" "PFL, 1986-."
"MD, 1966-1980" "PMDB, 1980-1988" "PSB, 1988-."
"MD, 1966-1968" "AR, 1968-1980" "PDS, 1980-1985"
end
compress
/* Split X1,X2,X3 into party, start year and end year and create 3 ID variables that we need later */
forvalues v=1/3 {
split X`v', parse(", " "-")
gen id`v'=_n
}
/* Makes years numeric, and get rid of messy original data */
destring X12 X13 X22 X23 X32 X33, replace
replace X33 = 1990 if missing(X33) // enter your survey year here
drop X1 X2 X3
/* stack the spells on top of each other */
stack (id1 X11 X12 X13) (id2 X21 X22 X23) (id3 X31 X32 X33), into(id party year1 year2) clear
drop _stack
/* Put the data into long format and fill in the gaps */
reshape long year, i(id party) j(p)
drop p
/* need this b/c people can be in more than one party in a given year */
egen idparty = group(id party), label
xtset idparty year
tsfill
carryforward id party, replace
drop idparty
/* create party dummies */
tab party, gen(DD_)
/* rename the dummies to have party affiliation at the end instead of numbers */
foreach var of varlist DD_* {
levelsof party if `var'==1, local(party) clean
rename `var' ind_`party'
}
drop party
/* get back down to one person-year observation */
collapse (max) ind_*, by(id year)
list id year ind_*, sepby(id) noobs
Following Dimitriy's lead (and interpretation), here is a slightly different way of doing it. I make a different assumption about the missing endpoints, i.e., I truncate the series to the last known years.
clear
set more off
input ///
str15 (XA XB XC)
"AR, 1972-1981" "PDC, 1982-1986" "PFL, 1986-."
"MD, 1966-1980" "PMDB, 1980-1988" "PSB, 1988-."
"MD, 1966-1968" "AR, 1968-1980" "PDS, 1980-1985"
end
list
*----- what you want? -----
// main
stack X*, into(X) clear
bysort _stack: gen id = _n
order id, first
split X, parse (, -)
rename (X1 X2 X3) (party sdate edate)
destring ?date, replace
gen diff = edate - sdate + 1
expand diff
bysort id party: replace sdate = sdate[1] + _n - 1
drop _stack X edate diff
// create indicator variables
tabulate party, gen(y)
// fix years with two or more parties
levelsof party, local(lp) clean
collapse (sum) y*, by(id sdate)
// rename
unab ly: y*
rename (`ly') (`lp')
list, sepby(id)
Related
I want to calculate the number of overlapping days within multiple date ranges. For example, in the sample data below, there are 167 overlapping days: first from 07jan to 04apr and second from 30may to 15aug.
start end
01jan2000 04apr2000
30may2000 15aug2000
07jan2000 31dec2000
This is fairly crude but gets the job done. Essentially, you
Reshape the data to be in long format, which is usually a good idea when working with panel data in Stata
Fill in gaps between the start and end of each spell
Keep dates that occur more than once
Count the distinct values of dates
clear
/* Fake Data */
input str9(start end)
"01jan2000" "04apr2000"
"30may2000" "15aug2000"
"07jan2000" "31dec2000"
end
foreach var of varlist start end {
gen d = date(`var', "DMY")
drop `var'
gen `var' = d
format %td `var'
drop d
}
/* Count Overlapping Days */
rename (start end) date=
gen spell = _n
reshape long date, i(spell) j(range) string
drop range
xtset spell date, delta(1 day)
tsfill
bys date: keep if _N>1
distinct date
I am encountering some difficulty with a dataset that I am analyzing with Stata. The dataset I have is a repeated cross section of the following form:
Individual Year Age VarA VarB VarC
Variable C has been calculated for each individual by year, using the egen command. As a result, this variable is year specific. I now want to match the value of this variable corresponding to the year when each individual was x years old. (I create this new variable by the transform variableD=Year-Age+x).
I want to match the value of Variable C that was obtained in the year "variableD" for each individual.
Here's an example of how to do this with a user-written xfill:
net install xfill, from("http://www.sealedenvelope.com/")
webuse nlswork, clear
duplicates drop idcode age, force
gen x=20 if mod(idcode,2)==1
replace x=25 if mod(idcode,2)!=1
bys idcode year: egen var_c = mean(ln_wage)
bys idcode: gen var_c_at_x = var_c if age == x
xfill var_c_at_x, i(idcode)
edit idcode ln_wage year age x var_c*
I have 55 weeks of sales data of a certain item. I created two SAS datasets from the original data. The first dataset has the date and the sum of quantity sold in each date. Therefore, I have 385 observations (55 x 7). The second table has detailed transaction data. Specifically, for each date, I have the time between transactions, which is the time between the arrival of one customer and the next one who purchased that item (I call it the interarrival times). What I need to do next is as follows:
For the first table (daily sales) I need to take the sales data for
each week, fit a number of distributions to find the parameters of
each one, and record those parameters in a separate table. Note that
each week has eaxctly 7 observations
For the second table (interarrival times) I also need to fit a
number of distributions to find the parameters of each one, and
record those parameters in the same table above, but here, I don’t
have an exact number of observations in each week
Note: I already labeled the week number for the observations in each of the two datasets and I wrote the code that fits the distributions to the data. The only area in which I am struggling is how to tell SAS to take the data for one week, do the calculations, fit the distributions, and then move to the next week (i.e. group the data by week and perform multiple statements on each group).
I tried so many methods and none of them worked including nested loops. I know how to get the weekly sales using other methods and procedures such as PROC SQL, but I am not sure whether I can fit distributions with PROC SQL.
I am using proc nlp to estimate the parameters of each distribution using the maximum likelihood method. For example, if I need to estimate Mu and Sigma for the normal distribution, I am using the following code:
proc nlp data= temp vardef=n covariance=h outest=parms;
title "Normal";
max loglik;
parms mu=0, sigma=1;
bounds sigma > 1e-12;
loglik=-log(sigma*(2*constant('PI'))**.5) - 0.5*((x-mu)/sigma)**2;
run;
This method will find Mu and Sigma that most likely produced the data.
For others wishing to use SAS's internal grouping the nlm code would become:
/* Ensure that the data is sorted to allow group processing */
proc sort data = temp;
by week;
run;
proc nlp data = temp vardef = n covariance = h outest = parms;
/* Produce separate output for each week */
by week;
title "Normal";
max loglik;
parms mu = 0, sigma = 1;
bounds sigma > 1e-12;
loglik = -log(sigma * (2 * constant('PI'))**.5) - 0.5 * ((x - mu) / sigma)**2;
run;
And here is a method using proc univariate:
/* Suppress printed output (remove to see all the details) */
ods select none;
proc univariate data = temp;
/* Produce separate output for each week */
by week;
histogram x /
/* Request fitting to normal distribution */
normal
/* You can select other distributions too */
lognormal;
/* Put the fitted parameters in a dataset */
ods output ParameterEstimates = parms;
/* Put the fit statistics in a dataset */
ods output GoodnessOfFit = quality;
run;
/* Restore printing output */
ods select all;
Here's what I used
%macro weekly;
%do i=1 %to 55;
proc sql;
create table temp as
select location, UPC, date, x, week
from weeks
where week = &i;
quit;
/* I have here the rest of the code where I do my calculations and I fit the distributions to the data of each week */
%end;
%mend;
%weekly;
I knew that proc sql would work initially but I was wondering whether there may be a more efficient way to do it.
I have a data set that looks similar to the one below. Basically, I have current prices for three different sizes of an item type. If the sizes are priced correctly (ie small<medium<large) I want to flag them with a “Y” and continue to use the current price. If they are not priced correctly, I want to flag them with a “N” and use the recommended price. I know that this is probably a time to use array programming, but my array skills are, admittedly, a bit weak. There are hundreds of locations, but only one item type. I currently have the unique locations loaded in a macro variable list.
data have;
input type location $ size $ cur_price rec_price;
cards;
x NY S 4 1
x NY M 5 2
x NY L 6 3
x LA S 5 1
x LA M 4 2
x LA L 3 3
x DC S 5 1
x DC M 5 2
x DC L 5 3
;
run;
proc sql;
select distinct location into :loc_list from have;
quit;
Any help would be greatly appreciated.
Thanks.
Not sure why you'd want to use an array here...proc transpose and some data step logic
can easily solve this problem. Arrays are very useful (gotta admit, I'm not entirely
comfortable with them either), but in a situation where you have that many locations,
I think transpose is better.
Does the code below accomplish your goal?
/*sorts to get ready for transpose*/
proc sort data=have;
by location;
run;
/*transpose current price*/
proc transpose data=have out=cur_tran prefix=cur_price;
by location;
id size;
var cur_price;
run;
/*transpose recommended price*/
proc transpose data=have out=rec_tran prefix=rec_price;
by location;
id size;
var rec_price;
run;
/*merge back together*/
data merged;
merge cur_tran rec_tran;
by location;
run;
/*creates flags and new field for final price*/
data want;
set merged;
if cur_priceS<cur_priceM<cur_priceL then
do;
FLAG='Y';
priceS=cur_priceS;
priceM=cur_priceM;
priceL=cur_priceL;
end;
else do;
FLAG='N';
priceS=rec_priceS;
priceM=rec_priceM;
priceL=rec_priceL;
end;
run;
I don't see how arrays would help here. How about just checking using dif to queue the last record's price and verify it (could also retain the last price if you prefer). Make sure the dataset's properly sorted by type location descending size, then:
data want;
set have;
by type location descending size; *S > M > L alphabetically;
retain price_check;
if not first.location and dif(cur_price) lt 0 then price_check=1;
*if dif < 0 then cur rec is smaller;
else if first.location then price_check=0; *reset it;
if last.location;
keep type location price_check;
run;
Then merge that back to your original dataset by type location, and use the other price if cur_price=1.
Alternatively you could do it in a single query which is almost a re-statement of your requirements.
Proc sql;
create table want as
select *
/* Basically, I have current prices for three different sizes of an item type.
If the sizes are priced correctly (ie small<medium<large) */
, case
when max ( case when size eq 'S' then cur_price end)
lt max ( case when size eq 'M' then cur_price end)
and max ( case when size eq 'M' then cur_price end)
lt max ( case when size eq 'L' then cur_price end)
/* I want to flag them with a “Y” and continue to use the current price */
then 'Y'
/* If they are not priced correctly,
I want to flag them with a “N” and use the recommended price. */
else 'N'
end as Cur_Price_Sizes_Correct
, case
when calculate Cur_Price_Correct eq 'Y'
then cur_price
else rec_price
end as Price
From have
Group by Type
, Location
;
Quit;
Data is setup with a bunch of information corresponding to an ID, which can show-up more than once.
ID Data
1 X
1 Y
2 A
2 B
2 Z
3 X
I want a loop that signifies which instance of the ID I am looking at. Is it the first time, second time, etc? I want it as a string in the form _# so I have to go beyond the simple _n function in Stata, to my knowledge. If someone knows a way to do what I want without the loop let me know, but I would still like the answer.
I have the following loop in Stata
by ID: gen count_one = _n
gen count_two = ""
quietly forval j = 1/3 {
replace count_two = "_`j'" if count_one == `j'
}
The output now looks like this:
ID Data count_one count_two
1 X 1 _1
1 Y 2 _2
2 A 1 _1
2 B 2 _2
2 Z 3 _3
3 X 1 _1
The question is how can I replace the 16 above with to tell Stata to take the max of the count_one column because I need to run this weekly and that max will change and I want to reduce errors.
It's hard to understand why you want this, but it is one line whether you want numeric or string:
bysort ID : gen nummax = _N
bysort ID : gen strmax = "_" + string(_N)
Note that the sort order within ID is irrelevant to the number of observations for each.
Some parts of your question aren't clear ("...replace the 16 above with to tell Stata...") but:
Why don't you just use _n with tostring?
gsort +ID +data
bys ID: g count_one=_n
tostring count_one, gen(count_two)
replace count_two="_"+count_two
Then to generate the max (answering the partial question at the end there) -- although note this value will be repeated across instances of each ID value:
bys ID: egen maxcount1=max(count_one)
or more elegantly:
bys ID: g maxcount2=_N