Matching different variables for the same observation - dataset

I am encountering some difficulty with a dataset that I am analyzing with Stata. The dataset I have is a repeated cross section of the following form:
Individual Year Age VarA VarB VarC
Variable C has been calculated for each individual by year, using the egen command. As a result, this variable is year specific. I now want to match the value of this variable corresponding to the year when each individual was x years old. (I create this new variable by the transform variableD=Year-Age+x).
I want to match the value of Variable C that was obtained in the year "variableD" for each individual.

Here's an example of how to do this with a user-written xfill:
net install xfill, from("http://www.sealedenvelope.com/")
webuse nlswork, clear
duplicates drop idcode age, force
gen x=20 if mod(idcode,2)==1
replace x=25 if mod(idcode,2)!=1
bys idcode year: egen var_c = mean(ln_wage)
bys idcode: gen var_c_at_x = var_c if age == x
xfill var_c_at_x, i(idcode)
edit idcode ln_wage year age x var_c*

Related

SAS- setting missings to a range of columns based on the value of another variable

I currently have a dataset that includes an ID for each person, and then variables called day1 through day1826 that are binary indicators of availability of a drug on each of those days. I need to censor individuals on certain days. For example, if a person needs to be censored on day500, then I need day500 to be set to missing, as well as every day variable after that (i.e. day500 through day1826). I have a variable called time_for_censor that indicates what day to start the missings.
How can I code this in SAS?
I've tried to code it in a loop like this:
array daydummy (1826) day1-day1826;
if time_for_censor ne . then do time_for_censor=1 to 1825;
daydummy(time_for_censor)=.;
daydummy(time_for_censor + 1) =.;
end;
Just loop from the censor date to the end of the array.
array daydummy (1826) day1-day1826;
if not missing(time_for_censor) then do index=time_for_censor to 1826;
daydummy(index)=.;
end;
drop index;
You might need to change the lower bound on the do loop to time_for_censor+1 depending on the whether the values are valid on the censoring date or not.

Stata: Number of overlapping days within multiple date ranges

I want to calculate the number of overlapping days within multiple date ranges. For example, in the sample data below, there are 167 overlapping days: first from 07jan to 04apr and second from 30may to 15aug.
start end
01jan2000 04apr2000
30may2000 15aug2000
07jan2000 31dec2000
This is fairly crude but gets the job done. Essentially, you
Reshape the data to be in long format, which is usually a good idea when working with panel data in Stata
Fill in gaps between the start and end of each spell
Keep dates that occur more than once
Count the distinct values of dates
clear
/* Fake Data */
input str9(start end)
"01jan2000" "04apr2000"
"30may2000" "15aug2000"
"07jan2000" "31dec2000"
end
foreach var of varlist start end {
gen d = date(`var', "DMY")
drop `var'
gen `var' = d
format %td `var'
drop d
}
/* Count Overlapping Days */
rename (start end) date=
gen spell = _n
reshape long date, i(spell) j(range) string
drop range
xtset spell date, delta(1 day)
tsfill
bys date: keep if _N>1
distinct date

Translate a Stata loop into a SPSS loop

Could anyone help with the translation of the following Stata code? I need this code for further analysis in SPSS.
if year<1990 {
bysort country year ID: egen sum080=sum(PY080g)
gen hydisp=(HY020+sum080)*HY025
}
else gen hydisp=HY020*HY025
I tried to solve the problem with the following SPSS code:
DO IF year<1990.
SORT CASES BY country year ID.
COMPUTE sum080 = SUM(PY080g).
COMPUTE hydisp=(HY020+sum080)*HY025.
ELSE.
COMPUTE hydisp=HY020*HY025.
END IF.
EXECUTE.
But this code appears to be wrong. Do you have any idea how to resolve the problem?
This particular use of egen in Stata can be replicated in SPSS by using the AGGREGATE command. Using Nick Cox's revised Stata code:
bysort country year ID: egen sum080 = sum(PY080g)
gen hydisp = (HY020 + sum080) * HY025 if year < 1990
replace hydisp = HY020 * HY025 if year >= 1990
A synonymous set of code in SPSS would be:
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
/BREAK = country year ID
/sum080 = SUM(PY080g).
DO IF Year < 1990.
COMPUTE hydisp = (HY020+sum080)*HY025.
ELSE.
COMPUTE hydisp = HY020*HY025.
END IF.
This is in no sense an answer on SPSS code, but it makes a point that would not go well in a comment.
The Stata code
if year < 1990 {
bysort country year ID: egen sum080=sum(PY080g)
gen hydisp=(HY020+sum080)*HY025
}
else gen hydisp=HY020*HY025
would get interpreted as
if year[1] < 1990 {
bysort country year ID: egen sum080=sum(PY080g)
gen hydisp=(HY020+sum080)*HY025
}
else gen hydisp=HY020*HY025
i.e. the branching is on the value of year in the first observation (case, record). The if command and the if qualifier are quite different constructs. It seems much more likely that the code desired is something like
bysort country year ID: egen sum080 = sum(PY080g)
gen hydisp = (HY020 + sum080) * HY025 if year < 1990
replace hydisp = HY020 * HY025 if year >= 1990
or
bysort country year ID: egen sum080 = sum(PY080g)
gen hydisp = cond(year < 1990, (HY020 + sum080) * HY025, HY020 * HY025)
The OP's comment that the code appears to be wrong is a poor problem report. What is wrong precisely? It may be nothing more than inability to replicate the results gained in Stata, which would not be surprising as the Stata code is almost certainly not what is intended. It seems unlikely that the first observation is special, but rather that the calculation should be carried out for all observations according to the value of year
Detail: sum() as an egen function is undocumented in favour of total(), but the syntax remains legal.
Detail: The Stata code here would not be considered a loop just because there is a tacit loop over observations.

MATLAB - array2table nesting

For the purpose of simplicity I'll try to take an example from everyday life. Let's say I have a table in CSV file loaded in a table called dataOriginal with 3 columns - names, jobs , dates.
Let's take a closer look at the column "date":
date
____
'13.01.2014 20:34'
'22.03.2014 11:17'
...
I want to split date in a date-vector and add this vector (along with the variable names for each of it's columns (since we have multiple dates we have de facto a matrix)) to a column in a new table again named "Date" but with all the naming goodies in it such as year, month etc.
Here is what I have done so far (sorry for the poor code quality but I've just started learning MATLAB :-/):
I split each date in a date-vector and also add names to each element like this:
dateFormat = 'dd.mm.yy HH:MM';
[year,month,day,hour,minute,second] = datevec(datesRaw, dateFormat);
so that I have this:
year(1) % returns '2014' since this is the first date in my column
year % returns all years in my entire column
Then I converted the above to a table:
dates = array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month',...,'second'});
so I get a nice output like this
year month second
____ _____ ... ______
2014 1 0
2014 3 0
... ... ... ...
This allows me an easy-to-read access to each column by simply calling for example:
year % returns all years
year(1) % returns first entry's year (here: '2014' from '13.01.2014 20:34')
I've processed my other columns too doing various operations on those and at the end I'm trying to horizontally concatenate all like this:
name job date
____ _____________________ _____________________
year month ... second
____ _____ ______
"Bob" "Construction worker" 2014 1 ... 0
"Alice" "Waitress" 2014 3 ... 0
... ... ... ... ... ...
I'm struggling exactly with the part with the nesting of year,month etc. in a single column named "date". I'd like to address a date's element in the table above as follows:
myData.name(1) % will return 'Bob'
myData.job(1) % will return 'construction worker'
myData.date(1).year(1) % should return '2014' for Bob, the construction worker
Currently I'm having the following code after some sweating and swearing:
dataFinal =
horzcat(array2table([dataProcessed(:,1),dataProcessed(:,2)],'VariableNames',[dataOriginal.Properties.VariableNames(1),dataOriginalProperties,VariableNames(2)]],
array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month','day','hour','minute','second'}))
where
dataProcessed(:,1) are my processed names
dataProcessed(:,2) are my processed jobs
dataOriginal.Properties.VariableNames(1) is the name of the first column in my original table - "name"
dataOriginal.Properties.VariableNames(2) is the name of the second column in my original table - "job"
I do not know how to insert
array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month','day','hour','minute','second'})
in a named column "date" in order to accomplish my goal.
Thanks!
Try the following, it may be what you're looking for:
data = table(names, jobs, table(years, months, ...), 'VariableNames', {'name', 'job', 'date'})
Though you will address as follows, which is slightly different from what you said you want; it may still work for your purposes:
data.name(1);
data.job(1);
data.date.year(1);
EDIT: To see your output, do
disp([data(:, ~strcmp(data.Properties.VariableNames, 'date')), data.date])
names ids years months
_____ ___ _____ ______
'Bob' 1 2014 4
'Max' 2 2013 8
(when editing the comment I didn't exactly replicate the data and fields from the answer, but I think you should get the point here).

Matlab: Number of observations per year for very large array

I have a large array with daily data from 1926 to 2012. I want to find out how many observations are in each year (it varies from year-to-year). I have a column vector which has the dates in the form of:
19290101
19290102
.
.
.
One year here is going to be July through June of the next year.
So 19630701 to 19640630
I would like to use this vector to find the number of days in each year. I need the number of observations to use as inputs into a regression.
I can't tell whether the dates are stored numerically or as a string of characters; I'll assume they're numbers. What I suggest doing is to convert each value to the year and then using hist to count the number of dates in each year. So try something like this:
year = floor(date/10000);
obs_per_year = hist(year,1926:2012);
This will give you a vector holding the number of observations in each year, starting from 1926.
Series of years starting July 1st:
bin = datenum(1926:2012,7,1);
Bin your vector of dates within each year with bin(1) <= x < bin(2), bin(2) <= x < bin(3), ...
count = histc(dates,bin);

Resources