Calculating sums conditionally on data values - loops

I have quite a large conflict dataset (71 million observations) with many variables and date (daily).
This is from the GDELT project. For each day, there is a target country and a source country of aggression. For example, on 1 January 2000, many countries were engaged in aggressive behaviour against others or themselves.
It looks like this:
clear
input long date_01 str18 source_01 str19 target_01 str4 cameocode_01
20000101 "AFG" "AFGGOV" "2"
20000101 "AFG" "AFGGOV" "8"
20000101 "AFG" "ARE" "3"
20000101 "AFG" "CVL" "4"
20000101 "AFG" "GOV" "10"
20000101 "AFG" "GOV" "4"
20000101 "AFGGOV" "kasUAF" "3"
20000101 "FRA" "kasUAF" "8"
20000101 "AFG" "IGOUNO" "3"
20000101 "AFG" "IND" "4"
20000101 "AFG" "IND" "12"
20000102 "AFG" "IND" "19"
end
Variable date_01 is the day, source_01 is the country that initiated aggression, target_01 is the victim, and cameocode_01 is the variable of concern which states the degree of hostility or cooperation. If the number is between 10 and 20, that is a hostility event, with 20 being the more hostile. If the number is between 0 and 9, that indicates cooperation (good event), with 9 being the friendliest.
I have managed with help from this platform to isolate the event per country, namely to isolate the cameo codes involving a certain amount of countries (I am interested in 30) to follow their conflict evolution through time.
I did the following:
foreach c in AFG IND ARE {
generate ind_`c' = cameocode_01 if strmatch(source_01, "`c'") | ///
strmatch(target_01, "`c'")
}
This yields what is desired:
date source target cameocode ind_AFG ind_IND ind_ARE
1. 20000101 AFG AFGGOV 2 2
2. 20000101 AFG IND 4 4 4
3. 20000101 AFG AFGGOV 8 8
4. 20000101 AFG ARE 3 3 36
5. 20000101 AFG CVL 4 4
6. 20000101 AFG GOV 10 10
7. 20000101 AFG GOV 4 4
8. 20000101 AFGGOV kasUAF 3
9. 20000101 AFGGOV kasUAF 8
10. 20000101 AFG IRQ 12 12
11. 20000102 AFG IND 19 19 19
Whenever a given country is involved as either recipient or initiator, I create a new variable isolating that specific event and its intensity for a given date.
What I want to do now is to be able to create a standardized measure or ratio where for each date, the sum of conflict measures (numbers from 10 to 20) are divided to by the sum of the cooperation measures (numbers from 1 to 9) for each country.
So my desired output for this table above for AFG 20000101 (5th column) would be:
(12+19) / (2+4+8+3+4+4)
I would like to repeat this for each date for each of the variables ind_COUNTRY CODE to have one number per day per country.
Is there a way to do this?

This appears to be the key trick you seek.
clear
input long date str6 source float cameocode
20000101 "AFG" 2
20000101 "AFG" 4
20000101 "AFG" 8
20000101 "AFG" 3
20000101 "AFG" 4
20000101 "AFG" 10
20000101 "AFG" 4
20000101 "AFGGOV" 3
20000101 "AFGGOV" 8
20000101 "AFG" 12
end
egen num = total(cond(cameocode >= 10, cameocode, .)), by(date source)
egen den = total(cond(cameocode < 10, cameocode, .)), by(date source)
generate wanted = num / den
sort date source
list, sepby(source)
+------------------------------------------------------------+
| date source target cameoc~e num den wanted |
|------------------------------------------------------------|
1. | 20000101 AFG IND 4 22 25 .88 |
2. | 20000101 AFG GOV 4 22 25 .88 |
3. | 20000101 AFG AFGGOV 2 22 25 .88 |
4. | 20000101 AFG AFGGOV 8 22 25 .88 |
5. | 20000101 AFG IRQ 12 22 25 .88 |
6. | 20000101 AFG GOV 10 22 25 .88 |
7. | 20000101 AFG CVL 4 22 25 .88 |
8. | 20000101 AFG ARE 3 22 25 .88 |
|------------------------------------------------------------|
9. | 20000101 AFGGOV kasUAF 8 0 11 0 |
10. | 20000101 AFGGOV kasUAF 3 0 11 0 |
+------------------------------------------------------------+
See sections 9 and 10 in this paper for technique. The essential idea is that many egen functions allow expressions as arguments, which can be more complicated than just variable names. Here we use cond() to specify that only values in certain intervals should be totalled.
A less transparent but less wasteful recipe in terms of creation of variables would run something like
egen wanted = !code for numerator!
egen den = !code for denominator!
replace wanted = wanted / den
drop den

Related

How to identify gaps in reported year in an unbalanced panel with repeated observations for firm-year?

I have a dataset at the firm-product-year level. I want to identify which firms having gaps in reporting years between 1994-2004. Consider an example below:
clear
input id year sales product
14 1994 28.9 2
14 1994 67.9 3
14 1994 12.5 9
14 1994 451.8 34
14 1994 27.5 44
14 1994 647.6 45
14 1995 9.7 2
14 1995 33.5 3
14 1995 112.4 9
14 1995 712.2 15
14 1995 902.3 41
14 1995 67.3 45
14 1995 15.1 50
14 1996 6.5 2
14 1996 24.6 3
14 1996 1009.4 5
14 1996 77.1 9
14 1996 76.9 17
14 1996 12.4 45
14 1996 946.3 88
14 1996 15.4 92
14 1997 .7 2
14 1997 63.2 2
14 1997 91.7 3
14 1997 860.8 9
14 1997 12.4 21
14 1997 800.8 32
14 1997 33.7 45
14 1997 41 95
15 1999 .1 44
15 2000 .1 58
15 2001 .4 27
15 2001 .1 95
15 2002 .5 5
15 2002 .1 58
15 2003 .1 17
15 2004 3.5 28
15 2004 .1 39
16 2000 .8 2
16 2001 .6 2
16 2003 .2 2
16 2004 .1 2
16 2004 .1 8
16 2004 2.5 8
end
Firm 14 produced 6 products in 1994. It produced every year consecutively until 1997. Because there are no missing years in between, I keep this firm. But firm 16 reports in 2000, 2001 and then in 2003. I assume that the firm still operated in 2002 but doesn't report in the data. How to create a dummy variable for this firm?
tsfill doesn't help because I have repeated values within id-year.
In the first step, you delete the companies that do not produce any products in a year by creating a dummy variable "firm_any_production" that indicates whether a company has produced at least one product in a given year. Then the maximum of this dummy variable is calculated for each firm and the firms for which the maximum is 0 are deleted.
gen firm_any_production = sum(sales) > 0
bysort id (year): egen firm_missing_year = max(firm_any_production)
drop if firm_missing_year == 0
In step 2 you calculate whether the newly added products of a company have higher sales than the core product. This is calculated by creating a dummy variable "is_new_product", which indicates whether a product is a new product. Then the sales of these new products are calculated and compared to the sales of the core product. If the sum of the turnover of the new products is greater than the turnover of the core product, another dummy variable "greater_than_core" is created and set to 1.
bysort id year: egen core_product_sales = max(sales)
gen is_new_product = sales != core_product_sales
gen new_product_sales = sales * is_new_product
gen greater_than_core = sum(new_product_sales) > core_product_sales
Translated with www.DeepL.com/Translator (free version)
Added:
The code is creating a firm_missing_year variable that takes the value of 1 if a firm doesn't report any product in the current year. The is_core_product variable indicates which product has the highest sales in a given year for each firm. The is_new_product variable takes the value of 1 if the product wasn't produced in the previous year. Finally, the higher_new_sales variable takes the value of 1 if the sum of sales of new products is greater than the sales of the core product.
use "your_data_file.dta", clear
gen firm_missing_year = 0
bysort id (year): egen last_year = max(year), unique(id)
replace firm_missing_year = 1 if year > last_year[1]
gen is_core_product = 0
bysort id year: egen max_sales = max(sales), unique(id year)
replace is_core_product = 1 if sales == max_sales
gen is_new_product = 0
bysort id year: gen lagged_product = product[_n-1]
replace is_new_product = 1 if product != lagged_product & sales != max_sales
bysort id year: egen sum_new_sales = sum(sales * is_new_product), unique(id year)
gen higher_new_sales = 0
replace higher_new_sales = 1 if sum_new_sales > max_sales

How to calculate mean of previous values of other firms for Director ID before he joins the firm

I need to calculate the previous wage of director before he joins a new company.
I have created a simple dataset for one director (in practice I have many observations of director_id). This director with ID = 1 manages 5 firms which he joined in different years (the variable called enter). If director joined firm number 2 in 2011, I need the average of the variable wage for all years before 2011 which he was managing. For the same director = 1, I need a different mean(wage) for firm number 3 which he joined in 2012 (which will include mean(wage) from previous 2 companies that he managed before entering company 3 in 2012).
Below is the data. I would really appreciate your help in coding this problem.
clear
input enter year wage director_id firm_id
2006 2006 6.4790964 1 1
2006 2010 6.4783854 1 1
2006 2011 6.4067149 1 1
2006 2012 6.3716507 1 1
2006 2013 6.2248578 1 1
2006 2014 6.0631728 1 1
2011 2011 5.0127039 1 2
2011 2012 4.9616795 1 2
2011 2013 4.9483747 1 2
2011 2014 5.2612371 1 2
2012 2012 4.5389338 1 3
2012 2013 4.4322848 1 3
2012 2014 4.3223209 1 3
2013 2013 4.336947 1 4
2013 2014 4.27459 1 4
2015 2015 -.60586482 1 5
2015 2016 .085194588 1 5
end
I just need to exclude from mean(wage) all values that happen after he enters, so really need to regard only years before he enters a new company.
A recipe for what I think you seek is that the mean previous wage in other firms =
(SUM of previous wages in all firms MINUS sum of previous wages in this firm) / (COUNT of previous years in all firms MINUS count of previous years in this firm).
Your example is helpful but the wage variable is too irregular to allow easy eyeball checks.
Consider this sequence, where rangestat is from SSC.
clear
input enter year wage director_id firm_id
2006 2006 6.4790964 1 1
2006 2010 6.4783854 1 1
2006 2011 6.4067149 1 1
2006 2012 6.3716507 1 1
2006 2013 6.2248578 1 1
2006 2014 6.0631728 1 1
2011 2011 5.0127039 1 2
2011 2012 4.9616795 1 2
2011 2013 4.9483747 1 2
2011 2014 5.2612371 1 2
2012 2012 4.5389338 1 3
2012 2013 4.4322848 1 3
2012 2014 4.3223209 1 3
2013 2013 4.336947 1 4
2013 2014 4.27459 1 4
2015 2015 -.60586482 1 5
2015 2016 .085194588 1 5
end
sort year firm_id
replace wage = _n
rangestat (sum) SUM=wage (count) COUNT=wage, int(year . -1) by(director_id)
rangestat (sum) sum=wage (count) count=wage, int(year . -1) by(director_id firm_id)
replace sum = 0 if sum == .
replace count = 0 if count == .
gen wanted = (SUM - sum) / (COUNT - count)
list, sepby(year)
+---------------------------------------------------------------------------------+
| enter year wage direct~d firm_id SUM COUNT sum count wanted |
|---------------------------------------------------------------------------------|
1. | 2006 2006 1 1 1 . . 0 0 . |
|---------------------------------------------------------------------------------|
2. | 2006 2010 2 1 1 1 1 1 1 . |
|---------------------------------------------------------------------------------|
3. | 2006 2011 3 1 1 3 2 3 2 . |
4. | 2011 2011 4 1 2 3 2 0 0 1.5 |
|---------------------------------------------------------------------------------|
5. | 2006 2012 5 1 1 10 4 6 3 4 |
6. | 2011 2012 6 1 2 10 4 4 1 2 |
7. | 2012 2012 7 1 3 10 4 0 0 2.5 |
|---------------------------------------------------------------------------------|
8. | 2006 2013 8 1 1 28 7 11 4 5.666667 |
9. | 2011 2013 9 1 2 28 7 10 2 3.6 |
10. | 2012 2013 10 1 3 28 7 7 1 3.5 |
11. | 2013 2013 11 1 4 28 7 0 0 4 |
|---------------------------------------------------------------------------------|
12. | 2006 2014 12 1 1 66 11 19 5 7.833333 |
13. | 2011 2014 13 1 2 66 11 19 3 5.875 |
14. | 2012 2014 14 1 3 66 11 17 2 5.444445 |
15. | 2013 2014 15 1 4 66 11 11 1 5.5 |
|---------------------------------------------------------------------------------|
16. | 2015 2015 16 1 5 120 15 0 0 8 |
|---------------------------------------------------------------------------------|
17. | 2015 2016 17 1 5 136 16 16 1 8 |
+---------------------------------------------------------------------------------+

How to track changes in the board of directors within the same firm

In Stata I need to create a new variable "changes in the board of directors" which indicates whether the same directors are observed in the same firm over time. Consider an example below:
clear
input dirid firmid year
1 10 2006
2 10 2006
3 10 2006
1 10 2007
2 10 2007
3 10 2007
1 10 2008
2 10 2008
3 10 2008
4 10 2008
3 10 2009
4 10 2009
end
Directors ID 1, 2, and 3 are in firm 10 in 2006 and in 2007. So there was no change in the board of directors from t-1 to t. The variable "changes in the board of directors" should be 0. However, in 2008 a new director came to the board dirid = 4, so there was a change in the board and the variable should be 1. The same in 2009 because dirid 1 and 2 left the company. So any change, whether the entrance or exit of directors, should be reported with 1 in the new binary variable.
Here's another way to do it. I think it should cope with directors leaving and later coming back.
clear
input dirid firmid year
1 10 2006
2 10 2006
3 10 2006
1 10 2007
2 10 2007
3 10 2007
1 10 2008
2 10 2008
3 10 2008
4 10 2008
3 10 2009
4 10 2009
end
bysort firmid year (dirid) : gen board = strofreal(dirid) if _n == 1
by firmid year : replace board = board[_n-1] + " " + strofreal(dirid) if _n > 1
by firmid year : replace board = board[_N]
by firmid : gen anychange = year != year[_n-1] & board != board[_n-1]
bysort firmid year (anychange) : replace anychange = anychange[_N]
sort firmid year dirid
list, sepby(firmid year)
+--------------------------------------------+
| dirid firmid year board anycha~e |
|--------------------------------------------|
1. | 1 10 2006 1 2 3 1 |
2. | 2 10 2006 1 2 3 1 |
3. | 3 10 2006 1 2 3 1 |
|--------------------------------------------|
4. | 1 10 2007 1 2 3 0 |
5. | 2 10 2007 1 2 3 0 |
6. | 3 10 2007 1 2 3 0 |
|--------------------------------------------|
7. | 1 10 2008 1 2 3 4 1 |
8. | 2 10 2008 1 2 3 4 1 |
9. | 3 10 2008 1 2 3 4 1 |
10. | 4 10 2008 1 2 3 4 1 |
|--------------------------------------------|
11. | 3 10 2009 3 4 1 |
12. | 4 10 2009 3 4 1 |
+--------------------------------------------+
See also [this paper][1] on concatenating rowwise.
[1]: https://journals.sagepub.com/doi/full/10.1177/1536867X20909698
clear
input dirid firmid year
1 10 2006
2 10 2006
3 10 2006
1 10 2007
2 10 2007
3 10 2007
1 10 2008
2 10 2008
3 10 2008
4 10 2008
3 10 2009
4 10 2009
end
bysort firmid year (dirid): gen n = _n
reshape wide n, i(firmid year) j(dirid)
egen all_directors = concat(n*)
bysort firmid (year): gen change = all_directors != all_directors[_n-1] & _n > 1
reshape long
drop if missing(n)
drop all_directors n

Issues Regarding SAS

I was working on a homework problem regarding using arrays and looping to create a new variable to identify the date of when the maximum blood lead value was obtained but got stuck. For context, here is the homework problem:
In 1990 a study was done on the blood lead levels of children in Boston. The following variables for twenty-five children from the study have been entered on multiple lines per subject in the file lead_sum2018.txt in a list format:
Line 1
ID Number (numeric, values 1-25)
Date of Birth (mmddyy8. format)
Day of Blood Sample 1 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 1 (numeric, initial possible range: -9 to 12)
Line 2
ID Number (numeric, values 1-25)
Day of Blood Sample 2 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 2 (numeric, initial possible range: -9 to 12)
Line 3
ID Number (numeric, values 1-25)
Day of Blood Sample 3 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 3 (numeric, initial possible range: -9 to 12)
Line 4
ID Number (numeric, values 1-25)
Blood Lead Level Sample 1 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 2 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 3 (numeric, possible range: 0.01 – 20.00)
Sex (character, ‘M’ or ‘F’)
All blood samples were drawn in 1990. However, during data entry the order of blood samples was scrambled so that the first blood sample in the data file (blood sample 1) may not correspond to the first blood sample taken on a subject, it could be the first, second or third. In addition, some of the months and days and days of blood sampling were not written on the forms. At data entry, missing month and missing day values were each coded as -9.
The team of investigators for this project has made the following decisions regarding the missing values. Any missing days are to set equal to 15, any missing months are to be set equal to 6. Any analyses that are done on this data set need to follow those decisions. Be sure to implement the SAS syntax as indicated for each question. For example, use SAS arrays and loops if the item states that these must be used.
Here is the data that the HW references (it is in list format and was contained in a separate file called lead_sum2018.txt):
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 6 6
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
5 12/26/80 21 5
5 3 7
5 -9 12
5 4.35 4.79 5.14 M
6 06/20/81 7 10
6 11 3
6 22 1
6 1.24 1.16 0.71 F
7 06/22/81 19 6
7 3 12
7 29 8
7 3.1 3.21 3.58 F
8 05/24/82 26 7
8 31 1
8 9 10
8 2.99 2.37 2.4 M
9 10/11/82 2 7
9 25 5
9 28 3
9 2.4 1.96 2.71 F
10 . 10 8
10 30 12
10 28 2
10 2.72 2.87 1.97 F
11 11/16/83 19 4
11 15 11
11 7 -9
11 4.8 4.5 4.96 M
12 03/02/84 17 6
12 11 2
12 17 11
12 2.38 2.6 2.88 F
13 04/19/84 2 12
13 -9 6
13 1 7
13 1.99 1.20 1.21 M
14 02/07/85 4 5
14 17 5
14 21 11
14 1.61 1.93 2.32 F
15 07/06/85 5 2
15 16 1
15 14 6
15 3.93 4 4.08 M
16 09/10/85 12 10
16 11 -9
16 23 6
16 3.29 2.88 2.97 M
17 11/05/85 12 7
17 18 1
17 11 11
17 1.31 0.98 1.04 F
18 12/07/85 16 2
18 18 4
18 -9 6
18 2.56 2.78 2.88 M
19 03/02/86 19 4
19 11 3
19 19 2
19 0.79 0.68 0.72 M
20 08/19/86 21 5
20 15 12
20 -9 4
20 0.66 1.15 1.42 F
21 02/22/87 16 12
21 17 9
21 13 4
21 2.92 3.27 3.23 M
22 10/11/87 7 6
22 1 12
22 -9 3
22 1.43 1.42 1.78 F
23 05/12/88 12 2
23 21 4
23 17 12
23 0.55 0.89 1.38 M
24 08/07/88 17 6
24 27 11
24 6 2
24 0.31 0.42 0.15 F
25 01/12/89 4 7
25 15 -9
25 23 1
25 1.69 1.58 1.53 M
A) Input the data and in the data step:
1) make sure that Date of Birth variable is recorded as a SAS date;
2) use SAS arrays and looping to create a SAS date variable for each of the three blood samples and to address the missing data in accordance to the decisions of the investigators. Hint: use a single array and do loop to recode the missing values for day and month, separately, and an array/do loop for creating the SAS date variable;
3) use a SAS function to create a variable for the highest, i.e., maximum, blood lead value for each child;
4) use SAS arrays and looping to identify the date on which this largest value was obtained and create a new variable for the date of the largest blood lead value;
5) determine the age of the child in years when the largest blood lead value was obtained (rounded to two decimal places);
6) create a new variable based on the age of the child in years when the largest lead value was obtained (call it, “agecat”) that takes on three levels: for children less than 4 years old, agecat should equal 1; for children at least 4 years old, but less than 8, agecat should equal 2; and for children at least 8 years of age, agecat should be 3.;
7) print out the variables for the date of birth, date of the largest lead level, age at blood sample for the largest blood lead level, agecat, sex, and the largest blood lead level (Only print out these requested variables). All dates should be formatted to use the mmddyy10. format on the output.
The code I used in response to this was:
libname HW3 'C:\Users\johns\Desktop\SAS';
filename HW3new 'C:\Users\johns\Desktop\SAS\lead_sum2018.txt';
data one;
infile HW3new;
informat dob mmddyy8.;
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
array dbs{3} dbs1 dbs2 dbs3;
array mbs{3} mbs1 mbs2 mbs3;
do i=1 to 3;
if dbs{i}=-9 then dbs{i}=15;
end;
do i=4 to 6;
if mbs{i}=-9 then mbs{i}=6;
end;
array date{3} mdy1 mdy2 mdy3;
do i=1 to 3;
date{i}=mdy(mbs{i}, dbs{i}, 1990);
end;
maxbls=max(of bls1-bls3);
array bls{3} bls1 bls2 bls3;
array maxdte{3} maxdte1 maxdte2 maxdte3;
do i=1 to i=3;
if bls{i}=maxbls then maxdte=i;
end;
agemax=maxdte-dob;
ageest=round(agemax/365.25,2);
if agemax=. then agecat=.;
else if agemax < 4 then agecat=1;
else if 4 <= agemax < 8 then agecat=2;
else if agemax ge 8 then agecat=3;
run;
I received this error:
22 maxbls=max(of bls1-bls3);
23 array bls{3} bls1 bls2 bls3;
24 array maxdte{3} maxdte1 maxdte2 maxdte3;
25 do i=1 to i=3;
26 if bls{i}=maxbls then maxdte=i;
ERROR: Illegal reference to the array maxdte.
27 end;
Does anyone have any tip is regards to this issue? What did I do wrong? Was I supposed to create an additional array for the date of when the maximum blood lead sample value was collected? Thanks!
**I'm stuck on #4 of Part A, but I included the other parts for context. Thanks!
**Edits: I included the data that I had to read into SAS and the file name of the file it came from
Just from looking at the code immediately prior to the error, you have a problem on this line:
26 if bls{i}=maxbls then maxdte=i;
You are getting the error because you are attempting to assign a value to the array maxdte. Arrays cannot be assigned values like that (unless you are using the deprecated do over syntax...) Instead, choose an element of the array and assign the value to the element. E.g. you could do:
26 if bls{i}=maxbls then maxdte{1}=i;
Or instead of a literal 1, you could use a variable containing the relevant array index.
You are not properly handling ID field from lines #2-4
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
For example you need to skip field 1 on line 2-3 or read the ids into array perhaps to check they are all the same.
input #1 id dob dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex;
This example show how to check that you have 4 lines with the same ID and if you do read the rest of the variables or execute LOSTCARD. ID 3 has a missing record;
353 data ex;
354 infile cards n=4 stopover;
355 input #1 id #2 id2 #3 id3 #4 id4 #;
356 if id eq id2 eq id3 eq id4
357 then input #1 id dob:mmddyy. dbs1 mbs1
358 #2 id2 dbs2 mbs2
359 #3 id3 dbs3 mbs3
360 #4 id4 bls1 bls2 bls3 sex :$1.;
361 else lostcard;
362 format dob mmddyy.;
363 cards;
NOTE: LOST CARD.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
372 3 01/03/80 11 7
373 3 27 2
374 3 3.24 3.4 3.83 M
375 4 08/01/80 5 12
NOTE: LOST CARD.
376 4 28 -9
NOTE: LOST CARD.
377 4 3 4
NOTE: The data set WORK.EX has 3 observations and 15 variables.
data ex;
infile cards n=4 stopover;
input #1 id #2 id2 #3 id3 #4 id4 #;
if id eq id2 eq id3 eq id4
then input #1 id dob:mmddyy. dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex :$1.;
else lostcard;
format dob mmddyy.;
cards;
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
;;;;
run;
proc print;
run;

Functions with Arrays in R

Let's say I have maximum temperature data for the last 20 years. My data frame has a column for month, day, year and MAX_C (temperature data). I want to calculate the mean (and standard deviation, and range) maximum temperature from June 31 of one year to July 1 of the preceding year (i.e. mean max daily temp from July 1, 1991 to June 31, 1992). Is there an efficient way to do this?
My approach, thus far, has been to create an array:
maxt.prev12<-tapply(maxt$MAX_C,INDEX=list(maxt$month,maxt$day,maxt$year),mean)
I put mean in as the function as tapply was not producing an array without a function after the INDEX, but mean is not actually calculating anything here. Then I was thinking about trying to take January through June from one the matrices (i.e. 1992), and July through December from the preceding matrix (i.e. 1991), and then computing the mean. I'm not entirely sure how to do that part, however, there must be a more efficient way of performing these calculations in R
EDIT
Here is a simple sample set of data
maxt
day month year MAX_C
1 1 1990 29
1 2 1990 28
1 3 1990 32
1 4 1990 26
1 5 1990 24
1 6 1990 32
1 7 1990 30
1 8 1990 28
1 9 1990 28
1 10 1990 24
1 11 1990 30
1 12 1990 30
1 1 1991 25
1 2 1991 26
1 3 1991 28
1 4 1991 25
1 5 1991 24
1 6 1991 32
1 7 1991 26
1 8 1991 32
1 9 1991 26
1 10 1991 26
1 11 1991 27
1 12 1991 26
1 1 1992 27
1 2 1992 25
1 3 1992 29
1 4 1992 32
1 5 1992 27
1 6 1992 27
1 7 1992 24
1 8 1992 25
1 9 1992 28
1 10 1992 26
1 11 1992 31
1 12 1992 27
I would create an "indicator year" column which was equal to the year if month in July-Dec but equal to year-1 when month in Jan-June.
EDITED month reference in light of the fact it was numeric rather than character:
> maxt$year2 <- maxt$year
> maxt[ maxt$month %in% 1:6, "year2"] <-
+ maxt[ maxt$month %in% 1:6, "year"] -1
> # month.name is a 12 element constant vector in all versions of R
> # check that it matches the spellings of your months
>
> mean_by_year <- tapply(maxt$MAX_C, maxt$year2, mean, na.rm=TRUE)
> mean_by_year
1989 1990 1991 1992
28.50000 27.50000 27.50000 26.83333
If you wanted to change the labels so they reflected the non-calendar year derivation:
> names(mean_by_year) <- paste(substr(names(mean_by_year),3,4),
+ as.character( as.numeric(substr(names(mean_by_year),3,4))+1),
sep="_")
> mean_by_year
89_90 90_91 91_92 92_93
28.50000 27.50000 27.50000 26.83333
Although I don't think it will be quite right at the millennial turn.

Resources