Let's say I have maximum temperature data for the last 20 years. My data frame has a column for month, day, year and MAX_C (temperature data). I want to calculate the mean (and standard deviation, and range) maximum temperature from June 31 of one year to July 1 of the preceding year (i.e. mean max daily temp from July 1, 1991 to June 31, 1992). Is there an efficient way to do this?
My approach, thus far, has been to create an array:
maxt.prev12<-tapply(maxt$MAX_C,INDEX=list(maxt$month,maxt$day,maxt$year),mean)
I put mean in as the function as tapply was not producing an array without a function after the INDEX, but mean is not actually calculating anything here. Then I was thinking about trying to take January through June from one the matrices (i.e. 1992), and July through December from the preceding matrix (i.e. 1991), and then computing the mean. I'm not entirely sure how to do that part, however, there must be a more efficient way of performing these calculations in R
EDIT
Here is a simple sample set of data
maxt
day month year MAX_C
1 1 1990 29
1 2 1990 28
1 3 1990 32
1 4 1990 26
1 5 1990 24
1 6 1990 32
1 7 1990 30
1 8 1990 28
1 9 1990 28
1 10 1990 24
1 11 1990 30
1 12 1990 30
1 1 1991 25
1 2 1991 26
1 3 1991 28
1 4 1991 25
1 5 1991 24
1 6 1991 32
1 7 1991 26
1 8 1991 32
1 9 1991 26
1 10 1991 26
1 11 1991 27
1 12 1991 26
1 1 1992 27
1 2 1992 25
1 3 1992 29
1 4 1992 32
1 5 1992 27
1 6 1992 27
1 7 1992 24
1 8 1992 25
1 9 1992 28
1 10 1992 26
1 11 1992 31
1 12 1992 27
I would create an "indicator year" column which was equal to the year if month in July-Dec but equal to year-1 when month in Jan-June.
EDITED month reference in light of the fact it was numeric rather than character:
> maxt$year2 <- maxt$year
> maxt[ maxt$month %in% 1:6, "year2"] <-
+ maxt[ maxt$month %in% 1:6, "year"] -1
> # month.name is a 12 element constant vector in all versions of R
> # check that it matches the spellings of your months
>
> mean_by_year <- tapply(maxt$MAX_C, maxt$year2, mean, na.rm=TRUE)
> mean_by_year
1989 1990 1991 1992
28.50000 27.50000 27.50000 26.83333
If you wanted to change the labels so they reflected the non-calendar year derivation:
> names(mean_by_year) <- paste(substr(names(mean_by_year),3,4),
+ as.character( as.numeric(substr(names(mean_by_year),3,4))+1),
sep="_")
> mean_by_year
89_90 90_91 91_92 92_93
28.50000 27.50000 27.50000 26.83333
Although I don't think it will be quite right at the millennial turn.
Related
I have a dataset at the firm-product-year level. I want to identify which firms having gaps in reporting years between 1994-2004. Consider an example below:
clear
input id year sales product
14 1994 28.9 2
14 1994 67.9 3
14 1994 12.5 9
14 1994 451.8 34
14 1994 27.5 44
14 1994 647.6 45
14 1995 9.7 2
14 1995 33.5 3
14 1995 112.4 9
14 1995 712.2 15
14 1995 902.3 41
14 1995 67.3 45
14 1995 15.1 50
14 1996 6.5 2
14 1996 24.6 3
14 1996 1009.4 5
14 1996 77.1 9
14 1996 76.9 17
14 1996 12.4 45
14 1996 946.3 88
14 1996 15.4 92
14 1997 .7 2
14 1997 63.2 2
14 1997 91.7 3
14 1997 860.8 9
14 1997 12.4 21
14 1997 800.8 32
14 1997 33.7 45
14 1997 41 95
15 1999 .1 44
15 2000 .1 58
15 2001 .4 27
15 2001 .1 95
15 2002 .5 5
15 2002 .1 58
15 2003 .1 17
15 2004 3.5 28
15 2004 .1 39
16 2000 .8 2
16 2001 .6 2
16 2003 .2 2
16 2004 .1 2
16 2004 .1 8
16 2004 2.5 8
end
Firm 14 produced 6 products in 1994. It produced every year consecutively until 1997. Because there are no missing years in between, I keep this firm. But firm 16 reports in 2000, 2001 and then in 2003. I assume that the firm still operated in 2002 but doesn't report in the data. How to create a dummy variable for this firm?
tsfill doesn't help because I have repeated values within id-year.
In the first step, you delete the companies that do not produce any products in a year by creating a dummy variable "firm_any_production" that indicates whether a company has produced at least one product in a given year. Then the maximum of this dummy variable is calculated for each firm and the firms for which the maximum is 0 are deleted.
gen firm_any_production = sum(sales) > 0
bysort id (year): egen firm_missing_year = max(firm_any_production)
drop if firm_missing_year == 0
In step 2 you calculate whether the newly added products of a company have higher sales than the core product. This is calculated by creating a dummy variable "is_new_product", which indicates whether a product is a new product. Then the sales of these new products are calculated and compared to the sales of the core product. If the sum of the turnover of the new products is greater than the turnover of the core product, another dummy variable "greater_than_core" is created and set to 1.
bysort id year: egen core_product_sales = max(sales)
gen is_new_product = sales != core_product_sales
gen new_product_sales = sales * is_new_product
gen greater_than_core = sum(new_product_sales) > core_product_sales
Translated with www.DeepL.com/Translator (free version)
Added:
The code is creating a firm_missing_year variable that takes the value of 1 if a firm doesn't report any product in the current year. The is_core_product variable indicates which product has the highest sales in a given year for each firm. The is_new_product variable takes the value of 1 if the product wasn't produced in the previous year. Finally, the higher_new_sales variable takes the value of 1 if the sum of sales of new products is greater than the sales of the core product.
use "your_data_file.dta", clear
gen firm_missing_year = 0
bysort id (year): egen last_year = max(year), unique(id)
replace firm_missing_year = 1 if year > last_year[1]
gen is_core_product = 0
bysort id year: egen max_sales = max(sales), unique(id year)
replace is_core_product = 1 if sales == max_sales
gen is_new_product = 0
bysort id year: gen lagged_product = product[_n-1]
replace is_new_product = 1 if product != lagged_product & sales != max_sales
bysort id year: egen sum_new_sales = sum(sales * is_new_product), unique(id year)
gen higher_new_sales = 0
replace higher_new_sales = 1 if sum_new_sales > max_sales
I need to calculate the previous wage of director before he joins a new company.
I have created a simple dataset for one director (in practice I have many observations of director_id). This director with ID = 1 manages 5 firms which he joined in different years (the variable called enter). If director joined firm number 2 in 2011, I need the average of the variable wage for all years before 2011 which he was managing. For the same director = 1, I need a different mean(wage) for firm number 3 which he joined in 2012 (which will include mean(wage) from previous 2 companies that he managed before entering company 3 in 2012).
Below is the data. I would really appreciate your help in coding this problem.
clear
input enter year wage director_id firm_id
2006 2006 6.4790964 1 1
2006 2010 6.4783854 1 1
2006 2011 6.4067149 1 1
2006 2012 6.3716507 1 1
2006 2013 6.2248578 1 1
2006 2014 6.0631728 1 1
2011 2011 5.0127039 1 2
2011 2012 4.9616795 1 2
2011 2013 4.9483747 1 2
2011 2014 5.2612371 1 2
2012 2012 4.5389338 1 3
2012 2013 4.4322848 1 3
2012 2014 4.3223209 1 3
2013 2013 4.336947 1 4
2013 2014 4.27459 1 4
2015 2015 -.60586482 1 5
2015 2016 .085194588 1 5
end
I just need to exclude from mean(wage) all values that happen after he enters, so really need to regard only years before he enters a new company.
A recipe for what I think you seek is that the mean previous wage in other firms =
(SUM of previous wages in all firms MINUS sum of previous wages in this firm) / (COUNT of previous years in all firms MINUS count of previous years in this firm).
Your example is helpful but the wage variable is too irregular to allow easy eyeball checks.
Consider this sequence, where rangestat is from SSC.
clear
input enter year wage director_id firm_id
2006 2006 6.4790964 1 1
2006 2010 6.4783854 1 1
2006 2011 6.4067149 1 1
2006 2012 6.3716507 1 1
2006 2013 6.2248578 1 1
2006 2014 6.0631728 1 1
2011 2011 5.0127039 1 2
2011 2012 4.9616795 1 2
2011 2013 4.9483747 1 2
2011 2014 5.2612371 1 2
2012 2012 4.5389338 1 3
2012 2013 4.4322848 1 3
2012 2014 4.3223209 1 3
2013 2013 4.336947 1 4
2013 2014 4.27459 1 4
2015 2015 -.60586482 1 5
2015 2016 .085194588 1 5
end
sort year firm_id
replace wage = _n
rangestat (sum) SUM=wage (count) COUNT=wage, int(year . -1) by(director_id)
rangestat (sum) sum=wage (count) count=wage, int(year . -1) by(director_id firm_id)
replace sum = 0 if sum == .
replace count = 0 if count == .
gen wanted = (SUM - sum) / (COUNT - count)
list, sepby(year)
+---------------------------------------------------------------------------------+
| enter year wage direct~d firm_id SUM COUNT sum count wanted |
|---------------------------------------------------------------------------------|
1. | 2006 2006 1 1 1 . . 0 0 . |
|---------------------------------------------------------------------------------|
2. | 2006 2010 2 1 1 1 1 1 1 . |
|---------------------------------------------------------------------------------|
3. | 2006 2011 3 1 1 3 2 3 2 . |
4. | 2011 2011 4 1 2 3 2 0 0 1.5 |
|---------------------------------------------------------------------------------|
5. | 2006 2012 5 1 1 10 4 6 3 4 |
6. | 2011 2012 6 1 2 10 4 4 1 2 |
7. | 2012 2012 7 1 3 10 4 0 0 2.5 |
|---------------------------------------------------------------------------------|
8. | 2006 2013 8 1 1 28 7 11 4 5.666667 |
9. | 2011 2013 9 1 2 28 7 10 2 3.6 |
10. | 2012 2013 10 1 3 28 7 7 1 3.5 |
11. | 2013 2013 11 1 4 28 7 0 0 4 |
|---------------------------------------------------------------------------------|
12. | 2006 2014 12 1 1 66 11 19 5 7.833333 |
13. | 2011 2014 13 1 2 66 11 19 3 5.875 |
14. | 2012 2014 14 1 3 66 11 17 2 5.444445 |
15. | 2013 2014 15 1 4 66 11 11 1 5.5 |
|---------------------------------------------------------------------------------|
16. | 2015 2015 16 1 5 120 15 0 0 8 |
|---------------------------------------------------------------------------------|
17. | 2015 2016 17 1 5 136 16 16 1 8 |
+---------------------------------------------------------------------------------+
I'm trying to figure out if it's possible to transform table rows to columns where the number of rows included changes at the time of the query. Here's a sample of what I'm trying to do:
Characteristics Table
strategy
year
month
aaa
aa
a
InvestmentA
2020
12
5
4
10
InvestmentB
2020
12
8
15
25
Investment(n)
2020
12
x
x
x
Output
year
month
Credit Type
InvestmentA
InvestmentA
Investment(n)
2020
12
aaa
5
8
x
2020
12
aa
4
15
x
2020
12
a
10
25
x
I have a SQL query already that gets the data I need but I'm struggling to figure out how to get that into a chart. This is sample data as result of my query:
year month day mode amount duration
2013 2 22 0 1 36001
2013 7 7 1 1 55062
2015 12 23 1 6 13
2015 12 23 4 4 11
2015 12 23 7 31 104
2015 12 23 8 2 4
2015 12 23 12 11 21
2015 12 23 13 3 8
2016 3 24 1 207 519
If I wanted to graph lets say amount grouped per year, month and day how would that be done in JFreeChart?
I have worked with a simple C program to find the Day for Given Date. For it, I have written a lot of lines to calculate the day and month and to find the kind of the given year. While Surfing I came to know about a single line code to find the day for the given date. The code is as below
( d += m < 3 ? y --: y- 2, 23 * m / 9 + d + 4 + y / 4 - y / 100 + y / 400) % 7 ;
// 0 - Sunday, 6 - saturday
It gave the correct answer for all inputs, but I couldn't understand the values used in this expression.
Why the sum of day and month is checked for less than 3.
Why the year is reduced by one and the condition fails it decreases the year by 2.
Why the numbers 3, 23 and 9 are used in this expression.
I have confused about the operator precedence on this statement. Can anyone explain how this works?
What I've found so far:
23 * m / 9 results in
1 2 3
2 5 2
3 7 3
4 10 2
5 12 3
6 15 2
7 17 3
8 20 3
9 23 2
10 25 3
11 28 2
12 30 3
This expression adds the days over 28 days of a month.
The expression y / 4 - y / 100 + y / 400 results in:
1995 483 0
1996 484 1
1997 484 1
1998 484 1
1999 484 1
2000 485 2
2001 485 2
with the result, adding one day every 4 years (except leap years)
Because every year with 365 days (mod 7 == 1) increments the weekday by 1, the years are added to the days.
The expression d + (m < 3 ? y --: y- 2) is for correcting the leap year calculation. If we have a leap year, we can correct by one day only if we have a month >= march.