Software: Stata
I have two datasets: one of company CEOs (dataset 1) and one of business agreements signed (dataset 2).
Dataset 1 is the following format, sorted by company:
company 1, CEO name, start date, end date, etc.
company 1, CEO name, start date, end date, etc.
...
company 2, CEO name, start date, end date, etc.
Dataset 2 is the following format, sorted by agreement (each with 2-150 parties):
agreement 1, party 1, party 1 accession date, party 2, party 2 accession date.
agreement 2, party 1, party 1 accession date, party 2, party 2 accession date.
I want to write a code that, for each individual CEO, counts the number of agreements signed by the CEO's company in his/her tenure as CEO.
So far I have created a CEO-day dataset with expand.
gen duration = enddate - startdate
expand duration -1
sort id startdate
by id: gen n = _n -1
gen day = startdate + n
Ideally I would proceed with a code like this:
collapse (count) agreement, by(id)
However, Dataset 2 lists the different parties as different variables. Company 1 is not always "party 1", sometimes it may be "party 150". Also, each party may have different accession dates. I need a loop that "scans" Dataset 2 for agreements where company 1 acceeded to the agreement as one of the parties with an accession date located within the period CEO 1 of company 1 was CEO of company 1.
What should I do? Do I need to create a loop?
A loop is not strictly necessary. You can try using reshape and joinby:
clear
set more off
*----- example data -----
// ceo data set
input ///
firm str15(ceo startd endd)
1 "pete" "01/04/1999" "05/12/2010"
1 "bill" "06/12/2010" "12/01/2011"
1 "lisa" "13/01/2011" "15/06/2014"
2 "mary" "01/04/1999" "05/12/2010"
2 "hank" "06/12/2010" "12/01/2011"
2 "mary" "13/01/2011" "15/06/2014"
3 "bob" "01/04/1999" "05/12/2010"
3 "john" "06/12/2010" "12/01/2011"
end
gen double startd2 = date(startd, "DMY")
gen double endd2 = date(endd, "DMY")
format %td startd2 endd2
drop startd endd
tempfile ceo
save "`ceo'"
clear
// agreement data set
input ///
agree party1 str15 p1acc party2 str15 p2acc
1 2 "09/12/2010" 3 "10/01/2011"
2 1 "05/06/1999" 2 "17/01/2011"
3 1 "06/06/1999" 3 "05/04/1999"
4 2 "07/01/2011" . ""
5 2 "08/01/2011" . ""
end
gen double p1accn = date(p1acc, "DMY")
gen double p2accn = date(p2acc, "DMY")
format %td p?accn
drop p?acc
*----- what you want -----
// reshape
gen i = _n
reshape long party p#accn, i(i)
rename (party paccn) (firm date)
order firm agree date
sort firm agree
drop i _j
// joinby
joinby firm using "`ceo'"
// find under which ceo, agreement was signed
gen tag = inrange(date, startd2, endd2)
list, sepby(firm)
// count
keep if tag
collapse (count) agreenum=tag, by(ceo firm)
list
A potential pitfall is joinby creating so many observations, you run out of memory.
See help datetime if you have no experience with dates in Stata.
(Notice how I set up example data for your problem. Providing it helps others, help you.)
Related
I'm using SSMS version 18.9.2 and I'm trying to get a list IDs who gave a gift the year following the year after their FIRST gift date. Meaning if their first gift was in 2019 and they gave a gift in 2020 then row count = 1, then the next person's first gift also was in 2019, but did NOT give a gift in 2020, then row count would remain 1 even though we have reviewed a total of two people. Hope that makes since.
Using a sample data as this, I would expect my row count to be 1; returning only ID 2
ID
Date
1
3/8/1981
1
2/11/1988
1
2/15/1995
2
2/22/1982
2
2/24/1983
2
3/15/1983
2
2/17/1984
3
2/16/1984
3
3/13/1984
3
6/13/1986
4
2/2/1983
4
3/11/1985
4
3/21/1986
This is the closest I've gotten to work. Notice the two different HAVINGs where as first works but the second fails which the second is how it needs to work:
SELECT DISTINCT
gifts1.giftid,
YEAR(gifts2.gifteffdat) AS 'MINYR',
YEAR(MIN(gifts1.gifteffdat)) AS 'MINYR+1'
FROM
gifts AS gifts1
INNER JOIN
gifts AS gifts2 ON gifts1.giftid = gifts2.giftid
AND DATEDIFF(year, gifts2.gifteffdat, gifts1.gifteffdat) = 1
GROUP BY
gifts1.giftid, gifts1.gifteffdat, gifts2.gifteffdat
-- THIS HAVING WORKS
HAVING
(YEAR(gifts2.gifteffdat) = 1982)
-- THIS HAVING DOESNT WORK
-- HAVING YEAR(gifts2.gifteffdat) = YEAR(MIN(gifts1.gifteffdat)) +1
I appreciate any help! Thank you!
I have a list of employees and want to generate an "Employee ID" based on the Hire Date value. I was hoping to be able to check the Hire Date value and compare it to an array of all the hire dates and to return the correct number.
You can see a list of these dates here: https://docs.google.com/spreadsheets/d/1ogjWzFPWLUECIP9YXL-r7RHM-hPWTPZ-6wX0sTV0QNc/edit?usp=sharing
Ideally, (using a small sampling of the dates above) it would look like the following:
Hire Date Employee ID
3/6/2012 1
3/30/2015 4
8/10/2015 5
8/10/2015 6
9/18/2015 7
9/18/2015 8
6/26/2020 9
3/6/2012 2
2/7/2013 3
use in B1:
={"ID"; ARRAYFORMULA(IFNA(VLOOKUP(A2:A, {SORT(A2:A), ROW(A2:A)-1}, 2, 0)))}
or:
=ARRAYFORMULA(RANK(A2:A, A2:A, 1))
if you want it to not repeat on the same date use:
={"ID"; ARRAYFORMULA(IF(A2:A="",,IFNA(VLOOKUP(A2:A&"z"&
COUNTIFS(A2:A, A2:A, ROW(A2:A), "<="&ROW(A2:A)), {SORT({A2:A&"z"&
COUNTIFS(A2:A, A2:A, ROW(A2:A), "<="&ROW(A2:A))}), ROW(A2:A)-1}, 2, 0))))}
I have a dataset with two tabs, one with monthly goal(target) and another tab with sales and order data. I'm trying to summarize sales data from the other tab into the target tab using several parameters with an Index(Match and SumIfs:
My Attempt:
=SUMIFS(INDEX(OrderBreakdown!$A$2:$T$8048,,MATCH(C2,OrderBreakdown!$G$2:$G$8048)),OrderBreakdown!$I$2:$I$8048,">="&A2,OrderBreakdown!$I$2:$I$8048,"<="&B2)
Order Breakdown is the other sheet, column D in OrderBreakdown sheet is what I want to sum if OrderBreakdown_Category(Col G) = Col C and if OrderBreakdown_Order Date(Col I) >= Start Date(Col A) and if OrderBreakdown_Order Date(Col I) <= End Date(Col A)
My answer should be much more in line with Col D but instead I'm getting $MM
Here's a sample of the dataset I'm pulling from:
dataset I'm pulling from
Ok, I am not sure why your range to sum is from A through T - that is probably where you went wrong. Also, I did not find the index method necessary. This should work for you
=SUMIFS(OrderBreakdown!$D$2:$D$8048,OrderBreakdown!$I$2:$I$8048, ">=" & A2,OrderBreakdown!$I$2:$I$8048, "<=" & B2, OrderBreakdown!$G$2:$G$8048, "<=" & C2)
Here is my sample data Starting on first sheet row 2
1/1/2011 1/30/2011 Office Supplies
Then the orderBreakdown tab starts on column C
Discount Sales Profit Quantity Category sub-category OrderDate
0.5 $45.00 ($26.00) 3 Office Supplies Paper 1/1/11 Eugene Mo Stockholm Sweden North Home Offic 1/5/11 Second Cla: Stockholm 2011-(11 0.1-2011 2011 1/1/2011
0 $854.00 $290.00 7 Furniture BookCases 1/2/2011
0 $854.00 $290.00 7 Furniture BookCases 12/32/2010
I need to make a comparison for ratings in two points in time and indicate if the change was upwards,downwards or stayed the same.
For example:
This would be a table with four columns:
ID T0 T0+1 Status
1 AAA AA Lower
2 BB A Higher
3 C C Same
However, this does not work when applying regular string comparison, because in SQL
A<B
B<BBB
I need
A>B
B<BBB
So my order(highest to lowest): AAA,AA,A,BBB,BB,B
SQL order(highest to lowest): BBB,BB,B,AAA,AA,A
Now I have 2 options in mind, but I wonder if someone know a better one:
1) Use CASE WHEN statements for all the possibilities of ratings going up and down ( I have more values than indictaed above)
CASE WHEN T0=T0+1 then 'Same'
WHEN T0='AAA' and To+1<>'AAA' then 'Lower'
....adress all other options for rating going down
ELSE 'Higher'
However, this generates a very large number of CASE WHEN statements.
2) My other option requires generating 2 tables. In table 1 I use case when statements to assign values/rank to the ratings.
For example:
CASE WHEN T0='AAA' then 6
CASE WHEN T0='AA' then 5
CASE WHEN T0='A' then 4
CASE WHEN T0='BBB' then 3
CASE WHEN T0='BB' then 2
CASE WHEN T0='B' then 1
The same for T0+1.
Then in table 2 I use a regular compariosn between column T0 and Column T0+1 on the numeric values.
However, I am looking for a solution where I can do it in one table (with as little lines as possible), and optimally never really show the ranking column.
I think a nested statement would be the best option, but it did now work for me.
Anybody has suggestions?
I use SQL Server 2008.
If you are using Credit Rating, this is very likely that this is not just about AAA > AA or BBB > BB.
Whether you are using one agency or another, it could also be AA+ or Aa1 for long term, F1+ for short term or something else in different contexts or with other agencies.
It is also often requiered to convert data from one agency to other agencies Rating.
Therefore it is better to use a mapping table such as:
Id | Rating
0 | AAA
1 | AA+
2 | AA
3 | AA-
4 | A+
5 | A
6 | A-
7 | BBB+
Using this table, you only have to join the rating in your data table with the rating in the mapping table:
SELECT d.Rating_T0, d.Rating_T1
CASE WHEN d.Rating_T0 = d.Rating_T1 THEN '='
WHEN m0.id < m1.id THEN '<'
WHEN m0.id > m1.id THEN '>'
END
FROM yourData d
INNER JOIN RatingMapping m0
ON m0.Rating= d.Rating_T0
INNER JOIN RatingMapping m1
ON m1.Rating= d.Rating_T1
If you only store the Rating id in you data table, you will not only save space (1 byte for tinyint versus up to 4 chars) but will also be able to compare without the JOIN to the mapping table.
SELECT d.Rating_Id0, d.Rating_Id1
CASE WHEN d.Rating_Id0 = d.Rating_Id1 THEN '='
WHEN d.Rating_Id0 < d.Rating_Id1 THEN '<'
WHEN d.Rating_Id0 > d.Rating_Id1 THEN '>'
END
FROM yourData d
The JOIN would only be requiered when you want to display the actual Rating value such as AAA for Rating_ID = 0.
You could also add an agency_Id to the Mapping table. This way, you can easily choose which Notation agency you want to display and easily convert between Agency 1 and Agency 2 or Agency 3 (ie. Id 1 => S&P and Id 2 => Fitch, Id 3 => ...)
I have two data sets. FIRST is a list of products and their daily prices from a supplier and SECOND is a list of start and end dates (as well as other important data for analysis). How can I tell Stata to pull the price at the beginning date and then the price at the end date from FIRST into SECOND for the given dates. Please note, if there is no exact matching date I would like it to grab the last date available. For example, if SECOND has the date 1/1/2013 and FIRST has prices on ... 12/30/2012, 12/31/2012, 1/2/2013, ... it would grab the 12/31/2012 price.
I would usually do this with Excel, but I have millions of observations, and it is not feasible.
I have put an example of FIRST and SECOND as well as what the optimal solution would give as an output POST_SECOND
FIRST
Product Price Date
1 3 1/1/2010
1 3 1/3/2010
1 4 1/4/2010
1 2 1/8/2010
2 1 1/1/2010
2 5 2/5/2010
3 7 12/26/2009
3 2 1/1/2010
3 6 4/3/2010
SECOND
Product Start Date End Date
1 1/3/2010 1/4/2010
2 1/1/2010 1/1/2010
3 12/26/2009 4/3/2010
POST_SECOND
Product Start Date End Date Price_Start Price_End
1 1/3/2010 1/4/2010 3 4
2 1/1/2010 1/1/2010 1 1
3 12/26/2009 4/3/2010 7 6
Here's a merge/keep/sort/collapse* solution that relies on using the last date. I altered your example data slightly.
/* Make Fake Data & Convert Dates to Date Format */
clear
input byte Product byte Price str12 str_date
1 3 "1/1/2010"
1 3 "1/3/2010"
1 4 "1/4/2010"
1 2 "1/8/2010"
2 1 "1/1/2010"
2 5 "2/5/2010"
3 7 "12/26/2009"
3 7 "12/28/2009"
3 2 "1/1/2010"
3 6 "4/3/2010"
4 8 "12/30/2012"
4 9 "12/31/2012"
4 10 "1/2/2013"
4 10 "1/3/2013"
end
gen Date = date(str_date,"MDY")
format Date %td
drop str_date
save "First.dta", replace
clear
input byte Product str12 str_Start_Date str12 str_End_Date
1 "1/3/2010" "1/4/2010"
2 "1/1/2010" "1/1/2010"
3 "12/27/2009" "4/3/2010"
4 "1/1/2013" "1/2/2013"
end
gen Start_Date = date(str_Start_Date,"MDY")
gen End_Date = date(str_End_Date,"MDY")
format Start_Date End_Date %td
drop str_*
save "Second.dta", replace
/* Data Transformation */
use "First.dta", clear
merge m:1 Product using "Second.dta", nogen
bys Product: egen ads = min(abs(Start_Date-Date))
bys Product: egen ade = min(abs(End_Date - Date))
keep if (ads==abs(Date - Start_Date) & Date <= Start_Date) | (ade==abs(Date - End_Date) & Date <= End_Date)
sort Product Date
collapse (first) Price_Start = Price (last) Price_End = Price, by(Product Start_Date End_Date)
list, clean noobs
*Some people are reshapers. Others are collapsers. Often both can get the job done, but I think collapse is easier in this case.
In Stata, I've never been able to get something like this to work nicely in one step (something you can do in SAS via a SQL call). In any case, I think you'd be better off creating an intermediate file from FIRST.dta and then merging that 2x on each of your StartDate and EndDate variables in SECOND.dta.
Say you have data for price adjustments from Jan 1, 2010 to Dec 31, 2013 (specified with varied intervals as you have shown above). I assume all the date variables are already in date format in FIRST.dta & SECOND.dta, and that variable names in SECOND do not have spaces in them.
tempfile prod prices
use FIRST.dta, clear
keep Product
duplicates drop
save `prod'
clear
set obs 1096
g Date=date("12-31-2009","MDY")+_n
format date %td
cross using `prod'
merge 1:1 Product Date using FIRST.dta, assert(1 3) nogen
gsort +Product +Date /*this ensures the data are sorted properly for the next step */
replace price=price[_n-1] if price==. & Product==Product[_n-1]
save `prices'
use SECOND.dta, clear
foreach i in Start End {
rename `i'Date Date
merge 1:1 Product Date using `prices', assert(2 3) keep(3) nogen
rename Price Price_`i'
rename Date `i'Date
}
This should work if I understand your data structures correctly, and it should address the issue being discussed in the comments to #Dimitriy's answer. I'm open to critiques on how to make this nicer as its something I've had to do a few times and this is how I usually go about it.