I have two data sets. FIRST is a list of products and their daily prices from a supplier and SECOND is a list of start and end dates (as well as other important data for analysis). How can I tell Stata to pull the price at the beginning date and then the price at the end date from FIRST into SECOND for the given dates. Please note, if there is no exact matching date I would like it to grab the last date available. For example, if SECOND has the date 1/1/2013 and FIRST has prices on ... 12/30/2012, 12/31/2012, 1/2/2013, ... it would grab the 12/31/2012 price.
I would usually do this with Excel, but I have millions of observations, and it is not feasible.
I have put an example of FIRST and SECOND as well as what the optimal solution would give as an output POST_SECOND
FIRST
Product Price Date
1 3 1/1/2010
1 3 1/3/2010
1 4 1/4/2010
1 2 1/8/2010
2 1 1/1/2010
2 5 2/5/2010
3 7 12/26/2009
3 2 1/1/2010
3 6 4/3/2010
SECOND
Product Start Date End Date
1 1/3/2010 1/4/2010
2 1/1/2010 1/1/2010
3 12/26/2009 4/3/2010
POST_SECOND
Product Start Date End Date Price_Start Price_End
1 1/3/2010 1/4/2010 3 4
2 1/1/2010 1/1/2010 1 1
3 12/26/2009 4/3/2010 7 6
Here's a merge/keep/sort/collapse* solution that relies on using the last date. I altered your example data slightly.
/* Make Fake Data & Convert Dates to Date Format */
clear
input byte Product byte Price str12 str_date
1 3 "1/1/2010"
1 3 "1/3/2010"
1 4 "1/4/2010"
1 2 "1/8/2010"
2 1 "1/1/2010"
2 5 "2/5/2010"
3 7 "12/26/2009"
3 7 "12/28/2009"
3 2 "1/1/2010"
3 6 "4/3/2010"
4 8 "12/30/2012"
4 9 "12/31/2012"
4 10 "1/2/2013"
4 10 "1/3/2013"
end
gen Date = date(str_date,"MDY")
format Date %td
drop str_date
save "First.dta", replace
clear
input byte Product str12 str_Start_Date str12 str_End_Date
1 "1/3/2010" "1/4/2010"
2 "1/1/2010" "1/1/2010"
3 "12/27/2009" "4/3/2010"
4 "1/1/2013" "1/2/2013"
end
gen Start_Date = date(str_Start_Date,"MDY")
gen End_Date = date(str_End_Date,"MDY")
format Start_Date End_Date %td
drop str_*
save "Second.dta", replace
/* Data Transformation */
use "First.dta", clear
merge m:1 Product using "Second.dta", nogen
bys Product: egen ads = min(abs(Start_Date-Date))
bys Product: egen ade = min(abs(End_Date - Date))
keep if (ads==abs(Date - Start_Date) & Date <= Start_Date) | (ade==abs(Date - End_Date) & Date <= End_Date)
sort Product Date
collapse (first) Price_Start = Price (last) Price_End = Price, by(Product Start_Date End_Date)
list, clean noobs
*Some people are reshapers. Others are collapsers. Often both can get the job done, but I think collapse is easier in this case.
In Stata, I've never been able to get something like this to work nicely in one step (something you can do in SAS via a SQL call). In any case, I think you'd be better off creating an intermediate file from FIRST.dta and then merging that 2x on each of your StartDate and EndDate variables in SECOND.dta.
Say you have data for price adjustments from Jan 1, 2010 to Dec 31, 2013 (specified with varied intervals as you have shown above). I assume all the date variables are already in date format in FIRST.dta & SECOND.dta, and that variable names in SECOND do not have spaces in them.
tempfile prod prices
use FIRST.dta, clear
keep Product
duplicates drop
save `prod'
clear
set obs 1096
g Date=date("12-31-2009","MDY")+_n
format date %td
cross using `prod'
merge 1:1 Product Date using FIRST.dta, assert(1 3) nogen
gsort +Product +Date /*this ensures the data are sorted properly for the next step */
replace price=price[_n-1] if price==. & Product==Product[_n-1]
save `prices'
use SECOND.dta, clear
foreach i in Start End {
rename `i'Date Date
merge 1:1 Product Date using `prices', assert(2 3) keep(3) nogen
rename Price Price_`i'
rename Date `i'Date
}
This should work if I understand your data structures correctly, and it should address the issue being discussed in the comments to #Dimitriy's answer. I'm open to critiques on how to make this nicer as its something I've had to do a few times and this is how I usually go about it.
Related
I'm using SSMS version 18.9.2 and I'm trying to get a list IDs who gave a gift the year following the year after their FIRST gift date. Meaning if their first gift was in 2019 and they gave a gift in 2020 then row count = 1, then the next person's first gift also was in 2019, but did NOT give a gift in 2020, then row count would remain 1 even though we have reviewed a total of two people. Hope that makes since.
Using a sample data as this, I would expect my row count to be 1; returning only ID 2
ID
Date
1
3/8/1981
1
2/11/1988
1
2/15/1995
2
2/22/1982
2
2/24/1983
2
3/15/1983
2
2/17/1984
3
2/16/1984
3
3/13/1984
3
6/13/1986
4
2/2/1983
4
3/11/1985
4
3/21/1986
This is the closest I've gotten to work. Notice the two different HAVINGs where as first works but the second fails which the second is how it needs to work:
SELECT DISTINCT
gifts1.giftid,
YEAR(gifts2.gifteffdat) AS 'MINYR',
YEAR(MIN(gifts1.gifteffdat)) AS 'MINYR+1'
FROM
gifts AS gifts1
INNER JOIN
gifts AS gifts2 ON gifts1.giftid = gifts2.giftid
AND DATEDIFF(year, gifts2.gifteffdat, gifts1.gifteffdat) = 1
GROUP BY
gifts1.giftid, gifts1.gifteffdat, gifts2.gifteffdat
-- THIS HAVING WORKS
HAVING
(YEAR(gifts2.gifteffdat) = 1982)
-- THIS HAVING DOESNT WORK
-- HAVING YEAR(gifts2.gifteffdat) = YEAR(MIN(gifts1.gifteffdat)) +1
I appreciate any help! Thank you!
I have records in database with start and end dates
i want to filter out all records with are in range of start and end date
i have two query that works , gives me records between two dates and the other query gives me records that are in range of start or end date
how do i combine these two linq query into single one so it works in both ways
Linq 1
schedulelist = (From c In db.Table1 Where c.StartDate.Value.Date <= objStartDate.Date And c.EndDate.Value.Date >= objStartDate.Date And
c.UserID = CInt(Session("UserID"))).ToList()
Linq 2
schedulelist = (From c In db.Table1 Where (c.StartDate.Value.Date >=
objStartDate.Date And c.StartDate.Value.Date <= objEndDate.Date) Or
(c.EndDate.Value.Date >= objStartDate.Date And c.EndDate.Value.Date <=
objEndDate.Date) And c.UserID = CInt(Session("UserID"))).ToList()
in db table have these values
StartDate EndDate
2019-10-08 07:00:00.000 2019-10-30 07:00:00.000
2019-10-15 07:00:00.000 2019-10-27 07:00:00.000
if search with ObjStartDate 15/10/2019 00:00:00 and ObjEndDate 27/10/2019 00:00:00
i get record No 2 when i run Linq 2
i get Record No 1 when i run Linq 1
what i should get is both records for any Linq 1 or Linq 2
so whats the better solution to combine both into one query or this query is all wrong ?
The simplest query to check if date range 1 intersects date range 2 is this:
... WHERE range2_end > range1_start
AND range1_end > range2_start
This checks all cases such as range 1 fully inside, fully contains, starts inside or ends inside range 2.
Below member returns me Running Total between first and chosen date. It's possible to aggregate data up to one day/week/month before?
WITH
MEMBER [Measures].[SUM] AS
AGGREGATE(
NULL:TAIL(EXISTING [Date].[Date].[Date].Members).Item(0),
[Measures].[X]
)
Here is example (date can be a day, month, year...) :
DATE X SUM
------------
1 1 NULL
2 4 1
3 2 5
4 2 7
I think you've almost got it - to end the aggregation x number of days before you can use lag:
WITH
MEMBER [Measures].[SUM] AS
AGGREGATE(
NULL
:
TAIL(
EXISTING [Date].[Date].[Date].Members
).Item(0).lag(7) //<<<< finishes 7 days before chosen date
,[Measures].[X]
)
Software: Stata
I have two datasets: one of company CEOs (dataset 1) and one of business agreements signed (dataset 2).
Dataset 1 is the following format, sorted by company:
company 1, CEO name, start date, end date, etc.
company 1, CEO name, start date, end date, etc.
...
company 2, CEO name, start date, end date, etc.
Dataset 2 is the following format, sorted by agreement (each with 2-150 parties):
agreement 1, party 1, party 1 accession date, party 2, party 2 accession date.
agreement 2, party 1, party 1 accession date, party 2, party 2 accession date.
I want to write a code that, for each individual CEO, counts the number of agreements signed by the CEO's company in his/her tenure as CEO.
So far I have created a CEO-day dataset with expand.
gen duration = enddate - startdate
expand duration -1
sort id startdate
by id: gen n = _n -1
gen day = startdate + n
Ideally I would proceed with a code like this:
collapse (count) agreement, by(id)
However, Dataset 2 lists the different parties as different variables. Company 1 is not always "party 1", sometimes it may be "party 150". Also, each party may have different accession dates. I need a loop that "scans" Dataset 2 for agreements where company 1 acceeded to the agreement as one of the parties with an accession date located within the period CEO 1 of company 1 was CEO of company 1.
What should I do? Do I need to create a loop?
A loop is not strictly necessary. You can try using reshape and joinby:
clear
set more off
*----- example data -----
// ceo data set
input ///
firm str15(ceo startd endd)
1 "pete" "01/04/1999" "05/12/2010"
1 "bill" "06/12/2010" "12/01/2011"
1 "lisa" "13/01/2011" "15/06/2014"
2 "mary" "01/04/1999" "05/12/2010"
2 "hank" "06/12/2010" "12/01/2011"
2 "mary" "13/01/2011" "15/06/2014"
3 "bob" "01/04/1999" "05/12/2010"
3 "john" "06/12/2010" "12/01/2011"
end
gen double startd2 = date(startd, "DMY")
gen double endd2 = date(endd, "DMY")
format %td startd2 endd2
drop startd endd
tempfile ceo
save "`ceo'"
clear
// agreement data set
input ///
agree party1 str15 p1acc party2 str15 p2acc
1 2 "09/12/2010" 3 "10/01/2011"
2 1 "05/06/1999" 2 "17/01/2011"
3 1 "06/06/1999" 3 "05/04/1999"
4 2 "07/01/2011" . ""
5 2 "08/01/2011" . ""
end
gen double p1accn = date(p1acc, "DMY")
gen double p2accn = date(p2acc, "DMY")
format %td p?accn
drop p?acc
*----- what you want -----
// reshape
gen i = _n
reshape long party p#accn, i(i)
rename (party paccn) (firm date)
order firm agree date
sort firm agree
drop i _j
// joinby
joinby firm using "`ceo'"
// find under which ceo, agreement was signed
gen tag = inrange(date, startd2, endd2)
list, sepby(firm)
// count
keep if tag
collapse (count) agreenum=tag, by(ceo firm)
list
A potential pitfall is joinby creating so many observations, you run out of memory.
See help datetime if you have no experience with dates in Stata.
(Notice how I set up example data for your problem. Providing it helps others, help you.)
I have a csv file with some high frequency stock price data, and I'd like to get a secondly price data from the table.
In each file, there are columns named date, time, symbol, price, volume, and etc.
There are some seconds with no trading so there are missing data in some seconds.
I'm wondering how could I fill the missing data in Q to get the secondly data from 9:30 to 16:00 in full? If there is missing price, just use the recently price as its price in that second.
I'm considering to write some loop, but I don't know how to exactly to that.
Simplifying a little, I'll assume you have some random timestamps in your dataset like this:
time price
--------------------------------------
2015.01.20D22:42:34.776607000 7
2015.01.20D22:42:34.886607000 3
2015.01.20D22:42:36.776607000 4
2015.01.20D22:42:37.776607000 8
2015.01.20D22:42:37.886607000 7
2015.01.20D22:42:39.776607000 9
2015.01.20D22:42:40.776607000 4
2015.01.20D22:42:41.776607000 9
so there are some missing seconds there. I'm going to call this table t. So if you do a by-second type of query, obviously the seconds that are missing are still missing:
q)select max price by time.second from t
second | price
--------| -----
22:42:34| 7
22:42:36| 4
22:42:37| 8
22:42:39| 9
22:42:40| 4
22:42:41| 9
To get missing seconds, you have to join a list of nulls. In this case we know the data goes from 22:42:34 to 22:42:41, but in reality you'll have to find the min/max time and use that to create a temporary "null" table to join against:
q)([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N)
second price
--------------
22:42:34
22:42:35
22:42:36
22:42:37
22:42:38
22:42:39
22:42:40
22:42:41
Then left join:
q)([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N) lj select max price by time.second from t
second price
--------------
22:42:34 7
22:42:35
22:42:36 4
22:42:37 8
22:42:38
22:42:39 9
22:42:40 4
22:42:41 9
You can use fills or whatever your favourite filling heuristic is after that.
q)fills `second xasc asc ([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N) lj select max price by time.second from t
second price
--------------
22:42:34 7
22:42:35 7
22:42:36 4
22:42:37 8
22:42:38 8
22:42:39 9
22:42:40 4
22:42:41 9
(Note the sort on second before fills!)
By the way for larger tables this will be much faster than a loop. Loops in q are generally a bad idea.
EDIT
You could use a comma join too, both tables need to be keyed on the second column
t,t1
(where t1 is the null-filled table keyed on second)
I haven't tested it, but I suspect it would be slightly faster than the lj version.
Using aj which is one of the most powerful features of KDB:
q)data
sym time price size
----------------------------
MS 10:24:04 93.35974 8
MS 10:10:47 4.586986 1
APPL 10:50:23 0.7831685 1
GOOG 10:19:52 49.17305 0
in-memory table needs to be sym,time sorted with g# attribute applied to sym column
q)data:update `g#sym from `sym`time xasc data
q)meta trade
c | t f a
-----| -----
sym | s g
time | v
price| f
size | j
Creating a rack table intervalized per second per sym :
q)rack: `sym`time xasc (select distinct sym from data) cross ([] time:{x[0]+til `int$x[1]-x[0]}(min;max)#\:data`time)
Using aj to join the data :
q)aj[`sym`time; rack; data]