Efficient way to get max date before a given date - sql-server

Suppose I have a table called Transaction and another table called Price. Price holds the prices for given funds at different dates. Each fund will have prices added at various dates, but they won't have prices at all possible dates. So for fund XYZ I may have prices for the 1 May, 7 May and 13 May and fund ABC may have prices at 3 May, 9 May and 11 May.
So now I'm looking at the price that was prevailing for a fund at the date of a transaction. The transaction was for fund XYZ on 10 May. What I want, is the latest known price on that day, which will be the price for 7 May.
Here's the code:
select d.TransactionID, d.FundCode, d.TransactionDate, v.OfferPrice
from Transaction d
inner join Price v
on v.FundCode = d.FundCode
and v.PriceDate = (
select max(PriceDate)
from Price
where FundCode = v.FundCode
/* */ and PriceDate < d.TransactionDate
)
It works, but it is very slow (several minutes in real world use). If I remove the line with the leading comment, the query is very quick (2 seconds or so) but it then uses the latest price per fund, which is wrong.
The bad part is that the price table is minuscule compared to some of the other tables we use, and it isn't clear to me why it is so slow. I suspect the offending line forces SQL Server to process a Cartesian product, but I don't know how to avoid it.
I keep hoping to find a more efficient way to do this, but it has so far escaped me. Any ideas?

You don't specify the version of SQL Server you're using, but if you are using a version with support for ranking functions and CTE queries I think you'll find this quite a bit more performant than using a correlated subquery within your join statement.
It should be very similar in performance to Andriy's queries. Depending on the exact index topography of your tables, one approach might be slightly faster than another.
I tend to like CTE-based approaches because the resulting code is quite a bit more readable (in my opinion). Hope this helps!
;WITH set_gen (TransactionID, OfferPrice, Match_val)
AS
(
SELECT d.TransactionID, v.OfferPrice, ROW_NUMBER() OVER(PARTITION BY d.TransactionID ORDER BY v.PriceDate ASC) AS Match_val
FROM Transaction d
INNER JOIN Price v
ON v.FundCode = d.FundCode
WHERE v.PriceDate <= d.TransactionDate
)
SELECT sg.TransactionID, d.FundCode, d.TransactionDate, sg.OfferPrice
FROM Transaction d
INNER JOIN set_gen sg ON d.TransactionID = sg.TransactionID
WHERE sg.Match_val = 1

There's a method for finding rows with maximum or minimum values, which involves LEFT JOIN to self, rather than more intuitive, but probably more costly as well, INNER JOIN to a self-derived aggregated list.
Basically, the method uses this pattern:
SELECT t.*
FROM t
LEFT JOIN t AS t2 ON t.key = t2.key
AND t2.Value > t.Value /* ">" is when getting maximums; "<" is for minimums */
WHERE t2.key IS NULL
or its NOT EXISTS counterpart:
SELECT *
FROM t
WHERE NOT EXISTS (
SELECT *
FROM t AS t2
WHERE t.key = t2.key
AND t2.Value > t.Value /* same as above applies to ">" here as well */
)
So, the result is all the rows for which there doesn't exist a row with the same key and the value greater than the given.
When there's just one table, application of the above method is pretty straightforward. However, it may not be that obvious how to apply it when there's another table, especially when, like in your case, the other table makes the actual query more complex not merely by its being there, but also by providing us with an additional filtering for the values we are looking for, namely with the upper limits for the dates.
So, here's what the resulting query might look like when applying the LEFT JOIN version of the method:
SELECT
d.TransactionID,
d.FundCode,
d.TransactionDate,
v.OfferPrice
FROM Transaction d
INNER JOIN Price v ON v.FundCode = d.FundCode
LEFT JOIN Price v2 ON v2.FundCode = v.FundCode /* this and */
AND v2.PriceDate > v.PriceDate /* this are where we are applying
the above method; */
AND v2.PriceDate < d.TransactionDate /* and this is where we are limiting
the maximum value */
WHERE v2.FundCode IS NULL
And here's a similar solution with NOT EXISTS:
SELECT
d.TransactionID,
d.FundCode,
d.TransactionDate,
v.OfferPrice
FROM Transaction d
INNER JOIN Price v ON v.FundCode = d.FundCode
WHERE NOT EXISTS (
SELECT *
FROM Price v2
WHERE v2.FundCode = v.FundCode /* this and */
AND v2.PriceDate > v.PriceDate /* this are where we are applying
the above method; */
AND v2.PriceDate < d.TransactionDate /* and this is where we are limiting
the maximum value */
)

Are both pricedate and transactiondate indexed? If not you are doing table scans which is likely the cause of the performance bottleneck.

Related

Missing Rows when running SELECT in SQL Server

I have a simple select statement. It's basically 2 CTE's, one includes a ROW_NUMBER() OVER (PARTITION BY, then a join from these into 4 other tables. No functions or anything unusual.
WITH Safety_Check_CTE AS
(
SELECT
Fact_Unit_Safety_Checks_Wkey,
ROW_NUMBER() OVER (PARTITION BY [Dim_Unit_Wkey], [Dim_Safety_Check_Type_Wkey]
ORDER BY [Dim_Safety_Check_Date_Wkey] DESC) AS Check_No
FROM
[Pitches].[Fact_Unit_Safety_Checks]
), Last_Safety_Check_CTE AS
(
SELECT
Fact_Unit_Safety_Checks_Wkey
FROM
Safety_Check_CTE
WHERE
Check_No = 1
)
SELECT
COUNT(*)
FROM
Last_Safety_Check_CTE lc
JOIN
Pitches.Fact_Unit_Safety_Checks f ON lc.Fact_Unit_Safety_Checks_Wkey = f.Fact_Unit_Safety_Checks_Wkey
JOIN
DIM.Dim_Unit u ON f.Dim_Unit_Wkey = u.Dim_Unit_Wkey
JOIN
DIM.Dim_Safety_Check_Type t ON f.Dim_Safety_Check_Type_Wkey = t.Dim_Safety_Check_Type_Wkey
JOIN
DIM.Dim_Date d ON f.Dim_Safety_Check_Date_Wkey = d.Dim_Date_Wkey
WHERE
f.Safety_Check_Certificate_No IN ('GP/KB11007') --option (maxdop 1)
Sometimes it returns 0, 1 or 2 rows. The result should obviously be consistent.
I have ran a profile trace whilst replicating the issue and my session was the only one in the database.
I have compared the Actual execution plans and they are both the same, except the final hash match returns the differing number of rows.
I cannot replicate if I use MAXDOP 0.
In case you use my comment as the answer.
My guess is ORDER BY [Dim_Safety_Check_Date_Wkey] is not deterministic.
In the CTE's you are finding the [Fact_Unit_Safety_Checks_Wkey] that's associated with the most resent row for any given [Dim_Unit_Wkey], [Dim_Safety_Check_Type_Wkey] combination... With no regard for weather or not [Safety_Check_Certificate_No] is equal to 'GP/KB11007'.
Then, in the outer query, you are filtering results based on [Safety_Check_Certificate_No] = 'GP/KB11007'.
So, unless the most recent [Fact_Unit_Safety_Checks_Wkey] happens to have [Safety_Check_Certificate_No] = 'GP/KB11007', the data is going to be filtered out.

Summation in SQL over a computed column

I have a Trans-SQL related question, concerning summations over a computed column.
I am having a problem with double-counting of these computed values.
Usually I would extract all the raw data and post-process it in Perl, but I can't do that on this occasion due to the particular reporting system we need to use. I'm relatively inexperienced with the intricacies of SQL, so I thought I'd refer this to the experts.
My data is arranged in the following tables (highly simplified and reduced for the purposes of clarity):
Patient table:
PatientId
PatientSer
Course table
PatientSer
CourseSer
CourseId
Diagnosis table
PatientSer
DiagnosisId
Plan table
PlanSer
CourseSer
PlanId
Field table
PlanSer
FieldId
FractionNumber
FieldDateTime
What I would like to do is find the difference between the maximum fraction number and the minimum fraction number over a range of dates in the FieldDateTime in the FieldTable. I would like to then sum these values over the possible plan ids associated with a course, but I do not want to double count over the two particular diagnosis ids (A or B or both) that I may encounter for a patient.
So, for a patient with two diagnosis codes (A and B) and two plans in the same course of treatment (Plan1 and Plan2), with a difference in fraction numbers of 24 for the first plan and 5 for the second what I would like to get out is something like this:
- **PatientId CourseId PlanId DiagnosisId FractionNumberDiff Sum
- AB1234 1 Plan1 A 24 29
- AB1234 1 Plan1 B * *
- AB1234 1 Plan2 A 5 *
- AB1234 1 Plan2 B * *
I've racked my brains about how to do this, and I've tried the following:
SELECT
Patient.PatientId,
Course.CourseId,
Plan.PlanId,
MAX(fractionnumber OVER PARTITION(Plan.PlanSer)) - MIN(fractionnumber OVER PARTITION(Plan.PlanSer)) AS FractionNumberDiff,
SUM(FractionNumberDiff OVER PARTITION(Course.CourseSer)
FROM
Patient P
INNER JOIN
Course C ON (P.PatientSer = C.PatientSer)
INNER JOIN
Plan Pl ON (Pl.CourseSer = C.CourseSer)
INNER JOIN
Diagnosis D ON (D.PatientSer = P.PatientSer)
INNER JOIN
Field F ON (F.PlanSer = Pl.PlanSer)
WHERE
FieldDateTime > [Start Date]
AND FieldDateTime < [End Date]
But this just double-counts over the diagnosis codes, meaning that I end up with 58 instead of 29.
Any ideas about what I can do?
change the FractionNumberDiff to
MAX(fractionnumber) OVER (PARTITION BY Plan.PlanSer) -
MIN(fractionnumber) OVER (PARTITION BY Plan.PlanSer) AS FractionNumberDiff
and remove the "SUM(FractionNumberDiff OVER PARTITION(Course.CourseSer)"
make the exisitng query as a derived table and calcualte the SUM(FractionNumberDiff) there
SELECT *, SUM(FractionNumberDiff) OVER ( PARTITION BYCourse.CourseSer)
FROM
(
< the modified existing query here>
) AS d
as for the double counting issue, please post some sample data and the expected result

Substract 2 columns from postgreSQL LEFT JOIN query with NULL values

I have a postgreSQL query which should be the actual stock of samples on our lab.
The initial samples are taken from a table (tblStudies), but then there are 2 tables to look for to decrease the amount of samples.
So I made a union query for those 2 tables, and then matched the uniun query with the tblStudies to calculate the actual stock.
But the union query only gives values when there is a decrease in samples.
So when the study still has it's initial samples, the value isn't returned.
I figured out I should use a JOIN operation, but then I have NULL values for my study with initial samples.
Here is how far I got, any help please?
SELECT
"tblStudies"."Studie_ID", "SamplesWeggezet", c."Stalen_gebruikt", "SamplesWeggezet" - c."Stalen_gebruikt" as "Stock"
FROM
"Stability"."tblStudies"
LEFT JOIN
(
SELECT b."Studie_ID",sum(b."Stalen_gebruikt") as "Stalen_gebruikt"
FROM (
SELECT "tblAnalyses"."Studie_ID", sum("tblAnalyses"."Aant_stalen_gebruikt") AS "Stalen_gebruikt"
FROM "Stability"."tblAnalyses"
GROUP BY "tblAnalyses"."Studie_ID"
UNION
SELECT "tblStalenUitKamer"."Studie_ID", sum("tblStalenUitKamer".aant_stalen) AS "stalen_gebruikt"
FROM "Stability"."tblStalenUitKamer"
GROUP BY "tblStalenUitKamer"."Studie_ID"
) b
GROUP BY b."Studie_ID"
) c ON "tblStudies"."Studie_ID" = c."Studie_ID"
Because you're doing a LEFT JOIN to the inline query "C" some values of c."stalen_gebruikt" can be null. And any number - null is going to yield null. To address this we can use coalesce
So change
"samplesweggezet" - c."stalen_gebruikt" AS "Stock
to
"samplesweggezet" - COALESCE(c."stalen_gebruikt",0) AS "Stock

Oracle: Select values in date range with days where value is missing

I want to select values from table in range.
Something like this:
SELECT
date_values.date_from,
date_values.date_to,
sum(values.value)
FROM values
inner join date_values on values.id_date = date_values.id
inner join date_units on date_values.id_unit = date_units.id
WHERE
date_values.date_from >= '14.1.2012' AND
date_values.date_to <= '30.1.2012' AND
date_units.id = 4
GROUP BY
date_values.date_from,
date_values.date_to
ORDER BY
date_values.date_from,
date_values.date_to;
But this query give me back only range of days, where is any value. Like this:
14.01.12 15.01.12 66
15.01.12 16.01.12 4
17.01.12 18.01.12 8
...etc
(Here missing 16.01.12 to 17.01.12)
But I want to select missing value too, like this:
14.01.12 15.01.12 66
15.01.12 16.01.12 4
16.01.12 17.01.12 0
17.01.12 18.01.12 8
...etc
I can't use PL/SQL and if can you advise more general solution which can I expand for use on Hours, Months, Years; will be great.
I'm going to assume you're providing date_from and date_to. If so, you can generate your list of dates first and then join to it to get the remainder of your result. Alternatively, you can union this query to your date_values table as union does a distinct this will remove any extra data.
If this is how the list of dates is generated:
select to_date('14.1.2012','dd.mm.yyyy') + level - 1 as date_from
, to_date('14.1.2012','dd.mm.yyyy') + level as date_to
from dual
connect by level <= to_date('30.1.2012','dd.mm.yyyy')
- to_date('14.1.2012','dd.mm.yyyy')
Your query might become
with the_dates as (
select to_date('14.1.2012','dd.mm.yyyy') + level - 1 as date_from
, to_date('14.1.2012','dd.mm.yyyy') + level as date_to
from dual
connect by level <= to_date('30.1.2012','dd.mm.yyyy')
- to_date('14.1.2012','dd.mm.yyyy')
)
SELECT
dv.date_from,
dv.date_to,
sum(values.value)
FROM values
inner join ( select the_dates.date_from, the_dates.date_to, date_values.id
from the_dates
left outer join date_values
on the_dates.date_from = date_values.date_from ) dv
on values.id_date = dv.id
inner join date_units on date_values.id_unit = date_units.id
WHERE
date_units.id = 4
GROUP BY
dv.date_from,
dv.date_to
ORDER BY
dv.date_from,
dv.date_to;
The with syntax is known as sub-query factoring and isn't really needed in this case but it makes the code cleaner.
I've also assumed that the date columns in date_values are, well, dates. It isn't obvious as you're doing a string comparison. You should always explicitly convert to a date where applicable and you should always store a date as a date. It saves a lot of hassle in the long run as it's impossible for things to be input incorrectly or to be incorrectly compared.

Sql MAX DateTime and if clause?

My 2nd Sql question on SO today - I need to brush up my Sql skills!!!!
I have the following three tables....
Diary, Entry, and EntryType
A Diary can have many Entry(ies), each Entry is of a particular EntryType.
Entry has a created DateTime
EntryType has an Id, and SomeOtherValue (int)
EDIT
Sorry Chaps. Thanks for response so far but I didn't explain it well enough; the requirements have changed a bit.....
I would like the Id of any Diary where its latest Entry that is greater than #lastDateTime AND EntryType.someOthervalue != 0 has a EntryType.someOthervalue == #someValue
Or to put it another way....For all those Entry rows with a created Date time > #lastDateTime, ignore the someOtherValues that equal 0, and if the top 1 left over has a someOtherValue = #someValue, return the Id of the diary!!!!
Does that make sense??? I'm not sure what I should be putting my MAX, WHERE, and HAVING (if anything at all!)
Thanks,
ETFairfax.
If you want the last entry, and then with the last entry, test whether the entrytype of that last entry is #somevalue, then this should be the right query
select diaryid
from (
select rn=row_number() over (partition by e.diaryid order by e.created desc),
d.diaryid, et.someothervalue
from entry e
inner join entrytype et on e.entrytype = et.id
where e.created > #lastDateTime
) X
where rn=1
and SomeOtherValue = #someValue -- of the last record
If however, you mean, among the entry types of #somevalue, is the latest one >#lastDateTime, then it is
select e.diaryid
from entry e
inner join entrytype et on e.entrytype = et.id
where et.SomeOtherValue = #someValue
group by e.diaryid
having max(e.created) > #lastDateTime
Note: For a db with proper references, the diary_id can be derived from entry without going back to diary. The only reason for linking back to diary is if you need the full diary record, or if there is no foreign key and you need to validate that the diary_id exists.
For the 2nd type, there is another way to write this that is still ANSI compliant and may be faster. This works by deducing that for the MAX e.created to be greater than #lastdatetime, ANY record
EDIT: As Andriy points out for the 2nd query (directly preceeding this statement), the HAVING clause can be moved to the WHERE clause based on the same fact, but I have left the query in that form to match the expression of the requirements (not using the deduced simplification).
select e.diaryid
from entry e
where exists (
select *
from entrytype et
where e.entrytype = et.id
and et.SomeOtherValue = #someValue)
and e.created > #lastDateTime
GROUP BY e.diaryid
This EXISTential test can stop inspecting entrytypes for an entry as soon as it finds one matching somevalue and created instead of fully processing the join and aggregating (max) before filtering it down in the HAVING clause.
How about something like this:
select d.diary_id
from diary d
join entry e on(d.diary_id = e.diary_id)
join entry_type et on(e.entry_type_id = et.entry_type_id)
where et.someOthervalue = #someValue
group by d.diary_id
having max(e.created) > #lastDateTime;
Another approach if you are using SQL Server 2005+
With EntryDates As
(
Select E.DiaryId
, Max(E.Created) Over ( Partition By E.DiaryId ) As LastCreateDate
From Entry As E
Join EntryType As ET
On ET.Id = E.EntryTypeId
Where ET.SomeOtherValue = #SomeValue
)
Select DiaryId
From EntryDates
Where LastCreateDate > #lastDateTime
Group By DiaryId
Diary, Entry, and EntryType
SELECT
FROM
Diary D
CROSS APPLY (
SELECT TOP 1 E.EntryID
FROM
Entry E
WHERE
D.DiaryID = E.DiaryID
AND E.CreatedDatetime > #LastDateTime
AND EXISTS (
SELECT 1 FROM EntryType ET
WHERE E.EntryTypeID = ET.EntryTypeID
AND ET.SomeOtherValue <> 0 -- or = #SomeValue ?
)
ORDER BY CreatedDatetime DESC
)
An index in Entry on CreatedDatetime would be helpful, but either the table's clustered index should have DiaryID or the index should contain it as one of the indexed columns.
I presume the EntryTypeID table is small compared to the others, so it's most likely going to be a table scan anyway, so indexes won't help much. If your execution plan gets some high number of reads on the EntryType table, please let me know and we'll try to fix it. I'm hoping the engine is smart enough to realize this table is small and only hits it up once, and puts it on the left side of a LOOP join somewhere.
Please let me know how this goes and if it's not working right I'll tweak it. I'm sure we can get a really nicely-performing query going.

Resources