My 2nd Sql question on SO today - I need to brush up my Sql skills!!!!
I have the following three tables....
Diary, Entry, and EntryType
A Diary can have many Entry(ies), each Entry is of a particular EntryType.
Entry has a created DateTime
EntryType has an Id, and SomeOtherValue (int)
EDIT
Sorry Chaps. Thanks for response so far but I didn't explain it well enough; the requirements have changed a bit.....
I would like the Id of any Diary where its latest Entry that is greater than #lastDateTime AND EntryType.someOthervalue != 0 has a EntryType.someOthervalue == #someValue
Or to put it another way....For all those Entry rows with a created Date time > #lastDateTime, ignore the someOtherValues that equal 0, and if the top 1 left over has a someOtherValue = #someValue, return the Id of the diary!!!!
Does that make sense??? I'm not sure what I should be putting my MAX, WHERE, and HAVING (if anything at all!)
Thanks,
ETFairfax.
If you want the last entry, and then with the last entry, test whether the entrytype of that last entry is #somevalue, then this should be the right query
select diaryid
from (
select rn=row_number() over (partition by e.diaryid order by e.created desc),
d.diaryid, et.someothervalue
from entry e
inner join entrytype et on e.entrytype = et.id
where e.created > #lastDateTime
) X
where rn=1
and SomeOtherValue = #someValue -- of the last record
If however, you mean, among the entry types of #somevalue, is the latest one >#lastDateTime, then it is
select e.diaryid
from entry e
inner join entrytype et on e.entrytype = et.id
where et.SomeOtherValue = #someValue
group by e.diaryid
having max(e.created) > #lastDateTime
Note: For a db with proper references, the diary_id can be derived from entry without going back to diary. The only reason for linking back to diary is if you need the full diary record, or if there is no foreign key and you need to validate that the diary_id exists.
For the 2nd type, there is another way to write this that is still ANSI compliant and may be faster. This works by deducing that for the MAX e.created to be greater than #lastdatetime, ANY record
EDIT: As Andriy points out for the 2nd query (directly preceeding this statement), the HAVING clause can be moved to the WHERE clause based on the same fact, but I have left the query in that form to match the expression of the requirements (not using the deduced simplification).
select e.diaryid
from entry e
where exists (
select *
from entrytype et
where e.entrytype = et.id
and et.SomeOtherValue = #someValue)
and e.created > #lastDateTime
GROUP BY e.diaryid
This EXISTential test can stop inspecting entrytypes for an entry as soon as it finds one matching somevalue and created instead of fully processing the join and aggregating (max) before filtering it down in the HAVING clause.
How about something like this:
select d.diary_id
from diary d
join entry e on(d.diary_id = e.diary_id)
join entry_type et on(e.entry_type_id = et.entry_type_id)
where et.someOthervalue = #someValue
group by d.diary_id
having max(e.created) > #lastDateTime;
Another approach if you are using SQL Server 2005+
With EntryDates As
(
Select E.DiaryId
, Max(E.Created) Over ( Partition By E.DiaryId ) As LastCreateDate
From Entry As E
Join EntryType As ET
On ET.Id = E.EntryTypeId
Where ET.SomeOtherValue = #SomeValue
)
Select DiaryId
From EntryDates
Where LastCreateDate > #lastDateTime
Group By DiaryId
Diary, Entry, and EntryType
SELECT
FROM
Diary D
CROSS APPLY (
SELECT TOP 1 E.EntryID
FROM
Entry E
WHERE
D.DiaryID = E.DiaryID
AND E.CreatedDatetime > #LastDateTime
AND EXISTS (
SELECT 1 FROM EntryType ET
WHERE E.EntryTypeID = ET.EntryTypeID
AND ET.SomeOtherValue <> 0 -- or = #SomeValue ?
)
ORDER BY CreatedDatetime DESC
)
An index in Entry on CreatedDatetime would be helpful, but either the table's clustered index should have DiaryID or the index should contain it as one of the indexed columns.
I presume the EntryTypeID table is small compared to the others, so it's most likely going to be a table scan anyway, so indexes won't help much. If your execution plan gets some high number of reads on the EntryType table, please let me know and we'll try to fix it. I'm hoping the engine is smart enough to realize this table is small and only hits it up once, and puts it on the left side of a LOOP join somewhere.
Please let me know how this goes and if it's not working right I'll tweak it. I'm sure we can get a really nicely-performing query going.
Related
I am trying to create a routine that can accept an SQL query as a string and the [table].[primaryKey] of the primary record in the returned dataset, then wrap that original query to implement pagination (return records 40-49 when requesting page 4 and 10 records per page).
The dataset returned by the original queries will frequently contain multiple instances of the primary record, one for each occurrence of supporting records. For the example provided, if a customer has three phone numbers on record the results for that customer in the original query would look like:
{5; John Smith; 205 W. Fort St; 17; Home; 123-123-4587}
{5; John Smith; 205 W. Fort St; 18; Work; 123-123-8547}
{5; John Smith; 205 W. Fort St; 19; Mobile; 123-123-1147}
I'm almost there, I think, with the following query:
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
SELECT [WrappedQuery].[RowNumber], [OriginalQuery].* FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) [RowNumber], *
FROM (
SELECT DISTINCT [OriginalQuery].[{Customer.Id}] [PrimaryKey]
FROM [OriginalQuery]
) [RuwNumberQuery]
) [WrappedQuery]
INNER JOIN [OriginalQuery] ON [WrappedQuery].[PrimaryKey] = [OriginalQuery].[{Customer.Id}]
WHERE [WrappedQuery].[RowNumber] >= #PageNumber
AND [WrappedQuery].[RowNumber] < #PageNumber + #RecordsPerPage
This solution performs a SELECT DISTINCT on the primary key for the Primary (Customer) record and uses the SQL routine Row_Number() then joins the result with the results of the original query such that each unique primary (customer) record is numbered 1 - {end of file}, and I can pull only the RowNumber counts that I want.
But because OriginalQuery may have multiple fields named Id (from different tables), I can't figure out how to properly access [Customer].[Id] in my SELECT DISTINCT clause of [RowNumberQuery] or in the INNER JOIN.
Is there a better way to implement pagination at the SQL level, or a more direct method of accessing the field I need from within the subquery based on the table to which it belongs?
EDIT:
I've caused confusion in the pagination I am looking for. I am using Dapper in C# to compile the resulting dataset into individual complex objects, so the goal in the example would be to retrieve customers 31-40 in the list regardless of how many individual records exist for each customer. If Customer 31 had five phone records, Customer 32 had three phone records, Customer 33 had 1 phone record, and the remaining seven customers had two phone records each, I would expect the resulting dataset to contain 23 records total, but only 10 distinct customers.
SOLUTION
Thank you for all of the assistance, and I apologize for those areas I should have clarified sooner. I am creating a toolset that will allow C# Data Access Libraries to implement a set of standard parameters. If I have an option to implement the pagination in an internal function that can accept the SQL statement, I can defer to the toolset and not have to remember (or count on others to remember) to add the appropriate text each time. I'll set it up to return the finished objects, but if I were going to just modify the original query string it would look like:
public static string AddPagination(string sql, string primaryKey, Parameter requestParameters)
{
return $"WITH OriginalQuery AS ({sql.Replace("SELECT ", $"SELECT DENSE_RANK() OVER (ORDER BY {primaryKey}) AS PrimaryRecordCount, ",StringComparison.OrdinalIgnoreCase)}) " +
$"SELECT TOP ({requestParameters.MaxRecords}) * " +
$"FROM OriginalQuery " +
$"WHERE PrimaryRecordCount >= 1 + (({requestParameters.PageNumber - 1}) * {requestParameters.RecordsPerPage})" +
$" AND PrimaryRecordCount <= {requestParameters.Page} * {requestParameters.Limit}";
}
Just give your columns a different alias in your original query, e.g. [Customer].[Id] AS CustomerId, [Phone].[Id] AS PhoneId..., then you can reference OriginalQuery.CustomerId, or OriginalQuery.PhoneId
e.g.
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT [Customer].[Id] AS CustomerId,
[Customer].[Name],
[Customer].[Address],
[Phone].[Id] AS PhoneId,
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
SELECT [WrappedQuery].[RowNumber], [OriginalQuery].* FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) [RowNumber], *
FROM (
SELECT DISTINCT [OriginalQuery].[{Customer.Id}] [PrimaryKey]
FROM [OriginalQuery]
) [RuwNumberQuery]
) [WrappedQuery]
INNER JOIN [OriginalQuery] ON [WrappedQuery].[PrimaryKey] = [OriginalQuery].[CustomerId]
WHERE [WrappedQuery].[RowNumber] >= #PageNumber
AND [WrappedQuery].[RowNumber] < #PageNumber + #RecordsPerPage
It's worth noting that your paging logic is wrong too. Currently you are adding page number to the number of pages so you are searching for:
Page 1: Customers 1 - 10
Page 2: Customers 2 - 11
Page 3: Customers 3 - 12
Your logic should be:
WHERE [WrappedQuery].[RowNumber] >= 1 + ((#PageNumber - 1) * #RecordsPerPage)
AND [WrappedQuery].[RowNumber] <= (#PageNumber * #RecordsPerPage)
Page 1: Customers 1 - 10
Page 2: Customers 11 - 20
Page 3: Customers 21 - 30
With that being said, you could just use DENSE_RANK() Rather than ROW_NUMBER which would simplify everything. I think this would give you the same result:
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT c.Id AS CustomerId,
c.Name,
c.Address,
p.Id AS PhoneId,
p.Type,
p.Number,
DENSE_RANK() OVER(ORDER BY c.Id) AS RowNumber
FROM Customer AS c INNER JOIN Phone AS p ON c.Id = p.CustomerId
)
SELECT oq.CustomerId, oq.Name, oq.Address, oq.PhoneId, oq.Type, oq.Number
FROM OriginalQuery AS oq
WHERE oq.RowNumber >= 1 +((#PageNumber - 1) * #RecordsPerPage)
AND oq.RowNumber <= (#PageNumber * #RecordsPerPage);
I've added table aliases to try and make the code a bit cleaner, and also removed all the unnecessary square brackets. This is not necessary, but I personally find them quite hard on the eye, and only use them to escape key words.
Another difference is that in adding ORDER BY c.CustomerId you ensure consistent results for your paging. Using ORDER BY (SELECT NULL) implies that you don't care about the order, but you should if you using it for paging.
There are many concerns with what you are trying to do and you might be better off explaining why you are trying to make this process.
SQL query as a string
You are receiving a SQL query as a string, how are you parsing that string into the OriginalQuery CTE? This has both concerns about sql injection and concerns about global temp tables if you are using those.
Secondly, your example isn't doing pagination as it is commonly understood. If someone were to request page 1, 10 records per page, the calling application would expect to receive the first 10 records of the result set but your example will returns all records for the first 10 customers. Meaning the result could be 40+ if they each had 4 phone numbers as in your example data.
You should take a look at OFFSET and FETCH NEXT, as well as why this requirement to parse an arbitrary SQL string. There is probably a better way to do that.
Here is a rough example using OFFSET and FETCH NEXT from a static query, and returning only #RecordsPerPage number of records.
DECLARE #PageNumber int = 1;
DECLARE #RecordsPerPage int = 10;
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
ORDER BY [Customer].[Id]
OFFSET (#PageNumber-1)*#RecordsPerPage rows
FETCH NEXT #RecordsPerPage ROWS ONLY
If you wanted to return all records for the the RecordsPerPage number of entries which have a corresponding phone number, then it would be something like...
DECLARE #PageNumber int = 1;
DECLARE #RecordsPerPage int = 10;
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
WHERE Customer.ID IN (
SELECT DISTINCT Customer.ID FROM Customer INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
ORDER BY [Customer].[Id]
OFFSET (#PageNumber-1)*#RecordsPerPage rows
FETCH NEXT #RecordsPerPage ROWS ONLY
)
This does leave a question, what is the point of this query when the calling application can just use their own OFFSET and FETCH NEXT? They already have the SQL to generate the initial dataset, all they need to do is add OFFSET / FETCH NEXT to the end of it and they have their own pagination without trying to wrap it in a procedure of some sort.
To create a comparison, would you create a stored procedure that accepts a SQL string and then filters specific fields by specific values? Or would the people calling that stored procedure just add a Where clause to their own queries instead?
You can use alias name for the cuplicated column.
For example:
WITH OriginalQuery AS (
SELECT [Customer].[Id] as CustomerID,
[Customer].[Name],
[Customer].[Address],
[Phone].[Id] as PhoneID,
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
now you can use the 2 ids whit the alias name for the next query.
select *
from dataentry with (nolock)
where clientnumber in (
'00602',
'00897',
'00940'
)
So background is clientnumber is value that client sent to me and will be inserted into the database. But so the 3 numbers above aren't inserted in the database by me YET, so it will turn blank when I search them in database. Is there way for SQL to show those numbers as null value?
Maybe one to do this is using not IN with client numbers that I have already inserted? But let's say there are 2000 numbers I inserted, it would be very inefficient. What would be the best way to do this?
You can left join to a derived table containing the "numbers" (well, strings with a possible numeric representation actually).
SELECT d.clientnumber
FROM (VALUES ('00602'),
('00897'),
('00940')) AS n
(n)
LEFT JOIN dataentry AS de
ON de.clientnumber = n.n;
That'll give you as many NULLs as "numbers" in the list that aren't present in the table and any "number" that is in the table the amount of times it is in it.
Or, what may be a bit more of a practical use, use NOT EXISTS to get the "numbers" not in the table (yet).
SELECT n.n AS clientnumber
FROM (VALUES ('00602'),
('00897'),
('00940')) AS n
(n)
WHERE NOT EXISTS (SELECT *
FROM dataentry AS de
WHERE de.clientnumber = n.n);
Or even an indicator whether they are present or not.
SELECT n.n AS clientnumber,
CASE
WHEN EXISTS (SELECT *
FROM dataentry AS de
WHERE de.clientnumber = n.n) THEN
'true'
ELSE
'false'
END existence
FROM (VALUES ('00602'),
('00897'),
('00940')) AS n
(n);
I'm trying to optimize or completely rewrite this query. It takes about ~1500ms to run currently. I know the distinct's are fairly inefficient as well as the Union. But I'm struggling to figure out exactly where to go from here.
I am thinking that the first select statement might not be needed to return the output of;
[Key | User_ID,(User_ID)]
Note; Program and Program Scenario are both using Clustered Indexes. I can provide a screenshot of the Execution Plan if needed.
ALTER FUNCTION [dbo].[Fn_Get_Del_User_ID] (#_CompKey INT)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE #UseID AS VARCHAR(8000);
SET #UseID = '';
SELECT #UseID = #UseID + ', ' + x.User_ID
FROM
(SELECT DISTINCT (UPPER(p.User_ID)) as User_ID FROM [dbo].[Program] AS p WITH (NOLOCK)
WHERE p.CompKey = #_CompKey
UNION
SELECT DISTINCT (UPPER(ps.User_ID)) as User_ID FROM [dbo].[Program] AS p WITH (NOLOCK)
LEFT OUTER JOIN [dbo].[Program_Scenario] AS ps WITH (NOLOCK) ON p.ProgKey = ps.ProgKey
WHERE p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL) x
RETURN Substring(#UserIDs, 3, 8000);
END
There are two things happening in this query
1. Locating rows in the [Program] table matching the specified CompKey (#_CompKey)
2. Locating rows in the [Program_Scenario] table that have the same ProgKey as the rows located in (1) above.
Finally, non-null UserIDs from both these sets of rows are concatenated into a scalar.
For step 1 to be efficient, you'd need an index on the CompKey column (clustered or non-clustered)
For step 2 to be efficient, you'd need an index on the join key which is ProgKey on the Program_Scenario table (this likely is a non-clustered index as I can't imagine ProgKey to be PK). Likely, SQL would resort to a loop join strategy - i.e., for each row found in [Program] matching the CompKey criteria, it would need to lookup corresponding rows in [Program_Scenario] with same ProgKey. This is a guess though, as there is not sufficient information on the cardinality and distribution of data.
Ensure the above two indexes are present.
Also, as others have noted the second left outer join is a bit confusing as an inner join is the right way to deal with it.
Per my interpretation the inner part of the query can be rewritten this way. Also, this is the query you'd ideally run and optimize before tacking the string concatenation part. The DISTINCT is dropped as it is automatic with a UNION. Try this version of the query along with the indexes above and if it provides the necessary boost, then include the string concatenation or the xml STUFF approaches to return a scalar.
SELECT UPPER(p.User_ID) as User_ID
FROM
[dbo].[Program] AS p WITH (NOLOCK)
WHERE
p.CompKey = #_CompKey
UNION
SELECT UPPER(ps.User_ID) as User_ID
FROM
[dbo].[Program] AS p WITH (NOLOCK)
INNER JOIN [dbo].[Program_Scenario] AS ps WITH (NOLOCK) ON p.ProgKey = ps.ProgKey
WHERE
p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL
I am taking a shot in the dark here. I am guessing that the last code you posted is still a scalar function. It also did not have all the logic of your original query. Again, this is a shot in the dark since there is no table definitions or sample data posted.
This might be how this would look as an inline table valued function.
ALTER FUNCTION [dbo].[Fn_Get_Del_User_ID]
(
#_CompKey INT
) RETURNS TABLE AS RETURN
select MyResult = STUFF(
(
SELECT distinct UPPER(p.User_ID) as User_ID
FROM dbo.Program AS p
WHERE p.CompKey = #_CompKey
group by p.User_ID
UNION
SELECT distinct UPPER(ps.User_ID) as User_ID
FROM dbo.Program AS p
LEFT OUTER JOIN dbo.Program_Scenario AS ps ON p.ProgKey = ps.ProgKey
WHERE p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL
for xml path ('')
), 1, 1, '')
from dbo.Program
I have a table in SQL Server where user is allowed to make changes to the employee's details. Every time a new record is placed in the EMPLOYEE_HIST table. Only the EMP_ID is kept constant for the employee, and all other details are modifiable.
Also there the is a SEQ_NO column which maintains the sequence of entries made.
EMPLOYEE_HIST:
SEQ_NO EMP_ID SOME_VAL1 SOME_VAL2
1 E1 V11 V21 (initial value of this employee)
2 E2 V12 V22 (initial value of this employee)
3 E3 V13 V23 (initial value of this employee)
4 E2 V00 V22
5 E1 V01 V21
6 E2 V02 V22
7 E4 V00 V00 (initial value of this employee)
I want a query which will give me changes made to particular employees, something like
EMP_ID SOME_VAL1_OLD SOME_VAL1_NEW SOME_VAL2_OLD SOME_VAL2_NEW
E1 V11 V01 V21 V21
E2 V12 V00 V22 V22
E2 V00 V02 V22 V22
UPDATE
Also employee details may be modified by user n number of times and for each change, a row should be present in the result set.
Please help.
EDIT:
I finally settled with using LAG function. It will work like this:
SELECT *,ROW_NUMBER() OVER(PARTITION BY EMP_ID,CHANGE_NO ORDER BY EMP_ID,CHANGE_NO,SEQ_NO)
FROM(
SELECT * FROM EMPLOYEE_HIST( SELECT LAG(SOME_VAL1)
OVER(PARTITION BY EMP_ID ORDER BY EMP_ID,SEQ_NO) AS OLD_VAL, SOME_VAL1 AS NEW_VAL, '1' AS CHANGE_NO) T
WHERE OLD_VAL<>NEW_VAL UNION ALL
SELECT * FROM EMPLOYEE_HIST( SELECT LAG(SOME_VAL1) OVER(PARTITION BY EMP_ID ORDER BY EMP_ID,SEQ_NO) AS OLD_VAL, SOME_VAL2 AS NEW_VAL, '2' AS CHANGE_NO) T
WHERE OLD_VAL<>NEW_VAL) TEMP
But the performance is terribly slow for fetching total 500 rows on the table containing 3 million records. Please give some suggestions to improve sorting cost.
You can use a CTE with a Window function if you're using 2008 or newer:
;WITH r AS (
SELECT RANK() OVER (PARTITION BY EMP_ID ORDER BY SEQ_NO DESC) [rank]
, EMP_ID
, SOME_VAL1
, SOME_VAL2
FROM EMPLOYEE_HIST
)
SELECT e.EMP_ID
, s2.SOME_VAL1 [SOME_VAL1_OLD]
, s1.SOME_VAL1 [SOME_VAL1_NEW]
, s2.SOME_VAL2 [SOME_VAL2_OLD]
, s1.SOME_VAL2 [SOME_VAL2_NEW]
FROM (SELECT DISTINCT EMP_ID FROM EMPLOYEE_HIST) AS e
LEFT JOIN r AS s1 ON e.EMP_ID = s1.EMP_ID and s1.rank = 1 --the last change
LEFT JOIN r AS s2 ON e.EMP_ID = s2.EMP_ID and s2.rank = 2 --the second to last change
If you want all of the changes, not just the top two, then you should be able to do something like this:
;WITH r AS (
SELECT RANK() OVER (PARTITION BY EMP_ID ORDER BY SEQ_NO DESC) [rank]
, EMP_ID
, SOME_VAL1
, SOME_VAL2
FROM EMPLOYEE_HIST
)
SELECT e.EMP_ID
, s2.SOME_VAL1 [SOME_VAL1_OLD]
, s1.SOME_VAL1 [SOME_VAL1_NEW]
, s2.SOME_VAL2 [SOME_VAL2_OLD]
, s1.SOME_VAL2 [SOME_VAL2_NEW]
FROM (SELECT DISTINCT EMP_ID FROM EMPLOYEE_HIST) AS e
LEFT JOIN (r AS s1 --the change
INNER JOIN r AS s2 ON s1.EMP_ID = s2.EMP_ID and s2.rank = s1.rank + 1) --previous value
ON e.EMP_ID = s1.EMP_ID
This should enumerate all changes until it encounters the original value.
You could use a CTE to get a partitioned row number, by EMP_ID. Then join that against itself where the row number is offset by 1.
;WITH PartitionedRows
AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY EMP_ID ORDER BY SEQ_NO) AS RowID, EMP_ID, SOME_VAL1,SOME_VAL2
FROM EMPLOYEE_HIST
)
SELECT a.EMP_ID,b.SOME_VAL1 AS SOME_VAL1_OLD,a.SOME_VAL1 AS SOME_VAL1_NEW,b.SOME_VAL2 AS SOME_VAL2_OLD,a.SOME_VAL2 AS SOME_VAL2_NEW
FROM PartitionedRows a
LEFT JOIN PartitionedRows b ON a.EMP_ID = b.EMP_ID AND a.RowID = (b.RowID + 1)
WHERE b.RowID IS NOT NULL
You may be better off with a different data model. You could have a table EMPLOYEE_HIST_OLD that contains the identical data structure. This would allow you to archive the former data (even with a timestamp and/or sequence number), keep the size of the EMPLOYEE_HIST table smaller and w/o data you would not reference regularly, etc. This would allow for a basic join statement between the two tables.
I would then suggest you use the timestamp of the EMPLOYEE_HIST_OLD records to find the most recent modifications, then join those records back to the current records. This will only present to you the changed records. You could limit the query on EMPLOYEE_HIST_OLD to simply return one record (most recent) if you like. SQL query to get most recent row for each instance of a given key
If you must stay within the same EMPLOYEE_HIST table for everything and use the sequence number approach you may wish to use a count() to find changed records for a particular Employee ID and return the values ORDERED by sequence number. You could also limit the query to employees with count > 1. You would then view the data vertically in the table, though. To parse the values into separate columns like VAR1_OLD and VAR1 essentially would require you to only read the last two values and make one record out of two. You lose the visibility of all the changes when trying to view the data horizontally. There could be more than one historical change. To view the records horizontally would require you to do some array manipulation outside of SQL after the data was returned from the query.
For info on counting:
SQL query for finding records where count > 1
Suppose I have a table called Transaction and another table called Price. Price holds the prices for given funds at different dates. Each fund will have prices added at various dates, but they won't have prices at all possible dates. So for fund XYZ I may have prices for the 1 May, 7 May and 13 May and fund ABC may have prices at 3 May, 9 May and 11 May.
So now I'm looking at the price that was prevailing for a fund at the date of a transaction. The transaction was for fund XYZ on 10 May. What I want, is the latest known price on that day, which will be the price for 7 May.
Here's the code:
select d.TransactionID, d.FundCode, d.TransactionDate, v.OfferPrice
from Transaction d
inner join Price v
on v.FundCode = d.FundCode
and v.PriceDate = (
select max(PriceDate)
from Price
where FundCode = v.FundCode
/* */ and PriceDate < d.TransactionDate
)
It works, but it is very slow (several minutes in real world use). If I remove the line with the leading comment, the query is very quick (2 seconds or so) but it then uses the latest price per fund, which is wrong.
The bad part is that the price table is minuscule compared to some of the other tables we use, and it isn't clear to me why it is so slow. I suspect the offending line forces SQL Server to process a Cartesian product, but I don't know how to avoid it.
I keep hoping to find a more efficient way to do this, but it has so far escaped me. Any ideas?
You don't specify the version of SQL Server you're using, but if you are using a version with support for ranking functions and CTE queries I think you'll find this quite a bit more performant than using a correlated subquery within your join statement.
It should be very similar in performance to Andriy's queries. Depending on the exact index topography of your tables, one approach might be slightly faster than another.
I tend to like CTE-based approaches because the resulting code is quite a bit more readable (in my opinion). Hope this helps!
;WITH set_gen (TransactionID, OfferPrice, Match_val)
AS
(
SELECT d.TransactionID, v.OfferPrice, ROW_NUMBER() OVER(PARTITION BY d.TransactionID ORDER BY v.PriceDate ASC) AS Match_val
FROM Transaction d
INNER JOIN Price v
ON v.FundCode = d.FundCode
WHERE v.PriceDate <= d.TransactionDate
)
SELECT sg.TransactionID, d.FundCode, d.TransactionDate, sg.OfferPrice
FROM Transaction d
INNER JOIN set_gen sg ON d.TransactionID = sg.TransactionID
WHERE sg.Match_val = 1
There's a method for finding rows with maximum or minimum values, which involves LEFT JOIN to self, rather than more intuitive, but probably more costly as well, INNER JOIN to a self-derived aggregated list.
Basically, the method uses this pattern:
SELECT t.*
FROM t
LEFT JOIN t AS t2 ON t.key = t2.key
AND t2.Value > t.Value /* ">" is when getting maximums; "<" is for minimums */
WHERE t2.key IS NULL
or its NOT EXISTS counterpart:
SELECT *
FROM t
WHERE NOT EXISTS (
SELECT *
FROM t AS t2
WHERE t.key = t2.key
AND t2.Value > t.Value /* same as above applies to ">" here as well */
)
So, the result is all the rows for which there doesn't exist a row with the same key and the value greater than the given.
When there's just one table, application of the above method is pretty straightforward. However, it may not be that obvious how to apply it when there's another table, especially when, like in your case, the other table makes the actual query more complex not merely by its being there, but also by providing us with an additional filtering for the values we are looking for, namely with the upper limits for the dates.
So, here's what the resulting query might look like when applying the LEFT JOIN version of the method:
SELECT
d.TransactionID,
d.FundCode,
d.TransactionDate,
v.OfferPrice
FROM Transaction d
INNER JOIN Price v ON v.FundCode = d.FundCode
LEFT JOIN Price v2 ON v2.FundCode = v.FundCode /* this and */
AND v2.PriceDate > v.PriceDate /* this are where we are applying
the above method; */
AND v2.PriceDate < d.TransactionDate /* and this is where we are limiting
the maximum value */
WHERE v2.FundCode IS NULL
And here's a similar solution with NOT EXISTS:
SELECT
d.TransactionID,
d.FundCode,
d.TransactionDate,
v.OfferPrice
FROM Transaction d
INNER JOIN Price v ON v.FundCode = d.FundCode
WHERE NOT EXISTS (
SELECT *
FROM Price v2
WHERE v2.FundCode = v.FundCode /* this and */
AND v2.PriceDate > v.PriceDate /* this are where we are applying
the above method; */
AND v2.PriceDate < d.TransactionDate /* and this is where we are limiting
the maximum value */
)
Are both pricedate and transactiondate indexed? If not you are doing table scans which is likely the cause of the performance bottleneck.