Filtering a complex SQL Query - sql-server

Unit - hmy, scode, hProperty
InsurancePolicy - hmy, hUnit, dtEffective, sStatus
Select MAX(i2.dtEffective) as maxdate, u.hMy, MAX(i2.hmy) as InsuranceId,
i2.sStatus
from unit u
left join InsurancePolicy i2 on i2.hUnit = u.hMy
and i2.sStatus in ('Active', 'Cancelled', 'Expired')
where u.hProperty = 2
Group By u.hmy, i2.sStatus
order by u.hmy
This query will return values for the Insurance Policy with the latest Effective Date (Max(dtEffective)). I added Max(i2.hmy) so if there was more than one Insurance Policy for the latest Effective Date, it will return the one with the highest ID (i2.hmy) in the database.
Suppose there was a Unit that had 3 Insurance Policies attached with the same latest effective date and all have different sStatus'.
The result would look like this:
maxdate UnitID InsuranceID sStatus
1/23/12 2949 1938 'Active'
1/23/12 2949 2343 'Cancelled'
1/23/12 2949 4323 'Expired'
How do I filter the results so that if there are multiple Insurance Policies with different Status' for the same unit and same date, then we choose the Insurance Policy with the 'Active' Status first, if one doesn't exist, choose 'Cancelled', and if that doesn't exist, choose 'Expired'.

This seems to be a matter of proper ranking of InsurancePolicy's rows and then joining Unit to the set of the former's top-ranked rows:
;
WITH ranked AS (
SELECT
*,
rnk = ROW_NUMBER() OVER (
PARTITION BY hUnit
ORDER BY dtEffective DESC, sStatus, hmy DESC
)
FROM InsurancePolicy
)
SELECT
i2.dtEffective AS maxdate,
u.hMy,
i2.hmy AS InsuranceId,
i2.sStatus
FROM Unit u
LEFT JOIN ranked i2 ON i2.hUnit = u.hMy AND i2.rnk = 1

You could make this work with one SQL statement but it will be nearly unreadable to your everyday t-sql developer. I would suggest breaking this query up into a few steps.
First, I would declare a table variable and place all the records that require no manipulation into this table (ie - Units that do not have multiple statuses for the same date = good records).
Then, get a list of your records that need work done on them (multiple statuses on the same date for the same UnitID) and place them in a table variable. I would create a "rank" column within this table variable using a case statement as illustrated here:
Pseudocode: WHEN Active THEN 1 ELSE WHEN Cancelled THEN 2 ELSE WHEN Expired THEN 3 END
Then delete records where 2 and 3 exist with a 1
Then delete records where 2 exists and 3
Finally, merge this updated table variable with your table variable containing your "good" records.
It is easy to get sucked into trying to do too much within one SQL statement. Break up the tasks to make it easier for you to develop and more manageable in the future. If you have to edit this SQL in a few years time you will be thanking yourself, not to mention any other developers that may have to take over your code.

Related

SQL Server: Slowly Changing Dimension Type 2 on historical records

I am trying to set up a SCD of Type 2 for historical records within my Customer table. Attached is how the Customer table is set up alongside the expected outcome. Note that the Customer table in practice has 2 million distinct Customer IDs. I tried to use the query below, but the Start_Date and End_Date are repeating for each row.
SELECT t.Customer_ID, t.Lifecyle_ID, t.Date As Start_Date,
LEAD(t.Date) OVER (ORDER BY t.Date) AS End_Date
FROM Customer AS t
I think a three step query is likely needed.
Use LEAD and LAG, partitioned by Customer and ordered by date, to peek at the next row's values for both Date and Lifecycle.
Use a CASE statement to emit a value for End Date when the current row's Lifecycle <> the next row's lifecycle (otherwise emit NULL). Now do the same using LAG for the Effective Date.
Group By or Distinct on the output from Step #2.
Hopefully that makes sense. I'll try to post a code example later today, but hopefully that's enough to get you started.

How do you efficiently pull data from multiple records into 1 record

I currently have data in a table related to transactions. Each record has a purchase ID, a Transaction Number, and up to 5 purchases assigned to the transaction. Associated with each purchase there can be up to 10 transactions. For the first transaction of each purchase I need a field that is a string of each unique purchase concatenated. My solution was slow I estimated it would take 40 days to complete. What would be the most effective way to do this?
What you are looking for can be achieved in 2 steps:
Step1: Extracting the first transaction of each purchase
Depending upon your table configuration this can be done in a couple of different ways.
If your transaction IDs are sequential, you can use something like:
select * from table a
inner join
(select purchaseid,min(transactionid) as transactionid
from table group by purchaseid) b
on a.purchaseid-b.purchaseid and a.transactionid=b.transactionid
If there is a date variable driving the transaction sequence, then:
select a.* from
(select *,row_number() over(partition by purchaseid order by date) as rownum from table)a
where a.rownum=1
Step2: Concatenating the Purchase details
This can be done by using the String_agg function if you are using the latest version of SQL server. If not, the following link highlights a couple of different ways you can do this:
Optimal way to concatenate/aggregate strings
Hope this helps.

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.
You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).
Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx
Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

Filter Duplicate records by closest data?

Background
I have multiple records in a table that some times have duplicate entries, apart from the data the record was created.
I have to pick between the duplicate records and change a field of the one with the latest date (last one to be created).
Currently I am doing this manually by visually checking the dates.
Question
Is here a way of only bring back one of the duplicates, the one with the closest day to today?
Example
Below is a query that brings back two sets of duplicates for one stationID. There should only be one record per assessment type. The isLive column would be changed to True for the bottom two record as they have the latest Filedate records.
SQL
SELECT StationFileID
,StationID
,AssessmentType
,URL
,FileDate
,isLive
,StationObjectID
FROM StationFiles
WHERE StationID = '1066'
ORDER BY StationID;
Records Returned
You can use the ROW_NUMBER() function to identify the latest rows:
SELECT *
,CASE WHEN N = 1 THEN 'True'
ELSE 'False' END AS isLive
FROM (SELECT StationFileID
,StationID
,AssessmentType
,FileDate
,ROW_NUMBER() OVER (PARTITION BY StationID, AssessmentType ORDER BY FileDate DESC) AS N
FROM StationFiles
WHERE StationID = '1066') AS T

Is there a quicker way of doing this type of query (finding inactive accounts)?

I have a very large table of wagering transactions. Let's say for the sake of the question I want to find the accounts of people who have wagered in the last year but not wagered in the last month, so I do something like this...
--query one
select accountnumber into #wageredrecently from activity
where _date >='2011-08-10' and transaction_type = 'Bet'
group by accountnumber
--query two
select accountnumber,firstname,lastname,email,sum(handle)
from activity a, customers c
where a.accountnumber = c.accountno
and transaction_type = 'Bet'
and _date >='2010-09-10'
and accountnumber not in (select * from #wageredrecently)
group by accountnumber,firstname,lastname,email
The problem is, this takes ages to get the data. Is there a quicker way to acheive the same in sql?
Edit, just to be specific about the time: It takes just over 3 minutes, which is far too long for a query that is destined for a php intranet page.
Edit (11/09/2011): I've found out that the problem is the customers table. It's actually a view. It previously had good performance but now all of a sudden its performance is terrible, a simple query on it takes almost as long as the above query pair. I have therefore chosen an alternative table of customer data (that actually is a table, and not a view) and now the query pair takes about 15 seconds.
You should try to join customers after you have found and aggregated the rows from activity (I assume that handle is a column in activity).
select c.accountno,
c.firstname,
c.lastname,
c.email,
a.sumhandle
from customers as c
inner join (
select accountnumber,
sum(handle) as sumhandle
from activity
where _date >= '2010-09-10' and
transaction_type = 'bet' and
accountnumber not in (
select accountnumber
from activity
where _date >= '2011-08-10' and
transaction_type = 'bet'
)
group by accountnumber
) as a
on c.accountno = a.accountnumber
I also included your first query as a sub-query instead. I'm not sure what that will do for performance. It could be better, it could be worse, you have to test on your data.
I don't know your exact business need, but rarely will someone need access to innactive accounts over several months at a moments notice. Depending on when you pruge data, this may get worse.
You could create an indexed view that contains the last transaction date for each account:
max(_date) as RecentTransaction
If this table gets too large, it could be partioned by year or month of the activity.
Have you considered adding an index on _date to the activity table? It's probably taking so long because it has to do a full table scan on that column when you're comparing the dates. Also, is transaction_type indexed as well? Otherwise, the other index wouldn't do you any good.
Answering my question as the problem wasn't the structure of the query but one of the tables being used. It was a view and its performance was terrible. I change to an actual table with customer data in and reduced the execution time down to about 15 seconds.

Resources