This is going to seem like a lame question for all experts in SQL server views but...
So I have a small set of data that my client needs for reporting purposes. I have to admit that although I did ask them their reporting requirements, it isn't till now that I see that my db could be better optimised.
One of the pieces of data they want is the time difference between two tasks that may have run:
select caseid, hy.createdate
from app_history hy
where hy.activityid in (303734, 303724)
This gives me two rows (after edit) per case-submission which then have to be measured; but a few wiggles:
Activity 303734 will always run, activity 303724 might run.
Each 303734 and 303724 combo match up. Conceiveably a case can have 1 un-matched 303734 with a matched pair afterwards on the 2nd submission. Matching these might be down to intuition. Not good.
There maybe more than one submission per caseid and if that is the case then both activities will run every subsequent time.
There is no way to write the submission number to this table.
The app_history table holds userid, caseid and activityid as foreign keys. The PK is the identity column ID.
Is there a better way to write the query?
AFter help from KM:
select
c.id, c.submissionno, hya.caseid, hya.createtime, hyb.caseid, hyb.createtime
,CASE
WHEN hyb.caseid IS NOT NULL THEN DATEDIFF(mi,hya.createtime,hyb.createtime)
ELSE NULL
END AS Difference
from app_case c
inner join app_history hya on c.id = hya.caseid
left outer join app_history hyb on c.id = hyb.caseid
where hya.activityid in (303734) and hyb.activityid in (303724) order by c.id asc
This nearly works.
I now have this issue:
460509|2|460509|15:15:39.000|460509|15:16:13.000|1
460509|2|460509|15:15:39.000|460509|15:18:13.000|3
460509|2|460509|15:17:52.000|460509|15:16:13.000|-1
460509|2|460509|15:17:52.000|460509|15:18:13.000|1
So I am now getting 1 row comparing each of the two for each of the four rows... mmm I think it is the best I can hope for. :(
USE LEFT JOIN
SELECT
a.caseid, a.createdate
,b.caseid, b.createdate
,CASE
WHEN b.caseid IS NOT NULL THEN DATEDIFF(mi,a.createdate,b.createdate)
ELSE NULL
END AS Difference
FROM app_history a
LEFT OUTER JOIN app_history b ON b.activityid=303724
WHERE a.activityid=303734
EDIT after a little more schema info...
SELECT
a.caseid, a.createdate
,b.caseid, b.createdate
,CASE
WHEN b.caseid IS NOT NULL THEN DATEDIFF(mi,a.createdate,b.createdate)
ELSE NULL
END AS Difference
FROM (SELECT MAX(ID) AS MaxID FROM app_history WHERE activityid=303734) aa
INNER JOIN app_history a ON aa.MaxID=a.ID
LEFT OUTER JOIN a(SELECT MAX(ID) AS MaxID FROM app_history WHERE activityid=303724) bb ON 1=1
LEFT OUTER JOIN app_history b ON bb.MaxID=b.ID
do something like this
select datediff(
day,
(select isnull(hy.createdate,0) from app_history hy where hy.activityid =303734),
(select isnull(hy.createdate,0) from app_history hy where hy.activityid =303724)
)
Related
I currently have the below query written within Query Designer. I asked a question yesterday and it worked on its own but I would like to incorporate it into my existing report.
SELECT Distinct
i.ProductNumber
,i.ProductType
,i.ProductPurchaseDate
,ih.SalesPersonComputerID
,ih.SalesPerson
,ic2.FlaggedComments
FROM [Products] i
LEFT OUTER JOIN
(SELECT Distinct
MIN(c2.Comments) AS FlaggedComments
,c2.SalesKey
FROM [SalesComment] AS c2
WHERE(c2.Comments like 'Flagged*%')
GROUP BY c2.SalesKey) ic2
ON ic2.SalesKey = i.SalesKey
LEFT JOIN [SalesHistory] AS ih
ON ih.SalesKey = i.SalesKey
WHERE
i.SaleDate between #StartDate and #StopDate
AND ih.Status = 'SOLD'
My question yesterday was that I wanted a way to select only the first comment made for each sale. I have a query for selecting the flagged comments but I want both the first row and the flagged comment. They would both be pulling from the same table. This was the query provided and it worked on its own but I cant figure out how to make it work with my existing query.
SELECT a.DateTimeCommented, a.ProductNumber, a.Comments, a.SalesKey
FROM (
SELECT
DateTimeCommented, ProductNumber, Comments, SalesKey,
ROW_NUMBER() OVER(PARTITION BY ProductNumber ORDER BY DateTimeCommented) as RowN
FROM [SalesComment]
) a
WHERE a.RowN = 1
Thank you so much for your assistance.
You can use a combination of row-numbering and aggregation to get both the Flagged% comments, and the first comment.
You may want to change the PARTITION BY clause to suit.
DISTINCT on the outer query is probably spurious, on the inner query it definitely is, as you have GROUP BY anyway. If you are getting multiple rows, don't just throw DISTINCT at it, instead think about your joins and whether you need aggregation.
The second LEFT JOIN logically becomes an INNER JOIN due to the WHERE predicate. Perhaps that predicate should have been in the ON instead?
SELECT
i.ProductNumber
,i.ProductType
,i.ProductPurchaseDate
,ih.SalesPersonComputerID
,ih.SalesPerson
,ic2.FlaggedComments
,ic2.FirstComments
FROM [Products] i
LEFT OUTER JOIN
(SELECT
MIN(CASE WHEN c2.RowN = 1 THEN c2.Comments) AS FirstComments
,c2.SalesKey
,MIN(CASE WHEN c2.Comments like 'Flagged*%' THEN c2.Comments) AS FlaggedComments
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ProductNumber ORDER BY DateTimeCommented) as RowN
FROM [SalesComment]
) AS c2
GROUP BY c2.SalesKey
) ic2 ON ic2.SalesKey = i.SalesKey
JOIN [SalesHistory] AS ih
ON ih.SalesKey = i.SalesKey
WHERE
i.SaleDate between #StartDate and #StopDate
AND ih.Status = 'SOLD'
I have a table of 100,000,000+ values, so efficiency is very important to me. I need to take information from table A, join it to an index table B, then join to table C using the index retrieved from table B. The problem is, there are multiple indexes for each value in table A, and I want to retrieve the one with the most recent date.
The query below creates duplicates:
SELECT ID_1, ID_2, Date
INTO #DEST_TABLE FROM Table_1 t1
INNER JOIN Table_2 t2 ON t1.ID_1=t2.ID_1
INNER JOIN Table_3 t3 ON t2.ID_2=t3.ID_2
This one does not, but when running with more than 35,000 vs 40,000 elements, the execution time goes from <5sec to >1min:
SELECT ID_1, ID_2, Date
INTO #DEST_TABLE FROM
(SELECT * FROM Table_1 l CROSS APPLY Table_2 t2 WHERE t1.ID_1=t2.ID_1) t_temp
LEFT JOIN Table_3 t3 ON t_temp.ID_2=t3.ID_2
How can I decrease my execution time as much as possible?
Here is an example table:
For this table, I would be trying to get the most recent location for each person.
None of the columns are indexed and I cannot create indexes on this table.
First of all, when you are working on 100 Million+ records and that
too joining to other tables, first thing I would ask is what is the
rationale behind not creating indexes which can cover your query. If
you are not the admin of that system, I would suggest that you
should bring this up to admin group and try to understand what is
the exact reason (if any) they do not want index on that huge table.
Specially because you mentioned "efficiency is very important to
me".
Remember that 'SQL Tuning' is only one of the steps of 'Database Performance Tuning' and you can tune only as much with writing a good SQL Query. When the data volume gets huge, a good SQL Query is never sufficient without taking other Performance Tuning Measures.
Apart from what Roger has already provided, here are a few solutions that you can try out:
Solution 1
SELECT T1.ID_1, OA.ID_2, OA.Location
FROM Table1 T1
OUTER APPLY (
SELECT TOP 1 T3.ID_2, T3.Location
FROM Table2 T2
INNER JOIN Table3 T3
ON T2.ID_2 = T3.ID_2
WHERE T2.ID_1 = T1.ID_1
ORDER BY T3.Date DESC
) OA;
Solution 2:
SELECT DISTINCT
T1.ID_1
,T2.ID_2
,Location = FIRST_VALUE(T3.Location) OVER (PARTITION BY T1.ID_1 ORDER BY T3.Date DESC)
FROM Table1 T1
INNER JOIN Table2 T2
ON T1.ID_1 = T2.ID_1
INNER JOIN Table3 T3
ON T2.ID_2 = T3.ID_2;
Data Preparation:
DROP TABLE IF EXISTS Table1
DROP TABLE IF EXISTS Table2
DROP TABLE IF EXISTS Table3
SELECT TOP 10000 ID_1 = object_id, name
INTO Table1
FROM sys.all_objects
ORDER BY object_id
SELECT ID_1 = T1.ID_1, ID_2 = IDENTITY(INT, 1, 1)
INTO Table2
FROM Table1 T1
CROSS JOIN Table1 T2
SELECT ID_2, Location = 'City_'+ CAST(ID_2 AS VARCHAR(100)), Date = CAST(DATEADD(DAY, ID_2/10000, GETDATE()) AS DATE)
INTO Table3
FROM Table2
Indexes to cover the Solution 1:
CREATE NONCLUSTERED INDEX IX_TABLE1_ID_1 ON Table1 (ID_1)
CREATE NONCLUSTERED INDEX IX_TABLE2_ID_2 ON Table2 (ID_1, ID_2)
CREATE NONCLUSTERED INDEX IX_TABLE3_ID_2 ON Table3 (ID_2, Date DESC) INCLUDE (Location)
Execution Plan:
You can see that all are 'Index Seek' except for Table1 which is an legitimate 'Index Scan' because you are doing scans for each value of Table1's ID_1 value. If you put a where clause in the outer loop to search for a few specific ID_1 values, then that 'Index Scan' will turn to a 'Index Seek' as well.
I will leave the Index Strategy for the 2nd solution to you (as a homework :) ). Tips: You have to make the Location as a key as well. Or you can go with COLUMNSTORE index approach.
You can use something like this:
select top (1) with ties
a.A_Id, b.B_Id, b.Date
from dbo.TableA a
inner join dbo.TableB b on a.A_Id = it.A_Id
inner join dbo.TableC c on c.B_Id = b.B_Id
order by row_number() over(partition by a.A_Id order by b.Date desc);
Alternatively, you can try an olde fashioneth approache:
select a.A_Id, b.B_Id, b.Date
from dbo.TableA a
inner join dbo.TableB b on a.A_Id = b.A_Id
inner join dbo.TableC c on c.B_Id = b.B_Id
where not exists (
select 0 from dbo.TableB pb where pb.B_Id = b.B_Id and pb.Date > b.Date
);
However, as with all such situations, its performance will heavily depend on indices. SSMS can suggest you some, if you will look at the execution plan; off the top of my head, you will need all Id columns to be indexed, and you will need either a single (Date) or a composite (A_Id, Date, B_Id) on the TableB.
UPD: If you can't create or modify any indices, and performance is paramount, I would suggest copying the data in question into a separate schema or database, where you might have appropriate permissions. Apart from that... it's impossible to get something out of nothing.
Good evening all!
I'm running into a really odd issue that I'm having trouble understanding.
I have 3 tables (parts table, parts move history and a parts detail table).
What I'm trying to do is have the result set return lot#,part#,product description,quantity,part location, what's currently in inventory (versus full history) and who last moved the product.
Now, for the query. When I run the below query, I get a result set of 4,751 rows; which lines up perfectly with my expected results. However, when I try to add in the userid field, I then get a result set of 186,573. This large result set appears to pull in all historic data versus just matching the userid to the 4,751 rows I actually need.
From the Parts Table I need (prod_desc)
From the Parts Detail Table I need (lot,part#,lotquantity,prtlocation)
From the Parts Move History Table I need (move_date,user_id)
4,751 Query:
SELECT DISTINCT
inv.lot,
inv.part#,
prt.prod_desc,
inv.lotquantity,
inv.prtlocation,
MAX(mv.move_date)AS 'Move Date'
FROM invdet AS inv
LEFT JOIN movetable AS mv ON inv.part# = mv.part#
LEFT JOIN partmstr AS prt ON inv.part# = prt.part#
WHERE inv.lot IS NOT NULL
GROUP BY inv.lot,inv.part#,prt.prod_desc,inv.lotquantity,inv.prtlocation
ORDER BY inv.prtlocation
186,573 Query:
SELECT DISTINCT
inv.lot,
inv.part#,
prt.prod_desc,
inv.lotquantity,
inv.prtlocation,
MAX(mv.move_date)AS 'Move Date'
mv.user_id
FROM invdet AS inv
LEFT JOIN movetable AS mv ON inv.part# = mv.part#
LEFT JOIN partmstr AS prt ON inv.part# = prt.part#
WHERE inv.lot IS NOT NULL
GROUP BY inv.lot,inv.part#,prt.prod_desc,inv.lotquantity,inv.prtlocation,mv.user_id
ORDER BY inv.prtlocation
If I don't use the MAX function, I do not get current inventory and instead get all results in the table, which I do not need. I'm still learning and my GROUP BY's leave a lot to be desired as I'm still wrapping my head around it (open to suggestions!). I'm sure there's a subquery I can throw in here somewhere, but I'm still figuring those out as well. Any help is greatly appreciated!
I think the problem is that when you insert mv.user_id from table movetable you get all part's movements and not only the last one with date max(mv.move_date).
One way is to remove the left join to movetable and use maybe a cross apply like
SELECT inv.lot,inv.part,prt.prod_desc,inv.lotquantity,inv.prtlocation,x.move_date,x.user_id
FROM invdet AS inv
CROSS APPLY(SELECT TOP 1
mv.user_id,mv.move_date
FROM movetable mv
WHERE inv.part=mv.part
ORDER BY mv.move_date DESC) AS x
LEFT JOIN partmstr AS prt ON inv.part=prt.part
WHERE inv.lot IS NOT NULL
ORDER BY inv.prtlocation
I've not tested it but should be fine, maybe a bit slow because cross apply executes one subquery per each row in inv table. If it is too slow, you can user ROWNUMBER to create a table composed of only the last movements and then use it in the LEFT JOIN, as follows
SELECT inv.lot,inv.part,prt.prod_desc,inv.lotquantity,inv.prtlocation,y.move_date,y.user_id
FROM invdet AS inv
LEFT JOIN(SELECT x.user_id,x.move_date,x.part
FROM (SELECT mv.user_id,mv.move_date,mv.part,rn=ROWNUMBER() OVER(PARTITION BY mv.part ORDER BY mv.move_date DESC)
FROM movetable mv) AS x
WHERE x.rn=1) AS y ON y.part=inv.part
LEFT JOIN partmstr AS prt ON inv.part=prt.part
WHERE inv.lot IS NOT NULL
ORDER BY inv.prtlocation
Hope it helps.
I have the following DB Structure (simplified):
Payments
----------------------
Id | int
InvoiceId | int
Active | bit
Processed | bit
Invoices
----------------------
Id | int
CustomerOrderId | int
CustomerOrders
------------------------------------
Id | int
ApprovalDate | DateTime
ExternalStoreOrderNumber | nvarchar
Each Customer Order has an Invoice and each Invoice can have multiple Payments.
The ExternalStoreOrderNumber is a reference to the order from the external partner store we imported the order from and the ApprovalDate the timestamp when that import happened.
Now we have the problem that we had a wrong import an need to change some payments to other invoices (several hundert, so too mach to do by hand) according to the following logic:
Search the Invoice of the Order which has the same external number as the current one but starts with 0 instead of the current digit.
To do that I created the following query:
UPDATE DB.dbo.Payments
SET InvoiceId=
(SELECT TOP 1 I.Id FROM DB.dbo.Invoices AS I
WHERE I.CustomerOrderId=
(SELECT TOP 1 O.Id FROM DB.dbo.CustomerOrders AS O
WHERE O.ExternalOrderNumber='0'+SUBSTRING(
(SELECT TOP 1 OO.ExternalOrderNumber FROM DB.dbo.CustomerOrders AS OO
WHERE OO.Id=I.CustomerOrderId), 1, 10000)))
WHERE Id IN (
SELECT P.Id
FROM DB.dbo.Payments AS P
JOIN DB.dbo.Invoices AS I ON I.Id=P.InvoiceId
JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00'
Now I started that query on a test system using the live data (~250.000 rows in each table) and it is now running since 16h - did I do something completely wrong in the query or is there a way to speed it up a little?
It is not required to be really fast, as it is a one time task, but several hours seems long to me and as I want to learn for the (hopefully not happening) next time I would like some feedback how to improve...
You might as well kill the query. Your update subquery is completely un-correlated to the table being updated. From the looks of it, when it completes, EVERY SINGLE dbo.payments record will have the same value.
To break down your query, you might find that the subquery runs fine on its own.
SELECT TOP 1 I.Id FROM DB.dbo.Invoices AS I
WHERE I.CustomerOrderId=
(SELECT TOP 1 O.Id FROM DB.dbo.CustomerOrders AS O
WHERE O.ExternalOrderNumber='0'+SUBSTRING(
(SELECT TOP 1 OO.ExternalOrderNumber FROM DB.dbo.CustomerOrders AS OO
WHERE OO.Id=I.CustomerOrderId), 1, 10000))
That is always a BIG worry.
The next thing is that it is running this row-by-row for every record in the table.
You are also double-dipping into payments, by selecting from where ... the id is from a join involving itself. You can reference a table for update in the JOIN clause using this pattern:
UPDATE P
....
FROM DB.dbo.Payments AS P
JOIN DB.dbo.Invoices AS I ON I.Id=P.InvoiceId
JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00'
Moving on, another mistake is to use TOP without ORDER BY. That's asking for random results. If you know there's only one result, you wouldn't even need TOP. In this case, maybe you're ok with randomly choosing one from many possible matches. Since you have three levels of TOP(1) without ORDER BY, you might as well just mash them all up (join) and take a single TOP(1) across all of them. That would make it look like this
SET InvoiceId=
(SELECT TOP 1 I.Id
FROM DB.dbo.Invoices AS I
JOIN DB.dbo.CustomerOrders AS O
ON I.CustomerOrderId=O.Id
JOIN DB.dbo.CustomerOrders AS OO
ON O.ExternalOrderNumber='0'+SUBSTRING(OO.ExternalOrderNumber,1,100)
AND OO.Id=I.CustomerOrderId)
However, as I mentioned very early on, this is not being correlated to the main FROM clause at all. We move the entire search into the main query so that we can make use of JOIN-based set operations rather than row-by-row subqueries.
Before I show the final query (fully commented), I think your SUBSTRING is supposed to address this logic but starts with 0 instead of the current digit. However, if that means how I read it, it means that for an order number '5678', you're looking for '0678' which would also mean that SUBSTRING should be using 2,10000 instead of 1,10000.
UPDATE P
SET InvoiceId=II.Id
FROM DB.dbo.Payments AS P
-- invoices for payments
JOIN DB.dbo.Invoices AS I ON I.Id=P.InvoiceId
-- orders for invoices
JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
-- another order with '0' as leading digit
JOIN DB.dbo.CustomerOrders AS OO
ON OO.ExternalOrderNumber='0'+substring(O.ExternalOrderNumber,2,1000)
-- invoices for this other order
JOIN DB.dbo.Invoices AS II ON OO.Id=II.CustomerOrderId
-- conditions for the Payments records
WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00'
It is worth noting that SQL Server allows UPDATE ..FROM ..JOIN which is less supported by other DBMS, e.g. Oracle. This is because for a single row in Payments (update target), I hope you can see that it is evident it could have many choices of II.Id to choose from from all the cartesian joins. You will get a random possible II.Id.
I think something like this will be more efficient ,if I understood your query right. As i wrote it by hand and didn't run it, it may has some syntax error.
UPDATE DB.dbo.Payments
set InvoiceId=(SELECT TOP 1 I.Id FROM DB.dbo.Invoices AS I
inner join DB.dbo.CustomerOrders AS O ON I.CustomerOrderId=O.Id
inner join DB.dbo.CustomerOrders AS OO On OO.Id=I.CustomerOrderId
and O.ExternalOrderNumber='0'+SUBSTRING(OO.ExternalOrderNumber, 1, 10000)))
FROM DB.dbo.Payments
JOIN DB.dbo.Invoices AS I ON I.Id=Payments.InvoiceId and
Payments.Active=0
AND Payments.Processed=0
AND O.ApprovalDate='2012-07-19 00:00:00'
JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
Try to re-write using JOINs. This will highlight some of the problems. Will the following function do just the same? (The queries are somewhat different, but I guess this is roughly what you're trying to do)
UPDATE Payments
SET InvoiceId= I.Id
FROM DB.dbo.Payments
CROSS JOIN DB.dbo.Invoices AS I
INNER JOIN DB.dbo.CustomerOrders AS O
ON I.CustomerOrderId = O.Id
INNER JOIN DB.dbo.CustomerOrders AS OO
ON O.ExternalOrderNumer = '0' + SUBSTRING(OO.ExternalOrderNumber, 1, 10000)
AND OO.Id = I.CustomerOrderId
WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00')
As you see, two problems stand out:
The undonditional join between Payments and Invoices (of course, you've caught this off by a TOP 1 statement, but set-wise it's still unconditional) - I'm not really sure if this really is a problem in your query. Will be in mine though :).
The join on a 10000-character column (SUBSTRING), embodied in a condition. This is highly inefficient.
If you need a one-time speedup, just take the queries on each table, try to store the in-between-results in temporary tables, create indices on those temporary tables and use the temporary tables to perform the update.
Is there an option for getting the row with the highest date without joining the same table and use max(date) ?? Is Top1 order by desc a valid option ?
I use SQL Server 2000. And performance is important.
edit:
Table1:
columns: part - partdesc
Table 2:
columns: part - cost - date
select a.part,partdesc,b.cost
left join( select cost,part
right join(select max(date),part from table2 group by part) maxdate ON maxdate.date = bb.date
from table2 bb ) b on b.part = a.part
from table1
I don't know if the code above works but that is the query I dislike. And seems to me inefficient.
Here's a somewhat simplified query based on your edit.
SELECT
a.part,
a.partdesc,
sub.cost
FROM
Table1 A
INNER JOIN
(SELECT
B.part,
cost
FROM
Table2 B
INNER JOIN
(SELECT
part,
MAX(Date) as MaxDate
FROM
Table2
GROUP BY
part) BB
ON bb.part = b.part
AND bb.maxdate = b.date) Sub
ON sub.part = a.part
The sub-sub query will hopefully run a little bit quicker than your current version since it'll run once for the entire query, not once per part value.
SELECT TOP 1 columnlist
FROM table
ORDER BY datecol DESC
is certainly a valid option, assuming that your datacols are precise enough that you get the results needed (in other words, if it's one row per day, and your date reflects that, then sure. If it's several rows per minute, you may not be precise enough).
Performance will depend on your indexing strategy and hardware.