INNER JOIN Taking longer - sql-server

I am joining a sub query with a table.
Sub query runs for 5 Seconds (returns 20 records) and the table has only 4 rows.
Sub query:
Select ID, Name, JoinID
FROM tableX
JOIN ..
Sub query Sample Result:
1, xx, 1
2, yy, 2
3, zz, 1
4, vv, 2
5, bb, 1
TableY (ID, Description):
Data
1, test1
2, test2
3, test3
4, test4
My below query is taking more than 30 seconds. What am I doing wrong here? I see no issue with table Stats. Also sub query is not returning any NULL record for JoinID column.
Select sub.*, tab.*
from
(
sub query
) sub
Join tableY on tableY.ID = sub.JoinID

Joining subqueries would be slow in some cases. If your tables has not PK or FK indexes joining would be complex for database engines.
First check your tables. Add primary keys to ID columns if you dont have. Create foreign key on JOINID in TABLEX. If you already done and same result then you must create index on JOINID if this not works also then you must check your server configuration, check database engine documentation (Oracle,Mssql,mysql etc.)
You can use this SQL statement to get same result without using subquery,
SELECT X.*, Y.*
FROM TABLEX as X
JOIN TABLEX as Y ON (X.JOIN_ID=Y.ID)
Use subqueries if you have aggregate functions or multiple joins in subquery sql statement.

Related

How can I create a row id within a set of rows

I have some data like
Id, GroupId, Whatever
1, 1, 10
2, 1, 10
3, 1, 10
4, 2, 10
5, 2, 10
6, 3, 10
And I need to add a "group row id" column such as
Id, GroupId, Whatever, GroupRowId
1, 1, 10 1
2, 1, 10 2
3, 1, 10 3
4, 2, 10 1
5, 2, 10 2
6, 3, 10 1
Ideally it would be computed and enforced by the database. So when I do
INSERT INTO Foos (GroupId, Whatever) VALUES (1, 20)
I'd get the correct GroupRowId. Continuing the example data above, this row would then look like
Id, GroupId, Whatever, GroupRowId
7, 1, 20 4
This data is to be shared with a 3rd party and one of the requirements is for those GroupRowIds to be fixed regardless of any different ORDER BY or WHERE clauses.
I've considered a view with a row_id over/partition by but that view could still be modified in the future breaking previously shared data.
Our business rules dictate that no rows will be deleted so the GroupRowId will never need to be recomputed in this respect and there will never** be missing values.
** in the perfect world of business rules.
My thinking is that it would be preferable that this be a physical column so that it exists within the row. It can be queried and won't change based on a ORDER BY or WHERE clause.
You might try something along this:
--create a test database (will be dropped at the end! Carefull with real data!!)
USE master;
GO
CREATE DATABASE GroupingTest;
GO
USE GroupingTest;
GO
--Your table, I use an IDENTITY column for your Id column
CREATE TABLE dbo.tbl(Id INT IDENTITY,GroupId INT,Whatever INT);
GO
--Insert your test values
INSERT INTO tbl(GroupId, Whatever)
VALUES
(1,10)
,(1,10)
,(1,10)
,(2,10)
,(2,10)
,(3,10);
GO
--This is necessary to add the new column and to fill it initially
ALTER TABLE tbl ADD GroupRowId INT;
GO
WITH cte AS
(
SELECT GroupRowId
,ROW_NUMBER() OVER(PARTITION BY GroupId ORDER BY Id) AS NewValue
FROM tbl
)
UPDATE cte SET GroupRowId=NewValue;
--check the result
SELECT * FROM tbl ORDER BY GroupId,Id;
GO
--Now we create a trigger, which does exactly the same for new rows
--Very important: This must work with single inserts and with multiple inserts as well!
CREATE TRIGGER dbo.SetNextGroupRowId ON dbo.tbl
FOR INSERT
AS
BEGIN
WITH cte AS
(
SELECT GroupRowId
,ROW_NUMBER() OVER(PARTITION BY GroupId ORDER BY Id) AS NewValue
FROM tbl
)
UPDATE cte
SET GroupRowId=NewValue
WHERE GroupRowId IS NULL; --<-- this ensures to change only new rows
END
GO
--Now we can test this with a single value
INSERT INTO tbl(GroupId, Whatever)
VALUES(1,20);
SELECT * FROM tbl ORDER BY GroupId,Id;
--And we can test this with multiple inserts
INSERT INTO tbl(GroupId, Whatever)
VALUES
(1,30)
,(2,30)
,(2,30)
,(3,30)
,(4,30); --<-- the "4" is a new group
SELECT * FROM tbl ORDER BY GroupId,Id;
GO
--Cleaning
USE master;
GO
DROP DATABASE GroupingTest;
What you should keep in mind:
This might get in troubles with values inserted manually into GroupRowId or with any manipulation of this column by any other statement.
This might get in troubles with deleted rows
You can think about an approach selecting MAX(GroupRowId)+1 for the given group. This depends on your needs.
You might add an unique index on GroupId,GroupRowId. This would - at least - avoid giving the same number twice, but would lead into an error.
...but in your perfect world of business rules :-) this won't happen...
And to be honest: The whole issue has some smell...

Which is the fastest way to run this SQL query?

I have a table (let's call it A) in SQL Server 2016 that I want to query on. I need to select only those rows that have a definitive status, so I need to exclude some rows. There's another table (B), containing the record id from the Table A and two columns, col1 and col2. If these columns are non-empty, the corresponding record can be considered final. There is a one-to-one relationship between tables A and B. Because these tables are rather large, I want to use the most efficient query. Which should I choose?
SELECT *
FROM TableA
WHERE record_id IN
(SELECT record_id FROM TableB WHERE col1 IS NOT NULL AND col2 IS NOT NULL)
SELECT a.*
FROM TableA a
INNER JOIN TableB b ON a.record_id = b.record_id
WHERE b.col1 IS NOT NULL AND b.col2 IS NOT NULL
SELECT a.*
FROM TableA a
INNER JOIN TableB b
ON a.record_id = b.record_id
AND b.col1 IS NOT NULL
AND b.col2 IS NOT NULL
Of course, if there's an even faster way that I hadn't thought of, please share. I'd also be very curious to know why one query is faster than the others.
WITH cte AS
(SELECT b.record_id, b.col1, b.col2
FROM TableB b
WHERE col1 IS NULL
AND col2 IS NULL --if the field isn't NULL, it might be quicker to do <> '')
SELECT a.record_id, a.identifyColumnsNeededExplicitely
FROM cte
JOIN TableA a ON a.record_id = cte.record_id
ORDER BY a.record_id
In practice the execution plan will do whatever it likes depending on your current indexes / clustered index / foreign keys / constraints / table stastics (aka number of rows / general containt of your rows/...). Any analysis should be done case by case and what's true for 2 tables may not be to 2 others table.
Theorically,
Without any index, the first one should be the best since it will make an optimization on operations with 1 table scan on TableB, 2 contants scan on TableB and 1 table scan on Table1.
With a foreign key on TableA.record_id referencing TableB.record_id OR an index in both column, the second should be faster since it will make a scan index and 2 constant scan.
In rare case, it could be the 3rd one depending on TableB stats. But not far from number 2 since number 3 will scan all the TableB.
In even rarer case, neither of the 3.
What I'm tryng to say is : "Since we don't have neither your tables nor rows, open your SQL Management, put the stats ON and try it yourself."

Database Index when SQL statement includes "IN" clause

I have SQL statement which takes really a lot of time to execute and I really had to improve it somehow.
select * from table where ID=1 and GROUP in
(select group from groupteam where
department= 'marketing' )
My question is if I should create index on columns ID and GROUP would it help?
Or if not should I create index on second table on column DEPARTMENT?
Or I should create two indexes for both tables?
First table has 249003.
Second table has in total 900 rows while query in that table returns only 2 rows.
That is why I am surprised that response is so slow.
Thank you
You can also use EXISTS, depending on your database like so:
select * from table t
where id = 1
and exists (
select 1 from groupteam
where department = 'marketing'
and group = t.group
)
Create a composite index on individual indexes on groupteam's department and group
Create a composite index or individual indexes on table's id and group
Do an explain/analyze depending on your database to review how indexes are being used by your database engine.
Try a join instead:
select * from table t
JOIN groupteam gt
ON d.group = gt.group
where ID=1 AND gt.department= 'marketing'
Index on table group and id column and table groupteam group column would help too.

how to get multiple sets of distinct values

This is not about distinct combinations of values (Select distinct col1, col2 from table)
I have a table with a newly loaded csv file.
Some columns are linked to foreign key dimensions but the values in a given column may not exist in the reference tables.
My desire is to find all the values in each column that do not exist but in such a way as to minimize the amount of table scans in our source table.
My current approach consumes the output of a bunch of queries like the following:
SELECT DISTINCT col2 FROM table WHERE col2 NOT IN (SELECT val FROM DimCol2)
SELECT DISTINCT col3 FROM table WHERE col3 NOT IN (SELECT val FROM DimCol3)
however, for N columns, this results in N table scans.
Table is up to 10M rows and columns range in cardinality from 5 through to 5M, but almost all values are already present in the dim tables (>99%).
DimColN ranges in size from 5 values to 50M values, and is well indexed.
The csv is loaded into table via SSIS, so splitting pre-processing inside SSIS is possible, but i would have to avoid a sql query for each row.
The ssis server does not have enough spare ram to cache all the dim tables.
What about using a LEFT JOIN and checking where the results of the join are null, meaning they don't exist in DimCol2
SELECT DISTINCT Col2
FROM table a
LEFT JOIN DimCol2 on a.Col2 = b.val
WHERE b.val IS NULL

Weird Behavior of Trigger

I've got a trigger (SQL 2008 R2) that does one simple operation but the results are not logical.
Here is the overview:
A text file is fed to an SSIS package with ONE line(one record) that loads it into "ORDERS_IN_PROCESS" table. The Data Access Mode is set to "Table or view" in the "OLE DB Destination" to allow triggers to fire.
Here is my ORDERS table:
OrderID ItemNo
--------- ---------
9813 1
9813 2
9813 3
9817 1
So, SSIS executes and
ORDERS_IN_PROCESS gets one record inserted which is OrderID 9813
Trigger is fired:
INSERT INTO ORDERS_ARCHIVE SELECT * FROM ORDERS WHERE OrderID=INSERTED.OrderID
Pretty simple so far...
The results I get in my ORDERS_ARCHIVE (identical layout to ORDERS) Table are
OrderId ItemNo
--------- ----------
9813 3
Where are the rest of the 2 line items?
Note, it only inserted the last row read from ORDERS table into ORDERS_ARCHIVE.
I need all 3 of them in ORDERS_ARCHIVE.
Why does this happen?
I believe it has something to do with the way SSIS processes it using "OLE DB Destination" because if I insert a record into RLFL manually, the trigger does exactly what it's supposed to do and inserts all 3 records from BACK.
You may argue that trigger fires once per batch and I agree but in this case I have a batch of just ONE record.
I'm thinking of an sp, but i'd rather not add another level of complexity for something so trivial, supposedly.
Any ideas?
Thanks.
I don't think that this has something to do with SSIS at all, but with your trigger. Instead of than IN that you have there, try using a JOIN in your query:
INSERT INTO ORDERS_ARCHIVE
SELECT O.*
FROM ORDERS O
INNER JOIN INSERTED I
ON O.ORderID = I.OrderID
I concur with Lamark, misspelling intentional, with the assessment of your trigger is incorrect.
The logic you provided for your trigger does not compile. The WHERE clause is not valid. I'm assuming, as Lamak did that your intention was to join based on OrderID.
create table dbo.ORDERS_ARCHIVE
(
OrderID int
, ItemNo int
)
GO
create table dbo.ORDERS
(
OrderID int
, ItemNo int
)
GO
create trigger
trUpdate
ON
dbo.ORDERS
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
-- This doesn't work
-- Msg 4104, Level 16, State 1, Procedure trUpdate, Line 12
-- The multi-part identifier "INSERTED.OrderID" could not be bound.
--INSERT INTO dbo.ORDERS_ARCHIVE
--SELECT *
--FROM ORDERS
--WHERE OrderID=INSERTED.OrderID;
-- I think you meant
INSERT INTO dbo.ORDERS_ARCHIVE
SELECT *
FROM ORDERS
WHERE OrderID=(SELECT INSERTED.OrderID FROM INSERTED);
END
GO
I then ginned up a simple SSIS package, I have a data source that supplies the 4 rows you indicated and writes to dbo.ORDERS. I ran the package 2 times and each one netted 4 rows in the ORDERS_ARCHIVE table. 3 rows with 9813, 1 with 9817 per batch.
I am getting the right count of rows in there so I believe the trigger is firing correctly. Instead, what is happening is the logic is incorrect. Since the OrderID is not unique in the ORDERS table, the database engine is going to pick the first row that happens to satisfy the search criteria. It just so happens that it picks the same row (ItemNo = 1) each time but since there is no guarantee of order without an ORDER BY clause, this is just random or an artifact of how the Engine chooses but no behaviour I would bank on remaining consistent.
How do you fix this?
Fix the trigger. Joining to the inserted virtual table only on the OrderID is resulting in multiple rows satisfying the condition.
create trigger
trUpdate
ON
dbo.ORDERS
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
-- This trigger will add all the rows from the ORDERS table
-- that match what was just inserted based on OrderID and ItemNo
INSERT INTO dbo.ORDERS_ARCHIVE
SELECT O.*
FROM dbo.ORDERS O
INNER JOIN INSERTED I
ON O.OrderID = I.OrderID
AND O.ItemNo = I.ItemNo;
END
Now when I run the ETL, I see 4 rows in ORDERS_ARCHIVE with the correct ItemNo values.

Resources