SQL Server: efficient way to combine two columns to distinct values?

SQL Server: efficient way to combine two columns to distinct values? - sql-server

This is a performance question where I want to combine two columns from two separate tables. How can you do the combination?
I understand this as or condition such that
SELECT a.contract1 or b.contract2 from TABLE1 a, TABLE2 b
where my goal is to get a single column where each element is either in Contract1 of Table1 or Contract2 of Table2. The or notation does not differentiate between distinct values and other values. I need distinct values. The proposed solution, the union method, acts slow with large datasets over many GBs because of the underlying distinct.
Please propose efficient methods to deal with the performance.
Input
Column in Table A
1
2
3
Golumn in Table B
1
3
5
Wanted Output
1
2
3
5

That's what UNION does
SELECT contract1 FROM TABLE1
UNION
SELECT contract2 FROM TABLE2
Edit
The performance problem you're talking about in your comment is probably caused by the nature of UNION itself; what happens behind the scenes is that the dbms executes both the statements separately then applies a distinct on the resulting set. On large tables this latter step may cause problems with the overall performances, and you can confirm that by switching to UNION ALL (which won't perform the distinct).
If you cannot settle for UNION ALL, because you don't want duplicates, I found this interesting article that proposes a solution for this kind of issues. It involves the usage of a table variable, that you populate with your two statements and from where you select to get the final result.
Essentially the steps are
DECLARE #Result TABLE (
Contract varchar(50)
— Example of how to declare a PK within a table variable
PRIMARY KEY ( Contract )
)
INSERT #Result
SELECT Contract1
FROM Table1
INSERT #Result
SELECT Contract2
FROM Table2
SELECT *
FROM #Result
but you can find a more detailed explaination at the link above

Related

Split field and insert rows in SQL Server trigger, when mutliple rows are affected without using a cursor

I have an INSERT trigger of a table, where one field of the table contains a comma-separated list of key-value pairs, that are separated by a :
I can select this field with the two values into a temp table easily with this statement:
-- SAMPLE DATA FOR PRESENTATION ONLY
DECLARE #messageIds VARCHAR(2000) = '29708332:55197,29708329:54683,29708331:54589,29708330:54586,29708327:54543,29708328:54539,29708333:54538,29708334:62162,29708335:56798';
SELECT
SUBSTRING(value, 1,CHARINDEX(':', value) - 1)AS MessageId,
SUBSTRING(value, CHARINDEX(':', value) + 1, LEN(value)-SUBSTRING(value,0,CHARINDEX(value,':'))) AS DeviceId
INTO #temp_messages
FROM STRING_SPLIT(#messageIds, ',')
SELECT * FROM #temp_messages
DROP TABLE #temp_messages
The result will look like this
29708332 55197
29708329 54683
29708331 54589
29708330 54586
29708327 54543
29708328 54539
29708333 54538
29708334 62162
29708335 56798
From here I can join the temp table to other tables and insert some of the results into a third table.
Inside the trigger I can get the messageIds with a simple SELECT statement like
DECLARE #messageIds VARCHAR(2000) = (SELECT ProcessMessageIds FROM INSERTED)
Now I create the temp table (like described above) and process my
INSERT INto <new_table> SELECT col1, col1, .. FROM #temp_messages
JOIN <another_table> ON ...
Unfortunately this will only work for single row inserts. As soon as there is more than one row, my SELECT ProcessMessageIds FROM INSERTED will fail, as there are multiple rows in the INSERTED table.
I can process the rows in a CURSOR but as far as I know CURSORS are a no-go in triggers and I should avoid them whenever it is possible.
Therefore my question is, if there is another way to do this without using a CURSOR inside the trigger?

Before we get into the details of the solution, let me point out that you would have no such issues if you normalized your database, as #Larnu pointed out in the comment section of your question.
Your
DECLARE #messageIds VARCHAR(2000) = (SELECT ProcessMessageIds FROM INSERTED)
statement assumes that there will be a single value to be assigned to #messageIDs and, as you have pointed out, this is not necessarily true.
Solution 1: Join with INSERTED rather than load it into a variable
INSERT INTO t1
SELECT ...
FROM t2
JOIN T3
ON ...
JOIN INSERTED
ON ...
and then you can reach INSERTED.ProcessMessageIds without issues. This will no longer assume that a single value was used.
Solution 2: cursors
You can use a CURSOR, as you have already pointed out, but it's not a very good idea to use cursors inside a trigger, see https://social.msdn.microsoft.com/Forums/en-US/87fd1205-4e27-413d-b040-047078b07756/cursor-usages-in-trigger-in-sql-server?forum=aspsqlserver
Solution 3: insert a single line at a time
While this would not require a change in your trigger, it would require a change in how you insert and it would increase the number of db requests necessary, so I would advise you not to choose this approach.
Solution 4: normalize
See https://www.simplilearn.com/tutorials/sql-tutorial/what-is-normalization-in-sql
If you had a proper table rather than a table of composite values, you would have no such issues and you would have a much easier time to process the message ids in general.
Summary
It would be wise to normalize your tables and perform the refactoring that would be needed afterwards. It's a great effort now, but you will enjoy its fruits. If that's not an option, you can "act as if it was normalized" and choose Solution 1.

As pointed out in the answers, joining with the INSERTED table solved my problem.
SELECT INTAB.Id,
SUBSTRING(value, 1,CHARINDEX(':', value) - 1)AS MessageId,
SUBSTRING(value, CHARINDEX(':', value) + 1, LEN(value)-SUBSTRING(value,0,CHARINDEX(value,':'))) AS DeviceId
FROM INSERTED AS INTAB
CROSS APPLY STRING_SPLIT(ProcessMessageids,',')
I never used "CROSS APPLY" before, thank you.

SQL Server - UNION with WHERE clause outside is extremely slow on simple join

I have a simple query and it works fast (<1sec):
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
WHERE a = '1/1/2020'
However, if I join with another small table in the final SELECT statement it is extremely slow (> 30 sec)
DECLARE #anotherTable TABLE (A DATE, B INT)
INSERT INTO #anotherTable (AsOfDate, FundId)
VALUES ('1/1/2020', 1)
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
JOIN #anotherTable T ON T.A = D.A AND T.B = D.B
In the real application, I have a complex UPDATE as the final statement, so I try to avoid copy-paste and introduces UNION to consolidate code.
But now experience an unexpected issue with slowness.
I tried using UNION ALL instead of UNION - with the same result.
Looks like SQL Server pushed simple conditions to each of UNION statements, but when I join it with another table, it doesn't happen and a table scan occurs.
Any advice?
UPDATE: Here is estimated plans
for the first simple condition query: https://www.brentozar.com/pastetheplan/?id=SJ5fynTgP
for the query with a join table: https://www.brentozar.com/pastetheplan/?id=H1eny3pxP
Please keep in mind that estimated plans are not exactly for the above query, but more real one, having exactly the same problem.

When I'm doing complex updates I normally declare a temp table and insert the rows into it that I intend to update. There's two benefits to this approach, one being that by explicitly collecting the rows to be updated you simplify the logic and make the update itself really simple (just update the rows whose primary key is in your temp table). The other big benefit of it is you can do some sanity checking before actually running your update, and "throw an error" by returning a different value.
I think it's generally a good practice to break down queries into simple steps like this, because it makes them much easier to troubleshoot in the future.

Based on the "similar" execution plan you shared. It would also be better to have the actual plan, to know if your estimates and memory grants are ok.
Key lookup
The index IX_dperf_date_fund should be extended to INCLUDE the following columns nav, equity
Why? Every row the index returns, create a lookup in the clusterd index to retrieve the column values of nav, equity.
Only if this is reasonable for the application, if other queries may benefit as well
CTE
Change your CTE to a temp table.
Example:
SELECT *
INTO #JointIncomingData
FROM (
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM
ETL.tblIncomingData
UNION ALL
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM ETL.vIncomingDataDPerf
) x
Why? CTE's are not materialized. and this answer
Bonus: parameter sniffing
If you pass in parameters you might be suffering from parameters sniffing.

SQL WHERE NOT EXISTS (skip duplicates)

Hello I'm struggling to get the query below right. What I want is to return rows with unique names and surnames. What I get is all rows with duplicates
This is my sql
DECLARE #tmp AS TABLE (Name VARCHAR(100), Surname VARCHAR(100))
INSERT INTO #tmp
SELECT CustomerName,CustomerSurname FROM Customers
WHERE
NOT EXISTS
(SELECT Name,Surname
FROM #tmp
WHERE Name=CustomerName
AND ID Surname=CustomerSurname
GROUP BY Name,Surname )
Please can someone point me in the right direction here.
//Desperate (I tried without GROUP BY as well but get same result)

DISTINCT would do the trick.
SELECT DISTINCT CustomerName, CustomerSurname
FROM Customers
Demo
If you only want the records that really don't have duplicates (as opposed to getting duplicates represented as a single record) you could use GROUP BY and HAVING:
SELECT CustomerName, CustomerSurname
FROM Customers
GROUP BY CustomerName, CustomerSurname
HAVING COUNT(*) = 1
Demo

First, I thought that #David answer is what you want. But rereading your comments, perhaps you want all combinations of Names and Surnames:
SELECT n.CustomerName, s.CustomerSurname
FROM
( SELECT DISTINCT CustomerName
FROM Customers
) AS n
CROSS JOIN
( SELECT DISTINCT CustomerSurname
FROM Customers
) AS s ;

Are you doing that while your #Tmp table is still empty?
If so: your entire "select" is fully evaluated before the "insert" statement, it doesn't do "run the query and add one row, insert the row, run the query and get another row, insert the row, etc."
If you want to insert unique Customers only, use that same "Customer" table in your not exists clause
SELECT c.CustomerName,c.CustomerSurname FROM Customers c
WHERE
NOT EXISTS
(SELECT 1
FROM Customers c1
WHERE c.CustomerName = c1.CustomerName
AND c.CustomerSurname = c1.CustomerSurname
AND c.Id <> c1.Id)
If you want to insert a unique set of customers, use "distinct"

Typically, if you're doing a WHERE NOT EXISTS or WHERE EXISTS, or WHERE NOT IN subquery,
you should use what is called a "correlated subquery", as in ypercube's answer above, where table aliases are used for both inside and outside tables (where inside table is joined to outside table). ypercube gave a good example.
And often, NOT EXISTS is preferred over NOT IN (unless the WHERE NOT IN is selecting from a totally unrelated table that you can't join on.)
Sometimes if you're tempted to do a WHERE EXISTS (SELECT from a small table with no duplicate values in column), you could also do the same thing by joining the main query with that table on the column you want in the EXISTS. Not always the best or safest solution, might make query slower if there are many rows in that table and could cause many duplicate rows if there are dup values for that column in the joined table -- in which case you'd have to add DISTINCT to the main query, which causes it to SORT the data on all columns.
-- Not efficient at all.
And, similarly, the WHERE NOT IN or NOT EXISTS correlated subqueries can be accomplished (and give the exact same execution plan) if you LEFT OUTER JOIN the table you were going to subquery -- and add a WHERE . IS NULL.
You have to be careful using that, but you don't need a DISTINCT. Frankly, I prefer to use the WHERE NOT IN subqueries or NOT EXISTS correlated subqueries, because the syntax makes the intention clear and it's hard to go wrong.
And you do not need a DISTINCT in the SELECT inside such subqueries (correlated or not). It would be a waste of processing (and for WHERE EXISTS or WHERE IN subqueries, the SQL optimizer would ignore it anyway and just use the first value that matched for each row in the outer query). (Hope that makes sense.)

Transact SQL parallel query execution

Suppose I have
INSERT INTO #tmp1 (ID) SELECT ID FROM Table1 WHERE Name = 'A'
INSERT INTO #tmp2 (ID) SELECT ID FROM Table2 WHERE Name = 'B'
SELECT ID FROM #tmp1 UNION ALL SELECT ID FROM #tmp3
I would like to run queries 1 & 2 in parallel, and then combine results after they are finished.
Is there a way to do this in pure T-SQL, or a way to check if it will do this automatically?
A background for those who wants it: I investigate a complex search where there're multiple conditions which are later combined (term OR (term2 AND term3) OR term4 AND item5=term5) and thus I investigate if it would be useful to execute those - largely unrelated - conditions in parallel, later combining resulting tables (and calculating ranks, weights, and so on).
E.g. should be several resultsets:
SELECT COUNT(*) #tmp1 union #tmp3
SELECT ID from (#tmp1 union #tmp2) WHERE ...
SELECT * from TABLE3 where ID IN (SELECT ID FROM #tmp1 union #tmp2)
SELECT * from TABLE4 where ID IN (SELECT ID FROM #tmp1 union #tmp2)

You don't. SQL doesn't work like that: it isn't procedural. It leads to race conditions and data issues because of other connections
Table variables are also scoped to the batch and connection so you can't share results over 2 connections in case you're wondering.
In any case, all you need is this, unless you gave us an bad example:
SELECT ID FROM Table1 WHERE Name = 'A'
UNION
SELECT ID FROM Table2 WHERE Name = 'B'
I suspect you're thinking of "run in parallel" because of this procedural thinking. What is your actual desired problem and goal?
Note: table variables do not allow parallel operations: Can queries that read table variables generate parallel exection plans in SQL Server 2008?

You don't decide what to parallelise - SQL Server's optimizer does. And the largest unit of work that the optimizer will work with is a single statement - so, you find a way to express your query as a single statement, and then rely on SQL Server to do its job, which it will usually do quite well.
If, having constructed your query, the performance isn't acceptable, then you can look at applying hints or forcing certain plans to be used. A lot of people break their queries into multiple statements, either believing that they can do a better job than SQL Server, or because it's how they "naturally" think of the task at hand. Both are "wrong" (for certain values of wrong), but if there's a natural breakdown, you may be able to replicate it using Common Table Expressions - these would allow you to name each sub-part of the problem, and then combine them together, all as part of a single statement.
E.g.:
;WITH TabA AS (
SELECT ID FROM Table1 WHERE Name = 'A'
), TabB AS (
SELECT ID FROM Table2 WHERE Name = 'B'
)
SELECT ID FROM TabA UNION ALL SELECT ID FROM TabB
And this will allow the server to decide how best to resolve this query (e.g. deciding whether to store intermediate results in "temp" tables)
Seeing in one of your other comments you discussing about having to "work with" the intermediate results - this can still be done with CTEs (if it's not just a case of you failing to be able to express the "final" result as a single query), e.g.:
;WITH TabA AS (
SELECT ID FROM Table1 WHERE Name = 'A'
), TabAWithCalcs AS (
SELECT ID,(ID*5+6) as ModID from TabA
)
SELECT * FROM TabAWithCalcs

Why not just:
SELECT ID FROM Table1 WHERE Name = 'A'
UNION ALL
SELECT ID FROM Table2 WHERE Name = 'B'
then if SQL Server wants to run the two selects in parallel, it will do at its own violition.
Otherwise we need more context for what you're trying to achieve if this isn't practical.

Possible to test for null records in SQL only?

I am trying to help a co-worker with a peculiar problem, and she's limited to MS SQL QUERY code only. The object is to insert a dummy record (into a surrounding union) IF no records are returned from a query.
I am having a hard time going back and forth from PL/SQL to MS SQL, and I am appealing for help (I'm not particularly appealing, but I am appealing to the StackOverflow audiance).
Basically, we need a single, testable value from the target Select ... statement.
In theory, it would do this:
(other records from unions)
Union
Select "These" as fld1, "are" as fld2, "Dummy" as fld3, "Fields" as fld4
where NOT (Matching Logic)
Union
Select fld1, fld2, fld3, fld4 // Regular records exist
From tested_table
Where (Matching Logic)
Forcing an individual dummy record, with no conditions, works.
IS there a way to get a single, testable result from a Select?
Can't do it in code (not allowed), but can feed SQL
Anybody? Anybody? Bbeller?

You could put the unions in a with, then include another union that returns a null only when the big union is empty:
; with BigUnion as
(
select *
from table1
union all
select *
from table2
)
select *
from BigUnion
union all
select null
where not exists (select * from BigUnion)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

SQL Server: efficient way to combine two columns to distinct values? - sql-server

Related

Split field and insert rows in SQL Server trigger, when mutliple rows are affected without using a cursor

SQL Server - UNION with WHERE clause outside is extremely slow on simple join

SQL WHERE NOT EXISTS (skip duplicates)

Transact SQL parallel query execution

Possible to test for null records in SQL only?

Categories

Resources