Transact SQL parallel query execution - sql-server

Suppose I have
INSERT INTO #tmp1 (ID) SELECT ID FROM Table1 WHERE Name = 'A'
INSERT INTO #tmp2 (ID) SELECT ID FROM Table2 WHERE Name = 'B'
SELECT ID FROM #tmp1 UNION ALL SELECT ID FROM #tmp3
I would like to run queries 1 & 2 in parallel, and then combine results after they are finished.
Is there a way to do this in pure T-SQL, or a way to check if it will do this automatically?
A background for those who wants it: I investigate a complex search where there're multiple conditions which are later combined (term OR (term2 AND term3) OR term4 AND item5=term5) and thus I investigate if it would be useful to execute those - largely unrelated - conditions in parallel, later combining resulting tables (and calculating ranks, weights, and so on).
E.g. should be several resultsets:
SELECT COUNT(*) #tmp1 union #tmp3
SELECT ID from (#tmp1 union #tmp2) WHERE ...
SELECT * from TABLE3 where ID IN (SELECT ID FROM #tmp1 union #tmp2)
SELECT * from TABLE4 where ID IN (SELECT ID FROM #tmp1 union #tmp2)

You don't. SQL doesn't work like that: it isn't procedural. It leads to race conditions and data issues because of other connections
Table variables are also scoped to the batch and connection so you can't share results over 2 connections in case you're wondering.
In any case, all you need is this, unless you gave us an bad example:
SELECT ID FROM Table1 WHERE Name = 'A'
UNION
SELECT ID FROM Table2 WHERE Name = 'B'
I suspect you're thinking of "run in parallel" because of this procedural thinking. What is your actual desired problem and goal?
Note: table variables do not allow parallel operations: Can queries that read table variables generate parallel exection plans in SQL Server 2008?

You don't decide what to parallelise - SQL Server's optimizer does. And the largest unit of work that the optimizer will work with is a single statement - so, you find a way to express your query as a single statement, and then rely on SQL Server to do its job, which it will usually do quite well.
If, having constructed your query, the performance isn't acceptable, then you can look at applying hints or forcing certain plans to be used. A lot of people break their queries into multiple statements, either believing that they can do a better job than SQL Server, or because it's how they "naturally" think of the task at hand. Both are "wrong" (for certain values of wrong), but if there's a natural breakdown, you may be able to replicate it using Common Table Expressions - these would allow you to name each sub-part of the problem, and then combine them together, all as part of a single statement.
E.g.:
;WITH TabA AS (
SELECT ID FROM Table1 WHERE Name = 'A'
), TabB AS (
SELECT ID FROM Table2 WHERE Name = 'B'
)
SELECT ID FROM TabA UNION ALL SELECT ID FROM TabB
And this will allow the server to decide how best to resolve this query (e.g. deciding whether to store intermediate results in "temp" tables)
Seeing in one of your other comments you discussing about having to "work with" the intermediate results - this can still be done with CTEs (if it's not just a case of you failing to be able to express the "final" result as a single query), e.g.:
;WITH TabA AS (
SELECT ID FROM Table1 WHERE Name = 'A'
), TabAWithCalcs AS (
SELECT ID,(ID*5+6) as ModID from TabA
)
SELECT * FROM TabAWithCalcs

Why not just:
SELECT ID FROM Table1 WHERE Name = 'A'
UNION ALL
SELECT ID FROM Table2 WHERE Name = 'B'
then if SQL Server wants to run the two selects in parallel, it will do at its own violition.
Otherwise we need more context for what you're trying to achieve if this isn't practical.

Related

SQL Server - UNION with WHERE clause outside is extremely slow on simple join

I have a simple query and it works fast (<1sec):
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
WHERE a = '1/1/2020'
However, if I join with another small table in the final SELECT statement it is extremely slow (> 30 sec)
DECLARE #anotherTable TABLE (A DATE, B INT)
INSERT INTO #anotherTable (AsOfDate, FundId)
VALUES ('1/1/2020', 1)
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
JOIN #anotherTable T ON T.A = D.A AND T.B = D.B
In the real application, I have a complex UPDATE as the final statement, so I try to avoid copy-paste and introduces UNION to consolidate code.
But now experience an unexpected issue with slowness.
I tried using UNION ALL instead of UNION - with the same result.
Looks like SQL Server pushed simple conditions to each of UNION statements, but when I join it with another table, it doesn't happen and a table scan occurs.
Any advice?
UPDATE: Here is estimated plans
for the first simple condition query: https://www.brentozar.com/pastetheplan/?id=SJ5fynTgP
for the query with a join table: https://www.brentozar.com/pastetheplan/?id=H1eny3pxP
Please keep in mind that estimated plans are not exactly for the above query, but more real one, having exactly the same problem.
When I'm doing complex updates I normally declare a temp table and insert the rows into it that I intend to update. There's two benefits to this approach, one being that by explicitly collecting the rows to be updated you simplify the logic and make the update itself really simple (just update the rows whose primary key is in your temp table). The other big benefit of it is you can do some sanity checking before actually running your update, and "throw an error" by returning a different value.
I think it's generally a good practice to break down queries into simple steps like this, because it makes them much easier to troubleshoot in the future.
Based on the "similar" execution plan you shared. It would also be better to have the actual plan, to know if your estimates and memory grants are ok.
Key lookup
The index IX_dperf_date_fund should be extended to INCLUDE the following columns nav, equity
Why? Every row the index returns, create a lookup in the clusterd index to retrieve the column values of nav, equity.
Only if this is reasonable for the application, if other queries may benefit as well
CTE
Change your CTE to a temp table.
Example:
SELECT *
INTO #JointIncomingData
FROM (
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM
ETL.tblIncomingData
UNION ALL
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM ETL.vIncomingDataDPerf
) x
Why? CTE's are not materialized. and this answer
Bonus: parameter sniffing
If you pass in parameters you might be suffering from parameters sniffing.

erratic "delayed" CTE evaluation?

I observe a behaviour with CTEs which I did not expect (and seems inconsistent).
Not quite sure that it is correct...
Basically, through a CTE, I filter rows to avoid a particular problem, then use the result of that CTE to perform calculations that would break on the problematic rows which I thought I eliminated in my CTE...
Take a simple table with a varchar column that often has a number in it, but not always
CREATE TABLE MY_TABLE(ROW_ID INTEGER NOT NULL
, GOOD_ROW BOOLEAN NOT NULL
, SOME_VALUE VARCHAR NOT NULL);
INSERT INTO MY_TABLE(ROW_ID, GOOD_ROW, SOME_VALUE)
VALUES(1, TRUE, '1'), (2, TRUE, '2'), (3, FALSE, 'ABC');
I also create a small table with just numbers to join on
CREATE TABLE NUMBERS(NUMBER_ID INTEGER NOT NULL);
INSERT INTO NUMBERS(NUMBER_ID) VALUES(1), (2), (3);
Joining these two tables on SOME_VALUE results in an error because 'ABC' is not numeric and it appears that the JOIN is evaluated BEFORE the WHERE clause (BAD implications on performance here...)
SELECT *
FROM MY_TABLE
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE)
WHERE ROW_ID < 3; --> ERROR
So, I try to filter my first table through a CTE which only return rows for which SOME_VALUE is numeric
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE GOOD_ROW = TRUE
)
SELECT *
FROM ONLY_GOOD_ONES;
Now, I would expect to be able to use the result of this CTE with SOME_VALUE being numeric.
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE GOOD_ROW = TRUE
)
SELECT *
FROM ONLY_GOOD_ONES
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE);
Miracle!!!
It worked!
I get my 2 expected records.
So far so good...
However, if I had defined my CTE slightly differently (WHERE clause which filters the same records)
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE ROW_ID < 3
)
SELECT *
FROM ONLY_GOOD_ONES;
This CTE returns exactly the same thing as before
But if I try to join, it Fails!
WITH ONLY_GOOD_ONES
AS (
SELECT *
FROM MY_TABLE
WHERE ROW_ID < 3
)
SELECT *
FROM ONLY_GOOD_ONES
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE);
I get the following error...
SQL Error [100038] [22018]: Numeric value 'ABC' is not recognized
Is there a particular explanation to this second version of the CTE behaving differently???
The actual answer is because snowflake does not follow the SQL standard, and execute SQL in the order given.
They apply transforms to data prior to filtering when there optimizer decides it wants to.
So for your table MY_TABLE when you do
SELECT some_value::NUMBER FROM my_table WHERE row_id IN (1,2);
You will under some cases have the as_number cast happen on all row, and explode on the 'ABC'. Which is violating SQL rules, that WHERE are evaluated before SELECT transforms are done, but Snowflake have known this for years, and it's intentional, as it makes things run faster.
The solution is to understand you have mixed data and therefore assume the code can and will be ran out of order, and thus use the protective versions of the functions like TRY_TO_NUMBER
The kicker is you can write a few nested SELECTs to avoid the problem and then put something like a window funcation around the code and the optimizer jump back into this behavour and you SQL explodes again. Thus the solution is to understand if you have mixed data, and handle it. Oh and complain it's a bug.
This is because you're getting a different execution plan with the different queries.
Here's how the query is executed with the working query:
... and here is how it's executed with the query generating a failure. The error comes from the fact that the join filter is applied directly on the table scan before the ROW_ID < 3 filter is applied, compared to the working query.
You can see these plans under history, clicking the query id and then the 'profile' tab.
It looks like the join filter is applied so early, maybe because of a wrong estimation. When I run the queries on my test database, they completed without any error.
To overcome the issue, you can always "Error-handling Conversion Functions":
SELECT *
FROM MY_TABLE
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TRY_TO_NUMBER(SOME_VALUE)
WHERE ROW_ID < 3;
More information:
https://docs.snowflake.com/en/sql-reference/functions-conversion.html#label-try-conversion-functions

TSQL query to merge data from multiple tables that may or may not have matching rows?

For example, suppose we're conducting research where students can take up to 10 different tests, and each table in the database stores all the students' responses for one test. The tables are named after each test as: T1, T2, ... , T10. Suppose each table has a primary key column 'Username' that identifies each student. Students may or may not have completed each test, so there may or may not be a record in each table for each student.
What is the correct SQL Query to return all the test data from all tables, with one row per student (one row per username)? I want the simplest query possible that returns the correct results. I would also like to coalesce the Username fields into a single Username field in the final query.
To clarify, I understand that SQL has a major limitation in that it does not support a syntax to select all columns except one or more fields like "select *[^ExcludeColumn1][^ExcludeColumn2]". To avoid specifically naming all columns in the final query, it would be acceptable to leave all the Username columns there, as long as it includes a coalesced Username field at the beginning named something like RowID.
As for the overall query, one option would be to perform a union all on the username column of all ten tables, then select the distinct usernames across all tables, then perform a series of left joins against the list of distinct usernames on all 10 tables. That would result in a very straightforward query where each left join is performed on the same distinct set of usernames, but I want to avoid a separate up-front query for distinct usernames. (Although if that's the best option, let me know). It would look something like this:
select * from
(select distinct coalesce(t1.Username,t2.Username,...,t10.Username) as RowID from t1,t2,t3,t4,t5,t6,t7,t8,t9,t10) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
Although that is short and easy to write, it is incredibly inefficient and would take hours to run on test tables with 5000+ rows each, so with an adjustment, an equivalent version that runs in a few seconds is:
select * from (
select distinct Username as RowID from (
select Username from t1
union all
select Username from t2
union all
...
select Username from t10
) all_usernames) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
I think that what I have above might be the most efficient and correct query (takes only a couple seconds to run and returns correct result set), but I also thought perhaps it could be simplified with some kind of full join. The problem is that full joins get confusing with more than two tables, because without pre-determining the usernames, each subsequent table would have to match records against any of the preceding tables, resulting in a query where each additional table has "[previous table count] + 1" conditions on matching the username.
Assuming that Username is unique in each table, your second query would be the way I would try first, with the slight modifications of removing distinct and simply using union (which implies distinct) rather than union all:
select *
from (
select Username from t1
union
select Username from t2
union
-- ...
select Username from t10
) distinct_usernames
left join t1 on t1.Username = distinct_usernames.Username
left join t2 on t2.Username = distinct_usernames.Username
-- ...
left join t10 on t10.Username = distinct_usernames.Username
From there I would make sure that Username is indexed, possibly even using it as the clustered index. I've also had optimization luck in the past by implementing your distinct_usernames as a temp table (possibly indexed, or an indexed view) at the beginning of the proc, but only testing would determine if that were worthwhile.
A full outer join would require a bunch of or conditions or coalesce arguments, though it could be worth a try on just a few tables to see if the performance is there. I can't try to out-guess what your query engine will like best.
Also, getting just the column names that you want could be done with a query to sys.columns or information_schema.columns and using dynamic SQL to build your query as a string and then executing that.

SQL WHERE NOT EXISTS (skip duplicates)

Hello I'm struggling to get the query below right. What I want is to return rows with unique names and surnames. What I get is all rows with duplicates
This is my sql
DECLARE #tmp AS TABLE (Name VARCHAR(100), Surname VARCHAR(100))
INSERT INTO #tmp
SELECT CustomerName,CustomerSurname FROM Customers
WHERE
NOT EXISTS
(SELECT Name,Surname
FROM #tmp
WHERE Name=CustomerName
AND ID Surname=CustomerSurname
GROUP BY Name,Surname )
Please can someone point me in the right direction here.
//Desperate (I tried without GROUP BY as well but get same result)
DISTINCT would do the trick.
SELECT DISTINCT CustomerName, CustomerSurname
FROM Customers
Demo
If you only want the records that really don't have duplicates (as opposed to getting duplicates represented as a single record) you could use GROUP BY and HAVING:
SELECT CustomerName, CustomerSurname
FROM Customers
GROUP BY CustomerName, CustomerSurname
HAVING COUNT(*) = 1
Demo
First, I thought that #David answer is what you want. But rereading your comments, perhaps you want all combinations of Names and Surnames:
SELECT n.CustomerName, s.CustomerSurname
FROM
( SELECT DISTINCT CustomerName
FROM Customers
) AS n
CROSS JOIN
( SELECT DISTINCT CustomerSurname
FROM Customers
) AS s ;
Are you doing that while your #Tmp table is still empty?
If so: your entire "select" is fully evaluated before the "insert" statement, it doesn't do "run the query and add one row, insert the row, run the query and get another row, insert the row, etc."
If you want to insert unique Customers only, use that same "Customer" table in your not exists clause
SELECT c.CustomerName,c.CustomerSurname FROM Customers c
WHERE
NOT EXISTS
(SELECT 1
FROM Customers c1
WHERE c.CustomerName = c1.CustomerName
AND c.CustomerSurname = c1.CustomerSurname
AND c.Id <> c1.Id)
If you want to insert a unique set of customers, use "distinct"
Typically, if you're doing a WHERE NOT EXISTS or WHERE EXISTS, or WHERE NOT IN subquery,
you should use what is called a "correlated subquery", as in ypercube's answer above, where table aliases are used for both inside and outside tables (where inside table is joined to outside table). ypercube gave a good example.
And often, NOT EXISTS is preferred over NOT IN (unless the WHERE NOT IN is selecting from a totally unrelated table that you can't join on.)
Sometimes if you're tempted to do a WHERE EXISTS (SELECT from a small table with no duplicate values in column), you could also do the same thing by joining the main query with that table on the column you want in the EXISTS. Not always the best or safest solution, might make query slower if there are many rows in that table and could cause many duplicate rows if there are dup values for that column in the joined table -- in which case you'd have to add DISTINCT to the main query, which causes it to SORT the data on all columns.
-- Not efficient at all.
And, similarly, the WHERE NOT IN or NOT EXISTS correlated subqueries can be accomplished (and give the exact same execution plan) if you LEFT OUTER JOIN the table you were going to subquery -- and add a WHERE . IS NULL.
You have to be careful using that, but you don't need a DISTINCT. Frankly, I prefer to use the WHERE NOT IN subqueries or NOT EXISTS correlated subqueries, because the syntax makes the intention clear and it's hard to go wrong.
And you do not need a DISTINCT in the SELECT inside such subqueries (correlated or not). It would be a waste of processing (and for WHERE EXISTS or WHERE IN subqueries, the SQL optimizer would ignore it anyway and just use the first value that matched for each row in the outer query). (Hope that makes sense.)

Possible to test for null records in SQL only?

I am trying to help a co-worker with a peculiar problem, and she's limited to MS SQL QUERY code only. The object is to insert a dummy record (into a surrounding union) IF no records are returned from a query.
I am having a hard time going back and forth from PL/SQL to MS SQL, and I am appealing for help (I'm not particularly appealing, but I am appealing to the StackOverflow audiance).
Basically, we need a single, testable value from the target Select ... statement.
In theory, it would do this:
(other records from unions)
Union
Select "These" as fld1, "are" as fld2, "Dummy" as fld3, "Fields" as fld4
where NOT (Matching Logic)
Union
Select fld1, fld2, fld3, fld4 // Regular records exist
From tested_table
Where (Matching Logic)
Forcing an individual dummy record, with no conditions, works.
IS there a way to get a single, testable result from a Select?
Can't do it in code (not allowed), but can feed SQL
Anybody? Anybody? Bbeller?
You could put the unions in a with, then include another union that returns a null only when the big union is empty:
; with BigUnion as
(
select *
from table1
union all
select *
from table2
)
select *
from BigUnion
union all
select null
where not exists (select * from BigUnion)

Resources