I'm trying to optimize or completely rewrite this query. It takes about ~1500ms to run currently. I know the distinct's are fairly inefficient as well as the Union. But I'm struggling to figure out exactly where to go from here.
I am thinking that the first select statement might not be needed to return the output of;
[Key | User_ID,(User_ID)]
Note; Program and Program Scenario are both using Clustered Indexes. I can provide a screenshot of the Execution Plan if needed.
ALTER FUNCTION [dbo].[Fn_Get_Del_User_ID] (#_CompKey INT)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE #UseID AS VARCHAR(8000);
SET #UseID = '';
SELECT #UseID = #UseID + ', ' + x.User_ID
FROM
(SELECT DISTINCT (UPPER(p.User_ID)) as User_ID FROM [dbo].[Program] AS p WITH (NOLOCK)
WHERE p.CompKey = #_CompKey
UNION
SELECT DISTINCT (UPPER(ps.User_ID)) as User_ID FROM [dbo].[Program] AS p WITH (NOLOCK)
LEFT OUTER JOIN [dbo].[Program_Scenario] AS ps WITH (NOLOCK) ON p.ProgKey = ps.ProgKey
WHERE p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL) x
RETURN Substring(#UserIDs, 3, 8000);
END
There are two things happening in this query
1. Locating rows in the [Program] table matching the specified CompKey (#_CompKey)
2. Locating rows in the [Program_Scenario] table that have the same ProgKey as the rows located in (1) above.
Finally, non-null UserIDs from both these sets of rows are concatenated into a scalar.
For step 1 to be efficient, you'd need an index on the CompKey column (clustered or non-clustered)
For step 2 to be efficient, you'd need an index on the join key which is ProgKey on the Program_Scenario table (this likely is a non-clustered index as I can't imagine ProgKey to be PK). Likely, SQL would resort to a loop join strategy - i.e., for each row found in [Program] matching the CompKey criteria, it would need to lookup corresponding rows in [Program_Scenario] with same ProgKey. This is a guess though, as there is not sufficient information on the cardinality and distribution of data.
Ensure the above two indexes are present.
Also, as others have noted the second left outer join is a bit confusing as an inner join is the right way to deal with it.
Per my interpretation the inner part of the query can be rewritten this way. Also, this is the query you'd ideally run and optimize before tacking the string concatenation part. The DISTINCT is dropped as it is automatic with a UNION. Try this version of the query along with the indexes above and if it provides the necessary boost, then include the string concatenation or the xml STUFF approaches to return a scalar.
SELECT UPPER(p.User_ID) as User_ID
FROM
[dbo].[Program] AS p WITH (NOLOCK)
WHERE
p.CompKey = #_CompKey
UNION
SELECT UPPER(ps.User_ID) as User_ID
FROM
[dbo].[Program] AS p WITH (NOLOCK)
INNER JOIN [dbo].[Program_Scenario] AS ps WITH (NOLOCK) ON p.ProgKey = ps.ProgKey
WHERE
p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL
I am taking a shot in the dark here. I am guessing that the last code you posted is still a scalar function. It also did not have all the logic of your original query. Again, this is a shot in the dark since there is no table definitions or sample data posted.
This might be how this would look as an inline table valued function.
ALTER FUNCTION [dbo].[Fn_Get_Del_User_ID]
(
#_CompKey INT
) RETURNS TABLE AS RETURN
select MyResult = STUFF(
(
SELECT distinct UPPER(p.User_ID) as User_ID
FROM dbo.Program AS p
WHERE p.CompKey = #_CompKey
group by p.User_ID
UNION
SELECT distinct UPPER(ps.User_ID) as User_ID
FROM dbo.Program AS p
LEFT OUTER JOIN dbo.Program_Scenario AS ps ON p.ProgKey = ps.ProgKey
WHERE p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL
for xml path ('')
), 1, 1, '')
from dbo.Program
Related
I need to join two tables via their project’s names. But for a few project names that meet a specific criteria, I need the join to be matched to their descriptions (job description is like a name and is unique). I am not 100% sure how to do this. Can a case expression be applied? I have provided what I have so far but it’s not joining properly when I am doing a case expression on names that are like BTG –.
SELECT
[Name] AS 'NAME'
,[DATA_Id] AS 'ID_FIELD'
,format([ApprovedOn], 'MM/dd/yyyy') as 'DATE_APPROVED'
,[DATA_PROJECT_NAME]
,[PHASE_NAME]
,[DATA_JOB_ID]
,[JOB_TYPE]
,[SUB_TYPE]
,format([CREATED_DATE], 'MM/dd/yyyy') as 'DATE_CREATED'
,CASE
WHEN [DATA_JOB_ID] = [DATA_Id] THEN 'OK'
WHEN [DATA_JOB_ID] != [DATA_Id] THEN 'NEED DATA NUMBER'
ELSE 'NEED DATA NUMBER'
END AS ACTION_SPECIALISTS
,DATA_PROJECTS
FROM [MI].[MI_B_View].[app_build]
LEFT JOIN
(SELECT * ,
CASE
WHEN [DATA_PROJECT_NAME] LIKE 'BTG -%' THEN [JOB_DESCRIPTION]
ELSE [DATA_PROJECT_NAME]
END AS DATA_PROJECTS
FROM [ExternalUser].[DATA].[JOB] WHERE [JOB_DESCRIPTION] LIKE '%ROW%' AND REGION = 'CITY') AS B
ON [Name] = [DATA_PROJECTS]
WHERE
REGION_ID = 1
AND APPROVED = 1
ORDER BY [ApprovedOn] DESC
TL; DR: The answer by Caius Jard is correct - you can join on anything, as long as it evaluates to true or false (ignoring unknown).
Unfortunately, the way you join between two tables can have drastically different performance depending on your methodology. If you join on an expression, you will usually get very poor performance. Using computed columns, materializing the intermediate result in a table, or splitting up your join conditions can all help with poor performance.
Joins are not the only place where expressions can ding you; grouping, aggregates, filters, or anything that relies on a good cardinality estimate will suffer when expressions are involved.
When I compare two methods of joining (they are functionally equivalent despite the new magic column; more on that later)
SELECT *
FROM #Build AppBuild
LEFT OUTER JOIN #Job Job
ON ( AppBuild.Name = Job.DATA_PROJECT_NAME
AND Job.DATA_PROJECT_NAME NOT LIKE 'BTG -%' )
OR ( Job.DATA_PROJECT_NAME LIKE 'BTG -%'
AND Job.JOB_DESCRIPTION = AppBuild.Name );
SELECT *
FROM #Build AppBuild
LEFT OUTER JOIN #Job Job
ON AppBuild.Name = Job.JoinOnMe;
The resulting query plans have huge differences:
You'll notice that the estimated cost of the first join is much higher - but that doesn't even tell the whole story. If I actually run these two queries with ~6M rows in each table, I end up with the second one finishing in ~7 seconds, and the first one being nowhere near done after 2 minutes. This is because the join predicate was pushed down onto the #Job table's table scan:
SQL Server has no idea what percentage of records will have a DATA_PROJECT_NAME (NOT) LIKE 'BTG -%', so it picks an estimate of 1 row. This then leads it to pick a nested loop join, and a sort, and a spool, all of which really end up making things perform quite poorly for us when we get far more than 1 row out of that table scan.
The fix? Computed columns. I created my tables like so:
CREATE TABLE #Build
(
Name varchar(50) COLLATE DATABASE_DEFAULT NOT NULL
);
CREATE TABLE #Job
(
JOB_DESCRIPTION varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
DATA_PROJECT_NAME varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
JoinOnMe AS CASE WHEN DATA_PROJECT_NAME LIKE N'BTG -%' THEN DATA_PROJECT_NAME
ELSE JOB_DESCRIPTION END
);
It turns out that SQL Server will maintain statistics on JoinOnMe, even though there is an expression inside of it and this value has not been materialized anywhere. If you wanted to, you could even index the computed column.
Because we have statistics on JoinOnMe, a join on it will give a good cardinality estimate (when I tested it was exactly correct), and thus a good plan.
If you don't have the freedom to alter the table, then you should at least split the join into two joins. It may seem counter intuitive, but if you're ORing together a lot of conditions for an outer join, SQL Server will usually get a better estimate (and thus better plans) if each OR condition is separate, and you then COALESCE the result set.
When I include a query like this:
SELECT AppBuild.Name,
COALESCE( Job.JOB_DESCRIPTION, Job2.JOB_DESCRIPTION ) JOB_DESCRIPTION,
COALESCE( Job.DATA_PROJECT_NAME, Job2.DATA_PROJECT_NAME ) DATA_PROJECT_NAME
FROM #Build AppBuild
LEFT OUTER JOIN #Job Job
ON ( AppBuild.Name = Job.DATA_PROJECT_NAME
AND Job.DATA_PROJECT_NAME NOT LIKE 'BTG -%' )
LEFT OUTER JOIN #Job Job2
ON ( Job2.DATA_PROJECT_NAME LIKE 'BTG -%'
AND Job2.JOB_DESCRIPTION = AppBuild.Name );
It is also 0% of total cost, relative to the first query. When compared against joining on the computed column, the difference is about 58%/42%
Here is how I created the tables and populated them with test data
DROP TABLE IF EXISTS #Build;
DROP TABLE IF EXISTS #Job;
CREATE TABLE #Build
(
Name varchar(50) COLLATE DATABASE_DEFAULT NOT NULL
);
CREATE TABLE #Job
(
JOB_DESCRIPTION varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
DATA_PROJECT_NAME varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
JoinOnMe AS CASE WHEN DATA_PROJECT_NAME LIKE N'BTG -%' THEN DATA_PROJECT_NAME
ELSE JOB_DESCRIPTION END
);
INSERT INTO #Build
( Name )
SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ))
FROM master.dbo.spt_values
CROSS APPLY master.dbo.spt_values SV2;
INSERT INTO #Job
( JOB_DESCRIPTION, DATA_PROJECT_NAME )
SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )),
CASE WHEN ROUND( RAND(), 0 ) = 1 THEN CAST(ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )) AS nvarchar(20))
ELSE 'BTG -1' END
FROM master.dbo.spt_values SV
CROSS APPLY master.dbo.spt_values SV2;
Sure, any expression that evaluates to a truth can be used in a join:
SELECT *
FROM
person
INNER JOIN
country
ON
country.name =
CASE person.homeCity
WHEN 'London' THEN 'England'
WHEN 'Amsterdam' THEN 'Holland'
ELSE person.homecountry
END
Suppose homecountry features records like 'united kingdom', 'great britain' and 'netherlands' but these don't match up with our country names in the countries table - we could use a case when to convert them (and I've case'd on the city name just to demo that it doesn't have to be anything to do with the country in the case), but for all the others (the ELSE case) we just pass through the country name from the person table unchanged
Ultimately the CASE WHEN will output some string and this will be matched against the other table column (but it could be matched against another case when etc)
In your scenario you might find you can avoid all this and just write something using AND and OR, like
a JOIN b
ON
(a.projectname like 'abc%' and a.projectname = b.description) OR
(a.projectname like '%def' and a.whatever = b.othercolumn)
Evaluations in a CASE WHEN are short circuited, evaluating from left to right
Remember; anything that ultimately evaluates to a truth can be used in an ON. Even ON 5<10 is valid (joins all rows to all other rows because it's always true)
I use PostgreSQL 9.6 and my table schema is as follows: department, key, value1, value2, value3, ... Each department has hundreds of millions of unique keys, but the set of keys is more or less the same for all departments. It's possible that some keys don't exist for some departments, but such situations are rare.
I would like to prepare a report that for two departments points out differences in values for each key (comparison involves some logic based only on values for the key).
My first approach was to write an external tool in python that:
creates a server-side cursor for query: SELECT * FROM my_table WHERE department = 'ABC' ORDER BY key;
creates another server-side cursor for query: SELECT * FROM my_table WHERE department = 'XYZ' ORDER BY key;
iterates over both cursors, and compares the values.
It worked fine, but I thought it will be more efficient to perform the comparison inside a stored procedure in PostgreSQL. I wrote a stored procedure that takes two cursors as arguments, iterates over them and compares the values. Any differences are written into a temporary table. At the end, the external tool iterates just over the temporary table - there shouldn't be many rows there.
I thought that the latter approach would be more efficient, because it doesn't require transferring lots of data outside the database.
To my surprise, it turned out to be slower by almost 40%.
To isolate the problem I compared the performance of iterating a cursor inside a stored procedure, and in python:
FETCH cur_1 INTO row_1;
WHILE (row_1 IS NOT NULL) LOOP
rows = rows + 1;
FETCH FROM cur_1 INTO row_1;
END LOOP;
vs.
conn = psycopg2.connect(PG_URI)
cur = conn.cursor('test')
cur.execute(query)
cnt = 0
for row in cur:
cnt += 1
Query was the same in both cases. Again, the external tool was faster.
My hypothesis is that this is because the stored procedure fetches rows one-by-one (FETCH FROM curs_1 INTO row_1) while the application fetches rows in batches of 2000. But I couldn't find a way to FETCH a batch of rows from cursor inside a PgSQL procedure. Thus, I can't test the hypothesis.
So my question is it possible to speed up my stored procedure?
What is the best approach for problems like this?
Why can you not do a self-join rather than using cursors? Something like:
SELECT t1.key, t1.value1 - t2.value1 as diff1,
t1.value2 - t2.value2 as diff2, ...
FROM my_table t1 inner join my_table t2 on t1.key = t2.key
WHERE t1.department = 'XYZ' and t2.department = 'ABC'
UNION
SELECT t1.key, t1.value1 as diff1,
t1.value2 as diff2, ...
FROM my_table t1 WHERE NOT EXISTS (SELECT 1 FROM my_table t2 WHERE
t1.key = t2.key AND t2.dept = 'ABC') AND t1.dept = 'XYZ'
UNION
SELECT t1.key, t1.value1 as diff1,
t1.value2 as diff2, ...
FROM my_table t1 WHERE NOT EXISTS (SELECT 1 FROM my_table t2 WHERE
t1.key = t2.key AND t2.dept = 'XYZ') AND t1.dept = 'ABC';
The first part deals with all the common cases and the two unions pick up the missing values. I would have thought this would be much faster than a cursor approach.
This might be faster as it will only return those rows that are different in at least one of the values:
select *
from (
SELECT key, value1, value2, value3
FROM my_table
WHERE department = 'ABC'
) d1
full join (
SELECT key, value1, value2, value3
FROM my_table
WHERE department = 'XYZ'
) d2 using (key)
where d1 is distinct from d2;
I am looking for help with optimization techniques or hint for me to move ahead with the problem I have. Using a temp table for in clause makes my query run for more than 5 seconds, changing it a static value returns the data under a second. I am trying to understand the way to optimize this.
-- details about the number of rows in table
dept_activity table
- total rows - 17,319,666
- rows for (dept_id = 10) - 36054
-- temp table
CREATE TABLE #tbl_depts (
Id INT Identity(1, 1) PRIMARY KEY
,dept_id integer
);
-- for example I inserted one row but based on conditions multiple department numbers are inserted in this temp table
insert into #tbl_depts(dept_id) values(10);
-- this query takes more than 5 seconds
SELECT activity_type,count(1) rc
FROM dept_activity da
WHERE (
#filter_by_dept IS NULL
OR da.depart_id IN (
SELECT td.dept_id
FROM #tbl_depts td
)
)
group by activity_type;
-- this query takes less than 500 milli seconds
SELECT activity_type,count(1) rc
FROM dept_activity da
WHERE (
#filter_by_dept IS NULL
OR da.depart_id IN (
10 -- changed to static value
)
)
group by activity_type;
What ways I can optimize to return data for first query under a second.
You're testing this with just one value, but isn't your real case different?
The problem that optimizer has here is that it can't know how many rows the temp. table in -clause will actually find, so it'll have to make a guess, and probably that why the result is different. Looking at estimated row counts (+vs actual) might give some insight on this.
If your clause only contains this one criteria:
#filter_by_dept IS NULL OR da.depart_id IN
It might be good to test what happens if you separate your logic with if blocks, into the one that fetches all, and the other that filters the data.
If that's not the real case, you might want to test both option (recompile), which could result into a better plan, but will use (little bit) more CPU since the plan is re-generated every time. Or by constructing the clause with dynamic SQL (either just with the temp table but optimizing away the or statements, or doing a full in clause if there isn't a ridiculous amount of values), but that might get really ugly.
There are different ways of writing same thing. Use as per your requirements -
Separate Block
IF #filter_by_dept IS NULL
BEGIN
SELECT da.activity_type, count(1) rc
FROM dept_activity da
GROUP BY da.activity_ty
END
ELSE
BEGIN
SELECT da.activity_type,COUNT(1) rc
FROM dept_activity da
INNER JOIN #tbl_depts td ON td.dept_id = da.depart_id
GROUP BY da.activity_ty
END
Dynamic Query
DECLARE #sql_stmt VARCHAR(5000)
SET #sql_stmt = '
SELECT activity_type, COUNT(1) rc
FROM dept_activity da
'
IF #filter_by_dept IS NOT NULL
SET #sql_stmt = #sql_stmt + ' INNER JOIN #tbl_depts td ON td.dept_id = da.depart_id'
SET #sql_stmt = #sql_stmt + ' GROUP BY da.activity_type '
EXEC(#sql_stmt)
Simple Left Join
Comparatively, it can be slower that above two options.
SELECT da.activity_type, count(1) rc
FROM dept_activity da
LEFT JOIN #tbl_depts td ON td.dept_id = da.depart_id
WHERE #filter_by_dept IS NULL OR td.id IS NOT NULL
GROUP BY da.activity_type
The biggest issue is most likely the use of an "optional parameter". The query optimizer has no idea weather or not #filter_by_dept is going to have a value the next time it's executed to it chooses to play it safe opts for an index scan, rather than an index seek. The is where OPTION(RECOMPILE) can be your friend. Especially on simple, easy to compile queries like this one.
Also, there are potential gains from using a WHERE EXISTS in place of the IN.
Try the following...
DECLARE #filter_by_dept INT = 10;
SELECT
da.activity_type,
rc = COUNT(1)
FROM
dbo.dept_activity da
WHERE
#filter_by_dept IS NULL
OR
EXISTS (SELECT 1 FROM #tbl_depts td WHERE da.depart_id = td.dept_id)
GROUP BY
da.activity_type
OPTION (RECOMPILE);
HTH, Jason
I have a query the I believe (not fully tested, only sanity-checked) works, but seems incredibly overwrought to me. It's a join over three tables.
One table is a quest table with questID, playerID, and score. Another table is a player table that contains the territory the player is in. A third table is not a real table; it's a subquery that produces a temporary result that consists of the highest scoring player for each territory.
SELECT
Q1.QuestID, Q1.PersonID, C1.TerritoryID
FROM
[dbo].[T_QuestSys_Quest] AS Q1
JOIN
[GZ_GAME].[dbo].[T_Character] AS C1 ON Q1.PersonID = C1.PersonID
JOIN
(SELECT
TerritoryID, MAX(Score) AS HighScore
FROM
[GZ_GAME].[dbo].[T_QuestSys_Quest] AS Q2
JOIN
[GZ_GAME].[dbo].[T_Character] AS C2 ON Q2.PersonID = C2.PersonID
AND Q2.QuestID = #QuestID
GROUP BY
TerritoryID) AS S ON Q1.Score = S.HighScore
AND C1.TerritoryID = S.TerritoryID
AND Q1.QuestID = #QuestID
Notably, the subquery would not allow me add additional terms on the SELECT statement resulting in an error:
invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause").
It seemed to me that the MAX(Score) would be sufficiently disambiguating as to which PersonID I wanted, but I guess not.
Anyway, my questions are: Is there a better way to do this in terms of elegance/simplicity? Is there a better way to do this in terms of performance?
One method use ROW_NUMBER window function
;WITH cte
AS (SELECT *,Rwo_number()OVER(partition BY territoryid ORDER BY score DESC) RN
FROM [GZ_GAME].[dbo].[t_questsys_quest] AS Q2
JOIN [GZ_GAME].[dbo].[t_character] AS C2
ON Q2.personid = C2.personid
AND Q2.questid = #QuestID)
SELECT *
FROM cte
WHERE rn = 1
I have the following Stored Procedure
ALTER PROCEDURE [dbo].[bt_BizForSale_GetByID]
(
#ID int
)
AS
SET NOCOUNT ON
SELECT dbo.bt_BizForSale.UserID,
dbo.bt_BizForSale.VendorType,
dbo.bt_BizForSale.BusinessName,
dbo.bt_BizForSale.isEmailSubscriber,
dbo.bt_BizForSale.isIntermediarySubscriber,
dbo.bt_Regions.Region AS Country,
bt_Regions_1.Region AS Province,
bt_Regions_2.Region AS City,
dbo.bt_BizForSale.[AdType]
FROM dbo.bt_BizForSale INNER JOIN
dbo.bt_Regions ON dbo.bt_BizForSale.[61] = dbo.bt_Regions.ID INNER JOIN
dbo.bt_Regions AS bt_Regions_1 ON dbo.bt_BizForSale.[62] = bt_Regions_1.ID INNER JOIN
dbo.bt_Regions AS bt_Regions_2 ON dbo.bt_BizForSale.[63] = bt_Regions_2.ID
WHERE (dbo.bt_BizForSale.ID = #ID)
And when I execute it with a VALID ID from the table, it's not returning any results. What could I possibly be missing?
PS: on "most" valid ID's I pass to the stored procedure, I get the results I'm looking for.
Example: 10010 will return results, but 10104 will not. Both are valid records in the database.
Probably a Region ID in the [61], [62], [63] columns which is not in dbo.bt_Regions (or is NULL, which will never satisfy any equality condition)
INNER JOIN requires that the (equality) condition be satisfied - so a row must be found
If any one of your INNER JOINs does not have a satisfied condition, you will get no rows.
You will need to either correct your foreign keys to be valid or change to a LEFT JOIN with appropriate handling of NULLs when the right hand side has no rows satisfying the join criteria