I'm a beginner at this so hope you can help. I'm working in SQL server 2008R2 and have a view that is comprised from four tables all joined together:
SELECT DISTINCT ad.award_id,
bl.funding_id,
bl.budget_line,
dd4.monthnumberofyear AS month,
dd4.yearcalendar AS year,
CASE
WHEN frb.full_value IS NULL THEN '0'
ELSE frb.full_value
END AS Expenditure_value,
bl.budget_id,
frb.accode,
'Actual' AS Type
FROM dw.dbo.dimdate5 AS dd4
LEFT OUTER JOIN dbo.award_data AS ad
ON dd4.fulldate BETWEEN ad.usethisstartdate AND
ad.usethisenddate
LEFT OUTER JOIN dbo.budget_line AS bl
ON bl.award_id = ad.award_id
LEFT OUTER JOIN dw.dbo.fctresearchbalances AS frb
ON frb.el3 = bl.award_id
AND frb.element4groupidnew = bl.budget_line
AND dd4.yearfiscal = frb.yr
AND dd4.monthnumberfiscal = frb.period
The view has 9 columns and 1.5 million rows and growing. A select * from this view was taking 20 minutes for all the rows. I added indexes on the fields in the tables that are joined on and that improved it to 10 minutes. My question is what else could I do to get the select to run faster?
Many thanks, Violet.
Try getting rid of the case statement.
If you have 1.5 million rows, if you're interesting in the aggregation of those rows rather than the whole set, you might want to sum the rows in fctResearchBalances first and then do the joins.
(It's a bit difficult to determine what else you might benefit from, without seeing the access plan.)
1- You can use stored procedure to have buffer cache.
2- you can use indexed view , this means creating index on schemabound views.
3- you can use query hints in join to order the query optimizer to use special kind of join.
4- you can use table partitioning .
SELECT DISTINCT --#1 - potential bottleneck
ad.award_id
, bl.funding_id
, bl.budget_line
, [month] = dd4.monthnumberofyear
, [year] = dd4.yearcalendar
, Expenditure_value = ISNULL(frb.full_value, '0')
, bl.budget_id
, frb.accode
, [type] = 'Actual'
FROM dbo.dimdate5 dd4
LEFT JOIN dbo.award_data ad ON dd4.fulldate BETWEEN ad.usethisstartdate AND ad.usethisenddate
LEFT JOIN dbo.budget_line bl ON bl.award_id = ad.award_id
LEFT JOIN dbo.fctresearchbalances frb ON frb.el3 = bl.award_id --#2 - join by multiple columns
AND frb.element4groupidnew = bl.budget_line
AND dd4.yearfiscal = frb.yr
AND dd4.monthnumberfiscal = frb.period
The CASE statement can be replace by
COALESCE(frb.full_value,'0') AS Expenditure_value
Without more info it's not possible to tell exactly what is wrong but just to give you some pointers.
When you have so many LEFT JOINS the order of the joins can make a difference.
Do you have standard indexes or covering indexes with included columns?
If you don't have covering indexes, then primary keys matter in the joins. Including all the primary key columns in the join condition will speed up the query.
Then look at your data - do you need all those LEFT JOINS base on the foreign keys between those tables? Depending on your keys a LEFT JOIN may be equivalent to an INNER JOIN.
And with all those LEFT JOINS is having a DISTINCT really useful?
How much RAM do you have? If you have 8GB+ then 1.5m rows is nothing for SQL Server. You need to optimise those joins.
Related
I am trying to find a solution in order to improve the String searching process and I selected FULL-TEXT INDEX Strategy.
However, after implementing it, I still can see there is a performance hit when it comes to search by using multiple strings using multiple Full-Text Index tables with OR clauses.
(E.x. WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%'))
As a solution, I am trying to use CONTAINSTABLE expecting a performance improvement.
Now, I am facing an issue with CONTAINSTABLE when it comes to joining tables with a LEFT JOIN
Please go through the example below.
Query 1
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
LEFT JOIN CONTAINSTABLE(P.Building,*,'%John%') AS FFTIndex ON F.ID = FFTIndex.[Key]
LEFT JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
LEFT JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
LEFT JOIN P.Person p ON pr2.ID = p.PID
LEFT JOIN CONTAINSTABLE(P.Person,FirstName,'%John%') AS PFTIndex ON P.ID = PFTIndex.[Key]
WHERE F.Name IS NOT NULL
This produces the below result.
Query 2
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
INNER JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
INNER JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
INNER JOIN P.Person p ON pr2.ID = p.PID
WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%')
AND F.Name IS NOT NULL
Result
Expectation
To use query 1 in a way that works as the behavior of an SQL SERVER OR clause. As I can understand Query 1's CONTAINSTABLE, joins the data with the building table, and the rest of the results are going to ignore so that the CONTAINSTABLE of the Person table gets data that already contains the keyword filtered from the building table.
If the keyword = Building, I want to match the keyword in both the tables regardless of searching a saved record in both the tables. Having a record in each table is enough.
Summary
Query 2 performs well but is creates a slowness when the words in the indexes are growing. Query 1 seems optimized(When it comes to multiple online resources and MS Documentation),
however, it does not give me the expected output.
Is there any way to solve this problem?
I am not strictly attached to CONTAINSTABLE. Suggesting another optimization method will also be considerable.
Thank you.
Hard to say definitively without your full data set but a couple of options to explore
Remove Invalid % Wildcards
Why are you using '%SearchTerm%'? Does performance improve if you use the search term without the wildcards (%)? If you want a word that matches a prefix, try something like
WHERE CONTAINS (String,'"SearchTerm*"')
Try Temp Tables
My guess is CONTAINS is slightly faster than CONTAINSTABLE as it doesn't calculate a rank, but I don't know if anyone has ever attempted to benchmark it. Either way, I'd try saving off the matches to a temp table before joining up to the rest of the tables. This will allow the optimizer to create a better execution plan
SELECT ID INTO #Temp
FROM YourTable
WHERE CONTAINS (String,'"SearchTerm"')
SELECT *
FROM #Temp
INNER JOIN...
Optimize Full Text Index by Removing Noisy Words
You might find you have some noisy words aka words that reoccur many times in your data that are meaningless like "the" or perhaps some business jargon. Adding these to your stop list will mean your full text index will ignore them, making your index smaller thus faster
The query below will list indexed words with the most frequent at the top
Select *
From sys.dm_fts_index_keywords(Db_Id(),Object_Id('dbo.YourTable') /*Replace with your table name*/)
Order By document_count Desc
This OR That Criteria
For your WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%') criteria where you want this or that, is tricky. OR clauses generally perform even when using simple equality operators.
I'd try either doing two queries and union the results like:
SELECT * FROM Table1 F
/*Other joins and stuff*/
WHERE CONTAINS(F.*,'%Gayan%')
UNION
SELECT * FROM Table2 P
/*Other joins and stuff*/
WHERE CONTAINS(P.FirstName,'%John%')
OR this is much more work, but you could load all your data into giant denormalized table with all your columns. Then apply a full text index to that table and adjust your search criteria that way. It'd probably be the fastest method searching, but then you'd have to ensure the data is sync between the denormalized table and the underlying normalized tables
SELECT B.*,P.* INTO DenormalizedTable
FROM Building AS B
INNER JOIN People AS P
CREATE FULL TEXT INDEX ft ON DenormalizedTable
etc...
I'm getting the following error when running a query against my local database :
The query processor could not produce a query plan because a worktable is required, and its minimum row size exceeds the maximum allowable of 8060 bytes. A typical reason why a worktable is required is a GROUP BY or ORDER BY clause in the query. If the query has a GROUP BY or ORDER BY clause, consider reducing the number and/or size of the fields in the clause. Consider using prefix (LEFT()) or hash (CHECKSUM()) of fields for grouping or prefix for ordering. Note however that this will change the behavior of the query.
This occurs when running a query that looks similar to this (apolgies for the lack of detail) :
SELECT <about 183 columns>
FROM tableA
INNER JOIN tvf1(<params>) tvf1
ON tvf1.id = tableA.X1
INNER JOIN tvf2(<params>) tvf2
ON tvf2.id = tableA.X2
INNER JOIN tvf3(<params>) tvf3
ON tvf3.id = tableA.X3
INNER JOIN tvf4(<params>) tvf4
ON tvf4.id = tableA.X4
INNER JOIN tvf5(<params>) tvf5
ON tvf5.id = tableA.X5
The table-valued-functions above all use a combination of GROUP BY, ROW_NUMBER() and other aggregation functions.
While binary debugging, commenting out any 2 of the above joins results in the error not occurring, doesnt matter which though.
My database is running on Compatibility Level 2019.
If i try setting Legacy Cardinality Estimation to On then the error no longer happens but I dont understand what this setting does.
edit :
If the database compatibility level is 2016 then everything works as expected as well
A concern i have is that the production database might be upgraded in future and this error could occur.
Edit :
I've managed to get the column count down to a handful now however my results are inconsistent.
SELECT
Other = TvfGroupData.Other
,GroupA = TvfGroupData.GroupA
,GroupB = TvfGroupData.GroupB
,GroupC = TvfGroupData.GroupC
, [Max Created Date] =
(SELECT MAX(Value)
FROM (VALUES
(Tvf1.CreatedDate)
,(Tvf2.CreatedDate)
,(Tvf3.CreatedDate)
--,(TvfGroupData.CreatedDate)
,(Tvf3.CreatedDate)
,(Tvf4.CreatedDate)
) AS AllValues(Value)
)
FROM TableA
LEFT JOIN Tvf1() ...
LEFT JOIN Tvf2() ...
LEFT JOIN TvfGroupData() ...
LEFT JOIN Tvf3() ...
LEFT JOIN Tvf4() ...
In the above query the following scenarios work :
excluding only GroupA column.
excluding only GroupB, GroupC column
Other combinations all fail with the error :
The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.
The cardinality estimator is how SQL Server generates the execution plan, meaning how the engine will execute and assemble the disparate sets of data.
The recent changes in the engine usually (but not always) result in better execution plans, leading to faster queries response while consuming less resources.
If you look into how SQL Server processes a query statement, the full data set is gathered before the unwanted columns are excluded. When multiple data sets are combined, the engine may exclude columns before joining to another set, or it may join the sets before excluding columns. It is based on what the engine "sees" in your data patterns (statistics).
UDFs are a frequent stumbling block for the query plan optimizer as a poorly formed function masks the data statistics and prevents the engine from efficiently piecing the data together.
All that to say, the updated engine is looking at your data and determining it is more efficient to combine multiple sets before eliminating unwanted columns.
I believe you may be able to fix this by subselecting the columns you need from your functions before joining to the outer set.
SELECT *
FROM (select SpecificColumns from tableA) as tableA
INNER JOIN (select SpecificColumns from tvf1(<params>)) as tvf1
ON tvf1.id = tableA.X1
INNER JOIN (select SpecificColumns from tvf2(<params>)) as tvf2
ON tvf2.id = tableA.X2
Alternatively, you may want to reconsider the Do-Everything query approach to reporting.
At ~8kb per row, you're likely passing a tremendous amount of data to your reporting system.
You may also give Cross Apply a try.
SELECT <about 183 columns>
FROM tableA
CROSS APPLY tvf1(<params>) tvf1
CROSS APPLY tvf2(<params>) tvf2
WHERE tvf1.id = tableA.X1
AND tvf2.id = tableA.X2
CROSS APPLY can instruct the optimizer to process sets in a different order.
Ive stil been unable to track down the exact problem.
My workaround for now has been to evaluate the table-valued-functions into table variables/temp tables and then join onto them instead
So it has been changed to something like this
DECLARE #tvf1 AS TABLE ....
INSERT INTO #tvf1 SELECT * FROM tvf1()...
DECLARE #tvf2 AS TABLE ....
INSERT INTO #tvf2 SELECT * FROM tvf2()...
DECLARE #tvf3 AS TABLE ....
INSERT INTO #tvf3 SELECT * FROM tvf3()...
SELECT <about 183 columns>
FROM tableA
INNER JOIN #tvf1 tvf1
ON tvf1.id = tableA.X1
INNER JOIN #tvf2 tvf2
ON tvf2.id = tableA.X2
INNER JOIN #tvf3 tvf3
ON tvf3.id = tableA.X3
I have a query that runs fairly fast under normal circumstances. But it is running very slow (at least 20 minutes in SSMS) due to how many values are in the filter.
Here's the generic version of it, and you can see that one part is filtering by over 8,000 values, making it run slow.
SELECT DISTINCT
column
FROM
table_a a
JOIN
table_b b ON (a.KEY = b.KEY)
WHERE
a.date BETWEEN #Start and #End
AND b.ID IN (... over 8,000 values)
AND b.place IN ( ... 20 values)
ORDER BY
a.column ASC
It's to the point where it's too slow to use in the production application.
Does anyone know how to fix this, or optimize the query?
To make a query fast, you need indexes.
You need a separate index for the following columns: a.KEY, b.KEY, a.date, b.ID, b.place.
As gotqn wrote before, if you put your 8000 items to a temp table, and inner join it, it will make the query even faster too, but without the index on the other part of the join it will be slow even then.
What you need is to put the filtering values in temporary table. Then use the table to apply filtering using INNER JOIN instead of WHERE IN. For example:
IF OBJECT_ID('tempdb..#FilterDataSource') IS NOT NULL
BEGIN;
DROP TABLE #FilterDataSource;
END;
CREATE TABLE #FilterDataSource
(
[ID] INT PRIMARY KEY
);
INSERT INTO #FilterDataSource ([ID])
-- you need to split values
SELECT DISTINCT column
FROM table_a a
INNER JOIN table_b b
ON (a.KEY = b.KEY)
INNER JOIN #FilterDataSource FS
ON b.id = FS.ID
WHERE a.date BETWEEN #Start and #End
AND b.place IN ( ... 20 values)
ORDER BY .column ASC;
Few important notes:
we are using temporary table in order to allow parallel execution plans to be used
if you have fast (for example CLR function) for spiting, you can join the function itself
it is not good to use IN with many values, the SQL Server is not able to build always the execution plan which may lead to time outs/internal error - you can find more information here
This query returns all the elements in the table la and all nulls for fields coming from the lar table which is not what I expected.
SELECT
la.listing_id,
la.id,
lar.*
FROM la
LEFT JOIN lar
ON lar.application_id = la.id AND la.listing_id = 2780;
This query returns correct and expected results but shouldn't both queries do the same thing ?
SELECT
la.listing_id,
la.id,
lar.*
FROM la
LEFT JOIN lar
ON lar.application_id = la.id
WHERE la.listing_id = 2780;
What am I missing here?
I want to make conditional joins as I have noticed that for complex queries Postgresql does the join then do the WHERE clause which is actually very slow. How to make the database filter out some records before doing the JOIN ?
The confusion around LEFT JOIN and WHERE clause has been clarified many times:
SQL / PostgreSQL left join ignores "on = constant" predicate, on left table
This interesting question remains:
How to make the database filter out some records before doing the JOIN?
There are no explicit query hints in Postgres. (Which is a matter of ongoing debate.) But there are still various tricks to make Postgres bend your way.
But first, ask yourself: Why did the query planner estimate the chosen plan to be cheaper to begin with? Is your server configuration basically sane? Cost settings adequate? autovacuum running? Postgres version outdated? Are you working around an underlying problem that should really be fixed?
If you force Postgres to do it your way, you should be sure it won't fire back, after a version upgrade or update to the server configuration ... You'd better know what you are doing exactly.
That said, you can force Postgres to "filter out some records before doing the JOIN" with a subquery where you add OFFSET 0 - which is just noise, logically, but prevents Postgres from rearranging it into the form of a regular join. (Query hint after all)
SELECT la.listing_id, la.id, lar.*
FROM (
SELECT listing_id, id
FROM la
WHERE listing_id = 2780
OFFSET 0
) la
LEFT JOIN lar ON lar.application_id = la.id;
Or you can use a CTE (less obscure, but more expensive). Or other tricks like setting certain config parameters. Or, in this particular case, I would use a LATERAL join to the same effect:
SELECT la.listing_id, la.id, lar.*
FROM la
LEFT JOIN LATERAL (
SELECT *
FROM lar
WHERE application_id = la.id
) lar ON true
WHERE la.listing_id = 2780;
Related:
Sample Query to show Cardinality estimation error in PostgreSQL
Here is an extensive blog on Query hints by 2ndQuadrant. Five year old but still valid.
The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.
So no matter you try to filter with AND la.listing_id = 2780; you still get all the rows from first table. But only those with la.listing_id = 2780; will have something <> NULL on the right side
The behaviour is different if you try INNER JOIN in that case only the matching columns are created and the AND condition will filter the rows.
So to make the first query work you need add WHERE la.listing_id IS NOT NULL
The problem with second query is will try to JOIN every row and then will filter only the one you need.
How can I write a stored procedure in SQL Server 2005 so that i can display the repeated column names by having a prefix added to it?
Example: If I have 'Others' as the column name belonging to a multiple categories mapped to another table having columns as 'MyColumn','YourColumn'. I need to join these two tables so that my output should be 'M_Others' and 'Y_Others'. I can use a case but I am not sure of any other repeated columns in the table. How to write that dynamically to know the repetitions ?
Thanks In Advance
You should use aliases in the projection of the query: (bogus example, showing the usage)
SELECT c.CustomerID AS Customers_CustomerID, o.CustomerID AS Orders_CustomerID
FROM Customers c INNER JOIN Orders o ON c.CustomerID = o.CustomerID
You can't dynamically change the column names without using dynamic SQL.
You have to explicitly alias them. There is no way to change "A_Others" or "B_Others" in this query:
SELECT
A.Others AS A_Others,
B.Others AS B_Others
FROM
TableA A
JOIN
TableB B ON A.KeyCol = B.KeyCol
If the repeated columns contain the same data (i.e. they are the join fields), you should not be sending both in the query anyway as this is a poor practice and is wasteful of both server and network resources. You should not use select * in queries on production especially if there are joins. If you are properly writing SQL code, you would alias as you go along when there are two columns with the same name that mean different things (for instance if you joined twice to the person table, once to get the doctor name and once to get the patient name). Doing this dynamically from system tables would not only be inefficient but could end up giving you a big security hole depending on how badly you wrote the code. You want to save five minutes or less in development by permanently affecting performance for every user and possibly negatively impacing data security. This is what database people refer to as a bad thing.
select n.id_pk,
(case when groupcount.n_count > 1 then substring(m.name, 1, 1) + '_' + n.name
else n.name end)
from test_table1 m
left join test_table2 n on m.id_pk = n.id_fk
left join (select name, count(name) as n_count
from test_table2 group by name)
groupcount on n.name = groupcount.name