I am trying to find a solution in order to improve the String searching process and I selected FULL-TEXT INDEX Strategy.
However, after implementing it, I still can see there is a performance hit when it comes to search by using multiple strings using multiple Full-Text Index tables with OR clauses.
(E.x. WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%'))
As a solution, I am trying to use CONTAINSTABLE expecting a performance improvement.
Now, I am facing an issue with CONTAINSTABLE when it comes to joining tables with a LEFT JOIN
Please go through the example below.
Query 1
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
LEFT JOIN CONTAINSTABLE(P.Building,*,'%John%') AS FFTIndex ON F.ID = FFTIndex.[Key]
LEFT JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
LEFT JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
LEFT JOIN P.Person p ON pr2.ID = p.PID
LEFT JOIN CONTAINSTABLE(P.Person,FirstName,'%John%') AS PFTIndex ON P.ID = PFTIndex.[Key]
WHERE F.Name IS NOT NULL
This produces the below result.
Query 2
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
INNER JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
INNER JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
INNER JOIN P.Person p ON pr2.ID = p.PID
WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%')
AND F.Name IS NOT NULL
Result
Expectation
To use query 1 in a way that works as the behavior of an SQL SERVER OR clause. As I can understand Query 1's CONTAINSTABLE, joins the data with the building table, and the rest of the results are going to ignore so that the CONTAINSTABLE of the Person table gets data that already contains the keyword filtered from the building table.
If the keyword = Building, I want to match the keyword in both the tables regardless of searching a saved record in both the tables. Having a record in each table is enough.
Summary
Query 2 performs well but is creates a slowness when the words in the indexes are growing. Query 1 seems optimized(When it comes to multiple online resources and MS Documentation),
however, it does not give me the expected output.
Is there any way to solve this problem?
I am not strictly attached to CONTAINSTABLE. Suggesting another optimization method will also be considerable.
Thank you.
Hard to say definitively without your full data set but a couple of options to explore
Remove Invalid % Wildcards
Why are you using '%SearchTerm%'? Does performance improve if you use the search term without the wildcards (%)? If you want a word that matches a prefix, try something like
WHERE CONTAINS (String,'"SearchTerm*"')
Try Temp Tables
My guess is CONTAINS is slightly faster than CONTAINSTABLE as it doesn't calculate a rank, but I don't know if anyone has ever attempted to benchmark it. Either way, I'd try saving off the matches to a temp table before joining up to the rest of the tables. This will allow the optimizer to create a better execution plan
SELECT ID INTO #Temp
FROM YourTable
WHERE CONTAINS (String,'"SearchTerm"')
SELECT *
FROM #Temp
INNER JOIN...
Optimize Full Text Index by Removing Noisy Words
You might find you have some noisy words aka words that reoccur many times in your data that are meaningless like "the" or perhaps some business jargon. Adding these to your stop list will mean your full text index will ignore them, making your index smaller thus faster
The query below will list indexed words with the most frequent at the top
Select *
From sys.dm_fts_index_keywords(Db_Id(),Object_Id('dbo.YourTable') /*Replace with your table name*/)
Order By document_count Desc
This OR That Criteria
For your WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%') criteria where you want this or that, is tricky. OR clauses generally perform even when using simple equality operators.
I'd try either doing two queries and union the results like:
SELECT * FROM Table1 F
/*Other joins and stuff*/
WHERE CONTAINS(F.*,'%Gayan%')
UNION
SELECT * FROM Table2 P
/*Other joins and stuff*/
WHERE CONTAINS(P.FirstName,'%John%')
OR this is much more work, but you could load all your data into giant denormalized table with all your columns. Then apply a full text index to that table and adjust your search criteria that way. It'd probably be the fastest method searching, but then you'd have to ensure the data is sync between the denormalized table and the underlying normalized tables
SELECT B.*,P.* INTO DenormalizedTable
FROM Building AS B
INNER JOIN People AS P
CREATE FULL TEXT INDEX ft ON DenormalizedTable
etc...
Related
I'm getting the following error when running a query against my local database :
The query processor could not produce a query plan because a worktable is required, and its minimum row size exceeds the maximum allowable of 8060 bytes. A typical reason why a worktable is required is a GROUP BY or ORDER BY clause in the query. If the query has a GROUP BY or ORDER BY clause, consider reducing the number and/or size of the fields in the clause. Consider using prefix (LEFT()) or hash (CHECKSUM()) of fields for grouping or prefix for ordering. Note however that this will change the behavior of the query.
This occurs when running a query that looks similar to this (apolgies for the lack of detail) :
SELECT <about 183 columns>
FROM tableA
INNER JOIN tvf1(<params>) tvf1
ON tvf1.id = tableA.X1
INNER JOIN tvf2(<params>) tvf2
ON tvf2.id = tableA.X2
INNER JOIN tvf3(<params>) tvf3
ON tvf3.id = tableA.X3
INNER JOIN tvf4(<params>) tvf4
ON tvf4.id = tableA.X4
INNER JOIN tvf5(<params>) tvf5
ON tvf5.id = tableA.X5
The table-valued-functions above all use a combination of GROUP BY, ROW_NUMBER() and other aggregation functions.
While binary debugging, commenting out any 2 of the above joins results in the error not occurring, doesnt matter which though.
My database is running on Compatibility Level 2019.
If i try setting Legacy Cardinality Estimation to On then the error no longer happens but I dont understand what this setting does.
edit :
If the database compatibility level is 2016 then everything works as expected as well
A concern i have is that the production database might be upgraded in future and this error could occur.
Edit :
I've managed to get the column count down to a handful now however my results are inconsistent.
SELECT
Other = TvfGroupData.Other
,GroupA = TvfGroupData.GroupA
,GroupB = TvfGroupData.GroupB
,GroupC = TvfGroupData.GroupC
, [Max Created Date] =
(SELECT MAX(Value)
FROM (VALUES
(Tvf1.CreatedDate)
,(Tvf2.CreatedDate)
,(Tvf3.CreatedDate)
--,(TvfGroupData.CreatedDate)
,(Tvf3.CreatedDate)
,(Tvf4.CreatedDate)
) AS AllValues(Value)
)
FROM TableA
LEFT JOIN Tvf1() ...
LEFT JOIN Tvf2() ...
LEFT JOIN TvfGroupData() ...
LEFT JOIN Tvf3() ...
LEFT JOIN Tvf4() ...
In the above query the following scenarios work :
excluding only GroupA column.
excluding only GroupB, GroupC column
Other combinations all fail with the error :
The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.
The cardinality estimator is how SQL Server generates the execution plan, meaning how the engine will execute and assemble the disparate sets of data.
The recent changes in the engine usually (but not always) result in better execution plans, leading to faster queries response while consuming less resources.
If you look into how SQL Server processes a query statement, the full data set is gathered before the unwanted columns are excluded. When multiple data sets are combined, the engine may exclude columns before joining to another set, or it may join the sets before excluding columns. It is based on what the engine "sees" in your data patterns (statistics).
UDFs are a frequent stumbling block for the query plan optimizer as a poorly formed function masks the data statistics and prevents the engine from efficiently piecing the data together.
All that to say, the updated engine is looking at your data and determining it is more efficient to combine multiple sets before eliminating unwanted columns.
I believe you may be able to fix this by subselecting the columns you need from your functions before joining to the outer set.
SELECT *
FROM (select SpecificColumns from tableA) as tableA
INNER JOIN (select SpecificColumns from tvf1(<params>)) as tvf1
ON tvf1.id = tableA.X1
INNER JOIN (select SpecificColumns from tvf2(<params>)) as tvf2
ON tvf2.id = tableA.X2
Alternatively, you may want to reconsider the Do-Everything query approach to reporting.
At ~8kb per row, you're likely passing a tremendous amount of data to your reporting system.
You may also give Cross Apply a try.
SELECT <about 183 columns>
FROM tableA
CROSS APPLY tvf1(<params>) tvf1
CROSS APPLY tvf2(<params>) tvf2
WHERE tvf1.id = tableA.X1
AND tvf2.id = tableA.X2
CROSS APPLY can instruct the optimizer to process sets in a different order.
Ive stil been unable to track down the exact problem.
My workaround for now has been to evaluate the table-valued-functions into table variables/temp tables and then join onto them instead
So it has been changed to something like this
DECLARE #tvf1 AS TABLE ....
INSERT INTO #tvf1 SELECT * FROM tvf1()...
DECLARE #tvf2 AS TABLE ....
INSERT INTO #tvf2 SELECT * FROM tvf2()...
DECLARE #tvf3 AS TABLE ....
INSERT INTO #tvf3 SELECT * FROM tvf3()...
SELECT <about 183 columns>
FROM tableA
INNER JOIN #tvf1 tvf1
ON tvf1.id = tableA.X1
INNER JOIN #tvf2 tvf2
ON tvf2.id = tableA.X2
INNER JOIN #tvf3 tvf3
ON tvf3.id = tableA.X3
Using Microsoft T-SQL, I know there is a difference in the result set between putting a filter in the join verses putting the filter in the where clause, because I get a different row count, and its hard to pin point, because it is a complex query with thousands of rows.
So does anybody know how this might lead to different results?
I feel safer with the where clause, but I am curious why the other one didn't get the same results.
Select
.....
left outer join
datamart_agent ag on a.CUSTOMER_TKN = ag.customer_tkn
and a.CUSTOMER_ACCT_TKN = ag.customer_acct_tkn
and a.ACCOUNT_PKG_TKN = ag.account_pkg_tkn
--where (commented out)
and ag.TYPE_DESC = 'Agent'
versus
Select
.....
left outer join
datamart_agent ag on a.CUSTOMER_TKN = ag.customer_tkn
and a.CUSTOMER_ACCT_TKN = ag.customer_acct_tkn
and a.ACCOUNT_PKG_TKN = ag.account_pkg_tkn
where
--and (commented out)
ag.TYPE_DESC = 'Agent'
If condition is in your where clause then it will drop all records for which no match was found. It is a like an eraser. I deletes all records for which no match was found (Type_Desc is null).
The safer thing, if you are actually seeking left join, is to put the condition in the on clause. If there was no match, it will still come in the result. When Type_Desc is null, it means you don't know who the agent is, but if you don't know, you don't want to delete the record....
This query returns all the elements in the table la and all nulls for fields coming from the lar table which is not what I expected.
SELECT
la.listing_id,
la.id,
lar.*
FROM la
LEFT JOIN lar
ON lar.application_id = la.id AND la.listing_id = 2780;
This query returns correct and expected results but shouldn't both queries do the same thing ?
SELECT
la.listing_id,
la.id,
lar.*
FROM la
LEFT JOIN lar
ON lar.application_id = la.id
WHERE la.listing_id = 2780;
What am I missing here?
I want to make conditional joins as I have noticed that for complex queries Postgresql does the join then do the WHERE clause which is actually very slow. How to make the database filter out some records before doing the JOIN ?
The confusion around LEFT JOIN and WHERE clause has been clarified many times:
SQL / PostgreSQL left join ignores "on = constant" predicate, on left table
This interesting question remains:
How to make the database filter out some records before doing the JOIN?
There are no explicit query hints in Postgres. (Which is a matter of ongoing debate.) But there are still various tricks to make Postgres bend your way.
But first, ask yourself: Why did the query planner estimate the chosen plan to be cheaper to begin with? Is your server configuration basically sane? Cost settings adequate? autovacuum running? Postgres version outdated? Are you working around an underlying problem that should really be fixed?
If you force Postgres to do it your way, you should be sure it won't fire back, after a version upgrade or update to the server configuration ... You'd better know what you are doing exactly.
That said, you can force Postgres to "filter out some records before doing the JOIN" with a subquery where you add OFFSET 0 - which is just noise, logically, but prevents Postgres from rearranging it into the form of a regular join. (Query hint after all)
SELECT la.listing_id, la.id, lar.*
FROM (
SELECT listing_id, id
FROM la
WHERE listing_id = 2780
OFFSET 0
) la
LEFT JOIN lar ON lar.application_id = la.id;
Or you can use a CTE (less obscure, but more expensive). Or other tricks like setting certain config parameters. Or, in this particular case, I would use a LATERAL join to the same effect:
SELECT la.listing_id, la.id, lar.*
FROM la
LEFT JOIN LATERAL (
SELECT *
FROM lar
WHERE application_id = la.id
) lar ON true
WHERE la.listing_id = 2780;
Related:
Sample Query to show Cardinality estimation error in PostgreSQL
Here is an extensive blog on Query hints by 2ndQuadrant. Five year old but still valid.
The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.
So no matter you try to filter with AND la.listing_id = 2780; you still get all the rows from first table. But only those with la.listing_id = 2780; will have something <> NULL on the right side
The behaviour is different if you try INNER JOIN in that case only the matching columns are created and the AND condition will filter the rows.
So to make the first query work you need add WHERE la.listing_id IS NOT NULL
The problem with second query is will try to JOIN every row and then will filter only the one you need.
I have a simple query that relies on two full-text indexed tables, but it runs extremely slow when I have the CONTAINS combined with any additional OR search. As seen in the execution plan, the two full text searches crush the performance. If I query with just 1 of the CONTAINS, or neither, the query is sub-second, but the moment you add OR into the mix the query becomes ill-fated.
The two tables are nothing special, they're not overly wide (42 cols in one, 21 in the other; maybe 10 cols are FT indexed in each) or even contain very many records (36k recs in the biggest of the two).
I was able to solve the performance by splitting the two CONTAINS searches into their own SELECT queries and then UNION the three together. Is this UNION workaround my only hope?
SELECT a.CollectionID
FROM collections a
INNER JOIN determinations b ON a.CollectionID = b.CollectionID
WHERE a.CollrTeam_Text LIKE '%fa%'
OR CONTAINS(a.*, '"*fa*"')
OR CONTAINS(b.*, '"*fa*"')
Execution Plan:
I'd be curious to see if a LEFT JOIN to an equivalent CONTAINSTABLE would perform any better. Something like:
SELECT a.CollectionID
FROM collections a
INNER JOIN determinations b ON a.CollectionID = b.CollectionID
LEFT JOIN CONTAINSTABLE(a, *, '"*fa*"') ct1 on a.CollectionID = ct1.[Key]
LEFT JOIN CONTAINSTABLE(b, *, '"*fa*"') ct2 on b.CollectionID = ct2.[Key]
WHERE a.CollrTeam_Text LIKE '%fa%'
OR ct1.[Key] IS NOT NULL
OR ct2.[Key] IS NOT NULL
I was going to suggest to UNION each as their own query, but as I read your question I saw that you have found that. I can't think of a better way, so if it helps use it. The UNION method is a common approach to a poor performing query that has several OR conditions where each performs well on its own.
I would probably use the UNION. If you are really against it, you might try something like:
SELECT a.CollectionID
FROM collections a
LEFT OUTER JOIN (SELECT CollectionID FROM collections WHERE CONTAINS(*, '"*fa*"')) c
ON c.CollectionID = a.CollectionID
LEFT OUTER JOIN (SELECT CollectionID FROM determinations WHERE CONTAINS(*, '"*fa*"')) d
ON d.CollectionID = a.CollectionID
WHERE a.CollrTeam_Text LIKE '%fa%'
OR c.CollectionID IS NOT NULL
OR d.CollectionID IS NOT NULL
We've experience the exact same problem and at the time, put it down to our query being badly formed - that SQL 2005 had let us get away with it, but 2008 wouldn't.
In the end, we split the query into 2 SELECTs that were called using an IF. Glad someone else has had the same problem and that it's a known issue. We were seeing queries on a table with ~150,000 rows + full-text going from < 1 second (2005) to 30+ seconds (2008).
We have two Tables:
Document: id, title, document_type_id, showon_id
DocumentType: id, name
Relationship: DocumentType hasMany Documents. (Document.document_type_id = DocumentType.id)
We wish to retrieve a list of all document types for one given ShowOn_Id.
We see two possiblities:
SELECT DocumentType.*
FROM DocumentType
WHERE DocumentType.id IN (
SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42
);
SELECT DocumentType.*
FROM DocumentType
WHERE DocumentType.id IN (
SELECT Document.document_type_id FROM Document WHERE showon_id = 42
);
Our question is: when and if is it better to use the DISTINCT to get the smaller record set versus retrieving the whole table and the IN statement walking the table to the first match. (We guess that's what it does ;-))
Is this different for different databases, is there a common answer?
Or is there a better way of doing it? (We are in .NET land)
You can use a join:
SELECT DISTINCT DocumentType.*
FROM DocumentType
INNER JOIN Document
ON DocumentType.id=Document.document_type_id
WHERE Document.showon_id = 42
I think it's the best way to do it.
For the best performance you should use:
SELECT DISTINCT dt.*
FROM
DocumentType dt
INNER JOIN Document d ON dt.id=d.document_type_id and d.showon_id = 42
Joins are very efficient at bridging multiple tables where as the nested query in the Where clause will need to perform a separate result selection that will filter down the From clause results. The join statement is also much more readable.
I would also put an index on showon_id, in addition to the primary keys and foreign key relationship.
My answer differs from wmasm's answer only by moving the showon_id filter up to the inner join. For MS SQL 2k5, I think the interpreter is smart enough to do this automatically, but you always want to work with the smallest result set possible. Bringing your filters up to inner join statements can limit the number of rows the query has to work with when joining many tables together. If you do this though, you should understand that this happens for every row comparison so complex filters (such as like x = '%a' or function calls) are better left for the Where clause so that the inner joins may filter out unnecessary comparisons.
Use an EXISTS. It sometimes is faster, but in my opinion, more readable than a DISTINCT and JOIN. Just for kicks, pls reply with the query plan for this query and the JOIN above, and see if anything is different (they may be optimized down to the same plan). If they are the same, I'd recommend the EXISTS as it is closer to a "plain language" description than a JOIN (because you don't want any of the data from Document, etc.)
SELECT whatever
FROM DocumentType dt
WHERE EXISTS( SELECT *
FROM Document
WHERE dt.id = document_type_id
AND showon_id = 42)
To get the query plan (ref: http://msdn.microsoft.com/en-us/library/ms180765(SQL.90).aspx), do:
SET SHOWPLAN_TEXT ON
GO
SELECT ...
GO
From my point of view it should not make any difference inside SQL Server (but who knows how this is implemented).
Think of it this way: to return the resultset the server needs to go into the Document table and retrieve all document_type_id WHERE showon_id = 42. In the process of retrieving the document_type_ids (e.g. by index seeking) it puts them into a hash table. When this process has finished the hash table will contain distinct values anyway. After that the query execution goes inside the Document_Type table, scans the primary key and probes into the hash table. Note that this depends, e.g. maybe it's more efficient to not use a hash table, when the expected row count from the Document table it low compared to Document_Type, but in general you get the same query plan as for the query wmasm just suggested.
Follow up on Matt's answer:
I've enabled the query plan and tested the following four different queries that have come up so far:
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN (SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42);
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN (SELECT Document.document_type_id FROM Document WHERE showon_id = 42);
SELECT DISTINCT DocumentType.* FROM DocumentType INNER JOIN Document ON DocumentType.id=Document.document_type_id WHERE Document.showon_id = 42;
SELECT DocumentType.* FROM DocumentType WHERE EXISTS ( SELECT * FROM Document WHERE DocumentType.id=Document.document_type_id AND showon_id = 42);
The query plan for all four queries turned out to be the same:
|--Hash Match(Right Semi Join, HASH:([Document].[document_type_id])=([DocumentType].[Id]))
|--Hash Match(Inner Join, HASH:([Document].[Title], [Uniq1005])=([Document].[Title], [Uniq1005]), RESIDUAL:([Document].[Title] as [Document].[Title] = [Document].[Title] as [Document].[Title] AND [Uniq1005] = [Uniq1005]))
| |--Index Seek(OBJECT:([Document].[IX_Document_3] AS [Document]), SEEK:([Document].[showon_id]=(1)) ORDERED FORWARD)
| |--Index Scan(OBJECT:([Document].[IX_Document_1] AS [Document]))
|--Table Scan(OBJECT:([DocumentType] AS [DocumentType]))
I am not sure what every line and element means, but it seems that from the performance perspective it does not matter how you construct the query for this kind of problem...