Full-Text-Search Results Not as Expected

Full-Text-Search Results Not as Expected - sql-server

I'm using Microsoft SQL Server 2008. I'm not an expert with it but I created a full-text index and have been writing some queries.
It's working without errors and is returning some results, but rows I know should match are not always included.
Is there anyway to verify or inspect the index? I went in several times and "repopulated the index" so I'm pretty sure it's right. But what do you do when you don't seem to get the right results.
My query is fairly complex but here it is if anyone is thinking that's the problem:
DECLARE #StartRow int;
DECLARE #MaxRows int;
SET #StartRow = 1;
SET #MaxRows = 10;
WITH ArtTemp AS
(SELECT TOP (#StartRow + #MaxRows) ROW_NUMBER() OVER (ORDER BY ArtViews DESC) AS RowID,
Article.ArtID,Article.ArtTitle,Article.ArtSlug,Category.CatID,Category.CatTitle,
Article.ArtDescription,Article.ArtCreated,Article.ArtUpdated,Article.ArtUserID,
[User].UsrDisplayName AS UserName
FROM Article
INNER JOIN Subcategory ON Article.ArtSubcategoryID = Subcategory.SubID
INNER JOIN Category ON Subcategory.SubCatID = Category.CatID
INNER JOIN [User] ON Article.ArtUserID = [User].UsrID
WHERE CONTAINS(Article.*,'FORMSOF(INFLECTIONAL,"HTML")'))
SELECT ArtID,ArtTitle,ArtSlug,CatID,CatTitle,ArtDescription,ArtCreated,
ArtUpdated,ArtUserID,UserName
FROM ArtTemp
WHERE RowID BETWEEN #StartRow + 1 AND (#StartRow + #MaxRows)
ORDER BY RowID
In the query above, rows are returned. However, at least one row I know to contain the word "HTML" is not included.
Any troubleshooting tips?

I'm not a SQL expert, but 'SELECT TOP (#StartRow + #MaxRows)' to me translates as select the top 11 rows (start = 1 max = 10) that match the criteria, regardless of their RowID, not select from rows 1-10. Later you select your results 'WHERE RowID BETWEEN #StartRow + 1 AND (#StartRow + #MaxRows)', which means only show the rows with a RowID between 2-11. That may be why you aren't receiving all of the results you are expecting. If that is not the case then I would make sure the rows you are expecting meet all of the join criteria.

I don't know if this is the issue, but when I first started working with MySQL and Fulltext indices, I often had issues with "stopwords" (http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html) and minimum word lengths (http://www.devcha.com/2008/03/display-mysql-fulltext-search-settings.html). Sometimes, the fulltext search would just ignore certain terms because they were on the stopword list, or they were shorter than the min word length.
There was also another issue where a standard fulltext search wouldn't return ANYTHING if more than 50% of the rows in my table met the criteria for the search. Switching to a boolean search mode solved the >50% problem, but not the stopword/min length issue.
I had to create a fallback %LIKE% search for the fulltext stuff. Possibly not the best way to go, but it at least returned valid results when the fulltext search didn't.
Microsoft SQL Server may be different, but I hope this helps a bit!

Related

slow query performance issue with partition and max

this a poor performancing query I have ... what have I done so wrong?
Please help me it is executed tons of times in my system, solving that will give me a ladder to heaven
I gave a check on the system with sp_Blitz and no mortal issues found
Here is the query :
SELECT MAX(F.id) OVER (PARTITION BY idstato ORDER BY F.id DESC) AS id
FROM jfel_tagxml_invoicedigi F
INNER JOIN jfel_invoice_state S ON F.id = S.idinvoice
WHERE S.idstato = #idstato
AND S.id = F.idstatocorrente
AND F.sequence_invoice % #number_service_installed = #idServizio
ORDER BY F.id DESC,
F.idstatocorrente OFFSET 0 ROWS FETCH NEXT 1 ROWS ONLY;
Here is the query plan
https://www.brentozar.com/pastetheplan/?id=SyYL5JOeE
I can send you privately my system properties
update:
Made some modification , it is better , but I think it could be better ...
here is the new query :
SELECT MAX(F.id) AS id
FROM jfel_tagxml_invoicedigi F
INNER JOIN jfel_invoice_state S ON F.id = S.idinvoice
WHERE S.idstato = #idstato
AND S.id = F.idstatocorrente
AND F.sequence_invoice % #number_service_installed = #idServizio;
And the new plan:
https://www.brentozar.com/pastetheplan/?id=SJ-5GDqeE
update:
Made some modification , it is better , but I think it could be better ...
here is the new query :
SELECT top 1 F.id as id
FROM jfel_tagxml_invoicedigi AS F
INNER JOIN jfel_invoice_state AS S
ON F.idstatocorrente = S.id
WHERE S.idstato= 1 AND S.id = F.idstatocorrente
and S.datastato > dateadd(DAY,-5,getdate())
AND F.progressivo_fattura % 1 = 0
ORDER BY S.datastato
And the new new plan
https://www.brentozar.com/pastetheplan/?id=S1xRkL51S

Filtering by calculated fields used to affect performance negatively. You can do your other filters first, and as a last step do the calculated filter, to have less rows to match. Maybe it will fill TEMPDB because it will store the intermediate recordset there, but in this case you either increase the size of it, or use another method.
Here is your second query written like this (maybe you need to adjust it, I just wrote it in Notepad++:
SELECT MAX(id) AS id
FROM (
SELECT F.id, F.sequence_invoice % #number_service_installed as [idServizio]
FROM jfel_tagxml_invoicedigi F
INNER JOIN jfel_invoice_state S ON F.id = S.idinvoice
WHERE S.idstato = #idstato
AND S.id = F.idstatocorrente
-- AND F.sequence_invoice % #number_service_installed = #idServizio
)
WHERE idServizio = #idServizio
;
Instead of the subquery, you can try a temp table or CTE as well, maybe one is the clear winner above the others, worth a try for all if you want maximum performance.

The data calculation is Non-Sargable, you could try using a variable with OPTION RECOMPILE:
DECLARE #d Date
SET #d = dateadd(DAY,-5,getdate())
SELECT top 1 F.id as id
FROM jfel_tagxml_invoicedigi AS F
INNER JOIN jfel_invoice_state AS S
ON F.idstatocorrente = S.id
WHERE S.idstato= 1 AND S.id = F.idstatocorrente
and S.datastato > #d
AND F.progressivo_fattura % 1 = 0
ORDER BY S.datastato
OPTION (RECOMPILE)

I think you need a NONCLUSTERED INDEX for your query that you describes above.
If you don't have any idea about INDEX, I mean you can not identify a witch field of your table NONCLUSTERED INDEXneed then simply, you just create an execution plan from SQL Server 2008 Management Studio and SQL Server intelligence gives you missing index details
and shows a green color text that is details of the missing index.
you can move your mouse pointer on missing Index text and SQL Server 2008 Management Studio intelligence will show the T-SQL code that is required to create the missing index or you can press your mouse to right-click on missing index text then select the missing index details option from the list to see the details of the missing index.
For more information, you can visit this article Create Missing Index From the Actual Execution Plan
I hope this solution helps you.

All Window Aggregation has a very big performance penalty. Try to take this window sliding mechanism outside the database (i.e. in your application RAM) will be the universal way of optimizing it.
Otherwise, you may try to give more RAM to each database section (in PostgreSQL, you can tweak this via a parameter. In other database, you may or may not able to).
The main reason why it is taking very long (slow) is that it invokes sorting and materializing of the sorted table.

Select Count(*) vs Select Count(id) vs select count(1). Are these indeed equivalent?

As a follow up to my earlier question:
Some of the answers and comments suggest that
select count(*) is mostly equivalent to select count(id) where id is the primary key.`
I have always favored select count(1); I even always use if exists (select 1 from table_name) ...
Now my question is:
1) What is the optimal way of issuing a select count query over a table?
2) If we add a where clause: where msg_type = X; if msg_type has a non_clustered index, would select count(msg_type) from table_name where msg_type = X be the preferred option for counting?
Side-bar:
From a very early age, I was taught that select * from... is BAD BAD BAD, I guess this has made me skeptical of select count(*) as well

count(*) --counts all values including nulls
count(id)-- counts this column value by excluding nulls
count(1) is same as count(*)
If we add a where clause: where msg_type = X; if msg_type has a non_clustered index, would select count(msg_type) from table_name where msg_type = X be the preferred option for counting?
As i mentioned in my previous answer ,SQL server is a cost based optimizer and the plan choosen depends on many factors .sql tries to retrieve cheapest plan in minimum time possible..
now when you issue,count(msg_type),SQL may choose this index if this is cheaper or scan another one as long as it gives right results(no nulls in output)..
I always tend to use count(*) ,unless i want to exclude nulls

Well, those count queries are not identical and will do different things.
select count(1)
select count(*)
Are identical, and will count every record !
select count(col_name)
Will count only NOT NULL values on col_name !
So, unless col_name is the PK as you said, those query will do different things.
As for you second question, it depends, we can't provide you a generic answer that will always be true. You will have to look at the explain plan or just check for your self, although I believe that adding this WHERE clause while having this index will be better.

Performance Issues with Count(*) in SQL Server

I am having some performance issues with a query I am running in SQL Server 2008. I have the following query:
Query1:
SELECT GroupID, COUNT(*) AS TotalRows FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2
ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word) GROUP BY GroupID
Table1 contains about 500,000 rows. Table2 contains about 50,000, but will eventually contain millions. Playing around with the query, I found that re-writing the query as follows will reduce the execution time of the query to under 1 second.
Query 2:
SELECT GroupID FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2 ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word)
What I do not understand is it is a simple count query. If I execute the following query on Table 1, it returns in < 1 s:
Query 3:
SELECT Count(*) FROM Table1
This query returns around 500,000 as the result.
However, the Original query (Query 1) mentioned above only returns a count of 50,000 and takes 3s to execute even though simply removing the GROUP BY (Query 2) reduces the execution time to < 1s.
I do not believe this is an indexing issue as I already have indexes on the appropriate columns. Any help would be very appreciated.

Performing a simple COUNT(*) FROM table can do a much more efficient scan of the clustered index, since it doesn't have to care about any filtering, joining, grouping, etc. The queries that include full-text search predicates and mysterious subqueries have to do a lot more work. The count is not the most expensive part there - I bet they're still relatively slow if you leave the count out but leave the group by in, e.g.:
SELECT GroupID FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2 ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word)
GROUP BY GroupID;
Looking at the provided actual execution plan in the free SQL Sentry Plan Explorer*, I see this:
And this:
Which lead me to believe you should:
Update the statistics on both Inventory and A001_Store_Inventory so that the optimizer can get a better rowcount estimate (which could lead to a better plan shape).
Ensure that Inventory.ItemNumber and A001_Store_Inventory.ItemNumber are the same data type to avoid an implicit conversion.
(*) disclaimer: I work for SQL Sentry.

You should have a look at the query plan to see what SQL Server is doing to retrieve the data you requested. Also, I think it would be better to rewrite your original query as follows:
SELECT
Table1.GroupID -- When you use JOINs, it's always better to specify Table (or Alias) names
,COUNT(Table1.GroupID) AS TotalRows
FROM
Table1
INNER JOIN
Table2 ON
(Table2.Column1 = Table1.Column1) AND
(Table2.GroupID = #GroupID)
WHERE
CONTAINS(Table1.*, #Word)
GROUP BY
Table1.GroupID
Also, keep in mind that a simple COUNT and a COUNT with a JOIN and GROUP BY are not the same thing. In one case, it's just a matter of going through an index and making a count, in the other there are other tables and grouping involved, which can be time consuming depending on several factors.

Want to Avoid Sorting Full-Text Search Results

I'm using the following SQL Server query, which searches a full-text index and appears to be working correctly. Some additional work is included so the query works with paging.
However, my understanding is that full-text searches return results sorted according to ranking, which would be nice.
But I get an error if I remove the OVER clause near the top. Can anyone tell me how this query could be modified to not resort the results?
DECLARE #StartRow int;
DECLARE #MaxRows int;
SET #StartRow = 0;
SET #MaxRows = 10;
WITH ArtTemp AS
(SELECT TOP (#StartRow + #MaxRows) ROW_NUMBER() OVER (ORDER BY ArtViews DESC) AS RowID,
Article.ArtID,Article.ArtTitle,Article.ArtSlug,Category.CatID,Category.CatTitle,
Article.ArtDescription,Article.ArtCreated,Article.ArtUpdated,Article.ArtUserID,
[User].UsrDisplayName AS UserName
FROM Article
INNER JOIN Subcategory ON Article.ArtSubcategoryID = Subcategory.SubID
INNER JOIN Category ON Subcategory.SubCatID = Category.CatID
INNER JOIN [User] ON Article.ArtUserID = [User].UsrID
WHERE CONTAINS(Article.*,'FORMSOF(INFLECTIONAL,"htmltag")'))
SELECT ArtID,ArtTitle,ArtSlug,CatID,CatTitle,ArtDescription,ArtCreated,ArtUpdated,
ArtUserID,UserName
FROM ArtTemp
WHERE RowID BETWEEN #StartRow + 1 AND (#StartRow + #MaxRows)
ORDER BY RowID
Thanks.

I'm really not an expert in FTS but hopefully this helps get you started.
First, ROW_NUMBER requires OVER (ORDER BY xxx) in SQL Server. Even if you tell it to order by a constant value, it still might end up rearranging the results. So, if you depend on row numbering to handle your pagination, you're stuck with some kind of sorting.
When I dig around on FTS for that "return results sorted according to ranking" bit, I find a couple articles that describe ordering by rank. In a nutshell, they say that RANK is a column explicitly returned by CONTAINSTABLE. So if you can't find a way to dig out the results ranking from CONTAINS, you might try joining against CONTAINSTABLE instead and use the RANK column explicitly as your order by value with ROW_NUMBER. Example (syntax may be a little off):
SELECT TOP (#StartRow + #MaxRows)
ROW_NUMBER() OVER (ORDER BY MyFTS.RANK DESC) AS RowID,
Article.ArtID,Article.ArtTitle,Article.ArtSlug,Category.CatID,Category.CatTitle,
Article.ArtDescription,Article.ArtCreated,Article.ArtUpdated,Article.ArtUserID,
[User].UsrDisplayName AS UserName
FROM Article
INNER JOIN Subcategory ON Article.ArtSubcategoryID = Subcategory.SubID
INNER JOIN Category ON Subcategory.SubCatID = Category.CatID
INNER JOIN [User] ON Article.ArtUserID = [User].UsrID
INNER JOIN CONTAINSTABLE(Article, *, 'FORMSOF(INFLECTIONAL,"htmltag")') AS MyFTS
The end result is that you're still sorting, but you're doing so on your rankings.
Also, the MSDN page says that CONTAINSTABLE has an ability to limit results on a TOP N basis, too. Maybe this would also be of use to you.

T-SQL filtering on dynamic name-value pairs

I'll describe what I am trying to achieve:
I am passing down to a SP an xml with name value pairs that I put into a table variable, let's say #nameValuePairs.
I need to retrieve a list of IDs for expressions (a table) with those exact match of name-value pairs (attributes, another table) associated.
This is my schema:
Expressions table --> (expressionId, attributeId)
Attributes table --> (attributeId, attributeName, attributeValue)
After trying complicated stuff with dynamic SQL and evil cursors (which works but it's painfully slow) this is what I've got now:
--do the magic plz!
-- retrieve number of name-value pairs
SET #noOfAttributes = select count(*) from #nameValuePairs
select distinct
e.expressionId, a.attributeName, a.attributeValue
into
#temp
from
expressions e
join
attributes a
on
e.attributeId = a.attributeId
join --> this join does the filtering
#nameValuePairs nvp
on
a.attributeName = nvp.name and a.attributeValue = nvp.value
group by
e.expressionId, a.attributeName, a.attributeValue
-- now select the IDs I need
-- since I did a select distinct above if the number of matches
-- for a given ID is the same as noOfAttributes then BINGO!
select distinct
expressionId
from
#temp
group by expressionId
having count(*) = #noOfAttributes
Can people please review and see if they can spot any problems? Is there a better way of doing this?
Any help appreciated!

I belive that this would satisfy the requirement you're trying to meet. I'm not sure how much prettier it is, but it should work and wouldn't require a temp table:
SET #noOfAttributes = select count(*) from #nameValuePairs
SELECT e.expressionid
FROM expression e
LEFT JOIN (
SELECT attributeid
FROM attributes a
JOIN #nameValuePairs nvp ON nvp.name = a.Name AND nvp.Value = a.value
) t ON t.attributeid = e.attributeid
GROUP BY e.expressionid
HAVING SUM(CASE WHEN t.attributeid IS NULL THEN (#noOfAttributes + 1) ELSE 1 END) = #noOfAttributes
EDIT: After doing some more evaluation, I found an issue where certain expressions would be included that shouldn't have been. I've modified my query to take that in to account.

One error I see is that you have no table with an alias of b, yet you are using: a.attributeId = b.attributeId.
Try fixing that and see if it works, unless I am missing something.
EDIT: I think you just fixed this in your edit, but is it supposed to be a.attributeId = e.attributeId?

This is not a bad approach, depending on the sizes and indexes of the tables, including #nameValuePairs. If it these row counts are high or it otherwise becomes slow, you may do better to put #namValuePairs into a temp table instead, add appropriate indexes, and use a single query instead of two separate ones.
I do notice that you are putting columns into #temp that you are not using, would be faster to exclude them (though it would mean duplicate rows in #temp). Also, you second query has both a "distinct" and a "group by" on the same columns. You don't need both so I would drop the "distinct" (probably won't affect performance, because the optimizer already figured this out).
Finally, #temp would probably be faster with a clustered non-unique index on expressionid (I am assuming that this is SQL 2005). You could add it after the SELECT..INTO, but it is usually as fast or faster to add it before you load. This would require you to CREATE #temp first, add the clustered and then use INSERT..SELECT to load it instead.
I'll add an example of merging the queries in a mintue... Ok, here's one way to merge them into a single query (this should be 2000-compatible also):
-- retrieve number of name-value pairs
SET #noOfAttributes = select count(*) from #nameValuePairs
-- now select the IDs I need
-- since I did a select distinct above if the number of matches
-- for a given ID is the same as noOfAttributes then BINGO!
select
expressionId
from
(
select distinct
e.expressionId, a.attributeName, a.attributeValue
from
expressions e
join
attributes a
on
e.attributeId = a.attributeId
join --> this join does the filtering
#nameValuePairs nvp
on
a.attributeName = nvp.name and a.attributeValue = nvp.value
) as Temp
group by expressionId
having count(*) = #noOfAttributes

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight