Full text searching scores across multiple columns

Full text searching scores across multiple columns - sql-server

I am using full text searching on a SQL Server database to return results from multiple tables. The simplest situation would be searching a persons name fields and a description field. The code I use to do this looks like:
select t.ProjectID as ProjectID, sum(t.rnk) as weightRank
from
(
select KEY_TBL.RANK * 1.0 as rnk, FT_TBL.ProjectID as ProjectID
FROM Projects as FT_TBL
INNER JOIN FREETEXTTABLE(Projects, Description, #SearchText) AS KEY_TBL
ON FT_TBL.ProjectID=KEY_TBL.[KEY]
union all
select KEY_TBL.RANK * 50 as rnk, FT_TBL.ProjectID as ProjectID
FROM Projects as FT_TBL
... <-- complex unimportant join
INNER JOIN People as p on pp.PersonID = p.PersonID
INNER JOIN FREETEXTTABLE(People, (FirstName, LastName), #SearchText) AS KEY_TBL
ON p.PersonID=KEY_TBL.[KEY]
)
group by ProjectID
As is (hopefully) clear above, I am trying to weight heavily on matches of a person's name over matches in a project description field. If I do a search for something like 'john' all projects with a person named john on it will be heavily weighted (as expected). The issue I am having is on searches where someone provides a full name like 'john smith'. In this case the match is much less strong on name as (I presume) only half the search terms are matching in each of the firstname / lastname columns. In many cases this means someone with an exact match of the name entered will not necessarily be returned near the top of the search results.
I have been able to correct this by searching each of the firstname / lastname fields separately and adding their scores together so my new query looks like:
select t.ProjectID as ProjectID, sum(t.rnk) as weightRank
from
(
select KEY_TBL.RANK * 1.0 as rnk, FT_TBL.ProjectID as ProjectID
FROM Projects as FT_TBL
INNER JOIN FREETEXTTABLE(Projects, Description, #SearchText) AS KEY_TBL
ON FT_TBL.ProjectID=KEY_TBL.[KEY]
union all
select KEY_TBL.RANK * 50 as rnk, FT_TBL.ProjectID as ProjectID
FROM Projects as FT_TBL
... <-- complex unimportant join
INNER JOIN People as p on pp.PersonID = p.PersonID
INNER JOIN FREETEXTTABLE(People, (FirstName), #SearchText) AS KEY_TBL
ON p.PersonID=KEY_TBL.[KEY]
union all
select KEY_TBL.RANK * 50 as rnk, FT_TBL.ProjectID as ProjectID
FROM Projects as FT_TBL
... <-- complex unimportant join
INNER JOIN People as p on pp.PersonID = p.PersonID
INNER JOIN FREETEXTTABLE(People, (LastName), #SearchText) AS KEY_TBL
ON p.PersonID=KEY_TBL.[KEY]
)
group by ProjectID
My question:
Is this the approach I should be taking, or is there some way to have the full text searching operate on a list of columns as though it were a blob of text: i.e. treat the firstname and lastname columns as a single name column, resulting in a higher scoring match for strings including both the persons first and last name?

I have recently run into this and have used a computed column to concatenate the required columns together into one string and then have the full text index on that column.
I have achieved the weighting by duplicating the weighted fields in the computed column.
i.e. last name appears 3 times and first name once.
ALTER TABLE dbo.person ADD
PrimarySearchColumn AS
COALESCE(NULLIF(forename,'') + ' ' + forename + ' ', '') +
COALESCE(NULLIF(surname,'') + ' ' + surname + ' ' + surname + ' ', '') PERSISTED
You must make sure you use the persisted keyword so that the column isnt computed on each read.

I know this is an old question but I've come across the same issue and solved it a different way.
Rather than add computed columns to the original tables, which may not always be an option, I have created indexed views which contain the combined fields. To use the original example:
CREATE VIEW [dbo].[v_PeopleFullName]
WITH SCHEMABINDING
AS SELECT dbo.People.PersonID, ISNULL(dbo.People.FirstName + ' ', '') + dbo.People.LastName AS FullName
FROM dbo.People
GO
CREATE UNIQUE CLUSTERED INDEX UQ_v_PeopleFullName
ON dbo.[v_PeopleFullName] ([PersonID])
GO
Then I join that view in my query, along with the existing full-text predicate on the individual columns in the base table, so that I can find exact matches and partial matches in the individual columns, like so:
DECLARE #SearchText NVARCHAR(100) = ' "' + #OriginalSearchText + '" ' --For matching exact phrase
DECLARE #SearchTextWords NVARCHAR(100) = ' "' + REPLACE(#OriginalSearchText, ' ', '" OR "') + '" ' --For matching on words in phrase
SELECT FT_TBL.ProjectID as ProjectID,
ISNULL(KEY_TBL.[Rank], 0) + ISNULL(KEY_VIEW.[Rank], 0) AS [Rank]
FROM Projects as FT_TBL
INNER JOIN People as p on FT_TBL.PersonID = p.PersonID
LEFT OUTER JOIN CONTAINSTABLE(People, (FirstName, LastName), #SearchTextWords) AS KEY_TBL ON p.PersonID = KEY_TBL.[KEY] INNER JOIN
LEFT OUTER JOIN CONTAINSTABLE(v_PeopleFullName, FullName, #SearchText) AS KEY_VIEW ON p.PersonID = KEY_VIEW.[Key]
WHERE ISNULL(KEY_TBL.[Rank], 0) + ISNULL(KEY_VIEW.[Rank], 0) > 0
ORDER BY [Rank] DESC
Some notes on this:
I'm using CONTAINSTABLE rather than FREETEXTTABLE as it seems more appropriate to me for searching names. I'm not interested in finding words with similar meaning or inflections of words when it's names that I'm searching on.
Because I'm using CONTAINSTABLE I'm having to do some pre-processing on the #SearchText variable to make it compatible and to break it down into individual words with the OR operator for searching on the base table's full-text index.
Rather than using a UNION query to join separate queries each using a single, joined CONTAINSTABLE I'm joining on both CONTAINSTABLE predicates in the same query. This means using outer joins rather than inner joins, so I'm then using a WHERE clause to exclude any records from the base table which don't match on either full-text index. I confess that I haven't made any examination of how this performs compared to separate queries each with a single full-text index predicate UNIONised to produce a single result set.
Although there's no guarantee that the Rank of matches on the full search text in the indexed view will be higher than that of matches on individual words in the full-text index on the base table's columns because the Rank value is arbitrary, my testing so far has shown that in practice it always is (so far!).

Related

SQL Server FTS indexing correct keywords but not returning results with those keywords

I've created a FTS catalog which indexes the Title column of a table called Articles. The word breaker language is set to Dutch.
The article title is "Contactgegevens Wijkteams 2019". My search term is 'contactgegevens' which is Dutch for 'Contact details'. This word could potentially be split into 'contact' and 'gegevens', although I have checked the indexed keywords which has successfully indexed the full word from the correct table/column.
Search term:
declare #searchTerm nvarchar(100)
select #searchTerm = 'contactgegevens';
Using Freetext:
If I use FREETEXT in the where clause I do find the result but it comes up near the end of around 300 rows. The majority of the rows don't have this word in the title column, not even words close in meaning.
SELECT a.ArticleID, a.Title
FROM Articles a
WHERE
FREETEXT(a.Title, #searchTerm))
Using FreeTextTable:
With FREETEXTTABLE, I get far less results but not one of them contain the keyword:'contactgegevens' or variations of it.
select *
from
freetexttable(Articles, Title, #search, LANGUAGE N'Dutch', 100) as key_table
inner join
Articles a on a.ArticleID = key_table.[Key]
order by
key_table.RANK desc
Using ContainsTable:
CONTAINSTABLE seems to return very similar results to FREETEXTTABLE.
SELECT key_table.rank, a.*
FROM Articles a
INNER JOIN
CONTAINSTABLE(Articles, Title, #searchTerm, LANGUAGE N'Dutch', 100) AS key_table on key_table.[KEY] = a.ArticleID
WHERE
ORDER BY
key_table.rank DESC
As mentioned I've checked the indexed keywords using the following query:
select *
from sys.dm_fts_index_keywords(DB_ID('MyDatabase'), OBJECT_ID('Articles'))
where (display_term like 'contactgegevens%') and column_id = 3
order by display_term
and the keyword is indexed for the correct table/column and looking at records close to this result I can see it's indexed other words relevant to the article title I'm looking for.
I'm expecting to be able to do a search for a phrase such as "Contactgegevens Wijkteams 2019" and have the article with that exact title appear at the top of the list, but it doesn't. In some cases it doesn't appear in the search results at all.
What am I missing here?

Turns out it was a simple mistake on my part. My join was using the ArticleID instead of the ArticleVersionID which is what the Catalog uses as its unique key and what is represented by key_table.[KEY].
SELECT key_table.rank, a.*
FROM Articles a
INNER JOIN
CONTAINSTABLE(Articles, Title, #searchTerm, LANGUAGE N'Dutch', 100) AS key_table on key_table.[KEY] = a.ArticleVersionID
WHERE
ORDER BY
key_table.rank DESC

SQL - join two tables based on up-to-date entries

I have two tables
1- Table of TestModules
TestModules
2- Table of TestModule_Results
TestModule_Results
in order to get the required information for each TestModule, I am using FULL OUTER JOIN and it works fine.
FULL OUTER JOIN result
But what is required is slightly different. The above picture shows that TestModuleID = 5 is listed twice, and the requirement is to list the 'up-to-date' results based on time 'ChangedAt'
Of course, I can do the following:
SELECT TOP 1 * FROM TestModule_Results
WHERE DeviceID = 'xxx' and TestModuleID = 'yyy'
ORDER BY ChangedAt DESC
But this solution is for a single row and I want to do it in a Stored Procedure.
Expected output should be like:
ExpectedOutput
Any advise how can I implement it in a SP?

Use a Common Table Expression and Row_Number to add a field identifying the newest results, if any, and select for just those
--NOTE: a Common Table Expression requires the previous command
--to be explicitly terminiated, prepending a ; covers that
;WITH cteTR as (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY DeviceID, TestModuleID
ORDER BY ChangedAt DESC) AS ResultOrder
FROM TestModule_Results
--cteTR is now just like TestModule_Results but has an
--additional field ResultOrder that is 1 for the newest,
--2 for the second newest, etc. for every unique (DeviceID,TestModuleID) pair
)
SELECT *
FROM TestModules as M --Use INNER JOIN to get only modules with results,
--or LEFT OUTER JOIN to include modules without any results yet
INNER JOIN cteTR as R
ON M.DeviceID = R.DeviceID AND M.TestModuleID = R.TestModuleID
WHERE R.ResultOrder = 1
-- OR R.ResultOrder IS NULL --add if Left Outer Join

You say "this solution is for a single row"? Excellent. Use CROSS APPLY and change the WHERE clause from hand-input literal to the fields of the original table. APPLY operates at row level.
SELECT *
FROM TestModules t
CROSS APPLY
(
SELECT TOP 1 * FROM TestModule_Results
WHERE TestModule_Results.DeviceID = TestModules.DeviceID -- put the connecting fields here
ORDER BY ChangedAt DESC
)tr

creating XML file for tables with multiple indexes and indexes with multiple columns returns duplicate indexes

I am creating an xml file for tables that may have multiple indexes and the indexes may have multiple columns.
My problem is that when I have an index with multiple columns associated with it, the Index is written to the xml file multiple times. (the number of columns).
This is my query:
SELECT [IndTbl].IndexName AS "#IndexName", [IndTbl].PrimaryKeyIndex AS "#PrimaryKeyIndex", [IndTbl].IndexDescription AS "#IndexDescription",
[IndTbl].PadIndex as "#PadIndex", [IndTbl].StatisticsNoRecompute as "#Statistics_NoRecompute", [IndTbl].IgnoreDupKey as "#IgnoreDupKey",
[IndTbl].AllowRowLocks as "#AllowRowLocks", [IndTbl].AllowPageLocks as "#AllowPageLocks",
(
SELECT DISTINCT [IndColTbl].IndexColumnName as "#ICName", [IndColTbl].IsDescendingSort as "#IsDescendingSort", [IndColTbl].OrdinalPosition as "#OrdinalPosition"
FROM ( SELECT DISTINCT S4.IndexName, IndexColumnName, IsDescendingSort, OrdinalPosition
FROM #SourceDBObjects S4 JOIN #TableObjectsToAdd T4 ON S4.TableName = T4.TableName AND S4.IndexName = T4.IndexName
) AS [IndColTbl]
WHERE [IndColTbl].IndexName = [IndTbl].IndexName
FOR XML PATH('IndexColumn'),
TYPE
)
FROM (SELECT DISTINCT T3.TableName, T3.IndexName, S3.PrimaryKeyIndex, IndexDescription, PadIndex, StatisticsNoRecompute, IgnoreDupKey, AllowRowLocks, AllowPageLocks
FROM #SourceDBObjects S3 JOIN #TableObjectsToAdd T3 ON S3.TableName = T3.TableName AND S3.IndexName = T3.IndexName AND T3.IndexName IS NOT NULL ) AS [IndTbl]
WHERE [Table].TableName = [IndTbl].TableName
FOR XML PATH('Index'),
TYPE
The temporary Table #SourceDBObjects has all of the Index data.
The temp table #TableObjectsToAdd just has the Index Names to add.
What I need to be able to do is add a DISTINCT in the outside query. But DISTINCT is not allowed when using FOR XML.
So if the Index has 7 columns associated with the index, the Index will be displayed 7 times in the xml file.
I can't use TOP 1 because there might be multiple indexes associated with the table.
If I add the IndexColumnName restriction to the IndexColumn subquery (in the SELECT) then I get the Index 7 times with 1 column associated with the index.
How can I limit the Index to display once for multiple columns without using DISTINCT?

I might have found a solution...
I added GROUP BY to the outside query and it is limiting the number of times the Index is written to the xml file.

Isn't this a little bit to complicated?
This query will get all the information in one go... You can add more data (e.g. from sys.stats) quite easily...
SELECT tables.name AS [#name]
,tables.object_id AS [#object_id]
,(
SELECT indexes.name AS [#name]
,indexes.index_id AS [#id]
,(
SELECT columns.name AS [#name]
,columns.column_id AS [#column_id]
,index_columns.index_column_id AS [#index_column_id]
,index_columns.is_descending_key AS [#is_descending_key]
FROM sys.index_columns
INNER JOIN sys.columns ON index_columns.object_id=columns.object_id AND index_columns.column_id=columns.column_id
WHERE index_columns.index_id=indexes.index_id
AND index_columns.object_id=tables.object_id
FOR XML PATH('index_column'),TYPE
) AS [index_columns]
FROM sys.indexes
WHERE indexes.object_id=tables.object_id
FOR XML PATH('index'),TYPE
) AS [indexes]
FROM sys.tables WHERE tables.type='U'
FOR XML PATH('table'),ROOT('DataBase')
UPDATE: Even simpler...
FOR XML RAW will put all columns with their names as attributes into the XML (unless you don't specify ELEMENTS). With SELECT * you'd get all information without much typing...
SELECT tables.*
,(
SELECT indexes.*
,(
SELECT columns.name AS column_name
,index_columns.*
FROM sys.index_columns
INNER JOIN sys.columns ON index_columns.object_id=columns.object_id AND index_columns.column_id=columns.column_id
WHERE index_columns.index_id=indexes.index_id
AND index_columns.object_id=tables.object_id
FOR XML RAW('index_column'),TYPE
) AS [index_columns]
FROM sys.indexes
WHERE indexes.object_id=tables.object_id
FOR XML RAW('index'),TYPE
) AS [indexes]
FROM sys.tables WHERE tables.type='U'
FOR XML RAW('table'),ROOT('DataBase')

Want to Avoid Sorting Full-Text Search Results

I'm using the following SQL Server query, which searches a full-text index and appears to be working correctly. Some additional work is included so the query works with paging.
However, my understanding is that full-text searches return results sorted according to ranking, which would be nice.
But I get an error if I remove the OVER clause near the top. Can anyone tell me how this query could be modified to not resort the results?
DECLARE #StartRow int;
DECLARE #MaxRows int;
SET #StartRow = 0;
SET #MaxRows = 10;
WITH ArtTemp AS
(SELECT TOP (#StartRow + #MaxRows) ROW_NUMBER() OVER (ORDER BY ArtViews DESC) AS RowID,
Article.ArtID,Article.ArtTitle,Article.ArtSlug,Category.CatID,Category.CatTitle,
Article.ArtDescription,Article.ArtCreated,Article.ArtUpdated,Article.ArtUserID,
[User].UsrDisplayName AS UserName
FROM Article
INNER JOIN Subcategory ON Article.ArtSubcategoryID = Subcategory.SubID
INNER JOIN Category ON Subcategory.SubCatID = Category.CatID
INNER JOIN [User] ON Article.ArtUserID = [User].UsrID
WHERE CONTAINS(Article.*,'FORMSOF(INFLECTIONAL,"htmltag")'))
SELECT ArtID,ArtTitle,ArtSlug,CatID,CatTitle,ArtDescription,ArtCreated,ArtUpdated,
ArtUserID,UserName
FROM ArtTemp
WHERE RowID BETWEEN #StartRow + 1 AND (#StartRow + #MaxRows)
ORDER BY RowID
Thanks.

I'm really not an expert in FTS but hopefully this helps get you started.
First, ROW_NUMBER requires OVER (ORDER BY xxx) in SQL Server. Even if you tell it to order by a constant value, it still might end up rearranging the results. So, if you depend on row numbering to handle your pagination, you're stuck with some kind of sorting.
When I dig around on FTS for that "return results sorted according to ranking" bit, I find a couple articles that describe ordering by rank. In a nutshell, they say that RANK is a column explicitly returned by CONTAINSTABLE. So if you can't find a way to dig out the results ranking from CONTAINS, you might try joining against CONTAINSTABLE instead and use the RANK column explicitly as your order by value with ROW_NUMBER. Example (syntax may be a little off):
SELECT TOP (#StartRow + #MaxRows)
ROW_NUMBER() OVER (ORDER BY MyFTS.RANK DESC) AS RowID,
Article.ArtID,Article.ArtTitle,Article.ArtSlug,Category.CatID,Category.CatTitle,
Article.ArtDescription,Article.ArtCreated,Article.ArtUpdated,Article.ArtUserID,
[User].UsrDisplayName AS UserName
FROM Article
INNER JOIN Subcategory ON Article.ArtSubcategoryID = Subcategory.SubID
INNER JOIN Category ON Subcategory.SubCatID = Category.CatID
INNER JOIN [User] ON Article.ArtUserID = [User].UsrID
INNER JOIN CONTAINSTABLE(Article, *, 'FORMSOF(INFLECTIONAL,"htmltag")') AS MyFTS
The end result is that you're still sorting, but you're doing so on your rankings.
Also, the MSDN page says that CONTAINSTABLE has an ability to limit results on a TOP N basis, too. Maybe this would also be of use to you.

T-SQL filtering on dynamic name-value pairs

I'll describe what I am trying to achieve:
I am passing down to a SP an xml with name value pairs that I put into a table variable, let's say #nameValuePairs.
I need to retrieve a list of IDs for expressions (a table) with those exact match of name-value pairs (attributes, another table) associated.
This is my schema:
Expressions table --> (expressionId, attributeId)
Attributes table --> (attributeId, attributeName, attributeValue)
After trying complicated stuff with dynamic SQL and evil cursors (which works but it's painfully slow) this is what I've got now:
--do the magic plz!
-- retrieve number of name-value pairs
SET #noOfAttributes = select count(*) from #nameValuePairs
select distinct
e.expressionId, a.attributeName, a.attributeValue
into
#temp
from
expressions e
join
attributes a
on
e.attributeId = a.attributeId
join --> this join does the filtering
#nameValuePairs nvp
on
a.attributeName = nvp.name and a.attributeValue = nvp.value
group by
e.expressionId, a.attributeName, a.attributeValue
-- now select the IDs I need
-- since I did a select distinct above if the number of matches
-- for a given ID is the same as noOfAttributes then BINGO!
select distinct
expressionId
from
#temp
group by expressionId
having count(*) = #noOfAttributes
Can people please review and see if they can spot any problems? Is there a better way of doing this?
Any help appreciated!

I belive that this would satisfy the requirement you're trying to meet. I'm not sure how much prettier it is, but it should work and wouldn't require a temp table:
SET #noOfAttributes = select count(*) from #nameValuePairs
SELECT e.expressionid
FROM expression e
LEFT JOIN (
SELECT attributeid
FROM attributes a
JOIN #nameValuePairs nvp ON nvp.name = a.Name AND nvp.Value = a.value
) t ON t.attributeid = e.attributeid
GROUP BY e.expressionid
HAVING SUM(CASE WHEN t.attributeid IS NULL THEN (#noOfAttributes + 1) ELSE 1 END) = #noOfAttributes
EDIT: After doing some more evaluation, I found an issue where certain expressions would be included that shouldn't have been. I've modified my query to take that in to account.

One error I see is that you have no table with an alias of b, yet you are using: a.attributeId = b.attributeId.
Try fixing that and see if it works, unless I am missing something.
EDIT: I think you just fixed this in your edit, but is it supposed to be a.attributeId = e.attributeId?

This is not a bad approach, depending on the sizes and indexes of the tables, including #nameValuePairs. If it these row counts are high or it otherwise becomes slow, you may do better to put #namValuePairs into a temp table instead, add appropriate indexes, and use a single query instead of two separate ones.
I do notice that you are putting columns into #temp that you are not using, would be faster to exclude them (though it would mean duplicate rows in #temp). Also, you second query has both a "distinct" and a "group by" on the same columns. You don't need both so I would drop the "distinct" (probably won't affect performance, because the optimizer already figured this out).
Finally, #temp would probably be faster with a clustered non-unique index on expressionid (I am assuming that this is SQL 2005). You could add it after the SELECT..INTO, but it is usually as fast or faster to add it before you load. This would require you to CREATE #temp first, add the clustered and then use INSERT..SELECT to load it instead.
I'll add an example of merging the queries in a mintue... Ok, here's one way to merge them into a single query (this should be 2000-compatible also):
-- retrieve number of name-value pairs
SET #noOfAttributes = select count(*) from #nameValuePairs
-- now select the IDs I need
-- since I did a select distinct above if the number of matches
-- for a given ID is the same as noOfAttributes then BINGO!
select
expressionId
from
(
select distinct
e.expressionId, a.attributeName, a.attributeValue
from
expressions e
join
attributes a
on
e.attributeId = a.attributeId
join --> this join does the filtering
#nameValuePairs nvp
on
a.attributeName = nvp.name and a.attributeValue = nvp.value
) as Temp
group by expressionId
having count(*) = #noOfAttributes

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight