Full-text Search on documents and related data mssql

Full-text Search on documents and related data mssql - sql-server

Currently in the middle of building a knowledge base app and am a bit unsure on the best way to store and index the document information.
The user uploads the document and when doing so selects a number of options from dropdown lists (such as category,topic,area..., note these are not all mandatory) they also enter some keywords and a description of the document. At the moment the category (and others) selected is stored as foreign key in the documents table using the id from the categories table.
What we want to be able to do is do a FREETEXTTABLE or CONTAINSTABLE on not only the information within the varchar(max) column where the document is located but also on the category name, topic name and area name etc.
I looked at the option of creating an indexed view but this wasn't possible due to the LEFT JOIN against the category column. So I'm not sure how to go about being able to do this any ideas would be most appreciated.

I assume that you want to AND the two searches together. For example find all documents containing the text "foo" AND in category the "Automotive Repair".
Perhaps you don't need to full text the additional data and can just use = or like? If the additional data is reasonably small it may not warrant the complication of full text.
However, if you want to use full text on both, use a stored procedure that pulls the results together for you. The trick here is to stage the results rather than trying to get a result set back straight away.
This is rough starting point.
-- a staging table variable for the document results
declare #documentResults table (
Id int,
Rank int
)
insert into #documentResults
select d.Id, results.[rank]
from containstable (documents, (text), '"foo*"') results
inner join documents d on results.[key] = d.Id
-- now you have all of the primary keys that match the search criteria
-- whittle this list down to only include keys that are in the correct categories
-- a staging table variable for each the metadata results
declare #categories table (
Id int
)
insert into #categories
select results.[KEY]
from containstable (Categories, (Category), '"Automotive Repair*"') results
declare #topics table (
Id int
)
insert into #topics
select results.[KEY]
from containstable (Topics, (Topic), '"Automotive Repair*"') results
declare #areas table (
Id int
)
insert into #areas
select results.[KEY]
from containstable (Areas, (Area), '"Automotive Repair*"') results
select d.text, c.category, t.topic, a.area
from #results r
inner join documents d on d.Id = r.Id
inner join #categories c on c.Id = d.CategoryId
inner join #topics t on t.Id = d.TopicId
inner join #areas a on a.Id = d.AreaId

You could create a new column for your full text index which would contain the original document plus the categories appended as metadata. Then a search on that column could search both the document and the categories simultaneously. You'd need to invent a tagging system that would keep them unique within your document yet the tags would not be likely to be used as search phrases themselves. Perhaps something like:
This is my regular document text. <FTCategory: Automotive Repair> <FTCategory: Transmissions>

Related

SQL Server: Selecting Value That was met in the Where Criteria

My query selects records where given item types were ordered and I would like it to return a column that has the value for the criteria which a given record has met.
For example (since the above explanation is probably confusing):
DECLARE #Item1 VARCHAR(8) = 'Red Shoes',
#Item2 VARCHAR(8) = 'Brown Belt',
#Item3 VARCHAR(8) = 'Blue Shoes',
#Item4 VARCHAR(8) = 'Black Belt'
SELECT DISTINCT ord.Order_number,
ord.Item_number,
ord.Item_type,
ord.Item_desc,
link.Item_number AS linked_item_number
FROM Ordertbl ord
LEFT JOIN Item_tables link
ON link.item_number = ord.item_number
WHERE link.Item_number IN (#Item1,#Item2,#Item3,#Item4) AND
ord.Item_number NOT IN (#Item1,#Item2,#Item3,#Item4)
Desired Outcome: All items that were ordered whenever Item1,2,3, or 4 were ordered and, for each record, a field that depicts what item (1,2,3, or 4) was the source for that record being returned.
Using multiple Union queries with where criteria set to a single item provides the desired outcome if I set the linked_item_number field to the queried item, but that method is less than ideal because, at times, large numbers of items may be queried.

Edited: I've updated my answer a bit, and expanded on a few areas using my best guesses, but hopefully they help help illuminate the points I'm making.
Using NOT IN in a WHERE clause is really bad for performance. It would be better to convert your filters into tables, and then JOIN them to your order table.
But before we get to that, let's make a few DML assumptions here that will help keep things clear. Let's say you have two tables like the following:
CREATE TABLE Ordertbl
(
Order_number INT
,Item_number INT
--you might have more columns in your table
)
CREATE TABLE Item_tables
(
Item_number INT
,Item_type INT
,Item_desc VARCHAR(8)
--again, you might have more columns in your table
)
I'm also going to assume that the details about an item are in Item_tables and not in Ordertbl, because that makes the most sense for a database design to me.
My original answer had the following block of text next:
In this scenario, you'd need two additional tables, one for the list
of Items in where Item_number in (#Item1,#Item2,#Item3,#Item4) which
would have the corresponding Item_numbers. The other table would be
the list of Subjs in Item_number Not in
(#Subj1,#Subj2,#Subj3,#Subj4,#Subj5,#Subj6,#Subj7), again, including
the Item_number for those Subj records.
The question has been updated so that the WHERE clause is different than it was when I wrote the original version of my answer. The design pattern still applies here, even if the variables being used are different.
So let's create a temp table to hold all of our "triggering" items, and then populate it.
DECLARE #Item1 VARCHAR(8) = 'Red Shoes',
#Item2 VARCHAR(8) = 'Brown Belt',
#Item3 VARCHAR(8) = 'Blue Shoes',
#Item4 VARCHAR(8) = 'Black Belt'
CREATE TABLE #TriggeringItems
(
Item_desc VARCHAR(8)
)
INSERT INTO #TriggeringItems
(
Item_desc
)
SELECT #Item1
UNION
SELECT #Item2
UNION
SELECT #Item3
UNION
SELECT #Item4
If you had more filter variables to add, you could keep UNIONing them onto the INSERT.
So now we have our temp table and we can filter our query. Great!
...right?
Well, not quite. Our input parameters are descriptions, and our foreign key is an INT, which means we'll have to do a few extra joins to get our key values into the query. But the general idea is that you'd use an INNER JOIN to replace WHERE ... IN ..., and a LEFT JOIN to replace the WHERE ... NOT IN ... (adding a WHERE X IS NULL clause, where X is the key of the LEFT JOINed table)
If you didn't care about getting the triggering items back in your SELECT, then you could just go ahead with replacing the WHERE ... IN ... with an INNER JOIN, and that would be the end of it. But let's say you only wanted the list of items that WEREN'T the triggering items. For that, you would need to join Ordertbl to itself, to get the list of Order_numbers with triggering items within them. Then you could INNER JOIN one side to the temp table, and LEFT JOIN the other half. I know my explanation might be hard to follow, so let me show you what I mean in code:
SELECT DISTINCT onums.Order_number,
orditems.Item_number,
orditems.Item_type,
orditems.Item_desc,
tinums.Item_number AS linked_item_number
FROM #TriggeringItems ti
INNER JOIN Item_tables tinums ON ti.Item_desc = tinums.Item_desc
INNER JOIN Ordertbl onums ON tinums.Item_number = onums.Item_number
INNER JOIN Ordertbl ord ON onums.Order_number = ord.Order_number
INNER JOIN Item_tables orditems ON ord.Item_number = orditems.Item_number
LEFT JOIN #TriggeringItems excl ON orditems.Item_desc = excl.Item_desc
WHERE excl.Item_desc IS NULL
onums is our list of order numbers, but ord is where we're going to pull our items from. But we only want to return items that aren't triggers, so we LEFT JOIN our temp table at the end and add the WHERE excl.Item_desc IS NULL.

Find Child with Parent having specific information

I am trying to find children whose parent have some specific information from different relational tables.
I have four tables as shown below
Search Criteria : Get all the "Section" who has parent as "Inventory" level with attached User name containing 'a' letter and role id is 'employee' (Please see LevelsUser table for relation).
I tried CTE (common table expression') approach to find the correct Section level but here I have to pass level Id as hard coded value and I cannot search all Section in the table.
WITH LevelsTree AS
(
SELECT Id, ParentLevelId, Level
FROM Levels
WHERE Level='Section' // here i need to pass value
UNION ALL
SELECT ls.Id, ls.ParentLevelId, ls.Level
FROM Levels ls
JOIN LevelsTree lt ON ls.Id = lt.ParentLevelId
)
SELECT * FROM LevelsTree
I need to find all sections match the above criteria.
Please help me here.

For hierarchical checks you need to select from and then join to the same table Levels. So something like this should help you:
declare #parentLevelName varchar(20) = 'Inventory';
with cte as (
select distinct
l1.id,
l1.Level
from Levels l1
join Levels l2 on l2.id=l1.ParentLevelId
and l2.Level = #parentLevelName -- use variable instead of hardcoded `Inventory`
where l1.Level='Section' -- replace `Section` with #var containing your value
) select * from cte
join LevelUsers lu on lu.LevelId=cte.id
join Users u on u.Id = lu.UserId
and u.UserName like '%a%' -- this letter check is not efficient
join Role r on r.id=lu.RoleId and r.Role='employee'
Note, the above query selects data only from the 4 tables which you have described in DB schema. However, you original query contains a reference to the HierarchyPosition table which you haven't described. If you really need to include HiearchyPosition reference then specify how it relates to the other 4 tables.
Also note, condition and u.UserName like '%a%' used to satisfy your requirement of User name containing 'a' letter is not efficient because of the leading %, which prevents the use of indexes. Consider changing your requirements if possible to User name starts with 'a' letter. This way and u.UserName like 'a%' will allow the use of index over Users table if it exists.
HTH

Search in multiple tables with Full-Text

I'm trying to make a detailed search with asp and SQL Server Full-text.
When a keyword submitted, I need to search in multiple tables. For example,
Table - Members
member_id
contact_name
Table - Education
member_id
school_name
My query;
select mem.member_id, mem.contact_name, edu.member_id, edu.school_name from Members mem FULL OUTER JOIN Education edu on edu.member_id=mem.member_id where CONTAINS (mem.contact_name, '""*"&keyword&"*""') or CONTAINS (edu.school_name, '""*"&keyword&"*""') order by mem.member_id desc;
This query works but it takes really long time to execute.
Image that the keyword is Phill; If mem.contact_name matches then list it, or if edu.school_name matches, list the ones whose education match the keyword.
I hope I could explain well :) Sorry for my english though.

Perhaps try an indexed view containing the merged dataset- you can add the fulltext index there instead of the individual tables, and it's further extensible to as many tables as you need down the line. Only trick, of course, is the space...

This is what I would do for my multi table full text search.
Not exact but it will give basic idea. the key thing is to give table vise contain with OR condition.
DECLARE #SearchTerm NVARCHAR(250)
SET #SearchTerm = '"Texas*"'
SELECT * FROM table1
JOIN table2 on table1.Id = table2.FKID
WHERE (
(#SearchTerm = '""') OR
CONTAINS((table1.column1, table1.column2, table1.column3), #SearchTerm) OR
CONTAINS((table2.column1, table2.column2), #SearchTerm)
)

Couple of points I don't understand that will be affecting your speed.
Do you really need a full outer join? That's killing you. It looks like these tables are one to one. In that case can't you make it an inner join?
Can't you pass a column list to contains like so:
SELECT mem.member_id,
mem.contact_name,
edu.member_id,
edu.school_name
FROM members mem
INNER JOIN education edu ON edu.member_id = mem.member_id
WHERE Contains((mem.contact_name,edu.school_name),'*keyword*')
ORDER BY mem.member_id DESC
Further info about contains.

T-SQL filtering on dynamic name-value pairs

I'll describe what I am trying to achieve:
I am passing down to a SP an xml with name value pairs that I put into a table variable, let's say #nameValuePairs.
I need to retrieve a list of IDs for expressions (a table) with those exact match of name-value pairs (attributes, another table) associated.
This is my schema:
Expressions table --> (expressionId, attributeId)
Attributes table --> (attributeId, attributeName, attributeValue)
After trying complicated stuff with dynamic SQL and evil cursors (which works but it's painfully slow) this is what I've got now:
--do the magic plz!
-- retrieve number of name-value pairs
SET #noOfAttributes = select count(*) from #nameValuePairs
select distinct
e.expressionId, a.attributeName, a.attributeValue
into
#temp
from
expressions e
join
attributes a
on
e.attributeId = a.attributeId
join --> this join does the filtering
#nameValuePairs nvp
on
a.attributeName = nvp.name and a.attributeValue = nvp.value
group by
e.expressionId, a.attributeName, a.attributeValue
-- now select the IDs I need
-- since I did a select distinct above if the number of matches
-- for a given ID is the same as noOfAttributes then BINGO!
select distinct
expressionId
from
#temp
group by expressionId
having count(*) = #noOfAttributes
Can people please review and see if they can spot any problems? Is there a better way of doing this?
Any help appreciated!

I belive that this would satisfy the requirement you're trying to meet. I'm not sure how much prettier it is, but it should work and wouldn't require a temp table:
SET #noOfAttributes = select count(*) from #nameValuePairs
SELECT e.expressionid
FROM expression e
LEFT JOIN (
SELECT attributeid
FROM attributes a
JOIN #nameValuePairs nvp ON nvp.name = a.Name AND nvp.Value = a.value
) t ON t.attributeid = e.attributeid
GROUP BY e.expressionid
HAVING SUM(CASE WHEN t.attributeid IS NULL THEN (#noOfAttributes + 1) ELSE 1 END) = #noOfAttributes
EDIT: After doing some more evaluation, I found an issue where certain expressions would be included that shouldn't have been. I've modified my query to take that in to account.

One error I see is that you have no table with an alias of b, yet you are using: a.attributeId = b.attributeId.
Try fixing that and see if it works, unless I am missing something.
EDIT: I think you just fixed this in your edit, but is it supposed to be a.attributeId = e.attributeId?

This is not a bad approach, depending on the sizes and indexes of the tables, including #nameValuePairs. If it these row counts are high or it otherwise becomes slow, you may do better to put #namValuePairs into a temp table instead, add appropriate indexes, and use a single query instead of two separate ones.
I do notice that you are putting columns into #temp that you are not using, would be faster to exclude them (though it would mean duplicate rows in #temp). Also, you second query has both a "distinct" and a "group by" on the same columns. You don't need both so I would drop the "distinct" (probably won't affect performance, because the optimizer already figured this out).
Finally, #temp would probably be faster with a clustered non-unique index on expressionid (I am assuming that this is SQL 2005). You could add it after the SELECT..INTO, but it is usually as fast or faster to add it before you load. This would require you to CREATE #temp first, add the clustered and then use INSERT..SELECT to load it instead.
I'll add an example of merging the queries in a mintue... Ok, here's one way to merge them into a single query (this should be 2000-compatible also):
-- retrieve number of name-value pairs
SET #noOfAttributes = select count(*) from #nameValuePairs
-- now select the IDs I need
-- since I did a select distinct above if the number of matches
-- for a given ID is the same as noOfAttributes then BINGO!
select
expressionId
from
(
select distinct
e.expressionId, a.attributeName, a.attributeValue
from
expressions e
join
attributes a
on
e.attributeId = a.attributeId
join --> this join does the filtering
#nameValuePairs nvp
on
a.attributeName = nvp.name and a.attributeValue = nvp.value
) as Temp
group by expressionId
having count(*) = #noOfAttributes

Using Full-Text Search in SQL Server 2008 across multiple tables, columns

I need to search across multiple columns from two tables in my database using Full-Text Search. The two tables in question have the relevant columns full-text indexed.
The reason I'm opting for Full-text search:
1. To be able to search accented words easily (cafè)
2. To be able to rank according to word proximity, etc.
3. "Did you mean XXX?" functionality
Here is a dummy table structure, to illustrate the challenge:
Table Book
BookID
Name (Full-text indexed)
Notes (Full-text indexed)
Table Shelf
ShelfID
BookID
Table ShelfAuthor
AuthorID
ShelfID
Table Author
AuthorID
Name (Full-text indexed)
I need to search across Book Name, Book Notes and Author Name.
I know of two ways to accomplish this:
Using a Full-text Indexed View: This would have been my preferred method, but I can't do this because for a view to be full-text indexed, it needs to be schemabound, not have any outer joins, have a unique index. The view I will need to get my data does not satisfy these constraints (it contains many other joined tables I need to get data from).
Using joins in a stored procedure: The problem with this approach is that I need to have the results sorted by rank. If I am making multiple joins across the tables, SQL Server won't search across multiple fields by default. I can combine two individual CONTAINS queries on the two linked tables, but I don't know of a way to extract the combined rank from the two search queries. For example, if I search for 'Arthur', the results of both the Book query and the Author query should be taken into account and weighted accordingly.

Using FREETEXTTABLE, you just need to design some algorithm to calculate the merged rank on each joined table result. The example below skews the result towards hits from the book table.
SELECT b.Name, a.Name, bkt.[Rank] + akt.[Rank]/2 AS [Rank]
FROM Book b
INNER JOIN Author a ON b.AuthorID = a.AuthorID
INNER JOIN FREETEXTTABLE(Book, Name, #criteria) bkt ON b.ContentID = bkt.[Key]
LEFT JOIN FREETEXTTABLE(Author, Name, #criteria) akt ON a.AuthorID = akt.[Key]
ORDER BY [Rank] DESC
Note that I simplified your schema for this example.

I had the same problem as you but it actually involved 10 tables (a Users table and several others for information)
I did my first query using FREETEXT in the WHERE clause for each table but the query was taking far too long.
I then saw several replies about using FREETEXTTABLE instead and checking for not nulls values in the key column for each table, but that took also to long to execute.
I fixed it by using a combination of FREETEXTTABLE and UNION selects:
SELECT Users.* FROM Users INNER JOIN
(SELECT Users.UserId FROM Users INNER JOIN FREETEXTTABLE(Users, (column1, column2), #variableWithSearchTerm) UsersFT ON Users.UserId = UsersFT.key
UNION
SELECT Table1.UserId FROM Table1 INNER JOIN FREETEXTTABLE(Table1, TextColumn, #variableWithSearchTerm) Table1FT ON Table1.UserId = Table1FT.key
UNION
SELECT Table2.UserId FROM Table2 INNER JOIN FREETEXTTABLE(Table2, TextColumn, #variableWithSearchTerm) Table2FT ON Table2.UserId = Table2FT.key
... --same for all tables
) fts ON Users.UserId = fts.UserId
This proved to be incredibly much faster.
I hope it helps.

I don't think the accepted answer will solve the problem. If you try to find all the books from a certain author and, therefore, use the author's name (or part of it) as the search criteria, the only books returned by the query will be those which have the search criteria in its own name.
The only way I see around this problem is to replicate the Author's columns that you wish to search by in the Book table and index those columns (or column since it would probably be smart to store the author's relevant information in an XML column in the Book table).

FWIW, in a similar situation our DBA created DML triggers to maintain a dedicated full-text search table. It was not possible to use a materialized view because of its many restrictions.

I would use a stored procedure. The full text method or whatever returns a rank which you can sort by. I am not sure how they will be weighted against eachother, but I'm sure you could tinker for awhile and figure it out. For example:
Select SearchResults.key, SearchResults.rank From FREETEXTTABLE(myColumn, *, #searchString) as SearchResults Order By SearchResults.rank Desc

This answer is well overdue, but one way to do this if you cannot modify primary tables is to create a new table with the search parameters added to one column.
Then create a full text index on that column and query that column.
Example
SELECT
FT_TBL.[EANHotelID] AS HotelID,
ISNULL(FT_TBL.[Name],'-') AS HotelName,
ISNULL(FT_TBL.[Address1],'-') AS HotelAddress,
ISNULL(FT_TBL.[City],'-') AS HotelCity,
ISNULL(FT_TBL.[StateProvince],'-') AS HotelCountyState,
ISNULL(FT_TBL.[PostalCode],'-') AS HotelPostZipCode,
ISNULL(FT_TBL.[Latitude],0.00) AS HotelLatitude,
ISNULL(FT_TBL.[Longitude],0.00) AS HotelLongitude,
ISNULL(FT_TBL.[CheckInTime],'-') AS HotelCheckinTime,
ISNULL(FT_TBL.[CheckOutTime],'-') AS HotelCheckOutTime,
ISNULL(b.[CountryName],'-') AS HotelCountry,
ISNULL(c.PropertyDescription,'-') AS HotelDescription,
KEY_TBL.RANK
FROM [EAN].[dbo].[tblactivepropertylist] AS FT_TBL INNER JOIN
CONTAINSTABLE ([EAN].[dbo].[tblEanFullTextSearch], FullTextSearchColumn, #s)
AS KEY_TBL
ON FT_TBL.EANHotelID = KEY_TBL.[KEY]
INNER JOIN [EAN].[dbo].[tblCountrylist] b
ON FT_TBL.Country = b.CountryCode
INNER JOIN [EAN].[dbo].[tblPropertyDescriptionList] c
ON FT_TBL.[EANHotelID] = c.EANHotelID
In the code above [EAN].[dbo].[tblEanFullTextSearch], FullTextSearchColumn is the new table and column with the fields added, you can now do a query on the new table with joins to the table you want to display the data from.
Hope this helps

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight