In a SQL server database, I have a table Messages with the following columns:
Id INT(1,1)
Detail VARCHAR(5000)
DatetimeEntered DATETIME
PersonEntered VARCHAR(25)
Messages are pretty basic, and only allow alphanumeric characters and a handful of special characters, which are as follows:
`¬!"£$%^&*()-_=+[{]};:'##~\|,<.>/?
Ignoring the bulk of the special characters bar the apostrophe, what I need is a way to list each word along with how many times the word occurs in the Detail column, which I can then filter by PersonEntered and DatetimeEntered.
Example output:
Word Frequency
-----------------
a 11280
the 10102
and 8845
when 2024
don't 2013
.
.
.
It doesn't need to be particularly clever. It is perfectly fine if dont and don't are treated as separate words.
I'm having trouble splitting out the words into a temporary table called #Words.
Once I have a temporary table, I would apply the following query:
SELECT
Word,
SUM(Word) AS WordCount
FROM #Words
GROUP BY Word
ORDER BY SUM(Word) DESC
Please help.
Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only ' is going to appear in a word; anything else is going to be grammatical.
You haven't posted what version of SQL you're using, so I've going to use SQL Server 2017 syntax. If you don't have the latest version, you'll need to replace TRANSLATE with a nested REPLACE (So REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, '¬',' '),...),'/',' '),'?',' '), and find a string splitter (for example, Jeff Moden's DelimitedSplit8K).
USE Sandbox;
GO
CREATE TABLE [Messages] (Detail varchar(5000));
INSERT INTO [Messages]
VALUES ('Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only `''` is going to appear in a word; anything else is going to be grammatical. You haven''t posted what version of SQL you''re using, so I''ve going to use SQL Server 2017 syntax. If you don''t have the latest version, you''ll need to replace `TRANSLATE` with a nested `REPLACE` (So `REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, ''¬'','' ''),...),''/'','' ''),''?'','' '')`, and find a string splitter (for example, Jeff Moden''s [DelimitedSplit8K](http://www.sqlservercentral.com/articles/Tally+Table/72993/)).'),
('As a note, this is going to perform **AWFULLY**. SQL Server is not designed for this type of work. I also imagine you''ll get some odd results and it''ll include numbers in there. Things like dates are going to get split out,, numbers like `9,000,000` would be treated as the words `9` and `000`, and hyperlinks will be separated.')
GO
WITH Replacements AS(
SELECT TRANSLATE(Detail, '`¬!"£$%^&*()-_=+[{]};:##~\|,<.>/?',' ') AS StrippedDetail
FROM [Messages] M)
SELECT SS.[value], COUNT(*) AS WordCount
FROM Replacements R
CROSS APPLY string_split(R.StrippedDetail,' ') SS
WHERE LEN(SS.[value]) > 0
GROUP BY SS.[value]
ORDER BY WordCount DESC;
GO
DROP TABLE [Messages];
As a note, this is going to perform AWFULLY. SQL Server is not designed for this type of work. I also imagine you'll get some odd results and it'll include numbers in there. Things like dates are going to get split out,, numbers like 9,000,000 would be treated as the words 9 and 000, and hyperlinks will be separated.
I have a full-text index on a column in a table that contains data like this:
searchColumn
90210 Brooks Diana Miami FL diana.brooks#email.com 5612233395
The column is an aggregate of Zip, last name, first name, city, state, e-mail and phone number.
I use this column to search for a customer based on any of this possible information.
The issue I am worried about is with the high number of reads that occurs when doing a query on this column. The query I am using is:
declare #searchTerm varchar(100) = ' "FL" AND "90210*" AND "Diana*" AND "Brooks*" '
select *
from CustomerInformation c
where contains(c.searchColumn, #searchTerm)
Now, when running Profiler I can see that this search has about 50.000 page reads to return a single row, as opposed to when using a different approach using regular indexes and multiple variables, broken down like #firstName, #LastName, like below:
WHERE C.FirstName like coalesce(#FirstName + '%' , C.FirstName)
AND C.LastName like coalesce(#LastName + '%' , C.LastName)
etc.
Using this approach I get only around 140 page reads. I know the approaches are quite different, but I'm trying to understand why the full-text version has so much more reads and if there is any way I can bring that down to something closer to the numbers I get when using regular indexes.
I have a couple of thoughts on this. First the Select * will generate a great number of page reads because it has to pull all columns which may or may not be indexed. When you pull every column it most likely will not make use of the best Index plan out there.
As to your Where clauses, when using the #searchTerm and the value of "FL" AND "90210*" AND "Diana*" AND "Brooks*" it has to check the datapages multiple times each time it is run. Think of how you would look up this information if you had to do it. You look at a piece of paper with the info on it and see if the search column contains FL. Now does it contain FL and 90210*. Now does it contain both of those plus Diana...etc.
You can see why it would keep having to go back to the page to read over and over again. The second query only has to look at 2 columns narrowly defined.
If you want more information on this, I would suggest a class by Brent Ozar that is free right now.
How to think like the SQL Server Engine
I hope that helps.
I try to implement a search-mechanism with "CONTAINS()" on a SQL Server 2014.
I've read here https://technet.microsoft.com/en-us/library/ms142538%28v=sql.105%29.aspx and in the book "Pro Full-Text Search in SQL Server 2008" that I need to use double quotes to search an exact phrase.
But e.q. if I use this CONTAINS(*, '"test"') I receive results containing words like "numerictest" also. If I try CONTAINS(*, '" test "') it is the same. I've noticed, that there are less results as if I would search with CONTAINS(*, '*test*') for a prefix, sufix search, so there is definitely a delta between the searches.
I didn't expect the "numerictest" in the first statement. Is there an explanation for this behaviour?
I have been wracking my brain about a very similar problem and I recently found the solution.
In my case I was searching full text fields for "#username" but using CONTAINS(body, "#username") returned just "username" as well. I wanted it to strictly match with the # sign.
I could use LIKE "%#username%" but the query took over a minute which was unacceptable so I kept looking.
With the help of some people in a chat room they suggested using both CONTAINS and LIKE. So:
SELECT TOP 25 * FROM table WHERE
CONTAINS(body, "#username") AND body LIKE "%#username%";
this worked perfectly for me because the contains pulls both username and #username records and then the LIKE filters out the ones with the # sign. Queries take 2-3 seconds now.
I know this is an old question but I came across it in my searching so having the answer I thought I would post it. I hope this helps.
Contains(*,'"test"') will only match full words of "test" as you expect.
Contains(*,'" test "') same as above
Contains(*,'"*test*"') will actually do a PREFIX ONLY search, basically strips out any special characters at the start of word and only uses the 2nd *.
You cannot do POSTFIX searches using full text search.
My concern lies with the Contains(*) part, this will search for any full text cataloged items in that entire row. Without seeing the data it is hard to tell but my guess is that another column in that row you think is bad is actually matching on "test" somewhere.
I would like to know how to exclude apostrophes from being indexed in full text search.
For example, if someone enters a search term for "o'brien" or for "obrien" I would want it to match all cases where someone's name matches either "O'Brien" or "OBrien".
However, if I search on:
select * from MyTable where contains (fullName, '"o''Brien*"')
It returns only ones with an apostrophe.
But if I do:
select * from MyTable where contains (fullName, '"oBrien*"')
It only returns the ones without an apostrophe.
In short, I want to know if it is possible for FTS to index both "O'Brien" and "OBrien" as "obrien" so that I can find both.
While the solution:
select * from MyTable where contains (fullName, '"oBrien*" OR "o''Brien*"')
would work, however, I can't make that assumption if the user entered "obrien".
I'm looking for a solution that works both on SQL Server 2005 and 2008.
depending on how much space you have, you could add another column
fullName_noPunctuation
containing only the alpha characters of the name, strip punctuation from your search criteria, and search the punctuation-free column instead.
Note: I am using SQL's Full-text search capabilities, CONTAINS clauses and all - the * is the wildcard in full-text, % is for LIKE clauses only.
I've read in several places now that "leading wildcard" searches (e.g. using "*overflow" to match "stackoverflow") is not supported in MS SQL. I'm considering using a CLR function to add regex matching, but I'm curious to see what other solutions people might have.
More Info: You can add the asterisk only at the end of the word or phrase. - along with my empirical experience: When matching "myvalue", "my*" works, but "(asterisk)value" returns no match, when doing a query as simple as:
SELECT * FROM TABLENAME WHERE CONTAINS(TextColumn, '"*searchterm"');
Thus, my need for a workaround. I'm only using search in my site on an actual search page - so it needs to work basically the same way that Google works (in the eyes on a Joe Sixpack-type user). Not nearly as complicated, but this sort of match really shouldn't fail.
Workaround only for leading wildcard:
store the text reversed in a different field (or in materialised view)
create a full text index on this column
find the reversed text with an *
SELECT *
FROM TABLENAME
WHERE CONTAINS(TextColumnREV, '"mrethcraes*"');
Of course there are many drawbacks, just for quick workaround...
Not to mention CONTAINSTABLE...
The problem with leading Wildcards: They cannot be indexed, hence you're doing a full table scan.
It is possible to use the wildcard "*" at the end of the word or phrase (prefix search).
For example, this query will find all "datab", "database", "databases" ...
SELECT * FROM SomeTable WHERE CONTAINS(ColumnName, '"datab*"')
But, unforutnately, it is not possible to search with leading wildcard.
For example, this query will not find "database"
SELECT * FROM SomeTable WHERE CONTAINS(ColumnName, '"*abase"')
To perhaps add clarity to this thread, from my testing on 2008 R2, Franjo is correct above. When dealing with full text searching, at least when using the CONTAINS phrase, you cannot use a leading , only a trailing functionally. * is the wildcard, not % in full text.
Some have suggested that * is ignored. That does not seem to be the case, my results seem to show that the trailing * functionality does work. I think leading * are ignored by the engine.
My added problem however is that the same query, with a trailing *, that uses full text with wildcards worked relatively fast on 2005(20 seconds), and slowed to 12 minutes after migrating the db to 2008 R2. It seems at least one other user had similar results and he started a forum post which I added to... FREETEXT works fast still, but something "seems" to have changed with the way 2008 processes trailing * in CONTAINS. They give all sorts of warnings in the Upgrade Advisor that they "improved" FULL TEXT so your code may break, but unfortunately they do not give you any specific warnings about certain deprecated code etc. ...just a disclaimer that they changed it, use at your own risk.
http://social.msdn.microsoft.com/Forums/ar-SA/sqlsearch/thread/7e45b7e4-2061-4c89-af68-febd668f346c
Maybe, this is the closest MS hit related to these issues... http://msdn.microsoft.com/en-us/library/ms143709.aspx
One thing worth keeping in mind is that leading wildcard queries come at a significant performance premium, compared to other wildcard usages.
Note: this was the answer I submitted for the original version #1 of the question before the CONTAINS keyword was introduced in revision #2. It's still factually accurate.
The wildcard character in SQL Server is the % sign and it works just fine, leading, trailing or otherwise.
That said, if you're going to be doing any kind of serious full text searching then I'd consider utilising the Full Text Index capabilities. Using % and _ wild cards will cause your database to take a serious performance hit.
Just FYI, Google does not do any substring searches or truncation, right or left. They have a wildcard character * to find unknown words in a phrase, but not a word.
Google, along with most full-text search engines, sets up an inverted index based on the alphabetical order of words, with links to their source documents. Binary search is wicked fast, even for huge indexes. But it's really really hard to do a left-truncation in this case, because it loses the advantage of the index.
As a parameter in a stored procedure you can use it as:
ALTER procedure [dbo].[uspLkp_DrugProductSelectAllByName]
(
#PROPRIETARY_NAME varchar(10)
)
as
set nocount on
declare #PROPRIETARY_NAME2 varchar(10) = '"' + #PROPRIETARY_NAME + '*"'
select ldp.*, lkp.DRUG_PKG_ID
from Lkp_DrugProduct ldp
left outer join Lkp_DrugPackage lkp on ldp.DRUG_PROD_ID = lkp.DRUG_PROD_ID
where contains(ldp.PROPRIETARY_NAME, #PROPRIETARY_NAME2)
When it comes to full-text searching, for my money nothing beats Lucene. There is a .Net port available that is compatible with indexes created with the Java version.
There's a little work involved in that you have to create/maintain the indexes, but the search speed is fantastic and you can create all sorts of interesting queries. Even indexing speed is pretty good - we just completely rebuild our indexes once a day and don't worry about updating them.
As an example, this search functionality is powered by Lucene.Net.
Perhaps the following link will provide the final answer to this use of wildcards: Performing FTS Wildcard Searches.
Note the passage that states: "However, if you specify “Chain” or “Chain”, you will not get the expected result. The asterisk will be considered as a normal punctuation mark not a wildcard character. "
If you have access to the list of words of the full text search engine, you could do a 'like' search on this list and match the database with the words found, e.g. a table 'words' with following words:
pie
applepie
spies
cherrypie
dog
cat
To match all words containing 'pie' in this database on a fts table 'full_text' with field 'text':
to-match <- SELECT word FROM words WHERE word LIKE '%pie%'
matcher = ""
a = ""
foreach(m, to-match) {
matcher += a
matcher += m
a = " OR "
}
SELECT text FROM full_text WHERE text MATCH matcher
% Matches any number of characters
_ Matches a single character
I've never used Full-Text indexing but you can accomplish rather complex and fast search queries with simply using the build in T-SQL string functions.
From SQL Server Books Online:
To write full-text queries in
Microsoft SQL Server 2005, you must
learn how to use the CONTAINS and
FREETEXT Transact-SQL predicates, and
the CONTAINSTABLE and FREETEXTTABLE
rowset-valued functions.
That means all of the queries written above with the % and _ are not valid full text queries.
Here is a sample of what a query looks like when calling the CONTAINSTABLE function.
SELECT RANK , * FROM TableName ,
CONTAINSTABLE (TableName, *, '
"*WildCard" ') searchTable WHERE
[KEY] = TableName.pk ORDER BY
searchTable.RANK DESC
In order for the CONTAINSTABLE function to know that I'm using a wildcard search, I have to wrap it in double quotes. I can use the wildcard character * at the beginning or ending. There are a lot of other things you can do when you're building the search string for the CONTAINSTABLE function. You can search for a word near another word, search for inflectional words (drive = drives, drove, driving, and driven), and search for synonym of another word (metal can have synonyms such as aluminum and steel).
I just created a table, put a full text index on the table and did a couple of test searches and didn't have a problem, so wildcard searching works as intended.
[Update]
I see that you've updated your question and know that you need to use one of the functions.
You can still search with the wildcard at the beginning, but if the word is not a full word following the wildcard, you have to add another wildcard at the end.
Example: "*ildcar" will look for a single word as long as it ends with "ildcar".
Example: "*ildcar*" will look for a single word with "ildcar" in the middle, which means it will match "wildcard". [Just noticed that Markdown removed the wildcard characters from the beginning and ending of my quoted string here.]
[Update #2]
Dave Ward - Using a wildcard with one of the functions shouldn't be a huge perf hit. If I created a search string with just "*", it will not return all rows, in my test case, it returned 0 records.