MSSql full-text search for words abbreviated with dots - sql-server

I have been struggling with the following query:
select * from table_name where contains((field1,field2),'"S.E.N.S"');
Following is the text I have in field 1:
"S.E.N.S Productions"
If I search for "production" I get the result, but not for "S.E.N.S".
Any idea regarding how to get the desired result ? Thanks.
Update: Sql Server version is 2005 with SP3.
Update: Well, it is quite weird. When I set the full-text to use noiseENG.txt the query in my question works fine. But everything stored in the database is in Turkish and all the settings are set accordingly, including noiseTRK.txt. "S.E.N.S" is not a word in Turkish nor in English as far as I know. I can set it to noiseENG.txt to get it work, but I doubt that would be appropriate. Can anyone know/think of a reason why noiseTRK.txt would break in above query ? Thanks.
P.S. I have tested with unmodified noise files as well as single-letters-removed version of Turkish noise file.

Out of the box, it may not work for SQL 2005.
As I recall, for the string "S.E.N.S Productions", the SQL 2005 FTS (Full Text Search) word breaker will ignore the punctiation, and break the string into individual letters (S E N S) and the word "Productions". Single letters are, by default, noise words, and therefore won't be included in the index.
You can do a couple things:
Edit your noise word list and remove the single letters and rebuild your index
Upgrade to a later version of SQL, as the word breaking behaviour changes to corretly return your result in 2008.

Related

Identify all strings in SQL Server code (red color - like in SSMS)

I was not able to solve this by myself so I hope I didn't miss any similar post here and I'm not wasting your time.
What I want is to identify (get a list) of all strings used in SQL Server code.
Example:
select 'WordToCatch1' as 'Column1'
from Table1
where Column2 = 'WordToCatch2'
If you put above code to SSMS all three words in apostrophes will be red but only words 'WordToCatch1' and 'WordToCatch2' are "real" strings used in code.
My goal is to find all those "real" strings in any code.
For example if I will have stored procedure 10k rows long it would be impossible to search them manually so I want something what will find all those "real" strings for me and return a list of them or something.
Thanks in advance!
The trouble is, Column1 is nothing particular different compared to WordToCatch1 and WordToCatch2 - not unless you parse the SQL yourself. You could modify your query to take the quotes away from Column1 and it will show up coloured black.
I guess a simple regex will show up all identifiers after an AS keyword, which would be easier than fully parsing SQL, if all the unwanted strings are like that, and its not just an example.

Strange SQL Server Fulltext match

I stumbled upon strange fulltextindex behavior in SQL Server 2008 R2 (my word-breaker language is German).
I have this text indexed:
[...] Java Editorerstellung in Eclipse eines Modellierungseditors(UML) mit den Eclipse Technologien [...]
I triple-checked: The only occurrence of the term edi is in this short snippet of text, I can only find it as part of Editorerstellung und Modellierungseditors.
But SQL Server still has edi as a single word in it's fulltextindex (occurrence: 1) and therefore returns it on ContainsTable(...) searches. Why is it recognized as a single word?
Has anybody an explanation for this behavior? Thanks.
German Compound words require special parsing in natural language word breakers. For example, "Editorerstellung" is parsed and stored as three separate terms, "editor", "erstellung" and "editorerstellung". Extensive research has been done on analyzing German compund words and while the techniques are improving, the process is not perfect.
It is likely that the behavior you are seeing is due to heuristics being using in the word breaker. I cannot re-produce your issue using the above snippet and the Sql Server 2012 word breaker, so either Microsoft's improvement in the German word-breaker between Sql Server 2008 R2 and Sql Server 2012 solved the problem or some text you didn't include is the source of "edi" in the full-text index.
You can use sys.dm_fts_index_keywords_by_document() to see what terms are in the index. Using a binary search pattern, you should be able to narrow it down to the specific text that is generating the "edi" term.

Full text search Sql Server 2008 question

Let's say I have the following string stored in a column which has full-text:
xx 3 555 7 4
My question is why a search using FREETEXT for the word '555' would not return anything
You can't use full text search on numbers :( You can try using LIKE, although performance may be an issue. Another thing that you can try is encapsulating the search term in double-quotes, although I don't think that will help you here.
One other thing that you can try... add "NN" to the start of your search term. So, instead of searching for '555', try searching for 'NN555'. Supposedly MS does store the numeric words in the index, but it appends "NN". That was back in SQL 2005. I don't know if it still holds true in 2008.
Try creating the fulltext index using the LANGUAGE [NEUTRAL] option.
http://msdn.microsoft.com/en-us/library/ms187317.aspx
Create Fulltext Index On YourTable (YourColumn Language [Neutral])
Key Index YourKey;

Full Text Search in SQL Server 2008 shows wrong display_item for Thai language

I am working with SQL Server 2008. My task is to investigate the issue where FTS cannot find the right result for Thai.
First, I have the table which enables the FTS on the column 'ItemName' which is nvarchar. The Catalog is created with the Thai Language. Note that the Thai language is one of the languages that doesn't separate the word by spaces, so 'หลวง' 'พ่อ' 'โสธร' are written like this in a sentence: 'หลวงพ่อโสธร'
In the table, there are many rows that include the word (โสธร); for example row#1 (ItemName: 'หลวงพ่อโสธร')
On the webpage, I try to search for 'โสธร' but SQL Server cannot find it.
So I try to investigate it by trying the following query in SQL Server:
select * from sys.dm_fts_parser(N'"หลวงพ่อโสธร"', 1054, 0, 0)
...to see how the words are broken. The first one is the text to be broken. The second parameter is to specify that we're using Thai (WorkBreaker, so on). Here is the result:
row#1 (display_item: 'ງลวง', source_item: 'หลวงพ่อโสธร')
row#2 (display_item: 'พຝโส', source_item: 'หลวงพ่อโสธร')
row#3 (display_item: 'ธร', source_item: 'หลวงพ่อโสธร')
Notice that the first and second row display the wrong display_item 'ງ' in the 'ງลวง' isn't even Thai characters. 'ຝ' in 'พຝโส' is not a Thai character either.
So the question is where did those alien characters come from? I guess this why I cannot search for 'โสธร' because the word breaker is broken and keeping the wrong character in the indexes.
Please help!
This should be due to the different Dialect of thai selected while the indexing was applied.
From FTS properties check what is your selected language / culture

How do you get leading wildcard full-text searches to work in SQL Server?

Note: I am using SQL's Full-text search capabilities, CONTAINS clauses and all - the * is the wildcard in full-text, % is for LIKE clauses only.
I've read in several places now that "leading wildcard" searches (e.g. using "*overflow" to match "stackoverflow") is not supported in MS SQL. I'm considering using a CLR function to add regex matching, but I'm curious to see what other solutions people might have.
More Info: You can add the asterisk only at the end of the word or phrase. - along with my empirical experience: When matching "myvalue", "my*" works, but "(asterisk)value" returns no match, when doing a query as simple as:
SELECT * FROM TABLENAME WHERE CONTAINS(TextColumn, '"*searchterm"');
Thus, my need for a workaround. I'm only using search in my site on an actual search page - so it needs to work basically the same way that Google works (in the eyes on a Joe Sixpack-type user). Not nearly as complicated, but this sort of match really shouldn't fail.
Workaround only for leading wildcard:
store the text reversed in a different field (or in materialised view)
create a full text index on this column
find the reversed text with an *
SELECT *
FROM TABLENAME
WHERE CONTAINS(TextColumnREV, '"mrethcraes*"');
Of course there are many drawbacks, just for quick workaround...
Not to mention CONTAINSTABLE...
The problem with leading Wildcards: They cannot be indexed, hence you're doing a full table scan.
It is possible to use the wildcard "*" at the end of the word or phrase (prefix search).
For example, this query will find all "datab", "database", "databases" ...
SELECT * FROM SomeTable WHERE CONTAINS(ColumnName, '"datab*"')
But, unforutnately, it is not possible to search with leading wildcard.
For example, this query will not find "database"
SELECT * FROM SomeTable WHERE CONTAINS(ColumnName, '"*abase"')
To perhaps add clarity to this thread, from my testing on 2008 R2, Franjo is correct above. When dealing with full text searching, at least when using the CONTAINS phrase, you cannot use a leading , only a trailing functionally. * is the wildcard, not % in full text.
Some have suggested that * is ignored. That does not seem to be the case, my results seem to show that the trailing * functionality does work. I think leading * are ignored by the engine.
My added problem however is that the same query, with a trailing *, that uses full text with wildcards worked relatively fast on 2005(20 seconds), and slowed to 12 minutes after migrating the db to 2008 R2. It seems at least one other user had similar results and he started a forum post which I added to... FREETEXT works fast still, but something "seems" to have changed with the way 2008 processes trailing * in CONTAINS. They give all sorts of warnings in the Upgrade Advisor that they "improved" FULL TEXT so your code may break, but unfortunately they do not give you any specific warnings about certain deprecated code etc. ...just a disclaimer that they changed it, use at your own risk.
http://social.msdn.microsoft.com/Forums/ar-SA/sqlsearch/thread/7e45b7e4-2061-4c89-af68-febd668f346c
Maybe, this is the closest MS hit related to these issues... http://msdn.microsoft.com/en-us/library/ms143709.aspx
One thing worth keeping in mind is that leading wildcard queries come at a significant performance premium, compared to other wildcard usages.
Note: this was the answer I submitted for the original version #1 of the question before the CONTAINS keyword was introduced in revision #2. It's still factually accurate.
The wildcard character in SQL Server is the % sign and it works just fine, leading, trailing or otherwise.
That said, if you're going to be doing any kind of serious full text searching then I'd consider utilising the Full Text Index capabilities. Using % and _ wild cards will cause your database to take a serious performance hit.
Just FYI, Google does not do any substring searches or truncation, right or left. They have a wildcard character * to find unknown words in a phrase, but not a word.
Google, along with most full-text search engines, sets up an inverted index based on the alphabetical order of words, with links to their source documents. Binary search is wicked fast, even for huge indexes. But it's really really hard to do a left-truncation in this case, because it loses the advantage of the index.
As a parameter in a stored procedure you can use it as:
ALTER procedure [dbo].[uspLkp_DrugProductSelectAllByName]
(
#PROPRIETARY_NAME varchar(10)
)
as
set nocount on
declare #PROPRIETARY_NAME2 varchar(10) = '"' + #PROPRIETARY_NAME + '*"'
select ldp.*, lkp.DRUG_PKG_ID
from Lkp_DrugProduct ldp
left outer join Lkp_DrugPackage lkp on ldp.DRUG_PROD_ID = lkp.DRUG_PROD_ID
where contains(ldp.PROPRIETARY_NAME, #PROPRIETARY_NAME2)
When it comes to full-text searching, for my money nothing beats Lucene. There is a .Net port available that is compatible with indexes created with the Java version.
There's a little work involved in that you have to create/maintain the indexes, but the search speed is fantastic and you can create all sorts of interesting queries. Even indexing speed is pretty good - we just completely rebuild our indexes once a day and don't worry about updating them.
As an example, this search functionality is powered by Lucene.Net.
Perhaps the following link will provide the final answer to this use of wildcards: Performing FTS Wildcard Searches.
Note the passage that states: "However, if you specify “Chain” or “Chain”, you will not get the expected result. The asterisk will be considered as a normal punctuation mark not a wildcard character. "
If you have access to the list of words of the full text search engine, you could do a 'like' search on this list and match the database with the words found, e.g. a table 'words' with following words:
pie
applepie
spies
cherrypie
dog
cat
To match all words containing 'pie' in this database on a fts table 'full_text' with field 'text':
to-match <- SELECT word FROM words WHERE word LIKE '%pie%'
matcher = ""
a = ""
foreach(m, to-match) {
matcher += a
matcher += m
a = " OR "
}
SELECT text FROM full_text WHERE text MATCH matcher
% Matches any number of characters
_ Matches a single character
I've never used Full-Text indexing but you can accomplish rather complex and fast search queries with simply using the build in T-SQL string functions.
From SQL Server Books Online:
To write full-text queries in
Microsoft SQL Server 2005, you must
learn how to use the CONTAINS and
FREETEXT Transact-SQL predicates, and
the CONTAINSTABLE and FREETEXTTABLE
rowset-valued functions.
That means all of the queries written above with the % and _ are not valid full text queries.
Here is a sample of what a query looks like when calling the CONTAINSTABLE function.
SELECT RANK , * FROM TableName ,
CONTAINSTABLE (TableName, *, '
"*WildCard" ') searchTable WHERE
[KEY] = TableName.pk ORDER BY
searchTable.RANK DESC
In order for the CONTAINSTABLE function to know that I'm using a wildcard search, I have to wrap it in double quotes. I can use the wildcard character * at the beginning or ending. There are a lot of other things you can do when you're building the search string for the CONTAINSTABLE function. You can search for a word near another word, search for inflectional words (drive = drives, drove, driving, and driven), and search for synonym of another word (metal can have synonyms such as aluminum and steel).
I just created a table, put a full text index on the table and did a couple of test searches and didn't have a problem, so wildcard searching works as intended.
[Update]
I see that you've updated your question and know that you need to use one of the functions.
You can still search with the wildcard at the beginning, but if the word is not a full word following the wildcard, you have to add another wildcard at the end.
Example: "*ildcar" will look for a single word as long as it ends with "ildcar".
Example: "*ildcar*" will look for a single word with "ildcar" in the middle, which means it will match "wildcard". [Just noticed that Markdown removed the wildcard characters from the beginning and ending of my quoted string here.]
[Update #2]
Dave Ward - Using a wildcard with one of the functions shouldn't be a huge perf hit. If I created a search string with just "*", it will not return all rows, in my test case, it returned 0 records.

Resources