I stumbled upon strange fulltextindex behavior in SQL Server 2008 R2 (my word-breaker language is German).
I have this text indexed:
[...] Java Editorerstellung in Eclipse eines Modellierungseditors(UML) mit den Eclipse Technologien [...]
I triple-checked: The only occurrence of the term edi is in this short snippet of text, I can only find it as part of Editorerstellung und Modellierungseditors.
But SQL Server still has edi as a single word in it's fulltextindex (occurrence: 1) and therefore returns it on ContainsTable(...) searches. Why is it recognized as a single word?
Has anybody an explanation for this behavior? Thanks.
German Compound words require special parsing in natural language word breakers. For example, "Editorerstellung" is parsed and stored as three separate terms, "editor", "erstellung" and "editorerstellung". Extensive research has been done on analyzing German compund words and while the techniques are improving, the process is not perfect.
It is likely that the behavior you are seeing is due to heuristics being using in the word breaker. I cannot re-produce your issue using the above snippet and the Sql Server 2012 word breaker, so either Microsoft's improvement in the German word-breaker between Sql Server 2008 R2 and Sql Server 2012 solved the problem or some text you didn't include is the source of "edi" in the full-text index.
You can use sys.dm_fts_index_keywords_by_document() to see what terms are in the index. Using a binary search pattern, you should be able to narrow it down to the specific text that is generating the "edi" term.
Related
I'm doing some tests with SQL Server 2017.
I'm trying to store arbitrary Unicode code points in an NVARCHAR column.
I've tried different collations.
I have no problem with common characters in the BMP plane of Unicode.
For more exotic symbols, for example if I try to store the "ðđ" character (U+1D33), the following happens:
If I do it within Management Studio, I only see the infamous square symbol. But Management Studio has the proper font since I can paste it in the query editor.
If I send the text from Visual Studio, the value I see in Management Studio is "??", that's what I retrieve from Visual Studio, too, after performing a query.
My understanding is, for non-supplementary character collations, characters outside the UCS-2 subset shouldn't be interpreted correctly because NCHAR fields are limited to 2 bytes.
But, I tried with Latin1_General_100_CS_AS_KS_WS_SC, both at the DB level and column level, and it doesn't seem to work either.
Any ideas?
Thanks
I can't reproduce any data loss or encoding issue. I can reproduce a squares that becomes ðđ when copied. It's probably caused by the font used to display results in the SSMS grid or the Visual Studio debugger windows.
SQL Server and Windows use UTF16 for some time now, not UCS-2. Few fonts support the full UTF16 range though.
When I tried this in SSMS :
create table #tc(name nvarchar(20));
insert into #tc values (N'ðđ');
select name,len(name),DATALENGTH(name) from #tc;
I saw a square, 2 and 4 in the grid. This means the character was stored properly and took 4 bytes. When I tried to copy those results to SO though I saw :
name (No column name) (No column name)
ðđ 2 4
When I used Result to Text I got the actual character :
name
-------------------- ----------- -----------
ðđ 2 4
The correct character is there but the SSMS grid's font can't display it
Update
As Dan Guzman noted,the font can be changed from Tools-->Options-->Environment-->Fonts and Colors-->Show settings for:-->Grid Results. The default font is Microsoft Sans Serif, a small font (855KB) used as the default font on Windows. It contains "only" 3000 glyphs. Chinese characters aren't included, which is why squares are displayed.
Chinese computers use SimShun as the default though, whose file is 17.1MB. They wouldn't have any problem displaying chinese characters.
I'm trying to store arbitrary unicode points in an nvarchar column. I've tried different collations. I have no problem with common characters in the PBS plane of Unicode.
Collations have nothing to do with what code points you can store in an NVARCHAR / NCHAR / NTEXT (deprecated) column, variable, or literal. Those datatypes can store all 1,114,112 Unicode code points (even though most haven't been mapped to a character yet).
if I try to store ðđ character(U+1D33), ... within Management Studio, i only see the infamous square symbol. But management studio has the proper font since i can paste it in the query editor.
As others have explained already: this is merely a font issue. Fonts can hold a max of 65k characters, so you might need multiple fonts to cover all of the characters you are trying to use. I prefer Code2003 which you can find on FontSpace.com.
If i send the text from Visual Studio, the value i see in management studio is '??'
This should be due to forgetting to prefix the string literal with an upper-case "N" ;-).
SELECT 'ðđ' AS [Oops], N'ðđ' AS [No Oops];
-- ?? ðđ
My understanding is, for non supplementary character collations, characters outside the UCS-2 subset shouldn't be interpreted correctly because nchar fields are limited to 2 bytes.
The Supplementary Character-Aware (SCA) collations â those ending with _SC or with _140_ in their names â do support supplementary characters. BUT, "support" only means that the built-in functions handle the surrogate pair as a single, supplementary code point instead a pair of surrogate code points. But, support for sorting and comparison of supplementary characters actually started in SQL Server 2005 with the introduction of the version 90 collations.
All code units in UCS-2 and UTF-16 are 16 bits / 2 bytes. Supplementary characters are merely two of those 2-byte code units. Hence, being able to store supplementary characters should have been available back in SQL Server 7.0 when NVARCHAR was introduced. Even though no supplementary characters were defined until years later (after SQL Server 2000 was released), the NVARCHAR types were still capable of storing and retrieving them. I don't have SQL Server 7.0 to test with, but I have confirmed this on SQL Server 2000.
For more info, please see:
How Many Bytes Per Character in SQL Server: a Completely Complete Guide
Collations Info
I have been struggling with the following query:
select * from table_name where contains((field1,field2),'"S.E.N.S"');
Following is the text I have in field 1:
"S.E.N.S Productions"
If I search for "production" I get the result, but not for "S.E.N.S".
Any idea regarding how to get the desired result ? Thanks.
Update: Sql Server version is 2005 with SP3.
Update: Well, it is quite weird. When I set the full-text to use noiseENG.txt the query in my question works fine. But everything stored in the database is in Turkish and all the settings are set accordingly, including noiseTRK.txt. "S.E.N.S" is not a word in Turkish nor in English as far as I know. I can set it to noiseENG.txt to get it work, but I doubt that would be appropriate. Can anyone know/think of a reason why noiseTRK.txt would break in above query ? Thanks.
P.S. I have tested with unmodified noise files as well as single-letters-removed version of Turkish noise file.
Out of the box, it may not work for SQL 2005.
As I recall, for the string "S.E.N.S Productions", the SQL 2005 FTS (Full Text Search) word breaker will ignore the punctiation, and break the string into individual letters (S E N S) and the word "Productions". Single letters are, by default, noise words, and therefore won't be included in the index.
You can do a couple things:
Edit your noise word list and remove the single letters and rebuild your index
Upgrade to a later version of SQL, as the word breaking behaviour changes to corretly return your result in 2008.
I was writing a query against a table today on a SQL Server 2000 box, and while writing the query in Query Analyzer, to my surprise I noticed the word LineNo was converted to blue text.
It appears to be a reserved word according to MSDN documentation, but I can find no information on it, just speculation that it might be a legacy reserved word that doesn't do anything.
I have no problem escaping the field name, but I'm curious -- does anyone know what "LineNo" in T-SQL is actually used for?
OK, this is completely undocumented, and I had to figure it out via trial and error, but it sets the line number for error reporting. For example:
LINENO 25
SELECT * FROM NON_EXISTENT_TABLE
The above will give you an error message, indicating an error at line 27 (instead of 3, if you convert the LINENO line to a single line comment (e.g., by prefixing it with two hyphens) ):
Msg 208, Level 16, State 1, Line 27
Invalid object name 'NON_EXISTENT_TABLE'.
This is related to similar mechanisms in programming languages, such as the #line preprocessor directives in Visual C++ and Visual C# (which are documented, by the way).
How is this useful, you may ask? Well, one use of this it to help SQL code generators that generate code from some higher level (than SQL) language and/or perform macro expansion, tie generated code lines to user code lines.
P.S., It is not a good idea to rely on undocumented features, especially when dealing with a database.
Update: This explanation is still correct up to and including the current version of SQL Server, which at the time of this writing is SQL Server 2008 R2 Cumulative Update 5 (10.50.1753.0) .
Depending on where you use it, you can always use [LineNo]. For example:
select LnNo [LineNo] from OrderLines.
Let's say I have the following string stored in a column which has full-text:
xx 3 555 7 4
My question is why a search using FREETEXT for the word '555' would not return anything
You can't use full text search on numbers :( You can try using LIKE, although performance may be an issue. Another thing that you can try is encapsulating the search term in double-quotes, although I don't think that will help you here.
One other thing that you can try... add "NN" to the start of your search term. So, instead of searching for '555', try searching for 'NN555'. Supposedly MS does store the numeric words in the index, but it appends "NN". That was back in SQL 2005. I don't know if it still holds true in 2008.
Try creating the fulltext index using the LANGUAGE [NEUTRAL] option.
http://msdn.microsoft.com/en-us/library/ms187317.aspx
Create Fulltext Index On YourTable (YourColumn Language [Neutral])
Key Index YourKey;
I am working with SQL Server 2008. My task is to investigate the issue where FTS cannot find the right result for Thai.
First, I have the table which enables the FTS on the column 'ItemName' which is nvarchar. The Catalog is created with the Thai Language. Note that the Thai language is one of the languages that doesn't separate the word by spaces, so 'āļŦāļĨāļ§āļ' 'āļāđāļ' 'āđāļŠāļāļĢ' are written like this in a sentence: 'āļŦāļĨāļ§āļāļāđāļāđāļŠāļāļĢ'
In the table, there are many rows that include the word (āđāļŠāļāļĢ); for example row#1 (ItemName: 'āļŦāļĨāļ§āļāļāđāļāđāļŠāļāļĢ')
On the webpage, I try to search for 'āđāļŠāļāļĢ' but SQL Server cannot find it.
So I try to investigate it by trying the following query in SQL Server:
select * from sys.dm_fts_parser(N'"āļŦāļĨāļ§āļāļāđāļāđāļŠāļāļĢ"', 1054, 0, 0)
...to see how the words are broken. The first one is the text to be broken. The second parameter is to specify that we're using Thai (WorkBreaker, so on). Here is the result:
row#1 (display_item: 'āšāļĨāļ§āļ', source_item: 'āļŦāļĨāļ§āļāļāđāļāđāļŠāļāļĢ')
row#2 (display_item: 'āļāšāđāļŠ', source_item: 'āļŦāļĨāļ§āļāļāđāļāđāļŠāļāļĢ')
row#3 (display_item: 'āļāļĢ', source_item: 'āļŦāļĨāļ§āļāļāđāļāđāļŠāļāļĢ')
Notice that the first and second row display the wrong display_item 'āš' in the 'āšāļĨāļ§āļ' isn't even Thai characters. 'āš' in 'āļāšāđāļŠ' is not a Thai character either.
So the question is where did those alien characters come from? I guess this why I cannot search for 'āđāļŠāļāļĢ' because the word breaker is broken and keeping the wrong character in the indexes.
Please help!
This should be due to the different Dialect of thai selected while the indexing was applied.
From FTS properties check what is your selected language / culture