SQL Server's SoundEx function on non-Latin character sets? - sql-server

Does SQL Server's (2000) Soundex function work on Asian character sets? I used it in a query and it appears to have not worked properly but I realize that it could be because I don't know how to read Chinese...
Furthermore, are there any other languages where the function might have trouble working on? (Russian for example)
Thank you,Frank

Soundex is fairly specific to English - it may or may not work well on other languages. One example that happened in New Zealand was an attempt at patient name matching using Soundex. Unfortunately pacific island names did not work well with Soundex, in many cases hashing to the same small set of values. A different algorithm had to be used.
Your mileage may vary. On more recent versions of SQL Server you could write a CLR function to do some other computation.

By design it works best on English sentences using the ASCII character set. I have used it on a project in Romania where I replaced the Romanian special characters with corresponding ASCII characters that sound more or less the same. It is not perfect but in my case it was a lot better than nothing.
I think you will have no great success with applying SOUNDEX on Asian character sets.

I know that soundex in older versions of SQLServer ignored any non-english characters. I believe it didn't even handle Latin-1, let alone anything more exotic.
I never dealt with soundex much in SQL2k, all I know for certain was that it does not handle Arabic correctly. This likely extends to other non-latin character sets as well.
In any case, a soundex based algorithm is unlikely to yield acceptable results for non-english languages even aside from character set issues. Soundex was specifically designed to handle the English pronunciation of names (mostly those of Western European origin) and does not function particularly well outside of that use. You would often be better off researching any of several variants of soundex or other unrelated phonetic similarity algorithms which are designed to address the language(s) in question.

You may use an algorithm like Levenshtein distance. There are various implementations of the algorithm as user-defined functions which you may use within a SELECT statement.

Related

SQL Server part number searching with possible variations on nvarchar

I've been tasked with looking into a part number matching issue, the basics of which there are several different ranges with different patterns for the part numbers; for example 0-732-012 might be stored in the database but the user might then search using 0732012 or 0.732.012, it could also be that the part numbers themselves have been added wrong in some cases (so there would be 0732012 as valid entry as well).
There is also some alphanumeric part numbers mixed in as well so 0732012KG for example.
My question is this then, is there an efficient way of checking numerical codes? Either by stripping or ignoring the non-numerical values when attempting to match?
The data is stored in SQL Server 2008 so it does have the capacity for RegEx should that be required.
This is what I would do.
Strip all dashes, spaces, special characters from both entry and field level (only in the comparison).
much faster than using a like.

SQL Server Collation Choices

I've spent a lot of time this evening trying to find guidance about which choice of collation to apply in my SQL Server 2008 R2 installation, but almost everything online basically says "choose what is right for you." Extremely unhelpful.
My context is new application development. I am not worrying about backward compatibility with a prior version of SQL Server (viz. <= 2005). I am very interested in storing data representing languages from around the globe - not just Latin based. What very little help I've found online suggests I should avoid all "SQL_" collations. This narrows my choice to using either a binary or "not binary" collation based on the Windows locale.
If I use binary, I gather I should use "BIN2." So this is my question. How do I determine whether I should use BIN2 or just "Latin1_General_100_XX_XX_XX"? My spider-sense tells me that BIN2 will provide collation that is "less accurate," but more generic for all languages (and fast!). I also suspect the binary collation is case sensitive, accent sensitive, and kana-sensitive (yes?). In contrast, I suspect the non-binary collation would work best for Latin-based languages.
The documentation doesn't support my claims above, I'm making educated guesses. But this is the problem! Why is the online documentation so thin that the choice is left to guesswork? Even the book "SQL Server 2008 Internals" discussed the variety of choices, without explaining why and when binary collation would be chosen (compared with non-binary windows collation). Criminy!!!
"SQL Server 2008 Internals" has a good discussion on the topic imho.
Binary collation is tricky, if you intend to support text search for human beings, you'd better go with non-binary. Binary is good to gain a tiny bit of performance if you have tuned everything else (architecture first) and in cases where case sensitivity and accent sensitivity are a desired behavior, like password hashes for instance. Binary collation is actually "more precise" in a sense that it does not consider similar texts. The sort orders you get out of there are good for machines only though.
There is only a slight difference between the SQL_* collations and the native windows ones. If you're not constrained with compatibility, go for the native ones as they are the way forward afaik.
Collation decides sort order and equality. You choose, what really best suits your users. It's understood that you will use the unicode types (like nvarchar) for your data to support international text. Collation affects what can be stored in a non-unicode column, which does not affect you then.
What really matters is that you avoid mixing collations in WHERE clause because that's where you pay the fine by not using indexes. Afaik there's no silver bullet collation to support all languages. You can either choose one for the majority of your users or go into localization support with different column for each language.
One important thing is to have the server collation the same as your database collation. It will make your life much easier if you plan to use temporary tables as temporary tables if created with "CREATE TABLE #ttt..." pick up the server collation and you'd run into collation conflicts which you'll need to solve with specifying an explicit collation. This has a performance impact too.
Please do not consider my answer as complete, but you should take into consideration the following points:
( as said by #Anthony) All text fields must use nvarchar data type. This will allow you to store any character from any language, as defined by UTF-8\unicode character set! If you do not do so, you will not be able to mix text from different origins (latin, cyrillic, arabic, etc) in your tables.
This said, your collation choice will mainly affect the following:
The collating sequence, or sorting rules to be set between characters such as 'e' and 'é', or 'c' and 'ç' (should they be considered as equal or not?). In some cases, collating sequences do consider specific letter combinations, just like in hungarian, where C and CS, or D, DZ and DZS, are considered independantly.
The way spaces (or other non letter characters) are analysed: which one is the correct 'alphabetical' order?
this one (spaces are considered as 'first rank' characters)?
San Juan
San Teodoro
Santa Barbara
or this one (spaces are not considered in the ordering)?
San Juan
Santa Barbara
San Teodoro
Collation also impacts on case sensitivity: do capital letters have to be considered as similar to small letters?
The best default collation for a global database (e.g. a website) is probably Latin1_General_CI_AS. More important than collation is making sure that all textual columns use the nvarchar data type.
As long as you use NVARCHAR columns (as you should for mixed international data), all *_BIN and *_BIN2 collations perform the same binary comparison/sorting based on the Unicode code points. It doesn't matter which one you pick. Latin1_General_BIN2 looks like a reasonable generic choice.
Source: http://msdn.microsoft.com/en-us/library/ms143350(v=sql.105).aspx

Why does SQL Server consider N'㐢㐢㐢㐢' and N'㐢㐢㐢' to be equal?

We are testing our application for Unicode compatibility and have been selecting random characters outside the Latin character set for testing.
On both Latin and Japanese-collated systems the following equality is true (U+3422):
N'㐢㐢㐢㐢' = N'㐢㐢㐢'
but the following is not (U+30C1):
N'チチチチ' = N'チチチ'
This was discovered when a test case using the first example (using U+3422) violated a unique index. Do we need to be more selective about the characters we use for testing? Obviously we don't know the semantic meaning of the above comparisons. Would this behavior be obvious to a native speaker?
Michael Kaplan has a blog post where he explains how Unicode strings are compared. It all comes down to the point that a string needs to have a weight, if it doesn't it will be considered equal to the empty string.
Sorting it all Out: The jury will give this string no weight
In SQL Server this weight is influenced by the defined collation. Microsoft has added appropriate collations for CJK Unified Ideographs in Windows XP/2003 and SQL Server 2005. This post recommends to use Chinese_Simplified_Pinyin_100_CI_AS or Chinese_Simplified_Stroke_Order_100_CI_AS:
You can always use any binary and binary2 collations although it wouldn't give you Linguistic correct result. For SQL Server 2005, you SHOULD use Chinese_PRC_90_CI_AS or Chinese_PRC_Stoke_90_CI_AS which support surrogate pair comparison (but not linguistic). For SQL Server 2008, you should use Chinese_Simplified_Pinyin_100_CI_AS and Chinese_Simplified_Stroke_Order_100_CI_AS which have better linguistic surrogate comparison. I do suggest you use these collation as your server/database/table collation instead of passing the collation name during comparison.
So the following SQL statement would work as expected:
select * from MyTable where N'' = N'㐀' COLLATE Chinese_Simplified_Stroke_Order_100_CI_AS;
A list of all supported collations can be found in MSDN:
SQL Server 2008 Books Online: Windows Collation Name
That character U+3422 is from the CJK Unified Ideographs tables, which are a relatively obscure (and politically loaded) part of the unicode standard. My guess is that SQL Server simply does not know that part - or perhaps even intentionally does not implement it due to political considerations.
Edit: looks like my guess was wrong and the real problem was that neither Latin nor Japanese collation define weights for that character.
If you look at the Unihan data page, the character appears to only have the "K-Source" field which corresponds to the South Korean government's mappings.
My guess is that MS SQL asks "is this character a Chinese character?" If so then use the Japanese sorting standard, discarding the character if the collation number isn't available - likely a SQL server-specific issue.
I very much doubt it's a political dispute as another poster suggested as the character doesn't even have a Taiwan or Hong Kong encoding mapping.
More technical info: The J-Source (the Japanese sorting order prescribed by the Japanese government) is blank as it probably was only used in classical Korean Hanja (chinese characters which are now only used in some contexts.)
The Japanese government's JIS sorting standards generally sort Kanji characters by the Japanese On reading (which is usually the approximated Chinese pronunciation when the characters were imported into Japan.) But this character probably isn't used much in Japanese and may not even have a Japanese pronunciation to associate with it, so hasn't been added to the data.

How to best match two strings?

do you know any good algorithms that match two strings and then return a percentage in how many percent those two strings match?
And are there some, that work with databases too?
The Levenstein distance is such a measure. It basically tells you how many characters need to be edited, deleted or added, to get from the first to the second string. I'm not sure whether some database systems support that.
But I know for sure that a much more simplified algorithm named Soundex is supported in some database systems.
It depends upon your criteria for similarity. Other people have already referred you to Levenstein distance (edit distance is the same thing). That's usually pretty good, and definitely more language-independent than something like soundex. However, be aware that Levenstein difference does not handle transposition very well. Thus:
Levenstein("copy", "cpoy") == 2
If you're trying to deal with human input, transpositions are fairly common. Whether that's a problem or not depends on your metrics for similarity.
It's been a while, but I believe Postgresql has levenstein() either built-in or available as a contrib C module.
I think the problem you're looking for is called Edit Distance. It is expensive to compute in general, but if you are looking for strings within small edit distance of other strings, it is not so bad. There is more information in the Wikipedia article.
How to best match two strings? Have them go out for coffee, and if they hit it off, dinner and a movie. Or maybe they could do some peer programming? It depends on the strings, really. Even coffee can often be tricky.
Would this be of help? I just ran into it. Comparing Two Strings producing a numeric delta

What datatype should be used for storing phone numbers in SQL Server 2005?

I need to store phone numbers in a table. Please suggest which datatype should I use?
Wait. Please read on before you hit reply..
This field needs to be indexed heavily as Sales Reps can use this field for searching (including wild character search).
As of now, we are expecting phone numbers to come in a number of formats (from an XML file). Do I have to write a parser to convert to a uniform format? There could be millions of data (with duplicates) and I dont want to tie up the server resources (in activities like preprocessing too much) every time some source data comes through..
Any suggestions are welcome..
Update: I have no control over source data. Just that the structure of xml file is standard. Would like to keep the xml parsing to a minimum.
Once it is in database, retrieval should be quick. One crazy suggestion going on around here is that it should even work with Ajax AutoComplete feature (so Sales Reps can see the matching ones immediately). OMG!!
Does this include:
International numbers?
Extensions?
Other information besides the actual number (like "ask for bobby")?
If all of these are no, I would use a 10 char field and strip out all non-numeric data. If the first is a yes and the other two are no, I'd use two varchar(50) fields, one for the original input and one with all non-numeric data striped and used for indexing. If 2 or 3 are yes, I think I'd do two fields and some kind of crazy parser to determine what is extension or other data and deal with it appropriately. Of course you could avoid the 2nd column by doing something with the index where it strips out the extra characters when creating the index, but I'd just make a second column and probably do the stripping of characters with a trigger.
Update: to address the AJAX issue, it may not be as bad as you think. If this is realistically the main way anything is done to the table, store only the digits in a secondary column as I said, and then make the index for that column the clustered one.
We use varchar(15) and certainly index on that field.
The reason being is that International standards can support up to 15 digits
Wikipedia - Telephone Number Formats
If you do support International numbers, I recommend the separate storage of a World Zone Code or Country Code to better filter queries by so that you do not find yourself parsing and checking the length of your phone number fields to limit the returned calls to USA for example
Use CHAR(10) if you are storing US Phone numbers only. Remove everything but the digits.
I'm probably missing the obvious here, but wouldn't a varchar just long enough for your longest expected phone number work well?
If I am missing something obvious, I'd love it if someone would point it out...
I would use a varchar(22). Big enough to hold a north american phone number with extension. You would want to strip out all the nasty '(', ')', '-' characters, or just parse them all into one uniform format.
Alex
nvarchar with preprocessing to standardize them as much as possible. You'll probably want to extract extensions and store them in another field.
SQL Server 2005 is pretty well optimized for substring queries for text in indexed varchar fields. For 2005 they introduced new statistics to the string summary for index fields. This helps significantly with full text searching.
using varchar is pretty inefficient. use the money type and create a user declared type "phonenumber" out of it, and create a rule to only allow positive numbers.
if you declare it as (19,4) you can even store a 4 digit extension and be big enough for international numbers, and only takes 9 bytes of storage. Also, indexes are speedy.
Normalise the data then store as a varchar. Normalising could be tricky.
That should be a one-time hit. Then as a new record comes in, you're comparing it to normalised data. Should be very fast.
Since you need to accommodate many different phone number formats (and probably include things like extensions etc.) it may make the most sense to just treat it as you would any other varchar. If you could control the input, you could take a number of approaches to make the data more useful, but it doesn't sound that way.
Once you decide to simply treat it as any other string, you can focus on overcoming the inevitable issues regarding bad data, mysterious phone number formating and whatever else will pop up. The challenge will be in building a good search strategy for the data and not how you store it in my opinion. It's always a difficult task having to deal with a large pile of data which you had no control over collecting.
Use SSIS to extract and process the information. That way you will have the processing of the XML files separated from SQL Server. You can also do the SSIS transformations on a separate server if needed. Store the phone numbers in a standard format using VARCHAR. NVARCHAR would be unnecessary since we are talking about numbers and maybe a couple of other chars, like '+', ' ', '(', ')' and '-'.
Use a varchar field with a length restriction.
It is fairly common to use an "x" or "ext" to indicate extensions, so allow 15 characters (for full international support) plus 3 (for "ext") plus 4 (for the extension itself) giving a total of 22 characters. That should keep you safe.
Alternatively, normalise on input so any "ext" gets translated to "x", giving a maximum of 20.
It is always better to have separate tables for multi valued attributes like phone number.
As you have no control on source data so, you can parse the data from XML file and convert it into the proper format so that there will not be any issue with formats of a particular country and store it in a separate table so that indexing and retrieval both will be efficient.
Thank you.
I realize this thread is old, but it's worth mentioning an advantage of storing as a numeric type for formatting purposes, specifically in .NET framework.
IE
.DefaultCellStyle.Format = "(###)###-####" // Will not work on a string
Use data type long instead.. dont use int because it only allows whole numbers between -32,768 and 32,767 but if you use long data type you can insert numbers between -2,147,483,648 and 2,147,483,647.
For most cases, it will be done with bigint
Just save unformatted phone numbers like: 19876543210, 02125551212, etc.
Check the topic about bigint vs varchar

Resources