SQL Server part number searching with possible variations on nvarchar

SQL Server part number searching with possible variations on nvarchar - sql-server

I've been tasked with looking into a part number matching issue, the basics of which there are several different ranges with different patterns for the part numbers; for example 0-732-012 might be stored in the database but the user might then search using 0732012 or 0.732.012, it could also be that the part numbers themselves have been added wrong in some cases (so there would be 0732012 as valid entry as well).
There is also some alphanumeric part numbers mixed in as well so 0732012KG for example.
My question is this then, is there an efficient way of checking numerical codes? Either by stripping or ignoring the non-numerical values when attempting to match?
The data is stored in SQL Server 2008 so it does have the capacity for RegEx should that be required.

This is what I would do.
Strip all dashes, spaces, special characters from both entry and field level (only in the comparison).
much faster than using a like.

Related

Should NVARCHAR be used to saved 'accented characters' into Sql Server?

I have the following two fields in a Sql Server table:
When I add some test data with accented characters into the field, it actually stores them! I thought I had to change the column from VARCHAR to NVARCHAR to accept accented characters, etc?
Basically, I thought:
VARCHAR = ASCII
NVARCHAR = Unicode
So is this a case where façade etc are actually ASCII .. while some other characters would error (if VARCHAR)?
I can see the ç and é characters in the extended ASCII chart (link above) .. so does this mean ASCII includes 0->127 or 0->255?
(Side thought: I guess I'm happy with accepting 0->255 and stripping out anything else.)
Edit
DB collation: Latin1_General_CI_AS
Server Version: 12.0.5223.6
Server Collation: SQL_Latin1_General_CP1_CI_AS

First the details of what Sql Server is doing.
VARCHAR stores single-byte characters using a specific collation. ASCII only uses 7 bits, or half of the possible values in a byte. A collation references a specific code page (along with sorting and equating rules) to use the other half of the possible values in each byte. These code pages often include support for a limited and specific set of accented characters. If the code page used for your data supports an accent character, you can do it; if it doesn't, you see weird results (unprintable "box" or ? characters). You can even output data stored in one collation as if it had been stored in another, and get really weird stuff that way (but don't do this).
NVARCHAR is unicode, but there is still some reliance on collations. In most situations, you will end up with UTF-16, which does allow for the full range of unicode characters. Certain collations will result instead in UCS-2, which is slightly more limited. See the nchar/nvarchar documentation for more information.
As an additional quirk, the upcoming Sql Server 2019 will include support for UTF-8 in char and varchar types when using the correct collation.
Now to answer the question.
In some rare cases, where you are sure your data only needs to support accent characters originating from a single specific (usually local) culture, and only those specific accent characters, you can get by with the varchar type.
But be very careful making this determination. In an increasingly global and diverse world, where even small businesses want to take advantage of the internet to increase their reach, even within their own community, using an insufficient encoding can easily result in bugs and even security vulnerabilities. The majority of situations where it seems like a varchar encoding might be good enough are really not safe anymore.
Personally, about the only place I use varchar today is mnemonic code strings that are never shown to or provided by an end user; things that might be enum values in procedural code. Even then, this tends to be legacy code, and given the option I'll use integer values instead, for faster joins and more efficient memory use. However, the upcoming UTF-8 support may change this.

VARCHAR is ASCII using the current system code page - so the set of characters you can save depends what code page.
NVARCHAR is UNICODE so you can store all the characters.

How does SQL server treat text comparisons with LEADING spaces

Much is made (and easily able to be found on the internet) about how you do not need to use where rtrim(columnname) = 'value' in sql server, because it automatically considers a value with or without trailing spaces to be the same.
However I've had a hard time finding info about LEADING spaces. What if (for whatever reason) our data warehouse has leading spaces on certain varchar / char type of fields and we need to have where clauses - do we still need where ltrim() ? I'm trying to avoid this big performance hit by researching out other options.
Thank You

Leading spaces are never ignored in comparisons of any text based data type. If you are comparing the equality of text columns, the best option is to validate your values on data entry to make sure that text with unwanted spaces in front is not allowed. For example if your database is expecting a user to type something from a list of possible values that your database application is expecting, do not allow your user interfaces to let users enter the text free-form, force them to enter one of the explicit valid values. If you need the user to be able to enter free-form text but never want leading spaces, then strip them on the insert. Normalizing your database should prevent a lot of these types of issues.

Sql Server String Comparision

Is there any information as to how SQL Server compares strings and handles searching in them (like statments)? I am trying to find out if there is a way to determine how efficient it is to store information as a large string and use sql server to do a bunch of comparisons on rows to determine which match. I know this is potentially going to be slow (the each string of information would be 2400 characters long), but I need something doucmenting how the string is compared, so I can show the efficency (or inefficency) of it.

each string of information would be 2400 characters long
Exactly 2400? So you've got fixed-width fields in there? Save your time and just split it into separate columns. You'll thank yourself later.
If you must have data, set up a test db and try it both ways. Then at least you'll have data that's specific to your system.

searching in them will be slow because you won't be able to create an index since an index can't be over 900 bytes long/wide
I would do what Joel Coehoorn suggests and split it up into columns
you also might want to split it up in more tables because you can only store 3 rows pr page with 2400 chars per row

There are full text search indexes that you can apply to sql server, which are often used for things like search engines. The full text indexes typically allow for boolean logic operators for the search.

Just additional information to the already mentioned. If you need to filter the large string with like, indices are also not used (except the wildcard % is only at the end of the search string). So it's best to avoid like and make the part you need to filter for available in an own field.

In the MSDN Article about Full-Text searches the following is called out regarding how the LIKE predicate uses character patterns.
Comparing LIKE to Full-Text Search
In contrast to full-text search, the LIKE Transact-SQL predicate works
on character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.

SQL Server's SoundEx function on non-Latin character sets?

Does SQL Server's (2000) Soundex function work on Asian character sets? I used it in a query and it appears to have not worked properly but I realize that it could be because I don't know how to read Chinese...
Furthermore, are there any other languages where the function might have trouble working on? (Russian for example)
Thank you,Frank

Soundex is fairly specific to English - it may or may not work well on other languages. One example that happened in New Zealand was an attempt at patient name matching using Soundex. Unfortunately pacific island names did not work well with Soundex, in many cases hashing to the same small set of values. A different algorithm had to be used.
Your mileage may vary. On more recent versions of SQL Server you could write a CLR function to do some other computation.

By design it works best on English sentences using the ASCII character set. I have used it on a project in Romania where I replaced the Romanian special characters with corresponding ASCII characters that sound more or less the same. It is not perfect but in my case it was a lot better than nothing.
I think you will have no great success with applying SOUNDEX on Asian character sets.

I know that soundex in older versions of SQLServer ignored any non-english characters. I believe it didn't even handle Latin-1, let alone anything more exotic.
I never dealt with soundex much in SQL2k, all I know for certain was that it does not handle Arabic correctly. This likely extends to other non-latin character sets as well.
In any case, a soundex based algorithm is unlikely to yield acceptable results for non-english languages even aside from character set issues. Soundex was specifically designed to handle the English pronunciation of names (mostly those of Western European origin) and does not function particularly well outside of that use. You would often be better off researching any of several variants of soundex or other unrelated phonetic similarity algorithms which are designed to address the language(s) in question.

You may use an algorithm like Levenshtein distance. There are various implementations of the algorithm as user-defined functions which you may use within a SELECT statement.

What datatype should be used for storing phone numbers in SQL Server 2005?

I need to store phone numbers in a table. Please suggest which datatype should I use?
Wait. Please read on before you hit reply..
This field needs to be indexed heavily as Sales Reps can use this field for searching (including wild character search).
As of now, we are expecting phone numbers to come in a number of formats (from an XML file). Do I have to write a parser to convert to a uniform format? There could be millions of data (with duplicates) and I dont want to tie up the server resources (in activities like preprocessing too much) every time some source data comes through..
Any suggestions are welcome..
Update: I have no control over source data. Just that the structure of xml file is standard. Would like to keep the xml parsing to a minimum.
Once it is in database, retrieval should be quick. One crazy suggestion going on around here is that it should even work with Ajax AutoComplete feature (so Sales Reps can see the matching ones immediately). OMG!!

Does this include:
International numbers?
Extensions?
Other information besides the actual number (like "ask for bobby")?
If all of these are no, I would use a 10 char field and strip out all non-numeric data. If the first is a yes and the other two are no, I'd use two varchar(50) fields, one for the original input and one with all non-numeric data striped and used for indexing. If 2 or 3 are yes, I think I'd do two fields and some kind of crazy parser to determine what is extension or other data and deal with it appropriately. Of course you could avoid the 2nd column by doing something with the index where it strips out the extra characters when creating the index, but I'd just make a second column and probably do the stripping of characters with a trigger.
Update: to address the AJAX issue, it may not be as bad as you think. If this is realistically the main way anything is done to the table, store only the digits in a secondary column as I said, and then make the index for that column the clustered one.

We use varchar(15) and certainly index on that field.
The reason being is that International standards can support up to 15 digits
Wikipedia - Telephone Number Formats
If you do support International numbers, I recommend the separate storage of a World Zone Code or Country Code to better filter queries by so that you do not find yourself parsing and checking the length of your phone number fields to limit the returned calls to USA for example

Use CHAR(10) if you are storing US Phone numbers only. Remove everything but the digits.

I'm probably missing the obvious here, but wouldn't a varchar just long enough for your longest expected phone number work well?
If I am missing something obvious, I'd love it if someone would point it out...

I would use a varchar(22). Big enough to hold a north american phone number with extension. You would want to strip out all the nasty '(', ')', '-' characters, or just parse them all into one uniform format.
Alex

nvarchar with preprocessing to standardize them as much as possible. You'll probably want to extract extensions and store them in another field.

SQL Server 2005 is pretty well optimized for substring queries for text in indexed varchar fields. For 2005 they introduced new statistics to the string summary for index fields. This helps significantly with full text searching.

using varchar is pretty inefficient. use the money type and create a user declared type "phonenumber" out of it, and create a rule to only allow positive numbers.
if you declare it as (19,4) you can even store a 4 digit extension and be big enough for international numbers, and only takes 9 bytes of storage. Also, indexes are speedy.

Normalise the data then store as a varchar. Normalising could be tricky.
That should be a one-time hit. Then as a new record comes in, you're comparing it to normalised data. Should be very fast.

Since you need to accommodate many different phone number formats (and probably include things like extensions etc.) it may make the most sense to just treat it as you would any other varchar. If you could control the input, you could take a number of approaches to make the data more useful, but it doesn't sound that way.
Once you decide to simply treat it as any other string, you can focus on overcoming the inevitable issues regarding bad data, mysterious phone number formating and whatever else will pop up. The challenge will be in building a good search strategy for the data and not how you store it in my opinion. It's always a difficult task having to deal with a large pile of data which you had no control over collecting.

Use SSIS to extract and process the information. That way you will have the processing of the XML files separated from SQL Server. You can also do the SSIS transformations on a separate server if needed. Store the phone numbers in a standard format using VARCHAR. NVARCHAR would be unnecessary since we are talking about numbers and maybe a couple of other chars, like '+', ' ', '(', ')' and '-'.

Use a varchar field with a length restriction.

It is fairly common to use an "x" or "ext" to indicate extensions, so allow 15 characters (for full international support) plus 3 (for "ext") plus 4 (for the extension itself) giving a total of 22 characters. That should keep you safe.
Alternatively, normalise on input so any "ext" gets translated to "x", giving a maximum of 20.

It is always better to have separate tables for multi valued attributes like phone number.
As you have no control on source data so, you can parse the data from XML file and convert it into the proper format so that there will not be any issue with formats of a particular country and store it in a separate table so that indexing and retrieval both will be efficient.
Thank you.

I realize this thread is old, but it's worth mentioning an advantage of storing as a numeric type for formatting purposes, specifically in .NET framework.
IE
.DefaultCellStyle.Format = "(###)###-####" // Will not work on a string

Use data type long instead.. dont use int because it only allows whole numbers between -32,768 and 32,767 but if you use long data type you can insert numbers between -2,147,483,648 and 2,147,483,647.

For most cases, it will be done with bigint
Just save unformatted phone numbers like: 19876543210, 02125551212, etc.
Check the topic about bigint vs varchar

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight