How does SQL server treat text comparisons with LEADING spaces - sql-server

Much is made (and easily able to be found on the internet) about how you do not need to use where rtrim(columnname) = 'value' in sql server, because it automatically considers a value with or without trailing spaces to be the same.
However I've had a hard time finding info about LEADING spaces. What if (for whatever reason) our data warehouse has leading spaces on certain varchar / char type of fields and we need to have where clauses - do we still need where ltrim() ? I'm trying to avoid this big performance hit by researching out other options.
Thank You

Leading spaces are never ignored in comparisons of any text based data type. If you are comparing the equality of text columns, the best option is to validate your values on data entry to make sure that text with unwanted spaces in front is not allowed. For example if your database is expecting a user to type something from a list of possible values that your database application is expecting, do not allow your user interfaces to let users enter the text free-form, force them to enter one of the explicit valid values. If you need the user to be able to enter free-form text but never want leading spaces, then strip them on the insert. Normalizing your database should prevent a lot of these types of issues.

Related

SQL Server part number searching with possible variations on nvarchar

I've been tasked with looking into a part number matching issue, the basics of which there are several different ranges with different patterns for the part numbers; for example 0-732-012 might be stored in the database but the user might then search using 0732012 or 0.732.012, it could also be that the part numbers themselves have been added wrong in some cases (so there would be 0732012 as valid entry as well).
There is also some alphanumeric part numbers mixed in as well so 0732012KG for example.
My question is this then, is there an efficient way of checking numerical codes? Either by stripping or ignoring the non-numerical values when attempting to match?
The data is stored in SQL Server 2008 so it does have the capacity for RegEx should that be required.
This is what I would do.
Strip all dashes, spaces, special characters from both entry and field level (only in the comparison).
much faster than using a like.

Is it good practice to trim whitespace (leading and trailing) when selecting/inserting/updating table field data?

Presuming that the spaces are not important in a field's data, is it good practice to trim off the spaces when inserting, updating or selecting data from the table ?
I imagine different databases implement handling of spaces differently, so to avoid that headache, I'm thinking I should disallow leading and trailing spaces in any field data.
What do you think?
I think it is a good practice. There are few things more soul crushing than spending an hour, a day, or any amount of time, chasing down a bug that ultimately was caused by a user typing an extra space. That extra space can cause reports to go subtly wrong, or can cause an exception somewhere in your program, and unless you have put brackets around every print statement in your logs and error messages, you might not realize that it is there. Even if you religiously trim spaces before using data you've pulled from the db, do future users of your data a favor and trim before putting it in.
If leading and trailing spaces are unimportant, then I'd trim them off before inserting or updating. There should then be no unnecessary spaces on a select.
This brings some advantages. Less space required in a row means that potentially more rows can exist in a data page which leads to faster data retrieval (less to retrieve). Also, you are not constantly trimming data on SELECTs. (Uses the DRY [don't repeat yourself] principle here)
I would say it's a good practice in most scenarios. If you can confidently say that data is worthless, and the cost of removing it is minimal, then remove it.
I would trim them (unless you are actually using the whitespace data), simply because it is easy to do, and spaces are particularly hard to spot if they do cause problem in your code.
For typical data enty it's not worth the overhead. Is there some reason you think you are going to get lots of extra blank lines? If you are then it might be a good idea to trim to keep DB size down but otherwise no.
Trailing spaces are particularly problematic, specifically in regards ANSI_NULLS behaviour.
For instance, colname = '1' can return true where colname like '1' returns false
Thus, given trailing spaces in varchar columns are ambiguous, truncation is most likely preferable, particularly because there is no real information in such data and it creates ambiguity in the behaviour of SQL Server.
For example, look at the discussion at this question:
Why would SqlServer select statement select rows which match and rows which match and have trailing spaces
Handling trailing spaces is a good practise. It is a common mistake in databases and it leads to long searching of mistakes.
Either trim them during insert/ update, or add a check clause to your table like this:
ALTER TABLE tblData
WITH CHECK ADD CONSTRAINT CK_Spaces_tblData
CHECK
(
datalength(USERID)>(0)
AND datalength(ltrim(rtrim(USERID)))=datalength(USERID)
)
In this case, users get an error when they try to insert or update.
This has the advantage, that users know about the mistake. Very often, they already have trailing spaces in some Excel sheet, and then they copy-paste. So it's good for them to know about this, so they can remove the error also in their excel sheets.

Should I trim strings before storing in the database?

The situation I have run into is this. When storing the Code of an entity (must be unique in database), someone could technically put "12345" and "12345 " as codes and the database would think they're unique, but to the end user, displaying of the space makes it look like they are duplicated and could cause confusion.
In this case, I would definitely trim before storing.
Should this become the standard for all strings?
I would think that unless the space is important to the data, that you should remove it.
This is one of those questions whose answer is "It depends".
What you need to keep in mind here is the principal of least astonishment. A user would be very astonished to see two codes that look identical as especially when you display it in a form or table, the space at the end essentially vanishes. The user is also expecting that these codes are unique and they're probably expecting your system to enforce this. For a user, a space is not really something that they expect to cause a difference.
In some other cases, for example in a Content Management System or word processor for example, when the user is consciously putting in spaces, he expects the underlying data store to persist his spaces. In this case the user is probably putting in spaces to align content or for visual purposes. In this case removing spaces at the end would astonish the user.
So always look to model the user's workflow as far as possible.
put a constraint on the table that disallows leading spaces if you use varchar, remeber GIGO?
If codes are numeric then use a numeric datatype

What's the best way to store a title in a database to allow sorting without the leading "The", "A"

I run (and am presently completely overhauling) a website that deals with theater (njtheater.com if you're interested).
When I query a list of plays from the database, I'd like "The Merchant of Venice" to sort under the "M"s. Of course, when I display the name of the play, I need the "The" in front.
What the best way of designing the database to handle this?
(I'm using MS-SQL 2000)
You are on the right track with two columns, but I would suggest storing the entire displayable title in one column, rather than concatenating columns. The other column is used purely for sorting. This gives you complete flexibility over sorting and display, rather than being stuck with a simple prefix.
This is a fairly common approach when searching (which is related to sorting). One column (with an index) is case-folded, de-punctuated, etc. In your case, you'd also apply the grammatical convention of removing leading articles to the values in this field. This column is then used as a comparison key for searching or sorting. The other column is not indexed, and preserves the original key for display.
Store the title in two fields: TITLE-PREFIX and TITLE-TEXT (or some such). Then sort on the second, but display the concatenation of the two, with a space between.
My own solution to the problem was to create three columns in the database.
article varchar(4)
sorttitle varchar(255)
title computed (article + sortitle)
"article" will only be either "The ", "A " "An " (note trailing space on each) or empty string (not null)
"sorttitle" will be the title with the leading article removed.
This way, I can sort on SORTTITLE and display TITLE. There's little actual processing going on the computed field (so it's fast), and there's only a little work to be done when inserting.
I agree with doofledorfer, but I would recommend storing spaces entered as part of the prefix instead of assuming it's a single space. It gives your users more flexibility. You may also be able to do some concatenation in your query itself, so you don't have to merge the fields as part of your business logic.
I don't know if this can be done in SQL Server. If you can create function based indexes you could create one that does a regex on the field or that uses your own function. This would take less space than an additional field, would be kept up to date by the database itself, and allows the complete title to be stored together.

What datatype should be used for storing phone numbers in SQL Server 2005?

I need to store phone numbers in a table. Please suggest which datatype should I use?
Wait. Please read on before you hit reply..
This field needs to be indexed heavily as Sales Reps can use this field for searching (including wild character search).
As of now, we are expecting phone numbers to come in a number of formats (from an XML file). Do I have to write a parser to convert to a uniform format? There could be millions of data (with duplicates) and I dont want to tie up the server resources (in activities like preprocessing too much) every time some source data comes through..
Any suggestions are welcome..
Update: I have no control over source data. Just that the structure of xml file is standard. Would like to keep the xml parsing to a minimum.
Once it is in database, retrieval should be quick. One crazy suggestion going on around here is that it should even work with Ajax AutoComplete feature (so Sales Reps can see the matching ones immediately). OMG!!
Does this include:
International numbers?
Extensions?
Other information besides the actual number (like "ask for bobby")?
If all of these are no, I would use a 10 char field and strip out all non-numeric data. If the first is a yes and the other two are no, I'd use two varchar(50) fields, one for the original input and one with all non-numeric data striped and used for indexing. If 2 or 3 are yes, I think I'd do two fields and some kind of crazy parser to determine what is extension or other data and deal with it appropriately. Of course you could avoid the 2nd column by doing something with the index where it strips out the extra characters when creating the index, but I'd just make a second column and probably do the stripping of characters with a trigger.
Update: to address the AJAX issue, it may not be as bad as you think. If this is realistically the main way anything is done to the table, store only the digits in a secondary column as I said, and then make the index for that column the clustered one.
We use varchar(15) and certainly index on that field.
The reason being is that International standards can support up to 15 digits
Wikipedia - Telephone Number Formats
If you do support International numbers, I recommend the separate storage of a World Zone Code or Country Code to better filter queries by so that you do not find yourself parsing and checking the length of your phone number fields to limit the returned calls to USA for example
Use CHAR(10) if you are storing US Phone numbers only. Remove everything but the digits.
I'm probably missing the obvious here, but wouldn't a varchar just long enough for your longest expected phone number work well?
If I am missing something obvious, I'd love it if someone would point it out...
I would use a varchar(22). Big enough to hold a north american phone number with extension. You would want to strip out all the nasty '(', ')', '-' characters, or just parse them all into one uniform format.
Alex
nvarchar with preprocessing to standardize them as much as possible. You'll probably want to extract extensions and store them in another field.
SQL Server 2005 is pretty well optimized for substring queries for text in indexed varchar fields. For 2005 they introduced new statistics to the string summary for index fields. This helps significantly with full text searching.
using varchar is pretty inefficient. use the money type and create a user declared type "phonenumber" out of it, and create a rule to only allow positive numbers.
if you declare it as (19,4) you can even store a 4 digit extension and be big enough for international numbers, and only takes 9 bytes of storage. Also, indexes are speedy.
Normalise the data then store as a varchar. Normalising could be tricky.
That should be a one-time hit. Then as a new record comes in, you're comparing it to normalised data. Should be very fast.
Since you need to accommodate many different phone number formats (and probably include things like extensions etc.) it may make the most sense to just treat it as you would any other varchar. If you could control the input, you could take a number of approaches to make the data more useful, but it doesn't sound that way.
Once you decide to simply treat it as any other string, you can focus on overcoming the inevitable issues regarding bad data, mysterious phone number formating and whatever else will pop up. The challenge will be in building a good search strategy for the data and not how you store it in my opinion. It's always a difficult task having to deal with a large pile of data which you had no control over collecting.
Use SSIS to extract and process the information. That way you will have the processing of the XML files separated from SQL Server. You can also do the SSIS transformations on a separate server if needed. Store the phone numbers in a standard format using VARCHAR. NVARCHAR would be unnecessary since we are talking about numbers and maybe a couple of other chars, like '+', ' ', '(', ')' and '-'.
Use a varchar field with a length restriction.
It is fairly common to use an "x" or "ext" to indicate extensions, so allow 15 characters (for full international support) plus 3 (for "ext") plus 4 (for the extension itself) giving a total of 22 characters. That should keep you safe.
Alternatively, normalise on input so any "ext" gets translated to "x", giving a maximum of 20.
It is always better to have separate tables for multi valued attributes like phone number.
As you have no control on source data so, you can parse the data from XML file and convert it into the proper format so that there will not be any issue with formats of a particular country and store it in a separate table so that indexing and retrieval both will be efficient.
Thank you.
I realize this thread is old, but it's worth mentioning an advantage of storing as a numeric type for formatting purposes, specifically in .NET framework.
IE
.DefaultCellStyle.Format = "(###)###-####" // Will not work on a string
Use data type long instead.. dont use int because it only allows whole numbers between -32,768 and 32,767 but if you use long data type you can insert numbers between -2,147,483,648 and 2,147,483,647.
For most cases, it will be done with bigint
Just save unformatted phone numbers like: 19876543210, 02125551212, etc.
Check the topic about bigint vs varchar

Resources