I have a database with a field that holds permit numbers associated with requests. The permit numbers are 13 digits, but a permit may not be issued.
With that said, I currently have the field defined as a char(13) that allows NULLs. I have been asked to change it to varchar(13) because char's, if NULL, still use the full length.
Is this advisable? Other than space usage, are there any other advantages or disadvantages to this?
I know in an ideal relational system, the permit numbers would be stored in another related table to avoid the use of NULLs, but it is what it is.
Well, if you don't have to use as much space, then you can fit more pages in memory. If you can do that, then your system will run faster. This may seem trivial, but I just recently tweaked the data types on a a table at a client that reduced the amount of reads by 25% and the CPU by about 20%.
As for which is easier to work with, the benefits David Stratton mentioned are noteworthy. I hate having to use trim functions in string building.
If the field should always be exactly 13 characters, then I'd probably leave it as CHAR(13).
Also, an interesting note from BOL:
If SET ANSI_PADDING is OFF when either
CREATE TABLE or ALTER TABLE is
executed, a char column that is
defined as NULL is handled as varchar.
Edit: How frequently would you expect the field to be NULL? If it will be populated 95% of the time, it's hardly worth it to make this change.
The biggest advantage (in general, not necessarily your specific case) I know of is that in code, if you use varchar, you don't have to use a Trim function every time you want it displayed. I run into this a lot when taking FirstName fields and LastName fields and combining them into a FullName. It's just annoying and makes the code less readable.
if your are using sql server 2008, you should look at Row Compression and perhaps sparse fields if the column is more ~60% nulls.
I would keep the datatype a char(13) if all of the populated fields use that amount.
Row Compression Information:
http://msdn.microsoft.com/en-us/library/cc280449.aspx
Sparse columns:
http://msdn.microsoft.com/en-us/library/cc280604.aspx
Related
im using sql server 2012, is it possible to generate a uniqueidentifier value based on two or three values mostly varchars or decimal, i mean any data type which takes 0-9 and a-z.
Usually uniqueidentifier varies from system to system. For my requirement, I need a custom one, when ever i call this function, it should get me the same value in all the systems.
I have been thinking of converting the values into varbinary and taking certain parts of it and generating a uniqueidentifier. How good is this approach.
Im still working on this approach.
Please provide your suggestions.
What you describe is a hash of the values. Use HASHBYTES function to digest your values into a hash. But your definition of the problem contradicts the requirement for uniqueness since, by definition, reducing an input of size M to a hash of size N, where N < M, may generate collisions. If you truly need uniqueness then redefine the requirements in a manner which would at least allow for uniqueness. Namely, the requirement for it should get me the same value in all the systems must be dropped since the only way to guarantee it is to output exactly the input. If you remove this requirement then the new requirement are satisfied by NEWID() (yes, it does not consider the input, but it doesn't have to in order to meet your requirements).
The standards document for Uniqueidentifier goes to some length showing how they are generated. http://www.ietf.org/rfc/rfc4122.txt
I would give this a read (especially 4.1.2. as that breaks down how a guid should be generated) and maybe I would keep use the timestamp components but hard code your network location element which will give you what you are looking for.
Is delimiting data in a database field something that would be ok to do?
Something like
create table column_names (
id int identity (1,1) PRIMARY KEY,
column_name varchar(5000)
);
and then storing data in it as follows
INSERT INTO column_names (column_name) VALUES ('stocknum|name|price');
No. this is bad:
in order to create new queries you have to track down how things are stored.
queries that join on price or name or stocknum are going to be nasty
the database can't assign data types to the data or validate it
you can't create constraints on any of this data now
Basically you're subverting the RDBMS' scheme for handling things and making up your own, so you're limiting how much the RDBMS tools can help you and you've made the system harder to understand for new people.
The only possible advantage of this kind of system that I can think of is that it can serve as a workaround to avoid dealing with a totally impossible DBA who vetoes all schema changes regardless of merit. Which can happen, unfortunately.
Of course there's an exception to everything. I'm currently on a project with audit-logging requirements that are pretty stringent. the logging is done to a database, we're using delimited fields for storing the fields because the application is never going to interact with this data, it gets written once and left alone.
Almost certainly not.
It violates principles of normalization. The data stored in a particular row of a particular column should be atomic-- you shouldn't be able to parse the data into smaller component parts.
It makes it substantially more difficult to get acceptable performance. Every piece of code that queries this table will need to know how to parse the data which is generally going to mean that more data needs to be read off disk and potentially sent over the network to the client. Every query that has to parse this data is going to have to be more complex which tends to cause grief for the query optimizer. Concatenated data cannot generally be indexed effectively for searches-- you'd have to do something like a full-text index with custom delimiters rather than a nice standard index on a character string. And if you ever have to update one of the delimited values (i.e. because a product name changes), those updates are going to have to scan every row in the table, parse the data, decide whether to actually update the row, and then update a ton of rows.
It makes the application much more brittle. What happens when someone decides to include a | character in the name attribute, for example? Even if you specify an optional enclosure in the spec (i.e. | is allowed if the entire token is enclosed in double quotes), what fraction of the bits of code that actually parse this column are going to implement and test that correctly?
I'm not that experienced with databases. If I have a database table containing a lot of empty cells, what's the best way to leave them (e.g. so performance isn't degraded, memory is not consumed, if this is even possible)?
I know there's a "null" value. Is there a "none" value or equivalent that has no drawbacks? Or by just not filling the cell, it's considered empty, so there's nothing left to do? Sorry if it's silly question. Sometimes you don't know what you don't know...
Not trying to get into a discussion of normalizing the database. Just wondering what the conventional wisdom is for blank/empty/none cells.
Thanks
The convention is to use null to signify a missing value. That's the purpose of null in SQL.
Noted database researcher C. J. Date writes frequently about his objections to the handling of null in SQL at a logical level, and he would say any column that may be missing belongs in a separate table, so that the absence of a row corresponds to a missing value.
I'm not aware of any serious efficiency drawbacks of using null. Efficiency of any features depend on the specific database implementation you use. You haven't said if you use MySQL, Oracle, Microsoft SQL Server, or other.
MySQL's InnoDB storage engine, for example, doesn't store nulls among the columns of a row, it just stores the non-null columns. Other databases may do this differently. Likewise nulls in indexes should be handled efficiently, but it varies from product to product.
Use NULL. That's what it's for.
Normally databases are said to have rows and columns. If the column does not require a value, it holds nothing (aka NULL) until it is updated with a value. That is best practice for most databases, though not all databases have the NULL value--some use an empty string, but they are the exception.
With regard to space utilization -- disk is relative inexpensive these days, so worries about space consumption are no longer as prevalent as they once used to be, except in gargantuan databases, perhaps. You can get better performance out of a database if you use all fixed-size datatypes, but once you start allowing variable sized string (e.g. varchar, nvarchar) types, that optimization is no longer possible.
In brief, don't worry about performance for the time being, at least until you get your feet wet.
It is possible, but consider:
Are they supposed to be not-empty? Should you implement not null?
Is it a workflow -- so they are empty now, but most of them will be filled in the future?
If both are NO, then you may consider re-design. Edit your question and post the schema you have now.
There are several schools of thought in this. The first is to use null when the data is not known - that's what it's for.
The second is to not allow nulls and either separate out all the fields that could be null to relational tables or to create "fake" values to replace null. For varchar this would usually be the empty string but the problem arises as to what should be the fake value for a date field or or an numeric. Then you have to write code to exclude the fake data just like you have to write code to deal with the nulls.
Personally I prefer to use nulls with some judicious moving of data to child tables if the data is truly a different entity (and often these fields turn out to need the one-to-many structure of a parent-child relationship anyway, such as when you may or may not know the phone number of a person, put it in a separate phone table and then you will often discover you needed to store multiple phone numbers anyway).
When designing a database, what decisions do you consider when deciding how big your nvarchar should be.
If i was to make an address table my gut reaction would be for address line 1 to be nvarchar(255) like an old access database.
I have found using this has got me in bother with the old 'The string would be truncated'. I know that this can be prevented by limiting the input box but if a user really has a address line one that is over 255 this should be allowed.
How big should I make my nvarchar(????)
My recommendation: make them just as big as you REALLY need them.
E.g. for a zip code column, 10-20 chars are definitely enough. Ditto for a phone number. E-Mails might be longer, 50-100 chars. Names - well, I usually get by with 50 chars, ditto for first names. You can always and easily extend fields if you really need to - that's no a big undertaking at all.
There's really no point in making all varchar/nvarchar fields as big as they can be. After all, a SQL Server page is fixed and limited to 8060 bytes per row. Having 10 fields of NVARCHAR(4000) is just asking for trouble.... (since if you actually try to fill them with too much data, SQL Server will barf at you).
If you REALLY need a really big field, use NVARCHAR/VARCHAR(MAX) - those are stored in your page, as long as they fit, and will be sent to "overflow" storage if they get too big.
NVARCHAR vs. VARCHAR: this really boils down to do you really need "exotic" characters, such as Japanese, Chinese, or other non-ASCII style characters? In Europe, even some of the eastern European characters cannot be represented by VARCHAR fields anymore (they will be stripped of their hachek (? spelling ?). Western European languages (English, German, French, etc.) are all very well served by VARCHAR.
BUT: NVARCHAR does use twice as much space - on disk and in your SQL Server memory - at all times. If you really need it, you need it - but do you REALLY ? :-) It's up to you.
Marc
I don't use nvarchar personally :-) I always use varchar.
However, I tend to use 100 for name and 1000 for comments. Trapping and dealing with longer strings is something the client can do, say via regex, so SQL only gets the data it expects.
You can avoid truncation errors be parameterising the calls, for example via stored procs.
If the parameter is defined as varchar(200), say, then truncation happens silently if you send > 200. The truncation error is thrown only for an INSERT or UPDATE statement: with parameters it won't happen.
The 255 "limit" for SQL Server goes back to 6.5 because vachar was limited to 255. SQL Server 7.0 + changed to 8000 and added support for unicode
Edit:
Why I don't use nvarchar: Double memory footprint, double index size, double disk size, simply don't need it. I work for a big Swiss company with offices globally so I'm not being parochial.
Also discussed here: varchar vs nvarchar performance
On further reflection, I'd suggest unicode appeals to client developers but as a developer DBA I focus on performance and efficiency...
It depends on what the field represents. If I'm doing a quick prototype I leave the defaults of 255. For anything like comments etc I'd probably put it to 1000.
The only way I'd make it smaller really is on things I definately know the siez of, zip codes or NI numbers etc.
For columns that you need to have certain constraints on - like names, emails, addresses, etc - you should put a reasonably high max length. For instance a first name of more than 50 characters seems a bit suspicious and an input above that size will probably contain more that just a first name. But for the initial design of a database, take that reasonable size and double it. So for first names, set it to 100 (or 200 if 100 is your 'reasonable size'). Then put the app in production, let the users play around for a sufficiently long time to gather data and then check the actual max(len(FirstName)). Are there any suspicious values there? Anything above 50 chars? Find out what's in there and see if it's actually a first name or not. If it's not, the input form probably needs better explanations/validations.
Do the same for comments; Set them to nvharchar(max) initially. Then come back when your database has grown enough for you to start optimizing performance. Take the max length of the comments, double it and you have a good max length for your column.
I need to store phone numbers in a table. Please suggest which datatype should I use?
Wait. Please read on before you hit reply..
This field needs to be indexed heavily as Sales Reps can use this field for searching (including wild character search).
As of now, we are expecting phone numbers to come in a number of formats (from an XML file). Do I have to write a parser to convert to a uniform format? There could be millions of data (with duplicates) and I dont want to tie up the server resources (in activities like preprocessing too much) every time some source data comes through..
Any suggestions are welcome..
Update: I have no control over source data. Just that the structure of xml file is standard. Would like to keep the xml parsing to a minimum.
Once it is in database, retrieval should be quick. One crazy suggestion going on around here is that it should even work with Ajax AutoComplete feature (so Sales Reps can see the matching ones immediately). OMG!!
Does this include:
International numbers?
Extensions?
Other information besides the actual number (like "ask for bobby")?
If all of these are no, I would use a 10 char field and strip out all non-numeric data. If the first is a yes and the other two are no, I'd use two varchar(50) fields, one for the original input and one with all non-numeric data striped and used for indexing. If 2 or 3 are yes, I think I'd do two fields and some kind of crazy parser to determine what is extension or other data and deal with it appropriately. Of course you could avoid the 2nd column by doing something with the index where it strips out the extra characters when creating the index, but I'd just make a second column and probably do the stripping of characters with a trigger.
Update: to address the AJAX issue, it may not be as bad as you think. If this is realistically the main way anything is done to the table, store only the digits in a secondary column as I said, and then make the index for that column the clustered one.
We use varchar(15) and certainly index on that field.
The reason being is that International standards can support up to 15 digits
Wikipedia - Telephone Number Formats
If you do support International numbers, I recommend the separate storage of a World Zone Code or Country Code to better filter queries by so that you do not find yourself parsing and checking the length of your phone number fields to limit the returned calls to USA for example
Use CHAR(10) if you are storing US Phone numbers only. Remove everything but the digits.
I'm probably missing the obvious here, but wouldn't a varchar just long enough for your longest expected phone number work well?
If I am missing something obvious, I'd love it if someone would point it out...
I would use a varchar(22). Big enough to hold a north american phone number with extension. You would want to strip out all the nasty '(', ')', '-' characters, or just parse them all into one uniform format.
Alex
nvarchar with preprocessing to standardize them as much as possible. You'll probably want to extract extensions and store them in another field.
SQL Server 2005 is pretty well optimized for substring queries for text in indexed varchar fields. For 2005 they introduced new statistics to the string summary for index fields. This helps significantly with full text searching.
using varchar is pretty inefficient. use the money type and create a user declared type "phonenumber" out of it, and create a rule to only allow positive numbers.
if you declare it as (19,4) you can even store a 4 digit extension and be big enough for international numbers, and only takes 9 bytes of storage. Also, indexes are speedy.
Normalise the data then store as a varchar. Normalising could be tricky.
That should be a one-time hit. Then as a new record comes in, you're comparing it to normalised data. Should be very fast.
Since you need to accommodate many different phone number formats (and probably include things like extensions etc.) it may make the most sense to just treat it as you would any other varchar. If you could control the input, you could take a number of approaches to make the data more useful, but it doesn't sound that way.
Once you decide to simply treat it as any other string, you can focus on overcoming the inevitable issues regarding bad data, mysterious phone number formating and whatever else will pop up. The challenge will be in building a good search strategy for the data and not how you store it in my opinion. It's always a difficult task having to deal with a large pile of data which you had no control over collecting.
Use SSIS to extract and process the information. That way you will have the processing of the XML files separated from SQL Server. You can also do the SSIS transformations on a separate server if needed. Store the phone numbers in a standard format using VARCHAR. NVARCHAR would be unnecessary since we are talking about numbers and maybe a couple of other chars, like '+', ' ', '(', ')' and '-'.
Use a varchar field with a length restriction.
It is fairly common to use an "x" or "ext" to indicate extensions, so allow 15 characters (for full international support) plus 3 (for "ext") plus 4 (for the extension itself) giving a total of 22 characters. That should keep you safe.
Alternatively, normalise on input so any "ext" gets translated to "x", giving a maximum of 20.
It is always better to have separate tables for multi valued attributes like phone number.
As you have no control on source data so, you can parse the data from XML file and convert it into the proper format so that there will not be any issue with formats of a particular country and store it in a separate table so that indexing and retrieval both will be efficient.
Thank you.
I realize this thread is old, but it's worth mentioning an advantage of storing as a numeric type for formatting purposes, specifically in .NET framework.
IE
.DefaultCellStyle.Format = "(###)###-####" // Will not work on a string
Use data type long instead.. dont use int because it only allows whole numbers between -32,768 and 32,767 but if you use long data type you can insert numbers between -2,147,483,648 and 2,147,483,647.
For most cases, it will be done with bigint
Just save unformatted phone numbers like: 19876543210, 02125551212, etc.
Check the topic about bigint vs varchar