Database Table design - database

I am having a problem choosing the variable types for database table.
Can someone please give me some general guidelines as to how to choose the types? The following are some of the questions I have --
What should an userid be? An INT seems small as the design should take large number of users into account. So if not INT what else? BIGINT? VARCHAR? Am I not thinking straight?
When should I choose varchar and text and tinytext, binary?
I searched online and did not find any helpful links. Can someone point me in the right direction? Maybe I need to read book.

Update for 2013: The internet now has roughly the same number of users as a signed 32-bit integer. Facebook has about a billion users. So if you're planning on being the next Google or Facebook, and you definitely won't have time to re-engineer between now and then, go with something bigger. Otherwise stick with what's easy.
As of 2010:
An int seems small? Are you anticipating having 2,147,483,648 users? (For the record, that's like 1 in 3 people in the world, or 400 million more than the total number of people who use the internet; or 5 and a half times more users than Facebook).
Answer: use an int. Don't overthink it. If you get in a situation where our old friend the 32-bit integer just isn't cutting it for you, you will probably also have a staff of genius college graduates and $5b capital infusion to help solve the problem then.

I would actually make a user ID a character type since I'd rather log in as "pax" than 14860. However, if it has to be numeric (such as if you want the name to be changed easily or you want to minimise storage for foreign keys to the user table), an integer should be fine. Just how many users are you expecting to have? :-)
I would always use varchar for character data since I know that's supported on all SQL databases (or at least the ones that matter). I prefer to stay with the standards as much as possible so moving between databases is easier.

Your username field should probably be a varchar (you pick the length). If you have an id_user on the user table, I agree with the above answers that int (2 billion) will more than adequately cover your needs.
As far as you alphanumeric data types, it depends on how long you want it to be. For example, I think a varchar(255), or probably even smaller (personally I'd go with 50), would work for a username and password field. This chart might help you.
Binary would be for storing non-character data such as images. bit can be used for boolean values.
A couple other useful types are uniqueidentifier for UUID/GUID and xml for XML. Although you can also use varchar or text types for those too.
There's also nvarchar for Unicode in SQL Server and several date types (I'm glad they came out with date in SQL Server 2008 which doesn't include a time such as smalldatetime).

Related

Request suggestions for defining a primary key in Oracle 11g

All,
I have read several other posts before posing my question to you.
I have a lot of programming/admin experience with other databases: MySql, MSSQL, PostGres, and the one which will not be named. I just don't have much experience with Oracle.
I was tasked with designing a few web-applications and supporting database tables. The tables were designed using an ER diagram, and sent to the development group for implementation. When they sent back the proposed table creation statements, I saw two things that seems wrong to me. The primary key is NUMBER(5) and the sequence set the MAXVALUE to 99999.
I would have expected that the MAXVALUE would be omitted in favor of NOMAXVALUE Primary key column be a NUMBER(*,0) or a LONG. Since I don't have much experience with Oracle table design, would you please offer up your advice?
Sincerely
Kristofer Hoch
Edit
Thank you for the information on LONG. I'll make sure to use NUMBER, but I'm still unclear on the best way to define it: NUMBER, or NUMBER(*,0), or NUMBER(9), etc.
I agree with you: since the column is to hold a surrogate key generated from a sequence, the only possible purpose of the 5 digit limit would be to restrict the total number of rows ever allowed in the table to under 100,000 - which would seem perverse. It certainly does not confer any performance or space efficiency advantages. Probably it is just the default of their ERD tool's DDL generator.
Do not use LONG: in Oracle that is an obsolete and deprecated way of storing large text strings (for which CLOB is now preferred).
If it was me, I wouldn't specify the size of the number, but that may be more laziness (set it, forget it) than a good practice for design purposes.
But "long" would cause problems. In Oracle, "long" is a character data type that's being deprecated. It's not the same thing as the long data type for numbers in other languages/systems. It's tricky.

How BIG do you make your Nvarchar()

When designing a database, what decisions do you consider when deciding how big your nvarchar should be.
If i was to make an address table my gut reaction would be for address line 1 to be nvarchar(255) like an old access database.
I have found using this has got me in bother with the old 'The string would be truncated'. I know that this can be prevented by limiting the input box but if a user really has a address line one that is over 255 this should be allowed.
How big should I make my nvarchar(????)
My recommendation: make them just as big as you REALLY need them.
E.g. for a zip code column, 10-20 chars are definitely enough. Ditto for a phone number. E-Mails might be longer, 50-100 chars. Names - well, I usually get by with 50 chars, ditto for first names. You can always and easily extend fields if you really need to - that's no a big undertaking at all.
There's really no point in making all varchar/nvarchar fields as big as they can be. After all, a SQL Server page is fixed and limited to 8060 bytes per row. Having 10 fields of NVARCHAR(4000) is just asking for trouble.... (since if you actually try to fill them with too much data, SQL Server will barf at you).
If you REALLY need a really big field, use NVARCHAR/VARCHAR(MAX) - those are stored in your page, as long as they fit, and will be sent to "overflow" storage if they get too big.
NVARCHAR vs. VARCHAR: this really boils down to do you really need "exotic" characters, such as Japanese, Chinese, or other non-ASCII style characters? In Europe, even some of the eastern European characters cannot be represented by VARCHAR fields anymore (they will be stripped of their hachek (? spelling ?). Western European languages (English, German, French, etc.) are all very well served by VARCHAR.
BUT: NVARCHAR does use twice as much space - on disk and in your SQL Server memory - at all times. If you really need it, you need it - but do you REALLY ? :-) It's up to you.
Marc
I don't use nvarchar personally :-) I always use varchar.
However, I tend to use 100 for name and 1000 for comments. Trapping and dealing with longer strings is something the client can do, say via regex, so SQL only gets the data it expects.
You can avoid truncation errors be parameterising the calls, for example via stored procs.
If the parameter is defined as varchar(200), say, then truncation happens silently if you send > 200. The truncation error is thrown only for an INSERT or UPDATE statement: with parameters it won't happen.
The 255 "limit" for SQL Server goes back to 6.5 because vachar was limited to 255. SQL Server 7.0 + changed to 8000 and added support for unicode
Edit:
Why I don't use nvarchar: Double memory footprint, double index size, double disk size, simply don't need it. I work for a big Swiss company with offices globally so I'm not being parochial.
Also discussed here: varchar vs nvarchar performance
On further reflection, I'd suggest unicode appeals to client developers but as a developer DBA I focus on performance and efficiency...
It depends on what the field represents. If I'm doing a quick prototype I leave the defaults of 255. For anything like comments etc I'd probably put it to 1000.
The only way I'd make it smaller really is on things I definately know the siez of, zip codes or NI numbers etc.
For columns that you need to have certain constraints on - like names, emails, addresses, etc - you should put a reasonably high max length. For instance a first name of more than 50 characters seems a bit suspicious and an input above that size will probably contain more that just a first name. But for the initial design of a database, take that reasonable size and double it. So for first names, set it to 100 (or 200 if 100 is your 'reasonable size'). Then put the app in production, let the users play around for a sufficiently long time to gather data and then check the actual max(len(FirstName)). Are there any suspicious values there? Anything above 50 chars? Find out what's in there and see if it's actually a first name or not. If it's not, the input form probably needs better explanations/validations.
Do the same for comments; Set them to nvharchar(max) initially. Then come back when your database has grown enough for you to start optimizing performance. Take the max length of the comments, double it and you have a good max length for your column.

SQL 2008 data types - Which ones to use?

I'm using SQL Express 2008 edition. I've planned my database tables/relationships on a piece of paper and was ready to begin.
Unfortunately, when I clicked on data type for my pkID_Officer I was offered more than I expected, and therefore doubt has set in.
Can anyone point me to a place where these dates types are explained and examples are given as to what fields work better with which data types.
Examples:
is int for an ID (primarykey) still the obvious choice or does the uniqueidentifier takes it's crown?
Telephone numbers where the digits are separated by '.' (01.02.03.04.05)
Email
items that will be hyper-links
nChar and vChar?
Any help is always appreciated.
Thanks
Mike.
The MSDN site has a good overview of the SQL 2008 datatypes.
http://msdn.microsoft.com/en-us/library/ms187752.aspx
For the ID field use a guid if it needs to be unique across separate systems or tables or you want to be able to generate the ID outside of the database. Otherwise an int/identity value works just fine.
I store telephone numbers a character data since I won't ever be doing calculations on it. I would think email would be stored the same way.
As for hyperlinks you can basically store the hyperlink by itself as a varchar and render the link on the client or store the markup itself in the database. Really depends on the circumstances.
Use nvarchar if you ever think you'll need to support double byte languages now or in the future.
For primary key I always prefer (as starting point) to use an auto increment int. It makes everything more usable and you don't have any "natural" relationship with the real data. Of course there can be exception to this...
is int for an ID (primarykey) still the obvious choice or does the uniqueidentifier takes it's crown?
I personally prefer INT IDENTITY over GUIDs - especially for your clustered index. GUIDs are random in nature and thus lead to a lot of index fragmentation and therefore poor performance when used as a clustered index on SQL Server. INT doesn't have this trouble, plus it's only 4 bytes vs. 16 bytes, so if you have lots of rows, and lots of non-clustered indexes (the clustered key gets added to each and every entry in each and every non-clustered index), using a GUID will unnecessarily bloat your space requirements (on disk and also in your machine's RAM)
Telephone numbers where the digits are separated by '.' (01.02.03.04.05)
Email
items that will be hyper-links
I'd use string fields for all of these.
VARCHAR is fine, as long as you don't need any "foreign" language support, e.g. it's okay for English and Western European languages, but fails on Eastern European and Asian languages (Cyrillic, Chinese etc.).
NVARCHAR will handle all those extra pesky languages at a price - each character is stored in 2 bytes, e.g. a string of 100 chars will use 200 bytes of storage - ALWAYS.
Hope this helps a bit !
Marc
uniqueidentifier are GUIDs:
http://de.wikipedia.org/wiki/Globally_Unique_Identifier
Guids
- are worldwide unique
- have in DotNet e.g. her own objects (System.Guid)
- are 16 digits long
If you want such things, then use it. If you are fine with int-ids, then it's ok.
Telephone-Numbers / emails / Hyperlings are normal strings.
NCHAR/NVARCHR are Unicode counterpoints of CHAR/VARCHAR datatypes. I almost always use them in my apps - unless I have compelling reasons not to use them.

What datatype should be used for storing phone numbers in SQL Server 2005?

I need to store phone numbers in a table. Please suggest which datatype should I use?
Wait. Please read on before you hit reply..
This field needs to be indexed heavily as Sales Reps can use this field for searching (including wild character search).
As of now, we are expecting phone numbers to come in a number of formats (from an XML file). Do I have to write a parser to convert to a uniform format? There could be millions of data (with duplicates) and I dont want to tie up the server resources (in activities like preprocessing too much) every time some source data comes through..
Any suggestions are welcome..
Update: I have no control over source data. Just that the structure of xml file is standard. Would like to keep the xml parsing to a minimum.
Once it is in database, retrieval should be quick. One crazy suggestion going on around here is that it should even work with Ajax AutoComplete feature (so Sales Reps can see the matching ones immediately). OMG!!
Does this include:
International numbers?
Extensions?
Other information besides the actual number (like "ask for bobby")?
If all of these are no, I would use a 10 char field and strip out all non-numeric data. If the first is a yes and the other two are no, I'd use two varchar(50) fields, one for the original input and one with all non-numeric data striped and used for indexing. If 2 or 3 are yes, I think I'd do two fields and some kind of crazy parser to determine what is extension or other data and deal with it appropriately. Of course you could avoid the 2nd column by doing something with the index where it strips out the extra characters when creating the index, but I'd just make a second column and probably do the stripping of characters with a trigger.
Update: to address the AJAX issue, it may not be as bad as you think. If this is realistically the main way anything is done to the table, store only the digits in a secondary column as I said, and then make the index for that column the clustered one.
We use varchar(15) and certainly index on that field.
The reason being is that International standards can support up to 15 digits
Wikipedia - Telephone Number Formats
If you do support International numbers, I recommend the separate storage of a World Zone Code or Country Code to better filter queries by so that you do not find yourself parsing and checking the length of your phone number fields to limit the returned calls to USA for example
Use CHAR(10) if you are storing US Phone numbers only. Remove everything but the digits.
I'm probably missing the obvious here, but wouldn't a varchar just long enough for your longest expected phone number work well?
If I am missing something obvious, I'd love it if someone would point it out...
I would use a varchar(22). Big enough to hold a north american phone number with extension. You would want to strip out all the nasty '(', ')', '-' characters, or just parse them all into one uniform format.
Alex
nvarchar with preprocessing to standardize them as much as possible. You'll probably want to extract extensions and store them in another field.
SQL Server 2005 is pretty well optimized for substring queries for text in indexed varchar fields. For 2005 they introduced new statistics to the string summary for index fields. This helps significantly with full text searching.
using varchar is pretty inefficient. use the money type and create a user declared type "phonenumber" out of it, and create a rule to only allow positive numbers.
if you declare it as (19,4) you can even store a 4 digit extension and be big enough for international numbers, and only takes 9 bytes of storage. Also, indexes are speedy.
Normalise the data then store as a varchar. Normalising could be tricky.
That should be a one-time hit. Then as a new record comes in, you're comparing it to normalised data. Should be very fast.
Since you need to accommodate many different phone number formats (and probably include things like extensions etc.) it may make the most sense to just treat it as you would any other varchar. If you could control the input, you could take a number of approaches to make the data more useful, but it doesn't sound that way.
Once you decide to simply treat it as any other string, you can focus on overcoming the inevitable issues regarding bad data, mysterious phone number formating and whatever else will pop up. The challenge will be in building a good search strategy for the data and not how you store it in my opinion. It's always a difficult task having to deal with a large pile of data which you had no control over collecting.
Use SSIS to extract and process the information. That way you will have the processing of the XML files separated from SQL Server. You can also do the SSIS transformations on a separate server if needed. Store the phone numbers in a standard format using VARCHAR. NVARCHAR would be unnecessary since we are talking about numbers and maybe a couple of other chars, like '+', ' ', '(', ')' and '-'.
Use a varchar field with a length restriction.
It is fairly common to use an "x" or "ext" to indicate extensions, so allow 15 characters (for full international support) plus 3 (for "ext") plus 4 (for the extension itself) giving a total of 22 characters. That should keep you safe.
Alternatively, normalise on input so any "ext" gets translated to "x", giving a maximum of 20.
It is always better to have separate tables for multi valued attributes like phone number.
As you have no control on source data so, you can parse the data from XML file and convert it into the proper format so that there will not be any issue with formats of a particular country and store it in a separate table so that indexing and retrieval both will be efficient.
Thank you.
I realize this thread is old, but it's worth mentioning an advantage of storing as a numeric type for formatting purposes, specifically in .NET framework.
IE
.DefaultCellStyle.Format = "(###)###-####" // Will not work on a string
Use data type long instead.. dont use int because it only allows whole numbers between -32,768 and 32,767 but if you use long data type you can insert numbers between -2,147,483,648 and 2,147,483,647.
For most cases, it will be done with bigint
Just save unformatted phone numbers like: 19876543210, 02125551212, etc.
Check the topic about bigint vs varchar

What's your opinion on using UUIDs as database row identifiers, particularly in web apps?

I've always preferred to use long integers as primary keys in databases, for simplicity and (assumed) speed. But when using a REST or Rails-like URL scheme for object instances, I'd then end up with URLs like this:
http://example.com/user/783
And then the assumption is that there are also users with IDs of 782, 781, ..., 2, and 1. Assuming that the web app in question is secure enough to prevent people entering other numbers to view other users without authorization, a simple sequentially-assigned surrogate key also "leaks" the total number of instances (older than this one), in this case users, which might be privileged information. (For instance, I am user #726 in stackoverflow.)
Would a UUID/GUID be a better solution? Then I could set up URLs like this:
http://example.com/user/035a46e0-6550-11dd-ad8b-0800200c9a66
Not exactly succinct, but there's less implied information about users on display. Sure, it smacks of "security through obscurity" which is no substitute for proper security, but it seems at least a little more secure.
Is that benefit worth the cost and complexity of implementing UUIDs for web-addressable object instances? I think that I'd still want to use integer columns as database PKs just to speed up joins.
There's also the question of in-database representation of UUIDs. I know MySQL stores them as 36-character strings. Postgres seems to have a more efficient internal representation (128 bits?) but I haven't tried it myself. Anyone have any experience with this?
Update: for those who asked about just using the user name in the URL (e.g., http://example.com/user/yukondude), that works fine for object instances with names that are unique, but what about the zillions of web app objects that can really only be identified by number? Orders, transactions, invoices, duplicate image names, stackoverflow questions, ...
I can't say about the web side of your question. But uuids are great for n-tier applications. PK generation can be decentralized: each client generates it's own pk without risk of collision.
And the speed difference is generally small.
Make sure your database supports an efficient storage datatype (16 bytes, 128 bits).
At the very least you can encode the uuid string in base64 and use char(22).
I've used them extensively with Firebird and do recommend.
For what it's worth, I've seen a long running stored procedure (9+ seconds) drop to just a few hundred milliseconds of run time simply by switching from GUID primary keys to integers. That's not to say displaying a GUID is a bad idea, but as others have pointed out, joining on them, and indexing them, by definition, is not going to be anywhere near as fast as with integers.
I can answer you that in SQL server if you use a uniqueidentifier (GUID) datatype and use the NEWID() function to create values you will get horrible fragmentation because of page splits. The reason is that when using NEWID() the value generated is not sequential. SQL 2005 added the NEWSEQUANTIAL() function to remedy that
One way to still use GUID and int is to have a guid and an int in a table so that the guid maps to the int. the guid is used externally but the int internally in the DB
for example
457180FB-C2EA-48DF-8BEF-458573DA1C10 1
9A70FF3C-B7DA-4593-93AE-4A8945943C8A 2
1 and 2 will be used in joins and the guids in the web app. This table will be pretty narrow and should be pretty fast to query
Why couple your primary key with your URI?
Why not have your URI key be human readable (or unguessable, depending on your needs), and your primary index integer based, that way you get the best of both worlds. A lot of blog software does that, where the exposed id of the entry is identified by a 'slug', and the numeric id is hidden away inside of the system.
The added benefit here is that you now have a really nice URL structure, which is good for SEO. Obviously for a transaction this is not a good thing, but for something like stackoverflow, it is important (see URL up top...). Getting uniqueness isn't that difficult. If you are really concerned, store a hash of the slug inside a table somewhere, and do a lookup before insertion.
edit: Stackoverflow doesn't quite use the system I describe, see Guy's comment below.
Rather than URLs like this:
http://example.com/user/783
Why not have:
http://example.com/user/yukondude
Which is friendlier to humans and doesn't leak that tiny bit of information?
You could use an integer which is related to the row number but is not sequential. For example, you could take the 32 bits of the sequential ID and rearrange them with a fixed scheme (for example, bit 1 becomes bit 6, bit 2 becomes bit 15, etc..).
This will be a bidirectional encryption, and you will be sure that two different IDs will always have different encryptions.
It would obviously be easy to decode, if one takes the time to generate enough IDs and get the schema, but, if I understand correctly your problem, you just want to not give away information too easily.
We use GUIDs as primary keys for all our tables as it doubles as the RowGUID for MS SQL Server Replication. Makes it very easy when the client suddenly opens an office in another part of the world...
I don't think a GUID gives you many benefits. Users hate long, incomprehensible URLs.
Create a shorter ID that you can map to the URL, or enforce a unique user name convention (http://example.com/user/brianly). The guys at 37Signals would probably mock you for worrying about something like this when it comes to a web app.
Incidentally you can force your database to start creating integer IDs from a base value.
It also depends on what you care about for your application. For n-tier apps GUIDs/UUIDs are simpler to implement and are easier to port between different databases. To produce Integer keys some database support a sequence object natively and some require custom construction of a sequence table.
Integer keys probably (I don't have numbers) provide an advantage for query and indexing performance as well as space usage. Direct DB querying is also much easier using numeric keys, less copy/paste as they are easier to remember.
I work with a student management system which uses UUID's in the form of an integer. They have a table which hold the next unique ID.
Although this is probably a good idea for an architectural point of view, it makes working with on a daily basis difficult. Sometimes there is a need to do bulk inserts and having a UUID makes this very difficult, usually requiring writing a cursor instead of a simple SELECT INTO statement.
I've tried both in real web apps.
My opinion is that it is preferable to use integers and have short, comprehensible URLs.
As a developer, it feels a little bit awful seeing sequential integers and knowing that some information about total record count is leaking out, but honestly - most people probably don't care, and that information has never really been critical to my businesses.
Having long ugly UUID URLs seems to me like much more of a turn off to normal users.
I think that this is one of these issues that cause quasi-religious debates, and its almost futile to talk about. I would just say use what you prefer. In 99% of systems it will no matter which type of key you use, so the benefits (stated in the other posts) of using one sort over the other will never be an issue.
I think using a GUID would be the better choice in your situation. It takes up more space but it's more secure.
YouTube uses 11 characters with base64 encoding which offers 11^64 possibilities, and they are usually pretty manageable to write. I wonder if that would offer better performance than a full on UUID. UUID converted to base 64 would be double the size I believe.
More information can be found here: https://www.youtube.com/watch?v=gocwRvLhDf8
Pros and Cons of UUID
Note: uuid_v7 is time based uuid instead of random. So you can
use it to order by creation date and solve some performance issues
with db inserts if you do really many of them.
Pros:
can be generated on api level (good for distributed systems)
hides count information about entity
doesn't have limit 2,147,483,647 as 32-bit int
removes layer of errors related to passing one entity id userId: 25 to get another bookId: 25 accidently
more friendly graphql usage as ID key
Cons:
128-bit instead 32-bit int (slightly bigger size in db and ~40% bigger index, around ~30MB for 1 million rows), should be a minor concern
can't be sorted by creation (can be solved with uuid_v7)
non-time-ordered UUID versions such as UUIDv4 have poor database index locality (can be solved with uuid_v7)
URL usage
Depending on app you may care or not care about url. If you don't care, just use uuid as is, it's fine.
If you care, then you will need to decide on url format.
Best case scenario is a use of unique slug if you ok with never changing it:
http://example.com/sale/super-duper-phone
If your url is generated from title and you want to change slug on title change there is a few options. Use it as is and query by uuid (slug is just decoration):
http://example.com/book/035a46e0-6550-11dd-ad8b-0800200c9a66/new-title
Convert it to base64url:
you can get uuid back from AYEWXcsicACGA6PT7v_h3A
AYEWXcsicACGA6PT7v_h3A - 22 characters
035a46e0-6550-11dd-ad8b-0800200c9a66 - 36 characters
http://example.com/book/AYEWXcsicACGA6PT7v_h3A/new-title
Generate a unique short 11 chars length string just for slug usage:
http://example.com/book/icACEWXcsAY-new-title
http://example.com/book/icACEWXcsAY/new-title
If you don't want uuid or short id in url and want only slug, but do care about seo and user bookmarks, you will need to redirect all request from
http://example.com/sale/phone-1-title
to
http://example.com/sale/phone-1-title-updated
this will add additional complexity of managing slug history, adding fallback to history for all queries where slug is used and redirects if slugs doesn't match
As long as you use a DB system with efficient storage, HDD is cheap these days anyway...
I know GUID's can be a b*tch to work with some times and come with some query overhead however from a security perspective they are a savior.
Thinking security by obscurity they fit well when forming obscure URI's and building normalised DB's with Table, Record and Column defined security you cant go wrong with GUID's, try doing that with integer based id's.

Resources