Using "varchar" as the primary key? bad idea? or ok? - sql-server

Is it really that bad to use "varchar" as the primary key?
(will be storing user documents, and yes it can exceed 2+ billion documents)

It totally depends on the data. There are plenty of perfectly legitimate cases where you might use a VARCHAR primary key, but if there's even the most remote chance that someone might want to update the column in question at some point in the future, don't use it as a key.

If you are going to be joining to other tables, a varchar, particularly a wide varchar, can be slower than an int.
Additionally if you have many child records and the varchar is something subject to change, cascade updates can causes blocking and delays for all users. A varchar like a car VIN number that will rarely if ever change is fine. A varchar like a name that will change can be a nightmare waiting to happen. PKs should be stable if at all possible.
Next many possible varchar Pks are not really unique and sometimes they appear to be unique (like phone numbers) but can be reused (you give up the number, the phone company reassigns it) and then child records could be attached to the wrong place. So be sure you really have a unique unchanging value before using.
If you do decide to use a surrogate key, then make a unique index for the varchar field. This gets you the benefits of the faster joins and fewer records to update if something changes but maintains the uniquess that you want.
Now if you have no child tables and probaly never will, most of this is moot and adding an integer pk is just a waste of time and space.

I realize I'm a bit late to the party here, but thought it would be helpful to elaborate a bit on previous answers.
It is not always bad to use a VARCHAR() as a primary key, but it almost always is. So far, I have not encountered a time when I couldn't come up with a better fixed size primary key field.
VARCHAR requires more processing than an integer (INT) or a short fixed length char (CHAR) field does.
In addition to storing extra bytes which indicate the "actual" length of the data stored in this field for each record, the database engine must do extra work to calculate the position (in memory) of the starting and ending bytes of the field before each read.
Foreign keys must also use the same data type as the primary key of the referenced parent table, so processing further compounds when joining tables for output.
With a small amount of data, this additional processing is not likely to be noticeable, but as a database grows you will begin to see degradation.
You said you are using a GUID as your key, so you know ahead of time that the column has a fixed length. This is a good time to use a fixed length CHAR(36) field, which incurs far less processing overhead.

I think int or bigint is often better.
int can be compared with less CPU instructions (join querys...)
int sequence is ordered by default -> balanced index tree -> no reorganisation if you use an PK as clustered index
index need potentially less space

Use an ID (this will become handy if you want to show only 50 etc...). Than set a constraint UNIQUE on your varchar with the file-names (I assume, that is what you are storing).
This will do the trick and will increase speed.

Related

Changing to newsequentialid() on an existing table with 4 million records

My Question is almost the same like Changing newid() to newsequentialid() on an existing table
if I change to newsequentialid() than the index should be more compact, correct?
And what happens if the sequence is hitting exists ID's while inserting new records? Will the database check that before?
It will have less fragmentation so you could say its' more compact. But it will still be the same size per key (16 bytes + overhead) for a guid key. The benefit of using a sequential guid vs a nonsequential guid is that you have less chances for a page split. A page split is where a logical page has to have a record inserted, but would be more than the page is allowed to hold, so the page is "split"; half to one page and half to another. Sometimes a page split causes another page to split, and theoretically you can have a cascading and costly page split by just inserting one new record. When you use a sequential key, it's less likely that you'll randomly triggger a page split somewhere in the middle of your index, so you reduce the likelihood of those happening. Using a sequential guid also helps optimize range scans (e.g selecting between one value and another value) but with a GUID, it's very unlikely that you'll end up doing many range based scans, since the value is basically meaningless.
What happens when the sequence hits an existing iD? You get a PK violation. SQL doesn't ensure that a GUID can only be used once. Sequential ID's start at a new seed every time the server is restarted so, in theory, you could skip back in the sequence and then wind up covering the same value twice. However, as with GUIDs in general, the liklihood of this happening is so astronomically small as to be statistically insignificant.
As with everything, the cost and benefits depend on your specific scenario. If you're looking to replace a GUID key with a sequential key, see if it's possible to use an int or bigint surrogate key instead of a GUID, because generally, all things being equal, an integer will outperform a guid in every case. 4 Million records will trivially fit into an INT data type and even more trivially into a bigint.
Hope this helps.

Use of specifying lengths for surrogate keys

In one of my database class assignments, I wrote that I specifically didn't assign lengths to my NUMBER columns acting as surrogate keys since it would unnecessarily limit the number of records able to be stored in the table, and because there is literally no difference in performance or physical storage between NUMBER(n) and NUMBER.
My professor wrote back that it would be technically possible but "impractical" for large databases, and that most DBAs in real-life situations would not do that.
There is no difference whatsoever between NUMBER(n) and NUMBER as far as physical storage or performance goes, and thus no reason to specify a length for a NUMBER-based surrogate key column. Why does this professor think that using NUMBER alone would be "impractical"?
In my experience, most production DBAs in real life would likely do as you suggested and declare key columns as NUMBER rather than NUMBER(n).
It would be worthwhile to ask the professor what makes this approach impractical in his or her opinion. There are a couple possibilities that I can think of
Assuming that you are using a data modeling tool to design your schema, a reasonable tool will ensure that the data type of a key will be the same in the table where it is defined as a primary key and in the child table where it is a foreign key. If you specify a length for the primary key, forcing the key to generate foreign keys without length limits would be impractical. Of course, the counter to this is that you can just declare both the primary and foreign key columns as NUMBER.
DBAs tend to be extremely picky (and I mean this as a compliment). They like to see everything organized "just so". Adding a length qualifier to a field whether it be a NUMBER or a VARCHAR2 serves as an implicit constraint that ensure that incorrect data does not get stored. Ideally, you would know when you are designing a table a reasonable upper bound on the number of rows you'll insert over the table's lifetime (i.e. if your PERSON table ended up with more than 10 billion rows, something would likely be seriously wrong). Applying length constraints to numeric columns demonstrates to the DBA that you've done this sort of analysis.
Today, however, that is rather unlikely to actually happen at least with respect to numeric columns both because it is something that is more in keeping with waterfall planning methods that would generally involve that sort of detailed design discussion and because people are less concerned with the growth analysis that would have traditionally been done at the same time. If you were designing a database schema 20 or 30 years ago, it wouldn't be uncommon to provide the DBA with a table-by-table breakdown of the projected size of each table at go-live and over the next few years. Today, it's more cost effective to potentially spend more on disk rather than investing the time to do this analysis up front.
It would probably be better from a readability and self documentation standpoint to limit what can be stored in the column to numbers that are expected. I would agree that I don't see how its impractical
From this thread about number
number(n) is an edit -- restricting the number to n digits in length.
if you store the number '55', it takes the same space in a number(2)
as it does in a number(38).
the storage required = function of the number actually stored.
Left to my own devices I would declare surrogate primary keys as NUMBER(38) on oracle instead of NUMBER. And possibly a check constraint to make the key > 0. Primarily to serve as documentation to outside systems about what they can expect in the column and what they need to be able to handle.
In theory, when building an application that is reading the surrogate primary key, seeing NUMBER means one needs to handle full floating point range of number, whereas NUMBER(38) means the application needs to handle an integer with up to 38 digits.
If I were working in an environment where all the front ends were going to be using a 32 bit integer for surrogate keys I'd define it as a number(10) with appropriate check constraint.

SQL Server performance difference with single or multi column primary key?

Is there any difference in performance (in terms of inserting/updating & querying) a table if the primary key is a single column (e.g., a GUID generated for every row) or multiple columns (e.g., a foreign key GUID + an offset number)?
I would assume querying speeds should be quicker if anything with multi-column primary keys, however I would imagine inserting would be slower due to a slightly more complicated unique check? I also imagine the data types of a multi-column primary key could also matter (e.g., if one of the columns was a DateTime type it would add complexity). These are just my thoughts to invoke answers & discussion (hopefully!) and are not fact based.
I realise there are some other questions covering this topic, but I'm wondering about performance impacts rather than management/business concerns.
You will be affected more by (each) component of the key being (a) variable length and (b) the width [wide instead of narrow columns], than the number of components in the key. Unless MS have broken it again in the latest release (they broke Heaps in 2005). Datatype does not slow it down; the width, and particularly variable length (any datatype) does. Note that a fixed len column is made variable if it is set to Nullable. Variable len columns in indices is bad news, because a bit of "unpacking" has to be performed on every access, to get at the data.
Obviously, keep indexed columns as narrow as possible, using fixed, and not Nullable columns only.
In terms of number of columns in a compound key, sure one column is faster than seven, but not that much: three fat wide variable columns are much slower than seven thin fixed columns.
GUID is of course a very fat key; GUID plus anything else is very very fat; GUID Nullable is Guiness material. Unfortunately it is the knee-jerk reaction to solving the IDENTITY problem, which in turn is a consequence of not having chosen good natural relational keys. So you are best advised to fix the real problem at the source, and choose good natural keys; avoid IDENTITY; avoid GUID.
Experience and performance tuning, not conjecture.
It depends on your access patterns, read/write ratio and whether (possibly most importantly) the clustered index is defined on the Primary Key.
Rule of thumb is make your primary key as small as possible (32 bit int) and define the clustered index on a monotonically increasing key (think IDENTITY) where possible, unless you have range searches that form a large proportion of the queries against that table.
If your application is write intensive, and you define the clustered index on the GUID column you should note:
All non-clustered indexes will
contain the clustered index key and will therefore be larger. This may have a negative effect of performance if there are many NC indexes.
Unless you are using an 'ordered'
GUID (such as a COMB or using
NEWSEQUENTIALID()), your inserts
will fragment the index over time. This means
you need a regular index rebuild and
possibly increasing the amount of
free space left in pages (fill
factor)
Because there are many factors at work (hardware, access patterns, data size), I suggest you run some tests and benchmark your particular circumstances..
It depends on the indexing and storage in each case. All other things being equal, the choice of primary key is irrelevant as far as performance is concerned. The choice of indexes and other storage options would be the deciding factor.
If your situation is going to be geared towards a higher number of inserts, then the smaller footprint possible, the better.
There are two things you need to separate, the concept of the primary key at the database level, and the concept of the key your application uses.
Why do you need a GUID? Are you going to be inserting into multiple database server, and then combining the information into one centralized database?
If that is the case then my recommendation is an identity followed by a guid. Clustered index on the identity, and Unique Non clustered on the GUID. If you use the GUID as a Clustered index, then your data inserts will be all over the place. Meaning your data will not be inserted sequentially, and this causes performance problems as your system will be inserting and moving pages around randomly.
Having your data inserted nice in an ordered faction, thanks to the identity, is the way to go. You can leave the sorting to the index structure( the nonclusered unique containing the GUID), which is a much more efficient structure to sort than using the table data.

good database design: enum values: ints or strings?

I have a column in a table that will store an enum value. E.g. Large, Medium, Small or the days of the week. This will correspond to displayed text on a web page or user selection from a droplist. What is the best design?
Store the values as an int and then perhaps have a table that has the enums/int corresponding string in it.
Just store the values in the column as a string, to make queries a little more self-explanatory.
At what point/quantity of values is it best to use ints or strings.
Thanks.
Assuming your RDBMS of choice doesn't have an ENUM type (which handles this for you), I think best to use ids instead of strings directly when the values can change (either in value or in quantity.)
You might think that days of the week won't change, but what if your application needs to add internationalization support? (or an evil multinational corporation decides to rename them after taking control of the world?)
Also, that Large, Medium and Small categorization is probably changing after a while. Most values you think cannot change, can change after a while.
So, mainly for anticipating change reasons, I think it's best to use ids, you just need to change the translation table and everything works painlessly. For i18n, you can just expand the translation table and pull the proper records automatically.
Most likely (it'll depend on various factors) ints are going to perform better, at the very least in the amount of required storage. But I wouldn't do ints for performance reasons, I'd do ints for flexibility reasons.
this is an interesting question. Definitely you have to take in consideration performance targets here. If you wan't to go for speed, int is a must. A Database can index integers a bit better than Strings although I must say its not at all a bad performance loss.
On example is Oracle database itself where they have the luxury of doing large caps enum as strings on their system tables. Things like USER_ALLOCATION_TYPE or things like that are the norm. Its like you say, Strings can be more "extensible" and more readable, but in any case in the code you will end up with:
Static final String USER_ALLOCATION_TYPE="USER_ALLOCATION_TYPE";
in place of
Static final int USER_ALLOCATION_TYPE=5;
Because you either do this you will end up with all this string literals that are just aching for someone to go there and misplace a char! :)
In my company we use tables with integers primary keys; all the tables have a serial primary key, because even if you don't think you need one, sooner or later you'll regret that.
In the case you are describing what we do is that we have a table with (PK Int, Description String) and then we do Views over the master tables with joins to get the descriptions, that way we get to see the joined fields descriptions if we must and we keep the performance up.
Also, with a separate description table you can have EXTRA information about those ids you would never think about. For example, lets say a user can have access to some fields in the combo box if and only if they have such property and so. You could use extra fields in the description table to store that in place of ad-hoc code.
My two cents.
Going with your first example. Lets say you create a Look up table: Sizes. It has the following columns:
Id - primary key + identity
Name - varchar / nvarchar
You'd have three rows in the table, Small, Medium and Large with values 1, 2, 3 if you inserted them in that order.
If you have another table that uses those values you can use the identity value as the foreign key...or you could create a third column which is a short hand value for the three values. It would have the values S, M & L. You could use that as the foreign key instead. You'd have to create a unique constraint on the column.
As far as the dropdown, you could use either one as the value behind the scenes.
You could also create S/M/L value as the primary key as well.
For your other question about when its best to use the ints vs strings. There is probably a lot of debate on the subject. A lot of people only like using identity values as their primary keys. Other people say that it's better to use a natural key. If you are not using an identity as the primary key then it's just important to make sure you have a good candidate for the primary key (making sure it will always be unique and that the value does not change).
I too would be interested in people's thinking regarding this, I've always gone the route of storing the enum in a look up table and then in any data tables that referenced the enum I would store the ID and using FK relationship. In a certain way, I still like this approach, but there is something plain and simple about putting the string value directly in the table.
Going purely by size, an int is 4 bytes, where as the string is n btyes (where n is number of characters). Shortest value in your look up is 5 characters, longest is 6, so storing the actual value would use up more space eventually (if that was a problem).
Going by performance, I'm not sure if an index on an int or on a varchar would return any difference in speed / optimisation / index size?

Is it ok to use character values for primary keys?

Is there a performance gain or best practice when it comes to using unique, numeric ID fields in a database table compared to using character-based ones?
For instance, if I had two tables:
athlete
id ... 17, name ... Rickey Henderson, teamid ... 28
team
teamid ... 28, teamname ... Oakland
The athlete table, with thousands of players, would be easier to read if the teamid was, say, "OAK" or "SD" instead of "28" or "31". Let's take for granted the teamid values would remain unique and consistent in character form.
I know you CAN use characters, but is it a bad idea for indexing, filtering, etc for any reason?
Please ignore the normalization argument as these tables are more complicated than the example.
I find primary keys that are meaningless numbers cause less headaches in the long run.
Text is fine, for all the reasons you mentioned.
If the string is only a few characters, then it will be nearly as small an an integer anyway. The biggest potential drawback to using strings is the size: database performance is related to how many disk accesses are needed. Making the index twice as big, for example, could create disk-cache pressure, and increase the number of disk seeks.
I'd stay away from using text as your key - what happens in the future when you want to change the team ID for some team? You'd have to cascade that key change all through your data, when it's the exact thing a primary key can avoid. Also, though I don't have any emperical evidence, I'd think the INT key would be significantly faster than the text one.
Perhaps you can create views for your data that make it easier to consume, while still using a numeric primary key.
I'm just going to roll with your example. Doug is correct when he says that text is fine. Even for a medium sized (~50gig) database having a 3 letter code be a primary key won't kill the database. If it makes development easier, reduces joins on the other table and it's a field that users would be typing in...I say go for it. Don't do it if it's just an abbreviation that you show on a page or because it makes the athletes table look pretty. I think the key is the question "Is this a code that the user will type in and not just pick from a list?"
Let me give you an example of when I used a text column for a key. I was making software for processing medical claims. After the claim got all digitized a human had to look at the claim and then pick a code for it that designated what kind of claim it was. There were hundreds of codes...and these guys had them all memorized or crib sheets to help them. They'd been using these same codes for years. Using a 3 letter key let them just fly through the claims processing.
I recommend using ints or bigints for primary keys. Benefits include:
This allows for faster joins.
Having no semantic meaning in your primary key allows you to change the fields with semantic meaning without affecting relationships to other tables.
You can always have another column to hold team_code or something for "OAK" and "SD". Also
The standard answer is to use numbers because they are faster to index; no need to compute a hash or whatever.
If you use a meaningful value as a primary key you'll have to update it all through you're database if the team name changes.
To satisfy the above, but still make the database directly readable,
use a number field as the primary key
immediately create a view Athlete_And_Team that joins the Athlete and Team tables
Then you can use the view when you're going through the data by hand.
Are you talking about your primary key or your clustered index? Your clustered index should be the column which you will use to uniquely identify that row by most often. It also defines the logical ordering of the rows in your table. The clustered index will almost always be your primary key, but there are circumstances where they can be differant.

Resources