Primary Key Theory & I/O Efficiency - database

According to RSBMS theory, when choosing a primary key, we are supposed to choose amongst minimal superkeys, effectively optimizing our key choice w.r.t # of columns.
Why are we interested in optimizing against # of columns instead of number of bytes? Wouldn't a smaller byte size PK result in smaller index tables and overall more read/write time efficient queries? For example, choosing a PK comprised of 2 varchar(16) rather than 1 varchar(64).

I think I agree with you.
I don't think theory accounts for physical storage.
Yes, if for instance, you created a column which was a SHA256 of two small columns, say VARCHAR(16), then yes the nodes of the B-tree in the index would take up more space, and the index would not be faster than indexing the two 16 byte columns.
There is some efficiency lost building an index which matches on the first column, and has to switch to comparisons on the second column. The b-nodes are more efficient if the whole b-node is comparing on the same column.
Honestly though, I don't think either amounts to much difference in efficiency. I think the statement is RDBMS theory not accounting for storage size.

The identification of minimal rather than non-minimal superkeys is very important when defining keys in a database. If you choose to enforce uniqueness on three columns, A,B,C then that's very different from enforcing uniqueness on just two columns, A,B. A uniqueness constraint on A,B,C would not guarantee the uniqueness of A,B - so A,B would no longer be a superkey. On the other hand if the uniqueness constraint is on A,B then A,B,C is also a superkey. So it's essential from a data integrity point of view to know what the irreducible set of superkeys is.
This has nothing to do with primary keys as such because all keys must be minimal, not just the one you choose to call primary. Storage size and performance are something else. Internal storage is an important consideration in the design of indexes but size and performance are non-functional requirements whereas keys are all about logic and functionality.

Related

SQL - define keys to table

Is there any considerations to define keys for table that has lot of records already and most of operation that are operated on it are Insert ?
Key definition ultimately comes down to how you can uniquely and efficiently identify any specific row in a table. If a business key value fulfills that requirement, then it is a suitable candidate. An ideal key is also skinny. A GUID is horrible for this (IMHO) because it is far larger than it needs to be.
If insert performance is the most important priority and a suitable business key is not available, you can use an integer based identity key. If you expect more than 2.1 billion records within a few years, use bigint (9 quintillion records) instead.
Keep in mind that every index you make on the table will always include the PK. Having a skinny PK can make your indexes more efficient, using less storage, memory and CPU.
Insert speed is affected by the clustered index sort order as well as the number and sort order of all non-clustered indexes on the table. Column-store indexes are not sorted and have minimal overhead on inserts.
If you have a PK that store ID-number is more heavy then auto increases number, therefore when you define key keep in mind that it bather to create another column of PK for auto increases number.

Use of specifying lengths for surrogate keys

In one of my database class assignments, I wrote that I specifically didn't assign lengths to my NUMBER columns acting as surrogate keys since it would unnecessarily limit the number of records able to be stored in the table, and because there is literally no difference in performance or physical storage between NUMBER(n) and NUMBER.
My professor wrote back that it would be technically possible but "impractical" for large databases, and that most DBAs in real-life situations would not do that.
There is no difference whatsoever between NUMBER(n) and NUMBER as far as physical storage or performance goes, and thus no reason to specify a length for a NUMBER-based surrogate key column. Why does this professor think that using NUMBER alone would be "impractical"?
In my experience, most production DBAs in real life would likely do as you suggested and declare key columns as NUMBER rather than NUMBER(n).
It would be worthwhile to ask the professor what makes this approach impractical in his or her opinion. There are a couple possibilities that I can think of
Assuming that you are using a data modeling tool to design your schema, a reasonable tool will ensure that the data type of a key will be the same in the table where it is defined as a primary key and in the child table where it is a foreign key. If you specify a length for the primary key, forcing the key to generate foreign keys without length limits would be impractical. Of course, the counter to this is that you can just declare both the primary and foreign key columns as NUMBER.
DBAs tend to be extremely picky (and I mean this as a compliment). They like to see everything organized "just so". Adding a length qualifier to a field whether it be a NUMBER or a VARCHAR2 serves as an implicit constraint that ensure that incorrect data does not get stored. Ideally, you would know when you are designing a table a reasonable upper bound on the number of rows you'll insert over the table's lifetime (i.e. if your PERSON table ended up with more than 10 billion rows, something would likely be seriously wrong). Applying length constraints to numeric columns demonstrates to the DBA that you've done this sort of analysis.
Today, however, that is rather unlikely to actually happen at least with respect to numeric columns both because it is something that is more in keeping with waterfall planning methods that would generally involve that sort of detailed design discussion and because people are less concerned with the growth analysis that would have traditionally been done at the same time. If you were designing a database schema 20 or 30 years ago, it wouldn't be uncommon to provide the DBA with a table-by-table breakdown of the projected size of each table at go-live and over the next few years. Today, it's more cost effective to potentially spend more on disk rather than investing the time to do this analysis up front.
It would probably be better from a readability and self documentation standpoint to limit what can be stored in the column to numbers that are expected. I would agree that I don't see how its impractical
From this thread about number
number(n) is an edit -- restricting the number to n digits in length.
if you store the number '55', it takes the same space in a number(2)
as it does in a number(38).
the storage required = function of the number actually stored.
Left to my own devices I would declare surrogate primary keys as NUMBER(38) on oracle instead of NUMBER. And possibly a check constraint to make the key > 0. Primarily to serve as documentation to outside systems about what they can expect in the column and what they need to be able to handle.
In theory, when building an application that is reading the surrogate primary key, seeing NUMBER means one needs to handle full floating point range of number, whereas NUMBER(38) means the application needs to handle an integer with up to 38 digits.
If I were working in an environment where all the front ends were going to be using a 32 bit integer for surrogate keys I'd define it as a number(10) with appropriate check constraint.

How to enforce uniqueness for big (BLOB) database field

I'm designing a database (SQLite, SQL Server and DB2), where a table holds a 32kB blob that must be unique. The table will typically hold about 20,000 rows.
I can think of two solutions,
1 - Make the blob a unique index.
2 - Calculate a hash index of the blob, use that as a non unique index, and write code that enforces the blob's uniqueness.
Solution 1 is safer, but is the storage space overhead and performance penalty bad enough to make solution 2 a better choice?
I would go with #2, partly as a space-saving measure, but more because some DBMS's don't allow indexes on LOBs (Oracle comes to mind, but that may be an old restriction).
I would probably create two columns to for hash values, MD5 and SHA1 (both commonly supported in client languages). Then add a unique composite index that covers those two columns. The likelihood of a collision on both hashes is infinitesimally small, particularly given your anticipated table sizes. However, you should still have a recovery strategy (which could be as simple as setting one of the values to 0).

SQL Server performance difference with single or multi column primary key?

Is there any difference in performance (in terms of inserting/updating & querying) a table if the primary key is a single column (e.g., a GUID generated for every row) or multiple columns (e.g., a foreign key GUID + an offset number)?
I would assume querying speeds should be quicker if anything with multi-column primary keys, however I would imagine inserting would be slower due to a slightly more complicated unique check? I also imagine the data types of a multi-column primary key could also matter (e.g., if one of the columns was a DateTime type it would add complexity). These are just my thoughts to invoke answers & discussion (hopefully!) and are not fact based.
I realise there are some other questions covering this topic, but I'm wondering about performance impacts rather than management/business concerns.
You will be affected more by (each) component of the key being (a) variable length and (b) the width [wide instead of narrow columns], than the number of components in the key. Unless MS have broken it again in the latest release (they broke Heaps in 2005). Datatype does not slow it down; the width, and particularly variable length (any datatype) does. Note that a fixed len column is made variable if it is set to Nullable. Variable len columns in indices is bad news, because a bit of "unpacking" has to be performed on every access, to get at the data.
Obviously, keep indexed columns as narrow as possible, using fixed, and not Nullable columns only.
In terms of number of columns in a compound key, sure one column is faster than seven, but not that much: three fat wide variable columns are much slower than seven thin fixed columns.
GUID is of course a very fat key; GUID plus anything else is very very fat; GUID Nullable is Guiness material. Unfortunately it is the knee-jerk reaction to solving the IDENTITY problem, which in turn is a consequence of not having chosen good natural relational keys. So you are best advised to fix the real problem at the source, and choose good natural keys; avoid IDENTITY; avoid GUID.
Experience and performance tuning, not conjecture.
It depends on your access patterns, read/write ratio and whether (possibly most importantly) the clustered index is defined on the Primary Key.
Rule of thumb is make your primary key as small as possible (32 bit int) and define the clustered index on a monotonically increasing key (think IDENTITY) where possible, unless you have range searches that form a large proportion of the queries against that table.
If your application is write intensive, and you define the clustered index on the GUID column you should note:
All non-clustered indexes will
contain the clustered index key and will therefore be larger. This may have a negative effect of performance if there are many NC indexes.
Unless you are using an 'ordered'
GUID (such as a COMB or using
NEWSEQUENTIALID()), your inserts
will fragment the index over time. This means
you need a regular index rebuild and
possibly increasing the amount of
free space left in pages (fill
factor)
Because there are many factors at work (hardware, access patterns, data size), I suggest you run some tests and benchmark your particular circumstances..
It depends on the indexing and storage in each case. All other things being equal, the choice of primary key is irrelevant as far as performance is concerned. The choice of indexes and other storage options would be the deciding factor.
If your situation is going to be geared towards a higher number of inserts, then the smaller footprint possible, the better.
There are two things you need to separate, the concept of the primary key at the database level, and the concept of the key your application uses.
Why do you need a GUID? Are you going to be inserting into multiple database server, and then combining the information into one centralized database?
If that is the case then my recommendation is an identity followed by a guid. Clustered index on the identity, and Unique Non clustered on the GUID. If you use the GUID as a Clustered index, then your data inserts will be all over the place. Meaning your data will not be inserted sequentially, and this causes performance problems as your system will be inserting and moving pages around randomly.
Having your data inserted nice in an ordered faction, thanks to the identity, is the way to go. You can leave the sorting to the index structure( the nonclusered unique containing the GUID), which is a much more efficient structure to sort than using the table data.

Primary key question

Is there a benefit to having a single column primary key vs a composite primary key?
I have a table that consists of two id columns which together make up the primary key.
Are there any disadvantages to this? Is there a compelling reason for me to throw in a third column that would be unique on it's own?
Database Normalization nuts will tell you one thing.
I'm just going to offer my own opinion of what i've learned over the years..I stick an AutoIncrementing ID field to Every ($&(##$)# one of my tables. It makes life a million times easier in the long run to be able to single out with impunity a single row.
This is from a "down in the trenches" developer.
Single column keys are simple to write, simple to maintain, and simple to understand.
If you're going to have a huge number of rows - billions? - maybe saving a byte here and there will help.
But if you're not looking at extreme cases, optimizing for "simple" is often the best way to go.
If you are a coder and the database is nothing to you but a glorified object-store, then sure, by all means inject surrogate keys willy nilly. In fact go one better and just delegate all DB schema design and DB interaction to your favourite ORM and be done with it. Indeed, when I want a small or medium scale object-store, that's exactly what I do.
If you are approaching an information systems or information management problem, then it is a completely different story. When you start dealing with 10's (or more likely 100's) of millions of dirty records integrated from multiple sources, several or all of which are not under your control; at that point the seductive lure of an easy answer to the problems of 'identity' is a trap.
Yes you sometimes still introduce a surrogate key internally to allow for concise FK relationships and improved cache efficiency on covering indices; but, you gain those benefits at the cost of substantial pain at managing the natural-key/surrogate-key relationship.
In this case it will be important to make sure you don't allow the surrogate key to leak. Your public API's at the business-logic layer should use the natural-key, nothing above an document/record-cache should be aware of the existence of a surrogate key. Be aware that the cost of matching updates against the existing surrogate keys can be prohibitive, and a far larger scalability hit than the incremental cost of moving a few extra bytes per request over the internal network.
So in conclusion:
If the DB is just being used as an object-store: let the ORM worry about object identity, and there should almost certainly be a surrogate key.
If the DB is being used as a database: the introduction of a surrogate key is an engineering design decision with serious tradeoffs in both directions. The decision will need to be made on a case by case basis, with full recognition of the resulting costs to be accepted in exchange for the benefits gained either way.
Update
The 'convenience' of a surrogate key is really just the ability to punt on the question of identity. This is often necessary in a database, and reasonable in the caching layer as I allow, but beyond that it leads to brittle data designs. The problem is that identity is no something that has one correct answer. For non-trivial data-intensive systems you will routinely find yourself needing to work in terms of equivalence classes, rather than the reference identity, object-oriented programming lulls us into thinking is normal.
What it really comes down to is a realization that the whole concept of a 'primary key' is a fiction invented to help the relational model work efficiently; but, adopting a surrogate key, cements that fiction and makes the whole system brittle and inflexible. Business logic needs to be able to provide their own definitions of equality — sometimes four copies of the same file need to be considered four files, sometimes they should be considered indistinguishable from the original file; when you edit one of them, is that then a new file? the same file? The answer to both questions is of course yes, when... Working with natural keys provides this critical ability to work in terms of conceptual equivalence classes. If you let surrogate keys infect your business logic, you quickly lose this.
I have had to use multi-column primary keys in the past, and it became quite a nightmare very quickly.
If you have one table that references your first table, how does it contain that primary key? Now add another table that references only the second table but needs to find data in the first. Now another... on down the rabbit hole.
If you know that you will only have the one table, there's probably not an issue either way- use whichever represents your data better. But if you'll be using it in joins, you can lose performance pretty quickly.
Is there a benefit to having a single column primary key vs a composit[sic] primary key?
Yes. If the primary key also happens to be the clustered index, it is common that the clustered index is duplicated fully for each secondary index in the table. Therefore, having a fatter clustered index, which is what one would get with a composite, implies an increase in storage cost. Also, foreign references to this table would need to specify both fields to refer to a unique entry, which implies a further storage cost. There is also an arguably greater cost in development time because there is a slight increase in the complexity of the join.
On the other hand, depending on the distribution of the values of your two key fields, it may be the case that concurrent access to your table is greatly improved because chronologically-successive inserts could occur on different physical pages; this could be the case, for example, if your fields are time-independent (and non-monotonic like an auto-incrementer) like clientID, or something like that. This could be significant for performance in a high concurrency environment.
I have a table that consists of two id columns which together make up the primary key.
Are there any disadvantages to this? Is there a compelling reason for me
to throw in a third column that would be unique on it's own?
If the most common way in which your table is queried is to specify those three fields as restrictions, then having all three in a composite key would likely be the fastest lookup.
And there is another important point that I almost forgot. Since having a composite key means that foreign references to this table from other tables must specify all fields in the key, it also means that some queries performed on the other table that required a restriction on one or more of the parts of the composite index of this table, can be performed without requiring a join. This could be considered similar to the concept of denormalization for the sake of performance (and arguably sacrificing a little ease of maintainability).
In general I prefer to have a surrogate key becasue there are very few truly good natural keys (key problem is not uniqueness but that they change over time) and the longer the natural key, the more it affects performance when used as a PK. If you have a natural key, you should create a unique index on it and then use the surrogate key as the PK used for joining to other tables. That enforces the uniqueness of the natural key data but fixes the problems of join performance and the extra time to update all child records when the natural key changes.
There is one case where I ignore this and that is a joining table. If it is a table that is used to enforce a many to many relationship and consists only of two surrogate keys from other tables, then you really gain nothing from adding a surrogate key. Typically the individual keys are used for joins not the PK and surrogate keys almost never change. In a joining table, I just add the two colmns I need and nothing else.
In most databases I know (MySQL, PostgreSQL) the composite key will generate an index. So if you specify your key as composite the DB should provide you an efficient way to lookup tuples from the DB using that key. I think it is the case for all DBs. I think you do not have to bother about performance there.
Don't use multi-column keys. They get very difficult to maintain, especially if the components of the key are not human-understandable.
Use an internally generated key instead.
Imagine you have a composite primary key (field1 and field2 for example) instead of just one autoincremental identifier. Clients' requirements are very changeable and after some development the client says that field2 is not compulsory and it can be nullable, it won't be possible to continue as the primary key of the table. Imagine this table is one of the most importants in your model. Then all the foreign keys should be changed if field 2 cannot be in the composite primary key. It's a nightmare changing the primary key all over the model.
As well if there is a lot of foreign keys I think is not a very good Idea to add several keys to each table just to make the link.
I'm not sure there's enough information for us to make your call for you. Here are a few observations that might be helpful though.
is the primary key a clustered index? Is the table referenced by other tables through a foreign key? If yes, then you may benefit from a single-column key, because that key will appear in those other tables. This is how you would save space.
If the table is not referenced by other tables, then you would be using extra space in your table without much additional benefit. And, if this table only contains the two columns now, then you would increase the table size by 50%.
If you use an extra column for the primary key, do not forget your natural key (the two-column key). Create a unique constraint on the composite key. You still want to maintain the integrity of the real data.
The decision should always be based on requirements and the intended meaning of the data. A table with only a single attribute key clearly enforces a different kind of constraint and implies that your table has a very different meaning to the same table with a multi attribute key. On the other hand adding an additional unique column would also be a waste of resources and add meaningless complexity if you don't actually need to use it anywhere.
One caveat to the auto-incrementing column is that it can give a false impression of uniqueness. Sure, your identity column is always unique, but that's just a meaningless value you've attached to the table. Unless you also have a unique constraint attached to the set of columns that represent the actual semantic primary key of the table, you have no guarantee of meaningful uniqueness.

Resources