Redundant DB column for indexing - database

I'm defining a few database tables, roughly looking like this:
In order to quickly run a query where a Person's MailMessages are retrieved in time order, regardless of what MailAccount they were sent to, I want an index for the MailMessage table, sorted by (PersonId, ReceivedTime). That means adding a redundant PersonId column to the MailMessage table, like this:
...or does it? Is there any neater way of doing this? If not, is the best practice to make PersonId a foreign key in the MailMessage table, or should this not be done, as it's conceptually not a foreign key but rather just a column used for the (PersonId, ReceivedTime) index?

Yes you could do that, but it would require having a key in table MailAccount on {MailAccountId, PersonId}, so it can be referenced by the FK in table MailMessage. From the perspective of enforcing uniqueness, this is redundant, since {MailAccountId} alone is already unique.
There is an alternative: use identifying relationships and natural keys. For example:
This achieves essentially the same goal, but with just one key (and the underlying index) per table.
Note the order of PK fields in the bottom table: it allows a query...
SELECT *
FROM MailMessage
WHERE PersonId = ?
ORDER BY ReceivedTime
...to be satisfied by an index range scan on the primary index. And if the table happens to be clustered, the DBMS won't even have to access the table heap after that (there is no table heap at all - rows are stored directly in the B-Tree).
Avoidance of JOINs without resorting to redundant keys (which is also good for clustering) is one of the pros of natural keys versus surrogate keys. As you can imagine, the list of pros and cons does not end there.

What you are doing is called denormalization. A full discussion of the pros and cons of this concept are a bit much for SO.
This type of optimization is also possible using a Materialized View (called an Indexed View in SQL Server).

Related

In a junction table, should I use a Primary key and a unique constraint, or a compound/composite key?

I have read through handfuls of what would seem to make this a duplicate question. But reading through all of these has left me uncertain. I'm hoping to get an answer based on the absolute example below, as many questions/answers trail off into debates back and forth.
If I have:
dbo.Book
--------
BookID PK int identity(1,1)
dbo.Author
----------
AuthorID PK int identity(1,1)
Now I have two choices for a simple junction table:
dbo.BookAuthor
--------------
BookID CPK and FK
AuthorID CPK and FK
The above would be a compound/composite key on both FKs, as well as set up the FK relationships for both columns - also using Cascade on delete.
OR
dbo.BookAuthor
--------------
RecordID PK int identity(1,1)
BookID FK
AuthorID FK
Foreign key relationships on BookID and AuthorID, along with Cascade on delete. Also set up a unique constraint on BookID and AuthorID.
I'm looking for a simple answer as to why one method is better than another in the ABOVE particular example. The answers that I'm reading are very detailed, and I was just about to settle on a compound key, but then watched a video where the example used an Identity column like my first example.
It seems this topic is slightly torn in half, but my gut is telling me that I should just use a composite key.
What's more efficient for querying? It seems having a PK identity column along with setting up a unique constraint on the two columns, AND the FK relationships would be more costly, even if a little.
This is something I've always remembered from my database course way back in college. We were covering the section from the textbook on "Entity Design" and it was talking about junction tables... we called them intersect tables or intersection relations. I was actually paying attention in class that day. The professor said, in his experience, a many-to-many junction table almost always indicates an unidentified missing entity. These entities almost always end up with data of their own.
We were given an example of Student and Course entities. For a student to take a course, you need to junction between those two. What you actually have as a result is a new entity: an Enrollment. The additional data in this case would be things like Credit Type (audit vs regular) or Final Grade.
I remember that advice to this day... but I don't always follow it. What I will do in this situation is stop, and make sure to go back to the stakeholders on the issue and work with them on what data points we might still be missing in this junction. If we really can't find anything, then I'll use the compound key. When we do find data, we think of a better name and it gets a surrogate key.
Update in 2020
I still have the textbook, and by amazing coincidence both it and this question were brought to my attention within a few hours of each other. So for the curious, it was Chapter 5, section 6, of the 7th edition of this book:
https://www.amazon.com/Database-Processing-Fundamentals-Design-Implementation-dp-9332549958/dp/9332549958/
As a staunch proponent of, and proselytizer for, the benefits of surrogate keys, I none-the-less make an exception for all-key join tables such as your first example. One of the benefits of surrogate keys is that engines are generally optimized for joining on single integer fields, as the default and most common circumstance.
Your first proposal still obtains this benefit, but also has a 50% greater fan-put on each index level, reducing both the overall size and height of the indices on the join table. Although the performance benefits of this are likely negligible for anything smaller than a massive table it is best practice and comes at no cost.
When I might opt for the other design is if the relation were to accrue additional columns. At that point it is no longer strictly a join table.
I prefer the first design, using Composite Keys. Having an identity column on the junction table does not give you an advantage even if the parent tables have them. You won't be querying the BookAuthor using the identity column, instead you would query it using the BookID and AuthorID.
Also, adding an identity would allow for duplicate BookID-AuthorID combination, unless you put a constraint.
Additionally, if your primary key is (BookID, AuthorID), you need to an index on AuthorID, BookID). This will help if you want to query the the books written by an author.
Using composite key would be my choice too. Here's why:
Less storage overhead
Let's say you would use a surrogate key. Since you'd probably gonna want to query all authors for a specific book and vica versa you'd need indexes starting with both BookId and AuthorId. For performance reasons you should include the other column in both indexes to prevent a clustered key lookup. You'd probably would want to make one of them a unique to make sure no duplicate BookId/AuthorId combinations are added to the table.
So as a net result:
The data is stored 3 times instead of 2 times
2 unique constraints are to be validated instead of 1
Querying a junction table referencing table
Even if you'd add a table like Contributions (AuthorId, BookId, ...) referencing the junction table. Most queries won't require the junction table to be touched at all. E.g.: to find all contribution of a specific author would only involve the author and contributions tables.
Depending on the amount of data in the junction table, a compound key might end up causing poor performance over an auto generated sequential primary key.
The primary key is the clustered index for the table, which means that it determines the order in which rows are stored on disc. If the primary key's values are not generated sequentially (e.g. it is a composite key comprised of foreign keys from tables where rows do not fall in the same order as the junction table's rows, or it is a GUID or other random key) then each time a row is added to the junction table a reshuffle of the junction table's rows will be necessary.
You probably should use the compound/composite key. This way you are fully relational - one author can write many books and one book can have multiple authors.

How can we deal with intersection tables that quickly grow very large?

For example, we have table A, and table B which have a many-to-many relationship. An intersection table, Table C stores A.id and B.id along with a value that represents a relationship between the two. Or as a concrete example, imagine stackexchange which has a user account, a forum, and a karma score. Or, a student, a course, and a grade. If table A and B are very large, table C can and probably will grow monstrously large very quickly(in fact lets just assume it does). How do we go about dealing with such an issue? Is there a better way to design the tables to avoid this?
There is no magic. If some rows are connected and some aren't, this information has to be represented somehow, and the "relational" way of doing it is a "junction" (aka "link") table. Yes, a junction table can grow large, but fortunately databases are very capable of handling huge amounts of data.
There are good reasons for using junction table versus comma-separated list (or similar), including:
Efficient querying (through indexing and clustering).
Enforcement of referential integrity.
When designing a junction table, ask the following questions:
Do I need to query in only one direction or both?1
If one direction, just create a composite PRIMARY KEY on both foreign keys (let's call them PARENT_ID and CHILD_ID). Order matters: if you query from parent to children, PK should be: {PARENT_ID, CHILD_ID}.
If both directions, also create a composite index in the opposite order, which is {CHILD_ID, PARENT_ID} in this case.
Is the "extra" data small?
If yes, cluster the table and cover the extra data in the secondary index as necessary.2
I no, don't cluster the table and don't cover the extra data in the secondary index.3
Are there any additional tables for which the junction table acts as a parent?
If yes, consider whether adding a surrogate key might be worthwhile to keep child FKs slim. But beware that if you add a surrogate key, this will probably eliminate the opportunity for clustering.
In many cases, answers to these questions will be: both, yes and no, in which case your table will look similar to this (Oracle syntax below):
CREATE TABLE JUNCTION_TABLE (
PARENT_ID INT,
CHILD_ID INT,
EXTRA_DATA VARCHAR2(50),
PRIMARY KEY (PARENT_ID, CHILD_ID),
FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
) ORGANIZATION INDEX COMPRESS;
CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
JUNCTION_TABLE (CHILD_ID, PARENT_ID, EXTRA_DATA) COMPRESS;
Considerations:
ORGANIZATION INDEX: Oracle-specific syntax for what most DBMSes call clustering. Other DBMSes have their own syntax and some (MySQL/InnoDB) imply clustering and user cannot turn it off.
COMPRESS: Some DBMSes support leading-edge index compression. Since clustered table is essentially an index, compression can be applied to it as well.
JUNCTION_TABLE_IE1, EXTRA_DATA: Since extra data is covered by the secondary index, DBMS can get it without touching the table when querying in the direction from child to parents. Primary key acts as a clustering key so the extra data is naturally covered when querying from a parent to the children.
Physically, you have just two B-Trees (one is the clustered table and the other is the secondary index) and no table heap at all. This translates to good querying performance (both parent-to-child and child-to-parent directions can be satisfied by a simple index range scan) and fairly small overhead when inserting/deleting rows.
Here is the equivalent MS SQL Server syntax (sans index compression):
CREATE TABLE JUNCTION_TABLE (
PARENT_ID INT,
CHILD_ID INT,
EXTRA_DATA VARCHAR(50),
PRIMARY KEY (PARENT_ID, CHILD_ID),
FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
);
CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
JUNCTION_TABLE (CHILD_ID, PARENT_ID) INCLUDE (EXTRA_DATA);
Note that MS SQL Server automatically clusters tables, unless PRIMARY KEY NONCLUSTERED is specified.
1 In other words, do you only need to get "children" of given "parent", or you might also need to get parents of given child.
2 Covering allows the query to be satisfied from the index alone, and avoids expensive double-lookup that would otherwise be necessary when accessing data through a secondary index in the clustered table.
3 This way, the extra data is not repeated (which would be expensive, since it's big), yet you avoid the double-lookup and replace it with (cheaper) table heap access. But, beware of clustering factor that can destroy the performance of range scans in heap-based tables!

SQL Server Performance Suggestion

I have been creating database tables using only a primary key of the datatype int and I have always had great performance but need to setup merge replication with updatable subscribers.
The tables use a typical primary key, data type int, and identity increment. Setting up merge replication, I have to add the rowguid to all tables with a newsequentialid() function for the default value. I noticed that the rowguid has indexable on and was wondering if I needed the primary key anymore?
Is it okay to have 2 indexes, the primary key int and the rowguid? What is the best layout for a merge replication table? Do I keep the int id for easy row referencing and just remove the index but keep the primary key? Not sure what route to take, Thanks.
Remember that if you remove the int id column and replace it with a GUID, you may need to rework a good deal of your data and your queries. And do you really want to do queries like:
select * from orders where customer_id = '2053995D-4EFE-41C0-8A04-00009890024A'
Remember if your ids are exposed to any users (often in the case of a customer because the customer table often has no natural key since names are not unique), they will find the guid daunting for doing research.
There is nothing wrong in an existing system with having both. In a new system, you could plan to not use the ints, but there is a great risk of introducing bugs if you try to remove them in a system already using them.
The only downside of replacing the integer primary key with the guid (that I know of) is that GUIDs are larger, so the btree (index space used) will be larger and if you have foreign keys to this table (which you'd also need to change) a lot more space may end up being used across (potentially) many tables.

Many-to-many relationship structure in SQL Server with or without extra primary key column?

Assume that we have two tables: Roles and Reports. And there exists
a many-to-many relationship between them. Of course, the only solution
that comes to my mind is to create a cross-table, let's name it RoleReport.
I can see two approaches to the structure of that table:
1. Columns: RoleReportId, RoleId, ReportId
PK: RoleReportId
2. Columns: RoleId, ReportId
PK: RoleId, ReportId
Is there any real difference between them (performance or whatever else)?
You will need a composite UNIQUE index on (RoleId, ReportId) anyway.
There is no point in not doing it a PRIMARY KEY.
If you do it a CLUSTERED PRIMARY KEY (which is default), this will be better performance-wise, since it will be less in size.
A clustered primary key will contain only two columns in each record: RoleID and ReportID, while a secondary index will contain three columns: RoleID, ReportID and RoleReportID (as a row pointer).
You may want to create an additional index on ReportID which may be used to search all Roles for a given Report.
There would be some point in making a surrogate key for this relationship if the two following conditions held:
You have additional attributes in your relationship (i. e. this table contains additional columns, like Date or anything else)
You have lots of tables that refer to this relationship with a FOREIGN KEY
In this case it would be nicer to have a single-column PRIMARY KEY to refer to in FOREIGN KEY relationships.
Since you don't seem to have this need, just make a composite PRIMARY KEY.
You don't actually need the RoleReportId. It adds nothing to the relationship.
Many people try to avoid using a naturally-unique key in real tables, instead opting for an artificially unique one, but I don't always agree with that. For example, if you can be sure that your SSN will never change, you can use that as a key. If it somehow does change in the future, you can fix it then.
But I don't intend arguing that point, there's good arguments on both sides. However, you certainly don't need an artificially unique key in this case since both your other fields are, and will remain, unique.
Unless you really need the RoleReportId as a foreign key in some other table (which is not usually going to be the case), go with option 2. It's going to require less storage, and that by itself will probably give a performance advantage -- plus why have a column you're never going to use?
Semantically, the difference is what you're using as the primary key.
Typically I let the remainder of my schema dictate what I do in this situation. If the cross-table is exclusively the implementation of the many-to-many relationship, I tend to use the concatenated primary key. If I'm hanging more information off the cross table, making it an entity in its own right, I'm more inclined to give it its own id independent of the two tables it's connecting.
This is, of course, subjective. I don't claim that this is the One True Way (tm).
If you have many rows, then it might be beneficial to have appropriately ordered indexes on your RoleId and/or ReportId columns, since this will speed up look up operations - but inversely this will slow down insert/delete operations. This is a classic usage profile issue...
If not required otherwise, omit the RoleReportId PK. It adds nothing to the relationship, forces the Server to generate a useless number on each insert, and leaves the other two columns unordered, which slows down lookups.
But all in all, we are talking about milliseconds here. This only becomes relevant, if there is a huge amount of data (say more than 10.000 rows)...
I would suggest du choose no PK for your second choice. You may use indices or an unique constraint over the combination of both columns.
The benefit of using RoleReportID as a single-column primary key comes when you (or the other guy, depending on the structure of your company) need to write a front end that addresses individual role<->report relationships (for instance, to delete one). At that point, you may prefer the fact that you need to address only one column, instead of two, to identify the linking record.
Other than that, you don't need the RoleReportID column.

Should each and every table have a primary key?

I'm creating a database table and I don't have a logical primary key assigned to it. Should each and every table have a primary key?
Short answer: yes.
Long answer:
You need your table to be joinable on something
If you want your table to be clustered, you need some kind of a primary key.
If your table design does not need a primary key, rethink your design: most probably, you are missing something. Why keep identical records?
In MySQL, the InnoDB storage engine always creates a primary key if you didn't specify it explicitly, thus making an extra column you don't have access to.
Note that a primary key can be composite.
If you have a many-to-many link table, you create the primary key on all fields involved in the link. Thus you ensure that you don't have two or more records describing one link.
Besides the logical consistency issues, most RDBMS engines will benefit from including these fields in a unique index.
And since any primary key involves creating a unique index, you should declare it and get both logical consistency and performance.
See this article in my blog for why you should always create a unique index on unique data:
Making an index UNIQUE
P.S. There are some very, very special cases where you don't need a primary key.
Mostly they include log tables which don't have any indexes for performance reasons.
Always best to have a primary key. This way it meets first normal form and allows you to continue along the database normalization path.
As stated by others, there are some reasons not to have a primary key, but most will not be harmed if there is a primary key
Disagree with the suggested answer. The short answer is: NO.
The purpose of the primary key is to uniquely identify a row on the table in order to form a relationship with another table. Traditionally, an auto-incremented integer value is used for this purpose, but there are variations to this.
There are cases though, for example logging time-series data, where the existence of a such key is simply not needed and just takes up memory. Making a row unique is simply ...not required!
A small example:
Table A: LogData
Columns: DateAndTime, UserId, AttribA, AttribB, AttribC etc...
No Primary Key needed.
Table B: User
Columns: Id, FirstName, LastName etc.
Primary Key (Id) needed in order to be used as a "foreign key" to LogData table.
Pretty much any time I've created a table without a primary key, thinking I wouldn't need one, I've ended up going back and adding one. I now create even my join tables with an auto-generated identity field that I use as the primary key.
Except for a few very rare cases (possibly a many-to-many relationship table, or a table you temporarily use for bulk-loading huge amounts of data), I would go with the saying:
If it doesn't have a primary key, it's not a table!
Marc
Just add it, you will be sorry later when you didn't (selecting, deleting. linking, etc)
Will you ever need to join this table to other tables? Do you need a way to uniquely identify a record? If the answer is yes, you need a primary key. Assume your data is something like a customer table that has the names of the people who are customers. There may be no natural key because you need the addresses, emails, phone numbers, etc. to determine if this Sally Smith is different from that Sally Smith and you will be storing that information in related tables as the person can have mulitple phones, addesses, emails, etc. Suppose Sally Smith marries John Jones and becomes Sally Jones. If you don't have an artifical key onthe table, when you update the name, you just changed 7 Sally Smiths to Sally Jones even though only one of them got married and changed her name. And of course in this case withouth an artificial key how do you know which Sally Smith lives in Chicago and which one lives in LA?
You say you have no natural key, therefore you don't have any combinations of field to make unique either, this makes the artficial key critical.
I have found anytime I don't have a natural key, an artifical key is an absolute must for maintaining data integrity. If you do have a natural key, you can use that as the key field instead. But personally unless the natural key is one field, I still prefer an artifical key and unique index on the natural key. You will regret it later if you don't put one in.
It is a good practice to have a PK on every table, but it's not a MUST. Most probably you will need a unique index, and/or a clustered index (which is PK or not) depending on your need.
Check out the Primary Keys and Clustered Indexes sections on Books Online (for SQL Server)
"PRIMARY KEY constraints identify the column or set of columns that have values that uniquely identify a row in a table. No two rows in a table can have the same primary key value. You cannot enter NULL for any column in a primary key. We recommend using a small, integer column as a primary key. Each table should have a primary key. A column or combination of columns that qualify as a primary key value is referred to as a candidate key."
But then check this out also: http://www.aisintl.com/case/primary_and_foreign_key.html
To make it future proof you really should. If you want to replicate it you'll need one. If you want to join it to another table your life (and that of the poor fools who have to maintain it next year) will be so much easier.
I am in the role of maintaining application created by offshore development team. Now I am having all kinds of issues in the application because original database schema did not contain PRIMARY KEYS on some tables. So please dont let other people suffer because of your poor design. It is always good idea to have primary keys on tables.
Late to the party but I wanted to add my two cents:
Should each and every table have a primary key?
If you are talking about "Relational Albegra", the answer is Yes. Modelling data this way requires the entities and tables to have a primary key. The problem with relational algebra (apart from the fact there are like 20 different, mismatching flavors of it), is that it only exists on paper. You can't build real world applications using relational algebra.
Now, if you are talking about databases from real world apps, they partially/mostly adhere to the relational algebra, by taking the best of it and by overlooking other parts of it. Also, database engines offer massive non-relational functionality nowadays (it's 2020 now). So in this case the answer is No. In any case, 99.9% of my real world tables have a primary key, but there are justifiable exceptions. Case in point: event/log tables (multiple indexes, but not a single key in sight).
Bottom line, in transactional applications that follow the entity/relationship model it makes a lot of sense to have primary keys for almost (if not) all of the tables. If you ever decide to skip the primary key of a table, make sure you have a good reason for it, and you are prepared to defend your decision.
I know that in order to use certain features of the gridview in .NET, you need a primary key in order for the gridview to know which row needs updating/deleting. General practice should be to have a primary key or primary key cluster. I personally prefer the former.
I'd like to find something official like this - 15.6.2.1 Clustered and Secondary Indexes - MySQL.
If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index named GEN_CLUST_INDEX on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.
So, why not create primary key or something like it by yourself? Besides, ORM cannot identify this hidden ID, meaning that you cannot use ID in your code.
I always have a primary key, even if in the beginning I don't have a purpose in mind yet for it. There have been a few times when I eventually need a PK in a table that doesn't have one and it's always more trouble to put it in later. I think there is more of an upside to always including one.
If you are using Hibernate its not possible to create an Entity without a primary key. This issues can create problem if you are working with an existing database which was created with plain sql/ddl scripts, and no primary key was added
In short, no. However, you need to keep in mind that certain client access CRUD operations require it. For future proofing, I tend to always utilize primary keys.

Resources