SQL Index on two columns - sql-server

Here's my simple scenario:
I've a Users table and a Locations table. ONE User can be related to MANY Locations so I've a UserLocation table which is as follows:
ID (int-Auto Increment) PK
UserID (Int FK to the Users table)
LocID (Int FK to the Locations table)
Now, as ID is the PK it is Indexed by default in SQL-Server. I was a bit confused about the other two columns:
OPT 1: Shud I define an Index on both the columns like:
IX_UserLocation_UserID_LocID
OR
OPT 2: Shud I define two separate Indexes like : IX_UserLocation_UserID
& IX_UserLocation_LocID
Pardon me if both do the same - in that case pls explain. If not - which one is better and why?

You need
2 columns
UserID (Int FK to the Users table)
LocID (Int FK to the Locations table)
One PK on both (UserID, LocID)
Another index on the reverse (LocID, UserID)
You may not need both indexes but it's rare
Edit, some links to other SO answers
SQL: Do you need an auto-incremental primary key for Many-Many tables?
SQL - many-to-many table primary key
Difference between 2 indexes with columns defined in reverse order

There are several things we hire the database for. One is fast information retrieval and another is declarative referential integrity (DRI).
If you requirement is that a user may be related to a given location only once then you want a unique index on UserID & LocatonID.
If your question is how to retrieve the data fast the answer is -- it depends. How are you accessing the data? If you always get the entire set of locations for a user then I would probably use a clustered non-unique index on UserID. If your access is "who is in locatin x?" then you probably want a clustered non-unique index on LocationID.
If you ask both questions you'll probably want both indexes (although you only get 1 clustered, so the 2nd index may want to use an INCLUDE to grab the other column).
Either way, you probalby don't want ID as your clustered index (the default when marking a column as PK in SSMS table designer).
HTH,
-eric

In addition to the "gbn" answer. It will depend on the Where clause. Whether you are using user or location or both

You should probably create two separate indexes. One thing that is often forgotten with foreign keys is the fact that deleting a user might cascade-delete the user-location relation in your table. If there is no index on userID, this might lead to a table-lock of your user-location relation. The same applies to deleting a location.

The best way to setup all the indexed you think you need on dev and check look at the query plans of the queries your app runs and see what indexes get read.

Related

Redundant DB column for indexing

I'm defining a few database tables, roughly looking like this:
In order to quickly run a query where a Person's MailMessages are retrieved in time order, regardless of what MailAccount they were sent to, I want an index for the MailMessage table, sorted by (PersonId, ReceivedTime). That means adding a redundant PersonId column to the MailMessage table, like this:
...or does it? Is there any neater way of doing this? If not, is the best practice to make PersonId a foreign key in the MailMessage table, or should this not be done, as it's conceptually not a foreign key but rather just a column used for the (PersonId, ReceivedTime) index?
Yes you could do that, but it would require having a key in table MailAccount on {MailAccountId, PersonId}, so it can be referenced by the FK in table MailMessage. From the perspective of enforcing uniqueness, this is redundant, since {MailAccountId} alone is already unique.
There is an alternative: use identifying relationships and natural keys. For example:
This achieves essentially the same goal, but with just one key (and the underlying index) per table.
Note the order of PK fields in the bottom table: it allows a query...
SELECT *
FROM MailMessage
WHERE PersonId = ?
ORDER BY ReceivedTime
...to be satisfied by an index range scan on the primary index. And if the table happens to be clustered, the DBMS won't even have to access the table heap after that (there is no table heap at all - rows are stored directly in the B-Tree).
Avoidance of JOINs without resorting to redundant keys (which is also good for clustering) is one of the pros of natural keys versus surrogate keys. As you can imagine, the list of pros and cons does not end there.
What you are doing is called denormalization. A full discussion of the pros and cons of this concept are a bit much for SO.
This type of optimization is also possible using a Materialized View (called an Indexed View in SQL Server).

SQL Server 2008 - Database Design Query

I have to load the data shown in the below image into my database.
For a particular row, either field PartID would be NULL OR field GroupID will be NULL, and the other available columns refers to the NON-NULL entity. I have following three options:
To use one database table, which will have one unified column say ID, which will have PartID and GroupID data. But, in this case I won't be able to apply foreign key constraint, as this column will be containing both entities' data.
To use one database table, which will have columns for both PartID and GroupID, which will contain the respective data. For each row, one of them will be NULL, But in this case I will be able to apply foreign key constraint.
To use two database tables, which will have similar structure, the only difference will be the column PartID and GroupID. In this case I will be able to apply foreign key constraint.
One thing to note here is that, the table(s) will be used in import processes to import about 30000 rows in one go and will also be heavily used in data retrieve operations. Also, the other columns will be used as pivot columns.
Can someone please suggest what should be best approach to achieve this?
I would use option 2 and add a constraint that only one can be non-null and the other must be null (just to be safe). I would not use option 1 because of the lack of a FK and the possibility of linking to the wrong table when not obeying the type identifier in the join.
There is a 4th option, which is to normalize them as "items" with another (surrogate) key and two link tables which link items to either parts or groups. This eliminates NULLs. There are further problems with that approach (items might be in both again or neither without any simple constraint), so unless that is necessary for other reasons, I wouldn't generally go down that path.
Option 3 could be fine - it really depends if these rows are a relation - i.e. data associated with a primary key. That's one huge problem I see with the data presented, the lack of a candidate key - I think you need to address that first.
IMO option 2 is the best - it's not perfectly normalized but will be the easiest to work with. 30K rows is not a lot of rows to import.
I would modify the table so it has one ID column and then add an IDType that is either "G" for Group or "P" for Part.

Many-to-many relationship structure in SQL Server with or without extra primary key column?

Assume that we have two tables: Roles and Reports. And there exists
a many-to-many relationship between them. Of course, the only solution
that comes to my mind is to create a cross-table, let's name it RoleReport.
I can see two approaches to the structure of that table:
1. Columns: RoleReportId, RoleId, ReportId
PK: RoleReportId
2. Columns: RoleId, ReportId
PK: RoleId, ReportId
Is there any real difference between them (performance or whatever else)?
You will need a composite UNIQUE index on (RoleId, ReportId) anyway.
There is no point in not doing it a PRIMARY KEY.
If you do it a CLUSTERED PRIMARY KEY (which is default), this will be better performance-wise, since it will be less in size.
A clustered primary key will contain only two columns in each record: RoleID and ReportID, while a secondary index will contain three columns: RoleID, ReportID and RoleReportID (as a row pointer).
You may want to create an additional index on ReportID which may be used to search all Roles for a given Report.
There would be some point in making a surrogate key for this relationship if the two following conditions held:
You have additional attributes in your relationship (i. e. this table contains additional columns, like Date or anything else)
You have lots of tables that refer to this relationship with a FOREIGN KEY
In this case it would be nicer to have a single-column PRIMARY KEY to refer to in FOREIGN KEY relationships.
Since you don't seem to have this need, just make a composite PRIMARY KEY.
You don't actually need the RoleReportId. It adds nothing to the relationship.
Many people try to avoid using a naturally-unique key in real tables, instead opting for an artificially unique one, but I don't always agree with that. For example, if you can be sure that your SSN will never change, you can use that as a key. If it somehow does change in the future, you can fix it then.
But I don't intend arguing that point, there's good arguments on both sides. However, you certainly don't need an artificially unique key in this case since both your other fields are, and will remain, unique.
Unless you really need the RoleReportId as a foreign key in some other table (which is not usually going to be the case), go with option 2. It's going to require less storage, and that by itself will probably give a performance advantage -- plus why have a column you're never going to use?
Semantically, the difference is what you're using as the primary key.
Typically I let the remainder of my schema dictate what I do in this situation. If the cross-table is exclusively the implementation of the many-to-many relationship, I tend to use the concatenated primary key. If I'm hanging more information off the cross table, making it an entity in its own right, I'm more inclined to give it its own id independent of the two tables it's connecting.
This is, of course, subjective. I don't claim that this is the One True Way (tm).
If you have many rows, then it might be beneficial to have appropriately ordered indexes on your RoleId and/or ReportId columns, since this will speed up look up operations - but inversely this will slow down insert/delete operations. This is a classic usage profile issue...
If not required otherwise, omit the RoleReportId PK. It adds nothing to the relationship, forces the Server to generate a useless number on each insert, and leaves the other two columns unordered, which slows down lookups.
But all in all, we are talking about milliseconds here. This only becomes relevant, if there is a huge amount of data (say more than 10.000 rows)...
I would suggest du choose no PK for your second choice. You may use indices or an unique constraint over the combination of both columns.
The benefit of using RoleReportID as a single-column primary key comes when you (or the other guy, depending on the structure of your company) need to write a front end that addresses individual role<->report relationships (for instance, to delete one). At that point, you may prefer the fact that you need to address only one column, instead of two, to identify the linking record.
Other than that, you don't need the RoleReportID column.

Should a database table always have primary keys?

Should I always have a primary key in my database tables?
Let's take the SO tagging. You can see the tag in any revision, its likely to be in a tag_rev table with the postID and revision number. Would I need a PK for that?
Also since it is in a rev table and not currently use the tags should be a blob of tagIDs instead of multiple entries of multiple post_id tagid pair?
A table should have a primary key so that you could identify each row uniquely with it.
Technically, you can have tables without a primary key, but you'll be breaking good database design rules.
You should strive to have a primary key in any non-trivial table where you're likely to want to access (or update or delete) individual records by that key. Primary keys can consist of multiple columns, and formally speaking, will be the shortest available superkey; that is, the shortest available group of columns which, together, uniquely identify any row.
I don't know what the Stack Overflow database schema looks like (and from some of the things I've read on Jeff's blog, I don't want to), but in the situation you describe, it's entirely possible there is a primary key across the post identifier, revision number and tag value; certainly, that would be the shortest (and only) superkey available.
With regards to your second point, while it may be reasonable to argue in favour of aggregating values in archive tables, it does go against the principle that each row/column intersection in a table ought to contain one single value. While it may slightly simplify development, there is no reason you can't keep to a normalised table with versioned metadata, even for something as trivial as tags.
I tend to agree that most tables should have a primary key. I can only think of two times where it doesn't make sense to do it.
If you have a table that relates keys to other keys. For example, to relate a user_id to an answer_id, that table wouldn't need a primary key.
A logging table, whose only real purpose is to create an audit trail.
Basically, if you are writing a table that may ever need to be referenced in a foreign key relationship then a primary key is important, and if you can't be positive it won't be, then just add the PK. :)
See this related question about whether an integer primary key is required. One of the answers uses tagging as an example:
Are there any good reasons to have a database table without an integer primary key
For more discussion of tagging and keys, see this question:
Id for tags in tag systems
From MySQL 5.5 Reference Manual section 13.1.17:
If you do not have a PRIMARY KEY and an application asks for the PRIMARY KEY in your tables, MySQL returns the first UNIQUE index that has no NULL columns as the PRIMARY KEY.
So, technically, the answer is no. However, as others have stated, in most cases it is quite useful.
I firmly believe every table should have a way to uniquely identify a record. For 99% of the tables, this is a primary key. For the rest you may get away with a unique index (I'm thinking one column look up type tables here). Any time I have a had to work with a table without a way to uniquely identify records, there has been trouble.
I also believe if you are using surrogate keys as your PK, you should, where at all possible, have a separate unique index on whatever combination of fields make up the natural key. I realize there are all too many times when you don't have a true natural key (names are not unique or what makes something unique might be spread across several parentchild tables), but if you do have one, please please please make sure it has a unique index or is created as the PK.
If there is no PK, how will you update or delete a single row ? It would be impossible ! To be honest I have used a few times tables without PK, for instance to store activity logs, but even in this case it is advisable to have one because the timestamps could not be granular enough. Temporary tables is another example. But according to relational theory the PK is mandatory.
it is good to have keys and relationships . Helps a lot. however if your app is good enough to handle the relationships then you could possibly skip the keys ( although i recommend that you have them )
Since I use Subsonic, I always create a primary key for all of my tables. Many DB Abstraction libraries require a primary key to work.
Note: that doesn't answer the "Grand Unified Theory" tone of your question, but I'm just saying that in practice, sometimes you MUST make a primary key for every table.
If it's a join table then I wouldn't say that you need a primary key. Suppose, for example, that you have tables PERSONS, SICKPEOPLE, and ILLNESSES. The ILLNESSES table has things like flu, cold, etc., each with a primary key. PERSONS has the usual stuff about people, each also with a primary key. The SICKPEOPLE table only has people in it who are sick, and it has two columns, PERSONID and ILLNESSID, foreign keys back to their respective tables, and no primary key. The PERSONS and ILLNESSES tables contain entities and entities get primary keys. The entries in the SICKPEOPLE table aren't entities and don't get primary keys.
Databases don't have keys, per se, but their constituent tables might. I assume you mean that, but just in case...
Anyway, tables with a large number of rows should absolutely have primary keys; tables with only a few rows don't need them, necessarily, though they don't hurt. It depends upon the usage and the size of the table. Purists will put primary keys in every table. This is not wrong; and neither is omitting PKs in small tables.
Edited to add a link to my blog entry on this question, in which I discuss a case in which database administration staff did not consider it necessary to include a primary key in a particular table. I think this illustrates my point adequately.
Cyberherbalist's Blog Post on Primary Keys

primary key on very small table

I am having a very small tables with at most 5 records that holds some labels. I am using Postgres.
The structure is as follows:
id - smallint
label - varchar(100)
The table will be used mainly to reference the rows from other tables. The question is if it's really necessary to have a primary key on id or to have just an index on the id or have them both?
I did read about indexes and primary keys and I understand that this depends quite a lot on what's the table going to be used for:
Tables with no Primary Key
Edit: I was going to ask about having a primary key or an index or have them both. I edited the question.
It is always good practice to have a primary key column. The typical scenario it is needed is when you want to update or delete a row, having a PK makes it much easier and safer.
Yes, a primary key is not only good practice -- it's crucial. A table that lacks a unique key fails to be in First Normal Form.
You must declare a PRIMARY KEY or UNIQUE constraint if you want other tables to reference this one with a foreign key.
In most RDBMS brands, both PRIMARY KEY and UNIQUE constraints implicitly create an index on the column(s). If it doesn't do this implicitly, you may be required to define the index yourself before you can declare the constraint.
Yes, you will need a primary key on the id field, since you do not want two labels that share the same id.
You also want an index, to speed up the search/lookup process in this table (although for small tables there is less performance gain). The sequence will just help you fill in the next ID; it does not prevent you from changing a previous value into one that already exists.
There are very little reasons for creating an index instead of a primary key. AS Bill Karwin said, you won't save resources at all. And, as you may have already guessed, there is no need at all to create a new index if you have the primary key.
In some cases it may be hard to find a key candidate. But it doesn't seems to be the case and it clearly goes against some good practices.
By the way. As your table is so small most queries will rather use a full table scan even if there is an index. Don't worry you see a full table scan.
From the developer's point of view, PRIMARY KEY is just a combination of NOT NULL and UNIQUE INDEX on one of the columns.
A UNIQUE INDEX is good if:
You need to enforce uniqueness. Using index is the most efficient way to do that.
You need to perform a SELECT, UPDATE or DELETE with a WHERE condition that is selective on the indexed field, that is number of rows affected by the query is much less that total number of the rows (say, 10 rows of 2,000,000).
A UNIQUE INDEX is bad if:
You don't need uniqueness on this field, of course :) But you'll better have one unique index in the table, for each record to be identifiable.
You need fast INSERTS
You need fast UPDATE's and DELETE's of the indexed value with a WHERE condition that is not selective on the indexed field, that is number of rows affected by the query is comparable to the total number of the rows (say, 1,500,000 rows of 2,000,000).
Given that you are going to have a little table, I'd advice you to create a PRIMARY KEY on it.

Resources