Related
How do we specify which key is used for building the index for a
database in SQL?
In most if not all RDBMS, is the search key used for building the index for a database always
the primary key?
From Database Management Systems, 3rd Edition, by Raghu
Ramakrishnan, and Johannes Gehrke
In principle, we can use any key, not just the primary
key, to refer to a tuple. However, using the primary key is
preferable because it is what the DBMS expects - this is the
significance of designating a particular candidate key as a
primary key and optimizes for. For example, the DBMS may create
an index with the primary key fields as the search key, to make the
retrieval of a tuple given its primary key value efficient.
Thanks.
That depends on which RDBMS you are using. It will be something like
CREATE INDEX index_name ON table_name(key_name).
YES and NO.
a) If you are creating a table, and generally the RDBMS will create the index for this table using the primary key you specify in your CREATE TABLE statement. If you don't specify a primary key, RDBMS will help you choose an unique and non-null key, OR create an internal key (probably an int type) as primary key for this table.
b) Sometimes, according to the query pattern, you may find some keys other than primary key are used frequently (in where clause for example), then it is good to build new indexes using these keys.
There are two aspects to your question:
I'm trying to interpret your questions. Perhaps what you need to understand is that there can be more than one index into a table?
Let's say you have a Customers table with 3 columns, CustomerID, LastName, and FirstName.
You create an index using a specific CREATE INDEX or ALTER TABLE command where you specify the specific columns you want to have included in the index. You do not have to have the primary key as part of an index; for example, you may create an index on a table of customers by their last name and first name to speed name searches while still having a different primary key like customerID. Here's some SQL-like syntax.
CREATE INDEX customer_name_idx ON Customers(LastName, FirstName)
This index doesn't include any primary keys nor does it require a primary key to function properly. Internally, it will likely point to some internal row IDs that only the DBMS cares about.
I'm trying to understand what you mean here as well.
A DBMS can return a result regardless of the presence of an index; an index just makes it more efficient if your query matches up nicely with an index.
Designating a column as a primary key provides benefits such as enforced uniqueness, and possibly some performance benefits for enforcing other foreign key constraints.
As your quote says though, there is no written rule that says a primary key must also be an index. MySQL, and probably many other DMBSes, creates an index automatically on the table's primary key as it makes sense to do so from a technical level.
Anyway, I hope this makes sense and I hope I can clarify better if you have other questions.
Imagine there is a social network and here is a table for storing the like (favorite) action and unlike that is deleting from this table:
CREATE TABLE IF NOT EXISTS post_likes(
post_id timeuuid,
liker_id uuid, //liker user_id
like_time timestamp,
PRIMARY KEY ((post_id) ,liker_id, like_time)
) WITH CLUSTERING ORDER BY (like_time DESC);
The above table has problem in Cassandra because when liker_id is the first clustering_key, we can't sort by the second clustering key which is like_time.
We need to sort our tables data by like_time, we use it when a user wants to see who liked this post and we show list of people who liked that post that sorted by time (like_time DESC)
and we also need to delete (unlike) and we again need to have post_id and liker_id
What is your suggestion? How we can sort this table by like_time?
After more researches, I found out this solution:
Picking the right data model is the hardest part of using Cassandra and here is the solution we found for likes tables in Cassandra, first of all, I have to say Cassandra's read and write path is amazingly fast and you don't need to be worry about writing on your Cassandra's tables, you need to model around your queries and remember, data duplication is okay. Many of your tables may repeat the same data. and do not forget to spread data evenly around the cluster and minimize the number of partitions read
Since we are using Cassandra which is NoSQL, we know one of the rules in NoSQLs is denormalization and we have to denormalize data and just think about the queries you want to have; Here for the like table data modeling we will have two tables, these tables have mainly focused on the easy read or easier to say we have focused on queries we want to have:
CREATE TABLE IF NOT EXISTS post_likes(
post_id timeuuid,
liker_id uuid, //liker user_id
like_time timestamp,
PRIMARY KEY ((post_id) ,liker_id)
);
CREATE TABLE IF NOT EXISTS post_likes_by_time(
post_id timeuuid,
liker_id uuid, //liker user_id
like_time timestamp,
PRIMARY KEY ((post_id), like_time, liker_id)
) WITH CLUSTERING ORDER BY (like_time DESC);
When a user like a post, we just insert into both above tables.
why do we have post_likes_by_time table?
In a social network, you should show list of users who liked a post, it is common that you have to sort likes by the like_time DESC and since you are going to sort likes by like_time you need to have like_time as clustering key to be able to sort likes by time.
Then why do we have post_likes table too?
In the post_likes_by_time, our clustering key is like_time, we also need to remove one like! We can't do that when we sorted data in our table when clustering key is like_time. That is the reason we also have post_likes table
Why you could not only have one table and do both actions, sorting and removing on it?
To delete one like from post_likes table we need to provide user_id (here liker_id) and post_id (together) and in post_likes_by_time we have like_time as clustering key and we need to sort table by like_time, then it should be the first clustering key and the second clustering key could be liker_id, and here is the point! like_time is the first clustering key then for selecting or deleting by liker_id you also need to provide like_time, but you do not have like_time most of the times.
I am creating user management database schema. I am using Postgresql as database. Following is my approach. Please suggest if there is any performance issue if I use this structure.
Requirement:
Expecting around millions of users in future.
I have to use unique user id on other systems also, may be on MongoDB, redis etc.
Approach:
I am using pseudo_encrypt() as unique user_id (BIGINT or BIGSERIAL), so that no one can guess other ids. For example: 3898573529235304961
Using user_id as foreign key in another table. I am not using primary key of user table as foreign key.
Any suggestions?
Use of unique key as foreign key everywhere in other tables, am I doing it correct?
Any performance issue during CRUD operations & with complex joins?
Use of unique key in any other database is correct way? (in case of distributed environment)
You are wading into flame war territory here over the question of natural vs surrogate primary keys. I agree with you and often use unique keys as foreign keys, and designate natural primary keys as such. On PostgreSQL this is safe (on MySQL or MS SQL it would be a bad habit though).
In PostgreSQL the only differences between primary keys and unique constraints are:
A table can have only one primary key
primary keys are not null on all columns
In practice, if you have a table defined as NOT NULL UNIQUE, it is just about the same as a single primary key.
On other dbs, often times table structure is optimized for primary key lookups which is why this is a problem, and there may be tools that don't like it but those are questions outside the realm of db design per se.
You are better to use normal serials and have real access controls than try to build things on obscurity. The obscurity controls are likely to perform worse, and be less secure than just doing things right however.
I'm creating a database table and I don't have a logical primary key assigned to it. Should each and every table have a primary key?
Short answer: yes.
Long answer:
You need your table to be joinable on something
If you want your table to be clustered, you need some kind of a primary key.
If your table design does not need a primary key, rethink your design: most probably, you are missing something. Why keep identical records?
In MySQL, the InnoDB storage engine always creates a primary key if you didn't specify it explicitly, thus making an extra column you don't have access to.
Note that a primary key can be composite.
If you have a many-to-many link table, you create the primary key on all fields involved in the link. Thus you ensure that you don't have two or more records describing one link.
Besides the logical consistency issues, most RDBMS engines will benefit from including these fields in a unique index.
And since any primary key involves creating a unique index, you should declare it and get both logical consistency and performance.
See this article in my blog for why you should always create a unique index on unique data:
Making an index UNIQUE
P.S. There are some very, very special cases where you don't need a primary key.
Mostly they include log tables which don't have any indexes for performance reasons.
Always best to have a primary key. This way it meets first normal form and allows you to continue along the database normalization path.
As stated by others, there are some reasons not to have a primary key, but most will not be harmed if there is a primary key
Disagree with the suggested answer. The short answer is: NO.
The purpose of the primary key is to uniquely identify a row on the table in order to form a relationship with another table. Traditionally, an auto-incremented integer value is used for this purpose, but there are variations to this.
There are cases though, for example logging time-series data, where the existence of a such key is simply not needed and just takes up memory. Making a row unique is simply ...not required!
A small example:
Table A: LogData
Columns: DateAndTime, UserId, AttribA, AttribB, AttribC etc...
No Primary Key needed.
Table B: User
Columns: Id, FirstName, LastName etc.
Primary Key (Id) needed in order to be used as a "foreign key" to LogData table.
Pretty much any time I've created a table without a primary key, thinking I wouldn't need one, I've ended up going back and adding one. I now create even my join tables with an auto-generated identity field that I use as the primary key.
Except for a few very rare cases (possibly a many-to-many relationship table, or a table you temporarily use for bulk-loading huge amounts of data), I would go with the saying:
If it doesn't have a primary key, it's not a table!
Marc
Just add it, you will be sorry later when you didn't (selecting, deleting. linking, etc)
Will you ever need to join this table to other tables? Do you need a way to uniquely identify a record? If the answer is yes, you need a primary key. Assume your data is something like a customer table that has the names of the people who are customers. There may be no natural key because you need the addresses, emails, phone numbers, etc. to determine if this Sally Smith is different from that Sally Smith and you will be storing that information in related tables as the person can have mulitple phones, addesses, emails, etc. Suppose Sally Smith marries John Jones and becomes Sally Jones. If you don't have an artifical key onthe table, when you update the name, you just changed 7 Sally Smiths to Sally Jones even though only one of them got married and changed her name. And of course in this case withouth an artificial key how do you know which Sally Smith lives in Chicago and which one lives in LA?
You say you have no natural key, therefore you don't have any combinations of field to make unique either, this makes the artficial key critical.
I have found anytime I don't have a natural key, an artifical key is an absolute must for maintaining data integrity. If you do have a natural key, you can use that as the key field instead. But personally unless the natural key is one field, I still prefer an artifical key and unique index on the natural key. You will regret it later if you don't put one in.
It is a good practice to have a PK on every table, but it's not a MUST. Most probably you will need a unique index, and/or a clustered index (which is PK or not) depending on your need.
Check out the Primary Keys and Clustered Indexes sections on Books Online (for SQL Server)
"PRIMARY KEY constraints identify the column or set of columns that have values that uniquely identify a row in a table. No two rows in a table can have the same primary key value. You cannot enter NULL for any column in a primary key. We recommend using a small, integer column as a primary key. Each table should have a primary key. A column or combination of columns that qualify as a primary key value is referred to as a candidate key."
But then check this out also: http://www.aisintl.com/case/primary_and_foreign_key.html
To make it future proof you really should. If you want to replicate it you'll need one. If you want to join it to another table your life (and that of the poor fools who have to maintain it next year) will be so much easier.
I am in the role of maintaining application created by offshore development team. Now I am having all kinds of issues in the application because original database schema did not contain PRIMARY KEYS on some tables. So please dont let other people suffer because of your poor design. It is always good idea to have primary keys on tables.
Late to the party but I wanted to add my two cents:
Should each and every table have a primary key?
If you are talking about "Relational Albegra", the answer is Yes. Modelling data this way requires the entities and tables to have a primary key. The problem with relational algebra (apart from the fact there are like 20 different, mismatching flavors of it), is that it only exists on paper. You can't build real world applications using relational algebra.
Now, if you are talking about databases from real world apps, they partially/mostly adhere to the relational algebra, by taking the best of it and by overlooking other parts of it. Also, database engines offer massive non-relational functionality nowadays (it's 2020 now). So in this case the answer is No. In any case, 99.9% of my real world tables have a primary key, but there are justifiable exceptions. Case in point: event/log tables (multiple indexes, but not a single key in sight).
Bottom line, in transactional applications that follow the entity/relationship model it makes a lot of sense to have primary keys for almost (if not) all of the tables. If you ever decide to skip the primary key of a table, make sure you have a good reason for it, and you are prepared to defend your decision.
I know that in order to use certain features of the gridview in .NET, you need a primary key in order for the gridview to know which row needs updating/deleting. General practice should be to have a primary key or primary key cluster. I personally prefer the former.
I'd like to find something official like this - 15.6.2.1 Clustered and Secondary Indexes - MySQL.
If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index named GEN_CLUST_INDEX on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.
So, why not create primary key or something like it by yourself? Besides, ORM cannot identify this hidden ID, meaning that you cannot use ID in your code.
I always have a primary key, even if in the beginning I don't have a purpose in mind yet for it. There have been a few times when I eventually need a PK in a table that doesn't have one and it's always more trouble to put it in later. I think there is more of an upside to always including one.
If you are using Hibernate its not possible to create an Entity without a primary key. This issues can create problem if you are working with an existing database which was created with plain sql/ddl scripts, and no primary key was added
In short, no. However, you need to keep in mind that certain client access CRUD operations require it. For future proofing, I tend to always utilize primary keys.
Although I'm guilty of this crime, it seems to me there can't be any good reason for a table to not have an identity field primary key.
Pros:
- whether you want to or not, you can now uniquely identify every row in your table which previously you could not do
- you can't do sql replication without a primary key on your table
Cons:
- an extra 32 bits for each row of your table
Consider for example the case where you need to store user settings in a table in your database. You have a column for the setting name and a column for the setting value. No primary key is necessary, but having an integer identity column and using it as your primary key seems like a best practice for any table you ever create.
Are there other reasons besides size that every table shouldn't just have an integer identity field?
Sure, an example in a single-database solution is if you have a table of countries, it probably makes more sense to use the ISO 3166-1-alpha-2 country code as the primary key as this is an international standard, and makes queries much more readable (e.g. CountryCode = 'GB' as opposed to CountryCode = 28). A similar argument could be applied to ISO 4217 currency codes.
In a SQL Server database solution using replication, a UNIQUEIDENTIFIER key would make more sense as GUIDs are required for some types of replication (and also make it much easier to avoid key conflicts if there are multiple source databases!).
The most clear example of a table that doesn't need a surrogate key is a many-to-many relation:
CREATE TABLE Authorship (
author_id INT NOT NULL,
book_id INT NOT NULL,
PRIMARY KEY (author_id, book_id),
FOREIGN KEY (author_id) REFERENCES Authors (author_id),
FOREIGN KEY (book_id) REFERENCES Books (book_id)
);
I also prefer a natural key when I design a tagging system:
CREATE TABLE Tags (
tag VARCHAR(20) PRIMARY KEY
);
CREATE TABLE ArticlesTagged (
article_id INT NOT NULL,
tag VARCHAR(20) NOT NULL,
PRIMARY KEY (article_id, tag),
FOREIGN KEY (article_id) REFERENCES Articles (article_id),
FOREIGN KEY (tag) REFERENCES Tags (tag)
);
This has some advantages over using a surrogate "tag_id" key:
You can ensure tags are unique, without adding a superfluous UNIQUE constraint.
You prevent two distinct tags from having the exact same spelling.
Dependent tables that reference the tag already have the tag text; they don't need to join to Tags to get the text.
Every table should have a primary key. It doesn't matter if it's an integer, GUID, or the "setting name" column. The type depends on the requirements of the application. Ideally, if you are going to join the table to another, it would be best to use a GUID or integer as your primary key.
Yes, there are good reasons. You can have semantically meaningful true keys, rather than articificial identity keys. Also, it is not a good idea to have a seperate autoincrementing primary key for a Many-Many table. There are some reasons you might want to choose a GUID.
That being said, I typically use autoincrementing 64bit integers for primary keys.
Every table should have a primary key. But it doesn't need to be a single field identifier. Take for example in a finance system, you may have the primary key on a journal table being the Journal ID and Line No. This will produce a unique combination for each row (and the Journal ID will be a primary key in its own table)
Your primary key needs to be defined on how you are going to link the table to other tables.
I don't think every table needs a primary key. Sometimes you only want to "connect" the contents of two tables - via their primary key.
So you have a table like users and one table like groups (each with primary keys) and you have a third table called users_groups with only two colums (user and group) where users and groups are connected with each other.
For example a row with user = 3 and group = 6 would link the user with primary key 3 to the group with primary key 6.
One reason not to have primary key defined as identity is having primary key defined as GUIDs or populated with externally generated values.
In general, every table that is semantically meaningful by itself should have primary key and such key should have no semantic meaning. A join table that realizes many-to-many relationship is not meaningful by itself and so it doesn't need such primary key (it already has one via its values).
To be a properly normalised table, each row should only have a single identifiable key. Many tables will already have natural keys, such a unique invoice number. I agree, especially with storage being so cheap, there is little overhead in having an autonumber/identity key on all tables, but in this instance which is the real key.
Another area where I personally don't use this approach if for reference data, where typically we have a Description and a Value
Code, Description
'L', 'Live'
'O', 'Old'
'P', 'Pending'
In this situation making code a primary key ensures no duplicates, and is more human readable.
The key difference (sorry) between a natural primary key and a surrogate primary key is that the value of the natural key contains information whereas the value of a surrogate key doesn't.
Why is this important? Well a natural primary key is by definition guaranteed to be unique, but its value is not usually guaranteed to stay the same. When it changes, you have to update it in multiple places.
A surrogate key's value has no real meaning and simply serves to identify that row, so it never needs to be changed. It is a feature of the model rather than the domain itself.
So the only place I would say a surrogate key isn't appropriate is in an association table which only contains columns referring to rows in other tables (most many-to-many relations). The only information this table carries is the association between two (or more) rows, and it already consists solely of surrogate key values. In this case I would choose a composite primary key.
If such a table had bag semantics, or carried additional information about the association, I would add a surrogate key.
A primary key is ALWAYS a good idea. It allows for very fast and easy joining of tables. It aides external tools that can read system tables to make join allowing less skilled people to create their own queries by drag-and-drop. It also makes the implementation of referential integrity a breeze and that is a good idea from the get go.
I know for sure that some very smart people working for web giants do this. While I don't know why their own reasons, I know 2 cases where PK-less tables make sense:
Importing data. The table is temporary. Insertions and whole table scans need to be as fast as possible. Also, we need to accept duplicate records. Later we will clean the data, but the import process needs to work.
Analytics in a DBMS. Identifying a row is not useful - if we need to do it, it is not analytics. We just need a non-relational, redundant, horrible blob that looks like a table. We will build summary tables or materialized views by writing proper SQL queries.
Note that these cases have good reasons to be non-relational. But normally your tables should be relational, so... yes, they need a primary key.