Indices on composite primary key - database

I googled this a lot many times but I didn't get the exact explanation for the same.
I am working on a complex database structures (in Oracle 10g) where I hardly have a primary key on one single column except for the static tables.
Now my question is consider a composite primary key ID (LXI, VCODE, IVID, GHID). Since it's a primary key, Oracle will provide a default index.
Will I get ONE (system generated) single index for the primary key itself or for its sub-columns also?
Asking this because I am retrieving data (around millions of records) based on individual columns as well. Now if system generates the indices for the individual columns as well. Why my query runs pretty faster than how it actually runs when I explicitly define indices for each individual column.
Please give a satisfactory answer
Thanks in advance

A primary key is a non-NULL unique key. In your case, the unique index has four columns, LXI, VCODE, IVID GHID in the order of declaration.
If you have a condition on VCODE but not on LXI, then most databases would not use the index. Oracle has a special type of index scan called the "skip scan", which allows for this very situation. It is described in the documentation.
I would expect an index skip scan to be a bit slower than an index range scan on individual columns. However, which is better might also depend on the complexity of the where clause. For instance, three equality conditions on VCODE, IVID and GHID connected by AND might be a great example for the skip scan. And, such an index would cover the WHERE clause -- a great efficiency -- and better than one-column indexes.
As a note: index skip scans were introduced in Oracle 9i, so they are available in Oracle 10.

It will not generate index for individual column. it will generate a composite index
first it will index on LXI
then next column like that it will be a tree structure.
if you search on 1st column of primary key it will use index to use index for second you have to combine it with the first column
ex : select where ...LXI=? will use index PK
select where LXI=? and VCODE=? alse use pk
but select where VCODE=? will not use it (without LXI)

Related

SQL Server creating multiple nonclustered indexes for one column vs having multiple columns in just one index

Suppose I have following table
UserID (Identity) PK
UserName - unique non null
UserEmail - unique non null
What is recommended for the best performance?
creating non clustered index for UserName and UserEmail separately
OR
Just one including both the columns
Please do share you thoughts why one is preferable over other.
Another important point to consider is this: a compound index (made up of multiple columns) will only be used if the n left-most columns are being referenced (e.g. in a WHERE clause).
So if you have a single compound index on
(UserID, UserName, UserEmail)
then this index might be used in the following scenarios:
when you're searching for UserID alone (using just the 1 left-most column - UserID)
when you're searching for UserID and UserName (using the 2 left-most columns)
when you're searching for all three columns
But this single compound index will never be able to be used for searches on
just UserName - it's the second column in the index and thus this index cannot ever be used
just UserEmail - it's the third column in the index and thus this index cannot ever be used
Just remember this - just because a column is part in an index doesn't necessarily mean that searching on that single column alone will be supported and sped up by that index!
So if your usage patterns and your application really need to search on UserName and/or UserEmail alone (without providing other search values), then you must create separate indices on these columns - just a single compound one will not have any benefit at all.
The best way to define indexes depend entirely on how you will use the table. There is no sensible way of choosing indexes just by looking at the table definition.
If your code searches through your table with username or joins your table with another table through username, than it would be wise to define an index on that column. If your code joins the table with another table using two columns (username and usermail), than it would be wise to define index for those two columns. Since all your columns are defined to be unique, I hardly believe this will be the case so you will not need multiple column indexes on that table.
There might be some additional advice on using multiple column indexes: Multiple column indexes are also used for filters that fit the index partially, but with conditions.
Example:
If you define a two column index on username and usermail (in given order), you will have performance gain in searches that filter through both columns (username and usermail). With that index you will also have performance gains in filters that use username only because that is the first column of the index, but not in searches through usermail, that is because the second column of an index can not be used alone.
The rule is: An index can be used for filtering with exact matching columns or filtering with a subset of columns that match subsequent top columns in index definition.
Please do share you thoughts why one is preferable over other.
It depends on what you do.
See, an index is only used "left to right". So, an indes on UserID; UserName is useless if I select filtering by UserName ONLY.
Generally, I would assume three indices here:
Uniuqe Index, Clustered in UserID, as Primary Key.
Unique Index on UserName, non clustered.
Unique Index on UserEMail, non clustered.
The reason is totally not for Performance but:
You will Need the first as Primary key for forein key relationships.
You Need the other two to handle unique constraints properly - there is no way to do that without indices.
In Addition, you Need flexibility to seek by UserName AND UserEMail, which means that it is not possible to Combine them only.
Performance really enters last here - for performacne e reasons all of These indices may contain all additional fields (not as part of the index but as included columns. But really, there is no other sensible way to have this table work unless you alow multiple registrations for the same user.

Difference between clustered and nonclustered index [duplicate]

This question already has answers here:
What are the differences between a clustered and a non-clustered index?
(13 answers)
Closed 7 years ago.
I need to add proper index to my tables and need some help.
I'm confused and need to clarify a few points:
Should I use index for non-int columns? Why/why not
I've read a lot about clustered and non-clustered index yet I still can't decide when to use one over the other. A good example would help me and a lot of other developers.
I know that I shouldn't use indexes for columns or tables that are often updated. What else should I be careful about and how can I know that it is all good before going to test phase?
A clustered index alters the way that the rows are stored. When you create a clustered index on a column (or a number of columns), SQL server sorts the table’s rows by that column(s). It is like a dictionary, where all words are sorted in alphabetical order in the entire book.
A non-clustered index, on the other hand, does not alter the way the rows are stored in the table. It creates a completely different object within the table that contains the column(s) selected for indexing and a pointer back to the table’s rows containing the data. It is like an index in the last pages of a book, where keywords are sorted and contain the page number to the material of the book for faster reference.
You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way!
One rule of thumb I would apply is this: any "regular" table (one that you use to store data in, that is a lookup table etc.) should have a clustering key. There's really no point not to have a clustering key. Actually, contrary to common believe, having a clustering key actually speeds up all the common operations - even inserts and deletes (since the table organization is different and usually better than with a heap - a table without a clustering key).
Kimberly Tripp, the Queen of Indexing has a great many excellent articles on the topic of why to have a clustering key, and what kind of columns to best use as your clustering key. Since you only get one per table, it's of utmost importance to pick the right clustering key - and not just any clustering key.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Marc
You should be using indexes to help SQL server performance. Usually that implies that columns that are used to find rows in a table are indexed.
Clustered indexes makes SQL server order the rows on disk according to the index order. This implies that if you access data in the order of a clustered index, then the data will be present on disk in the correct order. However if the column(s) that have a clustered index is frequently changed, then the row(s) will move around on disk, causing overhead - which generally is not a good idea.
Having many indexes is not good either. They cost to maintain. So start out with the obvious ones, and then profile to see which ones you miss and would benefit from. You do not need them from start, they can be added later on.
Most column datatypes can be used when indexing, but it is better to have small columns indexed than large. Also it is common to create indexes on groups of columns (e.g. country + city + street).
Also you will not notice performance issues until you have quite a bit of data in your tables. And another thing to think about is that SQL server needs statistics to do its query optimizations the right way, so make sure that you do generate that.
A comparison of a non-clustered index with a clustered index with an example
As an example of a non-clustered index, let’s say that we have a non-clustered index on the EmployeeID column. A non-clustered index will store both the value of the
EmployeeID
AND a pointer to the row in the Employee table where that value is actually stored. But a clustered index, on the other hand, will actually store the row data for a particular EmployeeID – so if you are running a query that looks for an EmployeeID of 15, the data from other columns in the table like
EmployeeName, EmployeeAddress, etc
. will all actually be stored in the leaf node of the clustered index itself.
This means that with a non-clustered index extra work is required to follow that pointer to the row in the table to retrieve any other desired values, as opposed to a clustered index which can just access the row directly since it is being stored in the same order as the clustered index itself. So, reading from a clustered index is generally faster than reading from a non-clustered index.
In general, use an index on a column that's going to be used (a lot) to search the table, such as a primary key (which by default has a clustered index). For example, if you have the query (in pseudocode)
SELECT * FROM FOO WHERE FOO.BAR = 2
You might want to put an index on FOO.BAR. A clustered index should be used on a column that will be used for sorting. A clustered index is used to sort the rows on disk, so you can only have one per table. For example if you have the query
SELECT * FROM FOO ORDER BY FOO.BAR ASCENDING
You might want to consider a clustered index on FOO.BAR.
Probably the most important consideration is how much time your queries are taking. If a query doesn't take much time or isn't used very often, it may not be worth adding indexes. As always, profile first, then optimize. SQL Server Studio can give you suggestions on where to optimize, and MSDN has some information1 that you might find useful
faster to read than non cluster as data is physically storted in index order
we can create only one per table.(cluster index)
quicker for insert and update operation than a cluster index.
we can create n number of non cluster index.

what does a B-tree index on more than 1 column look like?

So I was reading up on indexes and their implementation, and I stumbled upon this website that has a brief explanation of b-tree indexes:
http://20bits.com/articles/interview-questions-database-indexes/
The b-tree index makes perfect sense for indexes that are only on a single column, but let's say I create an index with multiple columns, how then does the b-tree work? What is the value of each node in the b-tree?
For example, if I have this table:
table customer:
id number
name varchar
phone_number varchar
city varchar
and I create an index on: (id, name, city)
and then run the following query:
SELECT id, name
FROM customer
WHERE city = 'My City';
how does this query utilize the multiple column index, or does it not utilize it unless the index is created as (city, id, name) or (city, name, id) instead?
With most implementations, the key is simply a longer key that includes all of the key values, with a separator. No magic there ;-)
In your example the key values could look something like
"123499|John Doe|Conway, NH"
"32144|Bill Gates| Seattle, WA"
One of the characteristics of these indexes with composite keys is that the intermediate tree nodes can be used in some cases to "cover" the query.
For example, if the query is to find the Name and City given the ID, since the ID is first in the index, the index can search by this efficiently. Once in the intermediate node, it can "parse" the Name and City, from the key, and doesn't need to go to the leaf node to read the same.
If however the query wanted also to display the phone number, then the logic would follow down the leaf when the full record is found.
Imagine that the key is represented by a Python tuple (col1, col2, col3) ... the indexing operation involves comparing tuple_a with tuple_b ... if you have don't know which value of col1 and col2 that you are interested in, but only col3, then it would have to read the whole index ("full index scan"), which is not as efficient.
If you have an index on (col1, col2, col3), then you can expect that any RDBMS will use the index (in a direct manner) when the WHERE clause contains reference to (1) all 3 columns (2) both col1 and col2 (3) only col1.
Otherwise (e.g. only col3 in the WHERE clause), either the RDBMS will not use that index at all (e.g. SQLite), or will do a full index scan (e.g. Oracle) [if no other index is better].
In your specific example, presuming that id is a unique identifier of a customer, it is pointless to have it appear in an index (other than the index that your DBMS should set up for a primary key or column noted as UNIQUE).
Some implementations simply concatenate the values in the order of the columns, with delimiters.
Another solution is to simply have a b-tree within a b-tree. When you hit a leaf on the first column, you get both a list of matching records and a mini b-tree of the next column, and so on. Thus, the order of the columns specified in the index makes a huge difference on whether that index will be useful for particular queries.
Here's a related question I wrote last week:
Does SQL Server jump leaves when using a composite clustered index?
"The index will be ordered by the first key element, then by second key element and so on"
https://www.qwertee.io/blog/postgresql-b-tree-index-explained-part-1/
In Oracle a composite key index can be used even though the leading columns are not filtered. This is done through three mechanisms:
A fast full index scan, in which multiblock reads are used to traverse the entire index segment.
An index full scan, in which the index is read in the logical order of the blocks (I believe I read that in recent versions Oracle can use multiblock reads for this, but really you should count on single block reads)
An inddex skip scan, where a very low cardinality for the non-predicated leading columns allows Oracle to perform multiple index range scans, one for each unique value of the leading column(s). These are pretty rare in my experience.
Look for articles by Richard Foote or Jonathan Lewis for more information on Oracle index internals.
Other than the "composite key" mechanism already described, one possibility is a kdtree which works like a binary tree, but as you traverse each level you cycle through k dimensions. That is, the first level of the tree separates the first dimension into two parts, the second level splits the second dimension, the k+1th level splits the first dimension again, etc.. This allows for efficient partitioning of data in any number of dimensions. This approach is common in "spatial" databases (e.g., Oracle Spatial, PostGIS, etc.), but probably not as useful in "regular" multi-indexed tables.
http://en.wikipedia.org/wiki/Kd-tree
It can use the (id,name,city) index to satisfy a "City = ? " predicate, but very very inefficently.
In order to use the index to satisfy this query it would need to walk most of tree structure looking for entries with the desired city. This is still probably an order of magnatude faster than scanning the table!
An index of (city,name,id) would be the best index for your query. It would find all the desired city entries easily and would not need to access the underlying table to get the id and name values.

primary key on very small table

I am having a very small tables with at most 5 records that holds some labels. I am using Postgres.
The structure is as follows:
id - smallint
label - varchar(100)
The table will be used mainly to reference the rows from other tables. The question is if it's really necessary to have a primary key on id or to have just an index on the id or have them both?
I did read about indexes and primary keys and I understand that this depends quite a lot on what's the table going to be used for:
Tables with no Primary Key
Edit: I was going to ask about having a primary key or an index or have them both. I edited the question.
It is always good practice to have a primary key column. The typical scenario it is needed is when you want to update or delete a row, having a PK makes it much easier and safer.
Yes, a primary key is not only good practice -- it's crucial. A table that lacks a unique key fails to be in First Normal Form.
You must declare a PRIMARY KEY or UNIQUE constraint if you want other tables to reference this one with a foreign key.
In most RDBMS brands, both PRIMARY KEY and UNIQUE constraints implicitly create an index on the column(s). If it doesn't do this implicitly, you may be required to define the index yourself before you can declare the constraint.
Yes, you will need a primary key on the id field, since you do not want two labels that share the same id.
You also want an index, to speed up the search/lookup process in this table (although for small tables there is less performance gain). The sequence will just help you fill in the next ID; it does not prevent you from changing a previous value into one that already exists.
There are very little reasons for creating an index instead of a primary key. AS Bill Karwin said, you won't save resources at all. And, as you may have already guessed, there is no need at all to create a new index if you have the primary key.
In some cases it may be hard to find a key candidate. But it doesn't seems to be the case and it clearly goes against some good practices.
By the way. As your table is so small most queries will rather use a full table scan even if there is an index. Don't worry you see a full table scan.
From the developer's point of view, PRIMARY KEY is just a combination of NOT NULL and UNIQUE INDEX on one of the columns.
A UNIQUE INDEX is good if:
You need to enforce uniqueness. Using index is the most efficient way to do that.
You need to perform a SELECT, UPDATE or DELETE with a WHERE condition that is selective on the indexed field, that is number of rows affected by the query is much less that total number of the rows (say, 10 rows of 2,000,000).
A UNIQUE INDEX is bad if:
You don't need uniqueness on this field, of course :) But you'll better have one unique index in the table, for each record to be identifiable.
You need fast INSERTS
You need fast UPDATE's and DELETE's of the indexed value with a WHERE condition that is not selective on the indexed field, that is number of rows affected by the query is comparable to the total number of the rows (say, 1,500,000 rows of 2,000,000).
Given that you are going to have a little table, I'd advice you to create a PRIMARY KEY on it.

Does selecting only indexed attributes result in faster queries?

When performing a query where the attributes selected make up the components of an index does that result in a faster query? I would imagine that the query planner/optimizer could see that the requested columns could be satisfied completely by the index scan.
Trivial Example
CREATE TABLE "liked" (
"id" BIGINT NOT NULL DEFAULT nextval('liked_id_seq'),
"userid" BIGINT NOT NULL,
"storyid" BIGINT NOT NULL,
"notes" TEXT,
PRIMARY KEY ("id")
);
CREATE INDEX "liked_user" ON "liked" (
"userid",
"storyid"
);
ALTER TABLE "liked" ADD FOREIGN KEY ("userid") REFERENCES "users" ("id") ON DELETE CASCADE;
ALTER TABLE "liked" ADD FOREIGN KEY ("storyid") REFERENCES "story" ("id") ON DELETE CASCADE;
SELECT storyid from liked where userid = 1;
With the query above there isn't any data external to what is already contained in the liked_user index so I would imagine there would be less actions if the query optimizer could infer that the resulting tuples could be satisfied by the index alone.
It's called a "covering index", and it improves the efficiency somewhat, by varying amounts depending on which DBMS you are using (and if you are using MySQL, which flavor of storage).
Try giving an example of a specific situation, if you have one.
In general, yes, but it depends on how you are accessing them. Using LIKE to key off a match in the middle out of a string field isn't going to be any faster with an index.
Not so much, in my experience. You speed up queries by optimizing their conditions, and trying to use the best possible index in those conditions. There are many ways to slow down a query based on what you are selecting, such as using subqueries, perhaps some UDFs--and of course you can slow down queries using some less-than-optimal joins.
It can do, but there are some caveats. These comments are based on Oracle, btw.
For example, SELECT COL1 FROM MY_TABLE might be able to use an index, but if all the columns of the index are nullable then there might be rows not included in a regular btree index, so the index might not be used.
It's also possible that an index might be larger (and therefore more costly to ful scan) than the underlying table (for example where the table only has a single column) because the index has to include a rowid for every entry as well as the column values. In that case, unless the query can leverage the index information in some special way (for example you are including an ORDER BY clause that the index can supply without the need for a sort) then the index might not be used.
You ought to also look into the various index access methods that the RDBMS can use in order to understand their strengths and weaknesses. In Oracle these would generally be INDEX RANGE SCAN, FULL INDEX SCAN, FAST FULL INDEX SCAN, and INDEX SKIP SCAN. This knowledge will help you understand whether an index could be used and in what way.

Resources