what is the exact difference between the multilevel indexing and secondary indexing in dbms as both appears same but classified different?
Secondary Index:
The secondary Index in DBMS can be generated by a field which has a unique value for each record, and it should be a candidate key. It is also known as a non-clustering index.
Multilevel Index:
Multilevel Indexing in Database is created when a primary index does not fit in memory. In this type of indexing method, you can reduce the number of disk accesses to short any record and kept on a disk as a sequential file and create a sparse base on that file.
Related
I have a table in SQL Server with a three-column clustered index.
I have a table with columns (CustomerID, A, ProductID, C, OtherID) and I have a clustered key on (OtherID, CustomerID, ProductID).
Is there a performance hit for that column order (in the table, not the index?) Or is there a hidden advantage to re-ordering the key columns to the first three columns of the table: (OtherID,CustomerID,ProductID,A,C)
Seems like it shouldn't be a big problem, but implementations can have hidden performance costs.
(I was looking for the cause of a performance issue we were having, and this was just one of those "It shouldn't be a problem, but maybe it could be a problem..." kind of guesses.)
I won't assume what type of clustered index we are talking about here, so I will try to cover all the basics. I would have to say that, logically, the impact (performance or otherwise) of the ordinal position of the columns within your table in relation to their ordinal position within the clustered index is inconsequential (unless someone out there has something to prove me wrong).
Rowstore
Keep in mind that your table data and rowstore clustered indexes end up becoming separate logical structures. Per Microsoft regarding the clustered rowstore index architecture:
indexes are organized as B-Trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom nodes in the index are called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels. In a clustered index, the leaf nodes contain the data pages of the underlying table. The root and intermediate level nodes contain index pages holding index rows.
So when we are talking about the physical storage of both the clustered index and the table data, we can think of them as separate structures. Looking at this image from the same link:
All three of these levels have at least one thing in common. They are all storing values (more or less) logically sorted by the value of your clustered index. Regardless of the ordinal position of the columns within your table structure, the leaf pages for your table data will be stored logically ordered by the columns/values within your clustered index. This is also true of your intermediate pages, which represent the storage of your clustered index values.
So all of that to say, the ordinal position of your columns within the clustered index is actually what determines how both the intermediate level and leaf pages are logically ordered, so the ordinal position of those columns within your table statement really has no impact to their storage order because of their inclusion in your clustered index.
Columnstore
Regarding clustered columnstore indexes, I would again say that it has no impact, but for a different (and simpler) reason. The columnstore index breaks up the column values in to separate logical structures, which have no relation to each other by way of their ordinal position. So regardless of the column's ordinal position within the table, when you query a value from a column you are querying the separate physical structure that represents that column's values (ignoring deltastore for simplicity here). Similarly, when you query multiple column's values, you are querying each individual logical structure that represents each column's values separately.
This is why you are not even able to specify a column list when creating a clustered columnstore index. The ordinal position of the columns within the columnstore index itself has no impact, so I'd imagine that the ordinal position of those columns within the table itself (or any relationship between the two) also has no impact.
Heap
Lastly, should anyone else ask, even with tables stored as a heap I would still argue that the ordinal position of columns within the table has no impact to any query performance. Under the hood, heaps are still stored and referenced by a sort of clustered index structure (I believe it would still be described that way).
Per Microsoft:
A rowstore is data that is logically organized as a table with rows and columns, and then physically stored in a row-wise data format. This has been the traditional way to store relational table data such as a heap or clustered B-tree index.
So heaps are still stored in an ordered fashion just like any other table created using a clustered index, but the main difference is that the value they are ordered by is simply non-business use value created in order to identify the row. As described by Microsoft:
If the table is a heap, which means it does not have a clustered index, the row locator is a pointer to the row. The pointer is built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).
This RID is not something you would ever normally use as a predicate to a query, which is the main disadvantage (since data is made to be queried, right?). But regardless, the ordinal position of these columns within your table still has no impact to how they are actually logically sorted/stored, so I can't imagine that it could impact your query performance.
From Database System Concepts
We use the term hash index to denote hash file structures as well as
secondary hash indices. Strictly speaking, hash indices are only
secondary index structures.
A hash index is never needed as a clustering index structure, since, if a file itself is organized by hashing, there is no need for a
separate hash index structure on it. However, since hash file
organization provides the same direct access to records that indexing
provides, we pretend that a file organized by hashing also has a
clustering hash index on it.
Is "secondary index" the same concept as "nonclustering index" (which is what I understood from the book)?
Is a hash index never a clustering index or not?
Could you rephrase or explain why the reason "A hash index is never needed as a clustering index structure" is "if a file itself is organized by hashing, there is no need for a separate hash index structure on it"? What about "if a file itself is not organized by hashing"?
Thanks.
The text tries to explain something but unfortunately creates more confusion than it resolves.
At the logical level, database tables (correct term : "relations") are made up of rows (correct term : "tuples") which represent facts about the real world the db is aimed to represent/reflect. Don't ever call those rows/tuples "records" because "records" is a concept pertaining to the physical level, which is distinct from the logical.
Typically, but this is not a universal law cast in stone, you will find that the physical organization consists of a "main" datastore which has a record for each tuple and where that record contains each and every attribute (column) value of the tuple (row). (That's unless there are LOBs in play or so.) Those records must be given a physical location in the store they are stored in and this is usually/typically done using a B-tree on the primary key values. This facilitates :
retrieving only specific [tuples/rows with] primary key values from the relation/table.
traversing the [tuples of] relation in-order of primary key values
retrieving only [tuples/rows within] specific ranges of primary key values from the relation/table.
This B-tree on the primary key values is typically called the "clustering" index.
Often, there is also a frequent need for retrieving only [tuples/rows with] specific values of attributes that are not the primary key. If that needs to be done as efficiently/fast as it can for values of the primary key, we use similar indexes that are then sometimes called "secondary". Those indexes typically do not contain all the attribute/column values of the tuple/row indexed, but only the attribute values to be indexed plus a mention of the primary key value (so we can find the rest of the attributes in the "main" datastore.
Those "secondary" indexes will mostly also be B-tree indexes which will permit in-order traversal for the attributes being indexed, but they can potentially also be hashing indexes, which permit only to look up tuples/rows using equality comparisons with a given key value ("key" = index key, nothing to do with the keys on the relation/table, though obviously for most keys on the table/relation, there will be a dedicated index too where the index key has the same attributes as the table key it supports).
Finally, there is no theoretical reason why a "primary" (/"clustered") index could not be a hash index (the text kinda suggests the opposite but that is plain wrong). But given the poor level of the explanation in your textbook, it is probably not expected of you to be taught that.
Also note that there are still other ways to physically organize a database than just using B-tree or hash indexes.
So to sum up :
"Clustered" usually refers to the index on the primary data records store
and is usually a B-tree [or some such] on the primary key
and the textbook presumably does not want you to know about more advanced possibilities
"Secondary" usually refers to additional indexes that provide additional "fast access to specific tuples/rows"
and is usually also a B-tree that permits in-order traversal just like the "clustered"/"primary" index
but can also be a hash index that permits only "access by given value" but no in-order traversal.
Hope it helps.
I will try to oversimplify just to point where your confusion is.
There are different type of index organisations:
Clustered
Non Clustered
Each of them may use one of the following file structures:
Sequential File organisation
Hash file organisation
We can have clustered indexes and non clustered indexes using hash file organisations.
Your text book is supposing that clustered indexes are used only on primary keys.
It also supposes that hash indexes, which I suppose is referring to a non-clustered index using hash file organisation, are only used for secondary indexes (non primary-key fields).
But you can actually have clustered indexes on primary keys and non-primary keys. Maybe it is a simplification done for the sake of comprehension, or it is based on a specific implementation of a DB.
Is it possible to create a non clustered index which is not unique? What data structure is used to implement non clustered indexes.
Assuming you are talking about SQL Server then simply don't specify UNIQUE when creating the index.
CREATE /*UNIQUE*/ NONCLUSTERED INDEX IX ON T(C)
As UNIQUE is commented out above this does not enforce uniqueness on the C column. But in fact it will still be made unique behind the scenes by adding the (unique) row locator in to the non clustered index key.
Regarding data structure both clustered and non clustered indexes are B+ trees.
As stated by Martin Smith, the indexes don't need to be logically unique but in practice, SQL Server adds a 4 byte 'uniquifier' column to guarantee physical uniqueness.
In terms of structural difference, the non-clustered indexes include pointers to the clustered index or the heap pointer (if you haven't created a clustered index).
You should note that while they are both B-Trees, there are other differences - non-clustered indexes have their leaf nodes 1 level higher, which can mean reading from non-clustered indexes can be faster than reading from a clustered index providing the data required is available in the leaf nodes (the columns required are in the key of the index).
Here's the clustered index structure from Books Online:
http://technet.microsoft.com/en-us/library/ms177443(v=sql.105).aspx
Here's the non-clustered index structure:
http://technet.microsoft.com/en-gb/library/ms177484(v=sql.105).aspx
So, reading from a 'covered' non-clustered index can be faster as each level incurs 1 page read so because the non-clustered index has fewer levels to get to the data then you will incur fewer logical reads which in turn will mean fewer physical disk reads and less work for the CPU.
You should also consider that covering indexes with only the specific columns required for a specific query will mean fewer total pages need to be read to grab all the data resulting in faster performance but also be aware that the more indexes you have, the more cost your writes will incur.
Can HashTables be used to create indexes in databases? What is the ideal Data structure to create indexes?
If a table has has a foreign key referencing a field in other database does will it help if we create index on the foreign key?
Can HashTables be used to create indexes in databases?
Some DBMSes support hash-based indexes, some don't.
What is the ideal Data structure to create indexes?
No data structure occupies 0 bytes, nor it can be manipulated in 0 CPU cycles, therefore no data structure is "ideal". It is upon us, the software engineers, to decide which data structure has most benefits and fewest detriments to the specific goal we are trying to accomplish.
For example, B-Trees are useful for range scans and hash indexes aren't. Does that mean the B-Trees are "better"? Well, they are if you need range scans, but may not necessarily be if you don't.
If a table has has a foreign key referencing a field in other database does will it help if we create index on the foreign key?
You can not normally have a foreign key toward another database, only another table.
And yes, it tends to help, since every time a row is updated or deleted in the parent table, the child table needs to be searched to see if the FK was violated. This search can significantly benefit from such an index. Many (but not all) DBMSes require index on FK (and might even create it automatically if not already there).
OTOH, if you only add rows to the parent table, you could consider leaving the child table unindexed on FK fields (assuming your DBMS allows you to do so).
Oracle Perspective
Oracle supports clustering by hash value, either for single or multiple tables. This physically colocates rows having the same hash value for the cluster columns, and is faster than accessing via an index. There are disadvantages due to increased complexity and a certain need for preplanning.
You could also use a function-based index to index based on a hash function applied to one or more columns. I'm not sure what the advantage of that would be though.
Foreign key columns in Oracle generally benefit from indexing due to the obvious performance advantages.
In a non-clustered index, each entry is of fixed length and so the database may use binary search to locate the record address in O(nlogn) time.
Since the tables have variable length records, and clustered index uses the underlying table itself for search (or am I wrong?) , how does the database find a record for a specific key in O(nlogn) time?
each entry is of fixed length
Not true for real-world databases.
Rows are split into groups called pages. Pages have a fixed size (~8KB). They form a tree structure with the top levels linking to the physical location of the bottom level pages.
That allows the tree to be traversed top-to-bottom, entering the relevant branch at each step.
Clustered indexes typically have exactly the same physical structure as non-clustered indexes.