Can HashTables be used to create indexes in databases? What is the ideal Data structure to create indexes?
If a table has has a foreign key referencing a field in other database does will it help if we create index on the foreign key?
Can HashTables be used to create indexes in databases?
Some DBMSes support hash-based indexes, some don't.
What is the ideal Data structure to create indexes?
No data structure occupies 0 bytes, nor it can be manipulated in 0 CPU cycles, therefore no data structure is "ideal". It is upon us, the software engineers, to decide which data structure has most benefits and fewest detriments to the specific goal we are trying to accomplish.
For example, B-Trees are useful for range scans and hash indexes aren't. Does that mean the B-Trees are "better"? Well, they are if you need range scans, but may not necessarily be if you don't.
If a table has has a foreign key referencing a field in other database does will it help if we create index on the foreign key?
You can not normally have a foreign key toward another database, only another table.
And yes, it tends to help, since every time a row is updated or deleted in the parent table, the child table needs to be searched to see if the FK was violated. This search can significantly benefit from such an index. Many (but not all) DBMSes require index on FK (and might even create it automatically if not already there).
OTOH, if you only add rows to the parent table, you could consider leaving the child table unindexed on FK fields (assuming your DBMS allows you to do so).
Oracle Perspective
Oracle supports clustering by hash value, either for single or multiple tables. This physically colocates rows having the same hash value for the cluster columns, and is faster than accessing via an index. There are disadvantages due to increased complexity and a certain need for preplanning.
You could also use a function-based index to index based on a hash function applied to one or more columns. I'm not sure what the advantage of that would be though.
Foreign key columns in Oracle generally benefit from indexing due to the obvious performance advantages.
Related
My question is, is it necessary for a relation/table in database to have a candidate key and hence a primary key? Is it possible to have a relation where a row cannot be uniquely identified by any combination of attributes?
If no, why? And if yes, then how does a DBMS make operations like search, delete etc, efficient?
Relations always have distinct tuples which means that in a Relational DBMS a table always has at least one candidate key.
SQL is a different case. SQL tables are "tuple bags", not relations. SQL tables can have duplicate rows, which is one of SQL's biggest flaws. Despite the fact that SQL supports duplicate rows the language is ill-suited to cope with them. In the presence of duplicate rows the SQL standard UPDATE and DELETE for instance have no guaranteed way to reference individual rows without resorting to some complex cursor-based operations.
Consequent problems of duplicate rows are certain inefficiencies and complexities of SQL DBMSs and a lack of orthogonality in their features. SQL DBMS engines have to use internal structures and support special features as a prerequisite in order to deal with duplicate rows. Some DBMS vendors try to get around the difficulties by disabling certain features for tables that don't have keys.
A database does not require a primary key. A table is just an unordered set of rows. Without any indexes, the only mechanism for accessing rows in a table is a full table scan (or a full partition scan, if the table is partitioned). Such operations are only efficient for very small numbers of rows.
Tables are more useful when you can refer to particular rows. Often, the best primary keys are auto incremented/identity primary keys. These are maintained by the database. In practice, all tables in a well-designed database are going to have primary keys. Here are three reasons:
Rows can be referred to by other tables.
Individual rows can be updated and deleted.
Individual rows can be selected efficiently and unambiguously.
Note: you can have indexes on a table without primary keys. And combinations of one or more columns can be made unique, even if the combination is not a primary key. The primary key itself is an index, so the inverse is not true. And all rows in a table have "row addresses" which are unique. Whether or not these are available for queries depends on the database engine.
Yes, this is possible.
Just note, that some identifier does exists behind the scenes (Example from SQL Server):
When a table is stored as a heap, individual rows are identified by
reference to a row identifier (RID) consisting of the file number, data page number, and slot on the page
How operations will be performed?
A table scan will be needed for almost any operation:
If a table is a heap and does not have any nonclustered indexes, then
the entire table must be examined (a table scan) to find any row
Do database engines utilize foreign keys transparently or a query should explicitly use them?
Based on my experience there is no explicit notion of foreign keys on a table, except that a constraint that maintains uniqueness of the key and the fact that the key (single or a group of fields) is a key which makes search efficient.
To clarify this, here is an example why it is important: I have a middleware (in particular ArcGIS for my case), for which I can control the back-end database (so I can create keys, indices, etc.) and I usually use the front (a RESTful API here). The middleware itself is a black box and to provide effective tools to take advantage of the underlying DBMS's capabilities. So what I want to understand is that if I build foreign key constraints and use queries that if implemented normally would translate into queries that would use those foreign keys, should I see performance improvements?
Is that generally the case or various engines do it differently? (I am using PostgresSQL).
Foreign keys aren't there to improve performance. They're there to enforce data integrity. They will decrease performance for inserts/updates/deletes, but they make no difference to queries.
Some DBMSs will automatically add an index to the foreign key field, which may be where the confusion is coming from. Postgres does not do this; you'll need to create the index yourself. (And yes, the database will use this index transparently.)
As far as I know Database engines needs specific queries to use foreign keys. You have to write some sort of join queries to get data from related tables.
However some Data access framework hides the complexity of accessing data from foreign keys by providing transparent way of accessing data from related tables but I am not sure that may provide much improvement in performance.
This is completely depends on the database engine.
In PostgreSQL constraints won't cause performance improvements directly, only indexes will do that.
CREATE INDEX is a PostgreSQL language extension. There are no provisions for indexes in the SQL standard.
However, adding some constraints will automatically create an index for that column(s) -- f.ex. UNIQUE & PRIMARY KEY constraints creates a btree index on the affected column(s).
The FOREIGN KEY constraint won't create indexes on the referencing column(s), but:
A foreign key must reference columns that either are a primary key or form a unique constraint. This means that the referenced columns always have an index (the one underlying the primary key or unique constraint); so checks on whether a referencing row has a match will be efficient. Since a DELETE of a row from the referenced table or an UPDATE of a referenced column will require a scan of the referencing table for rows matching the old value, it is often a good idea to index the referencing columns too. Because this is not always needed, and there are many choices available on how to index, declaration of a foreign key constraint does not automatically create an index on the referencing columns.
Index Organized Tables (IOTs) are tables stored in an index structure. Whereas a table stored
in a heap is unorganized, data in an IOT is stored and sorted by primary key (the data is the index). IOTs behave just like “regular” tables, and you use the same SQL to access them.
Every table in a proper relational database is supposed to have a primary key... If every table in my database has a primary key, should I always use an index organized table?
I'm guessing the answer is no, so when is an index organized table not the best choice?
Basically an index-organized table is an index without a table. There is a table object which we can find in USER_TABLES but it is just a reference to the underlying index. The index structure matches the table's projection. So if you have a table whose columns consist of the primary key and at most one other column then you have a possible candidate for INDEX ORGANIZED.
The main use case for index organized table is a table which is almost always accessed by its primary key and we always want to retrieve all its columns. In practice, index organized tables are most likely to be reference data, code look-up affairs. Application tables are almost always heap organized.
The syntax allows an IOT to have more than one non-key column. Sometimes this is correct. But it is also an indication that maybe we need to reconsider our design decisions. Certainly if we find ourselves contemplating the need for additional indexes on the non-primary key columns then we're probably better off with a regular heap table. So, as most tables probably need additional indexes most tables are not suitable for IOTs.
Coming back to this answer I see a couple of other responses in this thread propose intersection tables as suitable candidates for IOTs. This seems reasonable, because it is common for intersection tables to have a projection which matches the candidate key: STUDENTS_CLASSES could have a projection of just (STUDENT_ID, CLASS_ID).
I don't think this is cast-iron. Intersection tables often have a technical key (i.e. STUDENT_CLASS_ID). They may also have non-key columns (metadata columns like START_DATE, END_DATE are common). Also there is no prevailing access path - we want to find all the students who take a class as often as we want to find all the classes a student is taking - so we need an indexing strategy which supports both equally well. Not saying intersection tables are not a use case for IOTs. just that they are not automatically so.
I'd consider them for very narrow tables (such as the join tables used to resolve many-to-many tables). If (virtually) all the columns in the table are going to be in an index anyway, then why shouldn't you used an IOT.
Small tables can be good candidates for IOTs as discussed by Richard Foote here
I consider the following kinds of tables excellent candidates for IOTs:
"small" "lookup" type tables (e.g. queried frequently, updated infrequently, fits in a relatively small number of blocks)
any table that you already are going to have an index that covers all the columns anyway (i.e. may as well save the space used by the table if the index duplicates 100% of the data)
From the Oracle Concepts guide:
Index-organized tables are useful when
related pieces of data must be stored
together or data must be physically
stored in a specific order. This type
of table is often used for information
retrieval, spatial (see "Overview of
Oracle Spatial"), and OLAP
applications (see "OLAP").
This question from AskTom may also be of some interest especially where someone gives a scenario and then asks would an IOT perform better than an heap organised table, Tom's response is:
we can hypothesize all day long, but
until you measure it, you'll never
know for sure.
An index-organized table is generally a good choice if you only access data from that table by the key, the whole key, and nothing but the key.
Further, there are many limitations about what other database features can and cannot be used with index-organized tables -- I recall that in at least one version one could not use logical standby databases with index-organized tables. An index-organized table is not a good choice if it prevents you from using other functionality.
All an IOT really saves is the logical read(s) on the table segment, and as you might have spent two or three or more on the IOT/index this is not always a great saving except for small data sets.
Another feature to consider for speeding up lookups, particularly on larger tables, is a single table hash cluster. When correctly created they are more efficient for large data sets than an IOT because they require only one logical read to find the data, whereas an IOT is still an index that needs multiple logical i/o's to locate the leaf node.
I can't per se comment on IOTs, however if I'm reading this right then they're the same as a 'clustered index' in SQL Server. Typically you should think about not using such an index if your primary key (or the value(s) you're indexing if it's not a primary key) are likely to be distributed fairly randomly - as these inserts can result in many page splits (expensive).
Indexes such as identity columns (sequences in Oracle?) and dates 'around the current date' tend to make for good candidates for such indexes.
An Index-Organized Table--in contrast to an ordinary table--has its own way of structuring, storing, and indexing data.
Index organized tables (IOT) are indexes which actually hold the data which is being indexed, unlike the indexes which are stored somewhere else and have links to actual data.
If I have a lookup table with very few records in it (say, less than ten), should I bother putting an index on the Foreign Key of another table to which it is attached? For that matter, does the lookup table even need an index on the Primary Key?
Specifically, is there any performance benefit that outweighs the overhead of maintaining the indexes? If not, are there any benefits other than speed?
Note: an example of a lookup table might be Order Status, where the tuples are:
1 - Order Received
2 - In Process
3 - Shipped
4 - Paid
On a transactional system there may be no significant benefit to putting an index on such a column (i.e. a low cardinality reference column) as the query optimiser probably won't use it. It will also generate additional disk traffic on writes to the table as the indexes have to be updated. So for low cardinality FK's on a transactional database it is usually better not to index the columns. This particularly applies to high volume systems.
Note that you may still want the FK for referential integrity and that the FK lookup on a small reference table will probably generate no I/O as the lookup table will almost always be cached.
However, you may find that you want to include the column in a composite index for some reason - perhaps to create a covering index for a commonly used query.
On a table that is frequently bulk-loaded (e.g. a data warehouse) the index write traffic will be much larger than that of the table load if you have many indexed columns. You will probably need to drop or disable the FKs and indexes for a bulk load if any indexes are present.
On a Star Schema you can get some benefit from indexing low cardinality columns, even on SQL Server. If you are doing a highly selective query (i.e. one where the query optimiser decides that the row set returned will be small) then it can do a 'star query' plan where it uses a technique known as index intersection.
Generally, query plans on a star schema should be based around a table scan of the fact table or a highly selective process that bookmarks the fact table and then returns a smaller set of rows. Index intersection is efficient for the latter type of query as the selection can be resolved before doing any I/O on the fact table.
Bitmap indexes are a real win for low cardinality columns on platforms such as Oracle that support them, but SQL Server does not. Even so, low cardinality indexes can still participate in star query plans on SQL Server.
Yes, always have an index.
The query optimizer of a modern database management system (DBMS) will make the determination as to which is faster: (1) actually reading from an index on a column, (2) performing a full table scan.
The table size (in number of rows) needs to be "large enough" for use of the index to be considered.
Yes to both. Always index as a rule of thumb.
Points:
You also can't set up an FK without a unique index on the lookup table
What if you want to delete or update in the lookup table? Especially accidently...
However, saying that, we don't always.
We have very OLTP table (5 million rows+ per day) with several parent tables. We only indexes on the FK columns where we need them. We assume no deletes/key updates on some parent tables, so we reduce the amount of work needed and disk space used.
We used the SQL Server 2005 dmvs to establish that indexes weren't used. We still have the FK in place though.
My personal opinion is that you should... it may be small now but ALWAYS anticipate your tables growing in size. A good database schema will grow easily with more records. Foreign Keys are almost always a good idea.
In sql server, the primary key is the clustered index if there isn't one already (clustered index that is).
I have several tables whose only unique data is a uniqueidentifier (a Guid) column. Because guids are non-sequential (and they're client-side generated so I can't use newsequentialid()), I have made a non-primary, non-clustered index on this ID field rather than giving the tables a clustered primary key.
I'm wondering what the performance implications are for this approach. I've seen some people suggest that tables should have an auto-incrementing ("identity") int as a clustered primary key even if it doesn't have any meaning, as it means that the database engine itself can use that value to quickly look up a row instead of having to use a bookmark.
My database is merge-replicated across a bunch of servers, so I've shied away from identity int columns as they're a bit hairy to get right in replication.
What are your thoughts? Should tables have primary keys? Or is it ok to not have any clustered indexes if there are no sensible columns to index that way?
When dealing with indexes, you have to determine what your table is going to be used for. If you are primarily inserting 1000 rows a second and not doing any querying, then a clustered index is a hit to performance. If you are doing 1000 queries a second, then not having an index will lead to very bad performance. The best thing to do when trying to tune queries/indexes is to use the Query Plan Analyzer and SQL Profiler in SQL Server. This will show you where you are running into costly table scans or other performance blockers.
As for the GUID vs ID argument, you can find people online that swear by both. I have always been taught to use GUIDs unless I have a really good reason not to. Jeff has a good post that talks about the reasons for using GUIDs: https://blog.codinghorror.com/primary-keys-ids-versus-guids/.
As with most anything development related, if you are looking to improve performance there is not one, single right answer. It really depends on what you are trying to accomplish and how you are implementing the solution. The only true answer is to test, test, and test again against performance metrics to ensure that you are meeting your goals.
[Edit]
#Matt, after doing some more research on the GUID/ID debate I came across this post. Like I mentioned before, there is not a true right or wrong answer. It depends on your specific implementation needs. But these are some pretty valid reasons to use GUIDs as the primary key:
For example, there is an issue known as a "hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts.
Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the performance hits of page splits until off-peak times.
If you even think you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there.
Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique - not just among all values generated by this server, but all values generated by all computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now - in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created.
http://forums.asp.net/t/264350.aspx
The primary key serves three purposes:
indicates that the column(s) should be unique
indicates that the column(s) should be non-null
document the intent that this is the unique identifier of the row
The first two can be specified in lots of ways, as you have already done.
The third reason is good:
for humans, so they can easily see your intent
for the computer, so a program that might compare or otherwise process your table can query the database for the table's primary key.
A primary key doesn't have to be an auto-incrementing number field, so I would say that it's a good idea to specify your guid column as the primary key.
Just jumping in, because Matt's baited me a bit.
You need to understand that although a clustered index is put on the primary key of a table by default, that the two concepts are separate and should be considered separately. A CIX indicates the way that the data is stored and referred to by NCIXs, whereas the PK provides a uniqueness for each row to satisfy the LOGICAL requirements of a table.
A table without a CIX is just a Heap. A table without a PK is often considered "not a table". It's best to get an understanding of both the PK and CIX concepts separately so that you can make sensible decisions in database design.
Rob
Nobody answered actual question: what are pluses/minuses of a table with NO PK NOR a CLUSTERED index.
In my opinion, if you optimize for faster inserts (especially incremental bulk-insert, e.g. when you bulk load data into a non-empty table), such a table: with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best. Now, if you ever want to query this table (as opposed to scanning it in its entirety) you may want to add a non-clustered non-unique indexes as needed but keep them to the minimum.
I too have always heard having an auto-incrementing int is good for performance even if you don't actually use it.
A Primary Key needn't be an autoincrementing field, in many cases this just means you are complicating your table structure.
Instead, a Primary Key should be the minimum collection of attributes (note that most DBMS will allow a composite primary key) that uniquely identifies a tuple.
In technical terms, it should be the field that every other field in the tuple is fully functionally dependent upon. (If it isn't you might need to normalise).
In practice, performance issues may mean that you merge tables, and use an incrementing field, but I seem to recall something about premature optimisation being evil...
Since you are doing replication, your are correct identities are something to stear clear of. I would make your GUID a primary key but nonclustered since you can't use newsequentialid. That stikes me as your best course. If you don't make it a PK but put a unique index on it, sooner or later that may cause people who maintain the system to not understand the FK relationships properly introducing bugs.