Does selecting only indexed attributes result in faster queries? - database

When performing a query where the attributes selected make up the components of an index does that result in a faster query? I would imagine that the query planner/optimizer could see that the requested columns could be satisfied completely by the index scan.
Trivial Example
CREATE TABLE "liked" (
"id" BIGINT NOT NULL DEFAULT nextval('liked_id_seq'),
"userid" BIGINT NOT NULL,
"storyid" BIGINT NOT NULL,
"notes" TEXT,
PRIMARY KEY ("id")
);
CREATE INDEX "liked_user" ON "liked" (
"userid",
"storyid"
);
ALTER TABLE "liked" ADD FOREIGN KEY ("userid") REFERENCES "users" ("id") ON DELETE CASCADE;
ALTER TABLE "liked" ADD FOREIGN KEY ("storyid") REFERENCES "story" ("id") ON DELETE CASCADE;
SELECT storyid from liked where userid = 1;
With the query above there isn't any data external to what is already contained in the liked_user index so I would imagine there would be less actions if the query optimizer could infer that the resulting tuples could be satisfied by the index alone.

It's called a "covering index", and it improves the efficiency somewhat, by varying amounts depending on which DBMS you are using (and if you are using MySQL, which flavor of storage).
Try giving an example of a specific situation, if you have one.

In general, yes, but it depends on how you are accessing them. Using LIKE to key off a match in the middle out of a string field isn't going to be any faster with an index.

Not so much, in my experience. You speed up queries by optimizing their conditions, and trying to use the best possible index in those conditions. There are many ways to slow down a query based on what you are selecting, such as using subqueries, perhaps some UDFs--and of course you can slow down queries using some less-than-optimal joins.

It can do, but there are some caveats. These comments are based on Oracle, btw.
For example, SELECT COL1 FROM MY_TABLE might be able to use an index, but if all the columns of the index are nullable then there might be rows not included in a regular btree index, so the index might not be used.
It's also possible that an index might be larger (and therefore more costly to ful scan) than the underlying table (for example where the table only has a single column) because the index has to include a rowid for every entry as well as the column values. In that case, unless the query can leverage the index information in some special way (for example you are including an ORDER BY clause that the index can supply without the need for a sort) then the index might not be used.
You ought to also look into the various index access methods that the RDBMS can use in order to understand their strengths and weaknesses. In Oracle these would generally be INDEX RANGE SCAN, FULL INDEX SCAN, FAST FULL INDEX SCAN, and INDEX SKIP SCAN. This knowledge will help you understand whether an index could be used and in what way.

Related

Cassandra - search by clustered key

This is my diseases table definition:
id text,
drugid text,
name
PRIMARY KEY (drugid, id)
Now I want to perform search by drugid column only (all values in this column are unique). This primary key was created due to quick drug search.
Now - what will be best solution to filter this table using id? Creating new table? Pass additional value (drugid) to SELECT? Is it option with only id?
Thans for help :)
Looking at your table definition, the partition key is drugid. This means that your queries will have to include the drugid. But since id is also part of the primary key, you could do something like:
select * from diseases where drugid = ? and id = ?
Unfortunately just having the id is not possible, unless you create a secondary index on it. Which wouldn't be very good since you could trigger a full cluster scan.
So, the solutions are:
specify the partition key (if possible), in this case drugid
create a new table that will have the id as partition key; in this case you will need to maintain both tables;
I guess the solution you'll choose depends on your data set. You should test to see how each solution behaves.
Should you use a secondary index?
When specifying the partition key, Cassandra will read the exact data from the partition and from only one node.
When you create a secondary index, Cassandra needs to read the data from partitions spread across the whole cluster. There are performance impact implications when an index is built over a column with lots of distinct values. Here is some more reading on this matter - Cassandra at Scale: The Problem with Secondary Indexes
In the above article, there is an interesting comment by #doanduyhai:
"There is only 1 case where secondary index can perform very well and
NOT suffer from scalability issue: when used in conjunction with
PARTITION KEY. If you ensure that all of your queries using secondary
index will be of the form :
SELECT ... FROM ... WHERE partitionKey=xxx AND my_secondary_index=yyy
then you're safe to go. Better, in this
case you can mix in many secondary indices. Performance-wise, since
all the index reading will be local to a node, it should be fine"
I would stay away from secondary indexes.
From what you described, id will have distinct values, more or less, so you might run into performance issues since "a general rule of thumb is to index a column with low cardinality of few values".
Also, if id is a clustering column, the data will be stored in an ordered manner. The clustering column(s) determine the data’s on-disk sort order only within a partition key. The default order is ASC.
I would suggest some more reading - When not to use an index and Using a secondary index

Proper table design for sparse primary key

In my system I have temporary entities that are created based on rules stored in my database, and the entities are not persisted.
Now, I need is to store information about these entities, and because they are created based on rules and are not stored, they have no ID.
I came up with a formula to generate an ID for these temp entities based on the rule that was used to generate them: id = rule id + "-" + entity index in the rule. This formula generates unique strings of the form 164-3, 123-0, 432-2, etc...
My question is how should I build my table (regarding primary key and clustered index) when my keys have no relation or order? Keep in mind that I will only (99.9% of the time) query the table using the id mentioned above.
Options I thought about after much reading, but don't have the knowledge to determine which is better:
1) primary key on a varchar column with clustered index. -According to various sources, this would be bad because of fragmentation and the wideness of the key. Also their format is pretty weird for sorting.
2) primary key on varchar column without clustered index (heap table). -Also a bad idea according to various sources due to indexing and fragmentation issues.
3) identity int column with clustered index, and a varchar column as primary key with unique index. -Can't really see the benefit of the surogate key here since it would mainly help with range queries and ordering and I would never query the table based on this key because it would be unknown at all times.
4) 2 columns composite key: rule id + rule index columns.
-Now I don't have strings but I have two columns that will be copied to FKs and non clustered indexes. Also I'm not sure what indexes I would use in this case.
Can anybody shine a light here? Any help is appreciated.
--Edit
I will perform more selects than inserts;
I will perform more inserts than updates;
All selects will include at least rule id;
If I use a surogate primary key, and a unique index on (rule id, index), then I can use the surogate for subsequent operations after retrieving data by rule id, which would be faster. Also, inserts would be faster.
However, because the data will be stored according to the surogate key, I might have records that have the same rule id, but different index, stored quite far from each other on disk, which means even with an index on rule id, retrieving the data could be kinda slow.
If I use (rule id, index) as clustered primary key, rows with same rule id would be stored close to each other, and selecting data by rule id would be efficient enough. However, I suspect inserts would be slow.
Is the rationale above correct?
Using a heap is generally a bad idea unless proven otherwise. Even so, you will need a very solid reason for not having a clustered index (any one will make things better, even on identity column).
Storing this key in a single column is okay; if you want natural sorting, you can pad your numbers with zeroes, for example. However, this will widen the key.
Having a composite primary key (and, subsequently, foreign keys) is completely acceptable, especially when dealing with natural keys, like the one you have. This will give you the narrowest possible key - int + int or some such - while eliminating the sorting issue at the same time. I would recommend to make this PK clustered to reduce additional key lookups.
Fragmentation here will not be a big issue; at least, no bigger than with any other indexing decision. Any index built on such a key will be prone to fragmentation, clustered or no. In any case, your DBA should know how to keep an index such as this in top form.
Regarding the order of columns in the index, the following rules usually apply:
If partial key match will take place (filtering by one part of the key but not by the other) the one which is used most often should go first;
If No.1 isn't applicable and all parts of the key used in all queries, the column with the highest cardinality should go first.
The order of remaining columns (if there are more than 1) isn't of much importance because SQL Server only creates distribution statistics for the first column in a composite index. However, it is a good idea to list them in order of decreasing cardinality.
EDIT: Seeing your update with additional details, here are the most suitable options. Suppose your table looks like this:
-- Sample table
create table dbo.TempEntities (
RuleId int not null,
IndexId int not null,
-- Remaining columns listed here
EntityData xml not null
);
go
From here, the most straightforward way is to use the natural key as a clustered index:
-- Option 1 - natural clustered index
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (RuleId, IndexId);
go
However, if you have any child tables that would reference this one, it might not be the most convenient solution, because natural keys are prone to updates, which creates a mess where you could avoid it. Instead, a surrogate key can be introduced, like this:
-- Option 2 - surrogate clustered, natural nonclustered
alter table dbo.TempEntities add Id bigint identity(1,1) not null;
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (Id);
alter table dbo.TempEntities
add constraint UQ_TempEntities_RuleIdIndexId unique (RuleId, IndexId);
go
It makes sense to have the surrogate PK clustered, because it will result in much less page splits, making inserts faster (despite having one index more compared to Option 1). Without any intimate knowledge of your queries, this is probably the most balanced solution.
Shuffling the clustered attribute between surrogate and natural keys has mostly academic value and can only make difference on a high-load system with hundreds of inserts happening every second on 24*7 schedule. If your system is indeed as such, please seek a professional consultant who will analyse your queries and provide the solution tailored to your situation.

Indices on composite primary key

I googled this a lot many times but I didn't get the exact explanation for the same.
I am working on a complex database structures (in Oracle 10g) where I hardly have a primary key on one single column except for the static tables.
Now my question is consider a composite primary key ID (LXI, VCODE, IVID, GHID). Since it's a primary key, Oracle will provide a default index.
Will I get ONE (system generated) single index for the primary key itself or for its sub-columns also?
Asking this because I am retrieving data (around millions of records) based on individual columns as well. Now if system generates the indices for the individual columns as well. Why my query runs pretty faster than how it actually runs when I explicitly define indices for each individual column.
Please give a satisfactory answer
Thanks in advance
A primary key is a non-NULL unique key. In your case, the unique index has four columns, LXI, VCODE, IVID GHID in the order of declaration.
If you have a condition on VCODE but not on LXI, then most databases would not use the index. Oracle has a special type of index scan called the "skip scan", which allows for this very situation. It is described in the documentation.
I would expect an index skip scan to be a bit slower than an index range scan on individual columns. However, which is better might also depend on the complexity of the where clause. For instance, three equality conditions on VCODE, IVID and GHID connected by AND might be a great example for the skip scan. And, such an index would cover the WHERE clause -- a great efficiency -- and better than one-column indexes.
As a note: index skip scans were introduced in Oracle 9i, so they are available in Oracle 10.
It will not generate index for individual column. it will generate a composite index
first it will index on LXI
then next column like that it will be a tree structure.
if you search on 1st column of primary key it will use index to use index for second you have to combine it with the first column
ex : select where ...LXI=? will use index PK
select where LXI=? and VCODE=? alse use pk
but select where VCODE=? will not use it (without LXI)

Will adding a clustered index to an existing table improve performance?

I have a SQL 2005 database I've inherited, with a table that has grown to about 17 million records over the course of about 15 years, and is now horribly slow.
The table layout looks about like this:
id_column = nvarchar(20),indexed, not unique
column2 = nvarchar(20), indexed, not unique
column3 = nvarchar(10), indexed, not unique
column4 = nvarchar(10), indexed, not unique
column5 = numeric(8,0), indexed, not unique
column6 = numeric(8,0), indexed, not unique
column7 = nvarchar(20), indexed, not unique
column8 = nvarchar(10), indexed, not unique
(and about 5 more columns that look pretty much the same, not indexed)
The 'id' field is a value entered in a front-end application by the end-user.
There are no defined primary keys, and no columns that can be combined to make a unique row (unless all columns are combined). The table actually is a 'details' table to another table, but there are no constraints ensuring referential integrity.
Every column is heavily used in 'where' clauses in queries, which is why I assume there's an index on every one, or perhaps a desperate attempt to speed things up by another DBA.
Having said all that, my question is: could adding a clustered index do me any good at this point?
If I did add a clustered index, I assume it would have to be a new column, ie., an identity column? Basically, is it worth the trouble?
Appreciate any advice.
I would say only add the clustered index if there is a reasoning for needing it. So ask these questions;
Does the order of the data make sense?
Is there sequential value to the way the data is inserted?
Do I need to use a feature that requires it have a clustered index, such as full text index?
If the answer to these questions is all "No" than a clustered index might not be of any additional help over a good non-clustered index strategy. Instead you might want to consider how and when you update statistics, when you refresh the indexes and whether or not filtered indexes make sense in your situation. Looking at the table you have as an example it tough to say, but maybe it makes sense to normalize the table further and use numeric keys instead of nvarchar.
http://www.mssqltips.com/sqlservertip/3041/when-sql-server-nonclustered-indexes-are-faster-than-clustered-indexes/
the article is a great example of when non-clustered indexes might make more sense.
I would recommend adding a clustered index, even if it's an identity column for 3 reasons:
Assuming that your existing queries have to go through the entire table every time, a clustered index scan is still faster than a table scan.
The table is a child to some other table. With some extra works, you can use the new child_id to join against the parent table. This enables clustered index seek, which is a lot faster than scan in some cases.
Depend on how they are setup, the existing indices may not do much good. I've come across some terrible indices that cover 1 column each, or indices that don't include the appropriate columns, causing costly Key Lookups operations. Check your index stats to see if they are being used at all.

Proper way of creating indexes for same column when "include" columns are different

Let's say I have 2 stored procedures and 1 table.
Table name: Table_A
Procedure names: proc1 and proc2
When I run the proc1 with execution plan, it suggests me to create an index for Table_A for tblID (which is NOT a Primary Key) column and suggests to include column_A and column_B.
And the proc2 suggests to create an index for Table_A again, for tblID column but this time it suggests to include column_B and column_C (it suggests column_C instead of column_A for this procedure)
So my question is, if I have create an index which included all suggested columns like:
CREATE NONCLUSTERED INDEX indexTest
ON [dbo].[Table_A] ([tblID])
INCLUDE ([column_A],[column_B],[column_C])
Does that cause any performance issue?
Is there any disadvantage of gathering INCLUDE columns?
Or should I create 2 different indexes as:
CREATE NONCLUSTERED INDEX indexTest_1
ON [dbo].[Table_A] ([tblID])
INCLUDE ([column_A],[column_B])
CREATE NONCLUSTERED INDEX indexTest_2
ON [dbo].[Table_A] ([tblID])
INCLUDE ([column_B],[column_C])
UPDATE: I would like to add one more thing to this question.
If I do the same thing for primary fields as well:
I mean,
proc-1 suggested to create an index on tblID field. And proc-2 suggested to create an index on tblID and column_A.
If I gather them as :
CREATE NONCLUSTERED INDEX indexTest_3
ON [dbo].[Table_A] ([tblID],[column_A])
INCLUDE ([[column_B])
Does that cause a performance issue? Or Should I create 2 separate index for suggested primary fields?
Definitely create one index that includes all three columns!
The fewer indexes you have, the better - index maintenance is a cost factor - more indices require more maintenance.
And the included columns are included in the leaf level of the index only - the have only a very marginal impact on performance.
Update: if you have a single index on (tblID, column_A), then you can use this for queries that use only tblID in their WHERE clause, or you can use it for queries that use both columns in their WHERE clause.
HOWEVER: this index is useless for queries that use only column_A in their WHERE clause. A compound index (index made up from multiple columns) is only ever useful if a given query uses the n left-most columns as specified in the index.
So in your case, one query seems to indicate tblID, while the other needs (tblID, column_A) - so yes, in this case, I would argue a single index on (tblID, column_A) would work for both queries.
It sounds like you're looking at the missing index dmvs. There are a couple things to realize here. The dmvs are really telling you about specific queries or groups of similar where a specific index might help.
In this sense, you are right to combine the indexes. This is the right idea.
However, also remember that indexes have a cost, and it's not the job of this dmv to weigh that cost. You definitely don't want to just automatically create an index to cover every recommendation. You also want to examine these indexes: once you include columns A,B, and C, are you keeping the entire table (or nearly so) in the index? Could you perhaps get better results by changing the primary key to match this index? Be careful evaluating that last part, because changing the primary key could then leave the prior key as an even more important missing index.

Resources