I have a use case where I want to create a composite index. I have two options for creating the composite index.
Option 1: STATUS_CODE (varchar(1)) & TICKET_ID (number)
Option 2: STAUS (varchar(50)) & TICKET_ID (number)
For STATUS and STATUS_CODE I have only 5 possible values.
Which one will be better option for my composite index? Will there be any performance difference between these two indexes?
If everything else is the same, then the first option is better. Each row will be represented by less data in the index, which will allow the index to take less space on disk. This way, whenever the index is used, more data (from the index) can be retrieved from disk in a single physical read operation. How much this affects performance, though, will depend on how much data you have in the first place, and on how long the STATUS values are. (If they are all close to 50 characters, then it will make a bigger difference; not so much if most of them are very short.)
It seems, though, that your table design violates Third Normal Form. If STATUS_CODE and STATUS determine each other, this means that you should have a separate table just for "status" (showing codes and descriptions). Your big table should only have the STATUS_CODE column, not the STATUS column.
Related
I have a RecyclerView list of items that uses an SQLite database to store user input data. I use the traditional _id column as INTEGER PRIMARY KEY AUTOINCREMENT. If I understand correctly, newly inserted rows in the database are added below existing rows and the new ROWID takes the largest existing ROWID and increments it by +1. Therefore, a cursor search for the latest insert will have to scan down the entire set of rows to reach the bottom of the database. For example, after 10 inserts, the cursor has to search down from 1, 2, 3,... until it gets to row 10.
To avoid a lengthy search of the entire set of ROWIDs, is there any way to have new inserts be added to the top of the database and not the bottom? That way a cursor search for the latest insert using moveToFirst() will be very fast since the cursor will stop at the first row it searches, the top of the database. The cursor would search 10, 9, 8,...3,2,1 and therefore the search would be very fast since it would stop at 10, the first row at the top of the database.
You are thinking too much about the database internals. Indexes are designed for this kind of optimisation.
Make a new numeric column where you put your wished ordering as a value and use order by in selects. Do not forget to make an index on this column and verify your selects do use the indexes. (explain)
First, if you are concerned about overheads then use the recommended INTEGER PRIMARY KEY as opposed to INTEGER PRIMARY KEY AUTOINCREMENT. Both will result in a unique id, the latter has overheads as per :-
The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and
disk I/O overhead and should be avoided if not strictly needed. It is
usually not needed.
SQLite Autoincrement
If I understand correctly, newly inserted rows in the database are
added below existing rows and the new ROWID takes the largest existing
ROWID and increments it by +1.
Generally BUT not necessarily, there is no guarantee that the value will increment by 1.
AUTOINCREMENT utilises a table called sqlite_seqeunce that has a single row per table that stores the highest last used sequence number along with the table name. The next sequence number will be that value + probably 1 UNLESS the highest rowid is greater than the value in the sqlite_sequence table.
Without AUTOINCREMENT then the next sequence is the highest rowid + probably 1.
AUTOINCREMENT guarantees a higher number. Without AUOINCREMENT can use a lower number (BUT not until the number would be greater than 9223372036854775807). If AUTOINCREMENT would use a number higher that this then an SQLITE_FULL exception will happen.
Again with regard to rowid's and searching :-
The data for rowid tables is stored as a B-Tree structure containing
one entry for each table row, using the rowid value as the key. This
means that retrieving or sorting records by rowid is fast. Searching
for a record with a specific rowid, or for all records with rowids
within a specified range is around twice as fast as a similar search
made by specifying any other PRIMARY KEY or indexed value. ROWIDs and the INTEGER PRIMARY KEY
To avoid a lengthy search of the entire set of ROWIDs, is there any
way to have new inserts be added to the top of the database and not
the bottom?
Yes there is, simply specify the value for the rowid or typically the alias when inserting (but beware using an already used value and good luck with managing the numbering). However, I doubt that doing so would result in a faster search. Tables have a rowid by default largely due to the rowid being optimised for searching by rowid.
I have a table with this simple definition:
CREATE TABLE Related
(
RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
CreationTime DateTime NOT NULL,
PRIMARY KEY(RelatedUser, RelatedStory)
);
with these indexes:
CREATE INDEX i_relateduserid
ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)
CREATE INDEX i_relatedstory
ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)
And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.
I have this stored procedure:
CREATE PROCEDURE GetStories
#offset INT,
#limit INT,
#input UserIdInput READONLY
AS
BEGIN
SELECT RelatedStory
FROM Related
WHERE EXISTS (SELECT 1 FROM #input WHERE UID = RelatedUser)
GROUP BY RelatedStory, CreationTime
ORDER BY CreationTime DESC
OFFSET #offset ROWS FETCH NEXT #limit ROWS ONLY;
END;
Using this User-Defined Table Type:
CREATE TYPE UserIdInput AS TABLE
(
UID nvarchar(100) PRIMARY KEY CLUSTERED
)
The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.
What index am I missing? this seems to be a pretty straight forward query on a single table.
What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.
Some "better" (but not ideal) solutions would be:
If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.
Your best bet is to:
Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.
The entire system will thank you, and show you its appreciation by being more efficient :-).
One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.
From O.P.'s comment on this answer:
[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers
Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).
In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?
Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.
So I finally found a solution.
While #srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.
#Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:
CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)
solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.
I think what gave me the revelation was #srutzky noting:
Remember, "Include" columns are not used for sorting or comparisons,
only for covering.
which made me realize I needed all my groupby and orderby columns in the index.
So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.
The main problem seems to be that it uses 63% of the effort on
sorting.
ORDER BY CreationTime DESC
I would suggest and index on CreationTime
Or try an index on RelatedStory, CreationTime
My Question is almost the same like Changing newid() to newsequentialid() on an existing table
if I change to newsequentialid() than the index should be more compact, correct?
And what happens if the sequence is hitting exists ID's while inserting new records? Will the database check that before?
It will have less fragmentation so you could say its' more compact. But it will still be the same size per key (16 bytes + overhead) for a guid key. The benefit of using a sequential guid vs a nonsequential guid is that you have less chances for a page split. A page split is where a logical page has to have a record inserted, but would be more than the page is allowed to hold, so the page is "split"; half to one page and half to another. Sometimes a page split causes another page to split, and theoretically you can have a cascading and costly page split by just inserting one new record. When you use a sequential key, it's less likely that you'll randomly triggger a page split somewhere in the middle of your index, so you reduce the likelihood of those happening. Using a sequential guid also helps optimize range scans (e.g selecting between one value and another value) but with a GUID, it's very unlikely that you'll end up doing many range based scans, since the value is basically meaningless.
What happens when the sequence hits an existing iD? You get a PK violation. SQL doesn't ensure that a GUID can only be used once. Sequential ID's start at a new seed every time the server is restarted so, in theory, you could skip back in the sequence and then wind up covering the same value twice. However, as with GUIDs in general, the liklihood of this happening is so astronomically small as to be statistically insignificant.
As with everything, the cost and benefits depend on your specific scenario. If you're looking to replace a GUID key with a sequential key, see if it's possible to use an int or bigint surrogate key instead of a GUID, because generally, all things being equal, an integer will outperform a guid in every case. 4 Million records will trivially fit into an INT data type and even more trivially into a bigint.
Hope this helps.
I've been researching best practices for creating clustered indexes and I'm just trying to totally understand these two suggestions that's listed with pretty much every BLOG or article on the matter
Columns that contain a large number of distinct values.
Queries that return large result sets.
These seem to be slightly contrary or I'm guessing maybe it just depends on how you're accessing the table.. Or my interpretation of what "large result sets" mean is wrong....
Unless you're doing range queries over the clustered column it seems like you typically won't be getting large result sets that matter. So in cases where SQL Server defaults the clustered indexes on the PK you're rarely going to fulfill the large result set suggestion but of course it does the large number of distinct values..
To give the question a little more context. This quetion stems from a vertical auditing table we have that has a column for TABLE.... Every single query that's written against this table has a
WHERE TABLE = 'TABLENAME'
But the TableName is highly non distinct... Each result set of tablenames is rather large which seems to fulfill that second conditon but it's definitely not largerly unique.... Which means all that other stuff happens with having to add the 4 byte Uniquifer (sp?) which makes the table a lot larger etc...
This situation has come up a few times for me when I've come upon DBs that have say all the contact or some accounts normalized into a single table and they are only separated by a TYPE parameter. Which is on every query....
In the case of the audit table the queries are typically not that exciting either they are just sorted by date modified, sometimes filtered by column, user that made the change etc...
My other thought with this auditing scenario was to just make the auditing table a HEAP so that inserting is fast so there's not contention between tables being audited and then to generate indexed views over the data ...
Index design is just as much art as it is science.
There are many things to consider, including:
How the table will be accessed most often: mostly inserts? any updates? more SELECTs than DML statements? Any audit table will likely have mostly inserts, no updates, rarely deletes unless there is a time-limit on the data, and some SELECTs.
For Clustered indexes, keep in mind that the data in each column of the clustered index will be copied into each non-clustered index (though not for UNIQUE indexes, I believe). This is helpful as those values are available to queries using the non-clustered index for covering, etc. But it also means that the physical space taken up by the non-clustered indexes will be that much larger.
Clustered indexes generally should either be declared with the UNIQUE keyword or be the Primary Key (though there are exceptions, of course). A non-unique clustered index will have a hidden 4-byte field called a uniqueifier that is required to make each row with a non-unique key value addressable, and is just wasted space given that the order of your rows within the non-unique groupings is not apparently obvious so trying to narrow down to a single row is still a range.
As is mentioned everywhere, the clustered index is the physical ordering of the data so you want to cater to what needs the best I/O. This relates also to the point directly above where non-unique clustered indexes have an order but if the data is truly non-unique (as opposed to unique data but missing the UNIQUE keyword when the index was created) then you miss out on a lot of the benefit of having the data physically ordered.
Regardless of any information or theory, TEST TEST TEST. There are many more factors involved that pertain to your specific situation.
So, you mentioned having a Date field as well as the TableName. If the combination of the Date and TableName is unique then those should be used as a composite key on a PK or UNIQUE CLUSTERED index. If they are not then find another field that creates the uniqueness, such as UserIDModified.
While most recommendations are to have the most unique field as the first one (due to statistics being only on the first field), this doesn't hold true for all situations. Given that all of your queries are by TableName, I would opt for putting that field first to make use of the physical ordering of the data. This way SQL Server can read more relevant data per read without having to seek to other locations on disk. You would likely also being ordering on the Date so I would put that field second. Putting TableName first will cause higher fragmentation across INSERTs than putting the Date first, but upon an index rebuild the data access will be faster as the data is already both grouped ( TableName ) and ordered ( Date ) as the queries expect. If you put Date first then the data is still ordered properly but the rows needed to satisfy the query are likely spread out across the datafile(s) which would require more I/O to get. AND, more data pages to satisfy the same query means more pages in the Buffer Pool, potentially pushing out other pages and reducing Page Life Expectancy (PLE). Also, you would then really need to inculde the Date field in all queries as any queries using only TableName (and possibly other filters but NOT using the Date field) will have to scan the clustered index or force you to create a nonclustered index with TableName being first.
I would be weary of the Heap plus Indexed View model. Yes, it might be optimized for the inserts but the system still needs to maintain the data in the indexed view across all DML statements against the heap. Again you would need to test, but I don't see that being materially better than a good choice of fields for a clustered index on the audit table.
I have a table with several indexes. All of them contain an specific integer column.
I'm moving to mysql 5.1 and about to partition the table by this column.
Do I still have to keep this column as key in my indexes or I can remove it since partitioning will take care of searching only in the relevant keys data efficiently without need to specify it as key?
Partition field must be part of index so the answer is that I kave to keep the partitioning column in my index.
Partitioning will only slice the values/ranges of that index into separate partitions according to how you set it up. You'd still want to have indexes on that column so the index can be used after partition pruning has been done.
Keep in mind there's a big impact on how many partitions you can have, if you have an integer column with only 4 distinct values in it, you might create 4 partitions, and an index would likely not benefit you much depending on your queries.
If you got 10000 distinct values in your integer column, you hit system limits if you try to create 10k partitions - you'll have to partition on large ranges (e.g. 0-1000,1001-2000, etc.) in such a case you'll benefit from an index (again depending on how you query the tables)