I have a table which holds the guid for an user and their actual name as a string. I would like to grab some information based on an user. But which field should I use? Should my code say:
select *
from userinboxcount
where countDate >= startDate and countDate <= endDate and userid = '<guid here>'
or
select *
from userinboxcount
where countDate >= startDate and countDate <= endDate and username = "FirstName LastName"
The biggest difference is if one field has an index that the database can use, and the other doesn't. If the database has to read all the data in the table to scan for the value, the disk access takes so much resources that the difference in data type is not relevant.
If both fields have indexes, then the index that is smaller would be somewhat faster, because it loads faster, and it's more likely that it remains in the cache.
Ideally you would have an index for all the fields in the condition, which has the fields that you want to return as included fields. That way the query can produce the result from only the index, and doesn't have to read from the actual table at all. You should of course not use select *, but specify the fields that you actually need to return.
Other than that, it would be somewhat faster to compare GUID values because it's a simple numeric comparison and doesn't have to consider lexical rules.
See the query plan and you can see it for yourself.
But the unique identifier usually has an index and the string (username) might not have. If so, and if there are many records, prolly the unique identifier would be faster!
To the the query plan, check THIS article.
GUID will be good enough.
1. GUID will produce unique values in the table.
2. Create Non Clustered Index on this column.
Reference - Non-clustered indexes are particularly handy when we want to return a single row from a table.
Are you completely married to the GUID? You should use a GUID when you need a primary key that will be unique across multiple systems. I would suggest skipping the GUID and using a composite key. For example, you could use an identity plus a GETDATE() as a composite key. This would give you an easy way to query your data (try to remember a GUID over an integer). This will also perform much, much better than GUID. Probably twice as fast.
If userid is a primary key, you should use that. If you use first and last name, you could have two John Smith entries, for example, and that could create an issue for you. Using the PK should be safer
On the performance side, it's a good idea to become familiar with explain plan (execution path?) of the query. I'd expect using the userid would be faster, but checking the plan should tell you for certain.
Related
I have a table with this simple definition:
CREATE TABLE Related
(
RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
CreationTime DateTime NOT NULL,
PRIMARY KEY(RelatedUser, RelatedStory)
);
with these indexes:
CREATE INDEX i_relateduserid
ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)
CREATE INDEX i_relatedstory
ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)
And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.
I have this stored procedure:
CREATE PROCEDURE GetStories
#offset INT,
#limit INT,
#input UserIdInput READONLY
AS
BEGIN
SELECT RelatedStory
FROM Related
WHERE EXISTS (SELECT 1 FROM #input WHERE UID = RelatedUser)
GROUP BY RelatedStory, CreationTime
ORDER BY CreationTime DESC
OFFSET #offset ROWS FETCH NEXT #limit ROWS ONLY;
END;
Using this User-Defined Table Type:
CREATE TYPE UserIdInput AS TABLE
(
UID nvarchar(100) PRIMARY KEY CLUSTERED
)
The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.
What index am I missing? this seems to be a pretty straight forward query on a single table.
What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.
Some "better" (but not ideal) solutions would be:
If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.
Your best bet is to:
Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.
The entire system will thank you, and show you its appreciation by being more efficient :-).
One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.
From O.P.'s comment on this answer:
[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers
Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).
In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?
Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.
So I finally found a solution.
While #srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.
#Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:
CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)
solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.
I think what gave me the revelation was #srutzky noting:
Remember, "Include" columns are not used for sorting or comparisons,
only for covering.
which made me realize I needed all my groupby and orderby columns in the index.
So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.
The main problem seems to be that it uses 63% of the effort on
sorting.
ORDER BY CreationTime DESC
I would suggest and index on CreationTime
Or try an index on RelatedStory, CreationTime
Which is better distinct or unique constraint for table in SQL Server Database ?
Should I use
distinct
for getting records from the large table or put
unique constraint
to the field so no duplicate entry happened ?
My ultimate goal is only that , get unique data, and i know both will give me this, But if i use unique constraint to field , then It will give me sql error at a time i insert duplicate data. Is it ok ? Is it affect to server or Databases ? I am using SQL Server for this process.
They're totally different use cases.
A unique constraint is what you use if the column itself (or set of columns) must be unique according to the schema details (the data). In other words, if the data is required to be unique on that column (or column set), use a unique constraint.
For example, if you're maintaining a membership table, the member ID should be unique.
The database must protect itself from dodgy data, this is not something that should be left to well-behaved applications, since the first non-well-behaved application that comes along is going to destroy your universe.
If the data is not required to be unique (such as the town each member lives in), then you can decide to "uniquify" it in a select statement, depending on your needs:
-- Get all towns.
select distinct town from members
So, here's your solution matrix, in decreasing priority:
Does the actual data need to be unique on that column? If so, a unique constraint must be used. Otherwise, a unique constraint should not be used.
If the data does not need to be uniques, do you need to only get one row for each possible value for that data? If so, use select distinct. If not, use select on its own.
Depends.
With distinct you pay at query time, but it's simpler for the user.
With unique constraint, you pay at insert time, and the app now has to handle exceptions on duplicates, but the query is faster.
Without more info, I would go with distinct, because life is simpler and you don't lock in behaviour (next week you may need the duplicates).
UNIQUE: always take part in data INSERTION (in brief)
DISTINCT: always concern on data retrieval (in brief)
Maybe this will help you.
When performing a query where the attributes selected make up the components of an index does that result in a faster query? I would imagine that the query planner/optimizer could see that the requested columns could be satisfied completely by the index scan.
Trivial Example
CREATE TABLE "liked" (
"id" BIGINT NOT NULL DEFAULT nextval('liked_id_seq'),
"userid" BIGINT NOT NULL,
"storyid" BIGINT NOT NULL,
"notes" TEXT,
PRIMARY KEY ("id")
);
CREATE INDEX "liked_user" ON "liked" (
"userid",
"storyid"
);
ALTER TABLE "liked" ADD FOREIGN KEY ("userid") REFERENCES "users" ("id") ON DELETE CASCADE;
ALTER TABLE "liked" ADD FOREIGN KEY ("storyid") REFERENCES "story" ("id") ON DELETE CASCADE;
SELECT storyid from liked where userid = 1;
With the query above there isn't any data external to what is already contained in the liked_user index so I would imagine there would be less actions if the query optimizer could infer that the resulting tuples could be satisfied by the index alone.
It's called a "covering index", and it improves the efficiency somewhat, by varying amounts depending on which DBMS you are using (and if you are using MySQL, which flavor of storage).
Try giving an example of a specific situation, if you have one.
In general, yes, but it depends on how you are accessing them. Using LIKE to key off a match in the middle out of a string field isn't going to be any faster with an index.
Not so much, in my experience. You speed up queries by optimizing their conditions, and trying to use the best possible index in those conditions. There are many ways to slow down a query based on what you are selecting, such as using subqueries, perhaps some UDFs--and of course you can slow down queries using some less-than-optimal joins.
It can do, but there are some caveats. These comments are based on Oracle, btw.
For example, SELECT COL1 FROM MY_TABLE might be able to use an index, but if all the columns of the index are nullable then there might be rows not included in a regular btree index, so the index might not be used.
It's also possible that an index might be larger (and therefore more costly to ful scan) than the underlying table (for example where the table only has a single column) because the index has to include a rowid for every entry as well as the column values. In that case, unless the query can leverage the index information in some special way (for example you are including an ORDER BY clause that the index can supply without the need for a sort) then the index might not be used.
You ought to also look into the various index access methods that the RDBMS can use in order to understand their strengths and weaknesses. In Oracle these would generally be INDEX RANGE SCAN, FULL INDEX SCAN, FAST FULL INDEX SCAN, and INDEX SKIP SCAN. This knowledge will help you understand whether an index could be used and in what way.
Are there any best practices to column ordering when designing a database? Will order effect performance, space, or the ORM layer?
I am aware of SQL Server - Does column order matter?. I am looking for more general advice.
I don't believe that the column order will necessarily affect performance nor space. To improve performance, you can create indexes on the table, and the order of the columns defined in the index will effect performance.
I've seen tables have their fields ordered alphabetically, as well as "logically" (in a way that makes sense for the data that is being represented). All in all, I can see benefits in both, but I would tend to go for the "logically" method.
I try to stick with the most important columns first. Typically I always keep my ID column as the first in any table. Then whatever information is important and is updated frequently usually follows, then the rest which may or may not be updated frequently.
I don't think it will affect performance, but from a developer stance, it's easier to read the first few columns which will be updated frequently than try and scan the hole table for that one field at the end.
In Oracle there can be significant storage space savings if your table has a number of NULLable columns and you place the NULLable columns at the end of the list. NULL values on the end of a row take up no space.
e.g. imagine this table: (id NOT NULL, name VARCHAR2(100), surname VARCHAR2(100), blah VARCHAR2(100, date_created DATE NOT NULL)
the row (100, NULL, NULL, NULL, '10-JAN-2000') will require storage for the values 100, some space for the three NULLs, followed by the date.
Alternatively, the same table but with different ordering: (id NOT NULL, date_created DATE NOT NULL, name VARCHAR2(100), surname VARCHAR2(100), blah VARCHAR2(100))
the row (100, '10-JAN-2000', NULL, NULL, NULL) will only require storage for the values 100 and the date - the trailing NULLs are omitted entirely.
Normally this makes little difference but for very large tables with many NULLable columns, significant savings may be made - less space used can translate to more rows per block, meaning less IO and CPU required to query the table.
I think the answer is no.
RDBMS servers optimise these kinds of things internally for queries so I suspect it's unimportant.
column order only matters in a composite index
If your index is on ( Lastname, firstname) and you always search for last name then you are good to go even if you don't include first name
if your index looks like this (Firstname, Lastname) and your where clause is
where lastname like 'smith%'
then you have to scan the whole index
More general advice isn't really available since you're asking for implementation details rather than the SQL standard.
Different DBMS will implement these things differently.
However, a clever DBMS would implement the internals such that the column ordering is not of consequence.
Therefore, I would order my columns to be intuitive for human readers.
In designing a database, I would probably put the most important columns first in a logical order (idfield, firstname, middlename, lastname for instance). It does make it easier to see them when you are looking for the columns you need the most out of a long column list.
I would however not rearrange the columns later on to support a more logical grouping.
As a follow up to "What are indexes and how can I use them to optimise queries in my database?" where I am attempting to learn about indexes, what columns are good index candidates? Specifically for an MS SQL database?
After some googling, everything I have read suggests that columns that are generally increasing and unique make a good index (things like MySQL's auto_increment), I understand this, but I am using MS SQL and I am using GUIDs for primary keys, so it seems that indexes would not benefit GUID columns...
Indexes can play an important role in query optimization and searching the results speedily from tables. The most important step is to select which columns are to be indexed. There are two major places where we can consider indexing: columns referenced in the WHERE clause and columns used in JOIN clauses. In short, such columns should be indexed against which you are required to search particular records. Suppose, we have a table named buyers where the SELECT query uses indexes like below:
SELECT
buyer_id /* no need to index */
FROM buyers
WHERE first_name='Tariq' /* consider indexing */
AND last_name='Iqbal' /* consider indexing */
Since "buyer_id" is referenced in the SELECT portion, MySQL will not use it to limit the chosen rows. Hence, there is no great need to index it. The below is another example little different from the above one:
SELECT
buyers.buyer_id, /* no need to index */
country.name /* no need to index */
FROM buyers LEFT JOIN country
ON buyers.country_id=country.country_id /* consider indexing */
WHERE
first_name='Tariq' /* consider indexing */
AND
last_name='Iqbal' /* consider indexing */
According to the above queries first_name, last_name columns can be indexed as they are located in the WHERE clause. Also an additional field, country_id from country table, can be considered for indexing because it is in a JOIN clause. So indexing can be considered on every field in the WHERE clause or a JOIN clause.
The following list also offers a few tips that you should always keep in mind when intend to create indexes into your tables:
Only index those columns that are required in WHERE and ORDER BY clauses. Indexing columns in abundance will result in some disadvantages.
Try to take benefit of "index prefix" or "multi-columns index" feature of MySQL. If you create an index such as INDEX(first_name, last_name), don’t create INDEX(first_name). However, "index prefix" or "multi-columns index" is not recommended in all search cases.
Use the NOT NULL attribute for those columns in which you consider the indexing, so that NULL values will never be stored.
Use the --log-long-format option to log queries that aren’t using indexes. In this way, you can examine this log file and adjust your queries accordingly.
The EXPLAIN statement helps you to reveal that how MySQL will execute a query. It shows how and in what order tables are joined. This can be much useful for determining how to write optimized queries, and whether the columns are needed to be indexed.
Update (23 Feb'15):
Any index (good/bad) increases insert and update time.
Depending on your indexes (number of indexes and type), result is searched. If your search time is gonna increase because of index then that's bad index.
Likely in any book, "Index Page" could have chapter start page, topic page number starts, also sub topic page starts. Some clarification in Index page helps but more detailed index might confuse you or scare you. Indexes are also having memory.
Index selection should be wise. Keep in mind not all columns would require index.
Some folks answered a similar question here: How do you know what a good index is?
Basically, it really depends on how you will be querying your data. You want an index that quickly identifies a small subset of your dataset that is relevant to a query. If you never query by datestamp, you don't need an index on it, even if it's mostly unique. If all you do is get events that happened in a certain date range, you definitely want one. In most cases, an index on gender is pointless -- but if all you do is get stats about all males, and separately, about all females, it might be worth your while to create one. Figure out what your query patterns will be, and access to which parameter narrows the search space the most, and that's your best index.
Also consider the kind of index you make -- B-trees are good for most things and allow range queries, but hash indexes get you straight to the point (but don't allow ranges). Other types of indexes have other pros and cons.
Good luck!
It all depends on what queries you expect to ask about the tables. If you ask for all rows with a certain value for column X, you will have to do a full table scan if an index can't be used.
Indexes will be useful if:
The column or columns have a high degree of uniqueness
You frequently need to look for a certain value or range of values for
the column.
They will not be useful if:
You are selecting a large % (>10-20%) of the rows in the table
The additional space usage is an issue
You want to maximize insert performance. Every index on a table reduces insert and update performance because they must be updated each time the data changes.
Primary key columns are typically great for indexing because they are unique and are often used to lookup rows.
Any column that is going to be regularly used to extract data from the table should be indexed.
This includes:
foreign keys -
select * from tblOrder where status_id=:v_outstanding
descriptive fields -
select * from tblCust where Surname like "O'Brian%"
The columns do not need to be unique. In fact you can get really good performance from a binary index when searching for exceptions.
select * from tblOrder where paidYN='N'
In general (I don't use mssql so can't comment specifically), primary keys make good indexes. They are unique and must have a value specified. (Also, primary keys make such good indexes that they normally have an index created automatically.)
An index is effectively a copy of the column which has been sorted to allow binary search (which is much faster than linear search). Database systems may use various tricks to speed up search even more, particularly if the data is more complex than a simple number.
My suggestion would be to not use any indexes initially and profile your queries. If a particular query (such as searching for people by surname, for example) is run very often, try creating an index over the relevate attributes and profile again. If there is a noticeable speed-up on queries and a negligible slow-down on insertions and updates, keep the index.
(Apologies if I'm repeating stuff mentioned in your other question, I hadn't come across it previously.)
It really depends on your queries. For example, if you almost only write to a table then it is best not to have any indexes, they just slow down the writes and never get used. Any column you are using to join with another table is a good candidate for an index.
Also, read about the Missing Indexes feature. It monitors the actual queries being used against your database and can tell you what indexes would have improved the performace.
Your primary key should always be an index. (I'd be surprised if it weren't automatically indexed by MS SQL, in fact.) You should also index columns you SELECT or ORDER by frequently; their purpose is both quick lookup of a single value and faster sorting.
The only real danger in indexing too many columns is slowing down changes to rows in large tables, as the indexes all need updating too. If you're really not sure what to index, just time your slowest queries, look at what columns are being used most often, and index them. Then see how much faster they are.
Numeric data types which are ordered in ascending or descending order are good indexes for multiple reasons. First, numbers are generally faster to evaluate than strings (varchar, char, nvarchar, etc). Second, if your values aren't ordered, rows and/or pages may need to be shuffled about to update your index. That's additional overhead.
If you're using SQL Server 2005 and set on using uniqueidentifiers (guids), and do NOT need them to be of a random nature, check out the sequential uniqueidentifier type.
Lastly, if you're talking about clustered indexes, you're talking about the sort of the physical data. If you have a string as your clustered index, that could get ugly.
A GUID column is not the best candidate for indexing. Indexes are best suited to columns with a data type that can be given some meaningful order, ie sorted (integer, date etc).
It does not matter if the data in a column is generally increasing. If you create an index on the column, the index will create it's own data structure that will simply reference the actual items in your table without concern for stored order (a non-clustered index). Then for example a binary search can be performed over your index data structure to provide fast retrieval.
It is also possible to create a "clustered index" that will physically reorder your data. However you can only have one of these per table, whereas you can have multiple non-clustered indexes.
The ol' rule of thumb was columns that are used a lot in WHERE, ORDER BY, and GROUP BY clauses, or any that seemed to be used in joins frequently. Keep in mind I'm referring to indexes, NOT Primary Key
Not to give a 'vanilla-ish' answer, but it truly depends on how you are accessing the data
It should be even faster if you are using a GUID.
Suppose you have the records
100
200
3000
....
If you have an index(binary search, you can find the physical location of the record you are looking for in O( lg n) time, instead of searching sequentially O(n) time. This is because you dont know what records you have in you table.
Best index depends on the contents of the table and what you are trying to accomplish.
Taken an example A member database with a Primary Key of the Members Social Security Numnber. We choose the S.S. because the application priamry referes to the individual in this way but you also want to create a search function that will utilize the members first and last name. I would then suggest creating a index over those two fields.
You should first find out what data you will be querying and then make the determination of which data you need indexed.