I have a table called Products.
This table contains over 3 million entries. Every day there are approximately 5000 new entries. which only happens during the night in 2 minutes.
But this table gets queried every night maybe over 20 000 times with this query.
SELECT Price
FROM Products
WHERE Code = #code
AND Company = #company
AND CreatedDate = #createdDate
Table structure:
Code nvarchar(50)
Company nvarchar(10)
CreatedDate datetime
I can see that this query takes about a second to return a result from Products table.
There is no productId column in the table as it is not needed. So there is no primary key in the table.
I would like to somehow improve this query to return the result faster.
I have never used indexes before. What would be the best way to use indexes on this table?
If I provide a primary key do you think it would speed up the query result? Keep in mind that I will still have to query the table by providing 3 parameters as
WHERE Code = #code
AND Company = #company
AND CreatedDate = #createdDate.
This is mandatory.
As I mentioned that the table gets new entries in 2 minutes every day during the night. How would this affect the indexes?
If I use indexes, which column would be the best to use and whether I should use clustered or non-clustered indexes?
The best thing to do would depend on what other fields the table has and what other queries run against that table.
Without more details, a non-clustered index on (code, company, createddate) that included the "price" column will certainly improve performance.
CREATE NONCLUSTERED INDEX IX_code_company_createddate
ON Products(code, company, createddate)
INCLUDE (price);
That's because if you have that index in place, then SQL will not access the actual table at all when running the query, as it can find all rows with a given "code, company, createddate" in the index and it will be able to do that really fast as the index allows precisely for fast access when using the fields that define the key, and it will also have the "price" value for each row.
Regarding the inserts, for each row added, SQL Server will have to add them to the index as well, so performance for inserts will be impacted. In think you should expect the gains on SELECT performance to outweigh the impact on the inserts, but you should test that.
Also, you will be using more space as the index will store all those fields for each row besides the space used by the original table.
As others have noted in the comments, adding a PK to your table (even if that means adding a ProductId column you don't actually need) might be a good idea as well.
Related
In my SQL Server database I have a table of Requests with requestID (int) as Identity, PK and Clustered index. There are approximately 30 other columns in the table.
I am using Entity Framework to access the DB.
There is a function called GetRequestByID(int requestID) that pulls all the columns from the Requests table and columns from related tables using inner joins.
Recently, to reduce the amount of data pulled where not needed, I created two additional functions, GetRequestByID_Lite and GetRequestByID_EvenLiter that return lesser number of columns, and replaced all the relevant calls in the code.
For each of those functions I created a corresponding non-clustered index by requestID and including only the columns each function needs.
After one hour, first thing I see is that the memory consumed by the process decreased dramatically.
When I ran SYS.DM_DB_INDEX_USAGE_STATS, I see the following for the new indexes:
_index_for_GetRequestByID_Lite - 0 seeks, 422 scans, 0 lookups, 49 updates
_index_for_GetRequestByID_EvenLiter - 0 seeks, 0 scans, 0 lookups, 51 updates
My question is why so many scans and no seeks for _index_for_GetRequestByID_Lite?
If the index doesn't contain all the columns required, then why doesn't SQL Server just use the clustered index?
And why _index_for_GetRequestByID_EvenLiter is not being used at all (there is no doubt the function GetRequestByID_EvenLiter is called a lot)?
Also, when I run an SQL query equivalent to GetRequestByID_EvenLiter, the Clustered index is used in execution plan instead of _index_for_GetRequestByID_EvenLiter.
Thank You.
SQLServer might not have found your index effective in terms of cost.
see below example
create table
test
(
col1 int primary key,
col2 int,
col3 int,
col4 varchar(10),
col5 datetime
)
insert into test
select number,number+1,number+2,number+5,dateadd(day,number,getdate())
from numbers
Let's create an index
create index nc_Col2 on test(col2)
include(Col3,col4)
Now if we run a query like below
select * from test
where col2>4
and see execution plan cost...
You might have thought sqlserver should have used above index,but it didn't.Now let's observe the cost when we force sqlserver to use that index
select * from test with (index (nc_col2))
where col2>4
In summary ,the reason being your index might not be used may be due to
It is not cost effective compared to other existing possibilties
your index is not efficient as shown in my example( i am selecting * and index has only three columns)
also there are some more concepts like allocation scan,sequential scan,but in summary SQL has to believe your index costs less.Check out below links to see how to improve costing
Further reading:
Inside the Optimizer: Plan Costing
https://dba.stackexchange.com/a/23716/31995
I am trying to confirm that my table needs a primary key, even though it would double the row size, or figure out what an appropriate indexing strategy would be. We are using SQL Server 2008 R2.
I have a Testscores table with just over 2 billion rows, and each row only contains 10 bytes of data of the following form:
(ItemID INT, ProjectID SMALLINT, DepartmentID SMALLINT, Score REAL).
No column is unique, but we have approximately 100 million ItemIDs, 500 ProjectIDs and 300 DepartmentIDs.
I have a lookup table of Projects with ~500 rows in the following form
(ID SMALLINT, ProjectName varchar, State Char(2), year INT)
Originally this table was denormalized and approximately 600gb. My goal is to be able to query the projects table on either ProjectName, State, or year (sometimes one of those, sometimes two, sometimes all three). I would then join the Testscores table on ProjectsID to return all test scores from matching projects (somewhere between 5 million and 20 million results)
After rebuilding the tables (stupid should have figured this out first), I come to learn that without a clustered index, every query will have to use a table scan even if I build a nonclustered index on ProjectsID.
My current row size is 10 bytes, and adding a BigInt (needed, already at 2 billion and adding more) would add 8 bytes to each row, essentially doubling my database. Building a nonclustered index on ProjectsID would essentially require 8 bytes for the uniqueifier (4 for the value, 4 because its the first varchar).
Any ideas? Did I screw something up in my database design? I don't mind rebuilding it again, I just want to do it right.
PS, I've haunted for about a decade, and this is the first question I've had that I couldn't answer through searches. You all rock!
Edit: When I loaded the data into the table, it was presorted on ProjectID ASC, ItemID ASC, if that makes any difference.
With a record size of 8 bytes per record, SQL Server is putting about 1,000 rows on each page. That means that any query that selects more than 0.1% of the data is quite likely to be hitting all or almost all pages. Under these circumstances, the engine generally opts for a full table scan rather than using an index.
Given that your queries are returning at least 5 million rows, I speculate that it would be hard to avoid a full table scan. A clustered index might help for some queries (through some miracles, perhaps), but not for all.
One thing that might help is partitioning the table; however, you would need to denormalize the data for effective partitioning.
I have a SQL 2005 database I've inherited, with a table that has grown to about 17 million records over the course of about 15 years, and is now horribly slow.
The table layout looks about like this:
id_column = nvarchar(20),indexed, not unique
column2 = nvarchar(20), indexed, not unique
column3 = nvarchar(10), indexed, not unique
column4 = nvarchar(10), indexed, not unique
column5 = numeric(8,0), indexed, not unique
column6 = numeric(8,0), indexed, not unique
column7 = nvarchar(20), indexed, not unique
column8 = nvarchar(10), indexed, not unique
(and about 5 more columns that look pretty much the same, not indexed)
The 'id' field is a value entered in a front-end application by the end-user.
There are no defined primary keys, and no columns that can be combined to make a unique row (unless all columns are combined). The table actually is a 'details' table to another table, but there are no constraints ensuring referential integrity.
Every column is heavily used in 'where' clauses in queries, which is why I assume there's an index on every one, or perhaps a desperate attempt to speed things up by another DBA.
Having said all that, my question is: could adding a clustered index do me any good at this point?
If I did add a clustered index, I assume it would have to be a new column, ie., an identity column? Basically, is it worth the trouble?
Appreciate any advice.
I would say only add the clustered index if there is a reasoning for needing it. So ask these questions;
Does the order of the data make sense?
Is there sequential value to the way the data is inserted?
Do I need to use a feature that requires it have a clustered index, such as full text index?
If the answer to these questions is all "No" than a clustered index might not be of any additional help over a good non-clustered index strategy. Instead you might want to consider how and when you update statistics, when you refresh the indexes and whether or not filtered indexes make sense in your situation. Looking at the table you have as an example it tough to say, but maybe it makes sense to normalize the table further and use numeric keys instead of nvarchar.
http://www.mssqltips.com/sqlservertip/3041/when-sql-server-nonclustered-indexes-are-faster-than-clustered-indexes/
the article is a great example of when non-clustered indexes might make more sense.
I would recommend adding a clustered index, even if it's an identity column for 3 reasons:
Assuming that your existing queries have to go through the entire table every time, a clustered index scan is still faster than a table scan.
The table is a child to some other table. With some extra works, you can use the new child_id to join against the parent table. This enables clustered index seek, which is a lot faster than scan in some cases.
Depend on how they are setup, the existing indices may not do much good. I've come across some terrible indices that cover 1 column each, or indices that don't include the appropriate columns, causing costly Key Lookups operations. Check your index stats to see if they are being used at all.
I have a db table with about 10 or so columns, two of which are month and year. The table has about 250k rows now, and we expect it to grow by about 100-150k records a month. A lot of queries involve the month and year column (ex, all records from march 2010), and so we frequently need to get the available month and year combinations (ie do we have records for april 2010?).
A coworker thinks that we should have a separate table from our main one that only contains the months and years we have data for. We only add records to our main table once a month, so it would just be a small update on the end of our scripts to add the new entry to this second table. This second table would be queried whenever we need to find the available month/year entries on the first table. This solution feels kludgy to me and a violation of DRY.
What do you think is the correct way of solving this problem? Is there a better way than having two tables?
Using a simple index on the columns required (Year and Month) should greatly improve either a DISTINCT, or GROUP BY Query.
I would not go with a secondary table as this adds extra over head to maintaining the secondary table (inserts/updates deletes will require that you validate the secondary table)
EDIT:
You might even want to consider using Improving Performance with SQL Server 2005 Indexed Views
Make sure to have an Clustered Index on those columns.
and partition your table on these date columns an place the datafiles on different disk drives
I Believe keeping your index fragmentation low is your best shot.
I also Believe having a physical view with the desired select is not a good idea,
because it adds Insert/Update overhead.
on average there's 3,5 insert's per minute.
or about 17 seconds between each insert (on average please correct me if I'm wrong)
The question is are you selecting more often than every 17 seconds?
That's the key thought.
Hope it helped.
Use a 'Materialized View', also called an 'Indexed View with Schema Binding', and then index this view. When you do this SQL server will essentially create and maintain the data in a secondary table behind the scenes and choose to use the index on this table when appropriate.
This is similar to what your co-worker suggested, the advantage being you won't need to add logic to your query to take advantage of it, SQL Server will do this when it creates a query plan and SQL Server will also automatically maintain the data in the Indexed View.
Here is how you would accomplish this: create a view that returns the distinct [month] [year] values and then index [year] [month] on the view. Again SQL Server will use the tiny index on the view and avoid the table scan on the big table.
Because SQL server will not let you index a view with the DISTINCT keyword, instead use GROUP BY [year],[month] and use BIG_COUNT(*) in the SELECT. It will look something like this:
CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING
AS
SELECT
[year],
[month],
COUNT_BIG(*) [MonthCount]
FROM [dbo].[YourBigTable]
GROUP BY [year],[month]
GO
CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month
ON [dbo].[vwMonthYear](Year,Month)
Now when you SELECT DISTINCT [Year],[Month] on the big table, the query optimizer will scan the tiny index on the view instead of scanning millions of records on the big table.
SELECT DISTINCT
[year],
[month]
FROM YourBigTable
This technique took me from 5 million reads with an estimated I/O of 10.9 to 36 reads with an estimated I/O of 0.003. The overhead on this will be that of maintaining an additional index, so each time the large table is updated the index on the view will also be updated.
If you find this index is substantially slowing down your load times. Drop the index, perform your data load and then recreate it.
Full working example:
CREATE TABLE YourBigTable(
YourBigTableID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_YourBigTable_YourBigTableID PRIMARY KEY,
[Year] INT,
[Month] INT)
GO
CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING
AS
SELECT
[year],
[month],
COUNT_BIG(*) [MonthCount]
FROM [dbo].[YourBigTable]
GROUP BY [year],[month]
GO
CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month ON [dbo].[vwMonthYear](Year,Month)
SELECT DISTINCT
[year],
[month]
FROM YourBigTable
-- Actual execution plan shows SQL server scaning ICU_vwMonthYear_Year_Month
create a materialized indexed view of:
SELECT DISTINCT
MonthCol, YearCol
FROM YourTable
you will now get access to the pre-computed distinct values without going through the work every time.
Make the date the first column in the table's clustered index key. This is very typical for historic data, because most, if not all, queries are interested in specific ranges and a clustered index on time can address this. All queries like 'month of May' need to be addressed as ranges, eg: WHERE DATECOLKEY BETWEEN '05/01/2010' AND '06/01/2001'. Answering a question like 'are there any records in May' will involve a simple seek into the clustered index.
While this seems complicated for a programmer mind, it is the optimal way to approach a database design problem.
I have a table (SQL 2000) with over 10,000,000 records. Records get added at a rate of approximately 80,000-100,000 per week. Once a week a few reports get generated from the data. The reports are typically fairly slow to run because there are few indexes (presumably to speed up the INSERTs). One new report could really benefit from an additional index on a particular "char(3)" column.
I've added the index using Enterprise Manager (Manage Indexes -> New -> select column, OK), and even rebuilt the indexes on the table, but the SELECT query has not sped up at all. Any ideas?
Update:
Table definition:
ID, int, PK
Source, char(3) <--- column I want indexed
...
About 20 different varchar fields
...
CreatedDate, datetime
Status, tinyint
ExternalID, uniqueidentifier
My test query is just:
select top 10000 [field list] where Source = 'abc'
You need to look at the query plan and see if it is using that new index - if it isnt there are a couple things. One - it could have a cached query plan that it is using that has not been invalidated since the new index was created. If that is not the case you can also trying index hints [ With (Index (yourindexname)) ].
10,000,000 rows is not unheard of, it should read that out pretty fast.
Use the Show Execution Plan in SQL Query Analyzer to see if the index is used.
You could also try making it a clustered index if it isn't already.
For a table of that size your best bet is probably going to be partitioning your table and indexes.
select top 10000
How unique are your sources? Indexes on fields that have very few values are usually ignore by the SQL engine. They make queries slower. You might want to remove that index and see if it is faster if your SOURCE field only has a handful of values.