Best way to get distinct values from large table - sql-server

I have a db table with about 10 or so columns, two of which are month and year. The table has about 250k rows now, and we expect it to grow by about 100-150k records a month. A lot of queries involve the month and year column (ex, all records from march 2010), and so we frequently need to get the available month and year combinations (ie do we have records for april 2010?).
A coworker thinks that we should have a separate table from our main one that only contains the months and years we have data for. We only add records to our main table once a month, so it would just be a small update on the end of our scripts to add the new entry to this second table. This second table would be queried whenever we need to find the available month/year entries on the first table. This solution feels kludgy to me and a violation of DRY.
What do you think is the correct way of solving this problem? Is there a better way than having two tables?

Using a simple index on the columns required (Year and Month) should greatly improve either a DISTINCT, or GROUP BY Query.
I would not go with a secondary table as this adds extra over head to maintaining the secondary table (inserts/updates deletes will require that you validate the secondary table)
EDIT:
You might even want to consider using Improving Performance with SQL Server 2005 Indexed Views

Make sure to have an Clustered Index on those columns.
and partition your table on these date columns an place the datafiles on different disk drives
I Believe keeping your index fragmentation low is your best shot.
I also Believe having a physical view with the desired select is not a good idea,
because it adds Insert/Update overhead.
on average there's 3,5 insert's per minute.
or about 17 seconds between each insert (on average please correct me if I'm wrong)
The question is are you selecting more often than every 17 seconds?
That's the key thought.
Hope it helped.

Use a 'Materialized View', also called an 'Indexed View with Schema Binding', and then index this view. When you do this SQL server will essentially create and maintain the data in a secondary table behind the scenes and choose to use the index on this table when appropriate.
This is similar to what your co-worker suggested, the advantage being you won't need to add logic to your query to take advantage of it, SQL Server will do this when it creates a query plan and SQL Server will also automatically maintain the data in the Indexed View.
Here is how you would accomplish this: create a view that returns the distinct [month] [year] values and then index [year] [month] on the view. Again SQL Server will use the tiny index on the view and avoid the table scan on the big table.
Because SQL server will not let you index a view with the DISTINCT keyword, instead use GROUP BY [year],[month] and use BIG_COUNT(*) in the SELECT. It will look something like this:
CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING
AS
SELECT
[year],
[month],
COUNT_BIG(*) [MonthCount]
FROM [dbo].[YourBigTable]
GROUP BY [year],[month]
GO
CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month
ON [dbo].[vwMonthYear](Year,Month)
Now when you SELECT DISTINCT [Year],[Month] on the big table, the query optimizer will scan the tiny index on the view instead of scanning millions of records on the big table.
SELECT DISTINCT
[year],
[month]
FROM YourBigTable
This technique took me from 5 million reads with an estimated I/O of 10.9 to 36 reads with an estimated I/O of 0.003. The overhead on this will be that of maintaining an additional index, so each time the large table is updated the index on the view will also be updated.
If you find this index is substantially slowing down your load times. Drop the index, perform your data load and then recreate it.
Full working example:
CREATE TABLE YourBigTable(
YourBigTableID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_YourBigTable_YourBigTableID PRIMARY KEY,
[Year] INT,
[Month] INT)
GO
CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING
AS
SELECT
[year],
[month],
COUNT_BIG(*) [MonthCount]
FROM [dbo].[YourBigTable]
GROUP BY [year],[month]
GO
CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month ON [dbo].[vwMonthYear](Year,Month)
SELECT DISTINCT
[year],
[month]
FROM YourBigTable
-- Actual execution plan shows SQL server scaning ICU_vwMonthYear_Year_Month

create a materialized indexed view of:
SELECT DISTINCT
MonthCol, YearCol
FROM YourTable
you will now get access to the pre-computed distinct values without going through the work every time.

Make the date the first column in the table's clustered index key. This is very typical for historic data, because most, if not all, queries are interested in specific ranges and a clustered index on time can address this. All queries like 'month of May' need to be addressed as ranges, eg: WHERE DATECOLKEY BETWEEN '05/01/2010' AND '06/01/2001'. Answering a question like 'are there any records in May' will involve a simple seek into the clustered index.
While this seems complicated for a programmer mind, it is the optimal way to approach a database design problem.

Related

SQL Server Indexed views used instead of tables

I am a little bit confused about using indexed views in SQL Server 2016.
Here is my issue. If I have a fact table with a lot of columns and I create an indexed view named IV_Sales as
select
year,
customer,
sum(sales)
from F_Sales
group by year, customer
I would aggregate all sales for year and customer.
After that, when a user runs a query from the F_sales like
Select
year, customer,
sum(sales)
from F_sales
group by year, customer
will the Optimizer (in SQL Server Enterprise Edition) automatically use the indexed view IV_sales instead of table scan of F_sales?
I have the Standard Edition and when I add
Select
year,
customer,
sum(sales)
from F_sales WITH (NOEXPAND)
group by year, customer
I get an error since there is no clustered index like the one I created on the indexed view. Is there a way to force using index views instead of the table in Standard Edition?
My real world issue is that I have a Cognos Framework model pointing to the table F_sales and when a report is executed using Year, customer and sum of sales for performance reasons I want it to use the indexed view automatically instead of the table.
I hope I'm being clear about my issue. Many thanks in advance.
If you have a performance issue, Indexed views are probably the last thing you want to try.
You should exhaust all other avenues, like standard indexes first.
For example if you know for sure that you are doing a table scan, the simple solution is to add a non clustered index to satisfy the query so it does an index scan or seek instead. If it still doesn't use this, you need to continue your performance tuning, and work out why it isn't (non sargable expressions? stale statistics?)
Your indexed view will automatically be used (without explicit mention of the indexed view) in a very limited number of cases. You'll see it in the query plan.
If your query very closely matches the index view definition, it will use your indexed view.
Make a very small change to your SQL, (like joining to another table) and it won't throw an error, it will just fall back to not using the indexed view.
Automatic SQL writing tools like Cognos will very quickly make the SQL unrecognisable to the query planner and therefore not use the indexed view.
This is all very easily verifiable if you just crack open SSMS and do some experiments.
So in short: start your optmisation with standard indexes, filtered indexes, even column store indexes (which are particularly good for fact tables or so I hear)

Speeding up a SQL query with indexes

I have a table called Products.
This table contains over 3 million entries. Every day there are approximately 5000 new entries. which only happens during the night in 2 minutes.
But this table gets queried every night maybe over 20 000 times with this query.
SELECT Price
FROM Products
WHERE Code = #code
AND Company = #company
AND CreatedDate = #createdDate
Table structure:
Code nvarchar(50)
Company nvarchar(10)
CreatedDate datetime
I can see that this query takes about a second to return a result from Products table.
There is no productId column in the table as it is not needed. So there is no primary key in the table.
I would like to somehow improve this query to return the result faster.
I have never used indexes before. What would be the best way to use indexes on this table?
If I provide a primary key do you think it would speed up the query result? Keep in mind that I will still have to query the table by providing 3 parameters as
WHERE Code = #code
AND Company = #company
AND CreatedDate = #createdDate.
This is mandatory.
As I mentioned that the table gets new entries in 2 minutes every day during the night. How would this affect the indexes?
If I use indexes, which column would be the best to use and whether I should use clustered or non-clustered indexes?
The best thing to do would depend on what other fields the table has and what other queries run against that table.
Without more details, a non-clustered index on (code, company, createddate) that included the "price" column will certainly improve performance.
CREATE NONCLUSTERED INDEX IX_code_company_createddate
ON Products(code, company, createddate)
INCLUDE (price);
That's because if you have that index in place, then SQL will not access the actual table at all when running the query, as it can find all rows with a given "code, company, createddate" in the index and it will be able to do that really fast as the index allows precisely for fast access when using the fields that define the key, and it will also have the "price" value for each row.
Regarding the inserts, for each row added, SQL Server will have to add them to the index as well, so performance for inserts will be impacted. In think you should expect the gains on SELECT performance to outweigh the impact on the inserts, but you should test that.
Also, you will be using more space as the index will store all those fields for each row besides the space used by the original table.
As others have noted in the comments, adding a PK to your table (even if that means adding a ProductId column you don't actually need) might be a good idea as well.

Simple select query taking too long

I have a table with around 18k rows for a certain week and another week has 22k rows.
I'm using view and indexes to retrieve the data like so
SELECT TOP 100 * FROM my_view
WHERE timestamp BETWEEN #date1 AND
#date2
But somehow, the week with 22k retrieves data faster (around 3-5sec) while the other takes a minute at least. These causes my wcf to timeout. What am i missing?
Apply an index on timestamp field.
if you have already an index on timestamp then check the index being used for this query in execution plan.
The index hint will only come into play where your query involves joining tables, and where the columns being used to join to the other table matches more than one index. In that case the database engine may choose to use one index to make the join, and from investigation you may know that if it uses another index the query will perform better. In that case you provide the index hint telling the database engine which index to use.
Sample code use index hints:
select [Order].[OrgId], [OrderDetail].[ProductId]
from [Order]
inner join [OrderDetail] **with(index(IX_OrderDetail_OrderId))** on [Order].[OrderId] = [OrderDetail].[OrderId]

Is there a SQL Server 2008 method to group rows in a table so as to behave as a nested table?

This could turn out to be the dumbest question ever.
I want to track groups and group members via SQL.
Let's say I have 3 groups and 6 people.
I could have a table such as:
Then if I wanted to have find which personIDs are in groupID 1, I would just do
select * from Table where GroupID=1
(Everyone knows that)
My problem is I have millions of rows added to this table and I would like it to do some sort of presorting about GroupID to make lookups as fast as possible.
I'm thinking of a scenario where it would have nested tables, where each sub table would contain a groupID's members. (Illustrated below)
This way when I wanted to select each GroupMembers, the structure in SQL would already be nested and not as to expensive look up as would trolling through rows.
Does such a structure exist, in essence, a table that would pivot around the groupID ? Is indexing the table about groupID the best/only option?
Perhaps you see it otherwise at the moment, but what you ask is nothing else but an index on GroupId. But there are many more shades of gray, a lot depends on how you plan to use the table (the actual queries you're going to run) and the cardinality of expected data.
Should the table be clustered by (PersonID) with a non clustered index on (GroupId))?
Should it be a clustered index on (GroupId, PersonID) with a non clustered index on (PersonId)?
Or should it be clustered by (PersonId, GroupId) with a non clustered index on (GroupId, PersonId)?
...
All are valid choices, depending on your requirements, and the choice you make is pretty much going to make or break your application.
Approaching this problem from the point of view of what EF or other ORM layer gives you will likely result in a bad database design. Ultimately your whole app, as fancy and carefully coded as as it is, is nothing but a thin shell around the the database. Consider approaching this from a sound data modeling point of view, create a good table schema design, and then write your code on top of it, not the other way around. I understand this goes against everything the preachers on the street recommend today, but I've seen too many applications designed in the Visual Studio various data context editor(s) fail in deployment...
If the inserts will typically be incremental (in other words, when you add a row you will typically add a groupid + personid that are greater than the last row) you can create a clustered index on groupid + personid and that will make SQL physically store the rows in that order and it makes a lookup on that key very fast.

SQL Server Indexes Aren't Helping

I have a table (SQL 2000) with over 10,000,000 records. Records get added at a rate of approximately 80,000-100,000 per week. Once a week a few reports get generated from the data. The reports are typically fairly slow to run because there are few indexes (presumably to speed up the INSERTs). One new report could really benefit from an additional index on a particular "char(3)" column.
I've added the index using Enterprise Manager (Manage Indexes -> New -> select column, OK), and even rebuilt the indexes on the table, but the SELECT query has not sped up at all. Any ideas?
Update:
Table definition:
ID, int, PK
Source, char(3) <--- column I want indexed
...
About 20 different varchar fields
...
CreatedDate, datetime
Status, tinyint
ExternalID, uniqueidentifier
My test query is just:
select top 10000 [field list] where Source = 'abc'
You need to look at the query plan and see if it is using that new index - if it isnt there are a couple things. One - it could have a cached query plan that it is using that has not been invalidated since the new index was created. If that is not the case you can also trying index hints [ With (Index (yourindexname)) ].
10,000,000 rows is not unheard of, it should read that out pretty fast.
Use the Show Execution Plan in SQL Query Analyzer to see if the index is used.
You could also try making it a clustered index if it isn't already.
For a table of that size your best bet is probably going to be partitioning your table and indexes.
select top 10000
How unique are your sources? Indexes on fields that have very few values are usually ignore by the SQL engine. They make queries slower. You might want to remove that index and see if it is faster if your SOURCE field only has a handful of values.

Resources