I have a table it has around 30 columns in it, I am running a select query, I am getting key lookup in the execution plan,I have around 2 million records in that table. Is it good practice to have that thirty columns in include clause?, or is there any way to solve issue? I know include columns store that data in leaf level node to satisfy the query. What is that 900 bytes? Does it mean that length of all include columns should not more that 900 character?
Creating an index with the leading key being the main predicate of the query and including all other columns will speed up the query and replace the key lookup with an index seek. BUT and big but you are then storing the data twice just in a different order so a 100MB table becomes a 200MB table. If this is the single most important query of your application then then you might be able to justify that.
Brent Ozar has a general rule of thumb of '5 in 5'. No more than 5 indexes with no more than 5 columns per table. I say general because obliviously there are situations when that doesn't apply.
If you need all 30 columns then a key lookup in the execution plan is going to be the best bet, have an index on the main predicate and let SQL Server use the hidden PK on the index to a key lookup for the other columns.
900 bytes if the maximum size of a single row on an index, so that will depend on what data types all columns in the table are
Related
I have an sql datadas, where among other things I have a prices table, where I have one price per product per store.
There are 50 stores and over 500000 products, so this table Will easily have 25 to 30 million records.
This table is feed daily over night with prices updates, and has huge read operations during day. Reads are made with readonly intent.
All queries contain storeid as part of identifying the record to update or read.
I m not able yet to determine how this Will behave since I m expecting external supply of prices but I m expecting performance issues at least on read operations, even though indexes are in place for now...
My question is if I should consider table partition by store since it is always part of queries. But then I have indexes where storeid is not the only column that is part of the index.
Based on this scenario, would you recommend partitioning? The alternative I see is having 50 tables one per store, but it seems painless and if possible to avoid the better
if I should consider table partition by store since it is always part of queries
Yes. That sounds promising.
But then I have indexes where storeid is not the only column that is part of the index.
That's fine. So long as the partitioning column is one of the clustered index columns, you can partition by it. In fact with partitioning, you can get partition elimination for a trailing column of the clustered index, then a clustered index seek within the target partition.
Hi all and thank you for your replies.
I was able to generate significant information on a contained environment where I was able to confirm that I can achieve excelent performance indicators by using only the appropriate indexes.
So for now we will keep it "as is" and have the partition strategy on hand just in case.
Thanks again, nice tips guys
I have over 500 million records in Azure SQL Data Warehouse.
I'm trying to do some benchmark in order to understand in which way keep the records. Rowstore or Columnstore.
I'll not join table with other tables, it's not an analytical fact table.
Both tables are distributed as a Round Robin and both of them contains 17 partitions. And both of them has 45 columns.
When I query to sum two columns, I expect Columnstore table perform much better than rowstore , however the reality is that I get my sum result from Rowstore somewhere around 2.5 min and for columnstore around 10 min. I don't use any filter or group by.
On the other hand , when i query count(*) , columnar table performs much much better than rowstore.
EDIT
Though I can't share all the details with you because its private,
here is some of them just to have understanding what's going on.
I run queries on smallrc and 100DWU.
Table is loaded one CTAS and contains pre joined information from several tables and is going to serve queries over custom defined protocol(sort/group/filter/paging) from our internal application.
The domain is gambling and from 45 Columns we have 43 could be used as filter. The output set usually contains 3 to 4 columns plus two sum columns with no more than 1000 row per query.
I partitioned both tables monthly via EventDate assuming to have each month a new partition. Mostly my queries contains EventDate as a filter.
My Rowstroe table contains EventDate as a clustered index in addition to partitions which are the same as for columnstore.
Adding EventDate as a secondary index for columnstore gave some improvement but performance still far behind rowstore.
EventDate is in int format and values pattern are yyyyMMdd (20180101).
Every DW optimized for elasticity has 60 distributions while the lower skews for DW optimzied for compute also have 60 distributions.
SQL Server's columnstore creates row groups based on row count (as opposed to Parquet for example, where row groups are created based on disk size). Row groups should ideally have 1M rows (see the link that #GregGalloway added), but row groups can get COMPRESSED if they have at least 100k rows loaded in a single bulk load. When a row group is not compressed it is stored in row format in delta stores (they are regular B-trees, with a MD/access overhead since they are part of the columnstore index. Note that you cannot specify the indexing, since they are part of the clustered columnstore index).
I assume that you have 500M rows in 60 distributions, that is 8.3M rows per distribution; assuming your partitioning is homogeneous with 17 partitions you'd have ~490k rows per partition.
When bulk loading into partitioned table you need to be careful about the memory requirements/resource class you're loading with, since the sort iterator on top of the bulk load is not spilling so it will feed the bulk load only that many rows that it can sort.
Ensure that your index has good quality. If you'll do only aggregates over the table without much filtering then 1 partition is ideal, even if you do filtering remember that columnstore does segment elimination so if your data is loaded in the right order you'll be fine.
You should ensure that you have at least a few million rows per partition and that you have COMPRESSED row groups to have good performance. Given your scan results you have most if not all of your columnstore data in OPEN row groups (delta stores).
What do you mean by much better performance in case of count(*)?
Also were these runs cold or warm? If it's a warm run for count(*) CS might be just grabbing the row group MD and up the row count - though in both cases the compiled plans show full table scan.
After running the following query:
SELECT [hour], count(*) as hits, avg(elapsed)
FROM myTable
WHERE [url] IS NOT NULL and floordate >= '2017-05-01'
group by [hour]
the execution plan is basically a clustered Index Scan on the PK (int, auto-increment, 97% of the work)
The thing is: URL has a index on it (regular index because i'm always searching for a exact match), floordate also has an index...
Why are they not being used? How can i speed up this query?
PS: table is 70M items long and this query takes about 9 min to run
Edit 1
If i don't use (select or filter) a column on my index, will it still be used? Usually i also filter-for/group-by clientId (approx 300 unique across the db) and hour (24 unique)...
In this scenario, two things affect how SQL Server will choose an index.
How selective is the index. A higher selectivity is better. NULL/NOT NULL filters generally have a very low selectivity.
Are all of the columns in the index, also known as a covering index.
In your example, if the index cannot cover the query, SQL will have to look up the other column values against the base table. If your URL/Floordate combination is not selective enough, SQL may determine it is cheaper to scan the base table rather than do an expensive lookup from the non-clustered index to the base table for a large number of rows.
Without knowing anything else about your schema, I'd recommend an index with the following columns:
floordate, url, hour; include elapsed
Date ranges scans are generally more selective than a NULL/NOT NULL test. Moving Floordate to the front may make this index more desirable for this query. If SQL determines the query is good for Floordate and URL, the Hour column can be used for the Group By action. Since Elapsed is included, this index can cover the query completely.
You can include ClientID after hour to see if that helps your other query as well.
As long as an index contains all of the columns to resolve the query, it is a candidate for use, even if there is no filtering needed. Generally speaking, a non-clustered index is skinnier than the base table, requiring less IO than scanning the full width base table.
I have created a unique, clustered index on a view. The clustered index contains 5 columns (out of the 30 on this view), but a typical select using this view will want all 30 columns.
Doing some testing shows that the time it takes to query for the 5 columns is way faster than all 30 columns. Is this because that is just natural overhead regarding selecting on 6x as many columns, or because the indexed view is not storing the non-indexed columns in a temp table, and therefore needs to perform some extra steps to gather the missing columns (joins on base tables I guess?)
If the latter, what are some steps to prevent this? Well, even if the former... what are some ways around this!
Edit: for comparison purposes, a select on the indexed view with just the 5 columns is about 10x faster than the same query on the base tables. But a select on all columns is basically equivalent in speed to the query on the base tables.
A clustered index, by definition, contains every field in every row in the table. It basically is a recreation of the table, but with the physical data pages in order by the clustered index, with b-tree sorting to allow quick access to specified values of your clustered key(s).
Are you just pulling values or are you getting aggregate functions like MIN(), MAX(), AVG(), SUM() for the other 25 fields?
An indexed view is a copy of the data, stored (clustered) potentially (and normally) in a different way to the base table. For all purposes
you now have two copies of the data
SQL Server is smart enough to see that the view and table are aliases of each other
for queries that involve only the columns in the indexed view
if the indexed view contains all columns, it is considered a full alias and can be used (substituted) by the optimizer wherever the table is queried
the indexed view can be used as just another index for the base table
When you select only the 5 columns from tbl (which has an indexed view ivw)
SQL Server completely ignores your table, and just gives you data from ivw
because the data pages are shorter (5 columns only), more records can be grabbed into memory in each page retrieval, so you get a 5x increase in speed
When you select all 30 columns - there is no way for the indexed view to be helpful. The query completely ignores the view, and just selects data from the base table.
IF you select data from all 30 columns,
but the query filters on the first 4 columns of the indexed view,*
and the filter is very selective (will result in a very small subset of records)
SQL Server can use the indexed view (scanning/seeking) to quickly generate a small result set, which it can then use to JOIN back to the base table to get the rest of the data.
However, similarly to regular indexes, an index on (a,b,c,d,e) or in this case clustered indexed view on (a,b,c,d,e) does NOT help a query that searches on (b,d,e) because they are not the first columns in the index.
I have a table with several indexes. All of them contain an specific integer column.
I'm moving to mysql 5.1 and about to partition the table by this column.
Do I still have to keep this column as key in my indexes or I can remove it since partitioning will take care of searching only in the relevant keys data efficiently without need to specify it as key?
Partition field must be part of index so the answer is that I kave to keep the partitioning column in my index.
Partitioning will only slice the values/ranges of that index into separate partitions according to how you set it up. You'd still want to have indexes on that column so the index can be used after partition pruning has been done.
Keep in mind there's a big impact on how many partitions you can have, if you have an integer column with only 4 distinct values in it, you might create 4 partitions, and an index would likely not benefit you much depending on your queries.
If you got 10000 distinct values in your integer column, you hit system limits if you try to create 10k partitions - you'll have to partition on large ranges (e.g. 0-1000,1001-2000, etc.) in such a case you'll benefit from an index (again depending on how you query the tables)