I have over 500 million records in Azure SQL Data Warehouse.
I'm trying to do some benchmark in order to understand in which way keep the records. Rowstore or Columnstore.
I'll not join table with other tables, it's not an analytical fact table.
Both tables are distributed as a Round Robin and both of them contains 17 partitions. And both of them has 45 columns.
When I query to sum two columns, I expect Columnstore table perform much better than rowstore , however the reality is that I get my sum result from Rowstore somewhere around 2.5 min and for columnstore around 10 min. I don't use any filter or group by.
On the other hand , when i query count(*) , columnar table performs much much better than rowstore.
EDIT
Though I can't share all the details with you because its private,
here is some of them just to have understanding what's going on.
I run queries on smallrc and 100DWU.
Table is loaded one CTAS and contains pre joined information from several tables and is going to serve queries over custom defined protocol(sort/group/filter/paging) from our internal application.
The domain is gambling and from 45 Columns we have 43 could be used as filter. The output set usually contains 3 to 4 columns plus two sum columns with no more than 1000 row per query.
I partitioned both tables monthly via EventDate assuming to have each month a new partition. Mostly my queries contains EventDate as a filter.
My Rowstroe table contains EventDate as a clustered index in addition to partitions which are the same as for columnstore.
Adding EventDate as a secondary index for columnstore gave some improvement but performance still far behind rowstore.
EventDate is in int format and values pattern are yyyyMMdd (20180101).
Every DW optimized for elasticity has 60 distributions while the lower skews for DW optimzied for compute also have 60 distributions.
SQL Server's columnstore creates row groups based on row count (as opposed to Parquet for example, where row groups are created based on disk size). Row groups should ideally have 1M rows (see the link that #GregGalloway added), but row groups can get COMPRESSED if they have at least 100k rows loaded in a single bulk load. When a row group is not compressed it is stored in row format in delta stores (they are regular B-trees, with a MD/access overhead since they are part of the columnstore index. Note that you cannot specify the indexing, since they are part of the clustered columnstore index).
I assume that you have 500M rows in 60 distributions, that is 8.3M rows per distribution; assuming your partitioning is homogeneous with 17 partitions you'd have ~490k rows per partition.
When bulk loading into partitioned table you need to be careful about the memory requirements/resource class you're loading with, since the sort iterator on top of the bulk load is not spilling so it will feed the bulk load only that many rows that it can sort.
Ensure that your index has good quality. If you'll do only aggregates over the table without much filtering then 1 partition is ideal, even if you do filtering remember that columnstore does segment elimination so if your data is loaded in the right order you'll be fine.
You should ensure that you have at least a few million rows per partition and that you have COMPRESSED row groups to have good performance. Given your scan results you have most if not all of your columnstore data in OPEN row groups (delta stores).
What do you mean by much better performance in case of count(*)?
Also were these runs cold or warm? If it's a warm run for count(*) CS might be just grabbing the row group MD and up the row count - though in both cases the compiled plans show full table scan.
Related
I have a table it has around 30 columns in it, I am running a select query, I am getting key lookup in the execution plan,I have around 2 million records in that table. Is it good practice to have that thirty columns in include clause?, or is there any way to solve issue? I know include columns store that data in leaf level node to satisfy the query. What is that 900 bytes? Does it mean that length of all include columns should not more that 900 character?
Creating an index with the leading key being the main predicate of the query and including all other columns will speed up the query and replace the key lookup with an index seek. BUT and big but you are then storing the data twice just in a different order so a 100MB table becomes a 200MB table. If this is the single most important query of your application then then you might be able to justify that.
Brent Ozar has a general rule of thumb of '5 in 5'. No more than 5 indexes with no more than 5 columns per table. I say general because obliviously there are situations when that doesn't apply.
If you need all 30 columns then a key lookup in the execution plan is going to be the best bet, have an index on the main predicate and let SQL Server use the hidden PK on the index to a key lookup for the other columns.
900 bytes if the maximum size of a single row on an index, so that will depend on what data types all columns in the table are
My question is about performance on SQL server tables.
Assume I have a table that has many columns, for example 30 columns, with 1 column indexed. This table has approximately 30,000 rows.
If I perform a select that selects the indexed column, and one more, for example this:
SELECT IndexedColumn, column1
FROM table
Will this be slower than performing the same select on a table that only has 2 columns, and doing a SELECT * ...
So basically, will the existence of the extra columns slow down the select query event if I am not retrieving the data from the extra columns?
There will be minor difference on the very end of the process as you don't have to print/pass the rest of information for the end client (either SSMS or other app).
When performing a read based on clustered index all of the column (without BLOB) are saved on the same page set so to read the data you have to access the same set of pages anyway.
You would see a performance increase if you would have a nonclustered index on the column list you are after as then they are saved in their own structure of data pages (so it would be less to read).
Assuming that you are using the default Clustered Index created by SQL server when defining the primary key on the table in both scenarios then no, there shouldn't be any performance difference between these two scenarios. Maybe worth just checking it out and generating an Actual Execution plan to see for yourself? -- Actually not sure above is true, as given this is rowstore, the first table wont be able to fit as many rows onto each page so will suffer more of an IO/Disk overhead when reading data.
After running the following query:
SELECT [hour], count(*) as hits, avg(elapsed)
FROM myTable
WHERE [url] IS NOT NULL and floordate >= '2017-05-01'
group by [hour]
the execution plan is basically a clustered Index Scan on the PK (int, auto-increment, 97% of the work)
The thing is: URL has a index on it (regular index because i'm always searching for a exact match), floordate also has an index...
Why are they not being used? How can i speed up this query?
PS: table is 70M items long and this query takes about 9 min to run
Edit 1
If i don't use (select or filter) a column on my index, will it still be used? Usually i also filter-for/group-by clientId (approx 300 unique across the db) and hour (24 unique)...
In this scenario, two things affect how SQL Server will choose an index.
How selective is the index. A higher selectivity is better. NULL/NOT NULL filters generally have a very low selectivity.
Are all of the columns in the index, also known as a covering index.
In your example, if the index cannot cover the query, SQL will have to look up the other column values against the base table. If your URL/Floordate combination is not selective enough, SQL may determine it is cheaper to scan the base table rather than do an expensive lookup from the non-clustered index to the base table for a large number of rows.
Without knowing anything else about your schema, I'd recommend an index with the following columns:
floordate, url, hour; include elapsed
Date ranges scans are generally more selective than a NULL/NOT NULL test. Moving Floordate to the front may make this index more desirable for this query. If SQL determines the query is good for Floordate and URL, the Hour column can be used for the Group By action. Since Elapsed is included, this index can cover the query completely.
You can include ClientID after hour to see if that helps your other query as well.
As long as an index contains all of the columns to resolve the query, it is a candidate for use, even if there is no filtering needed. Generally speaking, a non-clustered index is skinnier than the base table, requiring less IO than scanning the full width base table.
I am new to partitioning.
Would be there a difference in performance between
select * from my_partitionedData where date = '20110523'
and
select * from my_Data where date = '20110523'
where my_partitionedData is a table partitioned by date by 1 day and my_Data is a table which has only data for '20110523' and both tables have same structure?
The other question - would be there a difference in performance in running these selects if all the partitions of the my_partitionedData are in the same file group? (note - the select is always for 1 day)
Like everything else in SQL, you will need to test to be sure.
That being said, I think you should get identical performance.
Behind the scenes, a partitioned table is basically a lot of smaller tables logically unioned together. If you are partitioning by day in you partitioned table, and your non-part table has only one day of data, the execution plan and performance should be pretty much identical.
If one returns the same data set a partitioned and non-partitioned table will return the data with the same IO. If the partitioned table has less fragmentation there would be a reduction in the IO delay from a random seek of the disk heads to retrieve the pages but all in all 100k of data is 100k of data.
You did not mention if you were considering partitioning the index. Partitioning index is an excellent way to reduce the number of levels which must be traversed to find the location of the data row. Partitioning indexes and tables with the same function is the optiomal solution.
where my_partitionedData is a table
partitioned by date by 1 day and
my_Data is a table which has only data
for '20110523' and both tables have
same structure?
The later one will less access time.
The other question - whould be there a
difference in performance in running
these selects if all the partitions of
the my_partitionedData are in the same
file group? (note - the select is
always for 1 day)
The access time will be more in this case despite of 1 day data.
Partitioning is required to improve the scalability and manageability of large tables and tables that have varying access patterns.
You created two tables to store information about each day records and on the other hand a single table for each day data is the easiest to design and understand, but these tables are not necessarily optimized for performance, scalability, and manageability, particularly as the table grows larger.
I have created a unique, clustered index on a view. The clustered index contains 5 columns (out of the 30 on this view), but a typical select using this view will want all 30 columns.
Doing some testing shows that the time it takes to query for the 5 columns is way faster than all 30 columns. Is this because that is just natural overhead regarding selecting on 6x as many columns, or because the indexed view is not storing the non-indexed columns in a temp table, and therefore needs to perform some extra steps to gather the missing columns (joins on base tables I guess?)
If the latter, what are some steps to prevent this? Well, even if the former... what are some ways around this!
Edit: for comparison purposes, a select on the indexed view with just the 5 columns is about 10x faster than the same query on the base tables. But a select on all columns is basically equivalent in speed to the query on the base tables.
A clustered index, by definition, contains every field in every row in the table. It basically is a recreation of the table, but with the physical data pages in order by the clustered index, with b-tree sorting to allow quick access to specified values of your clustered key(s).
Are you just pulling values or are you getting aggregate functions like MIN(), MAX(), AVG(), SUM() for the other 25 fields?
An indexed view is a copy of the data, stored (clustered) potentially (and normally) in a different way to the base table. For all purposes
you now have two copies of the data
SQL Server is smart enough to see that the view and table are aliases of each other
for queries that involve only the columns in the indexed view
if the indexed view contains all columns, it is considered a full alias and can be used (substituted) by the optimizer wherever the table is queried
the indexed view can be used as just another index for the base table
When you select only the 5 columns from tbl (which has an indexed view ivw)
SQL Server completely ignores your table, and just gives you data from ivw
because the data pages are shorter (5 columns only), more records can be grabbed into memory in each page retrieval, so you get a 5x increase in speed
When you select all 30 columns - there is no way for the indexed view to be helpful. The query completely ignores the view, and just selects data from the base table.
IF you select data from all 30 columns,
but the query filters on the first 4 columns of the indexed view,*
and the filter is very selective (will result in a very small subset of records)
SQL Server can use the indexed view (scanning/seeking) to quickly generate a small result set, which it can then use to JOIN back to the base table to get the rest of the data.
However, similarly to regular indexes, an index on (a,b,c,d,e) or in this case clustered indexed view on (a,b,c,d,e) does NOT help a query that searches on (b,d,e) because they are not the first columns in the index.