group by on clustering key is not reading from metadata - snowflake-cloud-data-platform

I have defined cluster key on one of the column "time periods" , when i use where clause it operates on metadata that I can see in history profile of below query
select count(*) from table where time_period = 'Jan 2021'
but when i use group by to know count of each month , it scan all the partition.
select time_period , count(*) from table group by time_period
Why the second query is not the metadata operation ..?

select time_period , count(*) from table group by time_period;
is a full table scan.
select count(*) from table where time_period = 'Jan 2021'
is a full scan on partitions with the time_period equal to one value, so the meta data is searched to find the matching partitions, thus the pruning.
if you table has values from 'Jan 2020' to 'Jan 2021' and assuming those are dates not strings (which would be very bad for performance), and assuming you data is clustered on time_period (or naturally inserting in "months") then
select time_period, count(*)
from table
where time_period >= '2021-06-01'
group by 1 order by 1;
should only read ~50% of your partitions, as the assumed order of the data, means only half the tables need to be read.

Answering the "meta-data" vs "scanning" question. This is based on years of working with query optimization, and is "very well educated speculation".
There is big difference between "COUNT()" and "COUNT() ... GROUP BY". The latter is much more complex and handles much more complex queries.
Optimizers evolve over time to handle special cases, but they start out focusing on more common types of queries.
The non-GROUP query against a non-keyed but well clustered table with use a scan. It's a specialized optimization, meaningful, optimization for a special case.
But the same specialization is not present in the GROUP BY, which addresses a much broader class of queries, with GROUP BY and WHERE clauses for multiple non-cluser-key columns.
The COUNT() GROUP BY would need to add a special check for this particular query form; once anything else is added, the meta-data would not be sufficient.
So no specialized optimization for this specific case in COUNT(), GROUP BY

Related

How can I speed up this sql server query?

-- Holds last 30 valdates
create table #valdates(
date int
)
insert into #valdates
select distinct top (30) valuation_date
from tbsm.tbl_key_rates_summary
where valuation_date <= 20150529
order by valuation_date desc
select
sum(fv_change), sc_group, valuation_date
from
(select *
from tbsm.tbl_security_scorecards_summary
where valuation_date in (select date from #valdates)) as fact
join
(select *
from tbsm.tbl_security_classification
where sc_book = 'UC' ) as dim on fact.classification_id = dim.classification_id
group by
valuation_date, sc_group
drop table #valdates
This query takes around 40 seconds to return because the fact table has almost 13 million rows.. Can I do anything about this?
Based on the fact that there's no proper index that supports the fetch, that's probably the easiest (or only) option to really improve the performance. Most likely index like this would improve the situation a lot:
create index idx_security_scorecards_summary_1 on
tbl_security_scorecards_summary (valuation_date, classification_id)
include (fv_change)
Everything depends of course on how good the selectivity of the valuation_date and classification_id fields are (=how big portion of the table needs to be fetched) and might work better with the fields in opposite order. The field fv_change is in the include section so that it's included in the index structure so there's no need to fetch it from the base table.
Include fields help if the SQL has to fetch a lot of rows from the table. If the amount of rows that this touches is small, then it might not help at all. Like always in indexing, this of course slows down the inserts / updates, and is optimized for this case only and you should of course look at the bigger picture too.
The select is written in a little bit strange way, not sure if that makes any difference, but you could also try the normal way to do this:
select
sum(fact.c), dim.sc_group, fact.valuation_date
from
tbsm.tbl_security_scorecards_summary fact
join tbsm.tbl_security_classification dim
on fact.classification_id = dim.classification_id
where
fact.valuation_date in (select date from #valdates) and
dim.sc_book = 'UC'
group by
fact.valuation_date,
dim.sc_group
Looking at "statistics io" output should give you a good idea which table is causing the slowness, and looking at query plan to see if there's any strange operators might help to understand the situation better.

SQL Server Performance With Large Query

Hi everyone I have a couple of queries for some reports in which each query is pulling Data from 35+ tables. Each Table has almost 100K records. All the Queries are Union ALL for Example
;With CTE
AS
(
Select col1, col2, col3 FROM Table1 WHERE Some_Condition
UNION ALL
Select col1, col2, col3 FROM Table2 WHERE Some_Condition
UNION ALL
Select col1, col2, col3 FROM Table3 WHERE Some_Condition
UNION ALL
Select col1, col2, col3 FROM Table4 WHERE Some_Condition
.
.
. And so on
)
SELECT col1, col2, col3 FROM CTE
ORDER BY col3 DESC
So far I have only tested this query on Dev Server and I can see It takes its time to get the results. All of these 35+ tables are not related with each other and this is the only way I can think of to get all the Desired Data in result set.
Is there a better way to do this kind of query ??
If this is the only way to go for this kind of query how can I
improve the performance for this Query by making any changes if
possible??
My Opinion
I Dont mind having a few dirty-reads in this report. I was thinking of using Query hints with nolock or Transaction Isolation Level set to READ UNCOMMITED.
Will any of this help ???
Edit
Every Table has 5-10 Bit columns and a Corresponding Date column to each Bit Column and my condition for each SELECT Statement is something like
WHERE BitColumn = 1 AND DateColumn IS NULL
Suggestion By Peers
Filtered Index
CREATE NONCLUSTERED INDEX IX_Table_Column
ON TableName(BitColumn)
WHERE BitColum = 1
Filtered Index with Included Column
CREATE NONCLUSTERED INDEX fIX_IX_Table_Column
ON TableName(BitColumn)
INCLUDE (DateColumn)
WHERE DateColumn IS NULL
Is this the best way to go ? or any suggestions please ???
There are lots of things that can be done to make it faster.
If I assume you need to do these UNIONs, then you can speed up the query by :
Caching the results, for example,
Can you create an indexed view from the whole statement ? Or there are lots of different WHERE conditions, so there'd be lots of indexed views ? But know that this will slow down modifications (INSERT, etc.) for those tables
Can you cache it in a different way ? Maybe in the mid layer ?
Can it be recalculated in advance ?
Make a covering index. Leading columns are columns form WHERE and then all other columns from the query as included columns
Note that a covering index can be also filtered but filtered index isn't used if the WHERE in the query will have variables / parameters and they can potentially have the value that is not covered by the filtered index (i.e., the row isn't covered)
ORDER BY will cause sort
If you can cache it, then it's fine - no sort will be needed (it's cached sorted)
Otherwise, sort is CPU bound (and I/O bound if not in memory). To speed it up, do you use fast collation ? The performance difference between the slowest and fastest collation can be even 3 times. For example, SQL_EBCDIC280_CP1_CS_AS, SQL_Latin1_General_CP1251_CS_AS, SQL_Latin1_General_CP1_CI_AS are one of the fastest collations. However, it's hard to make recommendations if I don't know the collation characteristics you need
Network
'network packet size' for the connection that does the SELECT should be the maximum value possible - 32,767 bytes if the result set (number of rows) will be big. This can be set on the client side, e.g., if you use .NET and SqlConnection in the connection string. This will minimize CPU overhead when sending data from the SQL Server and will improve performance on both side - client and server. This can boost performance even by tens of percents if the network was the bottleneck
Use shared memory endpoint if the client is on the SQL Server; otherwise TCP/IP for the best performance
General things
As you said, using isolation level read uncommmitted will improve the performance
...
Probably you can't do changes beyond rewriting the query, etc. but just in case, adding more memory in case it isn't sufficient now, or using SQL Server 2014 in memory features :-), ... would surely help.
There are way too many things that could be tuned but it's hard to point out the key ones if the question isn't very specific.
Hope this helps a bit
well you haven't give any statistics or sample run time of any execution so it is not possible to guess what is slow and is it really slow. how much data is in the result set? it might be just retrieving 100K rows as in result is just taking its time. if the result set of 10000 rows is taking 5 minute, yes definitely something can be looked at. so if you have sample query, number of rows in result and how much time it took for couple of execution with different where conditions, post that. it will help us to compare results.
BTW, do not use CTE just use regular inner and outer query select. make sure the Temp DB is configured properly. LDF and MDF is not default configured for 10% increase. by certain try and error you will come to know how much log and temp DB is increased for verity of range queries and based on that you should set the initial and increment size of the MDF and LDF of temp DB. for the Covered filter index the include column should be col1, col2 and co3 not column Date unless Date is also in select list.
how frequently the data in original 35 tables get updated? if max once per day or if they all get updates almost same time then Indexed-Views can be a possible solution. but if original tables getting updates more than once a day or they gets updates anytime and no where they are in same line then do no think about Indexed-View.
if disk space is not an issue as a last resort try and test performance using trigger on each 35 table. create new table to hold final results as you are expecting from this select query. create insert/update/delete trigger on each 35 table where you check the conditions inside trigger and if yes then only copy the same insert/update/delete to new table. yes you will need a column in new table that identifies which data coming from which table. because Date is Null-Able column you do not get full advantage of Index on that Column as "mostly you are looking for WHERE Date is NULL".
in the new Table only query you always do is where Date is NULL then do not even bother to create that column just create BIT columns and other col1, col2, col3 etc... if you give real example of your query and explain the actual tables, other details can be workout later.
The query hints or the Isolation Level are only going to help you in case of any blocking occurs.
If you dont mind dirty reads and there are locks during the execution it could be a good idea.
The key question is how many data fits the Where clausule you need to use (WHERE BitColumn = 1 AND DateColumn IS NULL)
If the subset filtered by that is small compared with the total number of rows, then use an index on both columns, BitColum and DateColumn, including the columns in the select clausule to avoid "Page Lookup" operations in your query plan.
CREATE NONCLUSTERED INDEX IX_[Choose an IndexName]
ON TableName(BitColumn, DateColumn)
INCLUDE (col1, col2, col3)
Of course the space needed for that covered-filtered index depends on the datatype of the fields involved and the number of rows that satisfy WHERE BitColumn = 1 AND DateColumn IS NULL.
After that I recomend to use a View instead of a CTE:
CREATE VIEW [Choose a ViewName]
AS
(
Select col1, col2, col3 FROM Table1 WHERE Some_Condition
UNION ALL
Select col1, col2, col3 FROM Table2 WHERE Some_Condition
.
.
.
)
By doing that, your query plan should look like 35 small index scans, but if most of the data satisfies the where clausule of your index, the performance is going to be similar to scan the 35 source tables and the solution won't worth it.
But You say "Every Table has 5-10 Bit columns and a Corresponding Date column.." then I think is not going to be a good idea to make an index per bit colum.
If you need to filter by using diferent BitColums and Different DateColums, use a compute column in your table:
ALTER TABLE Table1 ADD ComputedFilterFlag AS
CAST(
CASE WHEN BitColum1 = 1 AND DateColumn1 IS NULL THEN 1 ELSE 0 END +
CASE WHEN BitColum2 = 1 AND DateColumn2 IS NULL THEN 2 ELSE 0 END +
CASE WHEN BitColum3 = 1 AND DateColumn3 IS NULL THEN 4 ELSE 0 END
AS tinyint)
I recomend you use the value 2^(X-1) for conditionX(BitColumnX=1 and DateColumnX IS NOT NULL). It is going to allow you to filter by using any combination of that criteria.
By using value 3 you can locate all rows that accomplish: Bit1, Date1 and Bit2, Date2 condition. Any condition combination has its corresponding ComputedFilterFlag value because the ComputedFilterFlag acts as a bitmap of conditions.
If you heve less than 8 diferents filters you should use tinyint to save space in the index and decrease the IO operations needed.
Then use an Index over ComputedFilterFlag colum:
CREATE NONCLUSTERED INDEX IX_[Choose an IndexName]
ON TableName(ComputedFilterFlag)
INCLUDE (col1, col2, col3)
And create the view:
CREATE VIEW [Choose a ViewName]
AS
(
Select col1, col2, col3 FROM Table1 WHERE ComputedFilterFlag IN [Choose the Target Filter Value set]--(1, 3, 5, 7)
UNION ALL
Select col1, col2, col3 FROM Table2 WHERE ComputedFilterFlag IN [Choose the Target Filter Value set]--(1, 3, 5, 7)
.
.
.
)
By doing that, your index coveres all the conditions and your query plan should look like 35 small index seeks.
But this is a tricky solution, may be a refactoring in your table schema could produce simpler and faster results.
You'll never get real time results from a union all query over many tables but I can tell you how I got a little speed out of a similar situation. Hopefully this will help you out.
You can actually run all of them at once with a little bit coding and ingenuity.
You create a global temporary table instead of a common table expression and don't put any keys on the global temporary table it will just slow things down. Then you start all the individual queries which insert into the global temporary table. I've done this a hundred or so times manually and it's faster than a union query because you get a query running on each cpu core. The tricky part is the mechanism to determine when the individual queries have finished your on your own for that piece hence I do these manually.

Is there a quicker way of doing this type of query (finding inactive accounts)?

I have a very large table of wagering transactions. Let's say for the sake of the question I want to find the accounts of people who have wagered in the last year but not wagered in the last month, so I do something like this...
--query one
select accountnumber into #wageredrecently from activity
where _date >='2011-08-10' and transaction_type = 'Bet'
group by accountnumber
--query two
select accountnumber,firstname,lastname,email,sum(handle)
from activity a, customers c
where a.accountnumber = c.accountno
and transaction_type = 'Bet'
and _date >='2010-09-10'
and accountnumber not in (select * from #wageredrecently)
group by accountnumber,firstname,lastname,email
The problem is, this takes ages to get the data. Is there a quicker way to acheive the same in sql?
Edit, just to be specific about the time: It takes just over 3 minutes, which is far too long for a query that is destined for a php intranet page.
Edit (11/09/2011): I've found out that the problem is the customers table. It's actually a view. It previously had good performance but now all of a sudden its performance is terrible, a simple query on it takes almost as long as the above query pair. I have therefore chosen an alternative table of customer data (that actually is a table, and not a view) and now the query pair takes about 15 seconds.
You should try to join customers after you have found and aggregated the rows from activity (I assume that handle is a column in activity).
select c.accountno,
c.firstname,
c.lastname,
c.email,
a.sumhandle
from customers as c
inner join (
select accountnumber,
sum(handle) as sumhandle
from activity
where _date >= '2010-09-10' and
transaction_type = 'bet' and
accountnumber not in (
select accountnumber
from activity
where _date >= '2011-08-10' and
transaction_type = 'bet'
)
group by accountnumber
) as a
on c.accountno = a.accountnumber
I also included your first query as a sub-query instead. I'm not sure what that will do for performance. It could be better, it could be worse, you have to test on your data.
I don't know your exact business need, but rarely will someone need access to innactive accounts over several months at a moments notice. Depending on when you pruge data, this may get worse.
You could create an indexed view that contains the last transaction date for each account:
max(_date) as RecentTransaction
If this table gets too large, it could be partioned by year or month of the activity.
Have you considered adding an index on _date to the activity table? It's probably taking so long because it has to do a full table scan on that column when you're comparing the dates. Also, is transaction_type indexed as well? Otherwise, the other index wouldn't do you any good.
Answering my question as the problem wasn't the structure of the query but one of the tables being used. It was a view and its performance was terrible. I change to an actual table with customer data in and reduced the execution time down to about 15 seconds.

Indexing on DateTime and VARCHAR fields in SQL Server 2000, which one is more effectient?

We have a CallLog table in Microsoft SQL Server 2000. The table contains CallEndTime field whose type is DATETIME, and it's an index column.
We usually delete free-charge calls and generate monthly fee statistics report and call detail record report, all the SQLs use CallEndTime as query condition in WHERE clause. Due to a lot of records exist in CallLog table, the queries are slow, so we want to optimize it starting from indexing.
Question
Will it more effictient if query upon an extra indexed VARCHAR column CallEndDate ?
Such as
-- DATETIME based query
SELECT COUNT(*) FROM CallLog WHERE CallEndTime BETWEEN '2011-06-01 00:00:00' AND '2011-06-30 23:59:59'
-- VARCHAR based queries
SELECT COUNT(*) FROM CallLog WHERE CallEndDate BETWEEN '2011-06-01' AND '2011-06-30'
SELECT COUNT(*) FROM CallLog WHERE CallEndDate LIKE '2011-06%'
SELECT COUNT(*) FROM CallLog WHERE CallEndMonth = '2011-06'
It has to be the datetime. Dates are essentially stored as a number in the database so it is relatively quick to see if the value is between two numbers.
If I were you, I'd consider splitting the data over multiple tables (by month, year of whatever) and creating a view to combine the data from all those tables. That way, any functionality which needs to entire data set can use the view and anything which only needs a months worth of data can access the specific table which will be a lot quicker as it will contain much less data.
I think comparing DateTime is much faster than LIKE operator.
I agree with DoctorMick on Spliting your DateTime as persisted columns Year, Month, Day
for your query which selects COUNT(*), check if in the execution plan there is a Table LookUp node. if so, this might be because your CallEndTime column is nullable. because you said that you have a [nonclustered] index on CallEndTime column. if you make your column NOT NULL and rebuild that index, counting it would be a INDEX SCAN which is not so slow.and I think you will get much faster results.

Microsoft SQL Server Paging

There are a number of sql server paging questions on stackoverflow and many of them talk about using ROW_NUMBER() OVER (ORDER BY ...) AND CTE. Once you get into the hundreds of thousands of rows and start adding sorting on non-primary key values and adding custom WHERE clauses, these methods become very inneficient. I have a dataset of several million rows I am trying to page through with custom sorting and filtering, but I am getting poor performance, even with indexes on all the fields that I sort by and filter by. I even went as far as to include my SELECT columns in each of the indexes, but this barely helped and severely bloated my database.
I noticed the stackoverflow paging only takes about 500 milliseconds no matter what sorting criteria or page number you click on. Anyone know how to make paging work efficiently in SQL Server 2008 with millions of rows? This would include getting the total rows as efficiently as possible.
My current query has the exact same logic as this stackoverflow question about paging:
Best paging solution using SQL Server 2005?
Anyone know how to make paging work efficiently in SQL Server 2008 with millions of rows?
If you want accurate perfect paging, there is no substitute for building an index key (position row number) for each record. However, there are alternatives.
(1) total number of pages (records)
You can use an approximation from sysindexes.rows (almost instant) assuming the rate of change is small.
You can use triggers to maintain a completely accurate, to the second, table row count
(2) paging
(a)
You can show page jumps within say the next five pages to either side of a record. These need to scan at most {page size} x 5 on each side. If your underlying query lends itself to travelling along the sort order quickly, this should not be slow. So given a record X, you can go to the previous page using (assuming sort order is a asc, b desc
select top(#pagesize) t.*
from tbl x
inner join tbl t on (t.a = x.a and t.b > x.b) OR
(t.a < a.x)
where x.id = #X
order by t.a asc, t.b desc
(i.e. the last {page size} of records prior to X)
To go five pages back, you increase it to TOP(#pagesize*5) then further TOP(#pagesize) from that subquery.
Downside: This option requires that you cannot directly jump to a particular location, your options are only FIRST (easy), LAST (easy), NEXT/PRIOR, <5 pages either side
(b)
If the paging is always going to be quite specific and predictable, maintain an INDEXED view or trigger-updated table that does not contain gaps in the row number. This may be an option if the tables normally only see updates at one end of the spectrum, with gaps from deletes easily filled quickly by shifting not-so-many records.
This approach gives you a rowcount (last row) and also direct access to any page.
try this, let say you have country table as below:
DECLARE #pageIndex INT=0;
DECLARE #pageSize INT= 10;
DECLARE #sortByColumn NVARCHAR(200)='Code';
DECLARE #sortByDesc BIT=0;
;WITH tbl AS (
SELECT COUNT(id) OVER() [RowTotal], c.Id, c.Code, c.Name
FROM dbo.[Country] c
ORDER BY
CASE WHEN #sortByColumn='Code' AND #sortByDesc=0 THEN c.Code END ASC,
CASE WHEN #sortByColumn='Code' AND #sortByDesc<>0 THEN c.Code END DESC,
CASE WHEN #sortByColumn='Name' AND #sortByDesc=0 THEN c.Name END ASC,
CASE WHEN #sortByColumn='Name' AND #sortByDesc<>0 THEN c.Name END DESC,
,c.Name ASC --DEFAULT SORTING ORDER
OFFSET #PageIndex*#pageSize ROWS
FETCH NEXT #pageSize ROWS ONLY
) SELECT (#PageIndex*#pageSize)+(ROW_NUMBER() OVER(ORDER BY Id))[RowNo],* from tbl;

Resources