COUNT(*) vs COUNT(column) Performance in Snowflake

COUNT(*) vs COUNT(column) Performance in Snowflake - snowflake-cloud-data-platform

Since SnowFlake is a columnar database, does it impact performance when you use COUNT(*) vs COUNT(column)? And this is assuming that the column that you're referencing does NOT have any NULLs

As a_horse_with_no_name explained these two functions are different but you already mentioned that the column has no NULL values. So they should return the same result in your case.
More important thing is, Snowflake has a special optimization for the COUNT function. As far I see, it does NOT impact performance if you use COUNT(*) or COUNT(column), even when the column contains NULL values! For both of them, Snowflake uses METADATA statistics, so it does not actually count rows.
You can test it with SNOWFLAKE_SAMPLE_DATA:
select count(*) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- 5999989709
select count(L_ORDERKEY) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- 5999989709
Both queries will return a result immediately although the table size is about 170G, and contain more than 5G rows.
I have to add this extra information because of the conversation between Niru and a_horse_with_no_name. a_horse_with_no_name said:
Even if all columns of a row are NULL, count(*) should include that row in the result. If it doesn't this is a clear violation of the SQL standard
I'm not sure about the SQL standard but when you use COUNT(*), Snowflake doesn't check if the columns are NULL or not (as you expected). I can see why Niru misunderstood the documents, the docs and the samples should be improved.
If you run my sample queries, you will see that they are completed in milliseconds. We are talking about counting almost 6 billion rows:
select count(*) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- completes in milliseconds
select count(L_ORDERKEY) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- completes in milliseconds
But if I do a little modification on the query, it takes about 3 minutes on the same warehouse (XSMALL):
select count(t.*) from sample_data.TPCH_SF1000.LINEITEM t;
-- completes in 3 minutes!?
Here is the trick:
Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs.
https://docs.snowflake.com/en/sql-reference/functions/count.html#arguments
Only if you use alias.* (like I used t.* in my sample), Snowflake will check if all columns are null when producing the count. This is why it is much slower, and this is why there shouldn't be any performance issues when you are running COUNT(XYZ) or COUNT(*) on a table.

Here is the snowflake doc.. hope it helps
https://docs.snowflake.com/en/sql-reference/functions/count.html Please refer to snowflake document.. it does effect count(alias.*) will check the each column in the row where as count(column) do null check only on that column..

Related

How to force reasonable execution plan for query with LIKE statement?

When creating ad-hoc queries to look for information in a table I have run into this issue over and over.
Let's say I have a table with a million records with fields id - int, createddatetime - timestamp, category - varchar(50) and content - varchar(max). I want to find all records in the last day that have a certain string in the content field. If I create a query like this...
select *
from table
where createddatetime > '2018-1-31'
and content like '%something%'
it may complete in a second because in the last day there may only be 100 records so the LIKE clause is only operating on a small number of records
However if I add one more item to the where clause...
select *
from table
where createddatetime > '2018-1-31'
and content like '%something%'
and category = 'testing'
then it could take many minutes to complete while locking up the table.
It appears to be changing from performing all the straight forward WHERE clause items first and then the LIKE on the limited set of records, over to having the LIKE clause first. There are even times where there are multiple LIKE statements and adding one more causes the query to go from a split second to minutes.
The only solutions I've found are to either generate an intermediate table (maybe temp tables would work), insert records based on the basic WHERE clause items, then run a separate query to filter by one or more LIKE statements. I've tried various JOIN and CTE approaches which usually have no improvement. Alternatively CHARINDEX also appears to work though difficult to use if trying to convert the logic of multiple LIKE statements.
Is there any hint or something that can be placed in the query statement to tell sql server to wait until records are filtered by the basic WHERE clause items before filtering by the LIKE?
I actually just tried this approach and it had the same issue...
select *
from (
select *, charindex('something', content) as found
from bounce
where createddatetime > '2018-1-31'
) t
where found > 0
while the subquery independently returns in a couple seconds, the overall query just never returns. Why is this so bad

Not fancy, but I've had better luck with temp tables than nested select statements... It will isolate the first data set, and then you can select just from that. If you're looking for quick and dirty, which usually serves my purposes for ad-hoc, this may help. If this is a permanent stored proc, the indexing suggestions may serve you better in the long run.
select *
into #like
from table
where createddatetime > '2018-1-31'
and content like '%something%'
select *
from #like
where category = 'testing'

SQL Server Performance With Large Query

Hi everyone I have a couple of queries for some reports in which each query is pulling Data from 35+ tables. Each Table has almost 100K records. All the Queries are Union ALL for Example
;With CTE
AS
(
Select col1, col2, col3 FROM Table1 WHERE Some_Condition
UNION ALL
Select col1, col2, col3 FROM Table2 WHERE Some_Condition
UNION ALL
Select col1, col2, col3 FROM Table3 WHERE Some_Condition
UNION ALL
Select col1, col2, col3 FROM Table4 WHERE Some_Condition
.
.
. And so on
)
SELECT col1, col2, col3 FROM CTE
ORDER BY col3 DESC
So far I have only tested this query on Dev Server and I can see It takes its time to get the results. All of these 35+ tables are not related with each other and this is the only way I can think of to get all the Desired Data in result set.
Is there a better way to do this kind of query ??
If this is the only way to go for this kind of query how can I
improve the performance for this Query by making any changes if
possible??
My Opinion
I Dont mind having a few dirty-reads in this report. I was thinking of using Query hints with nolock or Transaction Isolation Level set to READ UNCOMMITED.
Will any of this help ???
Edit
Every Table has 5-10 Bit columns and a Corresponding Date column to each Bit Column and my condition for each SELECT Statement is something like
WHERE BitColumn = 1 AND DateColumn IS NULL
Suggestion By Peers
Filtered Index
CREATE NONCLUSTERED INDEX IX_Table_Column
ON TableName(BitColumn)
WHERE BitColum = 1
Filtered Index with Included Column
CREATE NONCLUSTERED INDEX fIX_IX_Table_Column
ON TableName(BitColumn)
INCLUDE (DateColumn)
WHERE DateColumn IS NULL
Is this the best way to go ? or any suggestions please ???

There are lots of things that can be done to make it faster.
If I assume you need to do these UNIONs, then you can speed up the query by :
Caching the results, for example,
Can you create an indexed view from the whole statement ? Or there are lots of different WHERE conditions, so there'd be lots of indexed views ? But know that this will slow down modifications (INSERT, etc.) for those tables
Can you cache it in a different way ? Maybe in the mid layer ?
Can it be recalculated in advance ?
Make a covering index. Leading columns are columns form WHERE and then all other columns from the query as included columns
Note that a covering index can be also filtered but filtered index isn't used if the WHERE in the query will have variables / parameters and they can potentially have the value that is not covered by the filtered index (i.e., the row isn't covered)
ORDER BY will cause sort
If you can cache it, then it's fine - no sort will be needed (it's cached sorted)
Otherwise, sort is CPU bound (and I/O bound if not in memory). To speed it up, do you use fast collation ? The performance difference between the slowest and fastest collation can be even 3 times. For example, SQL_EBCDIC280_CP1_CS_AS, SQL_Latin1_General_CP1251_CS_AS, SQL_Latin1_General_CP1_CI_AS are one of the fastest collations. However, it's hard to make recommendations if I don't know the collation characteristics you need
Network
'network packet size' for the connection that does the SELECT should be the maximum value possible - 32,767 bytes if the result set (number of rows) will be big. This can be set on the client side, e.g., if you use .NET and SqlConnection in the connection string. This will minimize CPU overhead when sending data from the SQL Server and will improve performance on both side - client and server. This can boost performance even by tens of percents if the network was the bottleneck
Use shared memory endpoint if the client is on the SQL Server; otherwise TCP/IP for the best performance
General things
As you said, using isolation level read uncommmitted will improve the performance
...
Probably you can't do changes beyond rewriting the query, etc. but just in case, adding more memory in case it isn't sufficient now, or using SQL Server 2014 in memory features :-), ... would surely help.
There are way too many things that could be tuned but it's hard to point out the key ones if the question isn't very specific.
Hope this helps a bit

well you haven't give any statistics or sample run time of any execution so it is not possible to guess what is slow and is it really slow. how much data is in the result set? it might be just retrieving 100K rows as in result is just taking its time. if the result set of 10000 rows is taking 5 minute, yes definitely something can be looked at. so if you have sample query, number of rows in result and how much time it took for couple of execution with different where conditions, post that. it will help us to compare results.
BTW, do not use CTE just use regular inner and outer query select. make sure the Temp DB is configured properly. LDF and MDF is not default configured for 10% increase. by certain try and error you will come to know how much log and temp DB is increased for verity of range queries and based on that you should set the initial and increment size of the MDF and LDF of temp DB. for the Covered filter index the include column should be col1, col2 and co3 not column Date unless Date is also in select list.
how frequently the data in original 35 tables get updated? if max once per day or if they all get updates almost same time then Indexed-Views can be a possible solution. but if original tables getting updates more than once a day or they gets updates anytime and no where they are in same line then do no think about Indexed-View.
if disk space is not an issue as a last resort try and test performance using trigger on each 35 table. create new table to hold final results as you are expecting from this select query. create insert/update/delete trigger on each 35 table where you check the conditions inside trigger and if yes then only copy the same insert/update/delete to new table. yes you will need a column in new table that identifies which data coming from which table. because Date is Null-Able column you do not get full advantage of Index on that Column as "mostly you are looking for WHERE Date is NULL".
in the new Table only query you always do is where Date is NULL then do not even bother to create that column just create BIT columns and other col1, col2, col3 etc... if you give real example of your query and explain the actual tables, other details can be workout later.

The query hints or the Isolation Level are only going to help you in case of any blocking occurs.
If you dont mind dirty reads and there are locks during the execution it could be a good idea.
The key question is how many data fits the Where clausule you need to use (WHERE BitColumn = 1 AND DateColumn IS NULL)
If the subset filtered by that is small compared with the total number of rows, then use an index on both columns, BitColum and DateColumn, including the columns in the select clausule to avoid "Page Lookup" operations in your query plan.
CREATE NONCLUSTERED INDEX IX_[Choose an IndexName]
ON TableName(BitColumn, DateColumn)
INCLUDE (col1, col2, col3)
Of course the space needed for that covered-filtered index depends on the datatype of the fields involved and the number of rows that satisfy WHERE BitColumn = 1 AND DateColumn IS NULL.
After that I recomend to use a View instead of a CTE:
CREATE VIEW [Choose a ViewName]
AS
(
Select col1, col2, col3 FROM Table1 WHERE Some_Condition
UNION ALL
Select col1, col2, col3 FROM Table2 WHERE Some_Condition
.
.
.
)
By doing that, your query plan should look like 35 small index scans, but if most of the data satisfies the where clausule of your index, the performance is going to be similar to scan the 35 source tables and the solution won't worth it.
But You say "Every Table has 5-10 Bit columns and a Corresponding Date column.." then I think is not going to be a good idea to make an index per bit colum.
If you need to filter by using diferent BitColums and Different DateColums, use a compute column in your table:
ALTER TABLE Table1 ADD ComputedFilterFlag AS
CAST(
CASE WHEN BitColum1 = 1 AND DateColumn1 IS NULL THEN 1 ELSE 0 END +
CASE WHEN BitColum2 = 1 AND DateColumn2 IS NULL THEN 2 ELSE 0 END +
CASE WHEN BitColum3 = 1 AND DateColumn3 IS NULL THEN 4 ELSE 0 END
AS tinyint)
I recomend you use the value 2^(X-1) for conditionX(BitColumnX=1 and DateColumnX IS NOT NULL). It is going to allow you to filter by using any combination of that criteria.
By using value 3 you can locate all rows that accomplish: Bit1, Date1 and Bit2, Date2 condition. Any condition combination has its corresponding ComputedFilterFlag value because the ComputedFilterFlag acts as a bitmap of conditions.
If you heve less than 8 diferents filters you should use tinyint to save space in the index and decrease the IO operations needed.
Then use an Index over ComputedFilterFlag colum:
CREATE NONCLUSTERED INDEX IX_[Choose an IndexName]
ON TableName(ComputedFilterFlag)
INCLUDE (col1, col2, col3)
And create the view:
CREATE VIEW [Choose a ViewName]
AS
(
Select col1, col2, col3 FROM Table1 WHERE ComputedFilterFlag IN [Choose the Target Filter Value set]--(1, 3, 5, 7)
UNION ALL
Select col1, col2, col3 FROM Table2 WHERE ComputedFilterFlag IN [Choose the Target Filter Value set]--(1, 3, 5, 7)
.
.
.
)
By doing that, your index coveres all the conditions and your query plan should look like 35 small index seeks.
But this is a tricky solution, may be a refactoring in your table schema could produce simpler and faster results.

You'll never get real time results from a union all query over many tables but I can tell you how I got a little speed out of a similar situation. Hopefully this will help you out.
You can actually run all of them at once with a little bit coding and ingenuity.
You create a global temporary table instead of a common table expression and don't put any keys on the global temporary table it will just slow things down. Then you start all the individual queries which insert into the global temporary table. I've done this a hundred or so times manually and it's faster than a union query because you get a query running on each cpu core. The tricky part is the mechanism to determine when the individual queries have finished your on your own for that piece hence I do these manually.

Mixing indexed and calculated fields in a table-valued function

I work with SQL Server 2008, but can use a later version if it would matter.
I have 2 tables with pretty similar data about some people but in different formats (no intersections between these 2 sets of people).
Table 1:
int personID
bit IsOldPerson //this field is indexed
Table 2:
int PersonID
int Age
I want to have a combined view that has the same structure as the Table 1. So I write the following script (a simplified version):
CREATE FUNCTION CombinedView(#date date)
RETURNS TABLE
AS
RETURN
select personID as PID, IsOldPerson as IOP
from Table1
union all
select personID as PID, dbo.CheckIfOld(Age,#date) as IOP
from Table2
GO
The function "CheckIfOld" returns yes/no depending on the input age at the date #date.
So I have 2 questions here:
A. if I try select * from CombinedView(TODAY) where IOP=true, whether the SQL Server will do the following separately: 1) for the Table 1 use the index for the field IsOldPerson and do a "clever" index-based selection of results; 2) for the Table 2 calculate CheckIfOld for all the rows and during the calculation pick up or rejecting rows on the row-by-row basis ?
B. how can I check the execution plan in this particular case to understand whether my guess in the question (A) is correct or not?
Any help is greatly appreciated! Thanks!

Yes, if the query isn't too complex, the query optimizer should "see through" the view into its constituent UNION-ed SELECT statements, evaluate them separately, and concatenate the results. If there is an index on Table1, it should be able to use it. I tested this using tables we had and the same function concepts you presented. I reviewed the query plans of the raw SELECT to Table1 and the SELECT to the inline table-valued function with the UNION and the portion of the query plan relevant to Table1 was the same-- and it used the index.
Now if performance is a concern, I suggest you do one of two things:
If (a) Table2 is read-heavy rather than write-heavy, (b) you have the space, and (c) you can write CheckIfOld as a single CASE statement (as its name and context in your question implies), then you should consider creating a persisted calculated field in Table2 with the calculation from IsOldPerson and applying an index to it.
If Table2 is write-heavy, or you have no space for additional fields, you should at least consider converting CheckIfOld into an inline function. You will likely reap performance gains, depending on how it is used. In your case, it would be used like this:
select personID as PID, IOP.IsOldPerson from Table2 CROSS APPLY dbo.CheckIfOld(Age,#date) AS IOP

MAX keyword taking a lot of time to select a value from a column

Well, I have a table which is 40,000,000+ records but when I try to execute a simple query, it takes ~3 min to finish execution. Since I am using the same query in my c# solution, which it needs to execute over 100+ times, the overall performance of the solution is deeply hit.
This is the query that I am using in a proc
DECLARE #Id bigint
SELECT #Id = MAX(ExecutionID) from ExecutionLog where TestID=50881
select #Id
Any help to improve the performance would be great. Thanks.

What indexes do you have on the table? It sounds like you don't have anything even close to useful for this particular query, so I'd suggest trying to do:
CREATE INDEX IX_ExecutionLog_TestID ON ExecutionLog (TestID, ExecutionID)
...at the very least. Your query is filtering by TestID, so this needs to be the primary column in the composite index: if you have no indexes on TestID, then SQL Server will resort to scanning the entire table in order to find rows where TestID = 50881.
It may help to think of indexes on SQL tables in the same way as those you'd find in the back of a big book that are hierarchial and multi-level. If you were looking for something, then you'd manually look under 'T' for TestID then there'd be a sub-heading under TestID for ExecutionID. Without an index entry for TestID, you'd have to read through the entire book looking for TestID, then see if there's a mention of ExecutionID with it. This is effectively what SQL Server has to do.
If you don't have any indexes, then you'll find it useful to review all the queries that hit the table, and ensure that one of those indexes is a clustered index (rather than non-clustered).

Try to re-work everything into something that works in a set based manner.
So, for instance, you could write a select statement like this:
;With OrderedLogs as (
Select ExecutionID,TestID,
ROW_NUMBER() OVER (PARTITION BY TestID ORDER By ExecutionID desc) as rn
from ExecutionLog
)
select * from OrderedLogs where rn = 1 and TestID in (50881, 50882, 50883)
This would then find the maximum ExecutionID for 3 different tests simultaneously.
You might need to store that result in a table variable/temp table, but hopefully, instead, you can continue building up a larger, single, query, that processes all of the results in parallel.
This is the sort of processing that SQL is meant to be good at - don't cripple the system by iterating through the TestIDs in your code.
If you need to pass many test IDs into a stored procedure for this sort of query, look at Table Valued Parameters.

Use of SET ROWCOUNT in SQL Server - Limiting result set

I have a sql statement that consists of multiple SELECT statements. I want to limit the total number of rows coming back to let's say 1000 rows. I thought that using the SET ROWCOUNT 1000 directive would do this...but it does not. For example:
SET ROWCOUNT 1000
select orderId from TableA
select name from TableB
My initial thought was that SET ROWCOUNT would apply to the entire batch, not the individual statements within it. The behavior I'm seeing is it will limit the first select to 1000 and then the second one to 1000 for a total of 2000 rows returned. Is there any way to have the 1000 limit applied to the batch as a whole?

Not in one statement. You're going to have to subtract ##ROWCOUNT from the total rows you want after each statement, and use a variable (say, "#RowsLeft") to store the remaining rows you want. You can then SELECT TOP #RowsLeft from each individual query...

And how would you ever see any records from the second query if the first always returns more than 1000 if you were able to do this in a batch?
If the queries are simliar enough you could try to do this through a union and use the rowcount on that as it would only be one query at that point. If the queries are differnt in the columns returned I'm not sure what you would get by limiting the entire group to 1000 rows because the meanings would be different. From a user perspective I'd rather consistently get 500 orders and 500 customer names than 998 orers and 2 names one day and 210 orders and 790 names the next. It would be impossible to use the application especially if you happened to be most interested in the information in the second query.

Use TOP not ROWCOUNT
http://msdn.microsoft.com/en-us/library/ms189463.aspx
You trying to get 1000 rows MAX from all tables right?
I think other methods may fill up with from the top queries first, and you may never get results from the lower ones.

The requirement sounds odd. Unless you are unioning or joining the data from the two selects, to consider them as one so that you apply a max rows simply does not make sense, since they are unrelated queries at that point. If you really need to do this, try:
select top 1000 from (
select orderId, null as name, 'TableA' as Source from TableA
union all
select null as orderID, name, 'TableB' as Source from TableB
) a order by Source

SET ROWCOUNT applies to each individual query. In your given example, it's applied twice, once to each SELECT statement, since each statement is its own batch (they're not grouped or unioned or anything, and so execute completely separately).
#RedFilter's approach seems the most likely to give you what you want.

Untested and doesn't make use of ROWCOUNT, but could give you an idea?
Assumes col1 in TableA and TableB are the same type.
SELECT TOP 1000 *
FROM (select orderId
from TableA
UNION ALL
select name from TableB) t

The following worked for me:
CREATE PROCEDURE selectTopN
(
#numberOfRecords int
)
AS
SELECT TOP (#numberOfRecords) * FROM Customers
GO

this is your solution :
TOP (Transact-SQL)
and about ##RowCount you can read this Link :
SET ROWCOUNT (Transact-SQL)
Important
Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server. Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax. For more information, see TOP (Transact-SQL).
I think two way will work.!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight