Snowflake query pruning by Column - snowflake-cloud-data-platform

in the Snowflake Docs it says:
First, prune micro-partitions that are not needed for the query.
Then, prune by column within the remaining micro-partitions.
What is meant with the second step?
Let's take the example table t1 shown in the link. In this example table I use the following query:
SELECT * FROM t1
WHERE
Date = ‚11/3‘ AND
Name = ‚C‘
Because of the Date = ‚11/3‘ it would only scan micro partitions 2, 3 and 4. Because of the Name = 'C' it can prune even more and only scan micro-partions 2 and 4.
So in the end only micro-partitions 2 and 4 would be scanned.
But where does the second step come into play? What is meant with prune by column within the remaining micro partitions?
Does it mean, that only rows 4, 5 and 6 on micro-partition 2 and row 1 on micro-partition 4 are scanned, because date is my clustering key and is sorted so you can prune even further with the date?
So in the end only 4 rows would be scanned?

But where does the second step come into play? What is meant with prune by column within the remaining micro partitions?
Benefits of Micro-partitioning:
Columns are stored independently within micro-partitions, often referred to as columnar storage.
This enables efficient scanning of individual columns; only the columns referenced by a query are scanned.
It is recommended to avoid SELECT * and specify required columns explicitly.

It simply means to only select the columns that are required for the query. So in your example it would be:
SELECT col_1, col_2 FROM t1
WHERE
Date = ‚11/3‘ AND
Name = ‚C‘

Related

SQL join running slow

I have 2 sql queries doing the same thing, first query takes 13 sec to execute while second takes 1 sec to execute. Any reason why ?
Not necessary all the ids in ProcessMessages will have data in ProcessMessageDetails
-- takes 13 sec to execute
Select * from dbo.ProcessMessages t1
join dbo.ProcessMessageDetails t2 on t1.ProcessMessageId = t2.ProcessMessageId
Where Id = 4 and Isdone = 0
--takes under a sec to execute
Select * from dbo.ProcessMessageDetails
where ProcessMessageId in ( Select distinct ProcessMessageId from dbo.ProcessMessages t1
Where Where Id = 4 and Isdone = 0 )
I have clusterd index on t1.processMessageId(Pk) and non clusterd index on t2.processMessageId (FK)
I would need the actual execution plans to tell you exactly what SqlServer is doing behind the scenes. I can tell you these queries aren't doing the exact same thing.
The first query is going through and finding all of the items that meet the conditions for t1 and finding all of the items for t2 and then finding which ones match and joining them together.
The second one is saying first find all of the items that are meet my criteria from t1, and then find the items in t2 that have one of these IDs.
Depending on your statistics, available indexes, hardware, table sizes: Sql Server may decide to do different types of scans or seeks to pick data for each part of the query, and it also may decide to join together data in a certain way.
The answer to your question is really simple the first query which have used will generate more number of rows as compared to the second query so it will take more time to search those many rows that's the reason your first query took 13 seconds and the second one to only one second
So it is generally suggested that you should apply your conditions before making your join or else your number of rows will increase and then you will require more time to search those many rows when joined.

Is "offset-fetch and order by" ordering the whole table or partial table?

In SQL Server, if I try the following query:
select id from table
order by id
offset 1000000 ROWS
fetch next 1000000 ROWS ONLY;
How will SQL Server work? What strategy does SQL server use?
1. Do a sorting on the whole table first and then select the 1 million rows we need
2. Do a sorting on partial table and then return the 1 million rows we need.
I assume it is 2nd option. If so, how does SQL server decide which range of the table to be sorted?
Edit 1:
I am asking this question to understand what could cause the query slow. I am testing with two queries:
--Query 1:
select id from table
order by id
offset 1 ROWS
fetch next 1 ROWS ONLY;
and
--Query 2:
select id from table
order by id
offset 1000000000 ROWS
fetch next 1 ROWS ONLY;
I found the second query can take me about 30 minutes to finish while the first takes almost 0 second.
So I am curious on what causes this difference? If the two have same time used for order by (or does it even really do a sorting on the whole table? The id is the clustered indexed column of the table. I cannot imagine that it takes 0 second to finish sorting on a terabyte table.)
Then if the sorting takes same time, only difference would be the clustered-index scan. For first query, it only needs to scan first 1 or 10 (a small number) of rows. While for the second query, it needs to scan a much bigger number of rows ( >1000000000 ). But I am not quite sure if this is correct.
Thank you for your help!
Let me take a simple example..
order by id
offset 50 rows fetch 25 rows only
For the above query,the steps would be
1.Table should be sorted by id (if not pay penalty of sort,there is no partial sort,always a full sort)
2.Then scan 50+25 rows(paying cost of 75 rows) and return 25 rows only..
Below is an example of orders table i have(orderid is Pk,so sorted),you can see even though, we are getting only 20 rows ,you are paying cost of 120 rows...
Coming to your question,there is no partial sort (Which implies first option regarding sort only),even you try to return one row like below..
select top 1* from table
order by orderid

How to sort comma delimited column in SQL

I have a database table with a Value column that contains comma delimited strings and I need to order on the Value column with a SQL query:
Column ID Value
1 ff, yy, bb, ii
2 kk, aa, ee
3 dd
4 cc, zz
If I use a simple query and order by Value ASC, it would result in an order of Column ID: 4, 3, 1, 2, but the desired result is Column ID: 2, 1, 4, 3, since Column ID '2' contains aa, '1' contains bb, and etc.
By the same token, order by Value DESC would result in an order of Column ID: 2, 1, 3, 4, but the desired result is Column ID: 4, 1, 2, 3.
My initial thought is to have two additional columns, 'Lowest Value' and 'Highest Value', and within the query I can order on either 'Lowest Value' and 'Highest value' depending on the sort order. But I am not quite sure how to sort the highest and lowest in each row and insert them into the appropriate columns. Or is there another solution without the use of those two additional columns within the sql statement? I'm not that proficient in sql query, so thanks for your assistance.
Best solution is not to store a single comma separated value at all. Instead have a detail table Values which can have multiple rows per ID, with one value each.
If you have the possibility to alter the data structure (and by your own suggestion of adding columns, it seems you have), I would choose that solution.
To sort by lowest value, you could then write a query similar to the one below.
select
t.ID
from
YourTable t
left join ValueTable v on v.ID = t.ID
group by
t.ID
order by
min(v.Value)
But this structure also allows you to write other, more advanced queries. For instance, this structure makes it easier and more efficient to check if a row matches a specific value, because you don't have to parse the list of values every time, and separate values can be indexed better.
string / array splitting (and creation, for that matter) is covered quite extensively. You might want to have a read of this, one of the best articles out there covering a comparison of the popular methods. Once you have the values the rest is easy.
http://sqlperformance.com/2012/07/t-sql-queries/split-strings
Funnily enough I did something like this just the other week in a cross-applied table function to do some data cleansing, improving performance 8 fold over the looped version in place.

inserting an artificial column in mdx query

from some reasons I need to insert an artificial(dummy) column into a mdx expression. (the reason is that i need to obtain a query with specific number of columns )
to ilustrate, this is my sample query:
SELECT {[Measures].[AFR],[Measures].[IB],[Measures].[IC All],[Measures].[IC_without_material],[Measures].[Nonconformance_PO],[Measures].[Nonconformance_GPT],[Measures].[PM_GPT_Weighted_Targets],[Measures].[PM_PO_Weighted_Targets], [Measures].[AVG_LC_Costs],[Measures].[AVG_MC_Costs]} ON COLUMNS,
([dim_ProductModel].[PLA].&[SME])
* ORDER( {([dim_ProductModel].[Warranty Group].children)} , ([Measures].[Nonconformance_GPT],[Dim_Date].[Date Full].&[2014-01-01]) ,desc)
* ([dim_ProductModel].[PLA Text].members - [dim_ProductModel].[PLA Text].[All])
* {[Dim_Date].[Date Full].&[2013-01-01]:[Dim_Date].[Date Full].&[2014-01-01]} ON ROWS
FROM [cub_dashboard_spares]
it is not very important, just some measures and crossjoined dimensions. Now I would need to add f.e. 2 extra columns, I don't care whether this would be a measure with null/0 values or another crossjoined dimension. Can I do this in some easy way without inserting any data into my cube?
In sql I can just write Select 0 or select "dummy1", but here it is not possible neither in ON ROWS nor in ON COLUMNS part of the query.
Thank you very much for your help,
Regards,
Peter
ps: so far I could just insert some measure more times, but I am interested whether there is a possibility to insert really "dummy" column
Your query just has the measures dimension on columns. The easiest way to extend it by some columns would be to repeat the last measure as many times that you get the correct number of columns.
Another possibility, which may be more efficient in case the last measure is complex to calculate would be to use
WITH member Measures.dummy as NULL
SELECT {[Measures].[AFR],[Measures].[IB],[Measures].[IC All],[Measures].[IC_without_material],[Measures].[Nonconformance_PO],[Measures].[Nonconformance_GPT],[Measures].[PM_GPT_Weighted_Targets],[Measures].[PM_PO_Weighted_Targets], [Measures].[AVG_LC_Costs],[Measures].[AVG_MC_Costs],
Measures.dummy, Measures.dummy, Measures.dummy
}
ON COLUMNS,
([dim_ProductModel].[PLA].&[SME])
* ORDER( {([dim_ProductModel].[Warranty Group].children)} , ([Measures].[Nonconformance_GPT],[Dim_Date].[Date Full].&[2014-01-01]) ,desc)
* ([dim_ProductModel].[PLA Text].members - [dim_ProductModel].[PLA Text].[All])
* {[Dim_Date].[Date Full].&[2013-01-01]:[Dim_Date].[Date Full].&[2014-01-01]}
ON ROWS
FROM [cub_dashboard_spares]
i. e. adding a dummy measure that should not need much computation as many times as you need it to the end of the columns.

Using a sort order column in a database table

Let's say I have a Product table in a shopping site's database to keep description, price, etc of store's products. What is the most efficient way to make my client able to re-order these products?
I create an Order column (integer) to use for sorting records but that gives me some headaches regarding performance due to the primitive methods I use to change the order of every record after the one I actually need to change. An example:
Id Order
5 3
8 1
26 2
32 5
120 4
Now what can I do to change the order of the record with ID=26 to 3?
What I did was creating a procedure which checks whether there is a record in the target order (3) and updates the order of the row (ID=26) if not. If there is a record in target order the procedure executes itself sending that row's ID with target order + 1 as parameters.
That causes to update every single record after the one I want to change to make room:
Id Order
5 4
8 1
26 3
32 6
120 5
So what would a smarter person do?
I use SQL Server 2008 R2.
Edit:
I need the order column of an item to be enough for sorting with no secondary keys involved. Order column alone must specify a unique place for its record.
In addition to all, I wonder if I can implement something like of a linked list: A 'Next' column instead of an 'Order' column to keep the next items ID. But I have no idea how to write the query that retrieves the records with correct order. If anyone has an idea about this approach as well, please share.
Update product set order = order+1 where order >= #value changed
Though over time you'll get larger and larger "spaces" in your order but it will still "sort"
This will add 1 to the value being changed and every value after it in one statement, but the above statement is still true. larger and larger "spaces" will form in your order possibly getting to the point of exceeding an INT value.
Alternate solution given desire for no spaces:
Imagine a procedure for: UpdateSortOrder with parameters of #NewOrderVal, #IDToChange,#OriginalOrderVal
Two step process depending if new/old order is moving up or down the sort.
If #NewOrderVal < #OriginalOrderVal --Moving down chain
--Create space for the movement; no point in changing the original
Update product set order = order+1
where order BETWEEN #NewOrderVal and #OriginalOrderVal-1;
end if
If #NewOrderVal > #OriginalOrderVal --Moving up chain
--Create space for the momvement; no point in changing the original
Update product set order = order-1
where order between #OriginalOrderVal+1 and #NewOrderVal
end if
--Finally update the one we moved to correct value
update product set order = #newOrderVal where ID=#IDToChange;
Regarding best practice; most environments I've been in typically want something grouped by category and sorted alphabetically or based on "popularity on sale" thus negating the need to provide a user defined sort.
Use the old trick that BASIC programs (amongst other places) used: jump the numbers in the order column by 10 or some other convenient increment. You can then insert a single row (indeed, up to 9 rows, if you're lucky) between two existing numbers (that are 10 apart). Or you can move row 370 to 565 without having to change any of the rows from 570 upwards.
Here is an alternative approach using a common table expression (CTE).
This approach respects a unique index on the SortOrder column, and will close any gaps in the sort order sequence that may have been left over from earlier DELETE operations.
/* For example, move Product with id = 26 into position 3 */
DECLARE #id int = 26
DECLARE #sortOrder int = 3
;WITH Sorted AS (
SELECT Id,
ROW_NUMBER() OVER (ORDER BY SortOrder) AS RowNumber
FROM Product
WHERE Id <> #id
)
UPDATE p
SET p.SortOrder =
(CASE
WHEN p.Id = #id THEN #sortOrder
WHEN s.RowNumber >= #sortOrder THEN s.RowNumber + 1
ELSE s.RowNumber
END)
FROM Product p
LEFT JOIN Sorted s ON p.Id = s.Id
It is very simple. You need to have "cardinality hole".
Structure: you need to have 2 columns:
pk = 32bit int
order = 64bit bigint (BIGINT, NOT DOUBLE!!!)
Insert/UpdateL
When you insert first new record you must set order = round(max_bigint / 2).
If you insert at the beginning of the table, you must set order = round("order of first record" / 2)
If you insert at the end of the table, you must set order = round("max_bigint - order of last record" / 2)
If you insert in the middle, you must set order = round("order of record before - order of record after" / 2)
This method has a very big cardinality. If you have constraint error or if you think what you have small cardinality you can rebuild order column (normalize).
In maximality situation with normalization (with this structure) you can have "cardinality hole" in 32 bit.
It is very simple and fast!
Remember NO DOUBLE!!! Only INT - order is precision value!
One solution I have used in the past, with some success, is to use a 'weight' instead of 'order'. Weight being the obvious, the heavier an item (ie: the lower the number) sinks to the bottom, the lighter (higher the number) rises to the top.
In the event I have multiple items with the same weight, I assume they are of the same importance and I order them alphabetically.
This means your SQL will look something like this:
ORDER BY 'weight', 'itemName'
hope that helps.
I am currently developing a database with a tree structure that needs to be ordered. I use a link-list kind of method that will be ordered on the client (not the database). Ordering could also be done in the database via a recursive query, but that is not necessary for this project.
I made this document that describes how we are going to implement storage of the sort order, including an example in postgresql. Please feel free to comment!
https://docs.google.com/document/d/14WuVyGk6ffYyrTzuypY38aIXZIs8H-HbA81st-syFFI/edit?usp=sharing

Resources