Snowflake zero copy clone billing - snowflake-cloud-data-platform

From snowflake document clone storage usage.
Every table within a clone group has an independent life-cycle, ownership of the storage within these tables sometimes needs to be transferred to a different table within the clone group. For example, consider a clone group that consists of:
T1 >> T2 >> T3
T1 has 10M data, (p0=5M + p1=5M)
T1 and T2 share 5M data (Partition1 -> p1)
T2 has 15M data, (p1 + p2=5M + p3=5M)
T2 and T3 share 10M (p1 + p2) data
T3 has 12M data, (p1 + p2 + p4=2M)
T3 and T2 share 10M data in total. (p1 + p2)
If the time-travel window is zero.
T2 is dropped:
p1 is still under t1's ownership and referenced by t3?
p2 ownership will be transferred from t2 to t3?
And the total storage usage after t2 is dropped is
p0 + p1 + p2 + p4 = 5 + 5 + 5 + 2 = 17?

EDIT: Update answer according to further research
It does seem that micro-partition ownership does change ownership after the time-travel window (of 0 in this case) So I believe that you are correct
p0 and p1 will still be owned by t1
p2 and p4 will be owned by t3
your storage math looks correct and should be 17
See Owned Storage vs Referenced Storage documentation
Related: Storage metrics don't appear to change and will continue to point at the original table even if it gets dropped or renamed.
See Table Storage Metrics usage notes
Storage bytes are always owned by, and therefore billed to, the table
where the bytes were initially added. If the table is then cloned,
storage metrics for these initial bytes never transfer to the clones,
even if the bytes are deleted from the source table.

Related

Is there a way to remove allocated pages that are not associated with a table in SQL Server?

We have an "mdf" file for a database that has grown enough to fill up a 200GB disk on our SQL Server Windows machine. I used the query shown below to list the tables in that database with their "reserved" sizes. The result with the larges size is grouped with table and schema named "NULL".
How can I access and review and eventually remove these allocated pages from the database in order to shrink the "mdf" file's footprint on the disk?
Query (thanks to https://dba.stackexchange.com/users/30859/solomon-rutzky):
use temp_db;
SELECT sch.[name], obj.[name], ISNULL(obj.[type_desc], N'TOTAL:') AS [type_desc],
COUNT(*) AS [ReservedPages],
(COUNT(*) * 8) AS [ReservedKB],
(COUNT(*) * 8) / 1024.0 AS [ReservedMB],
(COUNT(*) * 8) / 1024.0 / 1024.0 AS [ReservedGB]
FROM sys.dm_db_database_page_allocations(DB_ID(), NULL, NULL, NULL, DEFAULT) pa
INNER JOIN sys.all_objects obj
ON obj.[object_id] = pa.[object_id]
INNER JOIN sys.schemas sch
ON sch.[schema_id] = obj.[schema_id]
GROUP BY GROUPING SETS ((sch.[name], obj.[name], obj.[type_desc]), ())
ORDER BY [ReservedPages] DESC;
Results (top 3 rows plus header, with big "ReservedGB" highlighted):
name
name
type_desc
ReservedPages
ReservedKB
ReservedMB
ReservedGB
NULL
NULL
TOTAL:
15414665
123317320
120427.070312
117.604560851562
data
crm_isc_sales_work_oppor_1b22e
USER_TABLE
4451592
35612736
34778.062500
33.962951660156
data
pr_data_ih_dim_action
USER_TABLE
2705708
21645664
21138.343750
20.642913818359
I've tried restarting the SQL Server Database Engine (and Agent) as described here:
https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/start-stop-pause-resume-restart-sql-server-services?view=sql-server-ver16
That restart did not free up those allocated pages.
For non-table usage, i think Shrink is your friend but it's not without downsides.
In SSMS there's a handy report that displays reserved and actual size of tables. If you right click on the database you're interested in and select Reports -> Standard Reports -> Disk usage by top tables.
If there's a lot of "air" between reserved and taken space, you could rebuild your indexes (especially clustered (by using ALTER INDEX xxx ON yourtable REBUILD) and it will perhaps get some space back.
Restarting server shouldn't make any difference me think.
You could shrink your db as well, but that usually has some problems with defragmentation.
If you have very large tables, something rebuild will take forever / not succeed, in that case you might wanna copy data to some other table and "rebuild" it manually by truncating original and copying it back.
Your mileage may vary.

Does two phase locking actually prevent lost updates?

Two phase locking is claimed to be a solution for ensuring serial execution. However, I'm not sure how it adequately solves the lost update problem during a read-modify-write cycle. I may be overlooking / misunderstanding the locking mechanism here!
For example, assuming we have a database running using 2PL:
Given a SQL table account with an integer column email_count, lets assume we have the following record in our database:
| ID | email_count |
| ----- | ----- |
| 1 | 0 |
Now lets assume we have two concurrently executing transactions, T1 and T2. Both transactions will read email_count from accounts where ID = 1, increment the count value by 1, and write back the result.
Here's one scenario in which 2PL does not seem to address the lost update problem (T1 represents transaction 1):
T1 -> Obtains a non-exclusive, shared read lock. Read from email_count for ID = 1. Gets the result 0. Application sets a new value (0 + 1 = 1) for a later write.
T2 -> Also obtains a non-exclusive, shared read lock. Read from `email_count' for ID = 1. Gets the result 0. Application also sets a new value (using a now stale pre-condition), which is 1 (0 + 1 = 1).
T1 -> Obtains an exclusive write lock and writes the new value (1) to our record. This will block T2 from writing.
T2 -> Attempts to obtain write lock so it can write the value 1, but is forced to wait for T1 to complete its transaction and release all of T1's own locks.
Now here's my question:
Once T1 completes and releases its locks (during the "shrink" phase of our 2PL), T2 still has a stale email_count value of 1! So when T1 completes and T2 proceeds with its write (with email_count = 1), we'll "lose" the original update from T1.
If T2 has read-lock, T1 cannot acquire an exclusive lock until T2 releases the read lock. Thus, the execution sequence you describe cannot happen. T1 would be denied the write lock, and T2 continues the transaction.
T1 -> Obtains an exclusive write lock and writes the new value (1) to
our record. This will block T2 from writing.
Above step cannot happen because T2 has already had shared read lock, so T1 should wait until T2 release shared read lock.
But, T2 can't release shared read lock.
Because, according to 2PL, if T2 want to release shared read lock, T2 should get write lock first.
But T2 can't get write lock because T1 has already had shared read lock.
So,, yes, it's deadlock. That's why 2PL prevent lost update. Even if it may produce deadlock..

How to display the total number of products in 2 warehouses?

I have this database:
I have a problem, how to show the total number of products in 2 warehouse, if the products in warehouse 1 don't have in warehouse2 and vice versa
I have tried like this
select warehouse1.good_count+warehouse2.good_count
from warehouse1, warehouse2
join goods on (warehouse1.good_id = goods.id)
and (warehouse2.good_id=goods.id);
but it doesn't work
Either you misunderstood what your teacher wants, or your teacher asked the question in a wrong manner.
There should be only one WAREHOUSE table with additional column (e.g. WAREHOUSE_TYPE_ID whose values would then be "remote" or "nearby" or whichever comes next), possibly making a foreign key constraint to WAREHOUSE_TYPE table.
You'd then simply
select t.warehouse_type_name,
sum(w.good_count) summary
from warehouse w join warehouse_type t on t.warehouse_type_id = w.warehouse_type_id
group by t.warehouse_type_name;
I read comments you wrote. The fact that these are physically different warehouses (one in the city and one in its outskirts) doesn't change the fact that data model is wrong. Data they represent should be contained in the same table, a single table (as I explained earlier).
Because, what will you do when company builds yet another warehouse in some other city? Add another table? WRONG!
Anyway, if you insist, something like this might be what you're looking for in this wrong data model of yours:
select g.name,
sum(a.good_count + b.good_count) total_count
from goods g left join warehouse1 a on a.good_id = g.id
left join warehouse2 b on b.good_id = g.id
group by g.name

Snowflake Joining 2 table doesn't load both table at once

Running a query like:
select *
from (select * from tableA where date = '2020-07-01') as prev
join
(select * from tableB where di_data_dt = '2020-08-01') as cur
on prev.ID = cur.ID;
Query Profile shows:
Question:
Why is snowflake loading the first table and then second table and then join? Why can't it load both together and save time?
P.S: I am using an XL warehouse and table is not super massive that snowflake can't handle tohether.
Snowflake clusters utilize all the available bandwidth for remote IO and transfer is already distributed across the cluster. If one table can be retrieved at a rate of X, then two tables would be retrieved at a rate of approximately 0.5 * X, so would take the same total time.
#mike-walton points out that the doing the scans sequentially can result in faster queries due to the partition pruning that can result from join filters.
Snowflake makes a plan for the entire query, so if there is one step that needs Partitions A & B from Table A, and a later sub-select that needs Partitions B & C, then A, B and C would be retrieved during one tablescan.

Join tables to get the Id from table # 2 for each relevant/multiple row in table # 1 in SQL server

There are 3 SQl server tables :
1. Table Account - all types of accounts and attributes - Rows- A= 123 B=456 C=789 ~ 3 accounts per customer, there can be multiple B and Cs for one customer
Table FlattenedHierarchy - one column for each account type (A,B,C) to detail out the relationship - Column A= 123 B=456 C=789
Table subscriptions - Subscriptions and attributes for only one type of account i.e. C
I want to get the list of all account types = B that belongs to customer with corresponding Cs have active subscriptions. And the list of B's should also have corresponding A's listed, for the same customer there can be two B's, but both B rows should bring up the same A.
Sample Tables and expected sample
How do i achieve this?
If I understood correctly, this should work:
SELECT * FROM Account t1
LEFT JOIN `FlattenedHierarchy` t2 ON t1.AcctId=t2.B
LEFT JOIN `Subscriptions` t3 ON t2.C=t3.Acct
WHERE t1.AcctType='B' AND STR_TO_DATE(t3.`End Date`,'%c/%d/%Y') < CURDATE()
AND t3.Acct IS NOT NULL

Resources