how to Copy from big table to another table in snowflake? - snowflake-cloud-data-platform

I have a 7TB+- table in snowflake, I want to pass half of that table to a new table. for example with a country filter. what technique would you recommend? insert into select * from TABLE where COUNTRY = 'A' or use snowpipe to send a parquet format to S3 an then copy into table into snowflake target table
I tried the first option. 5 hours after and the process was on 35%. I read a post where a guy had to scaling the cluster to XL instance. He read another post where snowpipe is the good option. my cluster is only a XS :(
by the way, I have Cluster key and the mission is segment the data by countries by company politics.
The original table is about events from the devices that have the app installed. 30 events per session minute, for example a Uber App or Lyft App

An MV will definitely be more performant than a standard view but there is an extra cost associated with that as Snowflake has to keep the MV in sync with the table. Sounds like the table will be rapidly changing so this cost will be continuous.
Another option is to create a stream on the source table and use a task to merge the stream data into the target table. Tasks require a running warehouse but I've found that an XS warehouse is very capable so minimum you're talking 24 credits per day. Tasks also have a minimum 1 minute interval so if you need bleeding edge, that might discount this option

Related

Can I perform transformations using Snowflake Streams?

Currently I have a snowflake table being updated from a kafka connector in near-realtime, I want to be able to then in near-real time take these new data entries through something such as snowflake cdc / snowflake streams and append some additional fields. Some of these will be to track max values within a certain time period (window function probs) and others will be to receive values from static tables based on where static_table.id = realtime_table.id.
The final goal is to perform these transformations and transfer them to a new presentation level table, so I have both a source table and a presentation level table, with little latency between the two.
Is this possible with Snowflake Streams? Or is there a combination of tools snowflake offers that can be used to achieve this goal? Due to a number of outside constraints it is important that this can be done within the snowflake infrastructure.
Any help would be much appreciated :).
I have considered the use of a materialised view, but am concerned regarding costs / latency.
The goal of Streams - together with Tasks - is to get the transformations done that you are asking for.
This is a quickstart to start growing you Stream and Tasks abilities:
https://quickstarts.snowflake.com/guide/getting_started_with_streams_and_tasks/
On the 6th step you can see a task that would transform the data as it arrives:
create or replace task REFINE_TASK
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
SCHEDULE = '4 minute'
COMMENT = '2. ELT Process New Transactions in Landing/Staging Table into a more Normalized/Refined Table (flattens JSON payloads)'
when
SYSTEM$STREAM_HAS_DATA('CC_TRANS_STAGING_VIEW_STREAM')
as
insert into CC_TRANS_ALL (select
card_id, merchant_id, transaction_id, amount, currency, approved, type, timestamp
from CC_TRANS_STAGING_VIEW_STREAM);

How many versions are created in a delta table in a Data lake on Azure

I have a clarification question. As per what I have read, Delta tables create 0--original data, 1--updated data version of a row in a table.
So basically we have just two versions of the data in Delta tables or this is configurable? what happens, when we update same row multiple times, delta table simply keep latest version of updates?
Thanks in advance.
Delta will create a new version for each operation - insert/update/delete, and also for additional operations, like, changing properties of the table, optimize, vacuum, etc., although some operations will not create new files (update table properties), or even delete not used files (vacuum).
Please take into account that data files in Delta aren't mutable, when you update or delete data, Delta identifies which files contain the data for update/delete, and create new files with modified data. That's why it's important to run VACUUM periodically, so you can get rid of the old files (although it will limit your ability to time travel just to the given period of time - one week by default)

Azure SQL Database Partitioning

I currently have an Azure SQL Database (Standard 100 DTUs S3) and I'm wanting to create partitions on a large table splitting a datetime2 value into YYYYMM. Each table has at least the following columns:
Guid (uniqueidentifier type)
MsgTimestamp (datetime2 type) << partition using this.
I've been looking on Azure documentation and SO but can't find anything that clearly says how to create a partition on a 'datetime2' in the desired format or even if it's supported on the SQL database type.
Another example if trying the link below, but I do not find the option to create a partition within SQL Studio to create a partition on the Storage menu.
https://www.sqlshack.com/database-table-partitioning-sql-server/
In addition, would these tables have to be created daily as the clock goes past 12am or is this done automatically?
UPDATE
I suspect I may have to manually create the partitions using the first link below and then at the beginning of each month, use the second link to create the next months partition table in advance.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-partition-function-transact-sql?view=sql-server-ver15
Context
I currently connect into a real-time feed that feeds upto 600 rows a minute and have a backlog of around 370 million for 3 years worth of data.
Correct.
You can create partitions based upon datetime2 columns. Generally, you'd just do that on the start of month date, and you'd use a RANGE RIGHT (so that the start of the month is included in the partition).
And yes, at the end of every month, the normal action is to:
Split the partition function to add a new partition option.
Switch out the oldest monthly partition into a separate table for archiving purposes (presuming you want to have a rolling period of months)
And another yes, we all wish the product had options to do this for you automatically.
I was one of the tech reviewers on the following whitepaper by Ron Talmage, back in 2008 and 99% of the advice in it is still current:
https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd578580(v=sql.100)

Best method updating sandbox tables with production tables/views

Using SQL, it is taking over 4 hours every evening to pull over all the data from the twelve Production database tables or views needed for our Sandbox database. There has to be a significantly more efficient and effective manner to get this data into our Sandbox.
Currently, I'm creating a UID (Unique ID) by concatenating the views Primary Keys and system date fields.
The UID is used in two steps:
Step 1.
INSERT INTO Sandbox
WHERE UID IS NULL
and only Looking back the Last 30 Days based on the System Date
(using Left Join the Production Table/View.UID to the Existing Sandbox Table/View.UID)
Step 2.
UPDATE Sandbox
Where Production.UID = Sandbox.UID
(using an Inner Join of the Production Table/View.UID to the Existing Sandbox Table/View.UID)
I've cut the 4 hour time down to 2 hours, but it feels like this process I've created is missing a (big) step.
How can I cut this time down? Should I put a 30 day filter on my UPDATE statement as well?
Assuming you're not moving billions of rows into your development environment, I would just create a simple ETL strategy to truncate the dev environment and do a full load from production. If you don't want the full dataset, add a filter to the source queries for your ETL. Just make sure that doesn't have any effect on the integrity of the data.
If your data is in the billions, you likely have a enterprise storage solution in place. Many of those can handle snapshotting the data files to another location. There are some security aspects with that approach that you'll need to consider as well.
I found an answer that is in two parts. It may not be the best solution, but it seems to be working for the moment.
I can use primary keys as my UID from the production box database tables (for the most part). Updating them using a 30-90 day filter
The views are a bit trickier as they union two exact tables and have duplicate primary keys. So, I created my own uid concatenating multiple primary key fields and updating with a 30-90 day filter.
The previous process would take up to 4+ hours to complete. The new process runs in an hour, and seems to be working for the moment.

MS SQL reporting from busy table

I've tried to search for some ideas but can't find anything that's very suitable for my scenario.
I have a table which I write and updata data to from multiple sites, maybe a row per second for specific hours of the day and on average having around 50k records added daily. Seperate to this, I have dashboards where people can query this data but some of the queries may be quite complex and take a number of seconds to complete.
I can't afford my write/updates to slow down
Although the dashboards don't need to be real time, it would be a bonus
Im hosting on Azure DB S2. What options are available?
Current idea is to use an 'active' table for writes/updates and flush the data to the full table every x min. My only concern is that I have a seeded bigint as a PK on the main table and because I also save other data linked to this, I'd have problems linking to this id until I commit to the main table. An option would be to reseed the active table and set identity insert off on the main table to populate it myself but I'm not 100% happy with this.
Just looking for suggestions until I go ahead with my current idea! Thanks

Resources