- Unloading data from snowflake - each row to a separate file - snowflake-cloud-data-platform

Please share your experiences wrt unloading the data from snowflake -
The table has million rows and each row is around 16MB data.
The "The copy into '#ext_stg/path/file_name'
from schema.table" has to generate separate file for each row.
Intent is to generate million files in S3.
The "Copy into" is designed to write bulk data at once.
Using "Copy into" to generate separate files for each row is extremely slow.
Thanks!

Snowflake's COPY INTO LOCATION statement writes in the ndjson format which already makes it very simple to divide the records down with a little local processing.
It appears you've already tried doing a row-by-row iteration to perform such single row exports and have found it expectedly slow. It may still be a viable option if this is only a one-time operation.
Snowflake does not offer any parallel split and per-row export techniques (that I am aware of) so it may be simpler instead to export the entire table normally, and then use a downstream parallel processing framework (such as a Spark job) to divide the input into individual record files. The ndjson format's ready-to-be-split nature makes processing the file easy in distributed program frameworks.
P.s. Specifying the MAX_FILE_SIZE copy option to a very low value (lower than the minimum bound of your row size) will not guarantee a single file per row as the writes are done over sets of rows read together from the table.

You can achieve this through scripting using python or even with the snowflake javascript procedure.
Pseudocode would look like this:
var_filter_list = select primary_key from schema.table; -- primary key or unique identifier
for idx, pk_val in enumerate(var_filter_list): -- for each row
var_file_name = concat(file_name,idx)
copy into #ext_stg/path/var_file_name from ( select * from schema.table where
primary_key = pk );

Related

Snowflake query performance is unexpectedly slower for external Parquet tables vs. internal tables

When I run queries on external Parquet tables in Snowflake, the queries are orders of magnitude slower than on the same tables copied into Snowflake or with any other cloud data warehouse I have tested on the same files.
Context:
I have tables belonging to the 10TB TPC-DS dataset in Parquet format on GCS and a Snowflake account in the same region (US Central). I have loaded those tables into Snowflake using create as select. I can run TPC-DS queries(here #28) on these internal tables with excellent performance. I was also able to query those files on GCS directly with data lake engines with excellent performance, as the files are "optimally" sized and internally sorted. However, when I query the same external tables on Snowflake, the query does not seem to finish in reasonable time (>4 minutes and counting, as opposed to 30 seconds, on the same virtual warehouse). Looking at the query profile, it seems that the number of records read in the table scans keeps growing indefinitely, resulting in a proportional amount of spilling to disk.
The table happens to be partitioned but it those not matter on the query of interest (which I tested with other engines).
What I would expect:
Assuming proper data "formatting", I would expect no major performance degradation compared to internal tables, as the setup is technically the same - data stored in columnar format in cloud object store - and as it is advertised as such by Snowflake. For example I saw no performance degradation with BigQuery on the exact same experiment.
Other than double checking my setup, I see don't see many things to try...
This is what the "in progress" part of the plan looks like 4 minutes into execution on the external table. All other operators are at 0% progress. You can see external bytes scanned=bytes spilled and 26G!! rows are produced. And this is what it looked like on a finished execution on the internal table executed in ~20 seconds. You can see that the left-most table scan should produce 1.4G rows but had produced 23G rows with the external table.
This is a sample of the DDL I used (I also tested without defining the partitioning column):
create or replace external table tpc_db.tpc_ds.store_sales (
ss_sold_date_sk bigint as
cast(split_part(split_part(metadata$filename, '/', 4), '=', 2) as bigint)
,
ss_sold_time_sk bigint as (value:ss_sold_time_sk::bigint),
ss_item_sk bigint as (value:ss_item_sk::bigint),
ss_customer_sk bigint as (value:ss_customer_sk::bigint),
ss_cdemo_sk bigint as (value:ss_cdemo_sk::bigint),
ss_hdemo_sk bigint as (value:ss_hdemo_sk::bigint),
ss_addr_sk bigint as (value:ss_addr_sk::bigint),
ss_store_sk bigint as (value:ss_store_sk::bigint),
ss_promo_sk bigint as (value:ss_promo_sk::bigint),
ss_ticket_number bigint as (value:ss_ticket_number::bigint),
ss_quantity bigint as (value:ss_quantity::bigint),
ss_wholesale_cost double as (value:ss_wholesale_cost::double),
ss_list_price double as (value:ss_list_price::double),
ss_sales_price double as (value:ss_sales_price::double),
ss_ext_discount_amt double as (value:ss_ext_discount_amt::double),
ss_ext_sales_price double as (value:ss_ext_sales_price::double),
ss_ext_wholesale_cost double as (value:ss_ext_wholesale_cost::double),
ss_ext_list_price double as (value:ss_ext_list_price::double),
ss_ext_tax double as (value:ss_ext_tax::double),
ss_coupon_amt double as (value:ss_coupon_amt::double),
ss_net_paid double as (value:ss_net_paid::double),
ss_net_paid_inc_tax double as (value:ss_net_paid_inc_tax::double),
ss_net_profit double as (value:ss_net_profit::double)
)
partition by (ss_sold_date_sk)
with location = #tpc_ds/store_sales/
file_format = (type = parquet)
auto_refresh = false
pattern = '.*sales.*[.]parquet';
Snowflake didn't originally have external file queries, nor did it originally have Parquet support. I feel like I remember when external queries arrives, it was a simple read 100% of all the files into the system and the start processing. Which aligns with out you are seeing. This was a blessing because the prior state was to have to load all files into a staging table, and then run a filter on that, and sometimes if it was a one off query (but it almost never is in the end) executing SQL against the raw files, was rather helpful.
Yes it should be possible to optimize parquet file reads, to gather the meta data and then eliminate wasteful partiton reads. But that is not the order of evolution. So I am not surprised by you findings.
I would never suggest using an external data model as the general day-to-day Snowflake operations, as it not presently optimized for that. For two reasons, the disk costs of storing it in snowflake are the same as storing it in S3, and Snowflake can then have complete control over meta data, and read/write sync between nodes. Which all amounts to performance.
Also spilling to local storage is not bad per say, spilling to remote is the worst spilling. But it does appear that you are getting the effective result of a full file import and then process.
Probably Snowflake plan assumes it must read every parquet file because it cannot tell beforehand if the files are sorted, number of unique values, nulls, minimum and maximum values for each column, etc.
This information is stored as an optional field in Parquet, but you'll need to read the parquet metadata first to find out.
When Snowflake uses internal tables, it has full control about storage, has information about indexes (if any), column stats, and how to optimize a query both from a logical and physical perspective.
I don't know how your query looks but there is also a small chance that you suffer from a known issue where Snowflake interprets a function in the partition filter as dynamic and thus runs over all data, see details in Joao Marques blog on Medium: Using Snowflake External Tables? You must read this!.
Example of how not to do it from the blog
SELECT COUNT(*)
FROM EXTERNAL_TABLE
WHERE PARTITION_KEY = dateadd(day, -1, current_date)
Example of how to do it from the blog
SET my_var = (select dateadd(day, -1, current_date));
SELECT COUNT(*)
FROM EXTERNAL_TABLE
WHERE PARTITION_KEY = $my_var
All credits to the blog author, I have merely stumbled across this issue myself and found his blog.

UPDATE millions of rows, or DELETE/INSERT?

Sorry for the longish description... but here we go...
We have a fact table somewhat flattened with a few properties that you might have put in a dimension in a more "classic" data warehouse.
I expect to have billions of rows in that table.
We want to enrich these properties with some cleansing/grouping that would not change often, but would still do from time to time.
We are thinking of keeping this initial fact table as the "master" that we never update or delete from, and making an "extended fact" table copy of it where we just add the new derived properties.
The process of generating these extended property values requires mapping to some fort of lookup table, from which we get several possibilities for each row, and then select the best one (one per initial row).
This is likely to be processor intensive.
QUESTION (at last!):
Imagine my lookup table is modified and I want to re-assess the extended properties for only a subset of my initial fact table.
I would end up with a few million rows I want to MODIFY in the target extended fact table.
What would be the best way to achieve this update? (updating a couple of million rows within a couple of billion rows table)
Should I write an UPDATE statement with a join?
Would it be better to DELETE this million rows and INSERT the new ones?
Any other way, like creating a new extended fact table with only the appropriate INSERTs?
Thanks
Eric
PS: I come from a SQL Server background where DELETE can be slow
PPS: I still love SQL Server too! :-)
Write performance for Snowflake vs. traditional RDBS behaves quite differently. All your tables persist in S3, and S3 does not let you rewrite only select bytes of an existing object; the entire file object must be uploaded and replaced. So, while in say SQL server where data and indexes are modified in place, creating new pages as necessary, an UPDATE/DELETE in snowflake is a full sequential scan on the table file, creating an immutable copy of the original with applicable rows filtered out (deleted) or modified (update), which then replaces the file just scanned.
So, whether updating 1 row, or 1M rows, at minimum the entirety of the micro-partitions that the modified data exists in will have to be rewritten.
I would take a look at the MERGE command, which allows you to insert, update, and delete all in one command (effectively applying the differential from table A into table B. Among other things, it should keep your Time-Travel costs down vs constantly wiping and rewriting tables. Another consideration is that since snowflake is column oriented, a column update in theory should only require operations on the S3 files for that column, whereas an insert/delete would replace all S3 files for all columns, which would lower performance.

Large file, SSIS to Small chunks, parallel enrichment

I have a pretty large file 10GB in Size need to load the records into the DB,
I want to have two additional columns
LoadId which is a constant (this indicates the files unique Load number)
ChunkNumber which would indicate the Chunk of the batch size.
So if I have a batch size of 10,000 records I want
LoadId = {GUID}
ChunkNumber = 1
for the next 10,000 records i want
LoadId = {GUID}
ChunkNumber = 2
Is this possible in SSIS? I suppose I can write a custom component for this but there should be a inbuilt ID if i could use as SSIS is already running stuff in batches of size 10,000
Can some one help me to figure out this parameter if it exists and can it be used?
Ok little bit more detail on the background of what and why.
We we get the data into a Slice of 10,000 records then we can start calling the Stored Procedures to enrich the data in chunks, all i am trying to do is can the SSIS help here by putting a Chunk number and a Guid
this helps the stored proc to move the data in chunks, although i could do this after the fact with a row number, Select has to travel through the whole set again and update the chunk numbers. its a double effort.
A GUID will represent the the complete dataset and individual chunks are related to it.
Some more insight. There is a WorkingTable we import this large file into and if we start enriching all the data at once the Transaction log would be used up, it is more manageable if we can get the data into chunks, so that Transaction log would not blow up and also we can parallel the enrichment process.
The data moves from De-normalized format normalized format from here. SP is more maintainable in therms of release and management of day today, so any help is appreciated.
or is there an other better way of dealing with this?
For the LoadID, you could use the SSIS variable
System::ExecutionInstanceGUID
which is generated by SSIS when the package runs.

How to bulk insert and validate data against existing database data

Here is my situation, my client wants to bulk insert 100,000+ rows into the database from a csv file which is simple enough but the values need to be checked against data that is already in the database (does this product type exist? is this product still sold? etc.). To make things worse these files will also be uploaded into the live system during the day so I need to make sure I’m not locking any tables for long. The data that is inserted will also be spread across multiple tables.
I’ve been adding the date into a staging table which takes seconds, I then tried creating a WebService to start processing the table using Linq and marking any erroneous rows with an invalid flag (this can take some time). Once the validation is done I need take the valid rows and update/add the rows to the appropriate tables.
Is there a process for this that I am unfamiliar with?
For a smaller dataset I would suggest
IF EXISTS (SELECT blah FROM blah WHERE....)
UPDATE (blah)
ELSE
INSERT (blah)
You could do this in chunks to avoid server load, but this is by no means a quick solution, so SSIS would be preferable

What is a good approach to preloading data?

Are there best practices out there for loading data into a database, to be used with a new installation of an application? For example, for application foo to run, it needs some basic data before it can even be started. I've used a couple options in the past:
TSQL for every row that needs to be preloaded:
IF NOT EXISTS (SELECT * FROM Master.Site WHERE Name = #SiteName)
INSERT INTO [Master].[Site] ([EnterpriseID], [Name], [LastModifiedTime], [LastModifiedUser])
VALUES (#EnterpriseId, #SiteName, GETDATE(), #LastModifiedUser)
Another option is a spreadsheet. Each tab represents a table, and data is entered into the spreadsheet as we realize we need it. Then, a program can read this spreadsheet and populate the DB.
There are complicating factors, including the relationships between tables. So, it's not as simple as loading tables by themselves. For example, if we create Security.Member rows, then we want to add those members to Security.Role, we need a way of maintaining that relationship.
Another factor is that not all databases will be missing this data. Some locations will already have most of the data, and others (that may be new locations around the world), will start from scratch.
Any ideas are appreciated.
If it's not a lot of data, the bare initialization of configuration data - we typically script it with any database creation/modification.
With scripts you have a lot of control, so you can insert only missing rows, remove rows which are known to be obsolete, not override certain columns which have been customized, etc.
If it's a lot of data, then you probably want to have an external file(s) - I would avoid a spreadsheet, and use a plain text file(s) instead (BULK INSERT). You could load this into a staging area and still use techniques like you might use in a script to ensure you don't clobber any special customization in the destination. And because it's under script control, you've got control of the order of operations to ensure referential integrity.
I'd recommend a combination of the 2 approaches indicated by Cade's answer.
Step 1. Load all the needed data into temp tables (on Sybase, for example, load data for table "db1..table1" into "temp..db1_table1"). In order to be able to handle large datasets, use bulk copy mechanism (whichever one your DB server supports) without writing to transaction log.
Step 2. Run a script which as a main step will iterate over each table to be loaded, if needed create indexes on newly created temp table, compare the data in temp table to main table, and insert/update/delete differences. Then as needed the script can do auxillary tasks like the security role setup you mentioned.

Resources