Snowflake transformations - snowflake-cloud-data-platform

We will have flat files loaded into snowflake tables in staging schema from AWS S3. Now we need to perform simple transformations like aggregations,
mapping, calculation etcetc. I know we can use informatica, other tools but we really do not have that big transformations to consume third party tools
We have to load as-is flat files from AWS S3 to snowflake so can't use transformations in copy command.
What is the easiest way and best practice to do transformations in snowflake considering simple and basic transformations.
Thanks

If you are loading CSV-files you can also apply some very simple transformations during your COPY-command. According to docs simple transformations are: Column reordering, column omission, and casts using a SELECT statement.
See here: https://docs.snowflake.com/en/user-guide/data-load-transform.html
If you want to do simple aggregations, mappings and/or calculations, I can recommend two ways:
Using Views: https://docs.snowflake.com/en/user-guide/views-introduction.html
Using Stored Procedures: https://docs.snowflake.com/en/sql-reference/stored-procedures.html
The easiest way is developing both (views and stored procedures) within your Snowflake Web GUI.

Take a look at Snowflake's Tasks and Streams. These would allow you to move incremental data from stage to target tables automatically every time a new set of files are loaded. Might be useful for you.
Tasks: https://docs.snowflake.com/en/user-guide/tasks-intro.html
Streams: https://docs.snowflake.com/en/user-guide/streams.html
(Links are to Snowflake documentation and won't "go dead")
General context is that a stream allows you to see what has changed in a table. Tasks can then monitor a stream and when there are tables can execute a MERGE or a SP to take that incremental data, transform it, and load it to a completed target table. All fully automated. Might be useful in a simple transformation scenario that you have described.

Related

Multiple Tables Load from Snowflake to Snowflake using ADF

I have source tables in Snowflake and Destination tables in Snowflake.
I need to load data from source to destination using ADF.
Requirement: I need to load data using single pipeline for all the tables.
Eg: For suppose i have 40 tables in source and load the total 40 tables data to destination tables. I need to create a single pipeline to load all tables at a time.
Can anyone help me in achieving this?
Thanks,
P.
This is a fairly broad question. So take this all as general thoughts, more than specific advice.
Feel free to ask more specific questions, and I'll try to update/expand on this.
ADF is useful as an orchestration/monitoring process, but can be tricky to manage the actual copying and maneuvering of data in Snowflake. My high level recommendation is to write your logic and loading code in snowflake stored procedures
then you can use ADF to orchestrate by simply calling those stored procedures. You get the benefits of using ADF for what it is good at, and allow Snowflake to do the heavy lifting, which is what it is good at.
hopefully you'd be able to parameterize procedures so that you can have one procedure (or a few) that takes a table name and dynamically figures out column names and the like to run your loading process.
Assorted Notes on implementation.
ADF does have a native Snowflake connector. It is fairly new, so a lot of online posts will tell you how to set up a custom ODBC connector. You don't need to do this. Use the native connector and auto resolve integration and it should work for you.
You can write a query in an ADF lookup activity to output your list of tables, along with any needed parameters (like primary key, order by column, procedure name to call, etc.), then feed that list into an ADF foreach loop.
foreach loops are a little limited in that there are some things that you can't nest inside of a loop (like conditionals). If you need extra functionality, you can have the foreach loop call a child ADF pipeline (passing in those parameters) and have the child pipline manage your table processing logic.
Snowflake has pretty good options for querying metadata based on a tablename. See INFORMATION_SCHEMA. Between that and just a tiny bit of javascript logic, it's not too bad to generate dynamic queries (e.g. with column names specific to a provided tablename).
If you do want to use ADF's Copy Activities, I think You'll need to set up an intermediary Azure Storage Account connection. I believe this is because it uses COPY INTO under the hood which requires using external storage.
ADF doesn't have many good options for avoiding running one pipeline multiple times at once. Either be careful about making sure that your code can handle edge cases like this, or that your scheduling/timeouts won't allow for that scenario with a pipeline running too long.
Extra note:
I don't know how tied you are to ADF, but without more context, I might suggest a quick look into DBT for this use case. It's a great tool for this specific scenario of Snowflake to Snowflake processing/transforming. My team's been much happier since moving some of our projects from ADF to DBT. (not sponsored :P )

How to get a list of files names from Snowflake external S3 stage?

I am looking for the best way to automatically detect new files in a S3 bucket and then to load the data into a Snowflake table.
I know this can be achieved using Snowpipe and SNS, SQS notifications set up in AWS but I would like to have a self-contained solution within Snowflake which can be used for multiple data sources.
I want to have a table which is updated with the file names from a S3 bucket and then to load files which have not already been loaded from S3 into Snowflake.
The only way I have found to automatically detect new files from an external S3 stage in Snowflake so far is to use the code below and a task on a set schedule. This lists the file names and then uses result_scan to display the last query as a table.
list #STAGE_NAME;
set qid=last_query_id();
select "name" from table(result_scan($qid))
Does anyone know a better way to automatically detect new files in an external stage from Snowflake? Any help is much appreciated.
Not necessarily better than the way you've already found, but there is an alternative approach to listing the files in an S3 bucket.
If you create an EXTERNAL TABLE over the data in S3, you can then use the METADATA$FILENAME property in a query. If you have a record of which files have already been loaded into Snowflake then you can compare and select the names of the new files and process them.
e.g.
ALTER EXTERNAL TABLE MYSCHEMA.MYEXTERNALTABLE REFRESH;
SELECT DISTINCT
METADATA$FILENAME as filename
FROM
MYSCHEMA.MYEXTERNALTABLE;
Short Run:
Your approach
You've already found a viable solution, and your concern about the reliability of the last query id function is understandable. Procedures' sessions are isolated and so the last_query_id() function will be isolated to only the statements executed within that procedure. It might be unnecessary to use a procedure, but I personally like that they let you create reusable abstractions.
Another approach
An alternative, if you don't like the approach you're using, would be to create a single table with a single VARIANT data column plus the stage metadata columns, maintained by a single giant pipe, and you could maintain a set of materialized views over that table, which would filter, convert variant fields to columns, and sanitize, as appropriate.
There are some benefits:
simpler: integrating new prefixes for a stage requires only an additional materialized view, not an additional pipe + task
more control: you'd be able to operate directly and automatically on the data in raw form, rather than needing to load into a table and then check it. This means you can perform data quality checks, metadata checks, and sanitization.
maintainable: the use of materialized views over an immutable source means you can at any time change the logic and perform a full backfill with little effort.
Long Run:
Notification Integrations enable snowflake to listen (and possibly notify in the future, roadmap-gods willing) to external messaging systems. At this moment only Azure is supported, so it won't work for your case, but keep an eye out over the next few months -- I think it's safe to speculate that we will see this feature grow to support AWS, and a more direct and concise manner for implementing your original solution will eventually become available.

What is the best way to get XML data to multiple SSIS Data Flow Tasks?

This is a question on how to structure an SSIS package to solve a very specific problem (I'm new SSIS and have not found anything on the correct approach).
My Problem: I have a SSIS package that reads a very simple XML file. The XML Source sees the information as a single table. One of the table columns is a qualifier that affects the way a record is processed. Rather than having the processing for all of the qualifiers in a single task, I’d like to have a separate task for each qualifier (for modularity). I could have the task for each qualifier read, shred, and process the XML file, but reading and shredding the XML file multiple times seems like an inefficient way of doing this. I’d think it would be better to have a task for an XML Source that persists the data, and then have that data used by a number of other tasks that process the data.
A Possible Solution: From what I’ve read, the correct approach is to save the data into a Raw File Destination, and then to have the various tasks use a Raw File Source. This seems too much like a global variable to me. Is there a better way? I can figure out the specifics, so I don’t need a detailed answer, just the best approach.
Thanks
I would use the SSIS Conditional Split transformation for this. It can evaluate your "Qualifier" column and send specified instances down different paths within that Data Flow Task.
https://msdn.microsoft.com/en-us/library/ms137886.aspx
It doesn't seem that there is a way of factoring a DFT (Data Flow Task) designed into SSIS. SSIS is structured so the each DFT provides a complete ETL function. The only way to "factor" a DFT seems to be to create an ad hoc data flow in the Control Flow by using Raw File sources and destinations to fake DFT input and output parameters. This means creating and managing lots of files (and variables for the file names). Recordset sources and destinations can also be used, but the coding overhead might be higher (I haven’t tried using Recordsets, only Raw Files).
Not being able to factor DFTs makes creating and validating complex SSIS packages extremely difficult. Microsoft really needs to come up with a solution. I found another use case in a web forum, being able to recover a task without having to go all of the way back to the beginning. I’ll add a link to that post here if I can find it again.
One solution might be to allow a DFT to execute another DFT, just as a package can execute another package in the control flow. This breaks the convention that each DFT provides a complete ETL function, but if this is the best way the benefit of being able to factor DFTs outweighs any added conceptual complexity.
Disclaimer: I am an experienced LabVIEW programmer, so my view of dataflow is likely biased. I could be missing an obvious solution.

Large tables of static data with DBGhost

We are thinking of restructuring our database development and deployment processes by using DBGhost, we want to move away from the central development database and bring the database to the source control.
One of the problems we have is a big table with static data (containing translated language strings), it has close to 200K rows.
I know that our best solution is to move these stings into resource files, but until we implement that, will DbGhost be able to maintain all this static data and generate our development and deployment databases in a short time? And if not is there a good alternative to filling up this table whenever we need to?
This is an older question with an accepted answer, but I have some different input into this.
We use DBGhost and we have lots of static table data, although the largest is only about 20K rows, rather than 200K rows.
DBGhost has a feature to script data (as a series of insert statements). We used that to export our static data into scripts and put those scripts under version control. We tweaked those scripts to clear the data before adding the data back in, so we can use a single script to "reset" the static data for a table. This addition was for our specific needs, and is not the only way that you could handle static data with DBGhost.
The "build from scripts" and "sync" processes both support runnning ad-hoc scripts before and after the process. We added the static data scripts as ad-hoc scripts to run after the build/sync.
DBGhost also supports data synchronization in the synchronization process. The sync process can be configured to do a data synchronization on selected tables. Using this technique, you can have your build process add the data via the scripts, then the sync process can automatically sync the data for those tables. Using this technique, you would not need to change the scripts like we did.
Would you be able to take a look at SQL Source Control? We've just added static data support and are looking for feedback prior to the full release.
http://www.red-gate.com/MessageBoard/viewtopic.php?t=12298
Would you be able to explain why you're moving away from a central database development model?
DBG is not really designed for moving massive amounts of data
That's from an email received from Innovartis regarding the same question as yours. You've probably found this out by now though!
Maybe when you asked this they didn't have an evaluation though I'm not sure that is true. The only way you will know is to test it and see how it works.
http://www.innovartis.co.uk/evaluation.aspx

What strategy to migrate data from a spreadsheet to an RDBMS?

This is linked to my other question when to move from a spreadsheet to RDBMS
Having decided to move to an RDBMS from an excel book, here is what I propose to do.
The existing data is loosely structured across two sheets in a work-book. The first sheet contains main record. The second sheet allows additional data.
My target DBMS is mysql, but I'm open to suggestions.
Define RDBMS schema
Define, say, web-services to interface with the database so the same can be used for both, UI and migration.
Define a migration script to
Read each group of affiliated rows from the spreadsheet
Apply validation/constraints
Write to RDBMS using the web-service
Define macros/functions/modules in spreadsheet to enforce validation where possible. This will allow use of the existing system while the new comes up. At the same time, ( i hope ) it will reduce migration failures when the move is eventually made.
What strategy would you follow?
There are two aspects to this question.
Data migration
Your first step will be to "Define RDBMS schema" but how far are you going to go with it? Spreadsheets are notoriously un-normalized and so have lots of duplication. You say in your other question that "Data is loosely structured, and there are no explicit constraints." If you want to transform that into a rigourously-defined schema (at least 3NF) then you are going to have to do some cleansing. SQL is the best tool for data manipulation.
I suggest you build two staging tables, one for each worksheet. Define the columns as loosely as possible (big strings basically) so that it is easy to load the spreadsheets' data. Once you have the data loaded into the staging tables you can run queries to assess the data quality:
how many duplicate primary keys?
how many different data formats?
what are the look-up codes?
do all the rows in the second worksheet have parent records in the first?
how consistent are code formats, data types, etc?
and so on.
These investigations will give you a good basis for writing the SQL with which you can populate your actual schema.
Or it might be that the data is so hopeless that you decide to stick with just the two tables. I think that is an unlikely outcome (most applications have some underlying structure, we just have to dig deep enough).
Data Loading
Your best bet is to export the spreadsheets to CSV format. Excel has a wizard to do this. Use it (rather than doing Save As...). If the spreadsheets contain any free text at all the chances are you will have sentences which contain commas, so make sure you choose a really safe separator, such as ^^~
Most RDBMS tools have a facility to import data from CSV files. Postgresql and Mysql are the obvious options for an NGO (I presume cost is a consideration) but both SQL Server and Oracle come in free (if restricted) Express editions. SQL Server obviously has the best integration with Excel. Oracle has a nifty feature called external tables which allow us to define a table where the data is held in a CSV file, removing the need for staging tables.
One other thing to consider is Google App Engine. This uses Big Table rather than an RDBMS but that might be more suited to your loosely-structured data. I suggest it because you mentioned Google Docs as an alternative solution. GAE is an attractive option because it is free (more or less, they start charging if usage exceeds some very generous thresholds) and it would solve the app sharing issue with those other NGOs. Obviously your organisation may have some qualms about Google hosting their data. It depends on what field they are operating in, and the sensitivity of the information.
Obviously, you need to create a target DB and the necessary table structure.
I would skip the web services and write a groovy script which reads the .xls (using the POI library), validates and saves the data in the database.
In my view, anything more involved (web services, GUI...) is not justified: these kinds of tasks are very well suited for scripts because they're concise and extremely flexible while things like performance, code base scalability and such are less of an issue here. Once you have something that works, you will be able to adapt the script to any future document with different data anomalies you run into in a matter of minutes or a few hours.
This is all assuming your data isn't in perfect order and needs to be filtered and/or cleaned.
Alternatively, if the data and validation rules aren't too complex, you can probably get good results with using a visual data transfer tool like Kettle: you just define the .xls as your source, the database table as the table, some validation/filter rules if needed and trigger the loading process. Quite painless.
If you'd rather use a tool that roll your own, check out SeekWell, which lets you write to your database from Google Sheets. Once you define your schema, Select the tables into a Sheet, then edit or insert the records and mark them for the appropriate action (e.g., update, insert, etc.). Set the schedule for the update and you're done. Read more about it here. Disclaimer--I'm a co-founder.
Hope that helps!
You might be doing more work than you need to. Excel spreadsheets can be saved as CVS or XML files and many RDBMS clients support importing these files directly into tables.
This could allow you skip writing web service wrappers and migration scripts. Your database constraints would still be properly enforced during any import. If your RDBMS data model or schema is very different from your Excel spreadsheets, however, then some translation would of course have to take place via scripts or XSLT.

Resources