Multiple Tables Load from Snowflake to Snowflake using ADF - snowflake-cloud-data-platform

I have source tables in Snowflake and Destination tables in Snowflake.
I need to load data from source to destination using ADF.
Requirement: I need to load data using single pipeline for all the tables.
Eg: For suppose i have 40 tables in source and load the total 40 tables data to destination tables. I need to create a single pipeline to load all tables at a time.
Can anyone help me in achieving this?
Thanks,
P.

This is a fairly broad question. So take this all as general thoughts, more than specific advice.
Feel free to ask more specific questions, and I'll try to update/expand on this.
ADF is useful as an orchestration/monitoring process, but can be tricky to manage the actual copying and maneuvering of data in Snowflake. My high level recommendation is to write your logic and loading code in snowflake stored procedures
then you can use ADF to orchestrate by simply calling those stored procedures. You get the benefits of using ADF for what it is good at, and allow Snowflake to do the heavy lifting, which is what it is good at.
hopefully you'd be able to parameterize procedures so that you can have one procedure (or a few) that takes a table name and dynamically figures out column names and the like to run your loading process.
Assorted Notes on implementation.
ADF does have a native Snowflake connector. It is fairly new, so a lot of online posts will tell you how to set up a custom ODBC connector. You don't need to do this. Use the native connector and auto resolve integration and it should work for you.
You can write a query in an ADF lookup activity to output your list of tables, along with any needed parameters (like primary key, order by column, procedure name to call, etc.), then feed that list into an ADF foreach loop.
foreach loops are a little limited in that there are some things that you can't nest inside of a loop (like conditionals). If you need extra functionality, you can have the foreach loop call a child ADF pipeline (passing in those parameters) and have the child pipline manage your table processing logic.
Snowflake has pretty good options for querying metadata based on a tablename. See INFORMATION_SCHEMA. Between that and just a tiny bit of javascript logic, it's not too bad to generate dynamic queries (e.g. with column names specific to a provided tablename).
If you do want to use ADF's Copy Activities, I think You'll need to set up an intermediary Azure Storage Account connection. I believe this is because it uses COPY INTO under the hood which requires using external storage.
ADF doesn't have many good options for avoiding running one pipeline multiple times at once. Either be careful about making sure that your code can handle edge cases like this, or that your scheduling/timeouts won't allow for that scenario with a pipeline running too long.
Extra note:
I don't know how tied you are to ADF, but without more context, I might suggest a quick look into DBT for this use case. It's a great tool for this specific scenario of Snowflake to Snowflake processing/transforming. My team's been much happier since moving some of our projects from ADF to DBT. (not sponsored :P )

Related

Which one is better, iterate and sort data in backend or let the database handle it?

I'm trying to design a database schema for Djabgo rest framework web application.
At some point, I have two choces:
1- Choose a schema in which in one or several apies, I have to get a queryset from database and iterate and order it with python. (For example, I can store some datas in an array-data-typed column, get them from database and sort them with python.)
2- store the data in another table and insert a kind of big number of rows with each insert. This way, I can get the data in my favorite format in much less lines with orm codes.
I tried some basic tests and benchmarking to see which way is faster, and letting database handle more of the job (second way) didn't let me down. But I don't have the means of setting a more real situatuin and here's the question:
Is it still a good idea to let database handle the job when it also has to handle hundreds of requests from other apies and clients each second?
Is database (and orm) usually faster and more reliable than backend?
As a general rule, you want to let the database do work when the work is appropriate for the database. Sorting result sets would be in that category.
Keep in mind:
The database is running on a server, often on a distributed system and so it has access to more resources.
Databases are designed to handle large data, so they are not limited by the memory in a single thread.
When this question comes up, often more data needs to be passed back to the application than is strictly needed. Consider a problem such as getting the top 10 of something.
Mixing processing in the application and the database often requires multiple queries and passing data back and forth, which is expensive.
(And there are no doubt other considerations.)
There are some situations where it might be more efficient or convenient to do work in the application. A common example is formatting result sets for the application -- say turning 1234.56 into $1,234.56. Other examples would be when the application language has capabilities that are not directly in SQL or are hard to implement in SQL.

Snowflake transformations

We will have flat files loaded into snowflake tables in staging schema from AWS S3. Now we need to perform simple transformations like aggregations,
mapping, calculation etcetc. I know we can use informatica, other tools but we really do not have that big transformations to consume third party tools
We have to load as-is flat files from AWS S3 to snowflake so can't use transformations in copy command.
What is the easiest way and best practice to do transformations in snowflake considering simple and basic transformations.
Thanks
If you are loading CSV-files you can also apply some very simple transformations during your COPY-command. According to docs simple transformations are: Column reordering, column omission, and casts using a SELECT statement.
See here: https://docs.snowflake.com/en/user-guide/data-load-transform.html
If you want to do simple aggregations, mappings and/or calculations, I can recommend two ways:
Using Views: https://docs.snowflake.com/en/user-guide/views-introduction.html
Using Stored Procedures: https://docs.snowflake.com/en/sql-reference/stored-procedures.html
The easiest way is developing both (views and stored procedures) within your Snowflake Web GUI.
Take a look at Snowflake's Tasks and Streams. These would allow you to move incremental data from stage to target tables automatically every time a new set of files are loaded. Might be useful for you.
Tasks: https://docs.snowflake.com/en/user-guide/tasks-intro.html
Streams: https://docs.snowflake.com/en/user-guide/streams.html
(Links are to Snowflake documentation and won't "go dead")
General context is that a stream allows you to see what has changed in a table. Tasks can then monitor a stream and when there are tables can execute a MERGE or a SP to take that incremental data, transform it, and load it to a completed target table. All fully automated. Might be useful in a simple transformation scenario that you have described.

How to get a list of files names from Snowflake external S3 stage?

I am looking for the best way to automatically detect new files in a S3 bucket and then to load the data into a Snowflake table.
I know this can be achieved using Snowpipe and SNS, SQS notifications set up in AWS but I would like to have a self-contained solution within Snowflake which can be used for multiple data sources.
I want to have a table which is updated with the file names from a S3 bucket and then to load files which have not already been loaded from S3 into Snowflake.
The only way I have found to automatically detect new files from an external S3 stage in Snowflake so far is to use the code below and a task on a set schedule. This lists the file names and then uses result_scan to display the last query as a table.
list #STAGE_NAME;
set qid=last_query_id();
select "name" from table(result_scan($qid))
Does anyone know a better way to automatically detect new files in an external stage from Snowflake? Any help is much appreciated.
Not necessarily better than the way you've already found, but there is an alternative approach to listing the files in an S3 bucket.
If you create an EXTERNAL TABLE over the data in S3, you can then use the METADATA$FILENAME property in a query. If you have a record of which files have already been loaded into Snowflake then you can compare and select the names of the new files and process them.
e.g.
ALTER EXTERNAL TABLE MYSCHEMA.MYEXTERNALTABLE REFRESH;
SELECT DISTINCT
METADATA$FILENAME as filename
FROM
MYSCHEMA.MYEXTERNALTABLE;
Short Run:
Your approach
You've already found a viable solution, and your concern about the reliability of the last query id function is understandable. Procedures' sessions are isolated and so the last_query_id() function will be isolated to only the statements executed within that procedure. It might be unnecessary to use a procedure, but I personally like that they let you create reusable abstractions.
Another approach
An alternative, if you don't like the approach you're using, would be to create a single table with a single VARIANT data column plus the stage metadata columns, maintained by a single giant pipe, and you could maintain a set of materialized views over that table, which would filter, convert variant fields to columns, and sanitize, as appropriate.
There are some benefits:
simpler: integrating new prefixes for a stage requires only an additional materialized view, not an additional pipe + task
more control: you'd be able to operate directly and automatically on the data in raw form, rather than needing to load into a table and then check it. This means you can perform data quality checks, metadata checks, and sanitization.
maintainable: the use of materialized views over an immutable source means you can at any time change the logic and perform a full backfill with little effort.
Long Run:
Notification Integrations enable snowflake to listen (and possibly notify in the future, roadmap-gods willing) to external messaging systems. At this moment only Azure is supported, so it won't work for your case, but keep an eye out over the next few months -- I think it's safe to speculate that we will see this feature grow to support AWS, and a more direct and concise manner for implementing your original solution will eventually become available.

Multiple tables vs one big table with JSON serialized data

Here is my situation,
I have an application in which I need to store information about the results of different tests made on blood samples. I am currently using ASP.Net core for the web application and SQL Server for the database. (Might switch to Postgres as I will surely host on Linux and SQL Server for Linux is not totally available yet)
All the tests have some information in common, who performed it, at what time, any other related information for tracking purposes. But then all of them also have specific variables that I need to save for reporting/further calculations.
As of now I have about 20 different types of tests we perform on the samples we receive. The question I have is what would be the best way to save that data?
The two options I see are the following:
Have 20 different tables, all containing the general sample tracking info + specific test variables. This way, when I need to fetch the info, everything for a specific type of test is easily accessible. But then I need to query all these tables by join queries whenever I want to generate a report or modify sample results information (as all the test results/variables entry forms are in a single page). There if very few moments where I need to query only a specific type of test, most of the time, I need to retrieve them all at once, which means that I will always (mostly) query the 20+ tables every time I need to access sample data.
Have one big table containing all the results for the different tests performed and serialize (JSON format) only the specific test variables. So I would have all tracking information available (queryable, searchable, etc....) but the variables and results of each test would be in a single serialized column.
It is important to know that the variables/results won't be queried directly, I don't need to filter by them or anything like that (yet at the very least).
Now I wonder what would give me the best performance in the long term between using the multiple tables with join queries vs using serialization/deserialization that needs to take place whenever I access the data.
Also, I am aware that by serializing the test results/variables, I am losing ability to query by the information they contain (except for SQL server 2016 that now includes a way to query JSON information if I'm not mistaken...).
I also try to follow best practices by normalizing the database but I'm not a pro and I don't know what would be the best approach between my two options (or any other option if there is a better alternative, I'm totally open to better ideas)
So what would be the best approach and why?
Usage estimate
There might be around 15 to 30 millions tests performed every year. Of which I would say 2/3 would be of 5 different blood tests and the other third would be all the other tests performed.
Different table for different test is a good idea to work with.
Reason 1:If only 10 tests are performed on the sample rest of the column will unnecessary waste DB space.
Reason 2:Creating report will be easy in future according to samples
Reason 3:Filtering of data will be easy
Reason 4:maintenance will be easy
If in case of tests are mandatory go with 1 table

What strategy to migrate data from a spreadsheet to an RDBMS?

This is linked to my other question when to move from a spreadsheet to RDBMS
Having decided to move to an RDBMS from an excel book, here is what I propose to do.
The existing data is loosely structured across two sheets in a work-book. The first sheet contains main record. The second sheet allows additional data.
My target DBMS is mysql, but I'm open to suggestions.
Define RDBMS schema
Define, say, web-services to interface with the database so the same can be used for both, UI and migration.
Define a migration script to
Read each group of affiliated rows from the spreadsheet
Apply validation/constraints
Write to RDBMS using the web-service
Define macros/functions/modules in spreadsheet to enforce validation where possible. This will allow use of the existing system while the new comes up. At the same time, ( i hope ) it will reduce migration failures when the move is eventually made.
What strategy would you follow?
There are two aspects to this question.
Data migration
Your first step will be to "Define RDBMS schema" but how far are you going to go with it? Spreadsheets are notoriously un-normalized and so have lots of duplication. You say in your other question that "Data is loosely structured, and there are no explicit constraints." If you want to transform that into a rigourously-defined schema (at least 3NF) then you are going to have to do some cleansing. SQL is the best tool for data manipulation.
I suggest you build two staging tables, one for each worksheet. Define the columns as loosely as possible (big strings basically) so that it is easy to load the spreadsheets' data. Once you have the data loaded into the staging tables you can run queries to assess the data quality:
how many duplicate primary keys?
how many different data formats?
what are the look-up codes?
do all the rows in the second worksheet have parent records in the first?
how consistent are code formats, data types, etc?
and so on.
These investigations will give you a good basis for writing the SQL with which you can populate your actual schema.
Or it might be that the data is so hopeless that you decide to stick with just the two tables. I think that is an unlikely outcome (most applications have some underlying structure, we just have to dig deep enough).
Data Loading
Your best bet is to export the spreadsheets to CSV format. Excel has a wizard to do this. Use it (rather than doing Save As...). If the spreadsheets contain any free text at all the chances are you will have sentences which contain commas, so make sure you choose a really safe separator, such as ^^~
Most RDBMS tools have a facility to import data from CSV files. Postgresql and Mysql are the obvious options for an NGO (I presume cost is a consideration) but both SQL Server and Oracle come in free (if restricted) Express editions. SQL Server obviously has the best integration with Excel. Oracle has a nifty feature called external tables which allow us to define a table where the data is held in a CSV file, removing the need for staging tables.
One other thing to consider is Google App Engine. This uses Big Table rather than an RDBMS but that might be more suited to your loosely-structured data. I suggest it because you mentioned Google Docs as an alternative solution. GAE is an attractive option because it is free (more or less, they start charging if usage exceeds some very generous thresholds) and it would solve the app sharing issue with those other NGOs. Obviously your organisation may have some qualms about Google hosting their data. It depends on what field they are operating in, and the sensitivity of the information.
Obviously, you need to create a target DB and the necessary table structure.
I would skip the web services and write a groovy script which reads the .xls (using the POI library), validates and saves the data in the database.
In my view, anything more involved (web services, GUI...) is not justified: these kinds of tasks are very well suited for scripts because they're concise and extremely flexible while things like performance, code base scalability and such are less of an issue here. Once you have something that works, you will be able to adapt the script to any future document with different data anomalies you run into in a matter of minutes or a few hours.
This is all assuming your data isn't in perfect order and needs to be filtered and/or cleaned.
Alternatively, if the data and validation rules aren't too complex, you can probably get good results with using a visual data transfer tool like Kettle: you just define the .xls as your source, the database table as the table, some validation/filter rules if needed and trigger the loading process. Quite painless.
If you'd rather use a tool that roll your own, check out SeekWell, which lets you write to your database from Google Sheets. Once you define your schema, Select the tables into a Sheet, then edit or insert the records and mark them for the appropriate action (e.g., update, insert, etc.). Set the schedule for the update and you're done. Read more about it here. Disclaimer--I'm a co-founder.
Hope that helps!
You might be doing more work than you need to. Excel spreadsheets can be saved as CVS or XML files and many RDBMS clients support importing these files directly into tables.
This could allow you skip writing web service wrappers and migration scripts. Your database constraints would still be properly enforced during any import. If your RDBMS data model or schema is very different from your Excel spreadsheets, however, then some translation would of course have to take place via scripts or XSLT.

Resources