I am looking for a way in Informatica to pull data from a table in a database, load it in Snowflake, and then move on to the next table in that same DB and repeating that for the remaining tables in the database.
We currently have this set up running in Matillion where there is an orchestration that grabs all of the names of a table of a database, and then loops through each of the tables in that database to send the data into Snowflake.
My team and I have tried to ask Informatica Global Support, but they have not been very helpful for us to figure out how to accomplish this. They have suggested things like Dynamic Mapping, which I do not think will work for our particular case since we are in essence trying to get data from one database to a Snowflake database and do not need to do any other transformations.
Please let me know if any additional clarification is needed.
Dynamic Mapping Task is your answer. You create one mapping. With, or without any transformations - as you need. Then you set up Dynamic Mapping Task to execute the mapping across whole set of your 60+ different sources and targets.
Please note that this is available as part of Cloud Data Integration module of IICS. It's not available in PowerCenter.
Related
I have a few questions regarding the process of copying tables from S3 to Snowflake.
The plan is to copy some data from AWS/S3 onto snowflake and then perform some modeling by DataRobot
We have some tables that contain PII data and we would like to hide those columns from Datarobot, what suggestion do you have for this problem?
The schema in AWS needs to match the schema in Snowflake for the copying process.
Thanks,
Mali
Assuming you know the schema of the data you are loading, you have a few options for using Snowflake:
Use COPY INTO statements to load the data into the tables
Use SNOWPIPE to auto-load the data into the tables (this would be good for instances where you are regularly loading new data into Snowflake tables)
Use EXTERNAL TABLES to reference the S3 data directly as a table in Snowflake. You'd likely want to use MATERIALIZED VIEWS for this in order for the tables to perform better.
As for hiring the PII data from DataRobot, I would recommend leveraging Snowflake DYNAMIC DATA MASKING to establish rules that obfuscate the data (or null it out) for the role that DataRobot is using.
All of these features are well-documented in Snowflake documentation:
https://docs.snowflake.com/
Regarding hiding your PII elements, you can use 2 different roles, one would be say data_owner(the role that will create the table and load the data in it) and another say data_modelling (for using data robot)
Create masking policies using the data owner such that the data robot cannot see the column data.
About your question on copying the data, there is no requirement that AWS S3 folder need to be in sync with Snowflake. you can create the external stage with any name and point it to any S3 folder.
Snowflake documentation has good example which helps to get some hands on :
https://docs.snowflake.com/en/user-guide/data-load-s3.html
good Day.
I need help. I want to transfer the data in Snowflake from Staging tables to Fact tables automatically, when data is available in Stage table. While moving data from Staging table to Fact tables, I have couple of Custom validations on each column and row.
Any idea how to do this in Snowflake.
If any one knows could you please suggest me...!
Thanks in Advance...!
There are many ways to do this and how you go about it depends on what tools you have available. The simplest way to do this without using tools outside of the Snowflake ecosystem would be:
On each of the staging tables you have, set up a stream on these tables (here is the Snowflake documentation on streams)
Create a task that runs on a schedule (here is the Snowflake doc on tasks) to pull from the streams and write into the fact table.
This is really a general data warehousing question rather than a Snowflake one. Here is some more documentation on building SCD type 2 dimensions also written by someone at Snowflake
Assuming "staging tables" refers to a Snowflake table and not a file in a Snowflake stage, I would recommend using a Stream and Task for this. A stream will identify the delta of data that needs to be loaded, and a Task can execute on a schedule and will only actually run something if there is data in the stream. Create a stored procedure that is executed in the Task to run your validations and Merge the outcome of those into your Fact.
I have to connect multiple tables that are part of single or multiple databases. Approximately 10-15 tables in each query have to be connected to generate data for the analysis in SQL Server 2014.
I don't have access to the database diagram or architecture and these reports are to be sent out weekly. I want to understand the approach on how to begin writing these kind of queries which are of basic and advanced level and identify the relationship between tables and what kind of advanced level queries I can learn or utilize like CTE, Rank Partition, Subqueries etc.
Anybody who can provide a rough flow diagram or structure about the approach will be really helpful.
It's very unlikely that owners of those source systems want to be directly queried every time someone runs a report. Since you already have access to SQL Server, I would suggest building a data warehouse with that.
You haven't provided a whole lot of information to go on, but SSIS packages could be created to connect to the source systems and load into your data warehouse. And furthermore, those packages can be scheduled through Agent.
As for modeling... Again it is difficult with the lack of information, but generally the star model works great for reporting, which is a fact table surrounded by dimension (or attribute) tables.
As for figuring out relationships without a diagram, this will have to be done via experimentation and tieing to existing reports to make sure your joins aren't dropping records or cascading.
Good luck.
I have more than 200 MSSQL tables, and want to transfer data to Azure Data Lake Storage.
One approach I consider is to use SSIS with dynamic data flows, i.e create table name variable and do foreach loop over table names and for each table run dataflow. However this approach seems wrong ,though files are created in Data Lake storage with correct schemes data is not transferred due to wrong mappings.
Is there any generic way to create one dynamic data flow and transfer huge number of table's data?
The scenario you are describing can now be achieved in ADF V2 - V2 added a set of rich control flow enhancements including the lookup activity, parameter passing, and foreach looping. You can see a tutorial of how to accomplish this here: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-bulk-copy
Is there any handy tool that can make updating tables easier? Usually I got an Excel file with the original value in one column and new value in another column. Then I write a formula in Excel to create the 'update' statement. Is there any way to simplify the updating task?
I believe the approach in SQL server 2000 and 2005 would be different, so could we discuss them both? Thanks.
In addition, these updates usually request by "non-programmer" (which means they don't understand SQL, so it may not feasible to let them do query), is there any tool that can let them update the table directly without having DBAs do this task? Also, that tool needs to limit the privilege to only modify certain tables. And better has a way rollback the change.
Create a DTS package that will import a csv file, make the updates and then archives the file. The user can drop the file in a specific folder designated for the task or this can be done by an ops person. Schedule the DTS to run every hour, day, etc.
In case your users would insist that they keep using Excel, you've got several different possibilities of getting the data transferred to SQL Server. My preferred one would be to use DTS/SSIS, as mentioned by buckbova.
However, another method is by using OPENROWSET(), which makes it possible to query your Excel file as if it was a table. I wrote a small article about it here: http://blog.hoegaerden.be/2010/03/29/retrieving-data-from-excel/
Another approach that hasn't been mentioned yet (I'm not a big fan of letting regular users edit data directly in the DB), any possibility of creating a small custom application for them?
There you go, a couple more possible solutions :-)
Valentino.
I think the best approach is to expose a view on your data accessible to users who are allowed to do updates, and set up triggers on the view to perform the actual updates on the underlying data. Restrict change to only the columns they should be changing.
This technique can work on SQL Server 2000 and 2005.
I would add audit triggers on the underlying tables so you can always track changes.
You'll have complete control, and they can connect to it with Access or whatever and perform their maintenance.
You could create some accounts in SQL Server for these users and limit their access to only certain tables and columns along with onlu select / update / insert privileges. Then you could create an access database with linked tables to these.