Tableau incremental refresh from Snowflake - snowflake-cloud-data-platform

I have a question regarding incremental refresh from Snowflake to Tableau. I know the feature for Incremental refresh/Incremental extract is available in Tableau but can it be used for incremental loads from Snowflake? And how does it work?
The reason for me asking is because I know that query folding which other BI-tools on the market uses for incremental refreshes, isn't possible in Snowflake.
Thanks!
/P

Tableau incremental refreshes work the same for Snowflake as it does for other databases.
"Query Folding" looks like a Microsoft (and specifically PowerBI) term. According to this article https://exceleratorbi.com.au/how-query-folding-works/ "query folding" is the process of pushing the work load down to the database, which is what Tableau does when querying Snowflake tables directly.
With Snowflake I would recommend querying the tables directly as they are already setup in columnar format, and you can avoid moving the data to a Tableau Server and waiting on refreshes. Snowflake has unlimited storage whereas you might be limited by your Tableau Server.
If you need the tables in Snowflake to only show data as of a point in time, there are different ways you could accomplish this including:
Preset date filters (or parameters as filter within Tableau) that are pushed down to Snowflake
Using Tasks in Snowflake to run at a specific time to:
Clone your tables, and use the clones for reporting
Update existing reporting tables

I agree with Chris' answer accept for avoiding the extracts on Tableau Server. There can be a lot of performance gains had by using Tableau to extract the data. We run extracts out of Snowflake for most of our data sources. We also test both live connections and extracts for each to see which performs best. If timing is an issue, extracts can be set to refresh every 15 minutes at the most.
To get extracts loaded and refreshing use the following steps.
Switch your data source to an extract in Tableau Desktop
This will create a local copy of the data to be used to publish next.
Select Server/Publish Workbook
In the Publish settings, choose your refresh schedule and publish to Tableau Server. The workbook and data source will be loaded to Server.
You can also update the refresh schedules directly in Server by navigating to the new data source and going to the Extract Refreshes tab.
If you don't have the correct schedule available, you can create one in the Schedules menu for the site.

Related

Looping Through Tables in a DB in Informatica

I am looking for a way in Informatica to pull data from a table in a database, load it in Snowflake, and then move on to the next table in that same DB and repeating that for the remaining tables in the database.
We currently have this set up running in Matillion where there is an orchestration that grabs all of the names of a table of a database, and then loops through each of the tables in that database to send the data into Snowflake.
My team and I have tried to ask Informatica Global Support, but they have not been very helpful for us to figure out how to accomplish this. They have suggested things like Dynamic Mapping, which I do not think will work for our particular case since we are in essence trying to get data from one database to a Snowflake database and do not need to do any other transformations.
Please let me know if any additional clarification is needed.
Dynamic Mapping Task is your answer. You create one mapping. With, or without any transformations - as you need. Then you set up Dynamic Mapping Task to execute the mapping across whole set of your 60+ different sources and targets.
Please note that this is available as part of Cloud Data Integration module of IICS. It's not available in PowerCenter.

How to push data from a on-premises database to tableau crm

We have an on-premises oracle database installed on a server. We have to create some Charts/Dashboards with Tableau CRM on those data on-premises. Note that, tableau CRM is not Tableau Online, it is a Tableau version for the Salesforce ecosystem.
Tableau CRM has APIs, so we can push data to it or can upload CSV programmatically to it.
So, what can be done are,
Run a nodeJS app on the on-premise server, pull data from Oracle DB, and then push to Tableau CRM via the TCRM API.
Run a nodeJS app on the on-premise server, pull data from Oracle DB, create CSV, push the CSV via TCRM API
I have tested with the 2nd option and it is working fine.
But, you all know, it is not efficient. Because I have to run a cronJob and schedule the process multiple times in a day. I have to query the full table all the time.
I am looking for a better approach. Any other tools/technology you know to have a smooth sync process?
Thanks
The second method you described in the questions is a good solution. However, you can optimize it a bit.
I have to query the full table all the time.
This is can be avoided. If you take a look at the documentation of SObject InsightsExternalData you can see that it has a field by name Operation which takes one of these values Append, Delete, Overwrite, Upsert
what you will have to do is when you push data to Tableau CRM you can use the Append operator and push the records that don't exist in TCRM. That way you only query the delta records from your database. This reduces the size of the CSV you will have to push and since the size is less it takes less time to get uploaded into TCRM.
However, to implement this solution you need two things on the database side.
A unique identifier that uniquely identifies every record in the database
A DateTime field
Once you have these two, you have to write a query that sorts all the records in ascending order of the DateTime field and take only the files that fall below the last UniqueId you pushed into TCRM. That way your result set only contains delta records that you don't have on TCRM. After that you can use the same pipeline you built to push data.

Extracting Data from SAP to SQL Server

I am using SSIS packages to extract data from SAP database tables into SQL Server tables. I am using OLEDB source/destination connections to achieve this.
The problem now is that a table in SAP has 5 Million records and its taking around 2 hours to extract this data into my SQL Server table. I have used the trunc-dump method (truncating the table in sql server and dumping data into it from SAP table) and also tried using Multiple Hash key to bring in the updated/new records.
The problem with Hash key is that it still has to scan the entire table to look for changed/new records and hence takes almost the same time as the trunc-dump method.
I am looking for a new way or changing the existing way to reduce the time taken to complete this extraction.
As you mentioned you were using OLEDB source connection to access SAP, if that means you were accessing SAP's underlying database directly, you should pause doing that for three reasons till there are explicit IT approvals:
You skipped SAP's application layer security. There can be an enterprise security compliance issue;
Your company's SAP license may not allow you to do that. If your company only has SAP indirect access license, then you may have to stay on application layer;
You will not get SAP's official support by accessing the underlying database directly.
You have multiple options to fetch data using SSIS through SAP application layer:
Use commercial SSIS custom components for this job (disclaimer: AecorSoft is one of the leading vendors offering such connectivity components);
Look into SAP's own OData Gateway interface to consume data.
Request your SAP ABAP team to write custom ABAP programs to dump SAP data into CSV files, and then use SSIS to fetch them.
Let's now look at the performance side:
SAP ETL Performance depends on many factors, but in general, even for the SAP transactional tables with 100+ columns, it's considered very slow to extract 5 millions rows per a couple of hours. For example, we've seen cases of extracting standard SAP General Ledger header table BKPF (almost 100 columns) at consistent performance of 1M rows every 1-2 minutes. Of course such performance is achieved through commercial component and SSIS, but you should expect at least 1M per 10 minutes even for the #3 option above, going through an intermediate CSV file. Under the hood, through SAP application layer, all the 3 options would leverage SAP Open SQL (in contrast to the "Native SQL" which the underlying database offers) to access SAP tables, therefore, if you experience application layer performance issue, you can analyze the Open SQL side.
As you also mentioned about update/new records scenario, it's a typical delta extraction problem. Normally, in SAP transactional tables, there are Create Date and Changed Date fields which can help you capture delta. In this case, in order to avoid full table scan, apply indices through SAP application layer on those "delta fields". For example, if you need to extract Sales Document Header VBAK table, you can filter by ERDAT (Created on) and AEDAT (Changed on). Delta is a complex subject in SAP. There is no simple statement to describe the delta solution, as SAP data models are complex and very different across functional modules. The delta analysis is always a case-by-case effort. Some people may also simply recommend using "delta extractors", but don't treat that as silver bullet, because extractor has its own problem. In short, if you look into table based extraction, focus on that, and try to work with your SAP functional team to determine the suitable delta fields. Try avoiding doing full table scan and hashing. Do incremental load with some optional overlap of previous extract (e.g. loading today and yesterday's records), and do MERGE to absorb the changes.
There are few cases you may not be able to find any delta field, and it is not practical to do full load all the time. One great example is the Address Master data table ADRC. In this case, if you are required to do delta load on such table, you ether have to request your SAP function team to figure out delta for you (meaning they inject custom logic to every place where Address master can be created, updated, or deleted), or you have to request your SAP Basis team to create DB trigger on the underlying database table, and expose the trigger table at application layer. This way, you can create an application layer view on the main table and the trigger table to do delta. Still, there is no direct database access through your solution. The DB layer trigger is fully managed and controlled by your SAP Basis team who also supports the database.
Hope this helps!

AWS Glue: SQL Server multiple partitioned databases ETL into Redshift

Our team is trying to create an ETL into Redshift to be our data warehouse for some reporting. We are using Microsoft SQL Server and have partitioned out our database into 40+ datasources. We are looking for a way to be able to pipe the data from all of these identical data sources into 1 Redshift DB.
Looking at AWS Glue it doesn't seem possible to achieve this. Since they open up the job script to be edited by developers, I was wondering if anyone else has had experience with looping through multiple databases and transfering the same table into a single data warehouse. We are trying to prevent ourselves from having to create a job for each database... Unless we can programmatically loop through and create multiple jobs for each database.
We've taken a look at DMS as well, which is helpful for getting the schema and current data over to redshift, but it doesn't seem like it would work for the multiple partitioned datasource issue as well.
This sounds like an excellent use-case for Matillion ETL for Redshift.
(Full disclosure: I am the product manager for Matillion ETL for Redshift)
Matillion is an ELT tool - it will Extract data from your (numerous) SQL server databases and Load them, via an efficient Redshift COPY, into some staging tables (which can be stored inside Redshift in the usual way, or can be held on S3 and accessed from Redshift via Spectrum). From there you can add Transformation jobs to clean/filter/join (and much more!) into nice queryable star-schemas for your reporting users.
If the table schemas on your 40+ databases are very similar (your question doesn't clarify how you are breaking your data down into those servers - horizontal or vertical) you can parameterise the connection details in your jobs and use iteration to run them over each source database, either serially or with a level of parallelism.
Pushing down transformations to Redshift works nicely because all of those transformation queries can utilize the power of a massively parallel, scalable compute architecture. Workload Management configuration can be used to ensure ETL and User queries can happen concurrently.
Also, you may have other sources of data you want to mash-up inside your Redshift cluster, and Matillion supports many more - see https://www.matillion.com/etl-for-redshift/integrations/.
You can use AWS DMS for this.
Steps:
set up and configure DMS instance
set up target endpoint for redshift
set up source endpoints for each sql server instance see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
set up a task for each sql server source, you can specify the tables
to copy/synchronise and you can use a transformation to specify
which schema name(s) on redshift you want to write to.
You will then have all of the data in identical schemas on redshift.
If you want to query all those together, you can do that by wither running some transformation code inside redsshift to combine and make new tables. Or you may be able to use views.

Tools to update tables in SQL server 2000/2005

Is there any handy tool that can make updating tables easier? Usually I got an Excel file with the original value in one column and new value in another column. Then I write a formula in Excel to create the 'update' statement. Is there any way to simplify the updating task?
I believe the approach in SQL server 2000 and 2005 would be different, so could we discuss them both? Thanks.
In addition, these updates usually request by "non-programmer" (which means they don't understand SQL, so it may not feasible to let them do query), is there any tool that can let them update the table directly without having DBAs do this task? Also, that tool needs to limit the privilege to only modify certain tables. And better has a way rollback the change.
Create a DTS package that will import a csv file, make the updates and then archives the file. The user can drop the file in a specific folder designated for the task or this can be done by an ops person. Schedule the DTS to run every hour, day, etc.
In case your users would insist that they keep using Excel, you've got several different possibilities of getting the data transferred to SQL Server. My preferred one would be to use DTS/SSIS, as mentioned by buckbova.
However, another method is by using OPENROWSET(), which makes it possible to query your Excel file as if it was a table. I wrote a small article about it here: http://blog.hoegaerden.be/2010/03/29/retrieving-data-from-excel/
Another approach that hasn't been mentioned yet (I'm not a big fan of letting regular users edit data directly in the DB), any possibility of creating a small custom application for them?
There you go, a couple more possible solutions :-)
Valentino.
I think the best approach is to expose a view on your data accessible to users who are allowed to do updates, and set up triggers on the view to perform the actual updates on the underlying data. Restrict change to only the columns they should be changing.
This technique can work on SQL Server 2000 and 2005.
I would add audit triggers on the underlying tables so you can always track changes.
You'll have complete control, and they can connect to it with Access or whatever and perform their maintenance.
You could create some accounts in SQL Server for these users and limit their access to only certain tables and columns along with onlu select / update / insert privileges. Then you could create an access database with linked tables to these.

Resources