Power BI Incremental Refresh with Snowflake - snowflake-cloud-data-platform

Has anyone successfully used the PBI incremental refresh with Snowflake as a data source? A full refresh of my dataset (without incremental refresh) takes approximately 20 minutes, but with incremental refresh turned on, the data refresh times out because it takes longer than 120 minutes. When looking at the query history in Snowflake, it looks like a 'SELECT *' query is being done again and again until it times out.
I've seen some posts that says 'query folding' is not supported by Snowflake while others say it's partially supported.
Any clarity would be appreciated!

We also had tried out multiple options to check if incremental refresh could be enabled for Snowflake Power BI combination.Two things we used to verify the details were
The query history from snowflake for the query which was sent from Power BI
Using diagnostics feature in power bi desktop which will show whether the source query was generated
Both of these indicated that query folding was not working and hence incremental refresh. Another option which we explored was if we can leverage Power BI Dataflows for incremental refresh. But this also was not supported directly.
We also are planning to try out one more "long cut" which might help us to implement incremental refresh:
Bring in an Azure ADLS gen2 storage between power bi and snowflake
We will need to bring in the data that needs to be incrementally loaded to ADLS
Power BI dataflows can be leveraged to do the incremental refresh for the Power BI Datasets from ADLS.
Not sure how much this will suit you. All the best
Thanks,
Prasanth

As on Aug 2020, Incremental refresh with Snowflake works in both Dataset and Data Flow. Verified with Query History in Snowflake.

Related

Scheduled refreshing problem on Power BI dataset with incremental refresh

I have large dataset with incremental refresh and scheduled refresh set.
On Power BI Desktop refresh works pretty fast and I can publish it to service, but scheduled refresh throws an error:
The operation was throttled by the Power BI Premium because of insufficient memory. Please try again later.
I don't know if it's a problem with dataset size or if I implemented incremental refresh wrong.
I set up RangeStart, RangeEnd parameters, store 3 years of data and refresh last 5 days.
Could it be problem with size of database tables not Power BI?

Given a query ID for a SQL query on a SQL database in Azure, is there a way to go back and trace who initiated the query?

Assuming query store is enabled, you can use query performance insights to see a query ID, CPU usage, execution time, and more. The problem I'm tasked with is attributing these queries to departments which share databases. How would you recommend tracing who initiated a query?
Query Store aggregates data across all users and does not try to give you a per-user view of what is happening. (That's not its job - it is about performance management and troubleshooting). If you want to have an audit trail of who executed every query in the system, then running an Xevent session is the right model to do this (tracking statement completed and login events so you can stitch together who did what when you want to link things together later).
Making query store try to track per-user operations would have made it too expensive to be on always in every application.
You can enable Auditing for the Azure SQL database and check the query that was executed and also the user

Power BI dealing with 16gb CSV file

I have a 16GB CSV that I have imported into Power BI desktop. The workstation I am using is an Azure VM running Windows Server 2016 (64GB Memory). The import of the file takes a few seconds, however, when I try to filter the data set in query editor to a specific date range, it takes a fairly long time (it is still running and has been around 30 minutes so far). The source file (16GB CSV) is being read from a RAM disk that has been created on the VM.
What is the best approach/practice when working with data sets of this size? Would I get better performance importing the CSV in SQL server and then using direct query when filtering the data set to a date range? I would have thought it would run fairly quickly with my current setup as I have 64GB memory on available on that VM.
When the data size is significant, you also need appropriate computing power to process it. When you import these rows in Power BI, the Power BI itself needs this computing power. If you import the data in SQL Server (or in Analysis Services, or other), and you use Direct Query or Live Connection, you can delegate computations to the database engine. With Live Connection all your modeling is done on the database engine, while in Direct Query modeling is also done in Power BI and you can add computed columns and measures. So if you you Direct Query, you still must be careful what is computed where.
You ask for "the best", which is always a bit vague. You must decide for yourself depending on many other factors. Power BI is Analysis Services by itself (when you run Power BI Desktop you can see the Microsoft SQL Server Analysis Services child process running), so importing the data in Power BI should give you similar performance as if it was imported in SSAS. To improve the performance in this case, you need to tune your model. If you import the data in SQL Server, you need to tune the database (proper indexing and modeling).
So to reach a final decision you must test these solutions, consider pricing and hardware requirements and depending on that, decide what is the best for your case.
Recently, Microsoft made a demo with 1 trillion rows of data. You may want to take a look at it. I will also recommend to take a look at aggregations, which could help you improve the performance of your model.

Alteryx - bulk copy from SQL Server to Greenplum - need tips to increase performance

Need advise here: using Alteryx Designer, I'm pulling a large dataset from SQL Server (10M rows) and need to move into Greenplum DB
I tried both with connecting using Input Data (SQL Server) and Output Data (GP) and also Connect In-DB (SQL Server) and Write Data In-DB (GP)
Any approach is taking a life to complete at the point that i have to cancel the process (to give an idea, over the weekend it ran for 18hours and advanced no further than 1%)
Any good advice or trick to speed up these sort of massive bulk data loading would be very very highly appreciated!
I can control or do modifications on SQL Server and Alteryx to increase performance but not in Greenplum
Thanks in advance.
Regards,
Erick
I'll break down the approaches that you're taking.
You won't be able to use IN-DB tools as the Databases are different, hence you can't push the processing on to the DB...
Using the standard Alteryx Tools, you are bringing the whole table on to your machine and then pushing it out again, there are multiple ways that this could be done depending on where your blockage is.
Looking first at the extract from SQL, 10M rows isn't that much and so you could split the process and write it as a yxdb. If that fails or takes several hours, then you will need to look at the connection to the SQL Server or the resources available on the SQL Server.
Then for the push into Greenplum, there is no PostgreS bulk loader at present and so you can either just try and write the whole table, Or you can write segments of the table into temp tables in Greenplum and then execute a command to combine those tables.
We are pulling millions of rows daily from SQL servers to Greenplum and we use open source tool called Outsourcer. it's great tool and take care of cleansing and other.. We are using this tool for past 3.5 yrs and no issue till now.. It take care of all parallelism and millions of rows are loaded within minutes.
It support incremental or full load. If you need supports Jon Robert owner of the Outsourcers will response to your email within minutes. Here is the link for the tool
https://www.pivotalguru.com/

SQL Server Analysis Services still needed if using Power BI?

I have a project that requires using SQL Server Analysis Services, but we've also started looking at PowerBI.
I'm not entirely clear on how PowerBI functions, and where the computations/data storage takes place. If we use PowerBI for generating the analytics, is there still a benefit to having an Analysis Services layer?
To Use Analysis Services or Not?
It depends. If you already have an Analysis Services (SSAS) model, as Caio mentions, then I wouldn't get rid of it. Power BI works very well with Analysis Services and Analysis Services is going to offer a lot of enterprise-grade options that Power BI isn't going to improve upon (such as the ability to handle millions of new rows each day).
However, if you don't have an Analysis Services model already, SSAS isn't a prerequisite for using Power BI. As Mike mentions, Power BI is fully featured by itself and can easily handle most needs (importing data, modeling the data, and then visualizing the data).
To answer your question about computation and storage, Power BI has a number of layers:
An ETL layer (M). This is how data is brought into your model.
A modeling layer (DAX). This is where the data is stored, and where calculations run.
A visualization layer.
When you use Power BI with Analysis Services in Direct Query mode, then the ETL & data modeling side of things are handled by SSAS. All computation & data storage happens in Analysis Services and Power BI becomes a visualization layer only, sending queries to Analysis Services as needed for your reports.
When you don't have Analysis Services (and are using Data Import mode), then the data is stored in Power BI and all the computations run inside Power BI too.
Pros & Cons of Each Option
The advantage of using Power BI without SSAS is speed of delivery. Everything is handled in one file by one person. If you need to change your data model to make a report work, you can do that within Power BI. When you have a SSAS model, making changes to your data model can be cumbersome (partly because you have to use another tool and partly because any changes will affect all users).
The advantage of using Power BI with SSAS is scalability. Configured correctly, a single Analysis Services model can grow to handle hundreds of gigabytes, hundreds of reports, and hundreds of users with no issue. Analysis Services offers a level of enterprise robustness that goes beyond what you'd want a Power BI file to handle.
That said, introducing Analysis Services brings a number of disadvantages: most importantly, licensing & maintaining a SQL Server & keeping that server up-to-date. Power BI Desktop is updated monthly and is a quick download to get the latest & greatest DAX features. Using SSAS means you have to wait for new releases of SQL Server that include the same DAX features, then test & install them.
Conclusion
If you're not dealing with vast amounts of data (e.g. millions of new rows each month), one way to know if you need the enterprise-grade features of Analysis Services would be to think about the reports needed at the end of the project. If there's a dozen or less reports and you plan to build them all yourself, then Power BI alone offers a lot of advantages. If, on the other hand, there's a whole department of report writers waiting for you to build a data model, then Analysis Services is the way to go.
Sidenote
What's more important than Analysis Services vs. Power BI for ETL/modeling is getting your data model right. A poor data model will be slow using either tool. A well-designed data model will be fast using either option. Make sure to spend plenty of time understanding best practices when it comes to modeling your data. "Analyzing Data with Power BI and Power Pivot for Excel" by Alberto Ferrari & Marco Russo is well worth picking up if you're new to data modeling & BI in general. (Not saying you are.)
Yes, you absolutely need to keep your Analysis Services layer (and other data sources you might have). Power BI is a reporting tool and should receive data pre-aggregated as much as possible, enough to be able to plot charts, display tables, apply filters, etc. The heavy lifting is done at the data source level.
There are a number of limitations in Power BI, and you should plan for that.
For instance:
There is a 1 million row limit for returning data when using
DirectQuery. This does not affect aggregations or calculations used to
create the dataset returned using DirectQuery, only the rows returned.
For example, you can aggregate 10 million rows with your query that
runs on the data source, and accurately return the results of that
aggregation to Power BI using DirectQuery as long as the data returned
to Power BI is less than 1 million rows. If more than 1 million rows
would be returned from DirectQuery, Power BI returns an error.
https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-use-directquery/
You probably dont need a separate Analysis Services instance - only for very large models. In the default Import mode you are only limited by a model size of 1GB for Free or Pro accounts. Due to effective data compression this can be many millions of rows. A rough basis for estimation would be 50m rows in 1GB. Performance is excellent.
Power BI actually spins up an internal Analysis Services instance when a model is in use, which handles all the analytic/calculation requirements. When using Power BI Desktop this runs on your PC (you can watch it in the Task Manager). When using the web service it runs in the cloud. With Power BI Report Server it runs on an on-premise server. You can connect to any of those using Excel Pivot Tables etc, just as you would with regular Analysis Services.

Resources