I have started to build a Power BI dashboard to display various information about a lot of customers.
All my information is stored in a MS SQL Server database, and my main table has 30+ million rows (increasing once every month by ~1 mil rows). Data is a mix between financials and dimensions type information.
Before I have created a monthly report in PowerPoint, but I want to create more flexibility and have therefore started to build the same “setup” in Power BI.
My issue is that performance is relatively slow; which I think it shouldn’t be, especially if I am to distribute to senior stakeholders in my company. Ideally I would like the numbers to “appear” in an instant, rather than having to wait 10, 15, 30 or 60 seconds for things to load every time you change a filter.
Currently I am using DirectQuery.
I have now experimented with creating a bunch of aggregated tables on my database, so that I lose a ton of details, but am able to load data much much faster (even with direct query), almost instant. My idea is to create a stored procedure that fires once a month when new data is loaded, to update each of these aggregated tables with newest data.
However I feel like it kind of defeats the purpose of having a BI tool, as I lose ability to drill down and do ad hoc investigations into underlying parameters of each metric etc.
I have created some relationships in Power BI that gives me some drill functionality, but not nearly as detailed as I would like.
What is best practice in relation to this? Is it common to do aggregated tables, even if it is on the expense of doing drill downs and losing detail? Or should I consider some other approach?
I have other datasets with even more detail (100s of million of rows), so looking for best practice for when I get around to these more detailed datasets.
Related
I am able to connect tableau with my database but the table size is really large here. Everytime I try to load the table into tableau, it is crashing and I am not able to find any work around. The table size varies from 10 million - 400 million rows. How should I approach this issue any suggestion ?
You don't "load" data into Tableau, you point Tableau at an external data source. Then Tableau sends a query to the external data source requesting only the summary info (aka query results) needed to create the visualization you designed.
So, for an extreme example, if you place CNT(Number of Records) on the Columns shelf, Tableau will send a simple short query to the external database asking it to report the number of records. Something along the lines of "select count(*) from xxx".
So even if there are billions of rows in the external database, Tableau will send a small amount of information to the database (a query) and receive back a small amount of information (the query results) to display. This allows Tableau to be very fast on its end, and performance depends on how fast the external database can respond to the query. Tuning your database depends on all kinds of factors: type and amount of memory and disk, how indices are set up, etc.
So the first step is to make sure that the database can perform as needed, regardless of Tableau.
That's the purist response. Now for a few messy details. It is possible to design a very complex visualization in Tableau that will send a complex query asking for a very large result set. For instance, you can design a dashboard that draws a dot on the map for every row in the database, and then refreshes a large volume of data everytime you wave the mouse over the marks on the map.
If you have millions or billions of data rows, and you want high performance, then don't do that. No user can read 60 million dots anyway, and they certainly don't want to wait for them to be sent over the wire. Instead first plot aggregate values, min, max, sum, avg etc and then drill down into more detail on demand.
As others suggest, you can use a Tableau extract to offload workload and cache data in a form for fast use by Tableau. An extract is similar to an optimized materialized view stored in Tableau. Extracts are very helpful with speeding up Tableau, but if you want high performance, filter and aggregate your extracts to contain only the data and level of detail needed to support your views. If you blindly make an extract of your entire database, you are simply copying all your data from one form of database to another.
I found a simple solution for optimising Tableau to work with very large datasets (1 billion+ rows): Google BigQuery, which is essentially a managed data warehouse.
Upload data to BigQuery (you can append multiple files into a single table).
Link that table to Tableau as an external data source
Tableau then sends SQL-like commands to BigQuery whenever a new 'view' is requested. The queries are processed quickly on Google's computing hardware, which then sends a small amount of information back to Tableau.
This method allowed me visualise a 100gb mobile call record dataset with ~1 billion rows on a MacBook.
There are two ways to interpret this question:
The data source (which might be a single table, a view, etc.) has 10M to 400M rows and Tableau is crashing at some point in the load process. In that case, I suggest you to contact Tableau tech support. They really like to hear about situations like that and help people through them.
You are trying to create a visualization (such as a text table or crosstab) that has N records resulting in 10M to 400M displayed rows. In that case, you're into a territory that Tableau isn't designed for. A text table with 10M rows is not going to be useful for much of anything than exporting to something else, and in that case, there are better tools than Tableau for doing that (such as export/import tools built into most databases).
Not really sure what your use case is, but I find it unlikely that you need all that data for one Tableau view.
You can parse down / aggregate the data using a view in the database or custom SQL from your Tableau connection. Also, try to use extracts rather than live database connections, as they will preform faster.
I like to use views in the database and then use those views to refresh my Tableau extracts on Tableau Server.
I need a general piece of advice, but for the record i use jpa.
I need to generate usage data statistics, eg breakdown of user purchases per product, etc... I see three possible strategies, 1) generate on the fly stats each time the stats are being viewed, 2) create a specific table for stats that i would update each time there is a change 3) do offline processing at regular time intervals
All have issues and advanages, eg cost vs not up to date data, and i was wondering if anyone with experience in this field could provide some advice. I am aware the question s pretty broad, i can refine my use case if needed.
I've done a lot of reporting and the first question I always want to know is if the stakeholder needs the data in real time or not. This definitely shifts how you think and how you'll design a reporting system.
Based on the size of your data, I think it's possible to do real time reporting. If you had data in the millions, then maybe you'd need to do some pre-processing or data warehousing (your options 2/3).
Some general recommendations:
If you want to do real time reporting, think about making a copy of the database so you aren't running reports against production data. Some reports can use queries that are heavy, so it's worth looking into replicating production data to some other server where you can run reports.
Use intermediate structures a lot for reports. Write views, stored procedures, etc. so every report isn't just some huge complex query.
If the reports start to get too complex for doing at the database level, make sure you move the report logic into the application layer. I've been bitten by this many times. I start writing a report with queries purely from the database and eventually it gets too complex and I have to jump through hoops to make it work.
Shoot for real time and then go to stale data if necessary. Databases are capable of doing a lot more than you'd think. Quite often you can make changes to your database structures that will give you a big yield in performance.
I have a set of views set up in SQL Server which output exactly the results that I would like to include in a SQL Server Analysis Services cube, including the calculation of a number of dimensions (such as Age using DATEDIFF, business quarter using DATENAME etc.). What I would like to know is whether it makes sense to use these views as the data source for a cube, or whether I should use the underlying tables to reproduce the logic in SSAS. What are the implications of going either route?
My concerns are:
the datasets are massive, but we need quick access to the results, so I would like to have as much of the calculations that are done in the views persisted within the SSAS data warehouse
Again, because the datasets are massive I want the recalculation of any cubes to be a fast as possible
Many experts actually recommend using views in your data source view in SSAS. John Welch (Pragmatic Works, Microsoft MVP, Analysis Services Maestro) spoke on how he preferred using views in the DSV this year at SQL Rally Dallas. The reason being is that it creates a layer between the cube and the physical table.
Calculating columns in the view will take a little extra time and resources during cube processing. If processing time is ok, leave the computations in the view. If it's an issue, you can always add a persisted computed column directly to the fact table so that the calculation is done during the insert / update of the fact table. The disadvantage of this is that you'll have to physically store the columns in the fact table. The advantage is that they don't have to be computed every time the cube gets processed. These are the tradeoffs that you'll need to weigh to decide which way to go.
Just make sure you tune the source queries to be as efficient as possible. Views are fine for DSVs.
Views, always! The only advantage of using tables on the DSV is that it ill map your keys automatically :) which saves you 5 minutes of development time haha.
Also, by "use the underlying tables to reproduce the logic in SSAS" you mean creating calculated columns on your SSAS DSV? It is an option too, but I rather add the calculations to the views because, in case I have to uptade them, is MUCH easier (and less subject to failure) to re-deploy a view than to redeploy a full cube.
What techniques/tips can you give in regards to summarizing report data points so you don't have to store the raw data in the database?
For example, if I was storing page view traffic for a website, and my reports were accurate to the hour I could roll-up all database rows by the hour, and then possible even create further summary tables by the various increments like per day/month etc.
Any other tricks/tips along these lines?
You are talking about data warehousing / data mining. You still need OLTP ("the raw") data in a database, but you'd create an additional OLAP data warehouse with "pre-crunched" numbers for faster report access. However, this is an expensive venture in dollars and time - definitely not suitable for web site statistics unless you were Google or Amazon. So you are better off just keeping the setup that you have, and use your queries to summarize data.
My App requires Daily reports based on various user activities. My current design does not sum the daily totals in database, which means I must compute them everytime.
For example A report that shows Top 100 users based on the number of submissions they have made on a given day.
For such a report If I have 50,000 users, what is the best way to create daily report?
How to create monthly and yearly report with such data?
If this is not a good design, then how to deal with such design decision when the metrics of the report are not clear during db design and by the time it is clear we already have huge data with limited parameters (fields).
Please advice.
Ideally, I would advise you to create your data model in such a way that all of the items that needed to be reported could be precomputed in order to minimize the amount of querying that had to be done on the database. It sounds like you might not be able to do that, and in any case, it is an approach that can be brittle and resistant-to-change.
With the release of the 1.3.1 version of the SDK, you now have access to query cursors, and that makes it a good deal easier to deal with generating reports based on a large number of user. You could use appengine cron jobs to put a job on a task queue to compute the numbers for the report.
Since any given invocation of your task is unlikely to complete in the time that AppEngine allows it to run, you'll have to pass the query cursor from one instance to the next until it finishes.
This approach allows you to adapt to changes to your database and reporting needs, as you can record that Task that computes the report values fairly easily.