I'm using tableau with a bigquery data source that has 500M rows, 30 columns. In order to have this BQ data used by my workbooks I refresh an extract (hyper) every day.
In my workbooks I have 6 parameters and one filter that is a user filter.
I notice that the workbooks loading time is slow. It gets also slow when I change the values of parameters.
When using performance recording I get numbers of 40s per query.
While reneding time is in the order of milliseconds.
Is this normal even if I'm using extracts with less quick filters ? How could I enhance the performance of querying ?
Tableau Server info : I'm using Tableau 2022.1 on a 2 nodes server with 256Gb Ram.
How many marks are you pulling into the Dashboard? If it is a summary query /Dashboard, it shouldn't take this long. More the number of marks you pull out, higher will the time taken. Also, user filters could also be a problem. Try removing that to see the impact.
Related
I'm working on a report where I have to get 3 different values making use of 3 queries:
Count of all our users
Users who purchased certain products in the last year
Users who have a certain attribute and have sent documents for signing
All these queries use tables from the same database. The first query takes under a second, the second one takes around 2 min and the third query takes around 1min.
I can create the report in two ways:
Create one data set and get all the required values.
Create one data set for each of these queries and let them run in parallel.
If my understanding is correct, the first method would take around 3 mins to execute whereas the second one would take around 2min. In this case, I think creating multiple data sets is much better for performance than creating one single data set. What do you think? Does running queries in parallel improve the performance in this case? What is a good practice in SSRS in these kind of scenarios? Thanks!
Summary
I have a simple power BI report with 3 KPIs that take 0.3 seconds to refresh. Once I move the data to Azure Dedicated SQL Pool, same report takes 6 seconds to refresh. The goal is to have a report that would load in less than 5 seconds with 10-15 KPIs and up to 10 filters applied and connect to much larger (several million lines) datasets.
How to improve performance of Power BI connected via Direct Query to Azure Dedicated SQL pool?
More details
Setup: Dedicated SQL pool > Power BI (Direct Query)
Data: 3 tables (rows x columns: 4x136 user table, 4x3000 date table, 8x270.000 main table)
Data model: Main table to Date table connection - many-to-one based on “date” type field. Main to User connection - many-to-one based on user_id (varchar(10)). Other data types used: date, integers, varchars (50 and less).
Dedicated SQL pool performance level is Gen2: DW200c. However, the number of DWUs used never exceeds 50, most often it stays under 15.
Report:
The report includes 3 KPIs that are calculated based on the same formula (just with “5 symbols text” being different):
KPI 1:=
CALCULATE(
COUNTA('Table Main'[id]),
CONTAINSSTRING('Table Main'[varchar 50 field 0], "5 symbols text"),
ISBLANK('Table Main' [varchar 50 field 1]),
ISBLANK('Table Main' [varchar 50 field 2])
)
The report includes 6 filters: 5 “basic filtering” with 1 or several values chosen, 1 “string contains” filter.
Problem:
The visual displaying 3 KPIs (just 3 numbers) takes 6…8 seconds to load, which is a lot, especially given the need to add more KPIs and filters. (compared to 0,3 seconds if all data is loaded in .pbix file). It is a problem because adding more KPIs on the page (10-15) increases the time to load proportionally or even more for KPI that are more complicated to calculate and reports become unusable.
Question:
Is it possible to significantly improve the performance of the report/AAS/SQL pool (2…10 times faster)? How?
If not, is it possible to somehow cache the calculated KPIs / visual contents in the report or AAS without querying the data every time and without keeping the data in pbix or AAS model?
Solutions tried and not working:
Sole use and different combinations of clustered columnstore, clustered rowstore, non-clustered rowstore indexes. Automated statistics on/off. Automated indexes and stats do give an improvement of 10…20%, but it is definitely not enough.
Simple list of values (1 column from any table) takes 1.5 to 4 seconds to load)
What I have tried
Moving SQL pool from West Europe to France and back. No improvements
Applying indexes: row- and columnstore, clustered and non-clustered, manual and automatically defined (inl. statistics) - gives 10...20% improvements in performance, which does not resolve the issue.
Changing resource classes: smallrc, largerc, xlargerc. % of DWUs used still never exceeds 50 (out of 200). No improvements
Shrinking data formats and removing excessive data: minimal nvarchar(n) possible, the biggest is nvarchar (50), all excessive columns have been removed. No improvements
Isolating the model: I have a larger data model, for testing puproses I have isolated 3 tables into a separate model to make sure other parts do not affect the performance. No improvements.
Reducing number of KPI and filters
With only 2 report filters left (main table fields only) the visual takes 2 seconds to load. With +2 filters on the connected date table 2,5 sec, with +2 filters on user table 6 sec. Which means that I would only be able to use 1-2 filter reports, which is not acceptable.
It's a bit of trial and error process unfortunately. Here are a few things that are not in your list already:
many-to-one based on user_id (varchar(10)) -> Add a numeric column which is a hash of user_id column and use that to join instead of varchar column.
Ensure your statistics are up to date.
Try Dual mode. Load smaller dimension tables in-memory and keep fact tables in DB.
Use aggregates such that unless a user is trying to drill down report is actually populated without querying DB.
Partition your fact table by appropriate column.
Make sure you're using the right distribution for your fact table and have chosen the right column.
Be careful with Partitioning and Distribution. Synapse is designed a little differently than traditional RDBs (like MySQL), it's closer to NoSQL DBs to some extent, but not completely. So understand how these concepts work in Syanpse before using them (or your might get worse performance)!
TL;DR
I have a table with about 2 million WRITEs over the month and 0 READs. Every 1st day of a month, I need to read all the rows written on the previous month and generate CSVs + statistics.
How to work with DynamoDB in this scenario? How to choose the READ throughput capacity?
Long description
I have an application that logs client requests. It has about 200 clients. The clients need to receive on every 1st day of a month a CSV with all the requests they've made. They also need to be billed, and for that we need to calculate some stats with the requests they've made, grouping by type of request.
So in the end of the month, a client receives a report like:
I've already come to two solutions, but I'm not still convinced on any of them.
1st solution: ok, every last day of the month I increase the READ throughput capacity and then I run a map reduce job. When the job is done, I decrease the capacity back to the original value.
Cons: not fully automated, risk of the DynamoDB capacity not being available when the job starts.
2nd solution: I can break the generation of CSVs + statistics to small jobs in a daily or hourly routine. I could store partial CSVs on S3 and on every 1st day of a month I could join those files and generate a new one. The statistics would be much easier to generate, just some calculations derived from the daily/hourly statistics.
Cons: I feel like I'm turning something simple into something complex.
Do you have a better solution? If not, what solution would you choose? Why?
Having been in a similar place myself before, I used, and now recommend to you, to process the raw data:
as often as you reasonably can (start with daily)
to a format as close as possible to the desired report output
with as much calculation/CPU intensive work done as possible
leaving as little to do at report time as possible.
This approach is entirely scaleable - the incremental frequency can be:
reduced to as small a window as needed
parallelised if required
It also, makes possible re-running past months reports on demand, as the report generation time should be quite small.
In my example, I shipped denormalized, pre-processed (financial calculations) data every hour to a data warehouse, then reporting just involved a very basic (and fast) SQL query.
This had the additional benefit of spreading the load on the production database server to lots of small bites, instead of bringing it to its knees once a week at invoice time (30000 invoiced produced every week).
I would use the service kinesis to produce a daily and almost real time billing.
for this purpose I would create a special DynamoDB table just for the calculated data.
(other option is to run it on flat files)
then I would add a process which will send events to kinesis service just after you update the regular DynamoDB table.
thus when you reach the end of the month you can just execute whatever post billing calculations you have and create your CSV files from the already calculated table.
I hope that helps.
Take a look at Dynamic DynamoDB. It will increase/decrease the throughput when you need it without any manual intervention. The good news is you will not need to change the way the export job is done.
I have created a windows form that uses SQL Server Database. The windows form contains a search grid which brings all the bank account information of a person. The search grid contains a special field "Number of Account" which displays the number of Accounts a person have associated with a bank.
There are more than 100,000 records in the table from where the data is fetched. I just wanted to know how should I decrease the response time or the search time while getting the data from the table in the search grid.
When I run the page it takes hell lot of time to get the records displayed in the search grid. Moreover, it does not get the data unless and until I provide a search criteria(To and from Date for searching)
Is their any possible way to decrease the search time so that the data should get displayed in the grid.
There are a few things that you can do:
Only fetch the minimum amount of data that you need for your results - this means only select the needed columns and limit the number of rows.
In addition to the above, consider using paging on the UI, so you can further limit the amount of data returned. There is no point in showing a user 100,000 rows.
If you hadn't done so already, add indexes to the table (though at 100,000 rows, things shouldn't be that slow anyway). I can't go into detail about how to do that.
Currently I have a project (written in Java) that reads sensor output from a micro controller and writes it across several Postgres tables every second using Hibernate. In total I write about 130 columns worth of data every second. Once the data is written it will stay static forever.This system seems to perform fine under the current conditions.
My question is regarding the best way to query and average this data in the future. There are several approaches I think would be viable but am looking for input as to which one would scale and perform best.
Being that we gather and write data every second we end up generating more than 2.5 million rows per month. We currently plot this data via a JDBC select statement writing to a JChart2D (i.e. SELECT pressure, temperature, speed FROM data WHERE time_stamp BETWEEN startTime AND endTime). The user must be careful to not specify too long of a time period (startTimem and endTime delta < 1 day) or else they will have to wait several minutes (or longer) for the query to run.
The future goal would be to have a user interface similar to the Google visualization API that powers Google Finance. With regards to time scaling, i.e. the longer the time period the "smoother" (or more averaged) the data becomes.
Options I have considered are as follows:
Option A: Use the SQL avg function to return the averaged data points to the user. I think this option would get expensive if the user asks to see the data for say half a year. I imagine the interface in this scenario would scale the amount of rows to average based on the user request. I.E. if the user asks for a month of data the interface will request an avg of every 86400 rows which would return ~30 data points whereas if the user asks for a day of data the interface will request an avg of every 2880 rows which will also return 30 data points but of more granularity.
Option B: Use SQL to return all of the rows in a time interval and use the Java interface to average out the data. I have briefly tested this for kicks and I know it is expensive because I'm returning 86400 rows/day of interval time requested. I don't think this is a viable option unless there's something I'm not considering when performing the SQL select.
Option C: Since all this data is static once it is written, I have considered using the Java program (with Hibernate) to also write tables of averages along with the data it is currently writing. In this option, I have several java classes that "accumulate" data then average it and write it to a table at a specified interval (5 seconds, 30 seconds, 1 minute, 1 hour, 6 hours and so on). The future user interface plotting program would take the interval of time specified by the user and determine which table of averages to query. This option seems like it would create a lot of redundancy and take a lot more storage space but (in my mind) would yield the best performance?
Option D: Suggestions from the more experienced community?
Option A won't tend to scale very well once you have large quantities of data to pass over; Option B will probably tend to start relatively slow compared to A and scale even more poorly. Option C is a technique generally referred to as "materialized views", and you might want to implement this one way or another for best performance and scalability. While PostgreSQL doesn't yet support declarative materialized views (but I'm working on that this year, personally), there are ways to get there through triggers and/or scheduled jobs.
To keep the inserts fast, you probably don't want to try to maintain any views off of triggers on the primary table. What you might want to do is to periodically summarize detail into summary tables from crontab jobs (or similar). You might also want to create views to show summary data by using the summary tables which have been created, combined with detail table where the summary table doesn't exist.
The materialized view approach would probably work better for you if you partition your raw data by date range. That's probably a really good idea anyway.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html