Does PowerQuery Data Preview work differently than Data Loading in PowerBI? - sql-server

PowerQuery tends to load previews quickly but the very same data takes much longer to load into the PowerBI model when we click 'Close and Apply'. Is there some difference in the way these two things are done? They both depend only on a SQL server stored procedure and I cannot share any screenshots here due to confidentiality of company data. I am hoping someone else has had this issue and/or understands how PowerBI data loads work and can explain this difference.
Tried multiple data loads and varying the timeout period. I expected that lengthening the timeout period would make a difference but the load failed.
** I posted this question earlier today and got a pretty hostile reply and a down-vote so I deleted that one and tried to re-phrase it and repost.

PQ can work differently depending on the circumstances (flat files, query folding, transformations involved). PQ typically works by streaming data rather than loading an entire dataset in order to be more efficient with memory. Given you only preview 1000 records in the preview window, if no aggregations or sorts are happening, only 1000 records will be streamed so you can get a quick preview. When loading an entire dataset, then all records will need to go through the transformation steps rather than just the first 1000. This is a really in-depth topic and videos like the following give you some insight: https://www.youtube.com/watch?v=AIvneMAE50o
The following articles also explain this in detail:
https://bengribaudo.com/blog/2018/02/28/4391/power-query-m-primer-part5-paradigm
https://bengribaudo.com/blog/2019/12/10/4778/power-query-m-primer-part12-tables-table-think-i
https://bengribaudo.com/blog/2019/12/20/4805/power-query-m-primer-part13-tables-table-think-ii

Related

ADF Calendar performance leak

I am using JDeveloper 11.1.2.3.0
I have implemented af:calenar functionality in my application. My calendar is based in a ViewObject that queries a database table with a big number of records (500-1000). Performing the selection through a select query to my database table is very fast, only some ms. The problem is that the time to load of my af:calendar is too long. It requires more than 5 seconds. If I just want to change the month, or the calendar view I have to wait approximately that amount of time. I searched a lot through the net but I found no explanation to this. Can anyone please explain why it takes so long? Has anyone ever faced this issue?
PS: I have tested even with JDeveloper 12 and the problem is identically the same
You should look into the viewobject tuning properties to see how many records you fetch in a single network access, and do the same check for the executable that populates your calendar.
Also try using the HTTP Analyzer to see what network traffic is going on and the ADF Logger to check what SQL is being sent to the DB.
https://blogs.oracle.com/shay/entry/monitoring_adf_pages_round_trips

Quering a huge database using cfquery

Well, I am going to query a 4 GB data using a cfquery. It's gonna be pain to query
the whole database as it's gonna take very long time to get the data back.
I tried stored procedure when the data was 2 GB and it wasn't really fast at that time either.
The data pulling will be done based on the date range user is gonna select from a HTML page.
I have been suggested to follow data archiving in order to speed up querying the database.
Do you think that I'll have to create a separate table with only fields that are required and then query this newly created table?
Well, the size of the current table is 4GB but it is increasing day by day, basically, it's a response database ( getting the information stored from somewhere
else). After doing some research, I am wondering if writing a Trigger could be one option? So, if I do this, then as soon as a new entry (row) will be added
into the current 4GB table , the trigger will initiate some SQL Query which will transfer the contents of the required fields into the newly created table.
This will keep on happening as long as I keep on getting new values in my original 4GB database.
Does above approach sounds good enough to tackle my problem? I have one more concern, even though I am filtering out the only fields required to querying into
a new table, at some point of time, the size of my new database will also increase and that could alsow slower the speed of querying the new table?
Please correct me if Iam wrong somewhere.
Thanks
More Information:
I am using SQL Server. Indexing is currently done but it's not effective.
Archiving the data will be farting against thunder. The data has to travel from your database to your application. Then your application has to process it to build the chart. The more data you have, the longer that will take.
If it is really necessary to chart that much data, you might want to simply acknowledge that your app will be slow and do things to deal with it. This includes code to prevent multiple page requests, displays to the user, and such.

Memory usage in WPF application - how to track and manage? (BEGINNER SCOPE)

I have a small-scale WPF application using VB.net as the code behind and I want to add certain features, but i'm concerned about performance. I REALLY appreciate any responses especially if you could include beginner-friendly articles regarding this, but please help me so I can be at ease...
1) My app interacts with a third party database to display "realtime" data to the user. My proposed method is to create a background worker to query a database every 30 seconds and display the data. I query about 2,000 records all long integer type, store them in a dataset, and then use LINQ to create subsets of observable collections which the WPF controls are bound to.
Is this too intensive? how much memory am i using for 2,000 records of long int? Is the background worker querying every 30 seconds too taxing? will it crash eventually? Will it interfere with the users other daily work (excel, email, etc)?
2) If an application is constantly reading/writing from text files can that somehow be a detriment to the user if they are doing day to day work? I want the app to read/write text files, but I don't want it to somehow interfere with something else the person is doing since this app will be more of a "run in the background check it when I need to" app.
3) Is there a way to quantify how taxing a certain block of code, variable storage, or data storage will be to the end user? What is acceptable?
4) I have several list(of t) that I use as "global" lists where I can hit them from any window in my application to display data. Is there a way to quantify how much memory these lists take up? The lists range from lists of integers to lists of variables with dozens of properties. Can I somehow quantify how taxing this is on the app or the end user?
Thank you for any help and I will continue to search for articles to answer my questions
IF you really want/need to get into the details of memory usage of an application you should use a memory profiler:
http://memprofiler.com/ (commercial)
http://www.red-gate.com/products/dotnet-development/ants-memory-profiler/ (commercial)
http://www.jetbrains.com/profiler/ (commercial)
http://www.microsoft.com/download/en/details.aspx?id=16273 (free)
http://www.scitech.se/blog/ (commercial)
Your other questions are hard to answer since all relevant aspects are rather unknown:
what DB is used ?
how powerful is the machine running the DB server ?
how many users ?
etc.
On some things a performance profiler can help - for example the above mentioned memory profilers (esp. from RedGate / JetBrains etc.) usually are available in a packaged together with a performance profiler...
I will just try a few. A byte integer uses a byte of memory. An int32 uses 4 bytes. So 2000 Int32 would use 8 kb. If you have a query you need to run a lot and it takes 5-10 seconds you need to look close at that query and add any missing indexes. If this is dynamic data then with (nolock) may be OK and faster with less (no) locking. If the query is returning the same data for all users then I hope you don't have all users running the same query. You should have a two tier application where the server runs the query every x seconds and sends that answer to the multiple clients that request it. As for size of an object just add it up - a byte is a byte. You can put you app in debug and get a feel for which statements are fast and slow.

Versioning a dataset in an RDBMS using initials and deltas

I'm working on a system that mirrors remote datasets using initials and deltas. When an initial comes in, it mass deletes anything preexisting and mass inserts the fresh data. When a delta comes in, the system does a bunch of work to translate it into updates, inserts, and deletes. Initials and deltas are processed inside long transactions to maintain data integrity.
Unfortunately the current solution isn't scaling very well. The transactions are so large and long running that our RDBMS bogs down with various contention problems. Also, there isn't a good audit trail for how the deltas are applied, making it difficult to troubleshoot issues causing the local and remote versions of the dataset to get out of sync.
One idea is to not run the initials and deltas in transactions at all, and instead to attach a version number to each record indicating which delta or initial it came from. Once an initial or delta is successfully loaded, the application can be alerted that a new version of the dataset is available.
This just leaves the issue of how exactly to compose a view of a dataset up to a given version from the initial and deltas. (Apple's TimeMachine does something similar, using hard links on the file system to create "view" of a certain point in time.)
Does anyone have experience solving this kind of problem or implementing this particular solution?
Thanks!
have one writer and several reader databases. You send the write to the one database, and have it propagate the exact same changes to all the other databases. The reader databases will be eventually consistent and the time to update is very fast. I have seen this done in environments that get upwards of 1M page views per day. It is very scalable. You can even put a hardware router in front of all the read databases to load balance them.
Thanks to those who tried.
For anyone else who ends up here, I'm benchmarking a solution that adds a "dataset_version_id" and "dataset_version_verb" column to each table in question. A correlated subquery inside a stored procedure is then used to retrieve the current dataset_version_id when retrieving specific records. If the latest version of the record has dataset_version_verb of "delete", it's filtered out of the results by a WHERE clause.
This approach has an average ~ 80% performance hit so far, which may be acceptable for our purposes.

Retrieving information from aggregated weblogs data, how to do it?

I would like to know how to retrieve data from aggregated logs? This is what I have:
- about 30GB daily of uncompressed log data loaded into HDFS (and this will grow soon to about 100GB)
This is my idea:
- each night this data is processed with Pig
- logs are read, split, and custom UDF retrieves data like: timestamp, url, user_id (lets say, this is all what I need)
- from log entry and loads this into HBase (log data will be stored infinitely)
Then if I want to know which users saw particular page within given time range I can quickly query HBase without scanning whole log data with each query (and I want fast answers - minutes are acceptable). And there will be multiple querying taking place simultaneously.
What do you think about this workflow? Do you think, that loading this information into HBase would make sense? What are other options and how do they compare to my solution?
I appreciate all comments/questions and answers. Thank you in advance.
With Hadoop you are always doing one of two things (either processing or querying).
For what you are looking to-do I would suggest using HIVE http://hadoop.apache.org/hive/. You can take your data and then create a M/R job to process and push that data how you like it into HIVE tables. From there (you can even partition on data as it might be appropriate for speed to not look at data not required as you say). From here you can query out your data results as you like. Here is very good online tutorial http://www.cloudera.com/videos/hive_tutorial
There are a lots of ways to solve this but it sounds like HBase is a bit overkill unless you want to setup all the server required for it to run as an exercise to learn it. HBase would be good if you have thousands of people simultaneously looking to get at the information.
You might also want to look into FLUME which is new import server from Cloudera . It will get your files from some place straight to HDFS http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/

Resources