Finding bottlenecks of ETL and Cube processing

Finding bottlenecks of ETL and Cube processing - sql-server

I have an ETL and Cube solutions, which I process one after another in a SQL agent job.
In ETL I run 1 package, that in turn one by one runs all other packages.
Whole processing takes 10 hours.
For ETL:
How can I find out which package takes what amount of time to run within that one parent package, other than opening solution and record times?
For cube:
Here dimensions process fast. What do I measure here in order to find which part takes it so long? Maybe measures? How to track processing times of particular measure?
Maybe SQL Profiler will help? If so, is there a good article which describes which metrics there should I pay attention to?

To gather statistics about SSIS execution times, you can enable logging:
For package deployment model, you'll have to turn on logging in each package, go to SSIS > logging. In the dialogue choose the Pre and Post Execute events. Use a sql logging provide which will log to a system table called dbo.sysssislog. You'll need to join pre and post events on execution id.
For Project deployment model, it's probably already on. This can be configured in SSMS, Integration Services > SSISDB, right click and choose properties. Once you've executed the package, you can see the results in the standard reports. Right click the master package and choose Reports > Standard Reports > All Executions.
Lots more details on SSIS logging here: https://learn.microsoft.com/en-us/sql/integration-services/performance/integration-services-ssis-logging
For SSAS, I always tested this manually. Connect in SSMS, right click on each Measure group and do a process full (this assumes the dimensions have just been freshly processed.) The measures are more likely to be the cause of an issue because of the amount of data.
Once you understand which measure is slow, you can look at tuning the source query, if it has any complexity to it, or partitioning the measure group and doing incremental loading. Full processing could be scheduled periodically.
m

Related

How will I know specifically when the tasks in my packages have run successfully in SSIS also how can I show that in a log tablet in SQL table

Let's there is a master package and several tasks run in it on a daily basis, I want to specifically determine when those tasks have finished like table load completed and cuble load completed, these steps run daily but I have to show this in a SQL table that this particular day table load started at this time and ended at this like etc

SSIS event handlers are the simplest means of turning an SSIS script
into a reliable system that is auditable, reacts appropriately to
error conditions, reports progress and allows instrumentation and
monitoring your SSIS packages. They are easy to implement, and provide
a great deal of flexibility. Rob Sheldon once again provides the easy,
clear introduction.
You can use on PostExecute and when the tasks runs successfully :
Or you can use containers and then use precedence constraints with success and failure

SQL Server Agent Job Monitoring (did a Job even start?)

I am a BI developer and was entrusted with a task of a DBA. It's about monitoring SQL Server Agent jobs respectively getting information about the following key figures, which are to be prepared after all frontend:
Failed Server Agent Jobs and Jobsteps
Run-time of jobs and Jobsteps
Find out if a job has even started
Memory allocation: evaluation and monitoring of the moving and stored data volumes
The goal is to monitor and, if necessary, provide an indication of any discrepancies in the key figures.
I got the first two points. In each case I show in a table whether a job or whether steps are enabled/disabled. I have also recorded the terms of each step and thresholds can be used to warn if critical ranges are reached.
The biggest problems for me are the jobs that did not even start (Point 3.). To my knowledge, these are not recorded in the MSDB tables. I would like to know when a job or jobstep has not even started. Can you tell me how to get this info? Maybe someone already has a suitable script for this problem ready?
On the subject of memory usage, I'm interested in how much free space is left on the hard disk, how big the partition is and how the consumption changes over time.
On the internet, I could find nothing to the points 3rd and 4th. I would be very grateful for your help! (and forgive me my bad english :) )

I get as a result, when the next run of Job xy is planned. He does not appear in the table.
However, he appears on the SQL Server Agent in the job history. There, however, the information is missing when he should run the next time.
My plan was to take the next_scheduled_run_date column from the sysjobactivity table and then compare a run later to the run_requested_Date column.
But obviously some records are missing in the sysjobactivity table. The other table, which otherwise contains target_start times, is the sysjobschedules.
Unfortunately, this only has the currently scheduled date. I have not found another table that contains a history of the target run_dates.
Of course, one could have manual tables (analogous to the target values) but that would be too much effort.

How many SQL jobs a sql server can handle?

I am creating a database medical system and then I came to a point where I am trying to create a notification feature and i will use SQL jobs in it, where the SQL job responsibility is to check some tables and the entities that will find it need to be notified for a change in certain data will put their ids in an entity called Notification and a trigger will be called for the app to check that table and send the notificiation.
what I want to ask is how many SQL jobs can a sql server handle ?
Does the number of running SQL jobs in background affect the performance of my application or the database performance in a way or another ?
NOTE: the SQL job will run every 10 seconds
I couldn't find any useful information online.
thanks in advance.

This question really doesn't have enough background to get a definitive answer. What are the considerations?
Do the queries in your ten-second job actually complete in ten seconds, even when your DBMS is under its peak transactional workload? Obviously, if the job routinely doesn't complete in ten seconds, you'll get jobs piling up.
Do the queries in your job lock up tables and/or indexes so the transactional load can't run efficiently? (You should use SET ISOLATION LEVEL READ UNCOMMITTED; as much as you can so database reads won't lock things unnecessarily.)
Do the queries in your job do a lot of rows' worth of inserts and updates, and so swamp the SQL Server transaction logs?
How big is your server? (CPU cores? RAM? IO capacity?) How big is your database?
If your project succeeds and you get many users, will your answers to the above questions remain the same? (Hint: no.)
You should spend some time on the execution plans for the queries in your job, and try to make them as efficient as possible. Add the necessary indexes. If necessary refactor the queries to make them more efficient. SSMS will show you the execution plans and suggest appropriate indexes.
If your job is doing things like deleting expired rows, you may want to build the expiration in your data model. For example, suppose your job does
DELETE FROM readings WHERE expiration_date >= GETDATE()
and your application does this, relying on your job to avoid getting expired readings.
SELECT something FROM readings
You can refactor your application query to say
SELECT something FROM readings WHERE expiration_date < GETDATE()
and then run your job overnight, at a quiet time, rather than every ten seconds.
A ten-second job is not the greatest idea in the world. If you can rework your application so it will function correctly with a ten-second, ten-minute, or twelve-hour job, you'll have a more resilient production system. At any rate if something goes wrong with the job when your system is very busy you'll have more than ten seconds to fix it.

Is this circumstance, is it better to use an SSIS package, or just script out the job?

Forewarning: I wasn't entirely sure if this question should be in here (SO) or in The Workplace because it isn't so much about programming, as much as it is convincing my co-worker that I think their method is bad. But it's still programming related. So MODs, please feel free to relocate this question to 'the workplace'. Anyway...
At work we have large SSAS cubes that have been split into multiple partitions. The individual who set up these partitions scheduled every partition to be processed everyday. But in hindsight because the data in these partitions is historic there is no need to process each partition everyday. Only the current partition should be processed after the latest data has been added into the cube's data source.
I gave my coworker a task to automate this process. I figured all they need to do is get the current date, and then process the partition corresponding to that date range. Easily scriptable.
My coworker creates an SSIS package for doing this...
Cons:
the ssis package is hard to source control
the ssis package will be hard to test
the ssis package is a pain in the ass to debug
the ssis package requires Visual Studio and Visual Studio Data Tools to even open
lastly, I feel SSIS packages lead to heavy technical-debt
Pros:
it's easier for my coworker to do (maybe)
Correct me if I'm wrong on any of those but just the first reason is enough for me to scrap all of their work.
Needless to say I'm extremely biased against anything done in SSIS. But processing a cube can be scripted out in xmla (ref: link). And then using a SQL Server Agent job you can schedule that script to run a specific times. The only tricky part would be changing out the partition name that is processed within the script. Furthermore, the script/job can be kept in source control and then deployed to the MSSQL server whenever a change is made.
Am I being too critical here? I'm just trying to keep the next developers from ripping their hair out.

What you can do is to have two SQL Jobs:
1) Full processing + repartitioning
2) Incremental processing (processing of only last (current) partition).
You don't need SSIS neither for (1), nor for (2).
For (2) the script will be fixed - you just make a call to process one partition and incremental processing of dimensions (if required). Current partition must have a condition WHERE >= .... (not BETWEEN), so it covers the future dates if a new partition is not created yet.
For (1), you can write TSQL code that creates a new partition for the new period and reprocess the cube. It can be scheduled to run over weekend when the server is idle, or once per month.
(1) does below:
backup existing cube (Prod) via SSAS Command of SQL Agent
restore the backup as TempCube via SSAS Command of SQL Agent with AllowOverwrite (in case if temp cube was not deleted before)
delete all partitions in TempCube via TSQL + LinkedServer to SSAS
re-create partitions and process cube (full) via TSQL +LinkedServer to SSAS
Backup TempCube
Delete TempCube
Restore backup of TempCube as Production Cube.
As you see, the process is crash safe and you don't need SSIS. Even when a job (that creates a new partition) wasn't run for some reason, the cube still have the new data. The data will be split when a new partition structure is created by (1).

I think you are looking at this wrong way. To be honest your list of cons is pretty bad and is just a reflection of your opinion of SSIS. There is always more than one tool in the toolbox for any job. The proper tool to use will vary from shop to shop.
If the skill set of the party responsible for development and maintenance of this automated process is SSIS then you should really have a better reason than personal preference to rewrite the process with a different tool. A couple reasons I can think of are company standards and skill set of the team.
If company standard dictates the tool then follow that. You should have staff that are capable of using the tools the company mandates. Otherwise assess the skill set of your team. If you have a team of SSIS developers don't force them to use something else because its your personal preference.

Your task of dynamic SSAS partitions processing can be automated with or without SSIS. SSIS is just an environment to execute tasks and do data manipulation. On its Pros - it has built-in components which execute XMLA script from variable and capture error messages. In pure .NET you have to do it yourself, but it is not too complex.
Several sample approaches to your task
Create XMLA XML and execute it with SSIS.
Generate XMLA from AMO library and execute it in .NET. You need to look at chapter 4d) Process All Dims. Provided sample does more that that and steps are put into SSIS package as well.
I personally used SSIS in similar situation, probably because the other 99% of ETL logic and data manipulation is on SSIS. As said before, SSIS offers no significant advantage here. The second example shows how to do it in pure .NET.

SQL Server job (stored proc) trace

I need your suggestion on tracing the issue.
We are running data load jobs at early morning and loading the data from Excel file into SQL Server 2005 db. When job runs on production server, many times it takes 2 to 3 hours to complete the tasks. We could drill down to one job step which is taking 99% of the total time to finish.
While running the job step (stored procs) on staging environment (with the same production database restored) takes 9 to 10 minutes, the same takes hours on production server when it run at early morning as part of job. The production server always stuck up at the very job step.
I would like to run trace on the very job step (around 10 stored procs run for each user in while loop within the job step) and collect the info to figure out the issue.
What are the ways available in SQL Server 2005 to achieve the same? I want to run the trace only for these SPs and not for certain period time period on production server, as trace give lots of information and it becomes very difficult for me (as not being DBA) to analyze that much of trace information and figure out the issue. So I want to collect info about specific SPs only.
Let me know what you suggest.
Appreciate your time and help.
Thanks.

Use SQL Profiler. It allows you to trace plenty of events, including stored procedures, and even apply filters to the trace.
Create a new trace
Select just stored procedures (RPC:Completed)
Check "TextData" for columns to read
Click the "Column Filters" button
Select "TextData" from the left hand nav
Expand the "Like" tree view, and type in your procedure name
Check "Exclude Rows that Do Not Contain Values"
Click "OK", then "Run"

What else is happening on the server at that time is important when it is faster on other servers but not prod. Maybe you are running into the daily backup or maintenance of statistics or indexes jobs?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Finding bottlenecks of ETL and Cube processing - sql-server

Related

How will I know specifically when the tasks in my packages have run successfully in SSIS also how can I show that in a log tablet in SQL table

SQL Server Agent Job Monitoring (did a Job even start?)

How many SQL jobs a sql server can handle?

Is this circumstance, is it better to use an SSIS package, or just script out the job?

SQL Server job (stored proc) trace

Categories

Resources