my SSISDB is writing a large number of entries, especially in [internal].[event_messages] and [internal].[operation_messages].
I have already set the the number of versions to keep and the log retention period to 5 each. After running the maintenance job, selecting the distinct dates in both those tables shows that there are only 6 dates left (including today), as one would expect. Still, the tables I mentioned above have about 6.5 million entries each and a total database size of 35 GB (again, for a retention period of 5 days).
In this particular package, I am using a lot of loops and I suspect that they are causing this rapid expansion.
Does anyone have an idea of how to reduce the number of operational and event messages written by the individual components? Or do you have an idea of what else might be causing this rate of growth? I have packages running on other machines for over a year with a retention period of 150 days and the size of the SSISDB is just about 1 GB.
Thank You!
Related
We are seeing an SSIS process taking too long overnight, and running late into the morning.
The process queries a database table to return a list of files that have been changed since the last time the process ran (it runs daily), and this can be anything from 3 (over a weekend) to 40 during the working week. There are possibly 258 Excel files (all based on the same template) that could be imported.
As I stated, the process seems to be taking too long some days (it's not on a dedicated server) so we decided to look at performance improvement suggestions i.e. increase the DefaultBufferMaxRows and DefaultBufferSize to 50,000 and 50MB respectively for each Data Flow Task in the project. The other main suggestion was to always use a SQL command over a Table or View - each of my Data Flow Tasks (I have nine) all rely on a range name from the spreadsheet - if it might help with performance, what I want to know is - is it possible to select from an Excel worksheet with a WHERE condition?
The nine import ranges vary from a single cell to a range of 10,000 rows by 230 columns. Each of these is imported into a staging table, and then merged into the appropriate main table, but we have had issues with the import not correctly understanding data types (even with IMEX=1), so it seems that I might get a better import if I could select the data differently, and restrict it to only the rows I'm interested in (rather than all 10,000 and then filter them as part of the task), i.e. all rows where a specific column is not blank.
This is just initially an exercise to look into performance improvement, but also it's going to help me going forward with maintaining and improving the package as it's an important process for the business.
Whilst testing the merge stored procedure to look at other ways to improve the process, it came to light that one of the temporary tables populated via SSIS was corrupted. It was taking 27 seconds to query and return 500 records, but when looking at the table statistics, the table was much bigger than it should have been. After conversations with our DBA, we dropped and recreated the table and now the process is running back at it's previous speed, i.e. for 5 spreadsheets, the process takes 1 minute 13 seconds, and for 43 spreadsheets it's around 7 minutes!
Whilst I will still be revisiting the process at some point, everyone agrees to leave it unchanged for the moment.
I am trying to figure out if the current redo log size I have right now is optimal. Here is what I have done:
I used the Oracle documentation to find most of this information:
http://www.oracle.com/technetwork/database/availability/async-2587521.pdf
Using SQL I used the below query
select thread#,sequence#,blocks*block_size/1024/1024 "MB",(next_time-first_time)*86400 "sec",
(blocks*block_size/1024/1024)/((next_time-first_time)*86400) "MB/s"
from V$ARCHIVED_LOG
where ((next_time-first_time)*86400<>0) and
first_time between to_date('2020/03/28 08:00:00','YYYY/MM/DD HH24:MI:SS')
and to_date('2020/05/28 11:00:00','YYYY/MM/DD HH24:MI:SS')
and dest_id=3
order by first_time
From the results, I calculate the average MB/S which is 7.67 and the maximum MB/S which is 245 MB/S
According to the Oracle documentation
See table on recommended redo log group size
Using this query
select * from V$LOGFILE a, V$LOG b where a.GROUP# = b.GROUP#
I discovered that I have 15 groups of 2 GB, so the redo log group size is 30 GB.
Oracle says that "In general we
recommend adding an additional 30% on top of the peak rate", so that would mean I am expected to have 245mb/s*1.3 = 318.5 MB/S. Then here is where I get a little lost. Do I use the table in the picture I attached? If so, I would be expected to have a redo log group size of 64GB? Or am I making a connection where there should not be one?
Finally, I did also
select optimal_logfile_size from v$instance_recovery
and that returns 14 GB.
I am having trouble making all the connections and trying to confirm, my redo log size is adequate.
If you have 15 groups of 2GB each, then your group size is 2GB, not 30GB.
The idea is not to switch logs too often - no more than every 20 minutes. So look at how often your log switches are happening. If you are still more than 20 minutes between switches then you are probably fine. If you are ever having more frequent switches than that, then you might need bigger logs.
Based on the calculations you performed, a max rate of ~319MB/s would indicate that individual redo log files should be 64GB, and you want a minimum (per best practice) of three redo log groups. That said - how much of your time is spent at peak load? If only a small amount of time per day (your average transaction rate is much lower) then this may be overkill. You don't want log switches to happen too far apart, either, or your ability to do point-in-time recovery after a redo log failure could be compromised.
It may make more sense for you to have log files that are 16GB and maintain a steady switch rate on average and accept a higher switch rate during peak load. You might need more individual log files that way to handle the same total transactions per minute without waiting for incomplete log switches: say three groups of 64GB each vs. 12 groups of 16GB each. The same total log capacity but in smaller chunks for switches and archive logging. That's probably why you have 15 groups of 2GB each configured now...
Ideally there should not be frequent log switches per hour. Check the redo log switch frequency and increase the size of redo log accordingly.
Found useful link that can be used to get all the redo logs related details below.
Find Redo Log Size / Switch Frequency / Location in Oracle
Problem: I am developing a reporting engine that displays data about how many bees a farm detected (Bees is just an example here)
I have 100 devices that each minute count how many bees were detected on the farm. Here is how the DB looks like:
So there can be hundreds of thousands of rows in a given week.
The farmer wants a report that shows for a given day how many bees came each hour. I developed two ways to do this:
The server takes all 100,000 rows for that day from the DB and filters it down. The server uses a large amount of memory to do this and I feel this is a brute force solution
I have a Stored Procedure that returns a temporarily created table, with every hour the amount of bees collected for each device totaled. The server takes this table and doesn't need to process 100,000 rows.
This return (24 * 100) rows. However it takes much longer than I expected to do this ~
What are some good candidate solutions for developing a solution that can consolidate and sum this data without taking 30 seconds just to sum a day of data (where I may need a months worth divided between days)?
If performance is your primary concern here, there's probably quite a bit you can do directly on the database. I would try indexing the table on time_collected_bees so it can filter down to 100K lines faster. I would guess that that's where your slowdown is happening, if the database is scanning the whole table to find the relevant entries.
If you're using SQL Server, you can try looking at your execution plan to see what's actually slowing things down.
Give database optimization more of a look before you architect something really complex and hard to maintain.
We have a database that is currently 1.5TB in size and grows by a gigabyte worth of data every day (a text file) that is 5 million records - and it grows daily
It has many columns, but a notable one is START_TIME which has the date and time -
We run many queries against a date range -
We keep 90 days worth of records inside of our database, and we have a larger table which has ALL of the records -
Queries run against the 90 days worth of records are pretty fast, etc. but queries run against ALL of the data are slow -
I am looking for some very high level answers, best practices
We are THINKING about upgrading to SQL Server enterprise and using table partitioning, and splitting the partition based on month (12) or days (31)
Whats the best way to do this?
Virtual Physical, a SAN, how many disks, how many partitions, etc. -
Sas
You don't want to split by day, because you will touch all partitions every month. Partitioning allows you not to touch certain data.
Why do you want to partition? Can you clearly articulate why? If not (which I assume) you shouldn't do it. Partitioning does not improve performance per-se. It improves performance in some scenarios and it takes performance in others.
You need to understand what you gain and what you loose. Here is what you gain:
Fast deletion of whole partitions
Read-Only partitions can run on a different backup-schedule
Here is what you loose:
Productivity
Standard Edition
Lower performance for non-aligned queries (in general)
Here is what stays the same:
Performance for partition-aligned queries and indexes
If you want to partition, you will probably want to do it on date or month, but in a continuous way. So don't make your key month(date). Make it (year(date) + '-' + month(date)). Never touch old partitions again.
If your old partitions are truly read-only, put each of them in a read-only file-group and exclude it from backup. That will give you really fast backup and smaller backups.
Because you only keep 90 days of data you probably want to have one partition per day. Every day at midnight you kill the last partition and alter the partition function to make room for a new day.
There is not enough information here to answer anything about hardware.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
EDIT
This question has been closed on SO and reposted on ServerFault
https://serverfault.com/questions/333168/how-can-i-make-my-ssis-process-consume-more-resources-and-run-faster
I have a daily ETL process in SSIS that builds my warehouse so we can provide day-over-day reports.
I have two servers - one for SSIS and the other for the SQL Server Database. The SSIS server (SSIS-Server01) is an 8CPU, 32GB RAM box. The SQL Server database (DB-Server) is another8CPU, 32GB RAM box. Both are VMWare virtual machines.
In its oversimplified form, the SSIS reads 17 Million rows (about 9GB) from a single table on the DB-Server, unpivots them to 408M rows, does a few lookups and a ton of calculations, and then aggregates it back to about 8M rows that are written to a brand new table on the same DB-Server every time (this table will then be moved into a partition to provide day-over-day reports).
I have a loop that processes 18 months worth of data at a time - a grand total of 10 years of data. I chose 18 months based on my observation of RAM Usage on SSIS-Server - at 18 months it consumes 27GB of RAM. Any higher than that, and SSIS starts buffering to disk and the performance nosedives.
I am using Microsoft's Balanced Data Distributor to send data down 8 parallel paths to maximize resource usage. I do a union before starting work on my aggregations.
Here is the task manager graph from the SSIS server
Here is another graph showing the 8 individual CPUs
As you can see from these images, the memory usage slowly increases to about 27G as more and more rows are read and processed. However the CPU usage is constant around 40%.
The second graph shows that we are only using 4 (sometimes 5) CPUs out of 8.
I am trying to make the process run faster (it is only using 40% of the available CPU).
How do I go about making this process run more efficiently (least time, most resources)?
At the end of the day, all processing is bound by one of four factors
Memory
CPU
Disk
Network
The first step is to identify what the limiting factor is and then determine whether you can influence it (acquire more of or reduce usage of)
Component choices
The reason your server's memory runs out when you do more than 18 months is related to why it takes so long for it to process. The Pivot and Aggregate transformations are asynchronous components. Every row coming in from the source component has N bytes of memory allocated to it. That same bucket of data visits all the transformations, has their operations applied and is emptied at the destination. That memory bucket is reused over and over again.
When an async component enters the arena, the pipeline is split. The bucket that was transporting that row of data must now be emptied into a new bucket to complete the pipeline. That copying of data between execution trees is an expensive operation in terms of execution time and memory (could double it). This also reduces the opportunity for the engine to parallelize some of the execution opportunities as it's waiting on the async operations to complete. A further slow down to operations is encountered from the nature of the transformations. The Aggregate is a fully blocking component so all the data has to arrive and be processed before the transformation will release a single row to the downstream transformations.
If it's possible, can you push the pivot and/or the aggregate onto the server? That should decrease the time spent in the data flow as well as the resources consumed.
You can try increasing the amount of parallel operations the engine can chose. Jamie's article, SQL CAT's article
If you really want to know where your time is being spent in the data flow, log the OnPipelineRowsSent for an execution. Then you can use this query to rip it apart (after substituting sysssislog for the sysdtslog90)
Network transfer
Based on your graphs, it doesn't appear the CPU or Memory is taxed on either box. I believe you have indicated the source and destination server are on a single box but the SSIS package is hosted and processed on another box. You're paying a not-insignificant cost to transfer that data over the wire and back again. Is it possible to process the data on the source server? You'd need to allocate more resources to that box and I'm crossing my fingers that's a big beefy VM and that's not a problem.
If that's not an option, try settings the Packet Size property of the connection manager to 32767 and talk to network ops about whether jumbo frames are right for you. Both of those tips are in the Tune your Network section.
I suck at disk counters but you should be able to see if the wait types are disk related.
Have you previously tried breaking the 18 months processing further into 2 or 3 more batches? unless of course your partitioning scheme will require all 18 months together in that partition --but then it would be come a curious matter to see how and why you're partitioning the data with that scheme. And it would still be okay to break into batches if you have validations in place when you recreate your indexes/constraints..
In my experience, I once had to create a job that would process between 50 and 60 million records and although the source was from data files and the destination was into a table in the server, breaking them into batches proved to be a faster method than going all out at once.
Are you worried about the nosedive performance because this is a highly-transactional database? If so, do you happen to have any data redundancy in place at your disposal?
[edit#01]
Re:Comment#01: Sorry if I'm quite confusing; I meant that on the scheduled day for processing the records, it would be good to have a scheduled job for your ssis package run at certain intervals (so test how long 1 month gets processed and take the average and give it a buffer for time) handling a month or two at a time (if possible) and then just set an additional task at the top to compute/determine which month is to be processed.
Just an example:
< only assuming that two months take less than an hour to finish >
[scheduled run] : 01:00
[ssis task 01] get hour value of current time. if hour = 1 then set monthtoprocessstart = 1 and monthtoprocessend = 2
[ssis task 02 and so on] : work with data whose months are in the range (monthtoprocessstart and end for the year you are processing)
If this is more confusing just let me know so I can remove the edit.. Thanks..