SSIS: execute taks based on value in column in source - sql-server

We are moving data from Oracle to SQL Server via SSIS. The first steps are filling some (about 8 right now) Oracle tables with selections from the existing Oracle application tables.
The 8 Oracle tables should be filled sequentially, so one after the other, but the filling must be able to be initiated from SSIS, one by one (it's a lot of data and things go wrong!) or all in 1 go (again, one after the other). So we made 9 SSIS packages for this, 8 for filling oracle tables and 1 that calls the 8 packages sequentially. All 8 packages call Oracle procedures, that are contained in 1 Oracle package, to fill the tables.
For 1 of the 8 Oracle tables we made (oracle) parallel jobs to fill the table (for performance sake, via the Oracle job scheduler) . Drawback of Oracle job scheduler parallel execution is that the 'mother' job is finished as soon as all the child jobs are submitted (not completed). So the status wether all subjobs are finished, is put in an Oracle table/column/field (eg FINISHED = 1) through another Oracle procedure. For the next SSIS package to run, the table/column/field has to be a certain value (eg FINISHED = 1), before the next SSIS package can run (to fill the next Oracle table).
I have searched quite a bit, but cannot find a solution on how to do this in SSIS from A to Z. Some say via script task but do not provide screenshots or an example. I am absolutely no expert in SSIS, I just know some basic stuff.
Thank you in advance!
PS: I am looking for a loop-constuct because we don't know how long it wil take for the Oracle scheduler jobs to complete (we test with 1000 rows, up to 20 million)

Related

SSIS locking table while updating it

I have an SSIS package which when runs, updates a table. It is using a staging table and subsequently, uses slowly changing dimension table to load data into the warehouse. We have set it up as a SQL Agent job and it runs every two hours.
The isolation level of the package is serializable. The database isolation level is read committed.
The issue is that when this job runs, this job blocks that table and therefore, clients cannot run any reports. It blanks it out.
So what would be the best option for me to avoid it? clients need to see that data, meanwhile, we need to update the table every two hours.
Using Microsoft SQL Server 2012 (SP3-GDR) (KB4019092) - 11.0.6251.0 (X64)
Thanks.
You're getting "lock escalation". It's a feature, not a bug. 8-)
SQL Server combines large numbers of smaller locks into a table lock to improve performance.
If INSERT performance isn't an issue, you can do your data load in smaller chunks inside of transactions and commit after each chunk.
https://support.microsoft.com/en-us/help/323630/how-to-resolve-blocking-problems-that-are-caused-by-lock-escalation-in
Another option is to give your clients/reports access to a clone of your warehouse table.
Do your ETL into a table that no one else can read from, and when it is finished, switch the table with the clone.

Parallel operations in sql server running "sequentially "

There are n databases in our windows application where the schema is same.
We have a requirement where we need to export the databases as mirrors but only a subset of data should be copied.
For this we are following
Create a new empty DB.
Run a script which inserts selected data into the DB based on a key.
After insert we detach the DB.
steps 1,2,3 are run for each database which needs to be shipped.
Inserts we are running in batches. Our plan is to run parallel for each database which needs to be exported.
When we run for large data it takes 6 minutes for a database.
And if I run the same script for 5 different databases in 5 different sessions simultaneously. The last script finishes running by 30-32 minutes. Which means even though they are running parallel time taken is same as sequential.
We disabled all indexes and at end we are rebuilding them
The database is supposed to be in simple logging mode we are turning them to bulk logging mode and turning back to simple logging.
We tried using MAXDOP(0) and (1) -- No change
NO LOCK Hint is being used in all select queries.
I want to understand what should I do to get best performance, because they are 5 different databases being copied to 5 new databases where all operations are supposed to be independent.

Using temporary tables in SSIS flow fails

I have an ETL process which extracts ~40 tables from a source database (Oracle 10g) to a SQL Server (2014 developer edition) Staging environment. My process for extraction:
Determine newest row in staging
Select all newer rows from source
Insert results into #TEMPTABLE
Merge results from #TEMPTABLE to Staging
This works on a package by package basis both from Visual Studio locally and executing from SSISDB on the SQL Server.
However I am grouping my Extract jobs into one master package for ease of execution and flow to the transform stage. Only approximately 5 of my packages use temporary tables, the others are all trunc and load, but wanted to move some more to this method. When i run the master package anything using a temporary table fails. Because of pretty large log files, its hard to pinpoint the actual error but so far all it tells me is that the #TEMPTABLE can't be found and/or the status is VS_ISBROKEN.
Things i have tried:
Set all relevant components to delay validation = false
Master package has ExecuteOutOfProcess = true
Increased my tempdb capacity far exceeding my needs
A thought i had was the RetainSameConnection = true on my Staging database connection - could this be the cause? I would try to create separate connections for each, but assumed the ExecuteOutOfProcess would take care of this for me.
EDIT
I created the following scenario:
Package A (Master package containing Execute Package Task references only)
Package B (Uses temp tables)
Package C (No temp tables)
Executing Package B on it's own completes successfully. All temp table usage is contained within this package - there is no requirement for Package C to see the temp table created by Package B.
Executing Package C completes successfully.
Executing Package A, C completes successfully, B fails.
UPDATE
The workaround was to create a package level connection for each package that uses temporary tables, thus ensuring that each package held its own connection. I have raised a connect issue with Microsoft as i believe that as the parent package opens the connection it should inherit and retain throughout any child packages.
Several suggestions to your case.
Set RetainSameCoonection=true. This will allow you to work safely with TEMP tables in SSIS packages.
Would not use ExecuteOutOfProcess, it will increase your RAM footprint since every Child pack will start in its process, and decrease performance - add process start lag. This used in 32-bit environments to overcome 2 GB limit, but on x64 it is no longer necessary.
Child package execution does not inherit connection object instances from its Parent, so the same connection will not be spanned across all of your Child packages.
SSIS Packages with Temp table operations are more difficult to debug (less obvious), so pay attention to testing.

SSIS Package Hangs Randomly on Execution

I'm working with an SSIS package that itself calls multiple SSIS packages and hangs periodically during execution.
This is a once-a-day package that runs every evening and collects new and changed records from our census databases and migrates them into the staging tables of our data warehouse. Each dimension has its own package that we call through this package.
So, the package looks like
Get current change version
Load last change version
Identify changed values
a-z - Move changed records to staging tables (Separate packages)
Save change version for future use
All of those are execute SQL tasks except for the moving records tasks which are twenty some execute package tasks (data move tasks), which are executed somewhat in parallel. (Max four at a time.)
The strange part is that it almost always fails when executed by the SQL agent (using a proxy user) or dtexec, but never fails when I run the package through Visual Studio. I've added logging so that I can see where it stops, but it's inconsistent.
We didn't see any of this while working in our development / training environments, but the volume of data is considerably smaller. I wonder if we're just doing too much at once.
I may - to test - execute the tasks serially through the SQL Server agent to see if it's a problem with a package calling a package , but I'd rather not do this because we have a relatively short time in the evening to do this for seven database servers.
I'm slightly new to SSIS, so any advice would be appreciated.
Justin

Warehouse PostgreSQL database architecture recommendation

Background:
I am developing an application that allows users to generate lots of different reports. The data is stored in PostgreSQL and has natural unique group key, so that the data with one group key is totally independent from the data with others group key. Reports are built only using 1 group key at a time, so all of the queries uses "WHERE groupKey = X;" clause. The data in PostgreSQL updates intensively via parallel processes which adds data into different groups, but I don't need a realtime report. The one update per 30 minutes is fine.
Problem:
There are about 4 gigs of data already and I found that some reports takes significant time to generate (up to 15 seconds), because they need to query not a single table but 3-4 of them.
What I want to do is to reduce the time it takes to create a report without significantly changing the technologies or schemes of the solution.
Possible solutions
What I was thinking about this is:
Splitting one database into several databases for 1 database per each group key. Then I will get rid of WHERE groupKey = X (though I have index on that column in each table) and the number of rows to process each time would be significantly less.
Creating the slave database for reads only. Then I will have to sync the data with replication mechanism of PostgreSQL for example once per 15 minutes (Can I actually do that? Or I have to write custom code)
I don't want to change the database to NoSQL because I will have to rewrite all sql queries and I don't want to. I might switch to another SQL database with column store support if it is free and runs on Windows (sorry, don't have Linux server but might have one if I have to).
Your ideas
What would you recommend as the first simple steps?
Two thoughts immediately come to mind for reporting:
1). Set up some summary (aka "aggregate") tables that are precomputed results of the queries that your users are likely to run. Eg. A table containing the counts and sums grouped by the various dimensions. This can be an automated process -- a db function (or script) gets run via your job scheduler of choice -- that refreshes the data every N minutes.
2). Regarding replication, if you are using Streaming Replication (PostgreSQL 9+), the changes in the master db are replicated to the slave databases (hot standby = read only) for reporting.
Tune the report query. Use explain. Avoid procedure when you could do it in pure sql.
Tune the server; memory, disk, processor. Take a look at server config.
Upgrade postgres version.
Do vacuum.
Out of 4, only 1 will require significant changes in the application.

Resources