I have a sql query that has more than 200 lines of codes with the following steps. I need to run this everyday and generate Table A. I have a new requirement to make a SSIS package with the same process and create the tableA with ssis . The below details are the current SQL process
drop table_A
select into table_A from (select tableB union all select tableC union all TableD)
key fators : table_B, table_C, table_D - I need to pull 20 columns out of 40 columns from these three tables. The columns names vary and I need to
rename and standardisse the column names and certain data type so that it goes as unique column in Table_A.
This is already set up in sql query, but I need to know whats the best practise and how to transform them into SSIS ? Should I use "Execute SQL Task" process flow
or use data flow task involving oledb source and oledb destination ?
Execute SQL Task is what you're going to want. The Execute SQL Task is designed to run an arbitrary query that may or may not return a result set. You've already done the hard work of getting your code working correctly so all you need to do is define a Connection Manager (likely an OLE DB) and paste in your code.
In this case, SSIS is going to be nothing more than a coordinator/execution framework for your existing SQL Process. And that's perfectly acceptable as someone who's written more than a few SSIS packages.
A Data Flow Task, I find, is more appropriate when you need to move the data from tables B, C, and D into a remote database or you need to perform transformation logic on them that isn't easily done in TSQL.
A Data Flow Task will not support creating the table at run-time. All SSIS tasks perform a validation check - either on package start or it can be delayed until the specific task begins. One of the checks a Data Flow Task is going to perform is "does the target table exist (and does the structure match my cached copy)?"
Related
Need to loading a flat file with an SSIS Package executed in a scheduled job in SQL Server 2016 but it's taking TOO MUCH TIME (like 2/3 hours) just to load data in source then it’s need extra (2/3 hours) time for sort and filter then need similar time to load data in target, the file just has like million rows and it’s not less than 3 GB file approximately. This is driving me crazy, because is affecting the performance of my server.
SSIS package: -My package is just a Data Flow Task that has a Flat File Source and an OLE DB Destination, that’s all -The Data Access Mode is set to FAST LOAD. -Just have 1 indexes in the table. My destination table has 32 columns
Input file:
Input text file has more than 32 columns, surrogate key data may not unique , referenced columns date may not unique , Need to filter them.
Face two problems one is SSIS FlatFile-Source take huge time to load date another one is sort and filter. What to do?
If you want it to run fast use this pattern:
Load the data exactly as-is into a staging table
Optionally add indexes to the staging table afterwards
Use SQL to perform whatever processing you need (i.e. SELECT DISTINCT, GROUP BY into the final table)
You can do this kind of thing in SSIS but you need to tune it properly etc. it's just easier to do it inside a database which is already well optimised for this
Some Suggestions
1. DROP and Recreate indexes
Add 2 Execute SQL Task; before and after the Data Flow Task, the first drop the index and the second recreate it after that the Data Flow Task is executed successfully
2. Adjust the buffer size
You can read more about buffer sizing in the following Technet article
3. Remove duplicates in SQL server instead of Sort Components
Try to remove the Sort components, and add an Execute SQL Task after the Data Flow Task which run a similar query:
;WITH x AS
(
SELECT col1, col2, col3, rn = ROW_NUMBER() OVER
(PARTITION BY col1, col2, col3 ORDER BY id)
FROM dbo.tbl
)
DELETE x WHERE rn > 1;
4. Use a script component instead of Sort
You have to implement a similar logic to the following answer:
SSIS: Flat File Source to SQL without Duplicate Rows
Some helpful Links
Integration Services: Performance Tuning Techniques
Data Flow Performance Features
Can I delete database duplicates based on multiple columns?
Removing duplicate rows (based on values from multiple columns) from SQL table
Deleting duplicates based on multiple columns
Remove duplicates based on two columns
I am running an SSIS package that contains many (7) reads from a single flat file uploaded from an external source. There is consistently a deadlock in every environment(Test, Pre-Production, and Production) on one of the data flows that uses a Slowly Changing Dimension to update an existing SQL table with both new and changed rows.
I have three groups coming off the SCD:
-Inferred Member Updates Output goes directly to an OLE DB Update command.
-Historical Attribute goes to a derived column boxed that sets a delete date and then goes to an update OLE DB command, then goes to a union box where it unions with the last group New Output.
-New Output goes into a union box along with the Historical output then to a derived column box that adds an update/create date, then inserts the values into the same SQL table as the Inferred Member Output DB Command.
The only error I am getting in my log looks like this:
"Transaction (Process ID 170) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction."
I could put the (NOLOCK) statement into the OLE db commands, but I have read that this isn't the way to go.
I am using SQL Server 2012 Data Tools to investigate and edit the Package, but I am unsure where to go from here to find the issue.
I want to get out there that i am a novice in terms of SSIS programming... with that out of the way... Any help would be greatly appreciated, even if it is just pointing me to a place I haven't looked for help.
Adding index on the WHERE condition column may resolve your issue. After adding index on the column, transactions will executes in faster way which reduce the chances of deadlock.
I am creating a process that automates testing the consistency in database tables across servers.
I have a test_master table which contains following columns:
test_id, test_name, test_status
and Job_master table which contains following columns:
jid, test_id, job_name, job_type, job_path, job_status,
server_ip, db, error_description, op_table, test_table,
copy_status, check_status
There can be multiple jobs for a particular test. The jobs are logical jobs (and not sql agent jobs), they can be script, procedure or ssis package.
So I have made an ssis package :
In Pre-execute, it takes up tests which aren't done yet.
Each Job runs and writes the name of live table into op_table field
In post-execute, the live tables are getting copied to a test database environment and table name is put into test_table.. and testing will be performed there only.
Here the jobs are running in a loop... Is there a way to let the jobs run in parallel because they are independent of each other....
Can I write an sql procedure for this inside of this loop or is there any other way I can do this..
Any new ideas are welcome...
Thank you very much.. :)
Very roughly, I would put the approach as below:
SQL bits
Wrap whatever SQL code is part of "job" into a stored proc. Inside this proc, populate a variable which takes care of the SQL bit and execute it using dynamic SQL. Update the job status in the same proc and take help of TRY-CATCH-THROW construct.
Packages
Populate the name of packages in an SSIS string variable in delimited fashion(or have an object variable, whatever suits you). Then, in a script task, iterate through the list of packages and fire them using dtExec command. To update the job status, it's best to have the update of job status taken care by the invoked packages. If that is not an option, use Try-catch construct, update the job statuses according. This is a helpful link.
Do a check on the job_type variable on top of the SSIS package(using precedence constraint) and route them into the correct 'block'.
I'm writing an SSIS package that imports data from an Oracle database. There's a chance some rows are already in the destination table, and seeing that there's no task in SSIS 2008 that allows me to check this, my idea is to create a temporary table and use a field in the destination table to filter the temporary table's rows that I can actually insert.
I understand that local and global temporary tables vanish when they go out of scope. So, my question is, when my SSIS package goes on to the next task, will my temporary tables disappear?
In your connection manager(s), you need to set RetainSameConnection = True. This ensures that your connection doesn't close throughout the execution of the package - using the same spid.
You'll also need to set DelayValidation = True for many of your tasks (without seeing your package, I can't tell you all of them, but you can experiment). By default, SSIS tries to pre-validate all the steps in your package before executing. When you're using temp tables, if the temp table doesn't exist, this pre-validation will fail. By setting DelayValidation = True, you avoid the pre-validation.
Finally, you may need to do some weird stuff to get the package to recognize your temp table at design time - e.g. Execute the task that creates your temp table, then try to map your fields (assuming you're going to follow good practice and drop the temp table at the end of your package).
This article provides a good overview: http://www.mssqltips.com/sqlservertip/2826/how-to-create-and-use-temp-tables-in-ssis/
You should be able to accomplish this with either global or local temp tables.
What you need is 2 oledb sources in the package. One for the oracle table and one for the destination table. Then you do a sort on each and add a merge join on the keys of the 2 sources. In the merge join, you do a left outer join, then add a conditional split. The conditional split will have "Existing" and "New"
Existing condition will be !IsNull(DestTableID). New condition will be IsNull(DestTableID).
From the conditional split, you have an oledb sql command, which will update the destingation table if condition is Existing.
From the conditional split, you have an oledb destination, which will be the destination table and will add new rows to it if the condition is New.
In SSIS 2008 I have a Script Task that checks if a table exists in a database and sets a boolean variable.
In my Data Flow I do a Conditional Split based on that variable, so that I can do the appropriate OLE DB Commands based on whether that table exists or not.
If the table does exist, the package will run correctly. But if the table doesn't exist, SSIS is checking metadata on the OLE DB Command that isn't being run, determine the table isn't there, and failing with an error before doing anything.
There doesn't seem to be any way to catch or ignore that error (e.g. I tried increasing MaximumErrorCount and various different ErrorRowDescription settings), or to stop it ever validating the command (ValidateExternalMetadata only seems to affect the designer, by design).
I don't have access to create stored procedures to wrap this kind of test, and OLE DB Commands do not let you use IF OBJECT_ID('') IS NOT NULL prefixes on any statements you're doing (in this case, a DELETE FROM TableName WHERE X = ?).
Is there any other way around this, short of using a script component to fire off the DELETE command row-by-row manually?
You can use Script component to execute DELETE statement for each row in input path but that might be very slow depending on number of rows to be deleted.
You can:
Store PKs of records that should be deleted to a database table (for instance: TBL_TO_DEL)
Add Execute SQL Task with SQL query to delete records by joining TBL_TO_DEL with table that You want to delete records from
Put precedence constraint on path between your data flow and Execute SQL task (constraint based on your variable)
This solution is much faster than deleting row by row.
If for some reason You can't create new table, check my answer on SSIS Pass Datasource Between Control Flow Tasks to see other ways to pass data to next data flow where You can use OleDb source and OleDb command. Whichever way You choose, key is in constraint that will or will not execute following task (Execute SQL task or data flow) depending on value in variable.
Note that Execute SQL task will not validate query and as such will fail at runtime if constraint is satisfied and table doesn't exist. If You use another Data Flow instead of Execute SQL Task, set DelayedValidation property to true. It means that task will be validated at the moment prior to executing particular task, not anytime earlier.