I have several processes that run nightly that imports data from a AS400 into SQL Server using Linked Server. Here is a sample:
truncate table TABLENAME
insert into TABLENAME
(
BT_TID,
BT_SEQ ,
BT_DES ,
BT_HRS ,
BT_MOD ,
BT_MSN ,
BT_STK
)
select
BT_TID,
BT_SEQ ,
BT_DES ,
BT_HRS ,
BT_MOD ,
BT_MSN ,
BT_STK
FROM OPENQUERY([ODBCSOURCE], 'select
BT_TID,
BT_SEQ ,
BT_DES ,
BT_HRS ,
BT_MOD ,
BT_MSN ,
BT_STK
from XXXX.XXXX.TABLENAME')
Some of these processes take HOURS to run.
Is there a better way of doing this? I looked into BCP, but didn't understand it.
As heard from Microsoft, Linked Server is slow per nature (due to processing and network limitation).
If SSIS is installed on your SQL server it should be a better option.
Related
I have a data table in Oracle that has the following columns:
Record_ID, Run_ID, PO_Type, PO_NUM, DateTime
When a PO is created, all the columns are populated except for Run_ID:
Record_ID, Run_ID, PO_Type, PO_Num, DateTime
---------------------------------------------------
1374, , NEW_PO , 12345 , 20211117123456
1375, , NEW_PO , 12346 , 20211117123545
These records are currently exported out of our system via SSIS where they get imported into a SQL Server database. This is where they will be assigned a RUN_ID which will be unique to all of the data runs that were exported (everything that was exported at one time will have the same Run_ID):
RECORDID, SYSTEM, RUN_ID, PO_TYPE, PO_NUM, DATETIME
---------------------------------------------------------
1374, ORDER , 5078 , NEW_PO , 12345 , 20211117123456
1375, ORDER , 5078 , NEW_PO , 12346 , 20211117123545
I then need to write back to the Oracle database this Run_ID and update the PO_TYPE from NEW_PO to Processed_PO so my Oracle database would then look like this:
Record_ID, Run_ID, PO_Type , PO_Num, DateTime
--------------------------------------------------------
1374, 5078 , Processed_PO , 12345 , 20211117123456
1375, 5078 , Processed_PO , 12346 , 20211117123545
The problem I am having is, this all needs to happen within the same SSIS pull, as it is the only tool I have available to me, and I don't know how to begin to tackle this, so any advice on this would be greatly appreciated.
Given your helpful additional information, I understand now that your concern is mostly surrounding making sure that only the rows you extract are the ones you update later with the RUN_ID.
The simplest way I could see doing this is to use the PO_TYPE column and introduce a new status of something like 'PO_Processing'. I don't know your environment / data model so this may or may not be feasible to do - maybe you have limitations to what you can enter here - but the SSIS package steps would then look something like this:
Update the Oracle rows you want
update oracle_table set po_type = 'In_Transit_PO' where <your criteria>
Perform your extract using this status as the selection criteria
Load the data into SQL Server
Store the new RUN_ID in a user variable in the package
Use the user variable to update the SQL Server rows
update SQL_PO set Run_ID = (?) where <your criteria> ('?' maps to your defined package variable)
Update RUN_ID_TRACKER to increment the next RUN_ID
Use the user variable to update the Oracle rows by mapping it (exact syntax may be slightly different depending on which provider your package is configured to use)
update oracle_table set PO_Type = 'Processed_PO', RUN_ID='?' where PO_Type = 'In_Transit_PO'
Done this way, you allow new POs to be generated on the Oracle side while the load is running, but you ensure that only the rows you extracted are the ones you update with the RUN_ID. A couple extra steps in the package but they are each very simple. Not only that, in the event of errors in the process, you have a record of exactly which subset of records it's trying to process, making debugging easier.
Come to think of it you could reduce the steps a bit by obtaining the run_id value and putting it in the variable before your load step, then you already have the value to include when you insert the rows - no need to to a secondary update on the SQL_PO table.
Long story short, we import data from a firebird database to MS SQL, this is imported to a backup database, checks are done to ensure all is well, then the names are switched, so it becomes the live database.
To do this we set the db to single user, sp_renamedb, then set it back to multi-user.
Periodically, this can fail if there is an open connection to either of the DBs. I have adapted the below script, this is to find any connections to the databases in question.
I just want to make sure this is a reliable way to get any connections? Once I have them, I can decide what to do with them.
SELECT DB_NAME(SP.dbid) AS Database_name ,
SP.spid AS SPIDS ,
SP.hostname AS HostName ,
SP.loginame AS LoginID
FROM sys.sysprocesses AS SP
WHERE (
DB_NAME(SP.dbid) = 'Database1'
OR DB_NAME(SP.dbid) = 'Database2'
)
GROUP BY DB_NAME(SP.dbid) ,
SP.spid ,
SP.hostname ,
SP.loginame
ORDER BY DB_NAME(SP.dbid);
Cheers
I've inherited a database recently which contains thousands of stored procedures and functions, however most of them are deprecated and no longer in use.
I've started adding a piece of code to the stored procedures one at time to notify me if they run, but this process is really quite manual.
Is there any way to start an audit, and see which stored procedures run in the next month or two without adding a piece of code to each stored procedure manually?
Thanks,
Eric
I believe you need to be on SQL Server 2005 SP2 or higher. In prior versions of SQL Server, the OBJECT_NAME function only accepts a parameter for object_id.
Hopefully this should work for you:
SELECT DB_NAME(dest.[dbid]) AS 'databaseName'
, OBJECT_NAME(dest.objectid) AS 'procName'
, MAX(deqs.last_execution_time) AS 'last_execution'
FROM sys.dm_exec_query_stats AS deqs
CROSS APPLY sys.dm_exec_sql_text(deqs.sql_handle) AS dest
WHERE dest.[TEXT] LIKE '%yourTableName%' -- replace
And dest.[dbid] = DB_ID() -- exclude ad-hocs
GROUP BY DB_NAME(dest.[dbid])
, OBJECT_NAME(dest.objectid)
ORDER BY databaseName
, procName
OPTION (MaxDop 1);
I have an SSIS package--two data flow tasks, 8 components each, reading from two flat files, nothing spectacular. If I run it in BIDS, it takes reliably about 60 seconds. I have a sandbox DB server with the package running in a job which also takes reliably 30-60 seconds. On my production server, the same job with the same package takes anywhere from 30 seconds to 12 hours.
With logging enabled on the package, it looks like it bogs down--initially at least--in the pre-execute phase of one or the other (or both) data flow tasks. But I can also see the data coming in--slowly, in chunks, so I think it does move on from there later. The IO subsystem gets pounded, and SSIS generates many large temp files (about 150MB worth--my input data files are only about 24MB put together) and is reading and writing vigorously from those files (thrashing?).
Of note, if I point my BIDS instance of the package at the production server, it still only takes about 60 seconds to run! So it must be something with running dtexec there, not the DB itself.
I've already tried to optimize my package, reducing input row byte size, and I made the two data flow tasks run in series rather than in parallel--to no avail.
Both DB servers are running MSSQL 2008 R2 64-bit, same patch level. Both servers are VMs on the same host, with the same resource allocation. Load on the production server should not be that much higher than on the sandbox server right now. The only difference I can see is that the production server is running Windows Server 2008, while the sandbox is on Windows Server 2008 R2.
Help!!! Any ideas to try are welcome, what could be causing this huge discrepancy?
Appendix A
Here's what my package looks like…
The control flow is extremely simple:
The data flow looks like this:
The second data flow task is exactly the same, just with a different source file and destination table.
Notes
The completion constraint in the Control Flow is only there to make the tasks run serially to try and cut down on resources needed concurrently (not that it helped solve the problem)…there is no actual dependency between the two tasks.
I'm aware of potential issues with blocking and partially-blocking transforms (can't say I understand them completely, but somewhat at least) and I know the aggregate and merge join are blocking and could cause problems. However, again, this all runs fine and quickly in every other environment except the production server…so what gives?
The reason for the Merge Join is to make the task wait for both branches of the Multicast to complete. The right branch finds the minimum datetime in the input and deletes all records in the table after that date, while the left branch carries the new input records for insertion--so if the right branch proceeds before the aggregate and deletion, the new records will get deleted (this happened). I'm unaware of a better way to manage this.
The error output from "Delete records" is always empty--this is deliberate, as I don't actually want any rows from that branch in the merge (the merge is only there to synchronize completion as explained above).
See comment below about the warning icons.
If you have logging turned on, preferably to SQL Server, add the OnPipelineRowsSent event. You can then determine where it is spending all of its time. See this post Your IO subsystem getting slammed and generating all these temp files is because you are no longer able to keep all the information in memory (due to your async transformations).
The relevant query from the linked article is the following. It looks at events in the sysdtslog90 (SQL Server 2008+ users substitute sysssislog) and performs some time analysis on them.
;
WITH PACKAGE_START AS
(
SELECT DISTINCT
Source
, ExecutionID
, Row_Number() Over (Order By StartTime) As RunNumber
FROM
dbo.sysdtslog90 AS L
WHERE
L.event = 'PackageStart'
)
, EVENTS AS
(
SELECT
SourceID
, ExecutionID
, StartTime
, EndTime
, Left(SubString(message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, 56) + 1) + 1) + 1) + 2, Len(message)), CharIndex(':', SubString(message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, 56) + 1) + 1) + 1) + 2, Len(message)) ) - 2) As DataFlowSource
, Cast(Right(message, CharIndex(':', Reverse(message)) - 2) As int) As RecordCount
FROM
dbo.sysdtslog90 AS L
WHERE
L.event = 'OnPipelineRowsSent'
)
, FANCY_EVENTS AS
(
SELECT
SourceID
, ExecutionID
, DataFlowSource
, Sum(RecordCount) RecordCount
, Min(StartTime) StartTime
, (
Cast(Sum(RecordCount) as real) /
Case
When DateDiff(ms, Min(StartTime), Max(EndTime)) = 0
Then 1
Else DateDiff(ms, Min(StartTime), Max(EndTime))
End
) * 1000 As RecordsPerSec
FROM
EVENTS DF_Events
GROUP BY
SourceID
, ExecutionID
, DataFlowSource
)
SELECT
'Run ' + Cast(RunNumber As varchar) As RunName
, S.Source
, DF.DataFlowSource
, DF.RecordCount
, DF.RecordsPerSec
, Min(S.StartTime) StartTime
, Max(S.EndTime) EndTime
, DateDiff(ms, Min(S.StartTime)
, Max(S.EndTime)) Duration
FROM
dbo.sysdtslog90 AS S
INNER JOIN
PACKAGE_START P
ON S.ExecutionID = P.ExecutionID
LEFT OUTER JOIN
FANCY_EVENTS DF
ON S.SourceID = DF.SourceID
AND S.ExecutionID = DF.ExecutionID
WHERE
S.message <> 'Validating'
GROUP BY
RunNumber
, S.Source
, DataFlowSource
, RecordCount
, DF.StartTime
, RecordsPerSec
, Case When S.Source = P.Source Then 1 Else 0 End
ORDER BY
RunNumber
, Case When S.Source = P.Source Then 1 Else 0 End Desc
, DF.StartTime
, Min(S.StartTime);
You were able to use this query to discern that the Merge Join component was the lagging component. Why it performs differently between the two servers, I can't say at this point.
If you have the ability to create a table in your destination system, you could modify your process to have two 2 data flows (and eliminate the costly async components).
The first data flow would take the Flat file and Derived columns and land that into a staging table.
You then have an Execute SQL Task fire off to handle the Get Min Date + Delete logic.
Then you have your second data flow querying from your staging table and snapping it right into your destination.
The steps below will help improve your SSIS performance.
Ensure connection managers are all set toDelayValidation ( = True).
Ensure that ValidateExternalMetadata is set to false
DefaultBufferMaxRows and DefaultBufferSize to correspond to the table's row sizes
Use query hints wherever possible http://technet.microsoft.com/en-us/library/ms181714.aspx
Problem Conditions
I have a very simple Oracle (11g) Stored Procedure that is declared like so:
CREATE OR REPLACE PROCEDURE pr_myproc(L_CURSOR out SYS_REFCURSOR)
is
BEGIN
OPEN L_CURSOR FOR
SELECT * FROM MyTable;
END;
This compiles correctly. The cursor contains col1, col2 and col3.
In SSRS, i have a Shared Data Source that uses Oracle OLEDB Provider for Oracle 11g:
Provider=OraOLEDB.Oracle.1;Data Source=LIFEDEV
(Plus the user credentials).
What Works OK:
The stored procedure executes
correctly in PL/SQL Developer
The 'test connect' in works fine in SSRS
A query string of SELECT * FROM MyTable; with Command Type of 'text' produces the correct fields in the SSRS report.
.NET Oracle Provider instead of Oracle OLE DB Provider
What Fails:
If i change the Command Type to 'Stored Procedure' and enter 'pr_myproc', when I click 'OK' Visual Studio 2005 (service pack 2) simply hangs/crashes.
Does anyone have any knowledge/experience of this?
Any help would be most appreciated. Thanks.
FURTHER INFORMATION
I've modified the provider from the Oracle OLE DB Provider to the .NET Oracle Provider, and, magically, it works.
This would seem to indicate an issue with the Oracle provider.
Any more thoughts?
We got to the bottom of this.
On the environment where the procedure resided, we have a substantial data dictionary. The two providers when looking up information use two different queries.
Here is the one the Oracle Provider used, taking 10+ minutes:
select * from (select null PROCEDURE_CATALOG
, owner PROCEDURE_SCHEMA
, object_name PROCEDURE_NAME
, decode (object_type, 'PROCEDURE', 2, 'FUNCTION', 3, 1) PROCEDURE_TYPE
, null PROCEDURE_DEFINITION
, null DESCRIPTION
, created DATE_CREATED
, last_ddl_time DATE_MODIFIED
from all_objects where object_type in ('PROCEDURE','FUNCTION')
union all
select null PROCEDURE_CATALOG
, arg.owner PROCEDURE_SCHEMA
, arg.package_name||'.'||arg.object_name PROCEDURE_NAME
, decode(min(arg.position), 0, 3, 2) PROCEDURE_TYPE
, null PROCEDURE_DEFINITION
, decode(arg.overload, '', '', 'OVERLOAD') DESCRIPTION
, min(obj.created) DATE_CREATED
, max(obj.last_ddl_time) DATE_MODIFIED
from all_objects obj, all_arguments arg
where arg.package_name is not null
and arg.owner = obj.owner
and arg.object_id = obj.object_id
group by arg.owner, arg.package_name, arg.object_name, arg.overload ) PROCEDURES
WHERE PROCEDURE_NAME = '[MY_PROCEDURE_NAME]' order by 2, 3
More info can be found here