Monitor SSIS dataflow with equivalent of SET STATISTICS - sql-server

I'm trying to compare statistics between SQL Server T-SQL and SSIS
Say I have the following script:
INSERT INTO [myDB].dbo.finalTable WITH (TABLOCK)
(id, description, value)
SELECT a.id, a.description, b.value
FROM [anotherDB].dbo.sourceA a
INNER JOIN [anotherDB].dbo.sourceB ON a.id = b.id
So, it just joins a couple of tables from a seperate database and writes some data to finalTable
If I want to look at scans, reads, writes, CPU time, elapsed time and IO I can just use:
SET STATISTICS IO ON
SET STATISTICS TIME ON
Now, if I take an entirely different approach and create a SSIS package with a dataflow task.
Then add the source (anotherDB]) and destination ([myDB]) connections.
Then just add the T-SQL as the source code and map everything
How do I monitor the same statistics?
Thanks

Related

upsert in multiple large tables using ssis

I have 40 tables having different structure in one of DB on one server that is being updated by data provider.
I want to create a SSIS package that would pull data from that data provider DB and insert ,update or delete (merge) data in to development ,Test,UAT and prod DBs.
The tables are having 1m- 3m rows and 20-30 columns each and all the DBs are on SQL Server platform and are on different servers.
The business requirement is to load data everyday on a particular time and have to use SSIS for this. I am new to SSIS and want your suggestions to create better design.
I don't know about SSIS.
There are packaged solutions to synch databases.
In general with just TSQL
delete
update
insert
TSQL
delete
tableA a
where not exists (select 1 from tableB b where b.PK = a.PK)
update A
set ...
from TableA a
join TableB b
on a.PK = b.PK
insert into TableA (columns)
select columns
from tableB b
where not exists (select 1 from tableA a where b.PK = a.PK)
It's a very broad question. I can help you with pointers. Follow them and ask questions when you get stuck. I'll be telling you for 1 table. You will have to do parallel for others:
Create a source OLEDB connection and destination OLEDB connection. This will be used to copy from source to staging tables where actual Data warehouse sits.
Create a Data flow task. Simply copy source db to staging tables. You'll have to implement incremental loading logic. For instance, store the last source.Id and load data from that Id onwards to latest.
Once you've data in staging, create another Data flow task where you'll have to apply Lookup transformation to insert and update data, while loading in destination table.
Deletion won't work here, so you'll have to apply deletion in next step(preferably via execute sql task)
Above steps are the guidelines. You'll be having multiple sequence containers working in parallel, each having above DFT, working on separate tables.

Joining tables from two Oracle databases in SAS

I am joining two tables together that are located in two separate oracle databases.
I am currently doing this in sas by creating two libname connections to each database and then simply using something like the below.
libname dbase_a oracle user= etc... ;
libname dbase_b oracle user= etc... ;
proc sql;
create table t1 as
select a.*, b.*
from dbase_a.table1 a inner join dbase_b.table2 b
on a.id = b.id;
quit;
However the query is painfully slow. Can you suggest any better options to speed up such a query (short of creating a database link going down the path of creating a database link)?
Many thanks for looking at this.
If those two databases are on the same server and you are able to execute cross-database queries in Oracle, you could try using SQL pass-through:
proc sql;
connect to oracle (user= password= <...>);
create table t1 as
select * from connection to oracle (
select a.*, b.*
from dbase_a.schema_a.table1 a
inner join dbase_b.schema_b.table2 b
on a.id = b.id;
);
disconnect from oracle;
quit;
I think that, in most cases, SAS attemps as much as possible to have the query executed on the database server, even if pass-through was not explicitely specified. However, when that query queries tables that are on different servers, different databases on a system that does not allow cross-database queries or if the query contains SAS-specific functions that SAS is not able to translate in something valid on the DBMS system, then SAS will indeed resort to 'downloading' the complete tables and processing the query locally, which can evidently be painfully inefficient.
The select is for all columns from each table, and the inner join is on the id values only. Because the join criteria evaluation is for data coming from disparate sources, the baggage of all columns could be a big factor in the timing because even non-match rows must be downloaded (by the libname engine, within the SQL execution context) during the ON evaluation.
One approach would be to:
Select only the id from each table
Find the intersection
Upload the intersection to each server (as a scratch table)
Utilize the intersection on each server as pass through selection criteria within the final join in SAS
There are a couple variations depending on the expected number of id matches, the number of different ids in each table, or knowing table-1 and table-2 as SMALL and BIG. For a large number of id matches that need transfer back to a server you will probably want to use some form of bulk copy. For a relative small number of ids in the intersection you might get away with enumerating them directly in a SQL statement using the construct IN (). The size of a SQL statement could be limited by the database, the SAS/ACCESS to ORACLE engine, the SAS macro system.
Consider a data scenario in which it has been determined the potential number of matching ids would be too large for a construct in (id-1,...id-n). In such a case the list of matching ids are dealt with in a tabular manner:
libname SOURCE1 ORACLE ....;
libname SOURCE2 ORACLE ....;
libname SCRATCH1 ORACLE ... must specify a scratch schema ...;
libname SCRATCH2 ORACLE ... must specify a scratch schema ...;
proc sql;
connect using SOURCE1 as PASS1;
connect using SOURCE2 as PASS2;
* compute intersection from only id data sent to SAS;
create table INTERSECTION as
(select id from connection to PASS1 (select id from table1))
intersect
(select id from connection to PASS2 (select id from table2))
;
* upload intersection to each server;
create table SCRATCH1.ids as select id from INTERSECTION;
create table SCRATCH2.ids as select id from INTERSECTION;
* compute inner join from only data that matches intersection;
create table INNERJOIN as select ONE.*, TWO.* from
(select * from connection to PASS1 (
select * from oracle-path-to-schema.table1
where id in (select id from oracle-path-to-scratch.ids)
))
JOIN
(select * from connection to PASS2 (
select * from oracle-path-to-schema.table2
where id in (select id from oracle-path-to-scratch.ids)
));
...
For the case of both table-1 and table-2 having very large numbers of ids that exceed the resource capacity of your SAS platform you will have to also iterate the approach for ranges of id counts. Techniques for range criteria determination for each iteration is a tale for another day.

SQL Query too slow on second pc

We have a huge database with over 100 tables and millions of rows.
I created a stored procedure for a job, tested it local and got 500'000 results in less than 10sec. I tested the same query on a second pc and waited about 1 hours for the same result.
The simple version of the query is:
select * from Table1
inner join Table2 on Table1.Table2Id = Table2.Id
where Table1.Segment = #segment
Table1 38'553'864 Rows
Table2 10'647'167 Rows
I used the execution plan and got the following result:
On the local PC I got the result:
(I could send the whole execution plan if needed)
The second PC is a virtual Server(testystem). It has a lot more memory, more space... I also stopped every Job on the server and only tried the sql query, but got the same result. So there aren't any sql query which blocks the tables.
Later I created a Index on the foreign key of table1 and tried to use it, but can't improve the query.
Does anyone have an idea where the problem could be and how I could solve it?
It would take a while to create a execution plan for both querys. But here are a few steps, who already helped a lot. Thanks guys for your help.
The statistics on the tables are from september last year on the second PC. We don't use the query that much on the server. An update on the statistics is a good point.
https://msdn.microsoft.com/en-us/library/ms190397(v=sql.120).aspx
Another thing is to improve my sql query. I removed the where and add it as a condition on the first inner join. So it filter the rows in the first table and after join the huge amount of rows from the second. (The where filters about 90% of the first table and table 3 is really small)
select * from Table1
inner join Table3 on Table1.Segment = #segment
and Table1.Table3Id = Table3.Id
inner join Table2 on Table1.Table2Id = Table2.Id
A next step is. I created an SQL Job which rebuild all Indexes. So they are up to date.
It already a lot better, but I'm still open for other inputs.

How can I update statistics in SQL Server 2012 without using sp_updatestats?

I have read that if I use the command:
EXEC sp_updatestats
That this creates statistics based on an estimated 20,000 rows per table. I am not sure what this means as I have many tables with less than 20 rows.
Can someone give me advice on if there's another more accurate way to update statistics that will not involve my entering a command for every table.
Why do you want to collect statistics other way then Microsoft recommends ?
more accurate way to update statistics that will not involve my
entering a command for every table
This command update statistics for all tables in the current database and you don't need to enter this command for every table in the database:
EXEC sp_updatestats;
Also you can use UPDATE STATISTICS command. I don't know the advantages of UPDATE STATISTiCS over sp_updatestats and I think you can use both of them.
It is a good way to collect actual statistics for your data, but be aware that it can be heavy operation and require many server resources. If it is possible I recommens to collect statistics when most of the users don't work with the data.
Other maintenence solution (like rebuild and reorganize indexes) you can find in this post.
Following query show list of table that need to update statistics. you can use cursor for result of this query and update statistics of each table that need to update statistics.
SELECT SchemaName, ObjectName, StatisticName, [RowCount], UpdatedCount
FROM (
SELECT SCHEMA_NAME(o.schema_id) AS SchemaName,
OBJECT_NAME(o.object_id) AS ObjectName,
s.Name AS StatisticName,
rows AS [RowCount],
modification_counter AS UpdatedCount,
modification_counter/rows * 100 AS UpdatePercent,
rows * -0.00001846153+19.538461 AS threshold
FROM sys.stats s
CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) b
INNER JOIN sys.objects o ON o.object_id = s.object_id
WHERE OBJECTPROPERTY(o.object_id,'IsUserTable')=1
)z
WHERE z.UpdatePercent > 20
OR (z.[RowCount]>=25000 AND z.[RowCount]<=1000000 AND z.UpdatePercent>2 AND z.UpdatePercent > z.threshold)
OR (z.[RowCount]>1000000 AND z.[RowCount]<=10000000 AND z.UpdatePercent>1)
OR (z.[RowCount]>10000000 AND z.[RowCount]<=20000000 AND z.UpdatePercent>0.5)
OR (z.[RowCount]>20000000 AND z.[RowCount]<=30000000 AND z.UpdatePercent>0.25)
ORDER BY z.UpdatePercent
When you set Auto Update Statistics (ALTER DATABASE YourDatabase SET AUTO_UPDATE_STATISTICS ON), sql server update automatically statistics of your table when updated row amount was 20% of count rows. it's seem good in first glance. But 20% of small table is very different with 20% of a large table. In other word if your table have 100 rows, each time that update 20 rows affected, sql server update statistics of your table automatically but if your table have 100,000,000 rows, each time update of 20,000,000 rows affected, sql server automatically update statistic of this table and this update rows need very time to affected. It seem that small table need to update statistics when Updated rows count was 20% of total rows and larg table need to update statistics when updated rows count was 1% of total rows. In my query I show list of table that need to updated statistics according to table rows count and update rows count.

Nested pass-through queries?

I have an ODBC connection to a SQL Server database, and because I'm returning large record sets with my queries, I've found that it's faster to run pass-through queries than native Access queries.
But I'm finding it hard to write and organize my queries because, as far as I know, I can't save several different pass-through queries and join them in another pass-through query. I have read-only access to this database, so I can't save stored procedures in SQL Server and then reference them in the pass-through.
For example, suppose I want to get only those entries with the maximum value of o_version from the following query:
select d.o_filename,d.o_version,parent.o_projectname
from dms_doc d
left join
dms_proj p
on
d.o_projectno=p.o_projectno
left join
dms_proj parent
on
p.o_parentno=parent.o_projectno
where
p.o_projectname='ABC'
and
lower(left(right(d.o_filename,4),3))='xls'
and
charindex('xyz',lower(d.o_filename))=0
I want to get only those entries with the maximum value of d.o_version. Ordinarily I would save this as a query called, e.g., abc, and then write another query abcMax:
select * from abc
inner join
(select o_filename,o_projectname,max(o_version) as maxVersion from abc
group by o_filename,o_projectname) abc2
on
abc.o_filename=abc2.o_filename
and
abc.o_projectname=abc2.o_projectname
where
abc.o_version=abc2.maxVersion
But if I can't store abc as a query that can be used in the pass-through query abcMax, then not only do I have to copy the entire body of abc into abcMax several times, but if I make any changes to the content of abc, then I need to make them to every copy that's embedded in abcMax.
The alternative is to write abcMax as a regular Access query that calls abc, but that will reduce the performance because the query is now being handled by ACE instead of SQL Server.
Is there any way to nest stored pass-through queries in Access? Or is creating stored procedures in SQL Server the only way to accomplish this?
If you have (or can get) permission to create temporary tables on the SQL Server then you might be able to use them to some advantage. For example, you could run one pass-through query to create a temporary table with the results from the first query (vastly simplified, in this example):
CREATE TABLE #abc (o_filename NVARCHAR(50), o_version INT, o_projectname NVARCHAR(50));
INSERT INTO #abc SELECT o_filename, o_version, o_projectname FROM dms_doc;
and then your second pass-through query could just reference the temporary table
select * from #abc
inner join
(select o_filename,o_projectname,max(o_version) as maxVersion from #abc
group by o_filename,o_projectname) abc2
on
#abc.o_filename=abc2.o_filename
and
#abc.o_projectname=abc2.o_projectname
where
#abc.o_version=abc2.maxVersion
When you're finished you can run a pass-through query to explicitly delete the temporary table
DROP TABLE #abc
or SQL Server will delete it for you automatically when your connection to the SQL Server closes.
For anyone still needing this info:
Pass through queries allow for the use of cte queries as can be used with Oracle SQL. Similar to creating multiple select queries, but much faster and efficient, without the clutter and confusion of “stacked” Select queries since you can see all the underlying queries in one view.
Example:
With Prep AS (
SELECT A.name,A.city
FROM Customers AS A
)
SELECT P.city, COUNT(P.name) AS clients_per_city
FROM Prep AS P
GROUP BY P.city

Resources