There is a pair of identical databases with 80 tables each: source (Oracle) and target (SQL Server).
There is a Powershell 7.x script that processes the data: it reads a table from the source, does some simple processing in a local DataTable variable and writes results to the target. Rinse, repeat x 80.
The source is read using System.Data.OracleClient.OracleConnection and the target is populated with System.Data.SqlClient.SqlBulkCopy
79 of tables are relatively small and fit entirely in memory, so I am loading them using the System.Data.DataTable.Load() method.
1 table, however, is very large horizontally (each record contains a dozen of megabytes XML clob) and needs to be processed row-by-row i.e. a single row loaded from source, processed, written to target.
The question is: how can I loop through individual records from a System.Data.OracleClient.OracleConnection dataset, and pull them to a local DataTable object? I was looking at the System.Data.DataTable.LoadDataRow() method but it doesn't seem to be doing what I need here.
Instead of loading a DataTable, just call ExecuteReader, and pass the DataReader to SqlBulkCopy.WriteToServer.
If you want to use LoadDataRow, still use ExecuteReader, and while rdr.Read() use rdr.GetValues to copy the row into an object array, which you can pass to LoadDataRow.
Related
In my SSIS Data Flow I create a derived column (DT_WSTR) based on a concatenation of two other columns. I want to save the max length in this column in a variable (with SQL it would be a MAX(LEN(COLUMN))). How is this done?
Add another Derived Column after your Derived Column that calculates the length of the computed column. Let's call it ColumnLength
LEN(COLUMN)
Now add a Multicast transformation. One path from here will go on to the "rest" of your data flow. A new path will lead to an Aggregate transformation. There, specify you want the maximum.
Now - what do you want to do with that information?
Write to a table -> OLE DB Destination
Report to the log -> Script Task that fires and information event
Use elsewhere in the package -> Recordset destination and then use a foreach loop to pop the one row, one value out it
Something else?
Sample data flow assuming you chose option 3 - recordset destination
I need to create 2 variables in my SSIS package. objRecordset of type Object and MaxColumnLength of type Int32
When the data flow finishes, all my data would have arrived in my table (represented by the Script Component) and my aggregated maximum length will flow into the recordset destination which uses my variable objRecordset
To get the value out of the ado.net recordset and into our single variable, we need to "shred the recordset" Google that term, you'll find many, many examples.
My control flow will look something like this
The ForEach (ado.net) Enumerator Loop Container is consumes every row in our dataset and we will specify that the our variable MaxColumnLength will be the 0th element of the table.
Finally, I put a sequence container in there so I can get a breakpoint to fire. We see the length of my max column variable to be 15 which matches my source query
SELECT 'a' As [COLUMN]
UNION ALL SELECT 'ZZZZZZZZZZZZZZZ'
I believe this addresses the problem you have asked.
As a data warehouse practitioner, I would encourage you to rethink your approach on the lookups. Yes, the 400 character column is going to wreak havoc on your memory so "math it out". Use the cryptological functions available to you and compute a fixed width, unique, key for that column and then you only work with that data.
SELECT
CONVERT(binary(20), HASHBYTES('SHA1', MyBusinessKeys)) AS BusHashKey
FROM
dbo.MyDimension;
Now you have 20 bytes, always and SHA1 is unlikely to generate duplicate values.
I have been searching on the internet for a solution to my problem but I can not seem to find any info. I have a large single text file ( 10 million rows), I need to create an SSIS package to load these records into different tables based on the transaction group assigned to that record. That is Tx_grp1 would go into Tx_Grp1 table, Tx_Grp2 would go into Tx_Grp2 table and so forth. There are 37 different transaction groups in the single delimited text file, records are inserted into this file as to when they actually occurred (by time). Also, each transaction group has a different number of fields
Sample data file
date|tx_grp1|field1|field2|field3
date|tx_grp2|field1|field2|field3|field4
date|tx_grp10|field1|field2
.......
Any suggestion on how to proceed would be greatly appreciated.
This task can be solved with SSIS, just with some experience. Here are the main steps and discussion:
Define a Flat file data source for your file, describing all columns. Possible problems here - different data types of fields based on tx_group value. If this is the case, I would declare all fields as strings long enough and later in the dataflow - convert its type.
Create a OLEDB Connection manager for the DB you will use to store the results.
Create a main dataflow where you will proceed the file, and add a Flat File Source.
Add a Conditional Split to the output of Flat file source, and define there as much filters and outputs as you have transaction groups.
For each transaction group data output - add Data Conversion for fields if necessary. Note - you cannot change data type of existing column, if you need to cast string to int - create a new column.
Add for each destination table an OLEDB Destination. Connect it to proper transaction group data flow, and map fields.
Basically, you are done. Test the package thoroughly on a test DB before using it on a production DB.
I am loading a large set (10s of thousands) of CSV files into a single staging sql server table, using standard SSIS approach.
Vast majority of source CSV files have identical column structure (order, set of columns, data types). There's around 140 columns all together.
However, in certain (<1%) cases a source file will be lacking some columns (I know exactly which columns they are, and there are three possible combinations of missing columns). This is by design i.e. this is a valid business scenario (meh).
Can I somehow create a "virtual" column (filled with NULL/empty/blank values) for a source CSV connection if (and only if) that column does not exist in the physical source CSV file?
I know I can read CSV header with a C# scripting component and create multiple source connections, and re-direct to the right data flow based on existence (or lack) of certain columns but I am hoping for a more "elegant" solution, with just single CSV data source "smart" enough to "artificially" add blank columns that are missing in the source file.
For simplicity let's assume that the full column set is:
ID;C1;C2;C3
And that C3 is missing occasionally i.e. some CSV files are:
ID;C1;C2
Any hints welcome.
No, there is no "smart" CSV data source built in to SSIS.
You are certainly going to need to use a script component, but instead of using a Script Task outside the dataflow that directs the control flow to the correct dataflow, you can simply create one dataflow that has a script component as the data source. The script component reads the CSV that is currently being imported, and if the column in question is missing, it supplies it with NULL or default values.
Do you know how to transfer only new records between two different databases (ie. Oracle and MSSQL) using SSIS? There is no problem transfering new data only between two tables in the same database and server, but is this possible to do such operation between completely different servers and databases?
Ps. I know about solution using Lookup but it is not very efficient if anybody needs to check and add a lot of records (50k and more) several times per day. I would like to operate with new data only.
You have several options:
Timestamp based solution
If you have a column which stores the insertation time in the source system, you can select only the new records created since the last load. With the same logic, you can transfer modified records too, just mark the records with the timestamp value when it change.
Sequence based solution
If there is a sequence in the source table, you can load the new records based on that sequence. Query the last value from the destination system, then load avarything which is larger than that value.
CDC based solution
If you have CDC (Change Data Capture) in your source system, you can track the changes and you can load them based on the CDC entries.
Full load
This is the most resource hungry solution: you have to copy all data from the source to the destination. If you do not have any column which marks the new records, you should use this solution.
You have several options to achieve this:
TRUNCATE the destination table and reload it from source
Use a Lookup component to determine which records are missing
Load all data from source to a temporary table and write a query which retrieves the new/changed records.
Summary
If you have at least one column, which marks the new/modified records, you can use it to implement a differential/incremental load with SSIS. If you do not have any clue, which columns/rows are changed, you have to load (or at least query) all of them.
There is no solution which enables a one-query (INSERT .. SELECT) solution using multiple servers without transferring all data. (Please note, that a multi-server query using Linked Servers are transfers the data from the source system).
What about variables? Is it possible to use the same variable between different databases and servers in SSIS?
I would like to transfer last id number from a destination table and transfer it to the source table (different server!).
I can set a variable in a database scope like this:
DECLARE #Last int
SET #Last = (SELECT TOP 1 Id FROM dbo.Table_1 ORDER BY Id DESC)
SELECT *
FROM dbo.Table_2
WHERE ID > #Last;
However it works between two tables in the same database (as a SQL command) only. I can create a variable for a entire SSIS package in Variables --> Add variable, but I don't know it is possible to use the variable in a similar way as above - to keep an information about last id in a destination table and pass it to another table on a source server as data limit.
I have a CSV file where there is a header row and data rows in the same file.
I want to get information from both rows during the same load.
What is the easiest way to do this?
i.e File Example - Import.CSV
2,11-Jul-2011
Mr,Bob,Smith,1-Jan-1984
Ms,Jane,Doe,23-Apr-1981
In the first row, there a a count of the number of rows and the date of transmission.
In the second and subsequent rows is the actual data, in this Title, FirstName, LastName, Birthdate
SQL Server Integration Services Conditional Split Transformation should do it.
I wonder what would You do with that info in the pipeline. However, there is only one solution to read it in one pass (take a look at notes/limitations at the end):
Create a data flow
Put File source component and set it the way You want
Add script task to count the number of rows
Put conditional split transformation where condition is mycounter=0
One path from condition split will be the first row of file (mycounter=0) and the other path will be the rest of the rows (2 in your example).
Note#1: file source can set only one metadata for each column in the source. This means that if your first column of data is string (Mr, Ms, ...) then You have to set it as string data type in the source. Otherwise, if You set it as integer (DT_Ix) it
will fail as soon as it encounters row with string data (Mr, Ms, ...) in the first column of file. This applies to all columns, not just the first one.
Note #2: SSIS will see only the number of columns You told it to. This means that You have to have the same number of columns in EACH row. Otherwise, You have ragged csv file and You need to take another approach - search the Internet. But those solutions also require different layout of csv.
Answers in the following links explain how to load parent-child data from a flat file into an SQL Server database when both parent and child rows exist in the same file next to each other.
How do I split flat file data and load into parent-child tables in database?
How to load a flat file with header and detail data into a database using SSIS package?