How to split source table records is SSIS - sql-server

Source and target tables structure with data is given below. Source table has transaction type column, base on this column target table rows will be defined. Suppose: source table data are
On first row SalesTranId=1 and TranType=monthly, as this is a monthly transaction target table will be filled with 30 rows with value 500/30=16.6, as like below
When Source TranType=Yearly, target table must have 365 rows base on source table row.
How to do that in SSIS package.
Source Table:
Target table:
SSIS package:

I agree with Tab and Nick on this but you are adamant about doing it in SSIS.
You have to make some assumptions in order to make my logic work:
Monthly translates to 30, quarterly to 90, and yearly to 365.
Import your source.
Add a script component and create a 2nd output that your destination looks like.
Add the following code:
//Determine Divisor
int DIV = 0;
switch(Row.TranType.ToLower())
{
case "monthly":
DIV=30;
break;
case "quarterly":
DIV=90;
break;
case "yearly":
DIV=365;
break;
}
for (int i = 1; i<=DIV;i++)
{
destBuffer.AddRow();
destBuffer.SalesTranID = Row.SalesTransID;
destBuffer.TranType = Row.TranType;
destBuffer.TranAmt = Row.TranAmt/DIV;
}

If you have to do this in the dataflow, you will need to use a Script component.
Personally I would send the data as-is to a staging table, and do the splitting in a stored procedure as you move it from the staging table to the final destination table. It will perform faster.

Related

Presto: How to read from s3 an entire bucket that is partitioned in sub-folders?

I need to read using presto from s3 an entire dataset that sits in "bucket-a". But, inside the bucket, the data was saved in sub-folders by year. So I have a bucket that looks like that:
Bucket-a>2017>data
Bucket-a>2018>more data
Bucket-a>2019>more data
All the above data is the same table but saved this way in s3. Notice that in the bucket-a itself there is no data, just inside each folder.
What I have to do is read all the data from the bucket as a single table adding a year as column or partition.
I tried doing this way, but didn't work:
CREATE TABLE hive.default.mytable (
col1 int,
col2 varchar,
year int
)
WITH (
format = 'json',
partitioned_by = ARRAY['year'],
external_location = 's3://bucket-a/'--also tryed 's3://bucket-a/year/'
)
and also
CREATE TABLE hive.default.mytable (
col1 int,
col2 varchar,
year int
)
WITH (
format = 'json',
bucketed_by = ARRAY['year'],
bucket_count = 3,
external_location = 's3://bucket-a/'--also tryed's3://bucket-a/year/'
)
All of the above didn't work.
I have seen people writing with partitions to s3 using presto, but what I'm trying to do is the opposite: read from s3 data that is already splitted in folders as single table.
Thanks.
If your folders were following Hive partition folder naming convention (year=2019/), you could declare the table as partitioned and just use system. sync_partition_metadata procedure in Presto.
Now, your folders do not follow the convention, so you need to register each one individually as a partition using system.register_partition procedure (will be available in Presto 330, about to be released). (The alternative to register_partition is to run appropriate ADD PARTITION in Hive CLI.)

Table insertions using stored procedures?

(Submitting for a Snowflake User, hoping to receive additional assistance)
Is there another way to perform table insertion using a stored procedure faster?
I started building a usp with the purpose to insert million or so of rows of test data into a table for the purpose of load testing.
I got to this stage show below and set the iteration value to 10,000.
This took over 10 mins to iterate 10,000 times to insert a single integer into a table each iteration
Yes - I am using a XS data warehouse, but even if this is increased to MAX - this is way to slow to be of any use.
--build a test table
CREATE OR REPLACE TABLE myTable
(
myInt NUMERIC(18,0)
);
--testing a js usp using a while statement with the intention to insert multiple rows into a table (Millions) for load testing
CREATE OR REPLACE PROCEDURE usp_LoadTable_test()
RETURNS float
LANGUAGE javascript
EXECUTE AS OWNER
AS
$$
//set the number of iterations
var maxLoops = 10;
//set the row Pointer
var rowPointer = 1;
//set the Insert sql statement
var sql_insert = 'INSERT INTO myTable VALUES(:1);';
//Insert the fist Value
sf_startInt = rowPointer + 1000;
resultSet = snowflake.execute( {sqlText: sql_insert, binds: [sf_startInt] });
//Loop thorugh to insert all other values
while (rowPointer < maxLoops)
{
rowPointer += 1;
sf_startInt = rowPointer + 1000;
resultSet = snowflake.execute( {sqlText: sql_insert, binds: [sf_startInt] });
}
return rowPointer;
$$;
CALL usp_LoadTable_test();
So far, I've received the following recommendations:
Recommendation #1
One thing you can do is to use a "feeder table" containing 1000 or more rows instead of INSERT ... VALUES, eg:
INSERT INTO myTable SELECT <some transformation of columns> FROM "feeder table"
Recommendation #2
When you perform a million single row inserts, you consume one million micropartitions - each 16MB.
That 16 TB chunk of storage might be visible on your Snowflake bill ... Normal tables are retained for 7 days minimum after drop.
To optimize storage, you could define a clustering key and load the table in ascending order with each chunk filling up as much of a micropartition as possible.
Recommendation #3
Use data generation functions that work very fast if you need sequential integers: https://docs.snowflake.net/manuals/sql-reference/functions/seq1.html
Any other ideas?
This question was also asked at the Snowflake Lodge some weeks ago.
Given the answers you received, do you still feel unanswered, then maybe hint about why?
If you just want a table with a single column of sequence numbers, use GENERATOR() as in #3 above. Otherwise, if you want more advice, share your specific requirements.

SSIS Package how to successfully do a foreach or for loop to auto increment a value for field insert?

First of all I have never attempted something like this in SSIS and I am very new to SSIS package development.
I need to build a component in my package that will run through a table of data (say 80 rows) and set a field titled DisplayOrder to the auto incremented number. The catch is that one of the records HAS to be set to 0 and then the rest of he records set to the auto incremented number.
In regards to code, I am not even sure what code to attach to this question or even what screenshots.
I finally figured it out and there is no need for a loop.
Create a SQL Task to clear the linked Table.
Script Used
DELETE FROM [Currency].[ExchangeRates]
Create a SQL Task to clear the main table.
Script Used
DELETE FROM [Currency].[CurrencyList]
Load the values into the main table.
Actions Used
Load values from XML Source
Dump values to [ExchangeRates] Table
Create a SQL Task to load the Values from the main table to the linked table.
Script Used
INSERT INTO [Currency].[CurrencyList] (CurrencyCode, CurrencyName, ExchangeRateID, DisplayOrder) SELECT [er].[TargetCurrency] AS [CurrencyCode], [er].[TargetName] AS [CurrencyName], [er].[ID] AS [ExchangeRateID], ROW_NUMBER() OVER (ORDER BY [ER].[TargetName]) AS [DisplayOrder] FROM [Currency].[ExchangeRates] AS [er] ORDER BY [CurrencyName]
Create a SQL Task to load a new record to the main table for use as DisplayOrder 0.
Script Used
INSERT INTO [Currency].[ExchangeRates] ([Title], [Link], [Description], [PubDate], [BaseCurrency], [TargetCurrency], [TargetName], [ExchangeRate]) VALUES ('1 USD = 1 USD','http://www.floatrates.com/usd/usd/','1 U.S. Dollar = 1 U.S. Dollar',(SELECT TOP 1 [PubDate] FROM [Currency].[ExchangeRates]),'USD','USD','United States Dollar','1')
Create a SQL Task to reference the newly created record from the main table.
Script Used
INSERT INTO [Currency].[CurrencyList] (CurrencyCode, CurrencyName, ExchangeRateID, DisplayOrder) SELECT [er].[TargetCurrency] AS [CurrencyCode], [er].[TargetName] AS [CurrencyName], [er].[ID] AS [ExchangeRateID], 0 AS [DisplayOrder] FROM [Currency].[ExchangeRates] AS [er] WHERE [er].[TargetCurrency] = 'USD'

SQLBulkCopy: Does Column Count make difference?

I try to search but didn't found answer to relative simple thing. I have a CSV, that doesn't have all the column as in my database table, as well as it miss the auto increment, primary key in CSV too.
All I did is I read CSV into the DataSet, and then run a traditional SQLBulkCopy code to read the first table of dataset to database table. But it give me following error:
The given ColumnMapping does not match up with any column in the source or destination.
My code for bulkcopy is
using (SqlBulkCopy blkcopy = new SqlBulkCopy(DBUtility.ConnectionString))
{
blkcopy.EnableStreaming = true;
blkcopy.DestinationTableName = "Project_" + this.ProjectID.ToString() + "_Data";
blkcopy.BatchSize = 100;
foreach (DataColumn c in ds.Tables[0].Columns)
{
blkcopy.ColumnMappings.Add(c.ColumnName, c.ColumnName);
}
blkcopy.WriteToServer(ds.Tables[0]);
blkcopy.Close();
}
I add Mapping to test, but it doesn't make difference to remove mapping part. If we remove mapping that it try to match column in order and since column are different in count they end up mismatch datatype and lesser column values etc. Oh yes the column names from CSV does match that from Table, and are in same case.
EDIT: I change the mapping code to compare the column name from live DB. For this I simply run a SQL Select query to fetch 1 record from database table and then do following
foreach (DataColumn c in ds.Tables[0].Columns)
{
if (LiveDT.Columns.Contains(c.ColumnName))
{
blkcopy.ColumnMappings.Add(c.ColumnName, c.ColumnName);
}
else
{
log.WriteLine(c.ColumnName + " doesn't exists in final table");
}
}
I would dump the results of CSV into a staging SQL table...and then do a simple insert from staging table to production table.
also do a simple Import of CSV into SQL Table, maybe there are some empty/invalid columns within CSV file.
I once had this problem and the cause was a difference in the case of the column names. One of the columns was "Id", but in the DB it was "id".

Correct method of deleting over 2100 rows (by ID) with Dapper

I am trying to use Dapper support my data access for my server app.
My server app has another application that drops records into my database at a rate of 400 per minute.
My app pulls them out in batches, processes them, and then deletes them from the database.
Since data continues to flow into the database while I am processing, I don't have a good way to say delete from myTable where allProcessed = true.
However, I do know the PK value of the rows to delete. So I want to do a delete from myTable where Id in #listToDelete
Problem is that if my server goes down for even 6 mintues, then I have over 2100 rows to delete.
Since Dapper takes my #listToDelete and turns each one into a parameter, my call to delete fails. (Causing my data purging to get even further behind.)
What is the best way to deal with this in Dapper?
NOTES:
I have looked at Tabled Valued Parameters but from what I can see, they are not very performant. This piece of my architecture is the bottle neck of my system and I need to be very very fast.
One option is to create a temp table on the server and then use the bulk load facility to upload all the IDs into that table at once. Then use a join, EXISTS or IN clause to delete only the records that you uploaded into your temp table.
Bulk loads are a well-optimized path in SQL Server and it should be very fast.
For example:
Execute the statement CREATE TABLE #RowsToDelete(ID INT PRIMARY KEY)
Use a bulk load to insert keys into #RowsToDelete
Execute DELETE FROM myTable where Id IN (SELECT ID FROM #RowsToDelete)
Execute DROP TABLE #RowsToDelte (the table will also be automatically dropped if you close the session)
(Assuming Dapper) code example:
conn.Open();
var columnName = "ID";
conn.Execute(string.Format("CREATE TABLE #{0}s({0} INT PRIMARY KEY)", columnName));
using (var bulkCopy = new SqlBulkCopy(conn))
{
bulkCopy.BatchSize = ids.Count;
bulkCopy.DestinationTableName = string.Format("#{0}s", columnName);
var table = new DataTable();
table.Columns.Add(columnName, typeof (int));
bulkCopy.ColumnMappings.Add(columnName, columnName);
foreach (var id in ids)
{
table.Rows.Add(id);
}
bulkCopy.WriteToServer(table);
}
//or do other things with your table instead of deleting here
conn.Execute(string.Format(#"DELETE FROM myTable where Id IN
(SELECT {0} FROM #{0}s", columnName));
conn.Execute(string.Format("DROP TABLE #{0}s", columnName));
To get this code working, I went dark side.
Since Dapper makes my list into parameters. And SQL Server can't handle a lot of parameters. (I have never needed even double digit parameters before). I had to go with Dynamic SQL.
So here was my solution:
string listOfIdsJoined = "("+String.Join(",", listOfIds.ToArray())+")";
connection.Execute("delete from myTable where Id in " + listOfIdsJoined);
Before everyone grabs the their torches and pitchforks, let me explain.
This code runs on a server whose only input is a data feed from a Mainframe system.
The list I am dynamically creating is a list of longs/bigints.
The longs/bigints are from an Identity column.
I know constructing dynamic SQL is bad juju, but in this case, I just can't see how it leads to a security risk.
Dapper request the List of object having parameter as a property so in above case a list of object having Id as property will work.
connection.Execute("delete from myTable where Id in (#Id)", listOfIds.AsEnumerable().Select(i=> new { Id = i }).ToList());
This will work.

Resources