I am using SSIS Data Tools to create data extracts from a legacy system.
Our new system needs the files that it imports to be split into 5MB files.
Is there anyway that I can split the files into separate files?
I'm thinking that because the data is already in the database, I can do a loop, or something similar, will select a certain amount of records at a time.
Any input appreciated!
If your source is SQL, use the Row_Number function against the table key to allocate a number per row e.g.
Row_number() OVER (Order by Customer_Id) as RowNumber
and then wrap your query in a CTE or make it a sub query with a where clause to give you the number of rows that will equate to a 5MD file e.g.
WHERE RowNumber >= 5000 and RowNumber <10000
You will need to call this source target several times (with different Row Start and Row End values), so probably best to
Find number of total records in control flow and set a TotalRows parameter
Create a loop in your control flow
Set 3 parameters in your control flow to iterate the through each set of records and store the data in seperate files. e.g. first loop would set
RowStart = 0
RowEnd = 5000
FileName = MyFile_[date]_0_to_4999
Related
I'm using SSIS with SQL Server 2016 to produce both text and Excel files (also version 2016). Some data flow tasks return more than a million rows of results. Writing to text is not an issue. However, Excel is limited to 1 million rows per sheet.
How do I configure the Excel Destination, or other component, to ensure more than a million rows are saved in the target workbook using multiple sheets as needed?
Sometimes, you need to push back to whomever has requested "exporting all the data to Excel" as it's just not an option. Is an analyst really going to be able to do anything with a a spreadsheet with multi-million rows? No, no they will not.
Purely as an exercise of "how could I do this really bad thing"
(excluded from this exercise is a custom Script Destination as I didn't feel like writing code)
You must determine ahead of time what is a reasonable upper bound on the the number of sheets to be created. You can either limit your worksheet to a million or hit the actual limit of 1,048,576 rows per sheet. Possibly 1,048,575 rows because you want to repeat your header across sheets.
Whatever that maximum number of sheets is, N, you will need to create to create N Excel Destinations
You'll need to have a ROW_NUMBER() function applied to your source data so you'll have to have a custom query there
SELECT *, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNum FROM dbo.MyTable;
The Conditional Split will use the modulo function to assign rows to their paths
RowNum % N == 1
RowNum % N == 2
...
RowNum % N == (N-1)
and the default output path is for == 0
I have this data and I need to generate a query that will give the output below
You can do this kind of groupings of rows with 2 separate row_number()s. Have 1 for all the data, ordered by date and second one ordered by code and date. To get the groups separated from the data, use the difference between these 2 row_number()s. When it changes, then it's a new block of data. You can then use that number in group by and take the minimum / maximum dates for each of them.
For the final layout you can use pivot or sum + case, most likely you want to have a new row_number for getting the rows aligned properly. Depending if you can have data missing / not matching you'll need probably additional checks.
I have a table called customers which contains around 1,000,000 records. I need to transfer all the records to 8 different flat files which increment the number in the filename e.g cust01, cust02, cust03, cust04 etc.
I've been told this can be done using a for loop in SSIS. Please can someone give me a guide to help me accomplish this.
The logic behind this should be something like "count number of rows", "divide by 8", "export that amount of rows to each of the 8 files".
To me, it will be more complex to create a package that loops through and calculates the amount of data and then queries the top N segments or whatever.
Instead, I'd just create a package with 9 total connection managers. One to your Data Database (Source) and then 8 identical Flat File Connection managers but using the patterns of FileName1, Filename2 etc. After defining the first FFCM, just copy, paste and edit the actual file name.
Drag a Data Flow Task onto your Control Flow and wire it up as an OLE/ADO/ODBC source. Use a query, don't select the table as you'll need something to partition the data on. I'm assuming your underlying RDBMS supports the concept of a ROW_NUMBER() function. Your source query will be
SELECT
MT.*
, (ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) % 8 AS bucket
FROM
MyTable AS MT;
That query will pull back all of your data plus assign a monotonically increasing number from 1 to ROWCOUNT which we will then apply the modulo (remainder after dividing) operator to. By modding the generated value by 8 guarantees us that we will only get values from 0 to 7, endpoints inclusive.
You might start to get twitchy about the different number bases (base 0, base 1) being used here, I know I am.
Connect your source to a Conditional Split. Use the bucket column to segment your data into different streams. I would propose that you map bucket 1 to File 1, bucket 2 to File 2... finally with bucket 0 to file 8. That way, instead of everything being a stair step off, I only have to deal with end point alignment.
Connect each stream to a Flat File Destination and boom goes the dynamite.
You could create a rownumber with a Script Component (don't worry very easy): http://microsoft-ssis.blogspot.com/2010/01/create-row-id.html
or you could use a rownumber component like http://microsoft-ssis.blogspot.com/2012/03/custom-ssis-component-rownumber.html or http://www.sqlis.com/post/Row-Number-Transformation.aspx
For dividing it in 8 files you could use the Balanced Data Distributor or the Conditional Split with a modulo expression (using your new rownumber column):
I am working on a project for a client using a classic ASP application I am very familiar with, but in his environment is performing more slowly than I have ever seen in a wide variety of other environments. I'm on it with many solutions; however, the sluggishness has got me to look at something I've never had to look at before -- it's more of an "acadmic" question.
I am curious to understand a category page with say 1800 product records takes ~3 times as long to load as a category page with say 54 when both are set to display 50 products per page. That is, when the number of items to loop through is the same, why does the variance in the total number of records make a difference in loading the number of products displayed when that is a constant?
Here are the methods used, abstracted to the essential aspects:
SELECT {tableA.fields} FROM tableA, tableB WHERE tableA.key = tableB.key AND {other refining criteria};
set rs=Server.CreateObject("ADODB.Recordset")
rs.CacheSize=iPageSize
rs.PageSize=iPageSize
pcv_strPageSize=iPageSize
rs.Open query, connObj, adOpenStatic, adLockReadOnly, adCmdText
dim iPageCount, pcv_intProductCount
iPageCount=rs.PageCount
If Cint(iPageCurrent) > Cint(iPageCount) Then iPageCurrent=Cint(iPageCount)
If Cint(iPageCurrent) < 1 Then iPageCurrent=1
if NOT rs.eof then
rs.AbsolutePage=Cint(iPageCurrent)
pcArray_Products = rs.getRows()
pcv_intProductCount = UBound(pcArray_Products,2)+1
end if
set rs = nothing
tCnt=Cint(0)
do while (tCnt < pcv_intProductCount) and (count < pcv_strPageSize)
{display stuff}
count=count + 1
loop
The record set is converted to an array via getRows() and the destroyed; records displayed will always be iPageSize or less.
Here's the big question:
Why, on the initial page load for the larger record set (~1800 records) does it take significantly longer to loop through the page size (say 50 records) than on the smaller records set (~54 records)? It's running through 0 to 49 either way, but takes a lot longer to do that the larger the initial record set/getRows() array is. That is, why would it take longer to loop through the first 50 records when the initial record set/getRows() array is larger when it's still looping through the same number of records/rows before exiting the loop?
Running MS SQL Server 2008 R2 Web edition
You are not actually limiting the number of records returned. It will take longer to load 36 times more records. You should change your query to limit the records directly rather than retrieving all of them and terminating your loop after the first 50.
Try this:
SELECT *
FROM
(SELECT *, ROW_NUMBER() OVER(ORDER BY tableA.Key) AS RowNum
FROM tableA
INNER JOIN tableB
ON tableA.key = tableB.key
WHERE {other refining criteria}) AS ResultSet
WHERE RowNum BETWEEN 1 AND 50
Also make sure the columns you are using to join are indexed.
Let's say I have a Product table in a shopping site's database to keep description, price, etc of store's products. What is the most efficient way to make my client able to re-order these products?
I create an Order column (integer) to use for sorting records but that gives me some headaches regarding performance due to the primitive methods I use to change the order of every record after the one I actually need to change. An example:
Id Order
5 3
8 1
26 2
32 5
120 4
Now what can I do to change the order of the record with ID=26 to 3?
What I did was creating a procedure which checks whether there is a record in the target order (3) and updates the order of the row (ID=26) if not. If there is a record in target order the procedure executes itself sending that row's ID with target order + 1 as parameters.
That causes to update every single record after the one I want to change to make room:
Id Order
5 4
8 1
26 3
32 6
120 5
So what would a smarter person do?
I use SQL Server 2008 R2.
Edit:
I need the order column of an item to be enough for sorting with no secondary keys involved. Order column alone must specify a unique place for its record.
In addition to all, I wonder if I can implement something like of a linked list: A 'Next' column instead of an 'Order' column to keep the next items ID. But I have no idea how to write the query that retrieves the records with correct order. If anyone has an idea about this approach as well, please share.
Update product set order = order+1 where order >= #value changed
Though over time you'll get larger and larger "spaces" in your order but it will still "sort"
This will add 1 to the value being changed and every value after it in one statement, but the above statement is still true. larger and larger "spaces" will form in your order possibly getting to the point of exceeding an INT value.
Alternate solution given desire for no spaces:
Imagine a procedure for: UpdateSortOrder with parameters of #NewOrderVal, #IDToChange,#OriginalOrderVal
Two step process depending if new/old order is moving up or down the sort.
If #NewOrderVal < #OriginalOrderVal --Moving down chain
--Create space for the movement; no point in changing the original
Update product set order = order+1
where order BETWEEN #NewOrderVal and #OriginalOrderVal-1;
end if
If #NewOrderVal > #OriginalOrderVal --Moving up chain
--Create space for the momvement; no point in changing the original
Update product set order = order-1
where order between #OriginalOrderVal+1 and #NewOrderVal
end if
--Finally update the one we moved to correct value
update product set order = #newOrderVal where ID=#IDToChange;
Regarding best practice; most environments I've been in typically want something grouped by category and sorted alphabetically or based on "popularity on sale" thus negating the need to provide a user defined sort.
Use the old trick that BASIC programs (amongst other places) used: jump the numbers in the order column by 10 or some other convenient increment. You can then insert a single row (indeed, up to 9 rows, if you're lucky) between two existing numbers (that are 10 apart). Or you can move row 370 to 565 without having to change any of the rows from 570 upwards.
Here is an alternative approach using a common table expression (CTE).
This approach respects a unique index on the SortOrder column, and will close any gaps in the sort order sequence that may have been left over from earlier DELETE operations.
/* For example, move Product with id = 26 into position 3 */
DECLARE #id int = 26
DECLARE #sortOrder int = 3
;WITH Sorted AS (
SELECT Id,
ROW_NUMBER() OVER (ORDER BY SortOrder) AS RowNumber
FROM Product
WHERE Id <> #id
)
UPDATE p
SET p.SortOrder =
(CASE
WHEN p.Id = #id THEN #sortOrder
WHEN s.RowNumber >= #sortOrder THEN s.RowNumber + 1
ELSE s.RowNumber
END)
FROM Product p
LEFT JOIN Sorted s ON p.Id = s.Id
It is very simple. You need to have "cardinality hole".
Structure: you need to have 2 columns:
pk = 32bit int
order = 64bit bigint (BIGINT, NOT DOUBLE!!!)
Insert/UpdateL
When you insert first new record you must set order = round(max_bigint / 2).
If you insert at the beginning of the table, you must set order = round("order of first record" / 2)
If you insert at the end of the table, you must set order = round("max_bigint - order of last record" / 2)
If you insert in the middle, you must set order = round("order of record before - order of record after" / 2)
This method has a very big cardinality. If you have constraint error or if you think what you have small cardinality you can rebuild order column (normalize).
In maximality situation with normalization (with this structure) you can have "cardinality hole" in 32 bit.
It is very simple and fast!
Remember NO DOUBLE!!! Only INT - order is precision value!
One solution I have used in the past, with some success, is to use a 'weight' instead of 'order'. Weight being the obvious, the heavier an item (ie: the lower the number) sinks to the bottom, the lighter (higher the number) rises to the top.
In the event I have multiple items with the same weight, I assume they are of the same importance and I order them alphabetically.
This means your SQL will look something like this:
ORDER BY 'weight', 'itemName'
hope that helps.
I am currently developing a database with a tree structure that needs to be ordered. I use a link-list kind of method that will be ordered on the client (not the database). Ordering could also be done in the database via a recursive query, but that is not necessary for this project.
I made this document that describes how we are going to implement storage of the sort order, including an example in postgresql. Please feel free to comment!
https://docs.google.com/document/d/14WuVyGk6ffYyrTzuypY38aIXZIs8H-HbA81st-syFFI/edit?usp=sharing