It has been a few months for me working with SSIS, I'm trying to implement a data flow to replace a sequence of SQL Tasks used to do some data transformation.
Data flow description :
Source :
Each row gives information about energy consumed (X) during a number of days (Y).
Destination :
Energy consumed in Day 1 (X/Y), Day 2 (X/Y) , Day 3 (X/Y), .....
Any ideas about how to implement such logic in a single data flow.
Thanks a lot.
Yacine.
If I understand what you're doing correctly, you would like one data flow task that will process the necessary algorithms that are used before storing data - meaning that, as only an example, we may have energy amount data and date data, and we'd like to store energy amount data divided by the day number of the year.
One approach to this would be a Derived Column between the source and destination, where we could apply mathematical functions to our existing data. I've done this with weather data before, where an additional column was added which calculated a value to be stored along with the specific day, temperature and forecast based on the latter two.
Another possible approach is an OLE DB Command, but note (according to SSIS):
Runs [a] SQL statement for each row in a data flow. For example, call a
'new employee setup' stored procedure for each row in the 'new
employees' table. Note: running [a] SQL statement for each row of a
large data flow may take a long time.
Related
I am trying to setup an ELT pipeline into Snowflake and it involves a transformation after loading.
This transformation will currently create or replace a Table using data queried from a source table in Snowflake after performing some manipulations of JSON data.
My question is, is this the proper way of doing it via create or replace Table everytime the transformation runs or is there a way to update the data in the transformed table incrementally?
Any advise would be greatly appreciated!
Thanks!
You can Insert into the load (soruce) table, and put into a stream, then you can know the rows, ranges of rows that need to be "reviewed" and then upsert into the output transform table.
That is is you doing something like "daily aggregates", thus if in "this batch you have data for the last 4 days, you then read the "last four days" of data from source (space a full read) and then aggregate and upsert via merge command. Thus with the model you can save reads/aggregate/write.
We have also used high water tables, to know last seen data, and/or lowest value in current batch.
i need to optimize database for high volume insert/reads
i have a postgress table raw data around 9million row.
Row composed of 2 columns: timestamp, string value; data is sorted asc by time
The only query I use agains this csv is:
select * where unixtimestamp between 1600000 and 1700000
after getting the data from raw table, i apply some function on result. and cache processed data into another table for faster future queries
I tried mongo with timeseries collection it was better yet still ~ 25sec to fetch; and insert is even longer.
Postgress was best fetch time: but trying to insert 500k row at once would take forever.
So far best way I found was to parse csv files as strings and use binary search to select rows
I think my indexs are not right and that's main problem
So what are your suggestions for keeping maintaining timeseries data where only operation I need is to get in range
clarify question EDIT1:
i have raw data of 9mil rows.
user can request data from api using one of 20+ aggregation formula
example api/get?formula=1&from=2019&to=2021
so i check if i do have cache for formula=1 within this date range
if not, then i need to load raw data for this date range, apply the formula then the user requested (result is usually 1k row for every 1mill or raw data) then i cache aggregated data for subsequent requests.
i cannot pre-process then data for all formulas in system; because i have over 2000 sensor (each has its own 9mil raw data); result will be more than 2TB of space.
In this dataset I am trying to develop a column or a measure based upon the hours column. I am trying to determine the difference between the first and second hour rows, the second and third hour rows, etc. and all the way through the entirety of the data.
Note: there are multiple serial numbers in this table; I just used this serial as an example.
I'm not sure this should be tagged with SQL-Server unless you can change the SQL that sources the data. If so you could pre-calculate this inside SQL Server.
If you can change the Power Query that brings the data into the data model you can add an Index column as the data's coming in and use that.
Please see:
How to Compare the Current Row to the Previous Row Using DAX
I've been banging my head against this for about a year on and off and I just hit a crunch time.
Business Issue: We use a software called Compeat Advantage (General Ledger system) and they provide a Excel add-in that allows you to use a function to retrieve data from the Microsoft SQL database. The problem is that it must make a call to the database for each cell with that function. On average it takes about .2 seconds to make the call and retrieve the data. Not bad except when a report has these in volume. Our standard report built with it has ~1,000 calls. So by math it takes just over 3 minutes to produce the report.
Again, in and of itself not a bad amount of time for a fully custom report. The issue I am trying to address that is one of the smaller reports ran, AND in some cases we have to produce 30 variants of the same report unique per location.
Arguments in function are; Unit(s) [String], Account(s) [String], Start Date, End Date. All of this is retrieved in a SUM() for all info to result in a single [Double] being returned.
SELECT SUM(acctvalue)
FROM acctingtbl
WHERE DATE BETWEEN startDate AND endDate AND storeCode = Unit(s) AND Acct = Account(s)
Solution Sought: For the standard report there is only three variation of the data retrieved (Current Year, Prior Year, and Budget) and if they are all retrieved in bulk but in detailed tables/arrays the 3 minute report would drop to less than a second to produce.
I want to find a way to retrieve in line item detail and store locally to access without the need to create a ODBC for every single function call on the sheet.
SELECT Unit, Account, SUM(acctvalue)
FROM acctingtbl
WHERE date BETWEEN startDate AND endDate
GROUP BY Unit, Account
Help: I am failing to find a functional way to do this. The largest problem I have is the scope/persistence of data. It is easy to call for all the data I need from the database, but keeping it around for use is killing me. Since these are spreadsheet functions after the call the data in the variables is released so I end up in the same spot. Each function call on the sheet takes .2 seconds.
I have tried storing the data in a CSV file but continue to have data handling issues is so far as moving it from the CSV to an array to search and sum data. I don't want to manipulate registry to store the info.
I am coming to the conclusion if I want this to work I will need to call the database, store the data in a .veryhidden tab, and then pull it forward from there.
Any thoughts would be much appreciated on what process I should use.
Okay!
After some lucking Google-fu I found a passable work around.
VBA - Update Other Cells via User-Defined Function This provided many of the answers.
Beyond his code I had to add code to make that sheet calculate ever time the UDF was called to check the trigger. I did that by doing a simple cell + cell formula and having a random number placed in it every time the workbook calculates.
I am expanding the code in the Workbook section now to fill in the holes.
Should solve the issue!
i have four dimension tables (titles,publishers,stores,period) and i like to load data into a fact table but i didn't know how! by the way i'm using SSIS
and in the measures of this fact table i have Qty and turnover(chiffre d'affaire) that i need to calculate but i din't know how to. plus in my source data i have in every date a qty.
i want to know which ssis tools i need to use to achieve that + calculating the measures.
i calculate qte from sources but it don't give me to every row a qte. it gives me qte of all in one row!!
Your question is very basic, so I suggest you to take your time and look at some SSIS tutorials on youtube or:
https://learn.microsoft.com/en-us/sql/integration-services/lesson-1-create-a-project-and-basic-package-with-ssis
To answer your question though: you need to first add a Data Flow Task into your Control Flow. Double click on the DFT and in the Data Flow, add an Excel Source to connect to your source. If you want to do some calculations, you need to use a Derived Column transformation.