How can I use variables and SQL code within an SSIS package? - sql-server

I have an SSIS package I am building to take data from a .CSV file and load it into a table in a SQL Server database. The .CSV file has more columns than my table and I'm looking to filter out the data based on some of these columns that are not being inserted into the table.
I have year, kind, type, dollars as my column names in the .CSV file but I'm only pulling type and dollars into the DB. However, I can only pull those rows where the kind= "L" and year is the current year (with one major caveat). If the process is running in the first quarter of a given year (so month <= 3) it needs to use the previous year as my qualifier for what rows it pulls in from the .CSV file. For instance, say it is February 2015 when this package is running, I need it to pull only rows with a year of 2014 and kind="L" from my .CSV file. If it is September 2015 then it needs to pull in rows with a year of 2015 and kind="L".
Any idea what the best way of doing this is? Right now I have a conditional split in my package but I can only get it to say year==YEAR(GETDATE()) and this will not work for the first quarter. I'd need some sort of variable logic to say something like IF(currentmonth<=3 THAN #year = currentyear-1) ELSE (#year = currentyear) and then use the #year variable in the conditional split. Is this possible?
Any help is much appreciated!

Normally for this kind of workflow, I will import the entire CSV into a temporary table, and then have a separate SQL script or view which reads from the temporary table and applies whatever business logic is needed for the final view.

If you want the logic to be in the SSIS package you can use a derived column component to declare a new boolean field for example IncludeRowInOutput and set it to be something like
((currentmonth <= 3 and year = year(getdate() - 1)) or (year = year(getdate))) and kind = 'L'
Then you can do the conditional split based on the IncludeRowInOutput field.
I'd normally be wary of using too much script components, I find that they are harder to debug and make the dataflow harder to understand.

Related

Database/Datawarehouse design suggestions

We have a legacy ERP system that stores data in flat files. we have replicated these flat files in SQL Server database pretty much as it is.
Some of the sales tables store historical data in multiple columns without storing any dates with them. the name of the column will tell us which month the sales data belong to. Sales01 is current month, Sales02 is previous month, Sales03 is the month before and so on. Same with sales Qty and margin. i.e. Qty01, qty02 and margin01 and margin 02 and so on. This is repeated for each customer and each item sold.
Now, I am working on a small project where I have to design a small DB for reporting with some tables that will be fed by this main database.
I want to load this data in such a way that these values from each month are stored in rows with a month-year or date from first day of the month in another column so I can use where clause with dates.
Not sure what would the be best way to go? I have written a stored proc in the past to load this data this way, but wonder if there a better way to go.
Some how I can use SSIS Pivot transformation?
or just use Pivot or similar statement to do this in an SP?
I will most likely be using this practice to built a data warehouse in future.

SSIS - Excel to SQL Server with changing column names

I have an Excel sheet that changes column names based on the year and current week of the year, so for example 201901 would be the first week of 2019.
The Excel sheet that is sent to us daily automatically adjusts the column names based on the current date (up to 6 months), so currently (31/07/2019) the year and week show 201931 - 202011:
So the N column next week will be 201932 (the columns shift left basically).
I have tried changing the Excel source columns to a different alias of just 1,2,3,4 etc in hopes to just get the data into SQL Server, and then script a trigger in SQL Server to change column names but doesn't work due to the mapping SSIS requires.
Works fine until the column changes to next week.
A simple method would be to drop the table and just dump the file in a new table named the same but can't see how to set up in SSIS as you need to map the column names (which unfortunately change).
Here is how the dataflow looks:
Ideally, for me, something like this would be perfect:
But not sure how to achieve this outcome in SSIS?
I would suggest transforming the data. Currently, you have a "cross-table"-format.
How about putting the Excel data in the form of (RAG_week; CalenderWeek; Value_of_CalenderWeek) ? For doing this you can use an Excel-Macro which fills a new Sheet in the Excel file. (Each cell is transformed in one dataset, being a row on its own.) Next, you create a similar table on the SQL Server. Then you can create a SSIS package with constant column assignment, simply appending the new data each week.
This impacts the further evaluation of your data, but seems to be a far more stable approach.

Date format in Excel file to load to SQL Server

Our business would be providing us a .csv file. One of the columns in the file would be in date format. Now as we know there are many date formats in Excel. The problem is that we need to check whether the date provided is a correct date. It could be in any format like ddmmyyyy, yyyymmdd, dd-mon-yyyy etc basically any format that Excel supports.
We are planning to first load the data in a staging area and the date field would be defined as varchar so that it can accept any data.
Now either using SSIS or via T-SQL, I need to check whether the date provided is actually a date and if it is I need to load it into a different table in YYYYMMDD format.
How do I go about doing the above?
Considering you have your excel data already loaded into a SQL Server table as varchar (you can easily do this using SSIS), something like this would work:
SELECT
case when ISDATE(YOUR_DATE) = 1 then CONVERT(int,YOUR_DATE,112) else null end as MyDate
FROM
YOUR_TABLE
I don't have access to a SQL Server instance at the moment and can't test the code above, so you may need to adapt to your needs, but this is the general idea.
You can also do further research on ISDATE and CONVERT functions in SQL Server. You should be able to achieve what you need combining them together.

EXCEL datetime row changed to "42507" after import into SQL Server

I got a problem, I have imported this Excel sheet into SQL Server several times, before it worked fine.
But suddenly there are 2 rows (datetime) with invalid data. In Excel, the datetime row has been all changed to 2016/12/12
But when the data is imported into SQL Server, some will change to sort of 42507 format, and couldn't calculate using datediff.
I was quite confused of this, can anyone help? Any of your idea is greatly appreciated.
Thanks in advance.
Excel stores dates as integers, the number of days since 1899-12-30, you can use =TEXT(A1,”yyyy-mm-dd hh:MM:ss”) in Excel to store the text value for easy import, but if you already have the integers in SQL you can use DATEADD(day,yourDate,'1899-12-30') to convert it to the proper date.
.xlsx (and .docx,pptx etc) files are just archives and the meat of your documents are stored in xml files. You can change the extension to .zip and open the archive to explore how the data is actually stored, in most if not all cases, cell formatting doesn't affect the underlying values.
Make sure that field in Excel is set to date and then import the sheet.
You can also cast the 5 digit as datetime:
UPDATE <yourTable>
SET <dateColumn> = CAST(<dateColumn> as datetime)
WHERE LEN(<dateColumn>) = 5

Recommended way of adding daily data to database

I receive new data files every day. Right now, I'm building the database with all the required tables to import the data and perform the required calculations.
Should I just append each new day's data to my current tables? Each file contains a date column, which would allow for a "WHERE" query in the future if I need to analyze data for one particular day. Or should I be creating a new set of tables for every day?
I'm new to database design (coming from Excel). I will be using SQL Server for this.
Assuming that the structure of the data being received is the same, you should only need one set of tables rather than creating new tables each day.
I'd recommend storing the value of the date column from your incoming data in your database, and also having a 'CreateDate' column in your tables, with a default value of 'GetDate()' so that it automatically gets populated with the current date when the row is inserted.
You may also want to have another column to store the data filename that the row was imported from, but if you're already storing the value of the date column and the date that the row was inserted, this shouldn't really be necessary.
In the past, when doing this type of activity using a custom data loader application, I've also found it useful to create log files to log success/error/warning messages, including some type of unique key of the source data and target database - ie. if coming from an Excel file and going into a database column, you could store the row index from Excel and the primary key of the inserted row. This helps tracking down any problems later on.
You might want to consider having a look at SSIS (SqlServer Integration Services). It's the SqlServer tool for doing ETL activities.
yes, append each day's data to the tables; 1 set of tables for all data.
yes, use a date column to identify the day that the data was loaded.
maybe have another table with a date column and a clob column. The date to contain the load date and the clob to contain the file that you imported.
Good question. You most definitely should have a single set of tables and append the data daily. Consider this: if you create a new set of tables each day, what would, say, a monthly report query look like? A quarterly report query? It would be a mess, with UNIONs and JOINs all over the place.
A single set of tables with a WHERE clause makes the querying and reporting manageable.
You might do a little reading on relational database theory. Wikipedia is a good place to start. The basics are pretty straightforward if you have the knack for it.
I would have the data load into a stage table regardless and append to the main tables after. Once a week i would then refresh all data in the main table to ensure that the data remains correct as per the source.
Marcus

Resources