Recommended way of adding daily data to database - sql-server

I receive new data files every day. Right now, I'm building the database with all the required tables to import the data and perform the required calculations.
Should I just append each new day's data to my current tables? Each file contains a date column, which would allow for a "WHERE" query in the future if I need to analyze data for one particular day. Or should I be creating a new set of tables for every day?
I'm new to database design (coming from Excel). I will be using SQL Server for this.

Assuming that the structure of the data being received is the same, you should only need one set of tables rather than creating new tables each day.
I'd recommend storing the value of the date column from your incoming data in your database, and also having a 'CreateDate' column in your tables, with a default value of 'GetDate()' so that it automatically gets populated with the current date when the row is inserted.
You may also want to have another column to store the data filename that the row was imported from, but if you're already storing the value of the date column and the date that the row was inserted, this shouldn't really be necessary.
In the past, when doing this type of activity using a custom data loader application, I've also found it useful to create log files to log success/error/warning messages, including some type of unique key of the source data and target database - ie. if coming from an Excel file and going into a database column, you could store the row index from Excel and the primary key of the inserted row. This helps tracking down any problems later on.

You might want to consider having a look at SSIS (SqlServer Integration Services). It's the SqlServer tool for doing ETL activities.

yes, append each day's data to the tables; 1 set of tables for all data.
yes, use a date column to identify the day that the data was loaded.
maybe have another table with a date column and a clob column. The date to contain the load date and the clob to contain the file that you imported.

Good question. You most definitely should have a single set of tables and append the data daily. Consider this: if you create a new set of tables each day, what would, say, a monthly report query look like? A quarterly report query? It would be a mess, with UNIONs and JOINs all over the place.
A single set of tables with a WHERE clause makes the querying and reporting manageable.
You might do a little reading on relational database theory. Wikipedia is a good place to start. The basics are pretty straightforward if you have the knack for it.

I would have the data load into a stage table regardless and append to the main tables after. Once a week i would then refresh all data in the main table to ensure that the data remains correct as per the source.
Marcus

Related

Using SSIS to insert records without inserting preexisting records

I have a 290 million source data set and I get a daily download of 12 million records daily which contain data from the previous days downloads. I am having trouble inserting the daily records into the source and excluding the records I already have. Some of the records that are new may not be from the previous day they could be several days back so a date restriction wont work. Please help.
I just had this exact same issue basically in your Data flow of your SSIS you need to add a Lookup. Have it match the data your inserting to the new data based on the PK. then you can separate the data from here, choose Redirect Rows to no match output. This will make the green arrow contain all data that is no present.
Lookup component using a key field and with the no match output, do an insert (you could also with the match output do an update; though 290million rows IS going to take A WHILE)...

How to transfer only new records between two different databases (ie. Oracle and MSSQL) using SSIS?

Do you know how to transfer only new records between two different databases (ie. Oracle and MSSQL) using SSIS? There is no problem transfering new data only between two tables in the same database and server, but is this possible to do such operation between completely different servers and databases?
Ps. I know about solution using Lookup but it is not very efficient if anybody needs to check and add a lot of records (50k and more) several times per day. I would like to operate with new data only.
You have several options:
Timestamp based solution
If you have a column which stores the insertation time in the source system, you can select only the new records created since the last load. With the same logic, you can transfer modified records too, just mark the records with the timestamp value when it change.
Sequence based solution
If there is a sequence in the source table, you can load the new records based on that sequence. Query the last value from the destination system, then load avarything which is larger than that value.
CDC based solution
If you have CDC (Change Data Capture) in your source system, you can track the changes and you can load them based on the CDC entries.
Full load
This is the most resource hungry solution: you have to copy all data from the source to the destination. If you do not have any column which marks the new records, you should use this solution.
You have several options to achieve this:
TRUNCATE the destination table and reload it from source
Use a Lookup component to determine which records are missing
Load all data from source to a temporary table and write a query which retrieves the new/changed records.
Summary
If you have at least one column, which marks the new/modified records, you can use it to implement a differential/incremental load with SSIS. If you do not have any clue, which columns/rows are changed, you have to load (or at least query) all of them.
There is no solution which enables a one-query (INSERT .. SELECT) solution using multiple servers without transferring all data. (Please note, that a multi-server query using Linked Servers are transfers the data from the source system).
What about variables? Is it possible to use the same variable between different databases and servers in SSIS?
I would like to transfer last id number from a destination table and transfer it to the source table (different server!).
I can set a variable in a database scope like this:
DECLARE #Last int
SET #Last = (SELECT TOP 1 Id FROM dbo.Table_1 ORDER BY Id DESC)
SELECT *
FROM dbo.Table_2
WHERE ID > #Last;
However it works between two tables in the same database (as a SQL command) only. I can create a variable for a entire SSIS package in Variables --> Add variable, but I don't know it is possible to use the variable in a similar way as above - to keep an information about last id in a destination table and pass it to another table on a source server as data limit.

How to find new fields in the Destination table in SQL Server?

I have a table in my source data base with 100 columns on it .Where i use ssis package to load data from source data base to destination data base table.
Where table in source and destination table are same .But in some times in Destination table address fields will be changed and or new data types are added.
So how can i find new columns are changed data types in destination table while compared with the source table.
Is there any stored procedure to find missing columns or address fields and changed data types.
i can check manually but its killing my time, where i have 50 tabels where each table consists of 100 to 200 columns.
Please some one help me to find ?
You can grab all the table names from sys.tables and all the column names from sys.columns.
You can write code to perform this operation on the destination and source databases; However, you might waste alot of time re-inventing the wheel.
http://www.red-gate.com/products/sql-development/sql-compare/
Just get a trial version of SQL-Compare from the sales team at red gate. If you like it, just buy it. It only costs $495.
Probably a-lot less money than trying to write something yourself!

Load data from multiple source into a destination

I have a desktop application through which data is entered and it is being captured in MS Access DB. The application is being used by multiple users(at different locations). The idea is to download data entered for that particular day into an excel sheet and load it into a centralized server, which is an MSSQL server instance.
i.e. data(in the form of excel sheets) will come from multiple locations and saved into a shared folder in the server, which need to be loaded into SQL Server.
There is a ID column with IDENTITY in the MSSQL server table, which is the primary key column and there are no other columns in the table which contains unique value. Though the data is coming from multiple sources, we need to maintain single auto-updating series(IDENTITY).
Suppose, if there are 2 sources,
Source1: Has 100 records entered for the day.
Source2: Has 200 records entered for the day.
When they get loaded into Destination(SQL Server), table should have 300 records, with ID column values from 1 to 300.
Also, for the next day, when the data comes from the sources, Destination has to load data from 301 ID column.
The issue is, there may be some requests to change the data at Source, which is already loaded in central server. So how to update the data for that row in the central server as the ID column value will not be same in Source and Destination. As mentioned earlier ID is the only unique value column in the table.
Please suggest some ides to do this or I've to take up different approach to accomplish this task.
Thanks in advance!
Krishna
Okay so first I would suggest .NET and doing it through a File Stream Reader, dumping it to the disconnected layer of ADO.NET in a DataSet with multiple DataTables from the different sources. But... you mentioned SSIS so I will go that route.
Create an SSIS project in Business Intelligence Development Studio(BIDS).
If you know for a fact you are just doing a bunch of importing of Excel files I would just create many 'Data Flow Task's or many Source to Destination tasks in a single 'Data Flow Task' up to you.
a. Personally I would create tables in a database for each location of an excel file and have their columns map up. I will explain why later.
b. In a data flow task, select 'Excel Source' as the source file. Put in the appropriate location of 'new connection' by double clicking the Excel Source
c. Choose an ADO Net Destination, drag the blue line from the Excel Source to this endpoint.
d. Map your destination to be the table you map to from SQL.
e. Repeat as needed for each Excel destination
Set up the SSIS task to automate from SQL Server through SQL Management Studio. Remember you to connect to an integration instance, not a database instance.
Okay now you have a bunch of tables right instead of one big one? I did that for a reason as these should be entry points and the logic to determinate dupes and import time I would leave to another table.
I would set up another two tables for the combination of logic and for auditing later.
a. Create a table like 'Imports' or similar, have the columns be the same except add three more columns to it: 'ExcelFileLocation', 'DateImported'. Create an 'identity' column as the first column and have it seed on the default of (1,1), assign it the primary key.
b. Create a second table like 'ImportDupes' or similar, repeat the process above for the columns.
c. Create a unique constraint on the first table of either a value or set of values that make the import unique.
c. Write a 'procedure' in SQL to do inserts from the MANY tables that match up to the excel files to insert into the ONE 'Imports' location. In the many inserts do a process similar to:
Begin try
Insert into Imports (datacol1, datacol2, ExcelFileLocation, DateImported) values
Select datacol1, datacol2, (location of file), getdate()
From TableExcel1
End try
-- if logic breaks unique constraint put it into second table
Begin Catch
Insert into ImportDupes (datacol1, datacol2, ExcelFileLocation, DateImported) values
Select datacol1, datacol2, (location of file), getdate()
From TableExcel1
End Catch
-- repeat above for EACH excel table
-- clean up the individual staging tables for the next import cycle for EACH excel table
truncate TableExcel1
d. Automate the procedure to go off
You now have two tables, one for successful imports and one for duplicates.
The reason I did what I did is two fold:
You need to know more detail than just the detail a lot of times like when it came in, from what source it came from, was it a duplicate, if you do this for millions of rows can it be indexed easily?
This model is easier to take apart and automate. It may be more work to set up but if a piece breaks you can see where and easily stop the import for one location by turning off the code in a section.

SQLBulkCopy and Dates (1/1/1753)

I've got an application which has been working fine for quite a while, but there is an annoying item that continues to get in the way on occasion.
Let's say that I use an object such as OracleDataReader or MySQLDataReader to pass the data to the sqlbulkcopy object for insert. Let's assume that all the columns maps just fine and for the most part, it all works well.
Granted, I don't have control over the source application or database (which is either MySQL or Oracle). So some goof goes into a different application and puts in a date on the invoice table of 5/31/0210. He really meant to put in 5/31/2010, but the application he's using is not validating the data very tightly and the Oracle database accepts it. For all intensive purposes, the data of 5/31/0210 is a valid date for the Oracle db. It might be stupid in terms of data entry, but it is what it is at this point.
Now our OracleDataReader comes along and is transferring this invoice table over to SQL Server via the SQLBulkCopy. It is passing the data to perfectly matched table with the right column names and data types. You can see what is going to happen. This date of 05/31/0210 from Oracle is not accepted by the SQL Server db engine, as the DATETIME field only allows dates from 1/1/1753 to 12/31/9999.
When it encounters this record, it simply fails and gives an overflow error. It doesn't skip the record, it kills the feed. So if it happens a thousand records in on a million record table, you don't get the remaining 999,000 records.
Is there anyway to get around this issue so that the feed will continue?
Ideally, I'd like to move the receiving SQL Server DB to 2008 and use DATETIME2, which would allow for these goofy dates, but unfortunately not all my clients are ready to move to this version yet, so I'm stuck with DATETIME in SQL 2000/2005/2008.
Any ideas on how to get around this without changing the SQL? Ideally, I wouldn't mind if it just skipped the record. I know that I could do this in the SQL for the datareder, but this would be extremely complicated when you have twenty date fields in a single query. It would be maintenance nightmare.
Any thoughts would be appreciated.
One option would be to change the datetime column type to varchar. Then add a derived column for converting the string to datetime. The trick would be to use a function in the derived column to validate the date and put an arbitrary datetime if the coversion will fail. If you do heavy date comparisons, persist the computed column and/or index it.
I say all of this under the impression that sqlbulkcopy is not able to do transforms. Maybe you can. Hopefully, someone will chime in with a way to.
SSIS would be great in this situation, as you could do the transform and also get the performance benefits of the bulk update lock.

Resources