Use SSIS to migrate and normalize database - sql-server

We have an MS Access database that we want to migrate to a SQL Server Database with a new DB design. A part of the application that uses the SQL Server DB is already written.
I looked around to find out how to do the migration step most easily and started with Microsofts SQL Server Integration Services (SSIS). Now I have gotten to the point that I want to split a table vertically for normalization reasons.
A made up example looks like this
MS Access table person
ID
Name
Street
SQL Server table person
id
name
SQL Server table address
id
person_id
street
How can I complete this task best with SSIS? The id columns are identity (autoincrement) columns, so I cannot insert the old ID. How can I put the correct person_id foreign key in the address table?
There might even be a table which has to be broken up into three tables, where a row in table2 belongs to table1 and a row in table3 belongs to a row table2.
Is SSIS the appropriate means for this?
EDIT
Although this is a one-time migration, we need to have an automated and repeatable process, because the production database is under heavy usage and we are working on the migration in our development environment with recent, but not up-to-date data. We plan for one test run of the migration and have the customer review the behaviour. If everything is fine, we will go for the real migration.
Most of the given solutions include lots of manual steps and are thus not appropriate.

Use the execute SQL Task and write the statement yourself.
For the parent table do Select into table from table... then do the same for the rest as you progress. Make sure you set identity insert to ON for the parent table and reuse your old ID's. That will help you keep your data integrity.

For migrating your Access tables into SQL Server, use SSMA, not the Upsizing Wizard from Access.
You'll get a lot more tools at your disposal.
You can then break up your tables one by one from within SQL Server.
I'm not sure if there are any tools that can help you split your tables automatically, at least I couldn't find any, but it's not too difficult to do manually although how much work is required depends on how you used the original tables in your VBA code and forms in the first place.
A side note
Regarding normalization, don't go overboard with it: I know your example was just that but normalizing customer addresses is not always (rarely?) needed.
How many addresses can a person have?
If you count a home address, business address, delivery address, billing address, that's probably the most you'll ever need.
In that case, it's better to just keep them in the same table. Normalizing that data will just require more work to recombine and offers no benefit.
Of course, there are cases where it would make sense to normalise but I've seen people going overboard with the notion (I've been guilty of it as well) and then find themselves struggling to build more complex queries to join all that split data, making development and maintenance harder and often suffering a performance penalty in the process.

Access is so user-friendly, why not normalize your tables in Access, and then upsize the finished structure from there?

I found a different solution which was not mentioned yet and allows us to use all the comfort and options of the dataflow task:
If the destination database is on a local SQL Server, you can use a dataflow task with SQL Server destination instead of an OLE DB destination.
For a SQL Server destination you can mark the "keep identities" option. (I do not know if the english names are correct, because we have a german version.) With this you can write into identity columns
We found that we cannot use the old primary keys everywhere, because we have some tables that take a union of records from multiple tables.
We start the process by building a temporary mapping table with columns
new_id (identity)
old_id (int)
old_tablename (string)
We first fill in all the old_id s for every table that is referenced by a foreign key in the new schema. The new_id values are generated automatically by SQL Server.
So we can use a join to translate from old_id to new_id where needed. We use the new_id values to fill the identity (primary key) columns in the new tables with the "keep identities" option and can simply look them up in our mapping table for the foreign keys by a join.

You might also look at Jamie Thomson's SSIS Normalizer component. I just found out about it today (haven't actually tried it yet). The example he posts looks a lot like the one in your question.

Related

Data Warehousing GUID to Int PrimaryKeys

I'm a (very) junior Analyst responsible for setting up an mssql DWH which hosts data from our CRM for reporting purposes.
The current CRM uses uniqueidentifiers in its mssql database for all keys, and some of the tables have 8m+ rows. In our reporting software (Qlikview) I can swap the GUIDs for ints and take an 800mb data file down to 90mb which is excellent, however I'd like to perform this logic in the DWH if possible to make it faster and a little cleaner.
My issue is I have no idea how to do so while maintaining FK links to other tables. I have considered maintaining a staging table of GUIDs and associated numeric IDs however this seems inefficient and poses a problem of then trying to write some arbitrary numeric ID to the PK column of the destination table which I'm sure is a terrible idea.
The DWH import works as follows: I have USPs on the source db performing SELECTs which are executed by a SSIS package, the output of which are placed in tables of the same name on the [Staging] schema of the DWH. From there, transform is performed by USPs on the DWH, also executed by the same SSIS package, which handles execution order and multi-threading. Whatever implementation I come up with will need to be compatible with this architecture (done within USPs that potentially run asynchronously).
I'm very much a SQL noob so I do ask to please link documentation if necessary or at least describe answers in a google-friendly way.
Is the removal of GUID is the major cause of possible shrink to 90mb ? Do you not need GUID to process the Report?
Do you strip the relationship and join almost all table into as few table as possible when creating the staging table?
If answer to number 1 and 2 is yes then you do not need GUID and simply need to have a int unique column.
I suggest in select command during creating/inserting staging table you use ROW_NUMBER for replacing the GUID column with int unique column. This is only going to work if you recreating the staging table each time running the SSIS Script.
If you are simply inserting data to an already existing Staging Table when running SSIS Script then you can just create an autoincrement primary column. When you insert data to Staging Table, do not insert to autoincrement primary column so the column is automatically generating unique int value.

Change Primary keys from int to guid in existing database

I have taken over a project with an existing SQL server installation. The client wants to move everything to the azure SQL and make several on premises databases sync to azure.
The PK's in the tables are int's and for the Azure datasync to work PK's needs to be guid's. the database consists of several related tables.
My question is therefore. What is the best way to change the PK's to guids and at the same time update the FK's accordingly in existing tables.
The process as far as I see it:
1. Create new guid column
2. fill it with ID's.
3. change the PK to the guid column
4. update data to new guids in the FK tables.
Is there an easy scriptable way to make this magically happen?
No there is nothing built into SQL Server that makes this any easier than the process you described already.

Performing Calculations in a table's field with Transactions Table

One thing I want to do is build a personal database for myself at home to use a financial database (transaction log, checking/savings account tables, etc), and I want to do this mainly to learn more about developing databases. I am pretty familiar with MS Access, though not put to use in this context, but what I am really trying to learn is SQL Server.
SO, that being said, the first question that popped into my mind is that if I have a transactions table that I would want to use as a ledger, then is there some method to have the table automatically perform a calculation for one field (balance) based on another field(s) (expense, revenue fields)? Similar to what someone may do with Excel......
Or is this something I would have to do with an unbound form, and an UPDATE statement kinda of approach? If a table constraint exists for this type of idea, I would like to learn it....
I mentioned MS Access in the title, but a SQL Server is also most appreciated. Thanks for the help!
Derived data should not be stored except if it needs to be indexed -- you calculate the values in your SQL statements, or in the presentation layer.
In addition to computed columns in SQL Server tables, you can have them in VIEWS and you can index them. The term is "indexed view" and when you do that, the data is persisted in a hidden temp table and updated on the fly when the data the VIEW is derived from is changed. You can read about it under the TYPES OF VIEWS topic in the same link cited in #Roland Bouman's answer.
Last of all, it's not clear to me why you mention Access at all if you're using SQL Server as your back end. Are you developing your front end in Access?
In MS SQL server, you can use computed columns for this: http://msdn.microsoft.com/en-us/library/ms191250.aspx

Copy Database Data from Many DBs to One. Data Replication (sort of)

This involves data replication, kind of:
We have many sites with SQL Express installed, there is an 'audit' database on each site that has one table in 1st normal form (to make life simple :)
Now I need to get this table from each site, and copy the contents (say, with a Date Time Value > 1/1/200 00:00, but this will change obviously) and copy it to a big 'super table' in sql server proper, that also has the primary key as the Site Name (That needs injecting in) and the current primary key from the SQL Express table)
e.g. Many SQL Express DBs with the following table columns
ID, Definition Name, Definition Type, DateTime, Success, NvarChar1, NvarChar2 etc etc etc
And the big super table needs to have:
SiteName, ID, Definition Name, Definition Type, DateTime, Success, NvarChar1, NvarChar2 etc etc etc
Where items in bold are the primary key(s)
Is there a Microsoft (or non MS I suppose) app/tool/thing to manager copying all this data accross already, or do we need to write our own?
Many thanks.
You can use SSIS (which comes with SQL Server) to populate, it can be set up with variables to change the connection string to the various databases. I have one that loops through the whole list and does the same process using three differnt files from three differnt vendors. You could so something simliar to loop through the different site databases. Put the whole list of database you want to copy the audit data from in a table and loop through it changing the connection string each time.
However, why on earth would you want one mega audit table per site? If every table in the database populates the audit table as changes happen, then the audit table eventually becomes a huge problem for performance. Every insert, update and delete has to hit this table and then you are proposing to add an export on top of that. This seems to me to be a guaranteed structure for locking and deadlocks and all sorts of nastiness. Do yourself a favor and limit each audit table to the table it is auditing.
Things to consider:
Linked servers and sp_msforeachdb as part of a do-it-yourself solution.
SQL Server Replication (by Microsoft) (which I believe can pull data from SQL Server Express)
SQL Server Integration Services which can pull data from SQL Server Express instances.
Personally, I would investigate Integration Services first.
Good luck.
You could do this with SymmetricDS. SymmetricDS is open source, web-enabled, database independent, data synchronization/replication software. It uses web and database technologies to replicate tables between relational databases in near real time. The software was designed to scale for a large number of databases, work across low-bandwidth connections, and withstand periods of network outage.
As of right now, however, you would need to implement a custom IDataLoaderFilter extension point (in Java) to add the extra column. The metadata would be available though because your SiteName would be the external_id.

Moving client data from one database to a new one

Our application architecture allows us to host multiple clients in a single database, and also host multiple databases. This allows us to scale out by distributing clients across multiple databases. For example, 20 clients can be in database A, and another 15 could be in database B. We use a ClientID field in almost every table to partition client data. All our table's primary keys are INT identity TableID fields.
I'm looking for a tool/script that would help me extract client data from one database, and move it to a brand new database (so the PKs can stay the same). I'm hoping this exists already so we don't have to build our own. Pretty flexible in how this could work, but ideally it just generates a large .sql file with all the necessary INSERTS in the right order to move the data, and another sql file with all the necessary DELETES to erase the data from the source.
If it makes any difference we are on SQL Server 2008.
If you have standard or enterprise, you do have SSIS. Although it may not qualify as a "tool", it is fairly easy to implement in this scenario.
I can recomend redgate SQL DataCompare for this, we use it for syncing data, and use their SQL Compare to sync the database schema.
Both tools can either output sql, you can execute yourself, or the tools can execute the sql scripts themself.
They have a command line version of the tools to, so you could use them in an deployment script, tho i haven't tried this.
They both work really well, and are no doubt worth the price.
Not the answer you may be looking for, but you should consider using a GUID as a key. This will ensure that you have some type of unique identifier for your all records and that you can avoid collisions with identity keys / integer based indexes. It would add another degree of traceability should something go wrong when you migrate between databases.
SplendidCRM uses this technique when importing data from other DB systems.
Update:
My assumption was that the operation of transferring data between databases was not that frequent and that you needed database architecture for that task. I would use the GUID as lookup key specifically validation for the transfer of data, but I would NOT use that as a primary key for joins for standard operations like URL's. Although unique across databases, the trade-off is that GUIDs are slow.
In other words, the GUIDS would in addition to your existing primary keys now, and act as a means of validation for you should something go wrong. If you need ClientID in Database A to retain the same value in Database B then an identity column as that identifier will be an issue. You may have to create another identifier that is not "auto-generated". This could something other than the GUID, but my instinct is that integers alone will not be enough. Maybe you can create a columns that is a hash of the identity key, customer name and database name, or more simply, just concatenate those columns into a varchar column.

Resources