Data Warehousing GUID to Int PrimaryKeys

Data Warehousing GUID to Int PrimaryKeys - sql-server

I'm a (very) junior Analyst responsible for setting up an mssql DWH which hosts data from our CRM for reporting purposes.
The current CRM uses uniqueidentifiers in its mssql database for all keys, and some of the tables have 8m+ rows. In our reporting software (Qlikview) I can swap the GUIDs for ints and take an 800mb data file down to 90mb which is excellent, however I'd like to perform this logic in the DWH if possible to make it faster and a little cleaner.
My issue is I have no idea how to do so while maintaining FK links to other tables. I have considered maintaining a staging table of GUIDs and associated numeric IDs however this seems inefficient and poses a problem of then trying to write some arbitrary numeric ID to the PK column of the destination table which I'm sure is a terrible idea.
The DWH import works as follows: I have USPs on the source db performing SELECTs which are executed by a SSIS package, the output of which are placed in tables of the same name on the [Staging] schema of the DWH. From there, transform is performed by USPs on the DWH, also executed by the same SSIS package, which handles execution order and multi-threading. Whatever implementation I come up with will need to be compatible with this architecture (done within USPs that potentially run asynchronously).
I'm very much a SQL noob so I do ask to please link documentation if necessary or at least describe answers in a google-friendly way.

Is the removal of GUID is the major cause of possible shrink to 90mb ? Do you not need GUID to process the Report?
Do you strip the relationship and join almost all table into as few table as possible when creating the staging table?
If answer to number 1 and 2 is yes then you do not need GUID and simply need to have a int unique column.
I suggest in select command during creating/inserting staging table you use ROW_NUMBER for replacing the GUID column with int unique column. This is only going to work if you recreating the staging table each time running the SSIS Script.
If you are simply inserting data to an already existing Staging Table when running SSIS Script then you can just create an autoincrement primary column. When you insert data to Staging Table, do not insert to autoincrement primary column so the column is automatically generating unique int value.

Related

IntelliJ database tool keeps indexing all tables when I modify one table

I am using the database tool in IntelliJ Idea 2016.2.1.
I have a database schema that has about 100 tables using DB2 server. Most of the tables are not related.
During development, I often need to add a new column to an table. However, each time intelliJ will start indexing all the tables and takes very long time. Even the table I modified is not related to any other tables. The table I modify is also a trivial table just for test purpose.
Question: Is there a way to avoid re-indexing all tables when not necessary?

Design for importing definition data from Excel into SQL Server

We have Restaurant Inventory Control system that uses SQL Server 2008 R2.
It takes a very long time to add all the definition data: stock items, yields, packsizes, recipes, categories etc. So, our clients have asked if they can upload it from Excel.
Before I just jump in and start, I want to find out if there is a best practice way to do this.
I know all the tools: SSIS, stored procedures etc. But I'm looking for advice/resources that can help with the design process. How best to setup the spreadsheet, validate the data, create the child/parent relationships etc.
This must be a fairly common project -- so it must have a standard design/approach and that's what I'm looking for.

I think the design will depend on the technologies you're most comfortable with. If you're comfortable with SSIS and stored procedures, this is the general pattern I would use:
Excel Template - I wouldn't spend too much time on this, add the headers and sheets necessary for the tables. You can lock down certain things and/or implement rules, but most of your validation would be done in stored procs.
SSIS - Have a package that loads the excel data into Staging tables, have rows with errors get added to an error log to be presented to the user along with the validation issues from the stored procedures.
Staging Tables - Have one staging table per sheet/production table, have an ExecutionId column in each staging table to allow parallel processing. Allow all columns to be NULL so you can get the data in the staging tables or set the proper null conditions and have SSIS redirect these rows on error. Don't have any primary key / foreign key relationships in the staging tables, these can be validated in the stored procedure
Stored Procedures - Validate the staging data, any issues found would be added to the error log to be presented to the user or person performing the import. If there are no issues, import the data into the production tables. If there is existing data in the production tables, you could do a comparison and update if applicable.

Vs2010 Data Generation Plan fails with "Data generation failed because of the following exception: Column "xyz" does not allow DBNull.Value"

I'm fairly new to Vs Data capabilities, and this is my first data generation plan. I have implemented a database using a Vs2010 database project, and used it to deploy to a sql server express 2008 database. All the tables use identity columns as their primary keys, and they're related to one another with foreign keys.
I set up a data generation plan, but when I try to generate data with it, the tables are simply populated in alphabetical order, which is of course going to fail. The only tables that populate correctly are the lookup tables and other sorts of independent entities with no FK constraints. The rest are skipped after the first table fails.
Supposedly the generation plan determines the population order based on FK dependencies. What happened?
edit: someone with the rep for it should make a visual-studio-data-tools tag, since DBPro is no longer (nor really ever was) a product name.

So apparently according to this thread the data generation plan blows up when you have a table containing only a primary key and no other columns. It turns out that one of my independent entities, whose only purpose is to serve as a joinder to one of my other tables, fit this description. After adding a harmless Description column, I was able to proceed fixing other problems until the generation plan completed successfully.

Moving client data from one database to a new one

Our application architecture allows us to host multiple clients in a single database, and also host multiple databases. This allows us to scale out by distributing clients across multiple databases. For example, 20 clients can be in database A, and another 15 could be in database B. We use a ClientID field in almost every table to partition client data. All our table's primary keys are INT identity TableID fields.
I'm looking for a tool/script that would help me extract client data from one database, and move it to a brand new database (so the PKs can stay the same). I'm hoping this exists already so we don't have to build our own. Pretty flexible in how this could work, but ideally it just generates a large .sql file with all the necessary INSERTS in the right order to move the data, and another sql file with all the necessary DELETES to erase the data from the source.
If it makes any difference we are on SQL Server 2008.

If you have standard or enterprise, you do have SSIS. Although it may not qualify as a "tool", it is fairly easy to implement in this scenario.

I can recomend redgate SQL DataCompare for this, we use it for syncing data, and use their SQL Compare to sync the database schema.
Both tools can either output sql, you can execute yourself, or the tools can execute the sql scripts themself.
They have a command line version of the tools to, so you could use them in an deployment script, tho i haven't tried this.
They both work really well, and are no doubt worth the price.

Not the answer you may be looking for, but you should consider using a GUID as a key. This will ensure that you have some type of unique identifier for your all records and that you can avoid collisions with identity keys / integer based indexes. It would add another degree of traceability should something go wrong when you migrate between databases.
SplendidCRM uses this technique when importing data from other DB systems.
Update:
My assumption was that the operation of transferring data between databases was not that frequent and that you needed database architecture for that task. I would use the GUID as lookup key specifically validation for the transfer of data, but I would NOT use that as a primary key for joins for standard operations like URL's. Although unique across databases, the trade-off is that GUIDs are slow.
In other words, the GUIDS would in addition to your existing primary keys now, and act as a means of validation for you should something go wrong. If you need ClientID in Database A to retain the same value in Database B then an identity column as that identifier will be an issue. You may have to create another identifier that is not "auto-generated". This could something other than the GUID, but my instinct is that integers alone will not be enough. Maybe you can create a columns that is a hash of the identity key, customer name and database name, or more simply, just concatenate those columns into a varchar column.

Use SSIS to migrate and normalize database

We have an MS Access database that we want to migrate to a SQL Server Database with a new DB design. A part of the application that uses the SQL Server DB is already written.
I looked around to find out how to do the migration step most easily and started with Microsofts SQL Server Integration Services (SSIS). Now I have gotten to the point that I want to split a table vertically for normalization reasons.
A made up example looks like this
MS Access table person
ID
Name
Street
SQL Server table person
id
name
SQL Server table address
id
person_id
street
How can I complete this task best with SSIS? The id columns are identity (autoincrement) columns, so I cannot insert the old ID. How can I put the correct person_id foreign key in the address table?
There might even be a table which has to be broken up into three tables, where a row in table2 belongs to table1 and a row in table3 belongs to a row table2.
Is SSIS the appropriate means for this?
EDIT
Although this is a one-time migration, we need to have an automated and repeatable process, because the production database is under heavy usage and we are working on the migration in our development environment with recent, but not up-to-date data. We plan for one test run of the migration and have the customer review the behaviour. If everything is fine, we will go for the real migration.
Most of the given solutions include lots of manual steps and are thus not appropriate.

Use the execute SQL Task and write the statement yourself.
For the parent table do Select into table from table... then do the same for the rest as you progress. Make sure you set identity insert to ON for the parent table and reuse your old ID's. That will help you keep your data integrity.

For migrating your Access tables into SQL Server, use SSMA, not the Upsizing Wizard from Access.
You'll get a lot more tools at your disposal.
You can then break up your tables one by one from within SQL Server.
I'm not sure if there are any tools that can help you split your tables automatically, at least I couldn't find any, but it's not too difficult to do manually although how much work is required depends on how you used the original tables in your VBA code and forms in the first place.
A side note
Regarding normalization, don't go overboard with it: I know your example was just that but normalizing customer addresses is not always (rarely?) needed.
How many addresses can a person have?
If you count a home address, business address, delivery address, billing address, that's probably the most you'll ever need.
In that case, it's better to just keep them in the same table. Normalizing that data will just require more work to recombine and offers no benefit.
Of course, there are cases where it would make sense to normalise but I've seen people going overboard with the notion (I've been guilty of it as well) and then find themselves struggling to build more complex queries to join all that split data, making development and maintenance harder and often suffering a performance penalty in the process.

Access is so user-friendly, why not normalize your tables in Access, and then upsize the finished structure from there?

I found a different solution which was not mentioned yet and allows us to use all the comfort and options of the dataflow task:
If the destination database is on a local SQL Server, you can use a dataflow task with SQL Server destination instead of an OLE DB destination.
For a SQL Server destination you can mark the "keep identities" option. (I do not know if the english names are correct, because we have a german version.) With this you can write into identity columns
We found that we cannot use the old primary keys everywhere, because we have some tables that take a union of records from multiple tables.
We start the process by building a temporary mapping table with columns
new_id (identity)
old_id (int)
old_tablename (string)
We first fill in all the old_id s for every table that is referenced by a foreign key in the new schema. The new_id values are generated automatically by SQL Server.
So we can use a join to translate from old_id to new_id where needed. We use the new_id values to fill the identity (primary key) columns in the new tables with the "keep identities" option and can simply look them up in our mapping table for the foreign keys by a join.

You might also look at Jamie Thomson's SSIS Normalizer component. I just found out about it today (haven't actually tried it yet). The example he posts looks a lot like the one in your question.