I have never done a conversion from a relational database to a data warehouse before. I have the data warehouse tables and models created and am currently creating staging tables. How do I go about actually populating the data warehouse tables? For instance, I am populating a portion of a fact table. However, a primary key cannot contain any null data, so I violate that constraint by transferring data over. I'm assuming primary and foreign key constraints are created prior to migration. Is this correct? Any help is a appreciated, I may be missing some simple database logic here.
You can remove constraint while you waiting for some data or remove it at all and add it only in daily tests to be sure that data valid.
Related
I'm a (very) junior Analyst responsible for setting up an mssql DWH which hosts data from our CRM for reporting purposes.
The current CRM uses uniqueidentifiers in its mssql database for all keys, and some of the tables have 8m+ rows. In our reporting software (Qlikview) I can swap the GUIDs for ints and take an 800mb data file down to 90mb which is excellent, however I'd like to perform this logic in the DWH if possible to make it faster and a little cleaner.
My issue is I have no idea how to do so while maintaining FK links to other tables. I have considered maintaining a staging table of GUIDs and associated numeric IDs however this seems inefficient and poses a problem of then trying to write some arbitrary numeric ID to the PK column of the destination table which I'm sure is a terrible idea.
The DWH import works as follows: I have USPs on the source db performing SELECTs which are executed by a SSIS package, the output of which are placed in tables of the same name on the [Staging] schema of the DWH. From there, transform is performed by USPs on the DWH, also executed by the same SSIS package, which handles execution order and multi-threading. Whatever implementation I come up with will need to be compatible with this architecture (done within USPs that potentially run asynchronously).
I'm very much a SQL noob so I do ask to please link documentation if necessary or at least describe answers in a google-friendly way.
Is the removal of GUID is the major cause of possible shrink to 90mb ? Do you not need GUID to process the Report?
Do you strip the relationship and join almost all table into as few table as possible when creating the staging table?
If answer to number 1 and 2 is yes then you do not need GUID and simply need to have a int unique column.
I suggest in select command during creating/inserting staging table you use ROW_NUMBER for replacing the GUID column with int unique column. This is only going to work if you recreating the staging table each time running the SSIS Script.
If you are simply inserting data to an already existing Staging Table when running SSIS Script then you can just create an autoincrement primary column. When you insert data to Staging Table, do not insert to autoincrement primary column so the column is automatically generating unique int value.
I was recently developing a .Net MVC web application and send the database schema to a DBA I work with to get the database built on a production DB server. The DBA asked me if I needed primary keys in all my tables. I said yes, for the primary reason, that it is good DB design practice. When I asked why the DBA told me that it is preferred to minimize the number of tables in our organization's database servers with primary keys to conserve resources. Is there some sort of detriment to having primary keys in data tables?
When you make a 'join table' the primary keys from each contributing table, form a composite key for the join table. It is then quite possible that this composite key can be indexed.
Inefficient indexing strategies can degrade performance.
An example is the 'InnoDB' engine for MySQL, this is one I work with a lot. With InnoDB every index entry is concatenated with a value of the corresponding primary key. When a query reads a record via a secondary index, this is used with the primary key, to find the record.
So the primary key could effect performance especially if it is something big like a java UUID (128bytes).
I have several tables in my database A which are interconnected via foreign keys and contain values. These values need to be transfered to another database B, all dependencies must be preserved, but the actual (numeric) values of primary and foreign keys are, of course, of no importance.
What would be the easiest way to fulfill this task using SSIS?
Here are the approaches I tried but with no much success:
I implemented a really very sophisticated view with flattened data and a lot of redundancy in the data and bumped into the problem how to split the data from this flattened view into several tables connected via foreign keys. This might be a solution, but I would personally prefer to avoid the data flatenning step if possible.
I tried to copy the tables one-to-one using NOCHECK options to lift up the constraint checks and to perform insertion into PK and FK fields. This, however, confines my transfer to a complete new import, I cannot just "add" some new data to existing set of data that would be nice.
Any other suggestions?
Integration Services has a Control Flow called Transfer Database Task and Transfer SQL Server Objects Task exclusive for what you need.
Here is a tutorial for what you need LINK.
How do I store data that is shared between databases?
Suppose a database for a contact management system. Each user is given a separate database. User can store his/her contacts' education information.
Currently there's a table called School in every database where the name of every school in the country is stored. School table is referenced as a FK by Contact table.
School table gets updated every year or so, as new schools get added or existing schools change names.
As the school information is common across all user databases, moving it into a separate common database seems to be a better idea. but when it's moved to a separate database, you can not create a FK constraint between School and Contact.
What is the best practice for this kind of situation?
(p.s. I'm using SQL Server if that is relevant)
Things to consider
Database is a unit of backup/restore.
It may not be possible to restore two databases to the same point in time.
Foreign keys are not supported across databases.
Hence, I would suggest managing the School -- and any other common table -- in one reference DB and then replicating those tables to other DBs.
Just straight out of the box, foreign key constraints aren't going to help you. You could look into replicating the individual schools table.
Based on the fact that you won't query tables with the SchoolID column very often I'll asume that inserts/updates to these tables will be really rare... In this case you could create a constraint on the table in which you need the FKs that checks for the existence of such SchoolID in the Schools table.
Note that every insert/update to the table with the SchoolID column will literally perform a query to another DB so, distance between databases, the way they connect to each other and many other factors may impact the performance of the insert/update statements.
Still, if they're on the same server and you have your indexes and primary keys all set up, the query should be fairly fast.
I've just come across something disturbing, I was trying to implement transactional replication from a database whose design is not under our control . This replication was in order to perform reporting without taxing the system too much. Upon trying the replication only some of the tables went across.
On investigation tables were not selected to be replicated because they don't have a primary key, I thought this cannot be it is even shown as a primary key if I use ODBC and ms access but not in management studio. Also the queries are not ridiculously slow.
I tried inserting a duplicate record and it failed saying about a unique index(not a primary key). Seems to be the tables have been implemented using a unique index as oppose to a primary key. Why I do not know I could scream.
Is there anyway to perform transactional replication or an alternative, it needs to be live (last minute or two). The main db server is currently sql 2000 sp3a and the reporting server 2005.
The only thing I have currently thought of trying is setting the replication up as if it is another type of database. I believe replication to say oracle is possible would this force the use of say an ODBC driver like I assume access is using hence showing a primary key. I don't know if that is accurate out of my depth on this.
As MSDN states, it is not possible to create a transactional replication on tables without primary keys. You could use Merge replication (one way), that doesn't require a primary key, and it automatically creates a rowguid column if it doesn't exist:
Merge replication uses a globally
unique identifier (GUID) column to
identify each row during the merge
replication process. If a published
table does not have a uniqueidentifier
column with the ROWGUIDCOL property
and a unique index, replication adds
one. Ensure that any SELECT and INSERT
statements that reference published
tables use column lists. If a table is
no longer published and replication
added the column, the column is
removed; if the column already
existed, it is not removed.
Unfortunately, you will have a performance penalty if using merge replication.
If you need to use replication for reporting only, and you don't need the data to be exactly the same as on the publisher, then you could consider snapshot replication also