Adding new dimensions to data warehouse (adding new columns to fact table) - sql-server

I am building an OLAP database and am running into some difficulty. I have already setup a fact table that includes columns for sales data, like quantity, sales, cost, profit, etc. The current dimensions I have are Date, Location, and Product. This means I have the foreign key columns for these dimension tables included in the fact table as well. I have loaded the fact table with this data.
I am now trying to add a dimension for salesperson. I have created the dimension, which has the salesperson's ID and their name and location. However, I can't edit the fact table to add the new column that will act as a foreign key to the salesperson dimension.
I want to use SSIS to do this, by using a look up on the sales database which the fact table is based on, and the salesperson ID, but I first need to add the Salesperson column to my fact table. When I try to do it, I get an error saying that it can't create a new column because it will be populated with NULLs.

I'm going to take a guess as to the problem you're having, but this is just a guess: your question is a little difficult to understand.
I'm going to make the assumption that you have created a Fact table with x columns, including links to the Date, Location, and Product dimensions. You have then loaded that fact table with data.
You are now trying to add a new column, SalesPerson_SK (or ID), to that table. You do not wish to allow NULL values in the database, so you clear the 'allow NULL' checkbox. However, when you attempt to save your work, the table errors out with the objection that it cannot insert NULL into the SalesPerson_SK column.
There are a few ways around this limitation. One, which is probably the best if you are still in the development stage, is to issue the following command:
TRUNCATE TABLE dbo.FactMyFact
which will remove all data from the table, allowing you to make your changes and reload the table with the new column included.
If, for some reason, you cannot do so, you can alter the table to add the column but include a default constraint that will put a default value into your fact table, essentially a dummy record that says, "I don't know what this is"
ALTER TABLE FactMyFact
ADD Salesperson_SK INT NOT NULL
CONSTRAINT DF_FactMyFact_SalesPersonSK DEFAULT 0
If you do not wish to put a default value into the table, simply create the column and allow NULL values, either by checking the box on the design page or by issuing the following command:
ALTER TABLE FactMyFact
ADD Salesperson_SK INT NULL
This answer has been given based on what I think your problem is: let me know if it helps.

Dimension inner join with fact table, get the values from dimensions and insert into fact...
or else create the fact less fact way

Related

SQL Server 2008 - Database Design Query

I have to load the data shown in the below image into my database.
For a particular row, either field PartID would be NULL OR field GroupID will be NULL, and the other available columns refers to the NON-NULL entity. I have following three options:
To use one database table, which will have one unified column say ID, which will have PartID and GroupID data. But, in this case I won't be able to apply foreign key constraint, as this column will be containing both entities' data.
To use one database table, which will have columns for both PartID and GroupID, which will contain the respective data. For each row, one of them will be NULL, But in this case I will be able to apply foreign key constraint.
To use two database tables, which will have similar structure, the only difference will be the column PartID and GroupID. In this case I will be able to apply foreign key constraint.
One thing to note here is that, the table(s) will be used in import processes to import about 30000 rows in one go and will also be heavily used in data retrieve operations. Also, the other columns will be used as pivot columns.
Can someone please suggest what should be best approach to achieve this?
I would use option 2 and add a constraint that only one can be non-null and the other must be null (just to be safe). I would not use option 1 because of the lack of a FK and the possibility of linking to the wrong table when not obeying the type identifier in the join.
There is a 4th option, which is to normalize them as "items" with another (surrogate) key and two link tables which link items to either parts or groups. This eliminates NULLs. There are further problems with that approach (items might be in both again or neither without any simple constraint), so unless that is necessary for other reasons, I wouldn't generally go down that path.
Option 3 could be fine - it really depends if these rows are a relation - i.e. data associated with a primary key. That's one huge problem I see with the data presented, the lack of a candidate key - I think you need to address that first.
IMO option 2 is the best - it's not perfectly normalized but will be the easiest to work with. 30K rows is not a lot of rows to import.
I would modify the table so it has one ID column and then add an IDType that is either "G" for Group or "P" for Part.

How to unique identify rows in a table without primary key

I'm importing more than 600.000.000 rows from an old database/table that has no primary key set, this table is in a sql server 2005 database. I created a tool to import this data into a new database with a very different structure. The problem is that I want to resume the process from where it stopped for any reason, like an error or network error. As this table doesn't have a primary key, I can't check if the row was already imported or not. Does anyone know how to identify each row so I can check if it was already imported or not? This table has duplicated row, I already tried to compute the hash of all the columns, but it's not working due to duplicated rows...
thanks!
I would bring the rows into a staging table if this is coming from another database -- one that has an identity set on it. Then you can identify the rows where all the other data is the same except for the id and remove the duplicates before trying to put it into your production table.
So: you are loading umpteen bazillion rows of data, the rows cannot be uniquely identified, the load can (and, apparently, will) be interrupted at any point at any time, and you want to be able to resume such an interrupted load from where you left off, despite the fact that for all practical purposes you cannot identify where you left off. Ok.
Loading into a table containing an additional identity column would work, assuming that however and whenever the data load is started, it always starts at the same item and loads items in the same order. Wildly inefficient, since you have to read through everythign every time you launch.
Another clunky option would be to first break the data you are loading into manageably-sized chunks (perhaps 10,000,000 rows). Load them chunk by chunk, keeping track of which chunk you have loaded. Use a Staging table, so that you know and can control when a chunk has been "fully processed". If/when interrupted, you've only toss the chunk you were working on when interrupted, and resume work with that chunk.
With duplicate rows, even row_number() is going to get you nowhere, as that can change between queries (due to the way MSSQL stores data). You need to either bring it into a landing table with an identity column or add a new column with an identity onto the existing table (alter table oldTbl add column NewId int identity(1,1)).
You could use row_number(), and then back out the last n rows if they have more than the count in the new database for them, but it would be more straight-forward to just use a landing table.
Option 1: duplicates can be dropped
Try to find a somewhat unique field combination. (duplicates are allowed) and join over a hash of the rest of the fields which you store in the destination table.
Assume a table:
create table t_x(id int, name varchar(50), description varchar(100))
create table t_y(id int, name varchar(50), description varchar(100), hash varbinary(8000))
select * from t_x x
where not exists(select *
from t_y y
where x.id = y.id
and hashbytes('sha1', x.name + '~' + x.description) = y.hash)
The reason to try to join as many fields as possible is to reduce the chance of hash collisions which are real on a dataset with 600.000.000 records.
Option 2: duplicates are important
If you really need the duplicate rows you should add a unique id column to your big table. To achieve this in a performing way you should do the following steps:
Alter the table and add a uniqueidentifier or int field
update the table with the newsequentialid() function or a row_number()
create an index on this field
add the id field to your destination table.
once all the data is moved over, the field can be dropped.

SQL Server Add Column

I want to add a column to one of my tables in SQL Server. I don't want it to be at the end of the column listing in the table...I actually want it to be somewhere else (location wise) in the table. Is there another option besides dropping and rebuilding (populating) the table to accomplish this? I obviously don't want ot lose any of my data, but I would prefer it not have to have the column at the end of the table definition.
Thanks,
S
Column order in the table is irrelevant -- it's purely cosmetic.
There is no extension to ALTER TABLE that allows you to specify the ordinal position of a new column (either for adding a new column or moving an existing column).
For more on the subject, see:
Change Order of Column In Database Tables
Short answer:
I agree with OMG Ponies, column order isn't important. If you don't have a clustered index, rather drop and recreate the table than run an ALTER TABLE x ADD col.
Long answer:
If your table has a fair bit of data (50Mb comes to mind) then you will be better off recreating the table rather than ALTER TABLE x ADD col The data page allocation plan for the table is calculated at table creation time so when you add a column, SQL Server will typically put your new column's data in separate pages and put forward pointers from your existing data pages to the new data pages for the column you added. If you're going to use the new column extensively then your table IO will be quite poor since reading even 1 row will require reading at least 2 pages. Table scans will also perform poorly since forward pointers will always be followed, causing table scans that are normally sequential to jump back and forth on your disk during a read.
In this case it's better to rename the existing table, recreate your table with the new column, insert into table_name select col1, col2, 'null or default for new col', col3 from temp_renamed_table and finally drop the old table that you renamed. The data pages will be much better organised and your IO will be faster despite looking the same from a SQL developer's point of view than when ALTER TABLE is used. If you have a clustered index the table will be reorganised when you add the column and page splits are less likely. You could also run ALTER TABLE x REBUILD if you have SQL Server 2008, don't have a clustered index and lots of time when users aren't using your table. It's hard to comment on your indexing strategy without knowing much more.
This is a much better reason for recreating the table than something cosmetic like column order.
It is an extremely bad practice to rearrange columns in a table. Do not even consider trying to do such a thing. The column order is irrelevant if you have used correct coding practices (such as never and I do mean never) using select *.
If you have used select * and you change the order of the columns, you are even more at risk of breaking code because the query may not be expecting Price as the third column but as the second column and that could seriously mess up a lot of things.
Further the only way to do this is to create another table, move your data and then drop the old table and rename the first one. Of course if you have FKs, they too have to be dropped and recreated. This takes alot of time if you hav ea large data set and could cause problems for users.
There is no cuircumstance where you would ever consider doing this for a table that in on production as it is just too risky. If you are in the early stages of design, you could consider doing it.
ALTER TABLE my_table ADD COLUMN column_name VARCHAR(50) AFTER col_name;
substituting whatever def you want for VARCHAR(50)
http://dev.mysql.com/doc/refman/5.1/en/alter-table.html
Edit This is of course the right answer for a MySQL server... but this is not what the OP wants.

How to prevent updating duplicate rows in SQLite Database?

I'm inserting new rows into a SQLite table, but I don't want to insert duplicate rows.
I also don't want to specify every column in the database if possible.
I don't even know if this is possible.
I should be able to take my values and create a new row with them, but if they duplicate another row they should either overwrite the existing row or do nothing.
This is one of the very first steps in database design and normalization. You have to be able to explicitly define what you mean by a duplicate row, and then place a primary key constraint, (or a unique constraint), on the columns in your table that represent that definition.
Before you can define what duplicate means, you have to define (or decide) exactly what the table is to contain,. i.e., what real-world business domain entity or abstraction each row in the table represents, or will hold data for...
Once you have done this, the PK or unique constraint will stop you from inserting duplicate rows... The same PK will help you find the duplicate row when it does exist, and update it with the values of the non-duplicate-defining (non-PK) columns that are different from the values in the existing duplicate row. Only after all this has been done, can an insert or replace (as defined by SQL Lite) process help. This command checks whether a duplicate row (*as dedined by yr PK constraint) exists, and if it does, instead of inserting a new row, it updates the non-PK defined columns in that row with the values spplied by your Replace query.
Your desires appear mutually contradictory. While Andrey's insert or replace answer will get you close to what say you want, what should probably clarify for yourself what you really want.
If you don't want to specify every column, and you want a (presumably) partial row to update rather than insert, you should probably look at the unique constraint and know that the ambiguity of your requirements was also made by the SQL92 Committee.
http://www.sqlite.org/lang_insert.html
insert or replace might interest you

Foreign key referencing composite table

I've got a table structure I'm not really certain of how to create the best way.
Basically I have two tables, tblSystemItems and tblClientItems. I have a third table that has a column that references an 'Item'. The problem is, this column needs to reference either a system item or a client item - it does not matter which. System items have keys in the 1..2^31 range while client items have keys in the range -1..-2^31, thus there will never be any collisions.
Whenever I query the items, I'm doing it through a view that does a UNION ALL between the contents of the two tables.
Thus, optimally, I'd like to make a foreign key reference the result of the view, since the view will always be the union of the two tables - while still keeping IDs unique. But I can't do this as I can't reference a view.
Now, I can just drop the foreign key, and all is well. However, I'd really like to have some referential checking and cascading delete/set null functionality. Is there any way to do this, besides triggers?
sorry for the late answer, I've been struck with a serious case of weekenditis.
As for utilizing a third table to include PKs from both client and system tables - I don't like that as that just overly complicates synchronization and still requires my app to know of the third table.
Another issue that has arisen is that I have a third table that needs to reference an item - either system or client, it doesn't matter. Having the tables separated basically means I need to have two columns, a ClientItemID and a SystemItemID, each having a constraint for each of their tables with nullability - rather ugly.
I ended up choosing a different solution. The whole issue was with easily synchronizing new system items into the tables without messing with client items, avoiding collisions and so forth.
I ended up creating just a single table, Items. Items has a bit column named "SystemItem" that defines, well, the obvious. In my development / system database, I've got the PK as an int identity(1,1). After the table has been created in the client database, the identity key is changed to (-1,-1). That means client items go in the negative while system items go in the positive.
For synchronizations I basically ignore anything with (SystemItem = 1) while synchronizing the rest using IDENTITY INSERT ON. Thus I'm able to synchronize while completely ignoring client items and avoiding collisions. I'm also able to reference just one "Items" table which covers both client and system items. The only thing to keep in mind is to fix the standard clustered key so it's descending to avoid all kinds of page restructuring when the client inserts new items (client updates vs system updates is like 99%/1%).
You can create a unique id (db generated - sequence, autoinc, etc) for the table that references items, and create two additional columns (tblSystemItemsFK and tblClientItemsFk) where you reference the system items and client items respectively - some databases allows you to have a foreign key that is nullable.
If you're using an ORM you can even easily distinguish client items and system items (this way you don't need to negative identifiers to prevent ID overlap) based on column information only.
With a little more bakcground/context it is probably easier to determine an optimal solution.
You probably need a table say tblItems that simply store all the primary keys of the two tables. Inserting items would require two steps to ensure that when an item is entered into the tblSystemItems table that the PK is entered into the tblItems table.
The third table then has a FK to tblItems. In a way tblItems is a parent of the other two items tables. To query for an Item it would be necessary to create a JOIN between tblItems, tblSystemItems and tblClientItems.
[EDIT-for comment below] If the tblSystemItems and tblClientItems control their own PK then you can still let them. You would probably insert into tblSystemItems first then insert into tblItems. When you implement an inheritance structure using a tool like Hibernate you end up with something like this.
Add a table called Items with a PK ItemiD, And a single column called ItemType = "System" or "Client" then have ClientItems table PK (named ClientItemId) and SystemItems PK (named SystemItemId) both also be FKs to Items.ItemId, (These relationships are zero to one relationships (0-1)
Then in your third table that references an item, just have it's FK constraint reference the itemId in this extra (Items) table...
If you are using stored procedures to implement inserts, just have the stored proc that inserts items insert a new record into the Items table first, and then, using the auto-generated PK value in that table insert the actual data record into either SystemItems or ClientItems (depending on which it is) as part of the same stored proc call, using the auto-generated (identity) value that the system inserted into the Items table ItemId column.
This is called "SubClassing"
I've been puzzling over your table design. I'm not certain that it is right. I realise that the third table may just be providing detail information, but I can't help thinking that the primary key is actually the one in your ITEM table and the FOREIGN keys are the ones in your system and client item tables. You'd then just need to do right outer joins from Item to the system and client item tables, and all constraints would work fine.
I have a similar situation in a database I'm using. I have a "candidate key" on each table that I call EntityID. Then, if there's a table that needs to refer to items in more than one of the other tables, I use EntityID to refer to that row. I do have an Entity table to cross reference everything (so that EntityID is the primary key of the Entity table, and all other EntityID's are FKs), but I don't find myself using the Entity table very often.

Resources