PostgreSQL - How do you create a "dimension" table from a 'select distinct' query and create a primary and foreign key? - database

I have a fact table with many entries, and they have 'ship to' columns that are very closely related, but none of the columns are always unique. I would like to make a dimension table for this, and reference the new dimension table rows using a key.
I can create the new dimension table with a create table as select distinct, and I can add a row number primary key for it, but I'm not sure how to put the matching foreign key into the fact table, where they match.
I could easily create a new foreign key column, and fill it using a where to match the old distinct rows in the fact table to the distinct rows in the dimension table, but there is no easy column to match (since there is no key yet), so do I need to create a 'where' match that matches all of the columns together, then assigns the primary key from the dimension table?
I may just be lazy, and don't want to research how to create alter queries and how to create complex where matching, but it seems like a pretty common action for database management, so I feel like it might help others.

I guess one way to do it is to create a concatenation of all of the values in all of the columns for each row in the new dimension to make a unique identifier from the data, then do the same thing for the columns in the fact dimension. Now there is a unique key between the two, and this can be converted to an integer id. Add a new sequence column in the new dimension, then create a new column in the fact, and set it to the integer id in the new dimension where the concatenated id is the same.
This is obviously very inefficient, as the whole contents of the new dimension have to be duplicated, as well as that again in the fact dimension - just to create a link.

Related

Normalization and primary keys

In a given table if there is no primary key and even impossible to create a composite primary key then what is the normal form of that table ?
If its zero(0NF) adding a new column and making it primary key will convert this table to 1NF ?
Normal forms apply to relations, which are mathematical structures. Tables can be used to represent relations, but this requires some rules to ensure that the table doesn't contain more or less information than the corresponding relation.
In order for a table to represent a relation:
all rows and columns must be unique
the order they're in mustn't matter
all significant information must be represented as values in cells (i.e. fonts, highlighting, etc, mustn't matter)
every cell must contain one value (doesn't matter how simple or complex that value is)
Also, the relational model cares about candidate keys, not primary keys. A relation can have multiple candidate keys. A primary key is just a selected candidate key that is used by some disciplines (e.g. the entity-relationship model) or by some database management systems (e.g. for physical record ordering).
With all that said, I can now answer your question. If your table follows the rules and specifically the rows are all unique, then there will be at least one candidate key, on all the columns together at worst. If your table's rows aren't unique, then the table doesn't represent a relation and the normal forms don't apply. A surrogate key (like an auto-increment column) can be added to identify rows uniquely, but that isn't necessarily sufficient on its own to make a table represent a relation (1NF).
BTW, I suggest you avoid using "0NF" or "UNF". Non-relational tables don't have a level of normalization, so attaching any kind of "NF" to them is misleading.
As long as you are talking about tables, there is one further case that needs to be covered. It's the case of duplicate rows.
Duplicate rows are rows that are identical in appearance but not in row number. Such a table cannot have a primary key. Sometimes duplicate rows represent the same information. Sometimes not.
For example, consider a table with just four columns: customerid, productid, quentity, price. If a customer orders the same product twice, we'll have two identical rows, representing different inforation. Ths is not good.
Note that the corresonding thing cannot happen with relations. If two tuples in a relation have the same appearance, then they are the same tuple.
As to the other points, they are covered by excellent earlier answers.
before you wan to check for normalization your table must have a Primary key(the primary key is playing lead role in Relational DB,...).
1NF: says that all of your table attributes must be single valued.
Answer of Question 1 : In a given table if there is no primary key and even impossible to create a composite primary key then what is the normal form of that table ?
Answer : If it is no primary key in relation and if it is impossible to create a composite primiary key(According to me your question says ,even if combine all the column of row to make candidate key then also it will not able to identify your relationship uniquly(duplicate rows are there), hence it is not in any normal form.
Answer of Question 2:
If you add some column(having unique values in it) and if all the cell contains only one value then it is in 1NF.
Still if you need some clarification can ask in comment box.
0NF is not any form of normalization. refer C.J. Date or Henry korth(database management system book)
Hope this helps.

Create Primary key in sql server

I have one table in sql server having duplicate ID,but I can not delete those duplicate records .Now the requirement is to create the primary key on that column which is having duplicate data. Is there any way to create the primary key without changing the data.
No, there is no way to a add a PRIMARY KEY constraint to a column that already has duplicate values.
Creating and Modifying PRIMARY KEY Constraints:
When a PRIMARY KEY constraint is added to an existing column or
columns in the table, the Database Engine examines the existing column
data and metadata to make sure that the following rules for primary
keys:
The columns cannot allow for null values.
There can be no duplicate values.
If a PRIMARY KEY constraint is added to a
column that has duplicate values or allows for null values, the
Database Engine returns an error and does not add the constraint.
In case ID column is incremental, then a possible workaround is to add a unique filtered index:
CREATE UNIQUE INDEX AK_MyUniqueIndex ON dbo.MyTable (ID)
WHERE ID > ... max value of existing ID here
This way, uniqueness will be applied only to newly added records.
I know this is old, but, had this idea that I wanted to share:
Step 1. Add a non-nullable int column with a default value that can
be 0
Optional step. Update that column to a 1, so you are able to
identify this existing records afterwards.
Step 2. Update column in all existing rows where there are duplicates with a standard rownumber() using a combination of unique columns or all columns.
Step 3. Define primary key with your ID column first (So, it is
indexed first), then add Step 1 column.
And you are done and with a special column that can helps identify the duplicates easily and the new records which will be all marked as 0, but the best practice would be to add a character or number to all Ids if possible and standardize (This approach helps to do that afterwards), or use something like by year sequence, etc.

Adding new dimensions to data warehouse (adding new columns to fact table)

I am building an OLAP database and am running into some difficulty. I have already setup a fact table that includes columns for sales data, like quantity, sales, cost, profit, etc. The current dimensions I have are Date, Location, and Product. This means I have the foreign key columns for these dimension tables included in the fact table as well. I have loaded the fact table with this data.
I am now trying to add a dimension for salesperson. I have created the dimension, which has the salesperson's ID and their name and location. However, I can't edit the fact table to add the new column that will act as a foreign key to the salesperson dimension.
I want to use SSIS to do this, by using a look up on the sales database which the fact table is based on, and the salesperson ID, but I first need to add the Salesperson column to my fact table. When I try to do it, I get an error saying that it can't create a new column because it will be populated with NULLs.
I'm going to take a guess as to the problem you're having, but this is just a guess: your question is a little difficult to understand.
I'm going to make the assumption that you have created a Fact table with x columns, including links to the Date, Location, and Product dimensions. You have then loaded that fact table with data.
You are now trying to add a new column, SalesPerson_SK (or ID), to that table. You do not wish to allow NULL values in the database, so you clear the 'allow NULL' checkbox. However, when you attempt to save your work, the table errors out with the objection that it cannot insert NULL into the SalesPerson_SK column.
There are a few ways around this limitation. One, which is probably the best if you are still in the development stage, is to issue the following command:
TRUNCATE TABLE dbo.FactMyFact
which will remove all data from the table, allowing you to make your changes and reload the table with the new column included.
If, for some reason, you cannot do so, you can alter the table to add the column but include a default constraint that will put a default value into your fact table, essentially a dummy record that says, "I don't know what this is"
ALTER TABLE FactMyFact
ADD Salesperson_SK INT NOT NULL
CONSTRAINT DF_FactMyFact_SalesPersonSK DEFAULT 0
If you do not wish to put a default value into the table, simply create the column and allow NULL values, either by checking the box on the design page or by issuing the following command:
ALTER TABLE FactMyFact
ADD Salesperson_SK INT NULL
This answer has been given based on what I think your problem is: let me know if it helps.
Dimension inner join with fact table, get the values from dimensions and insert into fact...
or else create the fact less fact way

SQL Server creating multiple nonclustered indexes for one column vs having multiple columns in just one index

Suppose I have following table
UserID (Identity) PK
UserName - unique non null
UserEmail - unique non null
What is recommended for the best performance?
creating non clustered index for UserName and UserEmail separately
OR
Just one including both the columns
Please do share you thoughts why one is preferable over other.
Another important point to consider is this: a compound index (made up of multiple columns) will only be used if the n left-most columns are being referenced (e.g. in a WHERE clause).
So if you have a single compound index on
(UserID, UserName, UserEmail)
then this index might be used in the following scenarios:
when you're searching for UserID alone (using just the 1 left-most column - UserID)
when you're searching for UserID and UserName (using the 2 left-most columns)
when you're searching for all three columns
But this single compound index will never be able to be used for searches on
just UserName - it's the second column in the index and thus this index cannot ever be used
just UserEmail - it's the third column in the index and thus this index cannot ever be used
Just remember this - just because a column is part in an index doesn't necessarily mean that searching on that single column alone will be supported and sped up by that index!
So if your usage patterns and your application really need to search on UserName and/or UserEmail alone (without providing other search values), then you must create separate indices on these columns - just a single compound one will not have any benefit at all.
The best way to define indexes depend entirely on how you will use the table. There is no sensible way of choosing indexes just by looking at the table definition.
If your code searches through your table with username or joins your table with another table through username, than it would be wise to define an index on that column. If your code joins the table with another table using two columns (username and usermail), than it would be wise to define index for those two columns. Since all your columns are defined to be unique, I hardly believe this will be the case so you will not need multiple column indexes on that table.
There might be some additional advice on using multiple column indexes: Multiple column indexes are also used for filters that fit the index partially, but with conditions.
Example:
If you define a two column index on username and usermail (in given order), you will have performance gain in searches that filter through both columns (username and usermail). With that index you will also have performance gains in filters that use username only because that is the first column of the index, but not in searches through usermail, that is because the second column of an index can not be used alone.
The rule is: An index can be used for filtering with exact matching columns or filtering with a subset of columns that match subsequent top columns in index definition.
Please do share you thoughts why one is preferable over other.
It depends on what you do.
See, an index is only used "left to right". So, an indes on UserID; UserName is useless if I select filtering by UserName ONLY.
Generally, I would assume three indices here:
Uniuqe Index, Clustered in UserID, as Primary Key.
Unique Index on UserName, non clustered.
Unique Index on UserEMail, non clustered.
The reason is totally not for Performance but:
You will Need the first as Primary key for forein key relationships.
You Need the other two to handle unique constraints properly - there is no way to do that without indices.
In Addition, you Need flexibility to seek by UserName AND UserEMail, which means that it is not possible to Combine them only.
Performance really enters last here - for performacne e reasons all of These indices may contain all additional fields (not as part of the index but as included columns. But really, there is no other sensible way to have this table work unless you alow multiple registrations for the same user.

SQL Server 2008 - Database Design Query

I have to load the data shown in the below image into my database.
For a particular row, either field PartID would be NULL OR field GroupID will be NULL, and the other available columns refers to the NON-NULL entity. I have following three options:
To use one database table, which will have one unified column say ID, which will have PartID and GroupID data. But, in this case I won't be able to apply foreign key constraint, as this column will be containing both entities' data.
To use one database table, which will have columns for both PartID and GroupID, which will contain the respective data. For each row, one of them will be NULL, But in this case I will be able to apply foreign key constraint.
To use two database tables, which will have similar structure, the only difference will be the column PartID and GroupID. In this case I will be able to apply foreign key constraint.
One thing to note here is that, the table(s) will be used in import processes to import about 30000 rows in one go and will also be heavily used in data retrieve operations. Also, the other columns will be used as pivot columns.
Can someone please suggest what should be best approach to achieve this?
I would use option 2 and add a constraint that only one can be non-null and the other must be null (just to be safe). I would not use option 1 because of the lack of a FK and the possibility of linking to the wrong table when not obeying the type identifier in the join.
There is a 4th option, which is to normalize them as "items" with another (surrogate) key and two link tables which link items to either parts or groups. This eliminates NULLs. There are further problems with that approach (items might be in both again or neither without any simple constraint), so unless that is necessary for other reasons, I wouldn't generally go down that path.
Option 3 could be fine - it really depends if these rows are a relation - i.e. data associated with a primary key. That's one huge problem I see with the data presented, the lack of a candidate key - I think you need to address that first.
IMO option 2 is the best - it's not perfectly normalized but will be the easiest to work with. 30K rows is not a lot of rows to import.
I would modify the table so it has one ID column and then add an IDType that is either "G" for Group or "P" for Part.

Resources