I have a Dimension in my DW in which 4 columns together forms the business key ,so there is no single BK. To insert into the fact table (Staging area) i would need a FK, so should the fact table have the combination of these 4 columns (4 FK) for this specific Dimension or should I create a column "ID identity(1,1) Int" into the source table? Note: This is the staging area I am creating in SSIS, so I assume surrogate keys should only be inserted in DW.
Related
I'm taking data from Excel file to Stagingtable using SSIS and I have 8 different Tables. From StagingTable I'm inserting data to different tables.
Now i have one table which consist of some data from Staging table plus I need to insert Foreign keys of all other table into this table(acts like a fact table).
I'm able to populate data into all the tables but I'm not able to populate the Foreign keys of all the tables into the fact table.
How do i get the Foreign keys and insert it?
I'm expecting to insert the foreign keys into the fact table.
Populate the dimension tables first so that foreign keys will be generated, and then join to them on the key-values while populating the fact table to get the foreign keys.
For example if you are importing an employee named John Smith from your Excel source, first insert John Smith into the Employee table.
Then when you are inserting John Smith into the fact table, join to the Employee table on EmployeeName='John Smith' to get his EmployeeID to insert into the fact table.
I am using SQL Server 2012 & am creating a table that will have 8 columns, types below
datetime
varchar(12)
varchar(6)
varchar(100)
float
float
int
datetime
Once a day (normally) there will be an upload of approx 10,000 rows of data. Going forward its possible it could be 100,000.
The rows will be unique if I group on the first three columns listed above. I have read I can use the unique constraint on multiple columns which will guarantee the rows are unique.
I think I'm correct in saying that the unique constraint by default sets up non-clustered index. Would a clustered index be better & assuming when the table starts to contain millions of rows this won't cause any issues?
My last question. By applying the unique constraint on my table I am right to say querying the data will be quicker than if the unique constraint wasn't applied (because of the non-clustering or clustering) & uploading the data will be slower (which is fine) with the constraint on the table?
Unique index can be non-clustered.
Primary key is unique and can be clustered
Clustered index is not unique by default
Unique clustered index is unique :)
Mor information you can get from this guide.
So, we should separate uniqueness and index keys.
If you need to kepp data unique by some column - create uniqe contraint (unique index). You'll protect your data.
Also, you can create primary key (PK) on your columns - they will be unique also. But, there is a difference: all other indexies will use PK for referencing, so PK must be as short as possible. So, my advice - create Identity column (int or bigint) and create PK on it. And, create unique index on your unique columns.
Querying data may become faster, if you do queries on your unique columns, if you do query on other columns - you need to create other, specific indexies.
So, unique keys - for data consistency, indexies - for queries.
I think I'm correct in saying that the unique constraint by default
sets up non-clustered index
TRUE
Would a clustered index be better & assuming when the table starts to
contain millions of rows this won't cause any issues?
(1)if u need to make (datetime ,varchar(12), varchar(6)) Unique
(2)if you application or you will access rows using datetime or datetime ,varchar(12) or datetime ,varchar(12), varchar(6) in where condition
ALL the time
then have primary key on (datetime ,varchar(12), varchar(6))
by default it will put Uniqness and clustered index on all above three column.
but as you commented above:
the queries will vary to be honest. I imagine most queries will make
use of the first datetime column
and you will deal with huge data and might join this table with other tables
then its better have a surrogate key( ever-increasing unique identifier ) in the table and to satisfy your Selects
have Non-Clustered INDEXES
Surrogate Key vs Business Key
NON-CLUSTERED INDEX
I am trying to model the following in a postgres db.
I have N number of 'datasets'. These datasets are things like survey results, national statistics, aggregated data etc. They each have a name a source insitution a method etc. This is the meta data of a dataset and I have tables created for this and tables for codifying the research methods etc. The 'root' meta-data table is called 'Datasets'. Each row represents one dataset.
I then need to store and access the actual data associated with this dataset. So I need to create a table that contains that data. How do I represent the relationship between this table and its corresponding row in the 'Datasets' table?
an example
'hea' is a set of survey responses. it is unaggregated so each row is one survey response. I create a table called 'HeaData' that contains this data.
'cso' is a set of aggregated employment data. each row is a economic sector. I create a table called 'CsoData' that contains this data
I create a row for each of these in the 'datasets' table with the relevant meta data for each and they have ids of 1 & 2 respectively.
what is the best way to relate 1 to the HeaData table and 2 to the CsoData table?
I will eventually be accessing this data with scala slick so if the database design could just 'plug and play' with slick that would be ideal
Add a column to the Datasets table which designates which type of dataset it represents. Then a 1 may mean HEA and 2 may mean CSO. A check constraint would limit the field to one of the two values. If new types of datasets are added later, the only change needed is to change the constraint. If it is defined as a foreign key to a "type of dataset" table, you just need to add the new type of dataset there.
Form a unique index on the PK and the new field.
Add the same field to each of the subtables. But the check constraint limits the value in the HEA table to only that value and the CSO table to only that value. Then form the ID field of Datasets table and the new field as the FK to Datasets table.
This limits the ID value to only one of the subtables and it must be the one defined in the Datasets table. That is, if you define a HEA dataset entry with an ID value of 1000 and the HEA type value, the only subtable that can contain an ID value of 1000 is the HEA table.
create table Datasets(
ID int identity/auto_generate,
DSType char( 3 ) check( DSType in( 'HEA', 'CSO' ),
[everything else],
constraint PK_Datasets primary key( ID ),
constraint UQ_Dateset_Type unique( ID, DSType ) -- needed for references
);
create table HEA(
ID int not null,
DSType char( 3 ) check( DSType = 'HEA' ) -- making this a constant value
[other HEA data],
constraint PK_HEA primary key( ID ),
constraint FK_HEA_Dataset_PK foreign key( ID )
references Dataset( ID ),
constraint FK_HEA_Dataset_Type foreign key( ID, DSType )
references Dataset( ID, DSType )
);
The same idea with the CSO subtable.
I would recommend an HEA and CSO view that would show the complete dataset rows, metadata and type-specific data, joined together. With triggers on those views, they can be the DML points for the application code. Then the apps don't have to keep track of how that data is laid out in the database, making it a lot easier to make improvements should the opportunity present itself.
Old version
I have a Person table and the table Company.
both tables have a column Id (Identity)
Table Company have Ids of 1 to 165
In the table Person have Ids 1 until 2029
New Version
In the new version of the system, was created a table Entity.
This table contains the records of the Companies and People
The Company and Person tables will be maintained, referring to the Entity table.
The Id in table Entity will be the same in Company or Person table
Question
Both tables have multiple relationships with other tables.
Table Entity (as well as others) has a column ID (identity).
The problem is that the Id were repeated when the two tables together (It was to be expected).
How to import without losing relationships?
Attempts
I thought of changing the value of Ids in Company table, starts from 2030.
Thus the Ids would not duplicate when joining the two tables.
But this creates another questions.
How to do this without losing existing relationships?
How to change the Id of a row in the table and this is reflected in all tables which it relates?
I would like to do this using only DDL (SQL Server)
I thought of changing the value of Ids in Company table, starts from 2030. Thus the Ids would not duplicate when joining the two tables.
Create foreign key constraints on the Person table to all related tables (or alter the existing foreign key constraints) with ON UPDATE CASCADE. Then update the Person table and change the values if the id columns - these changes will cascade to the related tables.
To stop further problems, maybe change the identity columns in Person and Company to something like identity( 1000, 3 ) and identity (1001, 3) respectively.
However, I think the best idea is to have a different EntityID column in the Entity table, unrelated to PersonID and CompanyID. The Entity table would also have a column called AltEntityID or BusinessKey that contains the id from the other table and that does not have a unique constraint or a foreign key constraint.
And if you make small modification to your attempt - add new column, say newId, to Company and to Person to manage relation with Entity and leave id columns as is. Why this is the simpliest way? Because new columns shouldnot be identity columns, from one side. From the other side, you can leave all logic of relating other tables with Company and Person intact.
I am developing a system in which I have a table Employees with multiple columns related to employees. I have a column for the JobTitle and another column for Department.
In my current design, the JobTitle & the Department columns are compound foreign keys in the Employees table and they are linked with the Groups table which has 2 columns compound primary key (JobTitle & Department) and an extra column for the job description.
I am not happy about this design because I think that linking 2 tables using 2 compound varchar columns is not good for the performance, and I think it would be better to have an Integer column (autonumber) JobTitleID used as the primary key in the Groups table and as a foreign key in the Employees table instead of the the textual JobTitle & the Department columns.
But I had to do this because when I import the employees list (Excel) into my Employees table it can just be directly mapped (JobTitle --> JobTitle & Department --> Department). Otherwise if I am using an integer index as primary key I would have then to manually rename the textual JobTitle column in the excel sheet to a number based on the generated keys from the Groups table in order to import.
Is it fine to keep my database design like this (textual compound primary key linked with textual compound foreign key)? If not, then if I used an integer column in the Groups table as primary key and the same as a foreign key in the Employees table then how can I import the employees list from excel directly to Employees table?
Is it possible to import the list from Excel to SQL Server in a way that the textual JobTitle from the excel sheet will be automatically translated to the corespondent JobTitleID from the Groups table? This would be the best solution, I can then add JobTitleID column in the Groups table as a primary key and as a foreign key in the Employees table.
Thank you,
It sounds like you are trying to make the database table design fit the import of the excel file which is not such a good idea. Forget the excel file and design your db tables first with correct primary keys and relationships. This means either int, bigint or guids for primary keys. This will keep you out of trouble unless you absolutely know the key is unique such as in a SSN. The when you import, then populate the departments and job titles into their respective tables creating their primary keys. Now that they are populated, add those keys to the excel file that can be imported into the employees table.
This is just an example of how I would solve this problem. It is not wrong to use multiple columns as the key but it will definitely keep you out of harms way if you stick with int, bigint or guids for your primary keys.
Look at the answer in this post: how-to-use-bulk-insert...
I would create a simple Stored Procedure that imports your excel data into a temporary unrestricted STAGING table and then do the INSERT into your real table by doing the corresponding table joins to get the right foreign keys and dump the rows that failed to import into an IMPORT FAIL table. Just some thoughts...