Table Structure for Multiple Histrory - sql-server

i want to create table to keep histroy of the ammendments & history of the object.
for That i have created two column Primary Key ( Id & update date)
I have 3 more date columns to maintain history & Status Column for Actual object history.
Status , StatusFrom , Statusto, UpdateDate & NextUpdateDate
UpdateDate & NextUpdateDate is for maintain histroy of ammendment.
Is there any better way to maintain actual history of the Record & Ammend histroy of the record?

You're creating what is known as an "audit table". There are many ways to do this; a couple of them are:
Create a table with appropriate key fields and before/after fields for all columns that you're interested in on the source table, along with a timestamp so you know when the change was made.
Create a table with a appropriate key fields, a modification timestamp, a field name, and before/after columns.
Method (1) has the problem that you end up with a lot of fields in the audit table - basically two for every field in your source table. In addition, if only one or two fields on the source table change then most of the fields on the audit table will be NULL which may waste space (depending on your database). It also requires a lot of special-purpose code to figure out which field changed when you go back to process the audit table.
Method (2) has the problem that you end up creating a separate row in the table for each field that is changed on your source table, which can result in a lot of rows in the audit table (one row for each field changed). Because each field change results in a new row being written to the audit table you also have the same key values in multiple rows which can use up a bunch of space just for the keys.
Regardless of how the audit table is structured it's usual to use a trigger to maintain them.
I hope this helps.

Related

SSIS - Insert records and update records with lookup transformation

I am new to SSIS and have been tasked with taking records from a source table and inserting new or updating existing tables in the target. I estimate in the region of 10-15 records per day at maximum.
I have investigated the Lookup Transformation object and this looks like it will do the job.
Both source and target table columns are identical. However, there is no unique key/value in target or source to perform the lookup on. The existing fields I have to work with and do the lookup on are Date or Load_Date...
How can I achieve only adding new records or updating existing records in the target without a key/ID field? Do I really need to add another ID column to compare source and target with? And if so, can someone tell me how to do this? The target table must not have a key/ID field, so if using key/ID field, it would need to be dropped after any inserts/updates are done.
Could this be achieved this using Load_Date? Currently all Load_Date values are NULL in target table, so would it be as simple as matching on Load_Date and if Load_Date is already populated, then do not load? If Load_Date is NULL, then load the record?
Thanks.

Data Versioning/Auditing in SQL Database best patterns

I have a Job table where I post the Job description, posted date, qualifications etc.. with below schema
Job(Id ##Identity PK, Description varchar (200), PostedOn DateTime, Skills Varchar(50))
Other attributes of jobs we would like to track such as Department, team etc will be stored in another table as Attriibutes of Job
JobAttributesList(Id ##Identity PK, AttributeName varchar(50))
JobAttributes(JobID ##Identity PK, AttributeID FK REFERENCES JobAttributesList.Id, AttributeValue varchar(50))
Now if a job description has changed, we do not want to lose old one and hence keep track of versioning.What are the best practices? we may have to scale later by adding more versioning tables
A strategy would be to use a History table for all the tables we want to enable versioning but that would add more and more tables as we add versioning requirements and I feel its schema duplication.
There is a difference between versioning and auditing. Versioning only requires that you keep the old versions of the data somewhere. Auditing typically requires that you also know who made a change.
If you want to keep the old versions in the database, do create an "old versions" table for each table you want to version, but don't create a new table for every different column change you want to audit.
I mean, you can create a new table for every column, whose only columns are audit_id, key, old_column_value, created_datetime, and it can save disk space if the original table is very wide, but it makes reconstructing the complete row for a given date and time extraordinarily expensive.
You could also keep the old data in the same table, and always do inserts, but over time that becomes a performance problem as your OLTP table gets way, way too big.
Just have a single table with all the columns of the original table, which you always insert into, which you can do inside an update, delete trigger on the original table. You can tell which columns have changed either by adding a bit flag for every column, or just determine that at select time by comparing data in one row with data in the previously audited row for the given key.
I would absolutely not recommend creating a trigger which concatenates all of the values cast to varchar and dumps it all into a single, universal audit table with an "audited_data" column. It will be slow to write, and impossible to usefully read.
If you want to use this for actual auditing, and not just versioning, then either the user making the change must be captured in the original table so it is available to the trigger, or you need people to be connecting with specific logins, in which case you can use transport information like original_login(), or you need to set a value like context_info or session_context on the client side.

An extra column that explains when a value in the same row was updated

In my database table, I have a LastUpdated column that describes when the current row was last updated.
What the customer has now asked for is to have a few more DateTime columns in the table to keep track of individual values within the same row
E.g. there's a column called Address and they would like to have an extra column AddressLastUpdated to know when it was last changed.
For some reason, this does not look like a good solution. It is certainly doable. But I am wondering if there's a better way of implementing this. Since if we have this in place for one column, chances are they are going to want to have a LastUpdated column for every column in the table.
Keeping a bridge table with below structure will help.
Structure :
Key Column of the table (e.g. Customer Key / Customer No)
Updated Column Name
Last Updated Date / DateTime
Above solution will help in 2 ways :
Keep the existing table structure intact.
All the future such requests can be easily managed.

Adding new dimensions to data warehouse (adding new columns to fact table)

I am building an OLAP database and am running into some difficulty. I have already setup a fact table that includes columns for sales data, like quantity, sales, cost, profit, etc. The current dimensions I have are Date, Location, and Product. This means I have the foreign key columns for these dimension tables included in the fact table as well. I have loaded the fact table with this data.
I am now trying to add a dimension for salesperson. I have created the dimension, which has the salesperson's ID and their name and location. However, I can't edit the fact table to add the new column that will act as a foreign key to the salesperson dimension.
I want to use SSIS to do this, by using a look up on the sales database which the fact table is based on, and the salesperson ID, but I first need to add the Salesperson column to my fact table. When I try to do it, I get an error saying that it can't create a new column because it will be populated with NULLs.
I'm going to take a guess as to the problem you're having, but this is just a guess: your question is a little difficult to understand.
I'm going to make the assumption that you have created a Fact table with x columns, including links to the Date, Location, and Product dimensions. You have then loaded that fact table with data.
You are now trying to add a new column, SalesPerson_SK (or ID), to that table. You do not wish to allow NULL values in the database, so you clear the 'allow NULL' checkbox. However, when you attempt to save your work, the table errors out with the objection that it cannot insert NULL into the SalesPerson_SK column.
There are a few ways around this limitation. One, which is probably the best if you are still in the development stage, is to issue the following command:
TRUNCATE TABLE dbo.FactMyFact
which will remove all data from the table, allowing you to make your changes and reload the table with the new column included.
If, for some reason, you cannot do so, you can alter the table to add the column but include a default constraint that will put a default value into your fact table, essentially a dummy record that says, "I don't know what this is"
ALTER TABLE FactMyFact
ADD Salesperson_SK INT NOT NULL
CONSTRAINT DF_FactMyFact_SalesPersonSK DEFAULT 0
If you do not wish to put a default value into the table, simply create the column and allow NULL values, either by checking the box on the design page or by issuing the following command:
ALTER TABLE FactMyFact
ADD Salesperson_SK INT NULL
This answer has been given based on what I think your problem is: let me know if it helps.
Dimension inner join with fact table, get the values from dimensions and insert into fact...
or else create the fact less fact way

How to unique identify rows in a table without primary key

I'm importing more than 600.000.000 rows from an old database/table that has no primary key set, this table is in a sql server 2005 database. I created a tool to import this data into a new database with a very different structure. The problem is that I want to resume the process from where it stopped for any reason, like an error or network error. As this table doesn't have a primary key, I can't check if the row was already imported or not. Does anyone know how to identify each row so I can check if it was already imported or not? This table has duplicated row, I already tried to compute the hash of all the columns, but it's not working due to duplicated rows...
thanks!
I would bring the rows into a staging table if this is coming from another database -- one that has an identity set on it. Then you can identify the rows where all the other data is the same except for the id and remove the duplicates before trying to put it into your production table.
So: you are loading umpteen bazillion rows of data, the rows cannot be uniquely identified, the load can (and, apparently, will) be interrupted at any point at any time, and you want to be able to resume such an interrupted load from where you left off, despite the fact that for all practical purposes you cannot identify where you left off. Ok.
Loading into a table containing an additional identity column would work, assuming that however and whenever the data load is started, it always starts at the same item and loads items in the same order. Wildly inefficient, since you have to read through everythign every time you launch.
Another clunky option would be to first break the data you are loading into manageably-sized chunks (perhaps 10,000,000 rows). Load them chunk by chunk, keeping track of which chunk you have loaded. Use a Staging table, so that you know and can control when a chunk has been "fully processed". If/when interrupted, you've only toss the chunk you were working on when interrupted, and resume work with that chunk.
With duplicate rows, even row_number() is going to get you nowhere, as that can change between queries (due to the way MSSQL stores data). You need to either bring it into a landing table with an identity column or add a new column with an identity onto the existing table (alter table oldTbl add column NewId int identity(1,1)).
You could use row_number(), and then back out the last n rows if they have more than the count in the new database for them, but it would be more straight-forward to just use a landing table.
Option 1: duplicates can be dropped
Try to find a somewhat unique field combination. (duplicates are allowed) and join over a hash of the rest of the fields which you store in the destination table.
Assume a table:
create table t_x(id int, name varchar(50), description varchar(100))
create table t_y(id int, name varchar(50), description varchar(100), hash varbinary(8000))
select * from t_x x
where not exists(select *
from t_y y
where x.id = y.id
and hashbytes('sha1', x.name + '~' + x.description) = y.hash)
The reason to try to join as many fields as possible is to reduce the chance of hash collisions which are real on a dataset with 600.000.000 records.
Option 2: duplicates are important
If you really need the duplicate rows you should add a unique id column to your big table. To achieve this in a performing way you should do the following steps:
Alter the table and add a uniqueidentifier or int field
update the table with the newsequentialid() function or a row_number()
create an index on this field
add the id field to your destination table.
once all the data is moved over, the field can be dropped.

Resources