data synchronization from unreliable data source to SQL table - database

I am looking for pattern, framework or best practice to handle a generic problem of application level data synchronisation.
Let's take an example with only 1 table to make it easier.
I have an unreliable datasource of product catalog. Data can occasionally be unavailable or incomplete or inconsistent. ( issue might come from manual data entry error, ETL failure...)
I have a live copy in a Mysql table in use by a live system. Let's say a website.
I need to implement safety mecanism when updating the mysql table to "synchronize" with original data source. Here are the safety criteria and the solution I an suggesting:
avoid deleting records when they temporarily disappear from datasource => use "deleted" boulean/date column or an archive/history table.
check for inconsistent changes => configure rules per columns such as : should never change, should only increment,
check for integrity issue => (standard problem, no point discussing approach)
ability to rollback last sync=> restore from history table ? use a version inc/date column ?
What I am looking for is best practice and pattern/tool to handle such problem. If not you are not pointing to THE solution, I would be grateful of any keywords suggestion that would me narrow down which field of expertise to explore.

We have the same problem importing data from web analytics providers - they suffer the same problems as your catalog. This is what we did:
Every import/sync is assigned a unique id (auto_increment int64)
Every table has a history table that is identical to the original, but has an additional column "superseded_id" which gets the import-id of the import, that changed the row (deletion is a change) and the primary key is (row_id,superseded_id)
Every UPDATE copies the row to the history table before changing it
Every DELETE moves the row to the history table
This makes rollback very easy:
Find out the import_id of the bad import
REPLACE INTO main_table SELECT <everything but superseded_id> FROM history table WHERE superseded_id=<bad import id>
DELETE FROM history_table WHERE superseded_id>=<bad import id>
For databases, where performance is a problem, we do this in a secondary database on a different server, then copy the found-to-be-good main table to the production database into a new table main_table_$id with $id being the highest import id and have main_table be a trivial view to SELECT * FROM main_table_$someid. Now by redefining the view to SELECT * FROM main_table_$newid we can atomically swicth the table.

I'm not aware of a single solution to all this - probably because each project is so different. However, here are two techniques I've used in the past:
Embed the concept of version and validity into your data model
This is a way to deal with change over time without having to resort to history tables; it does complicate your queries, so you should use it sparingly.
For instance, instead of having a product table as follows
PRODUCTS
Product_ID primary key
Price
Description
AvailableFlag
In this model, if you want to delete a product, you execute "delete from product where product_id = ..."; modifying price would be "update products set price = 1 where product_id = ...."
With the versioned model, you have:
PRODUCTS
product_ID primary key
valid_from datetime
valid_until datetime
deleted_flag
Price
Description
AvailableFlag
In this model, deleting a product requires you to update products set valid_until = getdate() where product_id = xxx and valid_until is null, and then insert a new row with the "deleted_flag = true".
Changing price works the same way.
This means that you can run queries against your "dirty" data and insert it into this table without worrying about deleting items that were accidentally missed off the import. It also allows you to see the evolution of the record over time, and roll-back easily.
Use a ledger-like mechanism for cumulative values
Where you have things like "number of products in stock", it helps to create transactions to modify the amount, rather than take the current amount from your data feed.
For instance, instead of having a amount_in_stock column on your products table, have a "product_stock_transaction" table:
product_stock_transactions
product_id FK transaction_date transaction_quantity transaction_source
1 1 Jan 2012 100 product_feed
1 2 Jan 2012 -3 stock_adjust_feed
1 3 Jan 2012 10 product_feed
On 2 Jan, the quantity in stock was 97; on 3 Jan, 107.
This design allows you to keep track of adjustments and their source, and is easier to manage when moving data from multiple sources.
Both approaches can create large amounts of data - depending on the number of imports and the amount of data - and can lead to complex queries to retrieve relatively simple data sets.
It's hard to plan for performance concerns up front - I've seen both "history" and "ledger" work with large amounts of data. However, as Eugen says in his comment below, if you get to an excessively large ledger, it may be necessary to to clean up the ledger table by summarizing the current levels, and deleting (or archiving) old records.

Related

Loading Fact tables from SCD1 and SCD2 Dimension in SSIS

I am finding it difficult to understand how you get the history data from a fact table join to a Dimension that has Type2 and Type1 for historic records that have changed. Currently I have a Surrogate Key and Business Key in the Dim. The Fact Table has the Surrogate Key the Fact table and I am using SSIS Lookup Component currently to bring back the row that has the CurrentFlag set to Yes.
However I am joining on the Business Key in the Lookup and returning the Surrogate. Which I know is the main reason I can't get history, however if I Join on the Business Key as I am currently doing and return the Business Key also, SSIS component will only bring back just one row, regardless of how many versions of history you have against that Business Key.
What I want to know or have been told is to use lookups to populate fact tables, however this doesn't seem to really give me the history as it will only return one row regardless. So I Just want to know how to return historic date between a fact and a dimension in SSIS.
Thank you
There's a few caveats when it comes to historical dimensions. Your end users will need to know what it is you are presenting, and understand the differences.
For example, consider the following scenario:
Customer A is located in Las Vegas in January 2017. They place an order for Product 123, which at that time costs $125.
Now, it's August. In the meantime, the Customer moved to Washington D.C. in May, and Product 123 was updated in July to cost $145.
Your end users will need to inform you what they want to see. In case you are not tracking history whatsoever, and simply truncate and load everything on a daily basis, your order report would show the following:
Customer A, located in Washington D.C. placed an order for $145 in January.
If you implement proper history tracking, and implemented logic to identify the start- and end-date of a row in a dimension, you would join the fact table to the dimension using the natural key as well as the proper date interval. This should return you a single value for every dimension row in the fact table. IF it returns more, you have overlapping dates.
Can you show us the logic where you receive only a single value from the lookup, even though you have more records?

Audit fields(CreatedBy, UpdatedBy) in tables. Is it good idea?

I was working with one product where almost every table had those columns. As developers we constantly had to join to Users table to get Id of who created record and it's just a mess in a code.
I'm designing new product and thinking about this again. Does it have to be like this? Obviously, it is good to know who created record and when. But having 300+ tables reference same User table doesn't seem to be very good..
How do you handle things like this? Should I create CreatedBy column only on major entities where it's most likely needed on UI and than deal with joining? Or should I go and put it everywhere? Or maybe have another "Audit" table where I store all this and look it up only on demand(not every time entity displayed on UI)
I'm just worrying about performance aspect where every UI query will hit User table..
EDIT: This is going to be SQL Server 2008 R2 database
The problem with that approach is that you only know who created the row and who changed the row last. What if the last person to update the row was correcting the previous updater's mistake?
If you're interested in doing full auditing for compliance or accountability reasons, you should probably look into SQL Server Audit. You can dictate which tables you're auditing, can change those on the fly without having to mess with your schema, and you can write queries against this data specifically instead of mixing the auditing logic with your normal application query logic (never mind widening every row of the table itself). This will also allow you to audit SELECT queries, which other potential solutions (triggers, CDC, Change Tracking - all of which are either more work or not complete for true auditing purposes) won't let you do that.
I know that this is an older post, but one way to avoid the lookup on the user table is to de-normalize the audit fields.
So instead of a userid in the CreatedBy field you insert a username itself. This will allow for a review of the table without the user look and also allow for any changes in your user table not reflect in the audit fields. Such as deleted users.
I usually add the following to the end of a table
IsDeleted bit default 0
CreatedBy varchar(20)
CreatedOn datetime2 default getdate()
UpdatedBy varchar(20)
UpdatedOn datetime2 default getdate()

What is the best way to keep changes history to database fields?

For example I have a table which stores details about properties. Which could have owners, value etc.
Is there a good design to keep the history of every change to owner and value. I want to do this for many tables. Kind of like an audit of the table.
What I thought was keeping a single table with fields
table_name, field_name, prev_value, current_val, time, user.
But it looks kind of hacky and ugly. Is there a better design?
Thanks.
There are a few approaches
Field based
audit_field (table_name, id, field_name, field_value, datetime)
This one can capture the history of all tables and is easy to extend to new tables. No changes to structure is necessary for new tables.
Field_value is sometimes split into multiple fields to natively support the actual field type from the original table (but only one of those fields will be filled, so the data is denormalized; a variant is to split the above table into one table for each type).
Other meta data such as field_type, user_id, user_ip, action (update, delete, insert) etc.. can be useful.
The structure of such records will most likely need to be transformed to be used.
Record based
audit_table_name (timestamp, id, field_1, field_2, ..., field_n)
For each record type in the database create a generalized table that has all the fields as the original record, plus a versioning field (additional meta data again possible). One table for each working table is necessary. The process of creating such tables can be automated.
This approach provides you with semantically rich structure very similar to the main data structure so the tools used to analyze and process the original data can be easily used on this structure, too.
Log file
The first two approaches usually use tables which are very lightly indexed (or no indexes at all and no referential integrity) so that the write penalty is minimized. Still, sometimes flat log file might be preferred, but of course functionally is greatly reduced. (Basically depends if you want an actual audit/log that will be analyzed by some other system or the historical records are the part of the main system).
A different way to look at this is to time-dimension the data.
Assuming your table looks like this:
create table my_table (
my_table_id number not null primary key,
attr1 varchar2(10) not null,
attr2 number null,
constraint my_table_ak unique (attr1, att2) );
Then if you changed it like so:
create table my_table (
my_table_id number not null,
attr1 varchar2(10) not null,
attr2 number null,
effective_date date not null,
is_deleted number(1,0) not null default 0,
constraint my_table_ak unique (attr1, att2, effective_date)
constraint my_table_pk primary key (my_table_id, effective_date) );
You'd be able to have a complete running history of my_table, online and available. You'd have to change the paradigm of the programs (or use database triggers) to intercept UPDATE activity into INSERT activity, and to change DELETE activity into UPDATing the IS_DELETED boolean.
Unreason:
You are correct that this solution similar to record-based auditing; I read it initially as a concatenation of fields into a string, which I've also seen. My apologies.
The primary differences I see between the time-dimensioning the table and using record based auditing center around maintainability without sacrificing performance or scalability.
Maintainability: One needs to remember to change the shadow table if making a structural change to the primary table. Similarly, one needs to remember to make changes to the triggers which perform change-tracking, as such logic cannot live in the app. If one uses a view to simplify access to the tables, you've also got to update it, and change the instead-of trigger which would be against it to intercept DML.
In a time-dimensioned table, you make the strucutural change you need to, and you're done. As someone who's been the FNG on a legacy project, such clarity is appreciated, especially if you have to do a lot of refactoring.
Performance and Scalability: If one partitions the time-dimensioned table on the effective/expiry date column, the active records are in one "table", and the inactive records are in another. Exactly how is that less scalable than your solution? "Deleting" and active record involves row movement in Oracle, which is a delete-and-insert under the covers - exactly what the record-based solution would require.
The flip side of performance is that if the application is querying for a record as of some date, partition elimination allows the database to search only the table/index where the record could be; a view-based solution to search active and inactive records would require a UNION-ALL, and not using such a view requires putting the UNION-ALL in everywhere, or using some sort of "look-here, then look-there" logic in the app, to which I say: blech.
In short, it's a design choice; I'm not sure either's right or either's wrong.
In our projects we usually do it this way:
You have a table
properties(ID, value1, value2)
then you add table
properties_audit(ID, RecordID, timestamp or datetime, value1, value2)
ID -is an id of history record(not really required)
RecordID -points to the record in original properties table.
when you update properties table you add new record to properties_audit with previous values of record updated in properties. This can be done using triggers or in your DAL.
After that you have latest value in properties and all the history(previous values) in properties_audit.
I think a simpler schema would be
table_name, field_name, value, time, userId
No need to save current and previous values in the audit tables. When you make a change to any of the fields you just have to add a row in the audit table with the changed value. This way you can always sort the audit table on time and know what was the previous value in the field prior to your change.

should i consolidate these database tables .

i have an event calendar application with a sql database behind it and right now i have 3 tables to represent the events:
Table 1: Holiday
Columns: ID, Date, Name, Location, CalendarID
Table 2: Vacation
Columns: Id, Date, Name, PersonId, WorkflowStatus
Table 3: Event
Columns: Id, Date, Name, CalendarID
So i have "generic events" which go into the event tableand special events like holidays and vacation that go into these separate tables. I am debating consolidating these into a single table and just having columns like location and personid blank for the generic events.
Table 1: Event:
Columns : Id, Date, Name, Location, PersonId, WorkflowStatus
does anyone see any strong positives or negative to each option. Obviously there will be records that have columns that dont necessarily apply but it there is overlap with these three tables.
Either way you construct it, the application will have to cope with variant types. In such a situation I recommend that you use a single representation in the DBM because the alternative is to require a multiplicity of queries.
So it becomes a question of where you stick the complexity and even in a huge organization, it's really hard to generate enough events to worry about DBMS optimization. Application code is more flexible than hardwired schemata. This is a matter of preference.
If it were my decision, i'd condense them into one table. I'd add a column called "EventType" and update that as you import the data into the new table to specify the type of event.
That way, you only need to index one table instead of three (if you feel indexes are required), the data is all in one table, and the queries to get the data out would be a little more concise because you wouldn't need to union all three tables together to see what one person has done. I don't see any downside to having it all in one table (although there will probably be one that someone will bring up that i haven't thought of).
How about sub-typing special events to an Event supertype? This way it is easy to later add any new special events.
Data integrity is the biggest downside of putting them in one table. Since these all appear to be fields that would be required, you lose the ability to require them all by default and would have to write a trigger to make sure that data integrity was maintained properly (Yes, this must be maintained in the database and not, as some people believe, by the application. Unless of course you want to have data integrity problems.)
Another issue is that these are the events you need now and there may be more and more specialized events in the future and possibly breaking code for one type of event because you added another specialized field that only applies to something else is a big risk. When you make a change to add some required vacation information, will you be sure to check that it doesn't break the application concerning holidays? Or worse not error out but show information you didn't want? Are you going to look at the actual screen everytime? Unit testing just of code may not pick up this type of thing especially if someone was foolish enough to use select * or fail to specify columns in an insert. And frankly not every organization actually has a really thorough automated test process in place (it could be less risk if you do).
I personally would tend to go with Damir Sudarevic's solution. An event table for all the common fields (making it easy to at least get a list of all events) and specialized tables for the fields not held in common, making is simpler to write code that affects only one event and allowing the database to maintain its integrity.
Keep them in 3 separate tables and do a UNION ALL in a view if you need to merge the data into one resultset for consumption. How you store the data on disk need not be identical to how you need to consume the data so long as the performance is adequate.
As you have it now there are no columns that do not apply for any of the presented entities. If you were to merge the 3 tables into one you'd have to add a field at the very least to know which columns to expect to be populated and reduce your performance. Now when you query for a holiday alone you go to a subset of the data that you would have to sift through / index to get at the same data in a merged storage table.
If you did not already have these tables defined you could consider creating one table with the following signature...
create table EventBase (
Id int PRIMARY KEY,
Date date,
Name varchar(50)
)
...and, say, the holiday table with the following signature.
create table holiday (
Id int PRIMARY KEY,
EventId int,
Location varchar(50),
CalendarId int
)
...and join the two when you needed to do so. Choosing between this and the 3 separate tables you already have depends on how you plan on using the tables and volume but I would definitely not throw all into a single table as is and make things less clear to someone looking at the table definition with no other initiation.
Or combine the common fields and separate out the unique ones:
Table 1: EventCommon
Columns: EventCommonID, Date, Name
Table 2: EventOrHoliday
Columns: EventCommonID, CalendarID, isHoliday
Table3: Vacation
Columns: EventCommonID, PersonId, WorkflowStatus
with 1->many relationships between EventCommon and the other 2.

Database versioning

I have made few projects (CMS and EC system) that required to have some data versioned.
Usually I come with that kind of schema
+--------------+
+ foobar +
+--------------+
+ foobar_id +
+ version +
+--------------+
it worked great but I am wondering if there is a better to way do it. The main problem with that solution you have to always use subquery to get the latest version.
i.e.:
SELECT * FROM foobar WHERE foobar_id = 2 and version = (SELECT MAX(version) FROM foobar f2 WHERE f2 = 2)
This render most of the queries more complicate and also have some performance drawbacks.
So it would be nice if you share your experience creating versioned table and what kind of pro and cons of each method.
Thanks
I prefer to have historical data in another table. I would make foobar_history or something similar and make a FK to foobar_id. This will stop you from having to use a subquery all together. This has the added advantage of not polluting your primary data table with the tons of historical data you probably don't want to see 99% of the time you're accessing it.
You will likely want to make a trigger for updating this data though, as it would require you to copy the current data in to _history and then do the update.
The cleanest solution in my opinion would be to have a History table for each table that requires versioned. In other words, have a foobar table, and then a foobar_History table, with a trigger on foobar that will write existing data to the History table with a timestamp and user that changed the data. Older data is easily queryably, sorted by timestamp descending, and you know that the data in the main table is always the latest version.
I used to work on a system with historical data, and we had a boolean to indicate which one was the latest version of the data. Of course you need to maintain the consitency of the flag at the applicative level. Then you can create indexes that use the flag and if you provide it in the where clause it's fast.
Pro:
easy to understand
does not require major change to your (existing) database schema
no need to copy old data in another table, only flag is updated.
Cons:
flag need to be maintained at applicative level
Otherwise, you can rely on a separate history table, as suggested in several answers.
Pro:
clean sepration of history from actual data
possible to have a db-level cascade delete between actual data and its history, in case the entity is removed
Cons:
need 2 queries (or a union) if you want the complete history (that is, old data + current data)
the row that corresponds to the latest version of the data will be updated. I heard that update are slower than insert, depending on the "size" of the data that changed.
What is best will depend from your use case. I had to deal with a document management system where we wanted to be able to version document. But we also had feature like reverting to old version. It was easier to use a boolean to speed up just the operation that required the last one. If you have real historical data (which never change) probably a dedicated history table is better.
Does the concept of history fit in your domain model? If no, then you have a db schema that differs from your conceptual domain model. If at the domain level, the actual data and the old data need to be handled the same way, having two tables complicates the design. Just consider the case you need to return the complete history (old + new). The easiest solution would be to have one class for each table, but then you can't return a list as easily as if you have only one table. But if these are two distinct concepts, then it's fine to have history be first-class in your design.
I would also recommend this article by M. Fowler also interesting when it comes to dealing with temporal data: Patterns for things that change with time
You can simplify the query by using a view over your table which filters to the latest version. This only makes the queries look nicer you still have the performance overhead.
Common technique is to add a column version_status for current/expired. Also a note, if you keep new and old records in the same table, you should have a business (natural) key for your entity, something like name + pin, because the primary key will change (increment) with each row.
TABLE foobar(foobar_id PK, business_key, version, version_status, .....)
SELECT *
FROM foobar
WHERE business_key = 'myFoobar3' AND version_status = 'current'
When deciding to keep the record history in the same table -- or move it to a separate one -- consider other tables which have the foobar_id as a foreign key. When issuing a new version, should existing foreign keys point to the new PK or to the old PK? If you want to keep history of relationships, you would probably want to keep everything in the same table. If only the new version is important, you may consider to move expired rows to another table -- though it is not necessary.
If you had used Oracle you could use analytic functions
select * from (
SELECT a.*
, row_number() over (partition by foobar_id order by version desc) rn
FROM foobar a
WHERE foobar_id = 2
) where rn = 1
It depends on how many of your tables require versioning, and if you've got a transactional ore reporting system.
If just a few transactional tables - the way that you're doing it is fine as long as the performance issues aren't too significant. You can make the querying easier by adding a column for current_row and a trigger that updates the prior row to make it non-current.
But if you've got a lot of tables or the extra rows are slowing down some of your queries then I'd do as others suggest and use history tables as well as history triggers. Note that you can generate that code to make it easier to develop & maintain.
If you're in the reporting world then there's a lot other options I won't address here. You can find the options given in detail in data warehousing data modeling books.

Resources