I was going to include 'status', 'date_created', 'date_updated' to every table in database.
'status' is for soft deletion of rows.
Then, I've seen few people also add 'user_created', 'user_updated' columns to each table.
If I add those columns too, then I will have at least 5 columns for every table.
Will this be too much overhead?
Do you think it's a good idea to have those five columns?
Also, does the 'user' in 'user_created' mean database user? or application user?
As per comments above, would advise adding auditing only to those tables actually requiring it.
You generally want to audit the application user - in many instances, applications (such as Web or SOA) may be connecting all users with the same credential, so storing the DB login is pointless.
IMHO, the date created / last date updated / lastupdateby patterns never give the full picture, as you will only be able to see who made the last change and not see what was changed. If you are doing auditing, I would suggest that instead you do a full change audit using patterns such as an audit trigger. You can also avoid using triggers if your inserts / updates / deletes to your tables are encapsulated e.g. via Stored Procedures. True, the audit tables will grow very large, but they will generally not be queried much (generally just in witch-hunts), and can be archived, easily partitioned by date (and can be made readonly). With a separated audit table, you won't need a DateCreated or LastDateUpdated column, as this can be derived. You will generally still need the last change user however, as SQL will not be able to derive the application user.
If you do decide on logical deletes, I would avoid using 'status' as an field indicating logical deletes, as it is likely you have tables which do model a process state (e.g. Payment Status etc.) Using a bit or char field such as ActiveYN or IsActive are common for logical deletes.
Logical deletes can be cumbersome, as all your queries will need to filter out Active=N records, and by keeping deleted records in your transaction tables can make these tables larger than necessary, especially on Many : Many / junction tables. Performance can also be impacted, as a 2-state field is unlikely to be selective enough to be useful in indexes. In this case, physical deletes with the full audit might make better sense.
I've used all five before, sure. When I want to track who, through a web app, is creating and (last) editing records, and when that happens, I include timestamps and the logged-in user (but not the DB user, that's not how my system is setup; we use one account for all DB interaction).
Likewise, status can also be useful if users are changing a record's, well, status. If it goes from being "Online" to "Offline" to "Archive", that record can reflect that.
However, I don't use these for every table, nor should you. Sometimes I have tables that are meant only to store parts of a record (normalized), or just don't have a value as far as needing a status or time created by who.
What you should be considering for every table is a Primary Key field. Unless you are more sophisticated in your approach than you sound, you will almost always want one. Some things don't necessarily need one (a states list, for instance, could Unique the abbreviation). But this is more important to most of your tables than a series of timestamp and status fields.
Simple answer - only put it in your database what you need in your database.
Related
(table: order_items)
I'm not sure if this is the correct way to implement an order history table in my database. Normally, I'm trying to reduce the redundancy. But because the user can change data in his/her offer, I need to save the minimum information of the order.
Goal: Buyer can see his/her old orders with correct title/pictures/origin path/allergens (long story...)
What speaks against my approach?
The only "fear" is that the table is going to be bloated with a lot of redundancy information.
This started out as a comment but it's getting too long, so...
What database are you working with?
SQL Server, for instance, introduced the concept of temporal tables in 2016 version. Basically you have two tables identical in structure, where one is the main table where you can use DML just as you would with normal table, and the other is a readonly table that's storing the historical data - so when you update a record in the main table, what is actually happening is that the record gets copied into the history table first, and updated later.
Something similar might exists in other databases as well, and can also be quite easily manually implemented using triggers in case your database does not provide it out of the box.
Of course, you could use the technique called "soft delete", where instead of actually deleting the data you simply mark it as deleted, and instead of updating the data you create a new record with the updated data, and change the status of the existing record to Inactive.
The major advantage of this approach over temporal tables is that you still only have one table for your entity instead of two - but on the other hand, the advantage of temporal tables is that the active data is being kept in a separate table from the historical data, therefor the active data is stored in a relatively small table and as a result, all CRUD operations is more efficient.
The "fear" of having a bloated table in this day and age when memory and storage are so cheep seems a bit strange to me.
Consider a database with several (3-4) tables with a lot of columns (from 15 to 40). In each table we have several thousand records generated per year and about a dozen of changes made for each record.
Right now we need to add a following functionality to our system: every time user makes a change to the record of one of our tables, the system needs to keep track of it - we need to have complete history of changes and also be able to restore row data to selected point.
For some reasons we cannot keep "final" and "historic" data in the same table (so we cannot add some columns to our tables to keep some kind of versioning information, i.e. like wordpress does when it comes to keeping edit history of posts).
What would be best approach to this problem? I was thinking about two solutions:
For each tracked table we have a mirror table with the same columns, and with additional columns where we keep information about versions (i.e. timestamps, id of "original" row etc...)
Pros:
we have data stored exactly in the same way it was in original tables
whenever we need to add a new column to the original table, we can do the same to mirror table
Cons:
we need to create one additional mirror table for each tracked table.
We create one table for "history" revisions. We keep some revisioning information like timestamps etc., and also we keep the track from which table the data originates. But the original data row is being stored in large text column in JSON.
Pros:
we have only one history table for all tracked tables
we don't need to create new mirror tables every time we add new tracked table,
Cons:
there can be some backward compatibility issues while trying to restore data after structure of the original table was changed (i.e. new column was added)
Maybe some other solution?
What would be the best way of keeping the history of versions in such system?
Additional information:
each of the tracked tables can change in the future (i.e. new columns added),
number of tracked tables can change in the future (i.e. new tables added).
FYI: we are using laravel 5.3 and mysql database.
How often do you need access to the auditing data? Is cost of storage ever a concern? Do you need it in the same system that you need the normal data?
Basically, having a table called foo and a second table called foo_log isn't uncommon. It also lets you store foo_log somewhere differently, even possibly a secondary DB. If foo_log is on a spindle disk and foo is on flash, you still get fast reads, but you get somewhat cheaper storage of the backups.
If you don't ever need to display this data, and just need it for legal reasons, or to figure out how something went wrong, the single-table isn't a terrible plan.
But if the issue is backups, which it sounds like it might be, why not just backup the MySQL database on a regular basis and store the backups elsewhere?
Background
I am designing a Data Warehouse with SQL Server 2012 and SSIS. The source system handles hotel reservations. The reservations are split between two tables, header and header line item. The Fact table would be at the line item level with some data from the header.
The issue
The challenge I have is that the reservation (and its line items) can change over time.
An example would be:
The booking is created.
A room is added to the booking (as a header line item).
The customer arrives and adds food/drinks to their reservation (more line items).
A payment is added to the reservation (as a line item).
A room could be subsequently cancelled and removed from the booking (a line item is deleted).
The number of people in a room can change, affecting that line item.
The booking status changes from "Provisional" to "Confirmed" at a point in its life cycle.
Those last three points are key, I'm not sure how I would keep that line updated without looking up the record and updating it. The business would like to keep track of the updates and deletions.
I'm resisting updating because:
From what I've read about Fact tables, its not good practice to revisit rows once they've been written into the table.
I could do this with a look-up component but with upward of 45 million rows, is that the best approach?
The questions
What type of Fact table or loading solution should I go for?
Should I be updating the records, if so how can I best do that?
I'm open to any suggestions!
Additional Questions (following answer from ElectricLlama):
The fact does have a 1:1 relationship with the source. You talk about possible constraints on future development. Would you be able to elaborate on the type of constraints I would face?
Each line item will have a modified (and created date). Are you saying that I should delete all records from the fact table which have been modified since the last import and add them again (sounds logical)?
If the answer to 2 is "yes" then for auditing purposes would I write the current fact records to a separate table before deleting them?
In point one, you mention deleting/inserting the last x days bookings based on reservation date. I can understand inserting new bookings. I'm just trying to understand why I would delete?
If you effectively have a 1:1 between the source line and the fact, and you store a source system booking code in the fact (no dimensional modelling rules against that) then I suggest you have a two step load process.
delete/insert the last x days bookings based on reservation date (or whatever you consider to be the primary fact date),
delete/insert based on all source booking codes that have changed (you will of course have to know this beforehand)
You just need to consider what constraints this puts on future development, i.e. when you get additional source systems to add, you'll need to maintain the 1:1 fact to source line relationship to keep your load process consistent.
I've never updated a fact record in a dataload process, but always delete/insert a certain data domain (i.e. that domain might be trailing 20 days or source system booking code). This is effectively the same as an update but also takes cares of deletes.
With regards to auditing changes in the source, I suggest you write that to a different table altogether, not the main fact, as it's purpose will be audit, not analysis.
The requirement to identify changed records in the source (for data loads and auditing) implies you will need to create triggers and log tables in the source or enable native SQL Server CDC if possible.
At all costs avoid using the SSIS lookup component as it is totally ineffective and would certainly be unable to operate on 45 million rows.
Stick with the 'insert/delete a data portion' approach as it lends itself to SSIS ability to insert/delete (and its inability to efficiently update or lookup)
In answer to the follow up questions:
1:1 relationship in fact
What I'm getting at is you have no visibility on any future systems that need to be integrated, or any visibility on what future upgrades to your existing source system might do. This 1:1 mapping introduces a design constraint (its not really a constraint, more a framework). Thinking about it, any new system does not need to follow this particular load design, as long as it's data arrive in the fact consistently. I think implementing this 1:1 design is a good idea, just trying to consider any downside.
If your source has a reliable modified date then you're in luck as you can do a differential load - only load changed records. I suggest you:
Load all recently modified records (last 5 days?) into a staging table
Do a DELETE/INSERT based on the record key. Do the delete inside SSIS in an execute SQL task, don't mess about with feeding data flows into row-by-row delete statements.
Audit table:
The simplest and most accurate way to do this is simply implement triggers and logs in the source system and keep it totally separate to your star schema.
If you do want this captured as part of your load process, I suggest you do a comparison between your staging table and the existing audit table and only write new audit rows, i.e. reservation X last modified date in the audit table is 2 Apr, but reservation X last modified date in the staging table is 4 Apr, so write this change as a new record to the audit table. Note that if you do a daily load, any changes in between won't be recorded, that's why I suggest triggers and logs in the source.
DELETE/INSERT records in Fact
This is more about ensuring you have an overlapping window in your load process, so that if the process fails for a couple of days (as they always do), you have some contingency there, and it will seamlessly pick the process back up once it's working again. This is not so important in your case as you have a modified date to identify differential changes, but normally for example I would pick a transaction date and delete, say 7 trailing days. This means that my load process can be borken for 6 days, and if I fix it by the seventh day everything will reload properly without needing extra intervention to load the intermediate days.
I would suggest having a deleted flag and update that instead of deleting. Your performance will also be better.
This will enable you to perform an analysis on how the reservations are changing over a period of time. You will need to ensure that this flag is used in all the analysis to ensure that there is no confusion.
While working on a content management system, I've hit a bit of a wall. Coming back to my data model, I've noticed some issues that could become more prevalent with time.
Namely, I want to maintain a audit trail (change log) of record modification by user (even user record modifications would be logged). Due to the inclusion of an arbitrary number of modules, I cannot use a by-table auto incrementation field for my primary keys, as it will inevitably cause conflicts while attempting to maintain their keys in a single table.
The audit trail would keep records of user_id, record_id, timestamp, action (INSERT/UPDATE/DELETE), and archive (a serialized copy of the old record)
I've considered a few possible solutions to the issue, such as generating a UUID primary key in application logic (to ensure cross database platform compatibility).
Another option I've considered (and I'm sure the consensus will be negative for even considering this method) is, creating a RecordKey table, to maintain a globally auto-incremented key. However, I'm sure there are far better methods to achieve this.
Ultimately, I'm curious to know of what options I should consider in attempting to implement this. For example, I intend on permitting (to start at least) options for MySQL and SQLite3 storage, but I'm concerned about how each database would handle UUIDs.
Edit to make my question less vague: Would using global IDs be a recommended solution for my problem? If so, using a 128 bit UUID (application or database generated) what can I do in my table design that would help maximize query efficiency?
Ok, you've hit a brick wall. And you realise that actually the db design has problems. And you are going to keep hitting this same brick wall many times in the future. And your future is not looking bright. And you want to change that. Good.
But what you have not yet done is, figure what the actual cause of this is. You cannot escape from the predictable future until you do that. And if you do that properly, there will not be a brick wall, at least not this particular brick wall.
First, you went and stuckIdiot columns on all the tables to force uniqueness, without really understanding the Identifiers and keys that used naturally to find the data. That is the bricks that the wall is made from. That was an unconsidered knee-jerk reaction to a problem that demanded consideration. That is what you will have to re-visit.
Do not repeat the same mistake again. Whacking GUIDs or UUIDs, or 32-byteIdiot columns to fix yourNUMERIC(10,0) Idiot columns will not do anything, except make the db much fatter, and all accesses, especially joins, much slower. The wall will be made of concrete blocks and it will hit you every hour.
Go back and look at the tables, and design them with a view to being tables, in a database. That means your starting point is No Surrrogate Keys, noIdiot columns. When you are done, you will have very fewId columns. Not zero, not all tables, but very few. Therefore you have very few bricks in the wall. I have recently posted a detailed set of steps required, so please refer to:
Link to Answer re Identifiers
What is the justification of having one audit table containing the audit "records" of all tables ? Do you enjoy meeting brick walls ? Do you want the concurrency and the speed of the db to be bottlenecked on the Insert hot-spot in one file ?
Audit requirements have been implemented in dbs for over 40 years, so the chances of your users having some other requirement that will not change is not very high. May as well do it properly. The only correct method (for a Rdb) for audit tables, is to have one audit table per auditable real table. The PK will be the original table PK plus DateTime (Compound keys are normal in a modern database). Additional columns will be UserId and Action. The row itself will be the before image (the new image is the single current row in the main table). Use the exact same column names. Do not pack it into one gigantic string.
If you do not need the data (before image), then stop recording it. It is a very silly to be recording all that volume for no reason. Recovery can be obtained from the backups.
Yes, a single RecordKey table is a monstrosity. And yet another guaranteed method of single-threading the database.
Do not react to my post, I can already see from your comments that you have all the "right" reasons for doing the wrong thing, and keeping your brick walls intact. I am trying to help you destroy them. Consider it carefully for a few days before responding.
How about keeping all the record_id local to each table, and adding another column table_name (to the audit table) to make for a composite key?
This way you can also easily filter your audit log by table_name (which will be tricky with arbitrary UUID or sequence numbers). So even if you do not go with this solution, consider adding the table_name column anyway for the sake of querying the log later.
In order to fit the record_id from all tables into the same column, you would still need to enforce that all tables use the same data type for their ids (but it seems like you were planning to do that anyway).
A more powerful scheme is to create an audit table that mirrors the structure of each table rather than put all the audit trail into one place. The "shadow" table model makes it easier to query the audit trail.
I have about 200 settings for the user. These include notice settings and tracking settings from user activities on objects. The problem is how to store it in the DB? Should each setting be a row or a column? If column then table will have 200 colunms. If row then about 3 colunms but 200 rows per user x even 10 million users = not good.
So how else can I store all these settings?
These settings are a mix of text entry and FK lookups to other tables.
Serializing the data almost always turns out to be a bad idea, because in doing so you cripple the dbms. All of the man years that went into producing an efficient dbms will be wasted on a serialized bucket of bits.
If you have application logic hooked up against each setting, I think you should implement it as either:
1 column per setting in the settings table.
This makes it easier to leverage the power of your dbms, with constraint checking, referential integrity, correct data type for your values, plenty of information to the optimizer. The downside is that row size grows.
or
1 table per setting (or group of related settings).
This has all of the benefits of the above, but trades rowsize for a performance penalty when you need to fetch most or all of the settings at once. When settings are optional, this alternative will be significantly smaller if the actual data is sparse.
Also, lots of columns is often a "smell", that suggests you haven't normalized your data correctly, but it doesn't have to be that way. Only you know your data.
That you have 200 settings to track suggests a flexible database schema. As such I'd suggest a hybrid approach:
Users: table with row per user for properties that a user will likely always have, such as username and password. This table may also keep foreign keys, but this is a heuristic and also depends if the relationship is zero-to-1 or zero-to-many. Where the latter requires a separate table.
Features: two options
table with row per feature, effectively a hashtable, with userId, name and value columns. This could also be a spot for foreign relationships, but you would not be able to enforce data integrity in this setup.
XML, but only with a database that has features that allow you to query the data or only for data that you do not need to query, but only work with on your application server.
I think the bigger answer is you are not going to arrive at one solution from your original question, but instead need to use both to suit the data.
I think 200 columns is definitely not good idea, because of difficulty in writing stored procs, manually viewing data or extending to more settings later.
Can you try XML for all these 200 settings and then you will have only 1 row per user. username and corresponding settings xml. But again it will limit your querying capabilities, but DBs now support XML. You can specifically check out XML DBs.