Many to many relations and history tables - database

Suppose I have Item and Tag, each of which have an id and name column only, and an Item_Tag_Map table that has a composite Item.id, Tag.id primary key.
If I want to implement a history table for Item and Tag, this seems relatively straightforward - I can add a third column revision and a trigger to copy into an ItemHistory or TagHistory table with id, revision as primary key and operation ("INSERT","UPDATE",etc). Since I may want to "delete" items, I can go about this one of two ways:
Add another column on Item or Tag for is_active, and do not actually delete any rows ever
Delete rows, but record the deletion in the history table as a delete operation, and on an Item or Tag insert, make sure to get the latest revision number from the ItemHistory or TagHistory table with that item, and set it to be that
The second option leaves a bad taste in my mouth, so I am fine with using the first. After all, why should I really ever need to delete an item when I can just modify it or change its active status?
Now, I've run into the same problem for the history table on the Item_Tag_Map table, but this time, neither option seems all that attractive. If I choose to add an is_active for the Item_Tag_Map, the logic of finding out whether a tag is mapped to an item changes from:
Get ALL tag_mapping for THESE items
to
Get ALL tag_mapping for THESE items WHERE is_active
The implicit idea that the presence of a mapping means that the mapping exists goes away. The set of unmapped item-tags not only includes all the ones that are not present in the table, but also the ones where is_active is false.
On the other hand, if I choose the second option, it's still rather ugly.
I'm sure people have run into this problem many times before, and I am interested in learning how you have dealt with it.

My answer depends on a few things, so I'll try to state my assumptions.
No matter what I think is_active on Item and Tag are ok. If the record size grows very fast on those two entities, then consider running a nightly job to move the inactive records to an archived version of the tables. This can be used for reporting or auditing of things later. You can also write a script to restore records if you need, but the idea is that your real time tables are fast and without deleted data.
If you allow the user to add/update/delete mappings, then I would consider the table the same as Item and Tag. Add the flag and use it in your queries. It doesn't seem ugly to me - I've seen it before.
If the mapping table isn't under user control, then I would guess you would use the is_active flag on either Item or Tag to determine whether or not a query could be run.
Just know that once you add that flag, people will forget to use it. I know I've done it many times, ("Why did I get so many records, what am I missing? Oh yeah, is_active...)

Related

.NET Storing "Checked" items into the database

Wondering if someone could assist with the best way to handle storing "checked" items in a MSSQL database.
On my form, i have a list of fields (name, address ect) and then a listbox the user can check for e.g. favourite colors.
In my database I would have a table for user details (tbl_userdetails - [UserID, Address...]) and a table for colors (tbl_colors, [ColorID, ColorName, ColorCode]). I would also need a table for user colors (tbl_userColors - [userID, ColorID])Theses would be linked via a "userID"
Normally, to save the user details, I have a sql string "UPDATE tbl_userdetails SET... WHERE userID = #userID". What is the best way to save the changed checked items into the next table?
My thoughts are:
Delete all the colors for UserID in the tbl_userColors and then loop the checked items into a "INSERT" statment.
Loop though each item that exists in the list create a datatable and then "merge" the data (on match, insert. on not matched delete)
Any other thoughts? What is the best way to build the INSERT statment?
Cheers
The DELETE and INSERT strategy works well as long as nothing is tied to those records. If you ever have any tables that reference tbl_userColors on the "one" side of the relationship, then you will have headaches.
The MERGE strategy is decent, usually. One possibly unfortunate consequence would be that a MISSING record is the same as a FALSE record. For instance, you Have your list of colors, {red, green, blue}, and your users are making their selections. Six months later you get crazy and add orange. Now you have no idea who didn't select orange vs. those that simply weren't presented with orange as an option.
A third option is to place an Enabled BIT field on the tbl_userColors table. This allows you to determine if a user was presented with a color option and they declined it vs. if the user never saw a particular color option.
Speaking of Enabled BITs. Your tbl_colors table should really have an Enabled BIT as well - or some other mechanism of removing a color from the UI without removing its database record. You realize at some point you no longer want to offer blue to your users, but you also don't want to loose the historical data.
And a small aside: Your tables names are horrific. You should really consider dropping the Hungarian notation. I'm a big fan of camel-case table names: Users, Colors, and UserColors.
Simple really. When an item is checked it returns a boolean. Whenever you get ready to save to the database you just want to loop through the colors and if it is checked (AKA is check is true) then you add it to a list that you can later loop through to save all of the values.

What is an efficient way if implementing hellbanning?

I was working on a forum component for a bigger project and considered adding in a hellban feature where a mod may prevent a user's posts from being viewed by anyone but that user. This basically enforces the "don't feed the troll" rule, forcing everyone to ignore the troublemaker. Meanwhile the troublemaker likely becomes bored as he doesn't succeed in getting a rise out of anyone and hopefully moves on.
My first thought was to add in a "hellbanned" column in a post table, and create a "hellbanend" table. A hellbanned user would have their user_id added as a record to the hellbanned table, and henceforth all their future posts would have their hellbanned column set to true.
So a query showing all a topic's posts would simply show all posts where 'hellbanned = False'. And, a post operation would check if the user was in the hellban table, and if so, set the post's 'hellbanned' column to True.
I can't help but thinking there is a better way to do this; I'd really appreciate some suggestions.
Hellbanning exists at the level of the user, not individual posts, so you don't have to keep the flag at the level of the posts table at all - in fact, doing that would open you to data inconsistencies (e.g. an application bug may lead to an "incompletely" hellbanned user).
Instead, put hellbanned user ID to a separate table (and if your DBMS supports it: cluster it to avoid "unnecessary" table heap)...
CREATE TABLE HELLBANNED_USER (
USER_ID INT PRIMARY KEY,
FOREIGN KEY (USER_ID) REFERENCES USER (USER_ID)
)
...and when the time comes to exclude the hellbanned user's posts, do it similarly to this:
SELECT * FROM POST
WHERE USER_ID NOT IN (
SELECT USER_ID FROM HELLBANNED_USER
)
This should perform nicely due to the index on HELLBANNED_USER.USER_ID.
The hellbanned users are still in the regular USER table, so everything else can keep working for them without significant changes to your code.
Obviously, once the user is hellbanned above, all of its posts (even those that were made before hellbanning) would become invisible. If you don't want that, add a HELLBANNED_DATE field to the hellbanned table and then hide the posts after the hellbanning similarly to...
SELECT * FROM POST
WHERE NOT EXISTS (
SELECT * FROM HELLBANNED_USER
WHERE POST.USER_ID = HELLBANNED_USER.USER_ID
AND POST_DATE >= HELLBANNED_DATE
)
Alternatively, you could just keep the HELLBANNED flag (and/or HELLBANNED_DATE) in the USER table, but you'd need to be careful to index it properly for good performance.
This might actually be a better solution than the HELLBANNED_USER, if you need to JOIN with the USER anyway (to display additional user information for each post), so the flag is readily reachable without doing the additional search through the HELLBANNED_USER table.

What are the pros and cons of having history tables?

WHAT is the better practice?:
Keep history records in a separate history table
Keep history records in the active table with different status?
In my opinion I rather to keep a separate table to avoid creating one huge table with duplicate records which may cause unwanted lag time when querying the table.
My preference has always been to have a separate table with history, purely because it removes the need to have a "WHERE Status = 'LIVE'" or "WHERE CurrentRecord = 1" to get latest record (I won't get into one design that required an inline query to get max(version) to get the latest). It should mean that the current records table should remain smaller and access times may be improved, etc. In the worst case scenario, I've seen an ad-hoc query against a table pick up the wrong version of a record, causing all sorts of problems later on.
Also, if you are already getting the history from another table, you could shard the data, so all history from one year is in one table/db and all history from another is in another table/db and so on.
Pro:
If you keep the history in a separate table then this data will be accessed only when you need to search something from the past. Most of the times the main table will be used far more than the historical one. So this means faster results.
Con:
In a project I worked, I had one table with 350 columns (don't ask why.....). So this table became very large as the data was inputed. At a specific moment records went from 'active' to 'closed' status. I was tempted to move all the closed records to a new table (a historical one), but I realized that it was more slow - in a lot of queries I had to make unions....
As a final opinion I think it depends for every case, but I will always think first for the separate table.
I prefer to use one table and partioning. I also would set up a view for the active records and use that instead of the base table when querying active records.
I would go for a separate table, otherwise setting up UNIQUE and FK constraints may be still doable, but too involved.

Are created and modified the two fields every database table should have?

I recently realized that I add some form of row creation timestamp and possibly a "updated on" field to most of my tables. Suddenly I started thinking that perhaps every table in the database should have a created and modified field that are set in the model behind the scenes.
Does this sound correct? Are there any types of high-load tables (like sessions) or massive sized tables that this wouldn't be a good idea for?
I wouldn't put those fields (which I generally call audit fields) on every database table. If it's a low-traffic, high-value table (like Users, for instance), it goes on, no question. I'd also add creator and modifier. If it's a table that gets hit a lot (an operation history table, say), then maybe the benefit isn't worth the cost of increased insert time and storage space.
It's a call you'll need to make separately for each table.
Obviously, there isn't a single rule.
Most of my tables have date-related things, DateCreated, DateModified, and occasionally a Revision to track changes and so on. Do whatever makes sense. Clearly, you can invent cases where it's appropriate and cases where it is not. If you're asking whether you should add them "by default" to most tables, I'd say "probably".

Where should I break up my user records to keep track of revisions

I am putting together a staff database and I need to be able to revise the staff member information, but also keep track of all the revisions. How should I structure the database so that I can have multiple revisions of the same user data but be able to query against the most recent revision? I am looking at information that changes rarely, like Last Name, but that I will need to be able to query for out of date values. So if Jenny Smith changes her name to Jenny James I need to be able to find the user's current information when I search against her old name.
I assume that I will need at least 2 tables, one that contains the uid and another that contains the revisions. Then I would join them and query against the most recent revision. But should I break it out even further, depending on how often the data changes or the type of data? I am looking at about 40 fields per record and only one or two fields will probably change per update. Also I cannot remove any data from the database, I need to be able to look back on all previous records.
A simple way of doing this is to add a deleted flag and instead of updating records you set the deleted flag on the existing record and insert a new record.
You can of course also write the existing record to an archive table, if you prefer. But if changes are infrequent and the table is not big I would not bother.
To get the active record, query with 'where deleted = 0', the speed impact will be minimal when there is an index on this field.
Typically this is augmented with some other fields like a revision number, when the record was last updated, and who updated it. The revision number is very useful to get the previous versions and also to do optimistic locking. The 'who updated this last and when' questions usually come once the system is running instead of during requirements gathering, and are useful fields to put in any table containing 'master' data.
I would use the separate table because then you can have a unique identifier that points to all the other child records that is also the PK of the table which I think makes it less likely you will have data integrity issues. For instance, you have Mary Jones who has records in the address table and the email table and performance evaluation table, etc. If you add a change record to the main table, how are you going to relink all the existing information? With a separate history table, it isn't a problem.
With a deleted field in one table, you then have to have an non-autogenerated person id and an autogenrated recordid.
You also have the possiblity of people forgetting to use the where deleted = 0 where clause that is needed for almost every query. (If you do use the deleted flag field, do yourself a favor and set a view with the where deleted = 0 and require developers to use the view in queries not the orginal table.)
With the deleted flag field you will also need a trigger to ensure one and only one record is marked as active.
#Peter Tillemans' suggestion is a common way to accomplish what you're asking for. But I don't like it.
The structure of a database should reflect the real-world facts that are being modeled.
I would create a separate table for obsolete_employee, and just store the historical information that would need to be searched in the future. This way you can keep your real employee data table clean and keep only the old data that is necessary. This approach will also simplify reporting and other features of the application that are not related to searching historical data.
Just think of that warm feeling you'll get when you type select * from employee and nothing but current, correct goodness comes flowing back!

Resources