How to store row modification data in Postgres?

How to store row modification data in Postgres? - database

So I need to find a way to track modifications made to database rows. I'm thinking the best way to do this is to store object deltas for the row after the modification has been made to the row. So if I had this as a table:
CREATE TABLE global.users
(
id serial NOT NULL,
username text,
password text,
email text,
admin boolean, --
CONSTRAINT users_pkey PRIMARY KEY (id)
) WITH (
OIDS=FALSE
);
ALTER TABLE global.users
OWNER TO postgres;
COMMENT ON COLUMN global.users.admin IS '';
So say I updated the username field from "support" to "admin", then I'd save a JSON object like, say:
{
"username": ["support", "admin"]
}
associated with a specific row ID.
So my question is as follows: what's a nice way to organize these objects in Postgres? I'm currently debating between either a) having a deltas table for every existing table in the database (so this object would go into a table called global.users_delta or similar) or b) having a global deltas table which holds all deltas for all objects and tracks which table each is associated with.
I haven't really been able to identify any best practices for doing this sort of thing as yet, so even some direction towards preexisting documentation would be nice.
EDIT: An added requirement here is the issue of how to deal with the related data. So say, something belongs to a category, which is stored in another table. Usually that would be referenced by ID, so the delta would track the change in category ID's numeric value. That value needs to be labeled somehow, but that label can't necessarily be retroactively applied (in case the other linked value changes say). Should the labeled value be stored or should the raw value be stored? Or maybe both?

Related

Organizing database tables - large number of properties

I have a database that stores some users in it. Each user has its account settings, privacy settings and lots of other properties to set. The number of those properties started to grow and I could end up with 30 properties or so.
Till now, I used to keep it in "UserInfo" table having User and UserInfo related as One-To-Many (keeping a log of all changes). Putting it in a single "UserInfo" table doesn't sound nice and, at least in the database model, it would look messy. What's the solution?
Separating privacy settings, account settings and other "groups" of settings in separate tables and have 1-1 relations between UserInfo and each group of settings table is one solution, but would that be too slow (or much slower) when retrieving the data? I guess all data would not be presented on a single page at the same moment. So maybe having one-to-many relationships to each table is a solution too (keeping log of each group separately)?

If it's only 30 properties, I'd recommend just creating 30 columns. That's not too much for a modern database to handle.
But I would guess that if you ahve 30 properties today, you will continue to invent new properties as time goes on, and the number of columns will keep growing. Restructuring your table to add columns every day may become time-consuming as you get lots of rows.
For an alternative solution check out this blog for a nifty solution for storing lots of dynamic attributes in a "schemaless" way: How FriendFeed Uses MySQL.
Basically, collect all the properties into some format and store it in a single TEXT column. The format is semi-structured, that is your application can separate the properties if needed but you can also add more at any time, or even have different properties per row. XML or YAML or JSON are example formats, or some object serialization format supported by your application code language.
CREATE TABLE Users (
user_id SERIAL PRIMARY KEY,
user_proerties TEXT
);
This makes it hard to search for a given value in a given property. So in addition to the TEXT column, create an auxiliary table for each property you want to be searchable, with two columns: values of the given property, and a foreign key back to the main table where that particular value is found. Now you have can index the column so lookups are quick.
CREATE TABLE UserBirthdate (
user_id BIGINT UNSIGNED PRIMARY KEY,
birthdate DATE NOT NULL,
FOREIGN KEY (user_id) REFERENCES Users(user_id),
KEY (birthdate)
);
SELECT u.* FROM Users AS u INNER JOIN UserBirthdate b USING (user_id)
WHERE b.birthdate = '2001-01-01';
This means as you insert or update a row in Users, you also need to insert or update into each of your auxiliary tables, to keep it in sync with your data. This could grow into a complex chore as you add more auxiliary tables.

What is the best way to keep this schema clear?

Currently I'm working on a RFID project where each tag is attached to an object. An object could be a person, a computer, a pencil, a box or whatever it comes to the mind of my boss.
And of course each object have different attributes.
So I'm trying to have a table tags where I can keep a register of each tag in the system (registration of the tag). And another tables where I can relate a tag with and object and describe some other attributes, this is what a have done. (No real schema just a simplified version)
Suddenly, I realize that this schema could have the same tag in severals tables.
For example, the tag 123 could be in C and B at the same time. Which is impossible because each tag just could be attached to just a single object.
To put it simple I want that each tag could not appear more than once in the database.
My current approach
What I really want
Update:
Yeah, the TagID is chosen by the end user. Moreover the TagID is given by a Tag Reader and the TagID is a 128-bit number.
New Update:
The objects until now are:
-- Medicament(TagID, comercial_name, generic_name, amount, ...)
-- Machine(TagID, name, description, model, manufacturer, ...)
-- Patient(TagID, firstName, lastName, birthday, ...)
All the attributes (columns or whatever you name it) are very different.
Update after update
I'm working on a system, with RFID tags for a hospital. Each RFID tag is attached to an object in order keep watch them and unfortunately each object have a lot of different attributes.
An object could be a person, a machine or a medicine, or maybe a new object with other attributes.
So, I just want a flexible and cleaver schema. That allow me to introduce new object's types and also let me easily add new attributes to one object. Keeping in mind that this system could be very large.
Examples:
Tag(TagID)
Medicine(generic_name, comercial_name, expiration_date, dose, price, laboratory, ...)
Machine(model, name, description, price, buy_date, ...)
Patient(PatientID, first_name, last_name, birthday, ...)
We must relate just one tag for just one object.
Note: I don't really speak (or also write) really :P sorry for that. Not native speaker here.

You can enforce these rules using relational constraints. Check out the use of a persisted column to enforce the constraint Tag:{Pencil or Computer}. This model gives you great flexibility to model each child table (Person, Machine, Pencil, etc.) and at same time prevent any conflicts between tag. Also good that we dont have to resort to triggers or udfs via check constraints to enforce the relation. The relation is built into the model.
create table dbo.TagType (TagTypeID int primary key, TagTypeName varchar(10));
insert into dbo.TagType
values(1, 'Computer'), (2, 'Pencil');
create table dbo.Tag
( TagId int primary key,
TagTypeId int references TagType(TagTypeId),
TagName varchar(10),
TagDate datetime,
constraint UX_Tag unique (TagId, TagTypeId)
)
go
create table dbo.Computer
( TagId int primary key,
TagTypeID as 1 persisted,
CPUType varchar(25),
CPUSpeed varchar(25),
foreign key (TagId, TagTypeID) references Tag(TagId, TagTypeID)
)
go
create table dbo.Pencil
( TagId int primary key,
TagTypeId as 2 persisted,
isSharp bit,
Color varchar(25),
foreign key (TagId, TagTypeID) references Tag(TagId, TagTypeId)
)
go
-----------------------------------------------------------
-- create a new tag of type Pencil:
-----------------------------------------------------------
insert into dbo.Tag(TagId, TagTypeId, TagName, TagDate)
values(1, 2, 'Tag1', getdate());
insert into dbo.Pencil(TagId, isSharp, Color)
values(1, 1, 'Yellow');
-----------------------------------------------------------
-- try to make it a Computer too (fails FK)
-----------------------------------------------------------
insert into dbo.Computer(TagId, CPUType, CPUSpeed)
values(1, 'Intel', '2.66ghz')

Have a Tag Table with PK identity insert of TagID.
This will ensure that each TagID only shows up once no matter what...
Then in the Tag Table have a TagType column that can either be free form (TableName) or better yet have a TagType table with entries A,B,C and then have a FK in Tag pointing TagType.
I would move the Tag attributes into Table A,B,C to minimize extra data in Tag or have a series of Junction Tables between Tag and A,B, and C
EDIT:
Assuming the TagID is created when the object is created this will work fine (Insert into Tag first to get TagID and capture it using IDENTITY_INSERT)
This assumes users cannot edit the TagID itself.
If users can choose the TagID then still use a Tag Table with the TagID but have another field called DisplayID where the user can type in a number. Just put on a unique constraint on Tag.DisplayID....
EDIT:
What attributes are you needing and are they nullable? If they are different for A, B, and C then it is cleaner to put them in A, B, and C especially if there might be some for A and B but not C...

talked with Raz to clear up what he's trying to do. What he's wanting is a flexable way to store attributes related to tags. Tags can one of multiple types of objects, and each object has a specific list of attributes. he also wants to be able to add objects/attributes without having to change the schema. here's the model i came up with:

if each tag can only be in a, b, or c only once, i'd just combine a, b, and c into one table. it'd be easier to give you a better idea of how to build your schema if you gave an example of exactly what you're wanting to collect.
to me, from what i've read, it sounds like you have a list of tags, and a list of objects, and you need to assign a tag to an object. if that is the case, i'd have a tags table, and objects table, and a ObjectTag table. in the object tab table you would have a foreign key to the tag table and a foreign key to the object table. then you make a unique index on the tag foreign key and now you've enforced your requirement of only using a tag once.

I would tackle this using your original structures. Relational databases are a lot better at aggregating/combining atomic data than they are at parsing complex data structures.
Keep the design of each "tag-able" object type in its own table. Data types, check constraints, default values, etc. are still easily implemented this way. Also, continue to define a FK from each object table to the Tags table.
I'm assuming you already have this in place, but if you place a unique constraint on the TagId column in each of the object tables (A, B, C, etc.) then you can guarantee uniqueness within that object type.
There are no built-in SQL Server constraints to guarantee uniqueness among all the object types, if implemented as separate tables. So, you will have to make your own validation. An INSTEAD OF trigger on your object tables can do this cleanly.
First, create a view to access the TagId list across all your object tables.
CREATE VIEW TagsInUse AS
SELECT A.TagId FROM A
UNION
SELECT B.TagId FROM B
UNION
SELECT C.TagId FROM C
;
Then, for each of your object tables, define an INSTEAD OF trigger to test your TagId.
CREATE TRIGGER dbo.T_IO_Insert_TableA ON dbo.A
INSTEAD OF INSERT
AS
IF EXISTS (SELECT 0 FROM dbo.TagsInUse WHERE TagId = inserted.TagId)
BEGIN;
--The tag(s) is/are already in use. Create the necessary notification(s).
RAISERROR ('You attempted to re-use a TagId. This is not allowed.');
ROLLBACK
END;
ELSE
BEGIN;
--The tag(s) is/are available, so proceed with the INSERT.
INSERT INTO dbo.A (TagId, Attribute1, Attribute2, Attribute3)
SELECT i.TagId, i.Attribute1, i.Attribute2, i.Attribute3
FROM inserted AS i
;
END;
GO
Keep in mind that you can also (and probably should) encapsulate that IF EXISTS test in a T-SQL function for maintenance and performance reasons.
You can write supplementary stored procedures for doing things like finding what object type a TagId is associated with.
Pros
You are still taking advantage of SQL Server's data integrity features, which are all quite fast and self-documenting. Don't underestimate the usefulness of data types.
The view is an encapsulation of the domain that must be unique without combining the underlying sets of attributes. Now, you won't have to write any messy code to decipher the object's type. You can base that determination by which table contains the matching tag.
Your options remain open...
Because you didn't store everything in an EAV-friendly nvarchar(300) column, you can tweak the data types for whatever makes the most sense for each attribute.
If you run into any performance issues, you can index the view.
You (or your DBA) can move the object tables to different file groups on different disks if you need to balance things out and help with parallel disk I/O. Think of it as a form of horizontal partitioning. For example, if you have 8 times as many RFID tags applied to medicine containers as you have for patients, you can place the medicine table on a different disk without having to create the partitioning function that you would need for a monolithic table (one table for all types).
If you need to eventually partition your tables vertically (for archiving data onto a read-only partition), you can more easily create a partitioning function for each object type. This would be useful where the business rules do
Most importantly, implementing different business rules based on object type is much simpler. You don't have to implement any nasty conditional logic like "IF type = 'needle' THEN ... ELSE IF type = 'patient' THEN ... ELSE IF....". If you need to apply different rules, then apply them to the relevant object table without having to test a "type" value.
Cons
Triggers have to be maintained. However, this would have to be done in your application anyway, so you are performing the same data integrity checking at the database. That means that you will have no extra network overhead and this will be available for any application that uses this database.

What you're describing is a classical "table-per-type" ORM mapping. Entity Framework has built-in support of this, which you should look into.
Otherwise, I don't think most databases have easy integrity constraints that are enforced over primary keys of multiple tables.
However, is there any reason why you can't just use a single tags table to hold all the fields? Use a type field to hold the type of object. NULL all the irrelevant fields -- this way they don't consume disk space. You'll end up with far fewer tables (only one) that you can maintain as one single coherent object; it also makes you write far fewer SQL queries to work on tags that may span multiple object types.
Implementing it as one single table also saves you disk space because you can implement tiers of inheritance -- for example, "patient" and "doctor" and "nurse" can be three different object types, each having similar fields (e.g. firstname, lastname etc.) and some unique fields. Right now you'll need three tables, with duplicated fields.
It is also simpler when you add an object type. Before, you need to add a new table, and duplicate some SQL statements that span multiple object types. Now you only need to add new fields to the same table (maybe reuse some). The SQL you need to change are far fewer.
The only reason why you won't go with one single table is when the number of fields make a row too large to fit inside a SQL-Server page (which I believe is 8K). Then SQL will complain and won't allow you to add any more fields. The solution, in this case, is to adopt an ORM tool (like Entity Framework), and then "reuse" fields. For example, if "Field1" is only used by object type #1, there is no reason why object type #3 can't use it to store something as well. You only need to be able to distinguish it in your programs.

You could have the Tags table such that it can have a pointer to any of those tables, and could include a Type that tells you which of the tables it is
Tags
-
ID
Type (A,B, or C)
A (nullable)
B (nullable)
C (nullable)
A
-
ID
(other attributes)

What is the best way to keep changes history to database fields?

For example I have a table which stores details about properties. Which could have owners, value etc.
Is there a good design to keep the history of every change to owner and value. I want to do this for many tables. Kind of like an audit of the table.
What I thought was keeping a single table with fields
table_name, field_name, prev_value, current_val, time, user.
But it looks kind of hacky and ugly. Is there a better design?
Thanks.

There are a few approaches
Field based
audit_field (table_name, id, field_name, field_value, datetime)
This one can capture the history of all tables and is easy to extend to new tables. No changes to structure is necessary for new tables.
Field_value is sometimes split into multiple fields to natively support the actual field type from the original table (but only one of those fields will be filled, so the data is denormalized; a variant is to split the above table into one table for each type).
Other meta data such as field_type, user_id, user_ip, action (update, delete, insert) etc.. can be useful.
The structure of such records will most likely need to be transformed to be used.
Record based
audit_table_name (timestamp, id, field_1, field_2, ..., field_n)
For each record type in the database create a generalized table that has all the fields as the original record, plus a versioning field (additional meta data again possible). One table for each working table is necessary. The process of creating such tables can be automated.
This approach provides you with semantically rich structure very similar to the main data structure so the tools used to analyze and process the original data can be easily used on this structure, too.
Log file
The first two approaches usually use tables which are very lightly indexed (or no indexes at all and no referential integrity) so that the write penalty is minimized. Still, sometimes flat log file might be preferred, but of course functionally is greatly reduced. (Basically depends if you want an actual audit/log that will be analyzed by some other system or the historical records are the part of the main system).

A different way to look at this is to time-dimension the data.
Assuming your table looks like this:
create table my_table (
my_table_id number not null primary key,
attr1 varchar2(10) not null,
attr2 number null,
constraint my_table_ak unique (attr1, att2) );
Then if you changed it like so:
create table my_table (
my_table_id number not null,
attr1 varchar2(10) not null,
attr2 number null,
effective_date date not null,
is_deleted number(1,0) not null default 0,
constraint my_table_ak unique (attr1, att2, effective_date)
constraint my_table_pk primary key (my_table_id, effective_date) );
You'd be able to have a complete running history of my_table, online and available. You'd have to change the paradigm of the programs (or use database triggers) to intercept UPDATE activity into INSERT activity, and to change DELETE activity into UPDATing the IS_DELETED boolean.
Unreason:
You are correct that this solution similar to record-based auditing; I read it initially as a concatenation of fields into a string, which I've also seen. My apologies.
The primary differences I see between the time-dimensioning the table and using record based auditing center around maintainability without sacrificing performance or scalability.
Maintainability: One needs to remember to change the shadow table if making a structural change to the primary table. Similarly, one needs to remember to make changes to the triggers which perform change-tracking, as such logic cannot live in the app. If one uses a view to simplify access to the tables, you've also got to update it, and change the instead-of trigger which would be against it to intercept DML.
In a time-dimensioned table, you make the strucutural change you need to, and you're done. As someone who's been the FNG on a legacy project, such clarity is appreciated, especially if you have to do a lot of refactoring.
Performance and Scalability: If one partitions the time-dimensioned table on the effective/expiry date column, the active records are in one "table", and the inactive records are in another. Exactly how is that less scalable than your solution? "Deleting" and active record involves row movement in Oracle, which is a delete-and-insert under the covers - exactly what the record-based solution would require.
The flip side of performance is that if the application is querying for a record as of some date, partition elimination allows the database to search only the table/index where the record could be; a view-based solution to search active and inactive records would require a UNION-ALL, and not using such a view requires putting the UNION-ALL in everywhere, or using some sort of "look-here, then look-there" logic in the app, to which I say: blech.
In short, it's a design choice; I'm not sure either's right or either's wrong.

In our projects we usually do it this way:
You have a table
properties(ID, value1, value2)
then you add table
properties_audit(ID, RecordID, timestamp or datetime, value1, value2)
ID -is an id of history record(not really required)
RecordID -points to the record in original properties table.
when you update properties table you add new record to properties_audit with previous values of record updated in properties. This can be done using triggers or in your DAL.
After that you have latest value in properties and all the history(previous values) in properties_audit.

I think a simpler schema would be
table_name, field_name, value, time, userId
No need to save current and previous values in the audit tables. When you make a change to any of the fields you just have to add a row in the audit table with the changed value. This way you can always sort the audit table on time and know what was the previous value in the field prior to your change.

Is using multiple tables an advisable solution to dealing with user defined fields?

I am looking at a problem which would involve users uploading lists of records with various field structures into an application. The 2nd part of this would be to also allow the users to specify fields to capture information.
This is a step beyond anything ive done up to this point where i would have designed a static RDMS structure myself. In some respects all records will be treated the same so there will be some common fields required for each. Almost all queries will be run on these common fields.
My first thought would be to dynamically generate a new table for each import and another for each data capture field spec.Then have a master table with a guid for every record in the application along with the common fields and then fields that specify the name of the table the data was imported to and name of table with the data capture fields.
Further information (metadata?) about the fields in the dynamically generated tables could be stored in xml or in a 'property' table.
This would mean as users log into the application i would be dynamically choosing which table of data to presented to the user, and there would be a large number of tables in the database if it was say not only multiuser but then multitennant.
My question is are there other methods to solving this kind of varaible field issue, im i going down an unadvised path here?
I believe that EAV would require me to have a table defining the fields for each import / data capture spec and then another table with the import - field - values data and that seems impracticle.

I hate storing XML in the database, but this is a perfect example of when it makes sense. Store the user imports in XML initially. As your data schema matures, you can later decide which tables to persist for your larger clients. When the users pick which fields they want to query, that's when you come back and build a solid schema.

What kind is each field? Could the type of field be different for each record?
I am working on a program now that does this sorta and the way we handle it is basically a record table which points to a recordfield table. the recordfield table contains all of the fields along with the field name of the actual field in the database(the column name). We then have a recorddata table which is where all the data goes for each record. We also store a record_id telling it which record it is holding.
This is how we do it where if each column for the record is the same type, then we don't need to add new columns to the table, and if it has more fields or fields of a different type, then we add fields as appropriate to the data table.
I think this is what you are talking about.. correct me if I'm wrong.

I think that one additional table for each type of user defined field for the table that the user can add the fields to is a good way to go.
Say you load your records into user_records(id), that table would have an id column which is a foreign key in the user defined fields tables.
user defined string fields would go in user_records_string(id, name), where id is a foreign key to user_records(id), and name is a string, or a foreign key to a list of user defined string fields.
Searching on them requires joining them in to the base table, probably with a sub-select to filter down to one field based on the user meta-data, so that the right field can be added to the query.
To simulate the user creating multiple tables, you can have a foreign key in the user_records table that points at a table list, and filter on that when querying for a single table.
This would allow your schema to be static while allowing the user to arbitrarily add fields and tables.

What is the best practices in db design when I want to store a value that is either selected from a dropdown list or user-entered?

I am trying to find the best way to design the database in order to allow the following scenario:
The user is presented with a dropdown list of Universities (for example)
The user selects his/her university from the list if it exists
If the university does not exist, he should enter his own university in a text box (sort of like Other: [___________])
how should I design the database to handle such situation given that I might want to sort using the university ID for example (probably only for the built in universities and not the ones entered by users)
thanks!
I just want to make it similar to how Facebook handles this situation. If the user selects his Education (by actually typing in the combobox which is not my concern) and choosing one of the returned values, what would Facebook do?
In my guess, it would insert the UserID and the EducationID in a many-to-many table. Now what if the user is entering is not in the database at all? It is still stored in his profile, but where?

CREATE TABLE university
(
id smallint NOT NULL,
name text,
public smallint,
CONSTRAINT university_pk PRIMARY KEY (id)
);
CREATE TABLE person
(
id smallint NOT NULL,
university smallint,
-- more columns here...
CONSTRAINT person_pk PRIMARY KEY (id),
CONSTRAINT person_university_fk FOREIGN KEY (university)
REFERENCES university (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);
public is set to 1 for the Unis in the system, and 0 for user-entered-unis.

You could cheat: if you're not worried about the referential integrity of this field (i.e. it's just there to show up in a user's profile and isn't required for strictly enforced business rules), store it as a simple VARCHAR column.
For your dropdown, use a query like:
SELECT DISTINCT(University) FROM Profiles
If you want to filter out typos or one-offs, try:
SELECT University FROM PROFILES
GROUP BY University
HAVING COUNT(University) > 10 -- where 10 is an arbitrary threshold you can tweak
We use this code in one of our databases for storing the trade descriptions of contractor companies; since this is informational only (there's a separate "Category" field for enforcing business rules) it's an acceptable solution.

Keep a flag for the rows entered through user input in the same table as you have your other data points. Then you can sort using the flag.

One way this was solved in a previous company I worked at:
Create two columns in your table:
1) a nullable id of the system-supplied string (stored in a separate table)
2) the user supplied string
Only one of these is populated. A constraint can enforce this (and additionally that at least one of these columns is populated if appropriate).
It should be noted that the problem we were solving with this was a true "Other:" situation. It was a textual description of an item with some preset defaults. Your situation sounds like an actual entity that isn't in the list, s.t. more than one user might want to input the same university.

This isn't a database design issue. It's a UI issue.
The Drop down list of universities is based on rows in a table. That table must have a new row inserted when the user types in a new University to the text box.
If you want to separate the list you provided from the ones added by users, you can have a column in the University table with origin (or provenance) of the data.

I'm not sure if the question is very clear here.
I've done this quite a few times at work and just select between either the drop down list of a text box. If the data is entered in the text box then I first insert into the database and then use IDENTITY to get the unique identifier of that inserted row for further queries.
INSERT INTO MyTable Name VALUES ('myval'); SELECT ##SCOPE_IDENTITY()
This is against MS SQL 2008 though, I'm not sure if the ##SCOPE_IDENTITY() global exists in other versions of SQL, but I'm sure there's equivalents.