I do not want Auditing or History tracking.
I have an application that pulls its data from a external source.
We mirror the external tables in our own DB. We allow our users to update the data in our DB.
Next time the system syncs with the external data, we only override fields that we have not changed locally.
Off the top of my head I can think of 2 ways to do this
1) Store 2 Rows for each Object. First 1 is the external version, the 2nd row links to the external version but will only have data in a field if that field has been changed.
e.g.
id | parentId | field1 | field2
1 | null | foo | bar
2 | 1 | new foo | null
This illustrates what the data would look like when a local user changed field1.
If no change occurred there would only be the first row.
2) Double the number of columns in the table.
e.g
name_external
name_internal
I like option 1 better as it seems like it would provides better separation and easier to query and to do in code comparisons between the 2 objects. The only downside is that it will result in more rows, but the DB will be pretty small.
Is there any other patterns I could use? Or a reason I shouldn't go with option 1.
I will be using .NET 4 WCF services
Solution
I will go with the two table answer provided below. and use the following SQL to return a Row that has the fields which have changed locally merged with the orginal values
SELECT
a.[ID],
isnull(b.[firstName], a.[firstName]),
isnull(b.[lastName], a.[lastName]),
isnull(b.[dob], a.[dob]),
isnull(b.active, a.active)
From tbl1 a
left join tbl2 b on a.[ID] = b.[ID]
As in my the DB will only ever be able to be updated via the UI system. And I can ensure people are not allowed to enter NULL as a value instead I force them to enter a blank string. This will allow me to overcome the issue what happens if the user updates a value to NULL
There are two issues I can think of if you choose option 1.
Allowing users to update the data means you will either have to write procedures to perform the insert/update/delete statements for them, managing the double row structure, or you have to train all the users to update the table correctly.
The other problem is modelling fields in your table which can be NULL. If you are using NULL to represent the field has not changed how can you represent a field changing to NULL?
The second option of doubling the number of columns avoids the complexity of updates and allows you to store NULL values but you may see performance decrease due to the increased row size and therefore the amount of data the server has to move around (without testing it I realise this claim is unsubstantiated but I thought it worth mentioning).
Another suggestion would be to duplicate the tables, perhaps putting them in another schema, which hold a snapshot of the data just after the sync with the external data source is performed.
I know you are not wanting a full audit, this would be a copy of the data at a point in time. You can then allow users to modify the data as they wish, without the complexity of the double row/column, and when you come to re-sync with the external source you can find out which fields have changed.
I found a great article: The shortest, fastest, and easiest way to compare two tables in SQL Server: UNION! describing such a comparison.
Consider having an update trigger that updates an UpdatedBy field. We have a user for our imports that no other process is allowed to use that updates the field if they were done by the import. So if any value other than user -1 was the last to update you can easily tell. Then you can use the merge statment to insert/delete/update.
Related
I have an API that i'm trying to read that gives me just the updated field. I'm trying to take that and update my tables using a stored procedure. So far the only way I have been able to figure out how to do this is with dynamic SQL but i would prefer to not do that if there is a way not to.
If it was just a couple columns, I'd just write a proc for each but we are talking about 100 fields and any of them could be updated together. One ticket might just need a timestamp updated at this time, but the next ticket might be a timestamp and who modified it while the next one might just be a note.
Everything I've read and have been taught have told me that dynamic SQL is bad and while I'll write it if I have too, I'd prefer to have a proc.
YOU CAN PERHAPS DO SOMETHING LIKE THIS:::
IF EXISTS (SELECT * FROM NEWTABLE NOT IN (SELECT * FROM OLDTABLE))
BEGIN
UPDATE OLDTABLE
SET OLDTABLE.OLDRECORDS = NEWTABLE.NEWRECORDS
WHERE OLDTABLE.PRIMARYKEY= NEWTABLE.PRIMARYKEY
END
The best way to solve your problem is using MERGE:
Performs insert, update, or delete operations on a target table based on the results of a join with a source table. For example, you can synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table.
As you can see your update could be more complex but more efficient as well. Using MERGE requires some proficiency, but when you start to use it you'll use it with pleasure again and again.
I am not sure how your business logic works that determines what columns are updated at what time. If there are separate business functions that require updating different but consistent columns per function, you will probably want to have individual update statements for each function. This will ensure that each process updates only the columns that it needs to update.
On the other hand, if your API is such that you really don't know ahead of time what needs to be updated, then building a dynamic SQL query is a good idea.
Another option is to build a save proc that sets every user-configurable field. As long as the calling process has all of that data, it can call the save procedure and pass every updateable column. There is no harm in having a UPDATE MyTable SET MyCol = #MyCol with the same values on each side.
Note that even if all of the values are the same, the rowversion (or timestampcolumns) will still be updated, if present.
With our software, the tables that users can edit have a widely varying range of columns. We chose to create a single save procedure for each table that has all of the update-able columns as parameters. The calling processes (our web servers) have all the required columns in memory. They pass all of the columns on every call. This performs fine for our purposes.
I use jpa 2 with hibernate 4 to persist/update objects to the database. To view these objects a web page with jsf 2 is used, which all works great.
I'm having the following problem: when i update any of the objects, the database (postgres) will put the updated object on the last row in the database. I'm wondering why this is happening? Can't the database (or is it jpa or hibernate?) just update the row and leave it where it is? The id is of course never changed.
INSERT INTO sector (id, code) VALUES(1, 'PRIVAAT');
INSERT INTO sector (id, code) VALUES(2, 'PUBLIEK');
Select * from sector;
returns in every program for querying databases:
1 privaat
2 publiek
now i update the first row:
update sector set code = 'privaat' where id = 1;
now it returns in every program:
2 publiek
1 privaat
This is also the case when using our application, which in turn uses Postgres as DB.
However, in our regression tests, which uses in memory DB HSQLDB, it returns always:
1 privaat
2 publiek
The regression tests is clearly what we expect from databases and didn't had this behavior before with other databases.
The problem is that in the web we want to preserve the order so that users don't have to search into the data: it is always a fixed position. Specially when working with a lot of referenced data for the user it is very annoying. The tests confirm this, but the application (which uses a different DB) works differently.
The only solution we have now is to put everywhere order by clauses on the id, but we would like to have a clean way of preserving the order. Using everywhere order by has also a small performance hit...
So my question is, is this due to how specific databases work (vendor-dependent)? Or is this due to JPA mapping queries in a different way depending on the dialect? If this is the second case, is there a property which can preserve the order in JPA (or maybe hibernate)?
There is no order there. You're just getting the semi-random order that rows happen to be read from the heap in.
Use ORDER BY if you want an ordering.
Some database implementations have a natural ordering because of how the data is structured (say, as a b-tree table). PostgreSQL uses a heap, so there is no order unless you ask for one.
We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead
I have been dancing around this issue for awhile but it keeps coming up. We have a system and our may of our tables start with a description that is originally stored as an NVARCHAR(150) and I then we get a ticket asking to expand the field size to 250, then 1000 etc, etc...
This cycle is repeated on ever "note" field and/or "description" field we add to most tables. Of course the concern for me is performance and breaking the 8k limit of the page. However, my other concern is making the system less maintainable by breaking these fields out of EVERY table in the system into a lazy loaded reference.
So here I am faced with these same to 2 options that have been staring me in the face. (others are welcome) please lend me your opinions.
Change all may notes and/or descriptions to NVARCHAR(MAX) and make sure we do exclude these fields in all listings. Basically never do a: SELECT * FROM [TableName] unless is it only retrieving one record.
Remove all notes and/or description fields and replace them with a forign key reference to a [Notes] table.
CREATE TABLE [dbo].[Notes] (
[NoteId] [int] NOT NULL,
[NoteText] [NVARCHAR](MAX)NOT NULL )
Obviously I would prefer use option 1 because it will change so much in our system if we go with 2. However if option 2 is really the only good way to proceed, then at least I can say these changes are necessary and I have done the homework.
UPDATE:
I ran several test on a sample database with 100,000 records in it. What I find is that the because of cluster index scans the IO required for option 1 is "roughly" twice that of option 2. If I select a large number of records (1000 or more) option 1 is twice as slow even if I do not include the large text field in the select. As I request less rows the lines blur more. I a web app where page sizes of 50 or so are the norm, so option 1 will work, but I will be converting all instances to option 2 in the (very) near future for scalability.
Option 2 is better for several reasons:
When querying your tables, the large
text fields fill up pages quickly,
forcing the database to scan more
pages to retrieve data. This is
especially taxing when you don't
actually need to return the text
data.
As you mentioned, it gives you
a clean break to change the data
type in one swoop. Microsoft has
deprecated TEXT in SQL Server 2008,
so you should stick with
VARCHAR/VARBINARY.
Separate filegroups. Having
all your text data in a slower,
cheaper storage location might be
something you decide to pursue in
the future. If not, no harm, no
foul.
While Option 1 is easier for now, Option 2 will give you more flexibility in the long-term. My suggestion would be to implement a simple proof-of-concept with the "notes" information separated from the main table and perform some of your queries on both examples. Compare the execution plans, client statistics and logical I/O reads (SET STATISTICS IO ON) for some of your queries against these tables.
A quick note to those suggesting the use of a TEXT/NTEXT from MSDN:
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature. Use varchar(max),
nvarchar(max) and varbinary(max) data
types instead. For more information,
see Using Large-Value Data Types.
I'd go with Option 2.
You can create a view that joins the two tables to make the transition easier on everyone, and then go through a clean-up process that removes the view and uses the single table wherever possible.
You want to use a TEXT field. TEXT fields aren't stored directly in the row; instead, it stores a pointer to the text data. This is transparent to queries, though - if you ask for a TEXT field, it will return the actual text, not the pointer.
Essentially, using a TEXT field is somewhat between your two solutions. It keeps your table rows much smaller than using a varchar, but you'll still want to avoid asking for them in your queries if possible.
The TEXT/NTEXT data type has practically unlimited length while taking up next to nothing in your record.
It comes with a few strings attached, like special behavior with string functions, but for a secondary "notes/description" type of field these may be less of a problem.
Just to expand on Option 2
You could:
Rename existing MyTable to MyTable_V2
Move the Notes column into a joined Notes table (with 1:1 joining ID)
Create a VIEW called MyTable that joins MyTable_V2 and Notes tables
Create an INSTEAD OF trigger on MyTable view which saves the Notes column into the Notes table (IF NULL then delete any existing Notes row, if NOT NULL then Insert if not found, otherwise Update). Perform appropriate action on MyTable_V2 table
Note: We've had trouble doing this where there is a Computed column in MyTable_V2 (I think that was the problem, either way we've hit snags when doing this with "unusual" tables)
All new Insert/Update/Delete code should be written to operate directly on MyTable_V2 and Notes tables
Optionally: Have the INSERT OF trigger on MyTable log the fact that it was called (it can do this minimally, UPDATE a pre-existing log table row with GetDate() only if existing row's date is > 24 hours old - so will only do an update once a day).
When you are no longer getting any log records you can drop the INSTEAD OF trigger on MyTable view and you are now fully MyTable_V2 compliant!
Huge amount of hassle to implement, as you surmised.
Alternatively trawl the code for all references to MyTable and change them to MyTable_V2, put a VIEW in place of MyTable for SELECT only, and not create the INSTEAD OF trigger.
My plan would be to fix all Insert/Update/Delete statements referencing the now deprecated MyTable. For me this would be made somewhat easier because we use unique names for all tables and columns in the database, and we use the same names in all application code, so making sure I had found all instances by a simple FIND would be high.
P.S. Option 2 is also preferable if you have any SELECT * lying around. We have had clients whos application performance has gone downhill fast when they added large Text/Blob columns to existing tables - because of "lazy" SELECT * statements. Hopefully that isn;t the case in your shop though!
I am interested in what methods of logging is frequent in an Oracle database.
Our method is the following:
We create a log table for the table to be logged. The log table contains all the columns of the original table plus some special fields including timestamp, modification type (insert, update, delete), modifier's id. A trigger on the original table creates one log row for each insertion and deletion, and two rows for a modification. Log rows contain the data before and after the alteration of the original one.
Although state of the records can be mined back in time using this method, it has some drawbacks:
Introduction of a new column in the original table does not automatically involves log modification.
Log modification affects log table and trigger and it is easy to mess up.
State of a record at a specific past time cannot be determined in a straightforward way.
...
What other possibilities exist?
What kind of tools can be used to solve this problem?
I only know of log4plsql. What are the pros/cons of this tool?
Edit: Based on Brian's answer I have found the following reference that explains standard and fine grain auditing.
It sounds like you are after 'auditing'. Oracle has a built-in feature called Fine Grain Auditing (FGA). In a nutshell you can audit everything or specific conditions. What is really cool is you can 'audit' selects as well as transactions. Simple command to get started with auditing:
audit UPDATE on SCOTT.EMP by access;
Think of it as a 'trigger' for select statements. For example, you create policies:
begin
dbms_fga.add_policy (
object_schema=>'BANK',
object_name=>'ACCOUNTS',
policy_name=>'ACCOUNTS_ACCESS'
);
end;
After you have defined the policy, when a user queries the table in the usual way, as follows:
select * from bank.accounts;
the audit trail records this action. You can see the trail by issuing:
select timestamp,
db_user,
os_user,
object_schema,
object_name,
sql_text
from dba_fga_audit_trail;
TIMESTAMP DB_USER OS_USER OBJECT_ OBJECT_N SQL_TEXT
--------- ------- ------- ------- -------- ----------------------
22-OCT-08 BANK ananda BANK ACCOUNTS select * from accounts
Judging from your description, I wonder if what you really need is not logging mechanism, but rather some sort of Historical value of some table. If this is the case, then maybe you better off using some kind of Temporal Database design (using VALID_FROM and VALID_TO fields). You can track changes in database using Oracle LogMiner tools.
As for your scenarios, I would rather stored the changes data in this kind of schema :
+----------------------------------------------------------------------------+
| Column Name | Function |
+----------------------------------------------------------------------------+
| Id | PRIMARY_KEY value of the SOURCE table |
| TimeStamp | Time stamp of the action |
| User | User who make the action |
| ActionType | INSERT, UPDATE, or DELETE |
| OldValues | All fields value from source table, seperated by '|' |
| Newvalues | All fields value from source table, seperated by '|' |
+----------------------------------------------------------------------------+
With this type of logging table, you can easily determine :
Historical Change action of particular record (using Id)
State of specific record in some point in time
Of course this kind of logging cannot easily determine all valid values of table in specific point in time. For this, you need to change you table design to Temporal Database Design.
In the similar question (How to Audit Database Activity without Performance and Scalability Issues?) the accepted answer mentions the monitoring of database traffic using a network traffic sniffer as an interesting alternative.
log4plsql is a completely different thing, its for logging debug info from PL/SQL
For what you want, you need to either.
Setup a trigger
Setup PL/SQL interface around the tables, CRUD operations happen via this interface, the interface ensures the log tables are updated.
Setup interface in your application layer, as with PL/SQL interface, just higher up.
Oracle 11g contains versioned tables, I have not used this at all though, so can make no real comment.
If you just interested in knowing what the data looked like in the recent past you could just use Oracles flashback query functionality to query the data for a specific time in the past. How far in the past is dependent on how much disk space you have and how much database activity there is. The bright side of this solution is that new columns automatically get added. The downside is that you can't flashback query past ddl operations.