How to log in an Oracle database? - database

I am interested in what methods of logging is frequent in an Oracle database.
Our method is the following:
We create a log table for the table to be logged. The log table contains all the columns of the original table plus some special fields including timestamp, modification type (insert, update, delete), modifier's id. A trigger on the original table creates one log row for each insertion and deletion, and two rows for a modification. Log rows contain the data before and after the alteration of the original one.
Although state of the records can be mined back in time using this method, it has some drawbacks:
Introduction of a new column in the original table does not automatically involves log modification.
Log modification affects log table and trigger and it is easy to mess up.
State of a record at a specific past time cannot be determined in a straightforward way.
...
What other possibilities exist?
What kind of tools can be used to solve this problem?
I only know of log4plsql. What are the pros/cons of this tool?
Edit: Based on Brian's answer I have found the following reference that explains standard and fine grain auditing.

It sounds like you are after 'auditing'. Oracle has a built-in feature called Fine Grain Auditing (FGA). In a nutshell you can audit everything or specific conditions. What is really cool is you can 'audit' selects as well as transactions. Simple command to get started with auditing:
audit UPDATE on SCOTT.EMP by access;
Think of it as a 'trigger' for select statements. For example, you create policies:
begin
dbms_fga.add_policy (
object_schema=>'BANK',
object_name=>'ACCOUNTS',
policy_name=>'ACCOUNTS_ACCESS'
);
end;
After you have defined the policy, when a user queries the table in the usual way, as follows:
select * from bank.accounts;
the audit trail records this action. You can see the trail by issuing:
select timestamp,
db_user,
os_user,
object_schema,
object_name,
sql_text
from dba_fga_audit_trail;
TIMESTAMP DB_USER OS_USER OBJECT_ OBJECT_N SQL_TEXT
--------- ------- ------- ------- -------- ----------------------
22-OCT-08 BANK ananda BANK ACCOUNTS select * from accounts

Judging from your description, I wonder if what you really need is not logging mechanism, but rather some sort of Historical value of some table. If this is the case, then maybe you better off using some kind of Temporal Database design (using VALID_FROM and VALID_TO fields). You can track changes in database using Oracle LogMiner tools.
As for your scenarios, I would rather stored the changes data in this kind of schema :
+----------------------------------------------------------------------------+
| Column Name | Function |
+----------------------------------------------------------------------------+
| Id | PRIMARY_KEY value of the SOURCE table |
| TimeStamp | Time stamp of the action |
| User | User who make the action |
| ActionType | INSERT, UPDATE, or DELETE |
| OldValues | All fields value from source table, seperated by '|' |
| Newvalues | All fields value from source table, seperated by '|' |
+----------------------------------------------------------------------------+
With this type of logging table, you can easily determine :
Historical Change action of particular record (using Id)
State of specific record in some point in time
Of course this kind of logging cannot easily determine all valid values of table in specific point in time. For this, you need to change you table design to Temporal Database Design.

In the similar question (How to Audit Database Activity without Performance and Scalability Issues?) the accepted answer mentions the monitoring of database traffic using a network traffic sniffer as an interesting alternative.

log4plsql is a completely different thing, its for logging debug info from PL/SQL
For what you want, you need to either.
Setup a trigger
Setup PL/SQL interface around the tables, CRUD operations happen via this interface, the interface ensures the log tables are updated.
Setup interface in your application layer, as with PL/SQL interface, just higher up.
Oracle 11g contains versioned tables, I have not used this at all though, so can make no real comment.

If you just interested in knowing what the data looked like in the recent past you could just use Oracles flashback query functionality to query the data for a specific time in the past. How far in the past is dependent on how much disk space you have and how much database activity there is. The bright side of this solution is that new columns automatically get added. The downside is that you can't flashback query past ddl operations.

Related

Subscriber-only trigger

There are two servers. The first is ERP system on production. The second one is the BI server for heavy analytical queries. We update the BI server on a daily basis via backups. However, it's not enough, some users want to see their data changes more often than the next day. I have no access to the ERP server and can't do anything, except asking for backups or replications.
Before starting to ask for the replication. I want to understand if it's possible to use subscriber triggers in order to process not all the data, but changed. There is an ETL process to make some queries faster (indexing, transformation, etc). Triggers should do the trick, but I can't find a way to use them on the subscriber side only. The ERP system doesn't allow to make any changes on the DB level. So, the subscriber database seems to be fine for triggers (they don't affect on the ERP server performance). Nonetheless, I can't find a way to set them up. Processing all the data is an insane overhead.
Use case: Simplified example, say, we have two replicated tables:
+------------+-------------+--------+
| dt | customer_id | amount |
+------------+-------------+--------+
| 2017-01-01 | 1 | 234 |
| 2017-01-02 | 1 | 123 |
+------------+-------------+--------+
+------------+-------------+------------+------------+
| manager_id | customer_id | date_from | date_to |
+------------+-------------+------------+------------+
| 1 | 1 | 2017-01-01 | 2017-01-02 |
| 2 | 1 | 2017-01-02 | null |
+------------+-------------+------------+------------+
I need to transform them into the following indexed table:
+----------+-------------+------------+--------+
| dt_id | customer_id | manager_id | amount |
+----------+-------------+------------+--------+
| 20170101 | 1 | 1 | 234 |
| 20170102 | 1 | 2 | 123 |
+----------+-------------+------------+--------+
So, I created yet another database where I store the table above. Now, in order to update the table I have to truncate it and reinsert all the data again. I may join them all the in order to check the diffs, but it's too heavy for big tables as well. The trigger helps to track only changing records. The first input table can use a trigger:
create trigger SaleInsert
on Table1
after insert
begin
insert into NewDB..IndexedTable
select
//some transformation
from inserted
left join Table2
on Table1.customer_id = Table2.customer_id
and Table1.dt >= Table2.date_from
and Table1.dt < Table2.date_to
end
The same idea for update, delete, a similar approach for the second table. I could get automatically updated DWH with little lags. Yeah, I expect performance lags for high-loaded databases. Theoretically, it should work smoothly on servers with the same configurations.
But, again, there are no triggers on the subscriber side only. Any ideas, alternatives?
MS SQL Server has "Change Tracking" features that maybe be of use to you. You enable the database for change tracking and configure which tables you wish to track. SQL Server then creates change records on every update, insert, delete on a table and then lets you query for changes to records that have been made since the last time you checked. This is very useful for syncing changes and is more efficient than using triggers. It's also easier to manage than making your own tracking tables. This has been a feature since SQL Server 2005.
How to: Use SQL Server Change Tracking
Change tracking only captures the primary keys of the tables and let's you query which fields might have been modified. Then you can query the tables join on those keys to get the current data. If you want it to capture the data also you can use Change Capture, but it requires more overhead and at least SQL Server 2008 enterprise edition.
Change Data Capture
The general process is:
Get the current sync version
Get the last sync version you used to get changes
Get all the primary keys of the tables that have changed since that last version (inserts, updates, and deletes).
Join the keys with the data and pull down the data (if not using change capture).
Save the data and the current sync version (from the first step).
Then you repeat this process whenever you want to subscribe to the next set of changes. SQL Server does all the magic behind the scenes for you of storing the changes and the versioning. You also might want to look into Snapshot Isolation... it works well with it. The linked article has more information about that.
This answer is somewhat roundabout, but given your tight limitations perhaps you'll consider it.
First, go for replication as you seem to have decided. You mentioned creating yet another database but were stuck for how to create triggers to populate it. The answer lies in the ability to run post-snapshot scripts. When creating the replication publication the DBA can specify a script to run on the Subscriber after the Snapshot.
You can have the script create all the triggers you require.
Also, to prevent replication from overwriting your triggers with "no trigger" (as defined in the ERP database) the DBA will need to verify that for each table on which you have triggers the property Copy user triggers is set to False.
You cannot get updates in ERP database until you implement Triggers, or change tracking on ERP database. The best way is to go for replication, If you have no access to the ERP server and can't do anything, except asking for backups or replications.
If you are using full recovery model on the ERP system database you could use Log file shipping to accomplish this. This would still have some level of delay between the production system and the reporting system. If some delay between DML statements being issued on the ERP system and the reporting system this solution would work. If you need nearly instant access to the data in the reporting system, replication and the associated overhead is your only good option. For more information of configuring Log File shipping:
https://learn.microsoft.com/en-us/sql/database-engine/log-shipping/about-log-shipping-sql-server

Add DATE column to store when last read

We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead

SQL 2005/2008 Database track field change

I do not want Auditing or History tracking.
I have an application that pulls its data from a external source.
We mirror the external tables in our own DB. We allow our users to update the data in our DB.
Next time the system syncs with the external data, we only override fields that we have not changed locally.
Off the top of my head I can think of 2 ways to do this
1) Store 2 Rows for each Object. First 1 is the external version, the 2nd row links to the external version but will only have data in a field if that field has been changed.
e.g.
id | parentId | field1 | field2
1 | null | foo | bar
2 | 1 | new foo | null
This illustrates what the data would look like when a local user changed field1.
If no change occurred there would only be the first row.
2) Double the number of columns in the table.
e.g
name_external
name_internal
I like option 1 better as it seems like it would provides better separation and easier to query and to do in code comparisons between the 2 objects. The only downside is that it will result in more rows, but the DB will be pretty small.
Is there any other patterns I could use? Or a reason I shouldn't go with option 1.
I will be using .NET 4 WCF services
Solution
I will go with the two table answer provided below. and use the following SQL to return a Row that has the fields which have changed locally merged with the orginal values
SELECT
a.[ID],
isnull(b.[firstName], a.[firstName]),
isnull(b.[lastName], a.[lastName]),
isnull(b.[dob], a.[dob]),
isnull(b.active, a.active)
From tbl1 a
left join tbl2 b on a.[ID] = b.[ID]
As in my the DB will only ever be able to be updated via the UI system. And I can ensure people are not allowed to enter NULL as a value instead I force them to enter a blank string. This will allow me to overcome the issue what happens if the user updates a value to NULL
There are two issues I can think of if you choose option 1.
Allowing users to update the data means you will either have to write procedures to perform the insert/update/delete statements for them, managing the double row structure, or you have to train all the users to update the table correctly.
The other problem is modelling fields in your table which can be NULL. If you are using NULL to represent the field has not changed how can you represent a field changing to NULL?
The second option of doubling the number of columns avoids the complexity of updates and allows you to store NULL values but you may see performance decrease due to the increased row size and therefore the amount of data the server has to move around (without testing it I realise this claim is unsubstantiated but I thought it worth mentioning).
Another suggestion would be to duplicate the tables, perhaps putting them in another schema, which hold a snapshot of the data just after the sync with the external data source is performed.
I know you are not wanting a full audit, this would be a copy of the data at a point in time. You can then allow users to modify the data as they wish, without the complexity of the double row/column, and when you come to re-sync with the external source you can find out which fields have changed.
I found a great article: The shortest, fastest, and easiest way to compare two tables in SQL Server: UNION! describing such a comparison.
Consider having an update trigger that updates an UpdatedBy field. We have a user for our imports that no other process is allowed to use that updates the field if they were done by the import. So if any value other than user -1 was the last to update you can easily tell. Then you can use the merge statment to insert/delete/update.

Advice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day

Our company is developing an internal project to parse text files.
Those text files are composed of metadata which is extracted using regular expresions.
Ten computers are 24/7 parsing the text files and feeding a high-end Intel Xeon SQL Server 2005 database with the extracted metadata.
The simplified database schema looks like this:
Items
| Id | Name |
|----|--------|
| 1 | Sample |
Items_Attributes
| ItemId | AttributeId |
|--------|-------------|
| 1 | 1 |
| 1 | 2 |
Attributes
| Id | AttributeTypeId | Value |
|----|-----------------|-------|
| 1 | 1 | 500mB |
| 2 | 2 | 1.0.0 |
AttributeTypes
| Id | Name |
|----|---------|
| 1 | Size |
| 2 | Version |
There are many distinct text files types with distinct metadata inside. For every text file we have an Item and for every extracted metadata value we have an Attribute.
Items_Attributes allow us to avoid duplicate Attribute values which avoids database size to increase x^10.
This particular schema allows us to dynamically add new regular expressions and to obtain new metadata from new processed files no matter which internal structure they have.
Additionally this allow us to filter the data and to obtain dynamic reports based on the user criteria. We are filtering by Attribute and then pivoting the resultset (http://msdn.microsoft.com/en-us/library/ms177410.aspx). So this example pseudo-sql query
SELECT FROM Items WHERE Size = #A AND Version = #B
would return a pivoted table like this
| ItemName | Size | Version |
|----------|-------|---------|
| Sample | 500mB | 1.0.0 |
The application has been running for months and performance decreased terribly at the point is no longer usable. Reports should take no more than 2 seconds and Items_Attributes table increases an average of 10,000,000 rows per week.
Everything is properly indexed and we spent severe time analyzing and optimizing query execution plans.
So my question is, how would you scale this in order to decrease report execution times?
We came with this possible solutions:
Buy more hardware and setup an SQL Server cluster. (we need advice on the proper "clustering" strategy)
Use a key/value database like HBase (we don't really know if would solve our problem)
Use a ODBMS rather than a RDBMS (we have been considering db4o)
Move our software to the cloud (we have zero experience)
Statically generate reports at runtime. (we don't really want to)
Static indexed views for common reports (performance is almost the same)
De-normalize schema (some of our reports involves up to 50 tables in a single query)
Perhaps this white paper by SQL Server CAT team on the pitfalls of Entity-Attribute-Value database model can help: http://sqlcat.com/whitepapers/archive/2008/09/03/best-practices-for-semantic-data-modeling-for-performance-and-scalability.aspx
I'd start from posting exact tables metadata (along with indexing details), exact query text and the execution plan.
With you current table layout, the query similar to this:
SELECT FROM Items WHERE Size = #A AND Version = #B
cannot benefit from using a composite index on (Size, Version), since it's impossible to build such an index.
You cannot even build an indexed view, since it would contain a self-join on attributes.
Probably the best decision would be to denormalize the table like this:
id name size version
and create an index on (size, version)
Worked with such schemas a lot of time. They never perform well.
The best thing is to just store the data as you need it, in the form:
| ItemName | Size | Version |
|----------|-------|---------|
| Sample | 500mB | 1.0.0 |
Then you don;t need to pivot. And BTW, please do not call your original EAV schema "normalized" - it is not normalized.
Looks to me like issuing some OLAP queries on a database optimized for OLTP transactions. Don't knowing details, I'd recommend building a separate "datawarehouse" optimized for the kind of queries you are doing. That would involve aggregating data (if possible), denormalization and also having a data base, which is 1 day old or so. You would incrementally update the data each day or at any interval you wish.
Please post exact DDL and indexes, if you have indexes on the ID columns then your query will result in a scan
instead of something like this
SELECT FROM Items WHERE Size = #A AND Version = #B
you need to do this
SELECT FROM Items WHERE ID = 1
in other words you need to grab the text values, find the ids that you are indexing on and then use that as your query to return results instead
Probably also a good idea to look at partitioning function to distribute your data
clustering is done for availability not performance, if one node dies (the active cluster) , the other node (the passive cluster) will become active....of course there is also active active clustering but that is another story
A short term fix may be to use horizontal partitioning. I am assuming your largest table is Items_Attributes. You could horizontally partition this table, putting each partition on a separate filegroup on a separate disk controller.
That's assuming you are not trying to report across all ItemIds at once.
You mention 50 tables in a single query. Whilst SQL server supports up to 256 tables in a single, monolithic query, taking this approach reduces the chances of the optimiser producing an efficient plan.
If you are wedded to the schema as it stands, consider breaking your reporting queries down into a series of steps which materialise their results into temporary (#) tables.
This approach enables you to carry out the most selective parts of the query in isolation, and can, in my experience, offer big performance gains. The queries are generally more maintainable too.
Also (a bit of a long shot, this) you don't say which SQL server version you're on; but if you're on SQL 2005, given the number of tables involved in your reports and the volume of data, it's worth checking that your SQL server is patched to at least SP2.
I worked on an ETL project using tables with rowcounts in the hundreds of millions, where we found that the query optimiser in SQL 2005 RTM/SP1 could not consistently produce efficient plans for queries joining more than 5 tables where one or more of the tables was of this scale. This issue was resolved in SP2.

LINQ to SQL object versioning

I'm trying to create a LINQ to SQL class that represents the "latest" version of itself.
Right now, the table that this entity represents has a single auto-incrementing ID, and I was thinking that I would add a version number to the primary key. I've never done anything like this, so I'm not sure how to proceed. I would like to be able to abstract the idea of the object's version away from whoever is using it. In other words, you have an instance of this entity that represents the most current version, and whenever any changes are submitted, a new copy of the object is stored with an incremented version number.
How should I proceed with this?
If you can avoid keeping a history, do. It's a pain.
If a complete history is unavoidable (regulated financial and medical data or the like), consider adding history tables. Use a trigger to 'version' into the history tables. That way, you're not dependent on your application to ensure a version is recorded - all inserts/updates/deletes are captured regardless of the source.
If your app needs to interact with historical data, make sure it's readonly. There's no sense capturing transaction histories if someone can simply change them.
If your concern is concurrent updates, consider using a record change timestamp. When both User A and User B view a record at noon, they fetch the record's timestamp. When User A updates the record, her timestamp matches the record's so the update goes through and the timestamp is updated as well. When User B updates the record five minutes later, his timestamp doesn't match the record's so he's warned that the record has changed since he last viewed it. Maybe it's automatically reloaded...
Whatever you decide, I would avoid inter-mingling current and historic data.
Trigger resources per comments:
MSDN
A SQL Team Introduction
Stackoverflow's Jon Galloway describes a general data-change logging trigger
The keys to an auditing trigger are the virtual tables 'inserted' and 'deleted'. These tables contain the rows effected by an INSERT, UPDATE, or DELETE. You can use them to audit changes. Something like:
CREATE TRIGGER tr_TheTrigger
ON [YourTable]
FOR INSERT, UPDATE, DELETE
AS
IF EXISTS(SELECT * FROM inserted)
BEGIN
--this is an insert or update
--your actual action will vary but something like this
INSERT INTO [YourTable_Audit]
SELECT * FROM inserted
END
IF EXISTS(SELECT * FROM deleted)
BEGIN
--this is a delete, mark [YourTable_Audit] as required
END
GO
The best way to proceed is to stop and seriously rethink your approach.
If you are going to keep different versions of the "object" around, then you are better off serializing it into an xml format and storing that in an XML column with a field for the version number.
There are serious considerations when trying to maintain versioned data in sql server revolving around application maintenance.
UPDATE per comment:
Those considerations include: the inability to remove a field or change the data type of a field in future "versions". New fields are required to be nullable or, at the very least, have a default value stored in the DB for them. As such you will not be able to use them in a unique index or as part of the primary keys.
In short, the only thing your application can do is expand. Provided the expansion can be ignored by previous layers of code.
This is the classic problem of Backwards Compatibility which desktop software makers have struggled with for years. And is the reason you might want to stay away from it.

Resources