Question
I often see it stated that rules should be avoided and triggers used instead. I can see the danger in the rule system, but certainly there are valid uses for rules, right? What are they?
I'm asking this out of general interest; I'm not very seasoned with databases.
Example of what might be a valid use
For instance, in the past I've needed to lock down certain data, so I've done something like this:
CREATE OR REPLACE RULE protect_data AS
ON UPDATE TO exampletable -- another similar rule for DELETE
WHERE OLD.type = 'protected'
DO INSTEAD NOTHING;
Then if I want to edit the protected data:
START TRANSACTION;
ALTER TABLE exampletable DISABLE RULE protect_data;
-- edit data as I like
ALTER TABLE exampletable ENABLE RULE protect_data;
COMMIT;
I agree this is hacky, but I couldn't change the application(s) accessing the database in this case (or even throw errors at it). So bonus points for finding a reason why this is a dangerous/invalid use of the rule system, but not for why this is bad design.
One of the use cases for RULES are updateable views (although that changes in 9.1 as that version introduces INSTEAD OF triggers for views)
Another good explanation can be found in the manual:
For the things that can be implemented by both, which is best depends on the usage of the database. A trigger is fired for any affected row once. A rule manipulates the query or generates an additional query. So if many rows are affected in one statement, a rule issuing one extra command is likely to be faster than a trigger that is called for every single row and must execute its operations many times. However, the trigger approach is conceptually far simpler than the rule approach, and is easier for novices to get right.
(Taken from: http://www.postgresql.org/docs/current/static/rules-triggers.html)
Some problems with rules are shown here: http://www.depesz.com/index.php/2010/06/15/to-rule-or-not-to-rule-that-is-the-question/ (for instance, if a random() is included in a query, it might get executed twice and return different values).
The biggest drawback of rules is that people don't understand them.
For example, one might think that having the rule:
CREATE OR REPLACE RULE protect_data AS
ON UPDATE TO exampletable -- another similar rule for DELETE
WHERE OLD.type = 'protected'
DO INSTEAD NOTHING;
Will mean that if I'll issue:
update exampletable set whatever = whatever + 1 where type = 'protected'
No query will be ran. Which is not true. The query will be run, but it will be ran in modified version - with added condition.
What's more - rules break very useful thing, that is returning clause:
$ update exampletable set whatever = whatever + 1 where type = 'normal' returning *;
ERROR: cannot perform UPDATE RETURNING on relation "exampletable"
HINT: You need an unconditional ON UPDATE DO INSTEAD rule with a RETURNING clause.
To wrap it - if you really, really, positively have to use writable views, and you're using pre 9.1 PostgreSQL - you might have a valid reason to use rules.
In all other cases - you're most likely shooting yourself in a foot, even if you don't immediately see it.
I've had a few bitter experiences with rules when dealing with volatile functions (if memory serves, depesz' blog post highlights some of them).
I've also broken referential integrity when using them because of the timing in which fkey triggers get fired:
CREATE OR REPLACE RULE protected_example AS
ON DELETE TO example
WHERE OLD.protected
DO INSTEAD NOTHING;
... then add another table, and make example reference that table with an on delete cascade foreign key. Then, delete * from that table... and recoil in horror.
I reported the above issue as a bug, which got dismissed as a feature/necessary edge case. It's only months later that I made sense of why that might be, i.e. the fkey trigger does its job, and the rule then kicks in and does its own, but the fkey trigger won't check that its job was done properly for performance reasons.
The practical use-case where I still use rules is when a BEFORE trigger that pre-manipulates data (the SQL standard says is not allowed, but Postgres will happily oblige) can result in pre-manipulating the affected rows and thus changing their ctid (i.e. it gets updated twice, or doesn't get deleted because an update invalidated the delete).
This results in Postgres returning an incorrect number of affected rows, which is no big deal until you monitor that number before issuing subsequent statements.
In this case, I've found that using a strategically placed rule or two can allow to pre-emptively execute the offending statement(s), resulting in Postgres returning the correct number of affected rows.
How about this: You have a table that needs to be changed into a view. In order to support legacy applications that insert into said table, a rule is created that maps "inserts" to the new view to the underlying table(s).
Related
Is there a way in a INSTEAD OF trigger to just cause the default action?
For instance write a query like this:
BEGIN
IF <some rule>
BEGIN
ROLLBACK
EXIT
END
/* Long, boring, DDL dependant query ...
INSERT INTO
...
... ...
... ... ... -_-'
*/
// A simple statement that does the job!!
<default insertion>
END
My goal is to check some business rule and only if the check passes insert (or update/delete) without having to rewrite the whole statement which would break if the table's structure was to change.
Generally, no. In some vanilla cases you can get away with using * instead of listing the fields, but you probably shouldn't do that anyway.
SQL is not an OO language or environment and does not have the same level of code re-usability as they do. Consequently, there are many common cases where the right thing to do is repetitive and even redundant. In short, DRY does not always apply in SQL.
In particular, it almost never applies with respect to the impact of DDL changes. So for DDL changes, it is best to accept that the Edit Sensitivity of correct code is always going to be greater than one. You are just going to have to change things in more than one place.
According to MSDN, ##IDENTITY returns the last identity value generated for any table in the current session, across all scopes.
Has anyone come across a situation where this functionality was useful? I can't think of a situation when you would want the last ID generated for any table across all scopes or how you would use it.
UPDATE:
Not sure what all the downvotes are about, but I figured I'd try to clarify what I'm asking.
First off I know when to use SCOPE_IDENTITY and IDENT_CURRENT. What I'm wondering is when would it be better to use ##IDENTITY as opposed to these other options? I have yet to find a place to use it in my day to day work and I'm wondering if someone can describe a situation when it is the best option.
Most of the time when I see it, it is because someone doesn't understand what they were doing, but I assume Microsoft included it for a reason.
In general, it shouldn't be used. SCOPE_IDENTITY() is far safer to use (as long as we're talking about single-row inserts, as highlighted in a comment above), except in the following scenario, where ##IDENTITY is one approach that can be used (SCOPE_IDENTITY() cannot in this case):
your code knowingly fires a trigger
the trigger inserts into a table with an identity column
the calling code needs the identity value generated inside the trigger
the calling code can guarantee that the ##IDENTITY value generated within the trigger will never be changed to reflect a different table (e.g. someone adds logging to the trigger after the insert you were relying on)
This is an odd use case, but feasible.
Keep in mind this won't work if you are inserting multiple rows and need multiple IDENTITY values back. There are other ways, but they also require the option to allow result sets from cursors, and IIRC this option is being deprecated.
##IDENTITY is fantastic for identifying individual rows based on an ID column.
ID | Name | Age
1 AA 20
2 AB 30
etc...
In this case the ID column would be reliant on the ##IDENTITY property.
Is there a way to keep 2 sequences synchronized in Postgres?
I mean if I have:
table_A_id_seq = 1
table_B_id_seq = 1
if I execute SELECT nextval('table_A_id_seq'::regclass)
I want that table_B_id_seq takes the same value of table_A_id_seq
and obviously it must be the same on the other side.
I need 2 different sequences because I have to hack some constraints I have in Django (and that I cannot solve there).
The two tables must be related in some way? I would encapsulate that relationship in a lookup table containing the sequence and then replace the two tables you expect to be handling with views that use the lookup table.
Just use one sequence for both tables. You can't keep them in sync unless you always sync them again and over again. Sequences are not transaction safe, they always roll forwards, never backwards, not even by ROLLBACK.
Edit: one sequence is also not going to work, doesn't give you the same number for both tables. Use a subquery to get the correct number and use just a single sequence for a single table. The other table has to use the subquery.
My first thought when seeing this is why do you really want to do this? This smells a little spoiled, kinda like milk does after being a few days expired.
What is the scenario that requires that these two seq stay at the same value?
Ignoring the "this seems a bit odd" feelings I'm getting in my stomach you could try this:
Put a trigger on table_a that does this on insert.
--set b seq to the value of a.
select setval('table_b_seq',currval('table_a_seq'));
The problem with this approach is that is assumes only a insert into table_a will change the table_a_seq value and nothing else will be incrementing table_a_seq. If you can live with that this may work in a really hackish fashion that I wouldn't release to production if it was my call.
If you really need this, to make it more robust make a single interface to increment table_a_seq such as a function. And only allow manipulation of table_a_seq via this function. That way there is one interface to increment table_a_seq and you should also put
select setval('table_b_seq',currval('table_a_seq')); into that function. That way no matter what, table_b_seq will always be set to be equal to table_a_seq. That means removing any grants to the users to table_a_seq and only granting them execute grant on the new function.
You could put an INSERT trigger on Table_A that executes some code that increases Table_B's sequence. Now, every time you insert a new row into Table_A, it will fire off that trigger.
I am creating a self-related table:
Table Item columns:
ItemId int - PK;
Amount money - not null;
Price money - a computed column using a UDF that retrieves value according to the items ancestors' Amount.
ParentItemId int - nullable, reference to another ItemId in this table.
I need to avoid a loop, meaning, a sibling cannot become an ancestor of his ancestors, meaning, if ItemId=2 ParentItemId = 1, then ItemId 1 ParentItemId = 2 shouldn't be allowed.
I don't know what should be the best practice in this situation.
I think I should add a CK that gets a Scalar value from a UDF or whatever else.
EDIT:
Another option is to create an INSTEAD OF trigger and put in 1 transaction the update of the ParentItemId field and selecting the Price field from the ##RowIdentity, if it fails cancel transaction, but I would prefer a UDF validating.
Any ideas are sincerely welcomed.
Does this definitely need to be enforced at the database level?
I'm only asking as I have databases like this (where the table similar to this is like a folder) and I only make sure that the correct parent/child relationships are set up in the application.
Checks like this is not easy to implement, and possible solutions could cause a lot of bugs and problems may be harder then initial one. Usually it is enough to add control for user's input and prevent infinite loop on read data.
If your application uses stored procedures, no ORM, than I would choose to implement this logic in SP. Otherwise - handle it in other layers, not in DB
How big of a problem is this, in real life? It can be expensive to detect these situations (using a trigger, perhaps). In fact, it's likely going to cost you a lot of effort, on each transaction, when only a tiny subset of all your transactions would ever cause this problem.
Think about it first.
A simple trick is to force the ParentItemId to be less than the ItemId. This prevents loop closure in this simple context.
However, there's a down side - if you need for some reason to delete/insert a parent, you may need to delete/insert all of its children in order as well.
Equally, hierarchies need to be inserted in order, and you may not be able to reassign a parent.
Tested and works just great:
CREATE TRIGGER Item_UPDATE
ON Item
FOR INSERT, UPDATE
AS
BEGIN
BEGIN TRY
SELECT Price FROM INSERTED
END TRY
BEGIN CATCH
RAISERROR('This item cannot be specified with this parent.', 16, 1)
ROLLBACK TRANSACTION;
END CATCH
END
GO
A few months back, I started using a CRUD script generator for SQL Server. The default insert statement that this generator produces, SELECTs the inserted row at the end of the stored procedure. It does the same for the UPDATE too.
The previous way (and the only other way I have seen online) is to just return the newly inserted Id back to the business object, and then have the business object update the Id of the record.
Having an extra SELECT is obviously an additional database call, and more data is being returned to the application. However, it allows additional flexibility within the stored procedure, and allows the application to reflect the actual data in the table.
The additional SELECT also increases the complexity when wanting to wrap the insert/update statements in a transaction.
I am wondering what people think is better way to do it, and I don't mean the implementation of either method. Just which is better, return just the Id, or return the whole row?
We always return the whole row on both an Insert and Update. We always want to make sure our client apps have a fresh copy of the row that was just inserted or updated. Since triggers and other processes might modify values in columns outside of the actual insert/update statement, and since the client usually needs the new primary key value (assuming it was auto generated), we've found it's best to return the whole row.
The select statement will have some sort of an advantage only if the data is generated in the procedure. Otherwise the data that you have inserted is generally available to you already so no point in selecting and returning again, IMHO. if its for the id then you can have it with SCOPE_IDENTITY(), that will return the last identity value created in the current session for the insert.
Based on my prior experience, my knee-jerk reaction is to just return the freshly generated identity value. Everything else the application is inserting, it already knows--names, dollars, whatever. But a few minutes reflection and reading the prior 6 (hmm, make that 5) replies, leads to a number of “it depends” situations:
At the most basic level, what you inserted is what you’d get – you pass in values, they get written to a row in the table, and you’re done.
Slightly more complex that that is when there are simple default values assigned during an insert statement. “DateCreated” columns that default to the current datetime, or “CreatedBy” that default to the current SQL login, are a prime example. I’d include identity columns here, since not every table will (or should) contain them. These values are generated by the database upon table insertion, so the calling application cannot know what they are. (It is not unknown for web server clocks to not be synchronized with database server clocks. Fun times…) If the application needs to know the values just generated, then yes, you’d need to pass those back.
And then there are are situations where additional processing is done within the database before data is inserted into the table. Such work might be done within stored procedures or triggers. Once again, if the application needs to know the results of such calculations, then the data would need to be returned.
With that said, it seems to me the main issue underlying your decision is: how much control/understanding do you have over the database? You say you are using a tool to automatically generate your CRUD procedures. Ok, that means that you do not have any elaborate processing going on within them, you’re just taking data and loading it on in. Next question: are there triggers (of any kind) present that might modify the data as it is being written to the tables? Extend that to: do you know whether or not such triggers exists? If they’re there and they matter, plan accordingly; if you do not or cannot know, then you might need to “follow up” on the insert to see if changes occurred. Lastly: does the application care? Does it need to be informed of the results of the insert action it just requested, and if so, how much does it need to know? (New identity value, date time it was added, whether or not something changed the Name from “Widget” to “Widget_201001270901”.)
If you have complete understanding and control over the system you are building, I would only put in as much as you need, as extra code that performs no useful function impacts performance and maintainability. On the flip side, if I were writing a tool to be used by others, I’d try to build something that did everything (so as to increase my market share). And if you are building code where you don't really know how and why it will be used (application purpose), or what it will in turn be working with (database design), then I guess you'd have to be paranoid and try to program for everything. (I strongly recommend not doing that. Pare down to do only what needs to be done.)
Quite often the database will have a property that gives you the ID of the last inserted item without having to do an additional select. For example, MS SQL Server has the ##Identity property (see here). You can pass this back to your application as an output parameter of your stored procedure and use it to update your data with the new ID. MySQL has something similar.
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.*
VALUES ('value1', 'value2')
With this clause, returning the whole row does not require an extra SELECT and performance-wise is the same as returning only the id.
"Which is better" totally depends on your application needs. If you need the whole row, return the whole row, if you need only the id, return only the id.
You may add an extra setting to your business object which can trigger this option and return the whole row only if the object needs it:
IF #return_whole_row = 1
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.*
VALUES ('value1', 'value2')
ELSE
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.id
VALUES ('value1', 'value2')
FI
I don't think I would in general return an entire row, but it could be a useful technique.
If you are code-generating, you could generate two procs (one which calls the other, perhaps) or parametrize a single proc to determine whther to return it over the wire or not. I doubt the DB overhead is significant (single-row, got to have a PK lookup), but the data on the wire from DB to client could be significant when all added up and if it's just discarded in 99% of the cases, I see little value. Having an SP which returns different things with different parameters is a potential problem for clients, of course.
I can see where it would be useful if you have logic in triggers or calculated columns which are managed by the database, in which case, a SELECT is really the only way to get that data back without duplicating the logic in your client or the SP itself. Of course, the place to put any logic should be well thought out.
Putting ANY logic in the database is usually a carefully-thought-out tradeoff which starts with the minimally invasive and maximally useful things like constraints, unique constraints, referential integrity, etc and growing to the more invasive and marginally useful tools like triggers.
Typically, I like logic in the database when you have multi-modal access to the database itself, and you can't force people through your client assemblies, say. In this case, I would still try to force people through views or SPs which minimize the chance of errors, duplication, logic sync issues or misinterpretation of data, thereby providing as clean, consistent and coherent a perimeter as possible.