Postgresql: keep 2 sequences synchronized - database

Is there a way to keep 2 sequences synchronized in Postgres?
I mean if I have:
table_A_id_seq = 1
table_B_id_seq = 1
if I execute SELECT nextval('table_A_id_seq'::regclass)
I want that table_B_id_seq takes the same value of table_A_id_seq
and obviously it must be the same on the other side.
I need 2 different sequences because I have to hack some constraints I have in Django (and that I cannot solve there).

The two tables must be related in some way? I would encapsulate that relationship in a lookup table containing the sequence and then replace the two tables you expect to be handling with views that use the lookup table.

Just use one sequence for both tables. You can't keep them in sync unless you always sync them again and over again. Sequences are not transaction safe, they always roll forwards, never backwards, not even by ROLLBACK.
Edit: one sequence is also not going to work, doesn't give you the same number for both tables. Use a subquery to get the correct number and use just a single sequence for a single table. The other table has to use the subquery.

My first thought when seeing this is why do you really want to do this? This smells a little spoiled, kinda like milk does after being a few days expired.
What is the scenario that requires that these two seq stay at the same value?
Ignoring the "this seems a bit odd" feelings I'm getting in my stomach you could try this:
Put a trigger on table_a that does this on insert.
--set b seq to the value of a.
select setval('table_b_seq',currval('table_a_seq'));
The problem with this approach is that is assumes only a insert into table_a will change the table_a_seq value and nothing else will be incrementing table_a_seq. If you can live with that this may work in a really hackish fashion that I wouldn't release to production if it was my call.
If you really need this, to make it more robust make a single interface to increment table_a_seq such as a function. And only allow manipulation of table_a_seq via this function. That way there is one interface to increment table_a_seq and you should also put
select setval('table_b_seq',currval('table_a_seq')); into that function. That way no matter what, table_b_seq will always be set to be equal to table_a_seq. That means removing any grants to the users to table_a_seq and only granting them execute grant on the new function.

You could put an INSERT trigger on Table_A that executes some code that increases Table_B's sequence. Now, every time you insert a new row into Table_A, it will fire off that trigger.

Related

Way to persist function result as a constant

I needed to create a function today which will always return the exact same value on the specific database it's executed on. It may / may not be the same across databases which is why it has to be able to load it from a table the first time it's required.
CREATE FUNCTION [dbo].[PAGECODEGET] ()
RETURNS nvarchar(6)
AS
BEGIN
DECLARE #PageCode nvarchar(6) = ( SELECT PCO_IDENTITY FROM PAGECODES WHERE PCO_PAGE = 'SWTE' AND PCO_TAB = 'RECORD' )
RETURN #PageCode
END
The PCO_IDENTITY field is a sql identity field, so once the record is inserted for the first time, it's always going to return the same result thereafter.
My question is, is there any way to persist this value to something equivalent to a C# readonly variable?
From a perfomance point of view I know sql will optimise the plan etc, but from a best practice point of view I'm thinking there may possibly be a better way of doing it.
We use a mix of SQL Servers, but the lowest is 2008 R2 in case there's a version specific solution.
I'm afraid there's no such thing as a global variable like you suggest in SQL Server.
As you've pointed out, the function will potentially return different results on another database, depending on a variety of factors, such as when the row was inserted, what other values exist in the table already etc. - basically, the PCO_IDENTITY value for this row cannot be relied upon to be consistent.
A few observations:
I don't see how getting this value occasionally is really going to be a performance bottleneck. I don't think best practices cover this, as selecting a value from a table is as basic as you can get.
If this is part of another larger query, you will probably get better performance by using a join to the PAGECODES table directly, rather than potentially running this function for every row
However, if you are really worried:
There are objects in the database which are persistant - tables. When you first insert this value, retrieve the PCO_IDENTITY value, and create a new table with just that in, that you may join to in your queries. Seems a bit of a waste for one value, doesn't it? (Note you could also make a view, but how would that be any better performing than the function you started with?)
You could force these values into a row with a specific PCO_IDENTITY value, using IDENTITY_INSERT. That way the value is consistent, and you know what it is - you could hard code it in your queries. (NB: Turn IDENTITY_INSERT off again afterwards, and other rows inserted into this table will continue to be automatically generated again)
TL;DR: How you are doing it is probably fine. I suspect you are trying to optimise something that isn't a problem. As always - if in doubt, try out a few approaches and measure.

What are PostgreSQL RULEs good for?

Question
I often see it stated that rules should be avoided and triggers used instead. I can see the danger in the rule system, but certainly there are valid uses for rules, right? What are they?
I'm asking this out of general interest; I'm not very seasoned with databases.
Example of what might be a valid use
For instance, in the past I've needed to lock down certain data, so I've done something like this:
CREATE OR REPLACE RULE protect_data AS
ON UPDATE TO exampletable -- another similar rule for DELETE
WHERE OLD.type = 'protected'
DO INSTEAD NOTHING;
Then if I want to edit the protected data:
START TRANSACTION;
ALTER TABLE exampletable DISABLE RULE protect_data;
-- edit data as I like
ALTER TABLE exampletable ENABLE RULE protect_data;
COMMIT;
I agree this is hacky, but I couldn't change the application(s) accessing the database in this case (or even throw errors at it). So bonus points for finding a reason why this is a dangerous/invalid use of the rule system, but not for why this is bad design.
One of the use cases for RULES are updateable views (although that changes in 9.1 as that version introduces INSTEAD OF triggers for views)
Another good explanation can be found in the manual:
For the things that can be implemented by both, which is best depends on the usage of the database. A trigger is fired for any affected row once. A rule manipulates the query or generates an additional query. So if many rows are affected in one statement, a rule issuing one extra command is likely to be faster than a trigger that is called for every single row and must execute its operations many times. However, the trigger approach is conceptually far simpler than the rule approach, and is easier for novices to get right.
(Taken from: http://www.postgresql.org/docs/current/static/rules-triggers.html)
Some problems with rules are shown here: http://www.depesz.com/index.php/2010/06/15/to-rule-or-not-to-rule-that-is-the-question/ (for instance, if a random() is included in a query, it might get executed twice and return different values).
The biggest drawback of rules is that people don't understand them.
For example, one might think that having the rule:
CREATE OR REPLACE RULE protect_data AS
ON UPDATE TO exampletable -- another similar rule for DELETE
WHERE OLD.type = 'protected'
DO INSTEAD NOTHING;
Will mean that if I'll issue:
update exampletable set whatever = whatever + 1 where type = 'protected'
No query will be ran. Which is not true. The query will be run, but it will be ran in modified version - with added condition.
What's more - rules break very useful thing, that is returning clause:
$ update exampletable set whatever = whatever + 1 where type = 'normal' returning *;
ERROR: cannot perform UPDATE RETURNING on relation "exampletable"
HINT: You need an unconditional ON UPDATE DO INSTEAD rule with a RETURNING clause.
To wrap it - if you really, really, positively have to use writable views, and you're using pre 9.1 PostgreSQL - you might have a valid reason to use rules.
In all other cases - you're most likely shooting yourself in a foot, even if you don't immediately see it.
I've had a few bitter experiences with rules when dealing with volatile functions (if memory serves, depesz' blog post highlights some of them).
I've also broken referential integrity when using them because of the timing in which fkey triggers get fired:
CREATE OR REPLACE RULE protected_example AS
ON DELETE TO example
WHERE OLD.protected
DO INSTEAD NOTHING;
... then add another table, and make example reference that table with an on delete cascade foreign key. Then, delete * from that table... and recoil in horror.
I reported the above issue as a bug, which got dismissed as a feature/necessary edge case. It's only months later that I made sense of why that might be, i.e. the fkey trigger does its job, and the rule then kicks in and does its own, but the fkey trigger won't check that its job was done properly for performance reasons.
The practical use-case where I still use rules is when a BEFORE trigger that pre-manipulates data (the SQL standard says is not allowed, but Postgres will happily oblige) can result in pre-manipulating the affected rows and thus changing their ctid (i.e. it gets updated twice, or doesn't get deleted because an update invalidated the delete).
This results in Postgres returning an incorrect number of affected rows, which is no big deal until you monitor that number before issuing subsequent statements.
In this case, I've found that using a strategically placed rule or two can allow to pre-emptively execute the offending statement(s), resulting in Postgres returning the correct number of affected rows.
How about this: You have a table that needs to be changed into a view. In order to support legacy applications that insert into said table, a rule is created that maps "inserts" to the new view to the underlying table(s).

How to avoid circular relationship in SQL-Server?

I am creating a self-related table:
Table Item columns:
ItemId int - PK;
Amount money - not null;
Price money - a computed column using a UDF that retrieves value according to the items ancestors' Amount.
ParentItemId int - nullable, reference to another ItemId in this table.
I need to avoid a loop, meaning, a sibling cannot become an ancestor of his ancestors, meaning, if ItemId=2 ParentItemId = 1, then ItemId 1 ParentItemId = 2 shouldn't be allowed.
I don't know what should be the best practice in this situation.
I think I should add a CK that gets a Scalar value from a UDF or whatever else.
EDIT:
Another option is to create an INSTEAD OF trigger and put in 1 transaction the update of the ParentItemId field and selecting the Price field from the ##RowIdentity, if it fails cancel transaction, but I would prefer a UDF validating.
Any ideas are sincerely welcomed.
Does this definitely need to be enforced at the database level?
I'm only asking as I have databases like this (where the table similar to this is like a folder) and I only make sure that the correct parent/child relationships are set up in the application.
Checks like this is not easy to implement, and possible solutions could cause a lot of bugs and problems may be harder then initial one. Usually it is enough to add control for user's input and prevent infinite loop on read data.
If your application uses stored procedures, no ORM, than I would choose to implement this logic in SP. Otherwise - handle it in other layers, not in DB
How big of a problem is this, in real life? It can be expensive to detect these situations (using a trigger, perhaps). In fact, it's likely going to cost you a lot of effort, on each transaction, when only a tiny subset of all your transactions would ever cause this problem.
Think about it first.
A simple trick is to force the ParentItemId to be less than the ItemId. This prevents loop closure in this simple context.
However, there's a down side - if you need for some reason to delete/insert a parent, you may need to delete/insert all of its children in order as well.
Equally, hierarchies need to be inserted in order, and you may not be able to reassign a parent.
Tested and works just great:
CREATE TRIGGER Item_UPDATE
ON Item
FOR INSERT, UPDATE
AS
BEGIN
BEGIN TRY
SELECT Price FROM INSERTED
END TRY
BEGIN CATCH
RAISERROR('This item cannot be specified with this parent.', 16, 1)
ROLLBACK TRANSACTION;
END CATCH
END
GO

What should be returned when inserting into SQL?

A few months back, I started using a CRUD script generator for SQL Server. The default insert statement that this generator produces, SELECTs the inserted row at the end of the stored procedure. It does the same for the UPDATE too.
The previous way (and the only other way I have seen online) is to just return the newly inserted Id back to the business object, and then have the business object update the Id of the record.
Having an extra SELECT is obviously an additional database call, and more data is being returned to the application. However, it allows additional flexibility within the stored procedure, and allows the application to reflect the actual data in the table.
The additional SELECT also increases the complexity when wanting to wrap the insert/update statements in a transaction.
I am wondering what people think is better way to do it, and I don't mean the implementation of either method. Just which is better, return just the Id, or return the whole row?
We always return the whole row on both an Insert and Update. We always want to make sure our client apps have a fresh copy of the row that was just inserted or updated. Since triggers and other processes might modify values in columns outside of the actual insert/update statement, and since the client usually needs the new primary key value (assuming it was auto generated), we've found it's best to return the whole row.
The select statement will have some sort of an advantage only if the data is generated in the procedure. Otherwise the data that you have inserted is generally available to you already so no point in selecting and returning again, IMHO. if its for the id then you can have it with SCOPE_IDENTITY(), that will return the last identity value created in the current session for the insert.
Based on my prior experience, my knee-jerk reaction is to just return the freshly generated identity value. Everything else the application is inserting, it already knows--names, dollars, whatever. But a few minutes reflection and reading the prior 6 (hmm, make that 5) replies, leads to a number of “it depends” situations:
At the most basic level, what you inserted is what you’d get – you pass in values, they get written to a row in the table, and you’re done.
Slightly more complex that that is when there are simple default values assigned during an insert statement. “DateCreated” columns that default to the current datetime, or “CreatedBy” that default to the current SQL login, are a prime example. I’d include identity columns here, since not every table will (or should) contain them. These values are generated by the database upon table insertion, so the calling application cannot know what they are. (It is not unknown for web server clocks to not be synchronized with database server clocks. Fun times…) If the application needs to know the values just generated, then yes, you’d need to pass those back.
And then there are are situations where additional processing is done within the database before data is inserted into the table. Such work might be done within stored procedures or triggers. Once again, if the application needs to know the results of such calculations, then the data would need to be returned.
With that said, it seems to me the main issue underlying your decision is: how much control/understanding do you have over the database? You say you are using a tool to automatically generate your CRUD procedures. Ok, that means that you do not have any elaborate processing going on within them, you’re just taking data and loading it on in. Next question: are there triggers (of any kind) present that might modify the data as it is being written to the tables? Extend that to: do you know whether or not such triggers exists? If they’re there and they matter, plan accordingly; if you do not or cannot know, then you might need to “follow up” on the insert to see if changes occurred. Lastly: does the application care? Does it need to be informed of the results of the insert action it just requested, and if so, how much does it need to know? (New identity value, date time it was added, whether or not something changed the Name from “Widget” to “Widget_201001270901”.)
If you have complete understanding and control over the system you are building, I would only put in as much as you need, as extra code that performs no useful function impacts performance and maintainability. On the flip side, if I were writing a tool to be used by others, I’d try to build something that did everything (so as to increase my market share). And if you are building code where you don't really know how and why it will be used (application purpose), or what it will in turn be working with (database design), then I guess you'd have to be paranoid and try to program for everything. (I strongly recommend not doing that. Pare down to do only what needs to be done.)
Quite often the database will have a property that gives you the ID of the last inserted item without having to do an additional select. For example, MS SQL Server has the ##Identity property (see here). You can pass this back to your application as an output parameter of your stored procedure and use it to update your data with the new ID. MySQL has something similar.
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.*
VALUES ('value1', 'value2')
With this clause, returning the whole row does not require an extra SELECT and performance-wise is the same as returning only the id.
"Which is better" totally depends on your application needs. If you need the whole row, return the whole row, if you need only the id, return only the id.
You may add an extra setting to your business object which can trigger this option and return the whole row only if the object needs it:
IF #return_whole_row = 1
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.*
VALUES ('value1', 'value2')
ELSE
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.id
VALUES ('value1', 'value2')
FI
I don't think I would in general return an entire row, but it could be a useful technique.
If you are code-generating, you could generate two procs (one which calls the other, perhaps) or parametrize a single proc to determine whther to return it over the wire or not. I doubt the DB overhead is significant (single-row, got to have a PK lookup), but the data on the wire from DB to client could be significant when all added up and if it's just discarded in 99% of the cases, I see little value. Having an SP which returns different things with different parameters is a potential problem for clients, of course.
I can see where it would be useful if you have logic in triggers or calculated columns which are managed by the database, in which case, a SELECT is really the only way to get that data back without duplicating the logic in your client or the SP itself. Of course, the place to put any logic should be well thought out.
Putting ANY logic in the database is usually a carefully-thought-out tradeoff which starts with the minimally invasive and maximally useful things like constraints, unique constraints, referential integrity, etc and growing to the more invasive and marginally useful tools like triggers.
Typically, I like logic in the database when you have multi-modal access to the database itself, and you can't force people through your client assemblies, say. In this case, I would still try to force people through views or SPs which minimize the chance of errors, duplication, logic sync issues or misinterpretation of data, thereby providing as clean, consistent and coherent a perimeter as possible.

TSQL "LIKE" or Regular Expressions?

I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:
Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits
Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.
Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.
the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.
Any ideas if regular expressions or any other methodology would make it go faster?
What about doing it all in the DB, rather than pulling records into your .Net app:
UPDATE source_table s SET some_field = true WHERE EXISTS
(
SELECT target_join_field FROM target_table t
WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)
This will reduce the total number of queries from 750k update queries down to 1 update.
First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.
If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.
BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.
One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.
Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.
The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.
First Idea
For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)
Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.
It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)
Second Idea
Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)
EDIT:
Modified Second Idea
Put the whole target table in to a temp table.
Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.
Loop through each record from the source table one at a time. Use the following logic in your loop...
DELETE target WHERE field LIKE '%' + #source_field + '%'
IF (##row_count = 0)
[no matches]
ELSE
[matches]
The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)
Try this --
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.
As le dorfier correctly pointed out, there is little hope if you are using a UDF.
There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.
Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.
I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).
You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.
By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.
Tell us more about the problem.
Holy smoke, what great responses!
system is on disconnected network, so I can't copy paste, but here's the retype
Current UDF:
Create function CountInTrim
(#caseno varchar255)
returns int
as
Begin
declare #reccount int
select #reccount = count(recId) from targettable where title like '%' + #caseNo +'%'
return #reccount
end
Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.
Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.
I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:
DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)
Parsing that became a bit more complex so I thought I'd keep it simpler.
I'm going to try the single update statement, then fall back to regex in the clr if needed.
I'll update the results. And, since I've already processed more than half the records, that should help.
Try either Dan R's update query from above:
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField
from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.

Resources