Multiple instances of a service processing rows from a single table, what's the best way to prevent collision - database

We are trying to build a system with multiple instances of a service on different machines that share the load of processing.
Each of these will check a table, if there are rows to be processed on that table, it will pick the first, mark it processing, then process it, then mark it done. Rinse repeat.
What is the best way to prevent a racing condition where 2 instances A and B do the following
A (1) read the table, finds row 1 to process,
B (1) reads the table, finds row 1 to process,
A (2) marks it row processing
B (2) Marks it row processing
In a single app we could use locks or mutexs.
I can just put A1 and A2 in a single transaction, is it that simple, or is there a better, faster way to do this?
Should I just turn it on it's head so that the steps are:
A (1) Mark the next row as mine to process
A (2) Return it to me for processing.
I figure this has to have been solved many times before, so I'm looking for the "standard" solutions, and if there are more than one, benefits and disadvantages.

Transactions are a nice simple answer, with two possible drawbacks:
1) You might want to check with the fine print of your database. Sometimes the default consistency settings don't guarantee absolute consistency in every possible circumstance.
2) Sometimes the pattern of accesses associated with using a database to queue and distribute work is hard on a database that isn't expecting it.
One possibility is to look at reliable message queuing systems, which are seem to pretty good match to what you are looking for - worker machines could just read work from a shared queue. Possible jumping-off points are http://en.wikipedia.org/wiki/Message_queue and http://docs.oracle.com/cd/B10500_01/appdev.920/a96587/qintro.htm

Related

Are there best practices for deduplicating records when using auto-ingest Snowpipes?

Currently in Snowflake we have configured an auto-ingest Snowpipe connected to an external S3 stage as documented here. This works well and we're copying records from the pipe into a "landing" table. The end goal is to MERGE these records into a final table to deal with any duplicates, which also works well. My question is around how best to safely perform this MERGE without missing any records? At the moment, we are performing a single data extraction job per-day so there is normally a point where the Snowpipe queue is empty which we use as an indicator that it is safe to proceed, however we are looking to move to more frequent extractions where it will become harder and harder to guarantee there will be no new records ingested at any given point.
Things we've considered:
Temporarily pause the pipe, MERGE the records, TRUNCATE the landing table, then unpause the pipe. I believe this should technically work but it is not clear to me that this is an advised way to work with Snowpipes. I'm not sure how resilient they are to being paused/unpaused, how long it tends to take to pause/unpause, etc. I am aware that paused pipes can become "stale" after 14 days (link) however we're talking about pausing it for a few minutes, not multiple days.
Utilize transactions in some way. I have a general understanding of SQL transactions, but I'm having a hard time determining exactly if/how they could be used in this situation to guarantee no data loss. The general thought is if the MERGE and DELETE could be contained in a transaction it may provide a safe way to process the incoming data throughout the day but I'm not sure if that's true.
Add in a third "processing" table and a task to swap the landing table with the processing table. The task to swap the tables could run on a schedule (e.g. every hour), and I believe the key is to have the conditional statement check both that there are records in the landing table AND that the processing table is empty. As this point the MERGE and TRUNCATE would work off the processing table and the landing table would continue to receive the incoming records.
Any additional insights into these options or completely different suggestions are very welcome.
Look into table streams which record insertions/updates/deletions to your snowpipe table. You can then merge off the stream to your target table which then resets the offset. Use a task to run your merge statement. Also, given it is snowpipe, when creating your stream it is probably best to use an append only stream
However, I had a question here where in some circumstances, we were missing some rows. Our task was set to 1min intervals, which may be partly the reason. However I never did get to the end of it, even with Snowflake support.
What we did notice though was that using a stored procedure, with a transaction and also running a select on the stream before the merge, seems to have solved the issue i.e. no more missing rows

How to get all results from results set in pentaho kettle step Input table?

I have simple transformation consisting of 2 steps. 1 step (Input table) makes query to DB and 2 step (Java class) processes results. 2 step takes much time (it is normal in my case) but after 1 hours I get error of closed results set
Server has closed the connection. If result set contain huge amount of data, Server expects client to read off the result set relatively fast. In this case, please consider increasing net_wait_timeout session variable. / processing your result set faster (check Streaming result sets documentation for more information)
2017/10/02 13:12:06 - Getting of data cells .0 -
I think there should be some intermediate step (or some other option) to get relatively fast all result from 1 step. Could you help me with that?
I guess your step 2 is locking the same table as the one in step 1.
That's one of the drawback of the otherwise efficient architecture of the PDI. All the steps startup at the same time, and the quickest to produce results give the hand to the next steps. With this strategy of "do the quickest first", you sometimes beat the sql optimizer itself when there is lots of joins on sums or averages (pro rata).
The main pitfall in this respect is to read a table, make some transformation and rewrite the result on the same table with the truncate table checked. In that case, the truncate is done a few milliseconds before the select of the input table which starts an infinite dead lock. After a long time you decide to kill the ETL, but at that time the data has been lost.
Solutions:
The best practice is to rewrite step2 using PDI steps rather than to use a ready made java class. That is the way I strongly recommend on the long run, but you may have some reason not follow it.
If your table is small, you can put a blocking step between the input and output.
If you table is big, you can use a sort row step instead of the blocking step. You do not really want to sort, but the PDI needs to look at the last row to be sure the sort is complete, before to give results to the next step. The sort will cut the data in temporary chuncks on the hard disk, and you can have a certain control on where and how the tmp data is stored.
You can copy your table in a tmp table (or file), process and delete it after. Use a job to do that, because in a job, unlike in a transformation, the process is sequential.

PostgreSQL increase sequence even if error due to unique - is this normal behavior? [duplicate]

I must / have to create unique ID for invoices. I have a table id and another column for this unique number. I use serialization isolation level. Using
var seq = #"SELECT invoice_serial + 1 FROM invoice WHERE ""type""=#type ORDER BY invoice_serial DESC LIMIT 1";
Doesn't help because even using FOR UPDATE it wont read correct value as in serialization level.
Only solution seems to put some retry code.
Sequences do not generate gap-free sets of numbers, and there's really no way of making them do that because a rollback or error will "use" the sequence number.
I wrote up an article on this a while ago. It's directed at Oracle but is really about the fundamental principles of gap-free numbers, and I think the same applies here.
Well, it’s happened again. Someone has asked how to implement a requirement to generate a gap-free series of numbers and a swarm of nay-sayers have descended on them to say (and here I paraphrase slightly) that this will kill system performance, that’s it’s rarely a valid requirement, that whoever wrote the requirement is an idiot blah blah blah.
As I point out on the thread, it is sometimes a genuine legal requirement to generate gap-free series of numbers. Invoice numbers for the 2,000,000+ organisations in the UK that are VAT (sales tax) registered have such a requirement, and the reason for this is rather obvious: that it makes it more difficult to hide the generation of revenue from tax authorities. I’ve seen comments that it is a requirement in Spain and Portugal, and I’d not be surprised if it was not a requirement in many other countries.
So, if we accept that it is a valid requirement, under what circumstances are gap-free series* of numbers a problem? Group-think would often have you believe that it always is, but in fact it is only a potential problem under very particular circumstances.
The series of numbers must have no gaps.
Multiple processes create the entities to which the number is associated (eg. invoices).
The numbers must be generated at the time that the entity is created.
If all of these requirements must be met then you have a point of serialisation in your application, and we’ll discuss that in a moment.
First let’s talk about methods of implementing a series-of-numbers requirement if you can drop any one of those requirements.
If your series of numbers can have gaps (and you have multiple processes requiring instant generation of the number) then use an Oracle Sequence object. They are very high performance and the situations in which gaps can be expected have been very well discussed. It is not too challenging to minimise the amount of numbers skipped by making design efforts to minimise the chance of a process failure between generation of the number and commiting the transaction, if that is important.
If you do not have multiple processes creating the entities (and you need a gap-free series of numbers that must be instantly generated), as might be the case with the batch generation of invoices, then you already have a point of serialisation. That in itself may not be a problem, and may be an efficient way of performing the required operation. Generating the gap-free numbers is rather trivial in this case. You can read the current maximum value and apply an incrementing value to every entity with a number of techniques. For example if you are inserting a new batch of invoices into your invoice table from a temporary working table you might:
insert into
invoices
(
invoice#,
...)
with curr as (
select Coalesce(Max(invoice#)) max_invoice#
from invoices)
select
curr.max_invoice#+rownum,
...
from
tmp_invoice
...
Of course you would protect your process so that only one instance can run at a time (probably with DBMS_Lock if you're using Oracle), and protect the invoice# with a unique key contrainst, and probably check for missing values with separate code if you really, really care.
If you do not need instant generation of the numbers (but you need them gap-free and multiple processes generate the entities) then you can allow the entities to be generated and the transaction commited, and then leave generation of the number to a single batch job. An update on the entity table, or an insert into a separate table.
So if we need the trifecta of instant generation of a gap-free series of numbers by multiple processes? All we can do is to try to minimise the period of serialisation in the process, and I offer the following advice, and welcome any additional advice (or counter-advice of course).
Store your current values in a dedicated table. DO NOT use a sequence.
Ensure that all processes use the same code to generate new numbers by encapsulating it in a function or procedure.
Serialise access to the number generator with DBMS_Lock, making sure that each series has it’s own dedicated lock.
Hold the lock in the series generator until your entity creation transaction is complete by releasing the lock on commit
Delay the generation of the number until the last possible moment.
Consider the impact of an unexpected error after generating the number and before the commit is completed — will the application rollback gracefully and release the lock, or will it hold the lock on the series generator until the session disconnects later? Whatever method is used, if the transaction fails then the series number(s) must be “returned to the pool”.
Can you encapsulate the whole thing in a trigger on the entity’s table? Can you encapsulate it in a table or other API call that inserts the row and commits the insert automatically?
Original article
You could create a sequence with no cache , then get the next value from the sequence and use that as your counter.
CREATE SEQUENCE invoice_serial_seq START 101 CACHE 1;
SELECT nextval('invoice_serial_seq');
More info here
You either lock the table to inserts, and/or need to have retry code. There's no other option available. If you stop to think about what can happen with:
parallel processes rolling back
locks timing out
you'll see why.
In 2006, someone posted a gapless-sequence solution to the PostgreSQL mailing list: http://www.postgresql.org/message-id/44E376F6.7010802#seaworthysys.com

Splitting Trigger in SQL Server

Greeting,
Recently I've started to work on an application, where 8 different modules are using the same table at some point in the workflow. This table have an Instead-Of trigger, which is 5,000 lines long (where first 500 and last 500 lines are common for all modules, and then each module has its own 500 lines of code).
Since the number of modules are going to grow and I want to keep thing as clear (and separate) as possible, I was wondering is there some sort of best practice to split trigger into stored procedures, or should I leave it all in one place?
P.S. Are there going to be any performance penalties for calling procedures from the trigger and passing 15+ parameters to them?
Bearing in mind that the inserted and deleted pseudo-tables are only accessible from within trigger code, and that they can contain multiple rows, you're facing two choices:
Process the rows in inserted and deleted in a RBAR1 fashion, to be able to pass scalar parameters to the stored procedures, or,
Copy all of the data from inserted and deleted into table variables that are then passed to the procedures as appropriate.
I'd expect either approach to impose some2 performance overhead, just from the copying
That being said, it sounds like too much is happening inside the triggers themselves - does all of this code have to be part of the same transaction that performed the DML statement? If not, consider using some form of queue (a table of requests or Service Broker, say) in which to place information on work to perform, and then process the data later - if you use Service Broker, you could have it inspect a shared message and then send appropriate messages to dedicated endpoints for each of your modules, as appropriate.
1 Row By Agonizing Row - using either a cursor of something else to simulate one to access each row in turn - usually frowned upon in a Set-based language like SQL.
2 How much is impossible to know without getting into the specifics of your code and probably trying all possible approaches and measuring the result.
I don't think there is a meaningful performance penalty in this case.
Any way, it is bad practice to write it all inside the trigger (when it is 5000 lines long...).
I think the main consideration is maintainability, which will be much better if you split it
To several SPs

app engine data pipelines talk - for fan-in materialized view, why are work indexes necessary?

I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!

Resources