Say I have a flink job processing a data flow like 1, 2, control_flag, 3...
When control_flag is met, the job should be stopped with savepoint and the following messages 3... should neither be processed or dropped. When centern actions are taken outside the flink and the job is restarted from savepoint, the job should go on process the following messages.
However, if the job hangs with a sleeping loop inside the process operator to prevent the following messages to be processed, it can not be stopped with savepoint using flink api. So how do I stop the job at the position of control_flag and let the job to be restarted with the position next to it?
Some suggestions can be found here.
There are a few possible ways that it can be done, but I think since You want to keep state between the runs, the best idea would be to have an operator that :
If the flag stop_execution is false, processes data and outputs that for the downstream operators.
If the flag stop_execution is true, it adds the data it receives to list state.
If it receives the control_flag it emits side output meaning that job should be stopped.
Now it's up to You to listen to the side output, this can be either external service that reads data from Kafka and executes correct REST calls to stop given job or anything else You want.
Related
I am trying to implement a way to wait until the table kafka-connector has caught up with the latest offset.
I've already tried implementing a session gap window operator on that table (kafka-connector to a compact topic) to wait for inactivity on the table and then collect the result locally (in an attempt to close the operator/task) but it became a global operator and the rest of my topology doesn't get executed (because this window operator creates a disjoint topology)
I have a Flink job running on a Kinesis Data Analytics application, which uses Flink's DataSet API to read data into two DataSet objects. Since I need the number of tuples in each DataSet, I call the count() method on each DataSet, but I keep seeing this error when run through my AWS Console:
org.apache.flink.api.common.InvalidProgramException: The main method caused an error: Job was submitted in detached mode. Results of job execution, such as accumulators, runtime, etc. are not available. Please make sure your program doesn't call an eager execution function [collect, print, printToErr, count].
For context, this is roughly the code that is causing the exception:
DataSet<Tuple> dataset = executionEnvironment.readTextFile(file);
log.info("Number of records: " + dataset.count());
Is there any way to change the execution mode from detached mode to another mode that would allow the call to the count and other accumulator functions?
There is an SQL Agent Job containing a complex Integration Services Package performing some ETL Jobs. It takes between 1 and 4 hours to run, depending on our data sources.
The Job currently runs daily, without problems. What I would like to do now is to let it run in an endless loop, which means: When it's done, start over again.
The scheduler doesn't seem to provide this option. I found that it would be possible to use the steps interface to go to step one after the last step is finished, but there's a problem using that method: If I need to stop the job, I would need to do that in a forceful way. However I would like to be able to let the job stop after the next iteration. How can I do that?
Thanks in advance for any help!
Since neither Martin nor Remus created an answer, here is one so the question can be accepted.
The best way is to simply set the run frequency to a very low value, like one minute. If it is already running, a second instance will not be created. If you want to stop the job after the current run, simply disable the schedule.
Thanks!
So you want that when you want to stop the job, after the running iteration, it should stop - if I am getting you correctly.
You can do one thing here.
Have one table for configuration which is having boolean value.
Add one step into the job. i.e. Before iteration, check the value from table. If it's true, then only run the ETL packages.
So, each time it finds its true, it'll follow endless loop.
When you want to stop the job, set that value in table to false.
When the current job iteration completes, it'll go to find the value from your table, will find it false, and the iteration will stop.
you can always set the "on success" action to go to step one, creating an endless loop, but as you said, if you want to stop the job you'll have to force it.
Other than that, an simple control table on the database with a status and a second job that queries this table and fires your main job depending on the status. Coupe of possible architectures here, just pick the one that suits you better
You could use service broker within the database. The job you need to run can be started by queuing a 'start' message and when it finishes it can send itself a message to start again.
To pause the process you can just deactivate the queue processor.
I have a queue that stops without any aparently reason, in this queue i have implemented a posion message handling. And during processing, it records and discards any poison messages.
It has worked fine for more than a year without stopping. But recently (the problem began four weeks ago), it stops once or twice a week. And only in this week it stopped twice.
And when I check the table with the new poisoned messages, there is none!! And when I enable the queue, processing resumes successfully and the 'poison message' situation does not reproduce.
About the task of the queue: Receives about 2-3000 messages per day. It is used to run stored procedures outside the transaction. And each message can last a little to be processed (doing a lot of selects, inserts, updates).
Let me explain this point: the database has triggers that are fired inside a transaction, the trigger sends a message to run some code outside the trigger. The asynchronous behavior prevents droping the performance of the database.
I have detected that even when a dead-lock occurs while proccessing the messages, the queue treats the message as poisoned. So in principle it shouldn't be a performance problem. But, can it be? Maybe the database is growing and it lasts too long to proces a messages?
But how can I find it out if it is not detected as posioned?
Why other reason a queue stops?
How can save when and with which message the queue got disabled?
Does anybody has any idea how I can do any forensics analysis?
Any idea?
UPDATE EXPOSING A PSEUDO-SOLUTION:
According Remus' post, I've tried to use the event notification to get the exact moment when the queue stops.
CREATE EVENT NOTIFICATION [QueueDisabledEN]
ON QUEUE [dbo].[ProcessQueue]
FOR BROKER_QUEUE_DISABLED
TO SERVICE 'Queue Watch Service', 'current database';
And then checking the event log:
select * from sys.event_notificiation
But since it is difficult to know the environment in which the event occurred, (what else was running at the momment??), forensic analysis ends there. Fortunately my broker service implementation stores the messages with the date of shipment, the date of receipt, date processing, ... This has helped me to detect that within 3 seconds the queue is flooded with hundreds of messages that take too long to be processed.
While I find a real solution the only temporary solution is to check with an agent job every x minutes the status of the queue and enable it:
IF (EXISTS(SELECT * FROM sys.service_queues WHERE name like 'ProcessQueue' AND (is_receive_enabled = 0 OR is_enqueue_enabled = 0))) BEGIN
PRINT convert(nvarchar, getdate(), 121)+ ': Activando la cola ProcessQueue'
ALTER QUEUE ProcessQueue WITH STATUS = ON
END
Thanks Remus!
When you find the queue in disabled state and you enable back the queue, I assume that the processing resumes successfully and the 'poison message' situation does not reproduce. This would indicate that the cause is transient or time related. It could be a SQL Agent job that is running and causes deadlocks with the queue processing, forcing the queue processing to rollback. Deadlocks are in my experience the most typical poison message cause. Your best forensics tool is the system event log, as the activated procedure does output errors into the ERRORLOG and hence into the system Event Log.
Whenever a queue is disabled by the poison message trigger (5 consecutive rollbacks) an event notification of type QUEUE_DISABLED is fired. You can capture more forensic information in the handling this event, as it will run shortly after the moment the queue was disabled.
As a side note, you can never have true 'poison message handling'. Whenever you enhance the processing to handle some error cases, the definition of the 'poison message' changes to be the message capable of disabling the new error handling.
I have a server application, and a database. Multiple instances of the server can run at the same time, but all data comes from the same database (on some servers it is postgresql, in other cases ms sql server).
In my application, there is a process that is performed which can take hours. I need to ensure that this process is only executed one at a time. If one server is processing, no other server instance can process until the first one has completed.
The process depends on one table (let's call it 'ProcessTable'). What I do is, before any server starts the hour-long process, I set a boolean flag in the ProcessTable which indicates that this record is 'locked' and is being processed (not all records in this table are processed / locked, so I need to specifically mark each record which is needed by the process). So when the next server instance comes along while the previous instance is still processing, it sees the boolean flags and throws an exception.
The problem is, that 2 server instances might both be activated at nearly the same time, and when both check the ProcessTable, there may not be any flags set, but both servers are actually in the process of 'setting' the flags but since the transaction hasn't yet commited for either process, neither process will see the locking done by the other process. This is because the locking mechanism itself may take a few seconds, so there is that window of opportunity where 2 servers might still be able to process at the same time.
It appears that what I need is a single record in my 'Settings' table which should store a boolean flag called 'LockInProgress'. So before even a server can lock the needed records in the ProcessTable, it first must make sure that it has full rights to do the locking by checking the 'LockInProgress' column in the Settings table.
So my question is, how do I prevent two servers from both modifying that LockInProgress column in the settings table, at the same time... or am I going about this in the wrong manner?
Please note that I need to support both postgresql and ms sql server as some servers use one database, and some servers use the other.
Thanks in advance...
How about obtaining a lock on the record first and then update the record to show "locked". This would avoid the 2nd instance to get a lock successfully and thereby the update of record fails.
The point is to make sure the lock and update as one atomic step.
Make a stored procedure that hands out the lock, and run it under 'serializable' isolation. This will guarantee that one and only one process can get at the resource at any given time.
Note that this means that the second process trying to get at the lock will block until the first process releases it. Also, if you have to get multiple locks in this manner, make sure that the design of the process guarantees that the locks will be acquired and released in the same order. This will avoid deadlock situations where two processes hold resources while waiting for each other to release locks.
Unless you can't deal with your other processes blocking this would probably be easier to implement and more robust than attempting to implement 'test and set' semantics.
I've been thinking about this, and I think this is the simplest way of doing things; I just execute a command like this:
update settings set settingsValue = '333' where settingsKey = 'ProcessLock' and settingsValue = '0'
'333' would be a unique value which each server process gets (based on date/time, server name, + random value etc).
If no other process has locked the table, then the settingsValue would be = to 0, and that statement would adjust the settingsValue.
If another process has already locked the table, then that statement becomes a no-op, and nothing get's modified.
I then immediately commit the transaction.
Finally, I requery the table for the settingsValue, and if it is the correct value, then our lock succeeded and we continue on, otherwise an exception is thrown, etc. When we're done with the lock, we reset the value back down to 0.
Since I'm using SERIALIZATION transaction mode, I can't see this causing any issues... please correct me if I'm wrong.