Flink: Run sinks separately

Flink: Run sinks separately - apache-flink

I am consuming data from Kafka and I need to write the stream to both my local file and a port listened by Flume
The code can run as expected as follows:
streamSource.writeToSocket("192.168.95.11", 9158, new SimpleStringSchema());
streamSource.writeAsText("/tmp/flink_output.txt").setParallelism(1);
However, when the port is closed, the whole flink task stops. Is there any way to make them run separately: when port is closed, writeToSocket keeps retring while writeAsText runs as usual?

For that to happen, where the same source is used for both, then the data read from the source that should have been written to the socket would have to either be buffered somewhere capable of absorbing an unbounded amount of data -- or that data would have to be dropped.
Or you could separate the two pipelines and make them completely independent of one another. This could be done by having two separate jobs, or one job with fully decoupled pipelines, in which case a failure of one will not restart the other. See failover strategies from the Flink 1.9 release notes and pipelined region failover strategy.
You would need to do something like this:
source1.writeToSocket(...)
source2.writeAsText(...)
Here source1 and source2 are two separate kafka consumers, reading independently from the same topic(s).

Related

Need advice on migrating from Flink DataStream Job to Flink Stateful Functions 3.1

I have a working Flink job built on Flink Data Stream. I want to REWRITE the entire job based on the Flink stateful functions 3.1.
The functions of my current Flink Job are:
Read message from Kafka
Each message is in format a slice of data packets, e.g.(s for slice):
s-0, s-1 are for packet 0
s-4, s-5, s-6 are for packet 1
The job merges slices into several data packets and then sink packets to HBase
Window functions are applied to deal with disorder of slice arrival
My Objectives
Currently I already have Flink Stateful Functions demo running on my k8s. I want to do rewrite my entire job upon on stateful functions.
Save data into MinIO instead of HBase
My current plan
I have read the doc and got some ideas. My plans are:
There's no need to deal with Kafka anymore, Kafka Ingress(https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/apache-kafka/) handles it
Rewrite my job based on java SDK. Merging are straightforward. But How about window functions?
Maybe I should use persistent state with TTL to mimic window function behaviors
Egress for MinIO is not in the list of default Flink I/O Connectors, therefore I need to write my custom Flink I/O Connector for MinIO myself, according to https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/flink-connectors/
I want to avoid Embedded module because it prevents scaling. Auto scaling is the key reason why I want to migrate to Flink stateful functions
My Questions
I don't feel confident with my plan. Is there anything wrong with my understandings/plan?
Are there any best practice I should refer to?
Update:
windows were used to assemble results
get a slice, inspect its metadata and know it is the last one of the packet
also knows the packet should contains 10 slices
if there are already 10 slices, merge them
if there are not enough slices yet, wait for sometime (e.g. 10 minutes) and then either merge or record packet errors.
I want to get rid of windows during the rewrite, but I don't know how

Background: Use KeyedProcessFunctions Rather than Windows to Assemble Related Events
With the DataStream API, windows are not a good building block for assembling together related events. The problem is that windows begin and end at times that are aligned to the clock, rather than being aligned to the events. So even if two related events are only a few milliseconds apart they might be assigned to different windows.
In general, it's more straightforward to implement this sort of use case with keyed process functions, and use timers as needed to deal with missing or late events.
Doing this with the Statefun API
You can use the same pattern mentioned above. The function id will play the same role as the key, and you can use a delayed message instead of a timer:
as each slice arrives, add it to the packet that's being assembled
if it is the first slice, send a delayed message that will act as a timeout
when all the slices have arrived, merge them and send the packet
if the delayed message arrives before the packet is complete, do whatever is appropriate (e.g., go ahead and send the partial packet)

Flink StreamSink and Checkpoint Understanding

I have written a job where 5 different source and sink is there in a single application. i am writing the data in parquet format using stream sink. As parquet sink write data on checkpoint. If one of the source get some malform records than i am getting exception in sink.
But that causing my all the consumer to getting stopped. I am not able to write any data by other sinks also.
Example:
source1(kafka)---sink1(s3)
source2(kafka) -sink2(s3)
source3(kafka) - sink3(s3)
i need to understand why due to one sink getting failed causing all the consumer to get stopped and no data is getting write in S3. can somebody please help to understand this or i am missing something.

The application needs to fail or otherwise orderness and consistency guarantees cannot hold anymore. This is completely independent of checkpointing.
If just one task fails, all other tasks in one application need to fail as well as Flink cannot know which tasks are relevant or not.
In your case, you actually seem to have 3 independent applications. So you have three options:
If they should fail together, you put them all in the same StreamExecutionEnvironment as you have done.
If all applications should run independently, you need to start the job 3 times with different parameters. The three deployments can be then restarted independently.
If you would still like to deploy only once, then you could spawn 3 StreamExecutionEnvironments and let them run in parallel in different threads. The main should then join on these threads.

Data/event exchange between jobs

Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.
For example, consider a process with an input/preprocessing stage, a business logic and an output stage.
In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.
Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)?
If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted?
If no, does anyone have experience with such a setup and point me to a possible solution?
Thank you!

If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.
Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.
I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.

An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.
I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...
You can use a broadcast stream to communicate/update the dynamic portions of the processing.
Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.

whether flink supports suspend a flink job?

i am just beginning learning apache flink and meet the folling problem:
How can i suspend a flink job and then resume it ?
does flink support suspend a job using command line ?

Yes, you certainly can do this with Flink. You want to read about savepoints, which can be triggered from the command line or from the REST API.
Updated
Normally the goal of a stream processor is to do continuous, immediate processing of new elements as they become available. If you want to suspend processing, then I guess this might be with the goal of ignoring the source(s) for a while and dropping the arriving events, or with a desire to conserve computing resources for a time and to later resume without losing any input.
RichCoFlatmap and CoProcessFunction are building blocks you might find useful. You could setup a control stream connected to a socket (for example), and when you want to "suspend" the primary stream, send an event that causes the primary stream to either start dropping its input, or do a blocking read, or sleep, for example.
Or you might think about adding your own layer of abstraction on top of jobs, and cope with the fact that the jobids will change. Note that jobs can have names that remain unchanged across savepoints/restarts.

Multiple Biztalk host instances writing to single file

We have four Biztalk servers on production envionment. The sendport is configured to write incoming message in one textfile. This port receives thousands of messages in a day. So multiple host instances tries to write to file at single time, before one instance finishes writing complete record another instances starts writing new record causing data scattered all over the file.
What can we do resolve this issue?

...before one instance finishes writing complete record another instances starts writing new record causing data scattered all over the file.
What can we do resolve this issue?
The easy way is to only use a single Host Instance to write data to the file, however you may then start to experience throttling issues. Alternatively, you could explore using the 'Allow Cache on write' option on the File Adapter which may offer some improvements.
However, I think your approach is wrong. You cannot expect four separate and totally disconnected processes (across 4 servers no-less) to reliably append to a single file - IN ORDER.
I therefore think you should look re-architecting this solution:
As each message is received, write the contents of the message to a database table (a simple INSERT) with an 'unprocessed' flag. You can reliably have four Host Instances banging data into SQL without fear of them tripping over each other.
At a scheduled time, have BizTalk extract all of the records that are marked as unprocessed in that SQL Table (the WCF-SQL Adapter can help you here). Once you have polled the records, mark them as 'in-process'.
You should now have a single message containing all of the currently unprocessed records (as retrieved from SQL). Using a single (or multiple) Host Instance/s, write the message to disk, appending each of the records to the file in a single write. The key here is that you are only writing a single message to the one file, not lots and lots and lots :-)
If the write is successful, update each of the records in the SQL table with a 'processed' flag so they are not picked-up again on the next poll.
You might want to consider a singleton orchestration for this piece to ensure that there is only ever one poll-write-update process taking place at the same time.

If FIFO is important, BizTalk has ordered delivery mechanism (FILE adapter supported) but it comes at performance cost.
The better solution would be let instances writing to individual files and then have another scheduled process (or orchestration) to combine them in one file. You can enforce FIFO using timestamps. This would provide better performance and resource utilization vs. mentioned earlier singleton orchestration. Other option may be using any suitable implementation of a queue.

You can move to a database system instead of a file. That would be very simply solution and also very efficient.
If you don't want to go that way, you must implement file locking or a semaphore inside of your application so the new threads will wait for other threads to finish writing.