Flink Sorting A Global Window On A Bounded Stream - apache-flink

I've built a flink application to consume data directly from Kafka but in the event of a system failure or a need to re-process this data, I need to instead consume the data from a series of files in S3. The order in which messages are processed is very important, so I'm trying to figure out how I can sort this bounded stream before pushing these messages through my existing application.
I've tried inserting the stream into a temporary table using the Table API but the sort operator always uses a maximum parallelism of 1 despite sorting on two keys. Can I leverage these keys somehow to increase this parallelism ?
I've been thinking of using a keyed global window but I'm not sure how to trigger on a bounded stream and sort the window. Is Flink a good choice for this kind of batch processing and would it be a good idea to write this using the old Dataset API?
Edit
After some experimentation, I've decided that Flink isn't the correct solution and Spark is just more feature rich in this particular use case. Im trying to consume and sort over 1.5tb of data in each job. Unfortunately some of these partitions contain maybe 100G or more and everything must be in order before I can break those groups up further, which makes sorting this data in the operators difficult.
My requirements are simple, ingest the data from S3 and sort by channel ID before flushing it to disk. Having to think about windows and timestamp assignors just complicates a relatively simple task that can be achieved in 4 lines of Spark code.

Have you considered using the HybridSource for your use case, since this is exactly for what is was designed? https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/hybridsource/
The DataSet API is deprecated and I would not recommend to use it.

Related

can (should) I use Flink like an in-memory database?

I've used batch Beam but am new to the streaming interface. I'm wondering about the appropriateness of using Apache Flink / Beam kind of like an in-memory database -- I'd like to constantly recompute and materialize one specific view of my data based on edge triggered updates.
More details: I have a few tables in a (normal) database, ranging from thousands to millions of rows, and each one has a many-to-many (M2M) relationship with other ones. Picture to explain:
Hosts <-M2M #1-> Table 1 <-M2M #2-> Table 2 <-M2M #3-> Table 3
Table 1 is a set of objects that the hosts need to know about, and each host needs to know about all downstream rows referenced directly or indirectly by the objects in Table 1 that it's related to. When changes happen anywhere other than the first many-to-many relationship M2M #1, it's not obvious which hosts need to be updated without traversing "left" to find the hosts and then traversing "right" to get all the necessary configuration. The objects and relationships at most levels change frequently, and I need sub-second latency to go from "a record or relationship changed" to recalculating any flattened config files with changes in them so that I can push updates to the hosts very quickly.
Is this an appropriate use case for streaming Flink / Beam? I have worked with Beam in a different system but only in batch mode, and I think that it would be a great tool to use here if I could edge-trigger it. The part I'm getting stuck on is, in batch mode, the PCollections are all "complete" in the sense that I can always join all records in one table with all records in another table. But with streaming, once I process a record once, it gets removed from its PCollection and can't be joined against future updates that arrive later on and relate to it, right? IIUC, it's only available within a window, but I effectively want an infinitely long window where only outdated versions of items in a PCollection (e.g. versions of them which have been overwritten by a new version that came in over the stream) would be freed up.
(Also, to bootstrap this system, I would need to scan the whole database to prefill all the state before I could start reading from a stream of edge-triggered updates. Is that a common pattern?)
I don't know enough about Beam to answer that part of the question, but you can certainly use Flink in the way you've described. The simplest way to accomplish this with Flink is with a streaming join, using the SQL/Table API. The runtime will materialize both tables into managed Flink state, and produce new and/or updated results as new and updated records are processed from the input tables. This is a commonly used pattern.
As for initially bootstrapping the state, before continuing to ingest the updates, I suggest using a CDC-based approach. You might start by looking at https://github.com/ververica/flink-cdc-connectors.

Flink - take advantage of input partitioning to avoid inter task-manager communications

We have a Flink pipeline aggregating data per "client" by combining data with identical keys ("client-id") and within the same window.
The problem is trivially parallelizable and the input Kafka topic has a few partitions (same number as the Flink parallelism) - each holding a subset of the clients. I.e., a single client is always in a specific Kafka partition.
Does Flink take advantage of this automatically or will reshuffle the keys? And if the latter is true - can we somehow avoid the reshuffle and keep the data local to each operator as assigned by the input partitioning?
Note: we are actually using Apache Beam with the Flink backend but I tried to simplify the question as much as possible. Beam is using FixedWindows followed by Combine.perKey
I'm not familiar with the internals of the Flink runner for Beam, but assuming it is using a Flink keyBy, then this will involve a network shuffle. This could be avoided, but only rather painfully by reimplementing the job to use low-level Flink primitives, rather than keyed windows and keyed state.
Flink does offer reinterpretAsKeyedStream, which can be used to avoid unnecessary shuffles, but this can only applied in situations where the existing partitioning exactly matches what keyBy would do -- and I see no reason to think that would apply here.

Is it possible to have a window per sub-task/partition

I am working with Flink using data from a Kafka topic that has multiple partitions. Is it possible to have a window on each parallel sub-task/partition without having to use keyBy (as I want to avoid the shuffle). Based on the documentation, I can only choose between keyed windows (which requires a shuffle) or global windows (which reduces parallelism to 1).
The motivation is that I want to use a CountWindow to batch the messages with a custom trigger that also fires after a set amount of processing time. So per Kafka partition, I want to batch N records together or wait X amount of processing time before sending the batch downstream.
Thanks!
There's no good way to do that.
One workaround would be to implement the batching and timeout logic in a custom sink. You'd want to implement the CheckpointedFunction interface to make your solution fault tolerant, and you could use the Sink.ProcessingTimeService.ProcessingTimeCallback interface for the timeouts.
UPDATE:
Just thought of another solution, similar to the one in your comment below. You could implement a custom source that sends a periodic heartbeat, and broadcast that to a BroadcastProcessFunction.

Control stream in Flink SQL

With the stream API, I can write a RichCoFlatMapFunction that accept a control stream and a data stream, the control stream contains the elements for start or stop or change parameter of the calculation, I know I can store the current control settings in states, and check the value when process data stream.
But what's the way to do the similar thing with Flink SQL?
I cannot use join as data stream and control stream is not able to join together.
The solution we come up with is to store the control settings by application itself.
The idea is:
Broadcast the control stream to a map operator, and store the control settings to a java singleton objects in its map() method, as the map operator will run with the default parallelism, we assume that it will run on all JVMs for that job, so that we make sure every JVM will initialize and keep updating the control settings in the singleton object.
With SQL, for every UDAF or UDF we can access the control settings through accessing the java singleton objects.
But I am not sure if my assumption is correct and this is a feasible solution.
I don't think that is a good idea. SQL was not designed for such use cases. Instead a SQL query is optimized and executed as specified. Changing the behavior of a query is not intended. Besides the design perspective, it would also not perform well because you would need do remote state look-ups to distributed queryable state for each record that you process. This adds of course latency.
To me your use case sounds more like an application than SQL query. For that the DataStream API would be the right choice. What you can do, is to embed SQL (or Table API) queries into an application, i.e., do the pre and post processing with SQL and have an operator with an control/data stream pattern in the middle.

Flink batch: data local planning on HDFS?

we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.

Resources