Everywhere in Flink docs I see that a state is individual to a map function and a worker. This seems to be powerful in a standalone approach, but what if Flink runs in a cluster ? Can Flink handle a global state where all workers could add data and query it ?
From Flink article on states :
For high throughput and low latency in this setting, network communications among tasks must be minimized. In Flink, network communication for stream processing only happens along the logical edges in the job’s operator graph (vertically), so that the stream data can be transferred from upstream to downstream operators.
However, there is no communication between the parallel instances of an operator (horizontally). To avoid such network communication, data locality is a key principle in Flink and strongly affects how state is stored and accessed.
I think that Flink only supports state on operators and state on Keyed streams, if you need some kind of global state, you have to store and recover data into some kind of database/file system/shared memory and mix that data with your stream.
Anyways, in my experiece, with a good processing pipeline design and partitioning your data in the right way, in most cases you should be able to apply divide and conquer algorithms or MapReduce strategies to archive your needs
If you introduce in your system some kind of global state, that global state could be a great bottleneck. So try to avoid it at all cost.
Related
I want to load a large amount of data into Flinks state backend (RocksDB) and process events using this data (for example in a CoProcessFunction). The data is much larger than the memory of a TaskManager.
The data blocks for a key can be very large, which has a negative impact on latency if the data needs to be loaded from the state backend each time. Therefore I would like to keep the data for frequent keys locally in the CoProcessFunction.
However, the total data in the state backend is larger than the memory of a TaskManager so it is not possible to keep the corresponding data block from the state backend locally for each key.
To solve this problem I would need to know the current memory usage of a SubTask to decide if a data block for a key can be kept locally or if something needs to be deleted. So here is my question: Since keys are not clearly assigned to a subtask is there a way to access memory related subtask information or custom metrics related to subtasks in a KeyedStream ? Or is there another way to solve this problem ? (External access via Async I/O is not an option).
The RocksDB block cache is already doing roughly what you describe. Rather than implementing your own caching layer, you should be able to get good results by tuning RocksDB, including giving it plenty of memory to work with.
Using RocksDB State Backend in Apache Flink: When and How is a good starting point, and includes pointers to where you can learn more about the native RocksDB metrics, memory management, etc. I also recommend reading The Impact of Disks on RocksDB State Backend in Flink: A Case Study.
As for your question about accessing subtask metrics from within your Flink job -- I don't know of any way to do that locally. You could, I suppose, implement a Flink source connector that fetches them and streams them into the job as another data source.
We have a Flink pipeline aggregating data per "client" by combining data with identical keys ("client-id") and within the same window.
The problem is trivially parallelizable and the input Kafka topic has a few partitions (same number as the Flink parallelism) - each holding a subset of the clients. I.e., a single client is always in a specific Kafka partition.
Does Flink take advantage of this automatically or will reshuffle the keys? And if the latter is true - can we somehow avoid the reshuffle and keep the data local to each operator as assigned by the input partitioning?
Note: we are actually using Apache Beam with the Flink backend but I tried to simplify the question as much as possible. Beam is using FixedWindows followed by Combine.perKey
I'm not familiar with the internals of the Flink runner for Beam, but assuming it is using a Flink keyBy, then this will involve a network shuffle. This could be avoided, but only rather painfully by reimplementing the job to use low-level Flink primitives, rather than keyed windows and keyed state.
Flink does offer reinterpretAsKeyedStream, which can be used to avoid unnecessary shuffles, but this can only applied in situations where the existing partitioning exactly matches what keyBy would do -- and I see no reason to think that would apply here.
I have a Flink application running in Amazon's Kinesis Data Analytics Service (managed Flink cluster). In the app, I read in user data from a Kinesis stream, keyBy userId, and then aggregate some user information. After asking this question, I learned that Flink will split the reading of a stream across physical hosts in a cluster. Flink will then forward incoming events to the host that has the aggregator task assigned to the key space that corresponds to the given event.
With this in mind, I am trying to decide what to use as a partition key for the Kinesis stream that my Flink application reads from. My goal is to limit network traffic between hosts in the Flink cluster in order to optimize performance of my Flink application. I can either partition randomly, so the events are evenly distributed across the shards, or I can partition my shards by userId.
The decision depends on how Flink works internally. Is Flink smart enough to assign the local aggregator tasks on a host a key space that will correspond to the key space of the shard(s) the Kinesis consumer task on the same host is reading from? If this is the case, then sharding by userId would result in ZERO network traffic, since each event is streamed by the host that will aggregate it. It seems like Flink would not have a clear way of doing this, since it does not know how the Kinesis streams are sharded.
OR, does Flink randomly assign each Flink consumer task a subset of shards to read and randomly assign aggregator tasks a portion of the key space? If this is the case, then it seems a random partitioning of shards would result in the least amount of network traffic since at least some events will be read by a Flink consumer that is on the same host as the event's aggregator task. This would be better than partitioning by userId and then having to forward all events over the network because the keySpace of the shards did not align with the assigned key spaces of the local aggregators.
10 years ago, it was really important that as little data as possible is shipped over the network. Since 5 years, network has become so incredible fast that you notice little difference between accessing a chunk of data over network or memory (random access is of course still much faster), such that I wouldn't sweat to much about the additional traffic (unless you have to pay for it). Anecdotally, Google Datastream started to stream all data to a central shuffle server between two tasks, effectively doubling the traffic; but they still experience tremendous speedups on their Petabyte network.
So with that in mind, let's move to Flink. Flink currently has no way to dynamically adjust to shards as they can come and go over time. In half a year with FLIP-27, it could be different.
For now, there is a workaround, currently mostly used in Kafka-land (static partition). DataStreamUtils#reinterpretAsKeyedStream allows you to specify a logical keyby without a physical shuffle. Of course, you are responsible that the provided partitioning corresponds to the reality or else you would get incorrect results.
I am trying to build data management (DM) solution involving high volume data ingestion, pass through some data domain rules, substitution (enrichment), flag the erroneous data before sending it off to downstream system. The rules checking & value replacement can be something simple like permissible threshold numeric values that the data elements should satisfy, to something more complex like lookup with master data for domain pool of values.
Do you think that Apache Flink can be a good candidate for such processing? Can there be flink operators defined to do lookup (with the master data) for each tuple flowing through it? I think there are some drawbacks of employing Apache Flink for the latter question - 1) the lookup could be a blocking operation that would slowdown the throughput, 2) checkpointing and persisting the operator state cannot be done if the operator functions have to fetch master data from elsewhere.
What are the thoughts? Is there some other tool best at the above use case?
Thanks
The short answer is 'yes'. You can use Flink for all the things you mentioned, including data lookups and enrichment, with the caveat that you won't have at-most-once or exactly-once guarantees on side effects caused by your operators (like updating external state.) You can work around the added latency of external lookups with higher parallelism on that particular operator.
It's impossible to give a precise answer without more information, such as what exactly constitutes 'high volume data' in your case, what your per-event latency requirements are, what other constraints you have, etc. However, in the general sense, before you commit to using Flink you should take a look at both Spark Streaming and Apache Storm and compare. Both Spark and Storm have larger communities and more documentation, so it might save you some pain in the long rum. Tags on StackOverflow at the time of writing: spark-streaming x 1746, apache-storm x 1720, apache-flink x 421
More importantly, Spark Streaming has similar semantics to Flink, but will likely give you better bulk data throughput. Alternatively, Storm is conceptually similar to Flink (spouts/bolts vs operators) and actually has lower performance/throughput in most cases, but is just a more established framework.
I have a static field(counter) in one of my bolt. Now if i run the topology in cluster with various worker. Each JVM is going to have its own copy of static field. But I want a field which can be shared across by workers. How can i accomplish that?
I know i can persist the counter somewhere and read that in each JVM and update it(synchronously). But that will be performance issue. Is there any way out through Storm?
The default Storm api only provides at-least-once processing guarantee. This might or might not be what you want depending on whether the accuracy of the counter matters (e.g. when a tuple is reprocessed due to a worker failure, counter is incremented incorrectly)
I think you can look into Trident, which is a high-level API for Storm that can provide exactly-once processing guarantee and the abstractions (e.g. Trident State) for persisting states using databases or in-memory stores. Cassandra has a column type for counters which might be suitable for your use case.
Persisting trident state in memcached
https://github.com/nathanmarz/trident-memcached
Persisting trident state in Cassandra
https://github.com/Frostman/trident-cassandra