I have a Flink app with a RichAsyncFunction that does async queries to an external database. The function is using a Guava cache. This is working perfectly, however it currently doesn't get included in checkpoints/savepoints. Is there a way I can get the cache data included in Checkpoints/savepoints?
I notice that RichAsyncFunction doesn't support the state functionality at all. Does this mean I can't serialize my cache to checkpoints/savepoints?
There is one Guava cache for the entire Flink app which might make this
scenario simpler.
Is there a recommended way to handle this situation?
FYI, I need lock-free concurrency support including check-and-set, which is offered by both java.util.concurrent.ConcurrentMap and Guava's cache, but not Flink's MapState. Is it in line with Flink best practices to use a Guava cache?
Related
I have conducted a performance testing a on a e-commerce website hosted on Azure. And I am checking azure logs for the test duration to find some scaling issues. From the logs I saw a lot of "InProc" dependencies failure. Also a lot of "Technical exception" with message " Cart not recalculated for remove shipping methods". So I would like to if this indicates any scaling issues or what should check or scaling issues for example slow database queries. I am very much new in performance testing and Azure so any help will be much appreciated. Thanks!!
Performance can be improved using Cache-Aside pattern- Caching Can Improve Application Performance
Data from a data store can be loaded into a cache on demand. This can assist increase performance while also ensuring data consistency between the cache and the underlying data storage.
Read-through and write-through/write-behind actions are available in
many commercial caching systems. An application in these systems
retrieves data by referring to the cache. If the data isn't already in
the cache, it's fetched and added from the data store. Any changes to
the data in the cache are also automatically pushed back to the data
store.
It is the duty of the programmes that utilise the cache to retain the
data if the cache does not provide this feature.
The cache-aside technique allows an application to mimic the
capabilities of read-through caching. This technique stores data in
the cache only when it is needed. The following diagram shows how to
use the Cache.
I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.
Flink documents suggests that Ceph can be used as a persistent storage for states. https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/checkpointing.html
Considering that Ceph is a transactional database, wouldn't it have adverse effect on Flink's performance?
Ceph describes itself as a "unified, distributed storage system" and provides a network file system API. As such, it such should be seamlessly working with Flink's state backends that persist checkpoints to a remote file system.
I'm not aware of people using Ceph (HDFS and S3 are more commonly used) and have no information about the performance. However, note that Flink is able to write checkpoints asynchronously, such that the performance of the storage system does not affect the processing speed of a Flink application. It might however, constrain the interval in which checkpoints are taken.
Update:
(Feb. 2018) I noticed that multiple users reported on Flink's user mailing list that they are using Ceph with Flink.
Update 2:
Flink is working fine with S3 protocol and both (Presto & Hadoop) Flink's S3 FileSystem plugins are working fine with it.
As both are streaming frameworks which processes event at a time, What are the core architectural differences between these two technologies/streaming framework?
Also, what are some particular use cases where one is more appropriate than the other?
As you mentioned both are streaming platform which to in memory computation in real time. But there are some architectural differences when you take a closer look.
Apex is yarn native architecture, it fully utilises yarn for scheduling, security & multi-tenancy where as Flink integrates with yarn. Apex can do resource allocation at operator (container) level with yarn.
Partitioning: Apex supports several sophisticated stream partitioning schemes and also allows controlling operator locality & stream locality. Flink supports simple hash partitions and custom partitions.
Apex allows dynamic changes to topology without having to take down the application. Apex allows the application to be updated at runtime so you can add and remove operators, update properties of operators, or automatically scale the application at runtime. Apache Flink does not support any of these capabilities.
Buffer Server: There is a message bus called buffer server between operators. Subscribers can connect to buffer server and fetch data from particular offsets. This is window aware, and holds data as long as no subscriber needs it.
Fault tolerance: Apex has incremental recovery model, on failure it can only part of topology can be restarted no need to go back to source, where in flink it goes back to source.
Apex has high level api as well as low level api. Flink only has high level api.
Apex has a library called Apache Malhar which has vast variety of well tested connectors and processing operators which can be reused easily.
Lastly Apex is more focused on productizing big data applications so has many features which will help in easy development and maintenance of applications.
Note: I am a committer to Apache Apex, so I might sound biased to Apex :)
I have a stream processing application written in Flink & I want to use its internal key-value store from the state backend to compute streaming aggregates. Because I am dealing with a lot of aggregates, I would like to avoid maintaining them on-heap inside the Flink application like the memory-backed and file-backed implementations currently offer. In stead, I would like to maintain a cache of the state in Apache Ignite, which in turn could use the write-through & read-through features to provide a more reliable back-up in HBase.
Ideally, I would have a single local Ignite cache on every physical node that handles the state for all long-running Flink operators on that node. E.g. each node has a single Ignite node in an 8 GB container available, whether it is running 1 or 10 Flink operators.
The problem is that I want both Flink and Ignite to run on YARN. Through consistent partitioning, I can ensure that the data in general is sent to the correct cache, and in case of failures etc., it can be refilled from HBase. The problem I'm facing though is that Ignite seems to request containers from YARN randomly, meaning I have no guarantee that there is in fact a local cache available, even if I set the amount of Ignite nodes exactly the same as the amount of physical nodes.
Any suggestions on how to achieve a one Ignite node per physical node set up?
There is a ticket created to enhance the resource allocation using YARN: https://issues.apache.org/jira/browse/IGNITE-3214. Someone in the community will puck it up and fix.