#Apache-flink: use case for Data Management

#Apache-flink: use case for Data Management - apache-flink

I am trying to build data management (DM) solution involving high volume data ingestion, pass through some data domain rules, substitution (enrichment), flag the erroneous data before sending it off to downstream system. The rules checking & value replacement can be something simple like permissible threshold numeric values that the data elements should satisfy, to something more complex like lookup with master data for domain pool of values.
Do you think that Apache Flink can be a good candidate for such processing? Can there be flink operators defined to do lookup (with the master data) for each tuple flowing through it? I think there are some drawbacks of employing Apache Flink for the latter question - 1) the lookup could be a blocking operation that would slowdown the throughput, 2) checkpointing and persisting the operator state cannot be done if the operator functions have to fetch master data from elsewhere.
What are the thoughts? Is there some other tool best at the above use case?
Thanks

The short answer is 'yes'. You can use Flink for all the things you mentioned, including data lookups and enrichment, with the caveat that you won't have at-most-once or exactly-once guarantees on side effects caused by your operators (like updating external state.) You can work around the added latency of external lookups with higher parallelism on that particular operator.
It's impossible to give a precise answer without more information, such as what exactly constitutes 'high volume data' in your case, what your per-event latency requirements are, what other constraints you have, etc. However, in the general sense, before you commit to using Flink you should take a look at both Spark Streaming and Apache Storm and compare. Both Spark and Storm have larger communities and more documentation, so it might save you some pain in the long rum. Tags on StackOverflow at the time of writing: spark-streaming x 1746, apache-storm x 1720, apache-flink x 421
More importantly, Spark Streaming has similar semantics to Flink, but will likely give you better bulk data throughput. Alternatively, Storm is conceptually similar to Flink (spouts/bolts vs operators) and actually has lower performance/throughput in most cases, but is just a more established framework.

Related

Extensive benchamark for Flink in stream processing

I am using Flink at my company and I am considering to apply several scenarios to see the performance of each case.
Below is the scenarios that I will work on
Experiments
End-to-End
Exactly-At-Once or At-least-once
source : kafka
sink : Mysql and Redis
logic : simple counting logic
For the Exactly-At-Once, I will use the TwoPhaseCommitSink for achieving the case.
Before doing experiment, I am wondering some issues as below.
The performance speed of the sink
As you can see, I will use the mysql (RDB) for the sink. Is there any descriptive benchmarks result when we use the RDB for at-least-once or exactly-at-once? I think that when the sink uses the database, the throughput will be influenced because it takes some time to connect and communicate with database. But I cannot find any documents or technical blogs showing the detailed results of benchmark of Flink when using the Sink for RDB.
Especially, I am also wondering that the Exactly-at-once will have more degraded performance than the at-least-once and it is hard to use the commercial purpose because of its slow processing.
So my question is as below.
Is there any informative results for the two semantics mode (at least once, exactly at once) using the database sink (mysql or redis)?
Exactly-at-once semantics for end-to-end will be very slow when using the mysql sink? I will apply the twophasecommitsink.
Thanks.

A few reactions:
Simple, generic Flink benchmarks are pretty useless as predictors of specific application performance. So much depends on what a specific job is doing, and there's a lot of room for optimization.
Exactly-once with two-phase commit sinks is costly in terms of latency, but not so bad with respect to throughput. The issue is that the commit has to be done in concert with a checkpoint. If you need to checkpoint frequently in order to reduce the latency, then that will more significantly harm the throughput.
Unaligned checkpoints and the changelog state backend can make a big difference for some use cases. If you want to include these in your testing, be sure to use Flink 1.16, which saw significant improvements in these areas.
The Flink project has invested quite a bit in having a suite of benchmarks that run on every commit. See https://github.com/apache/flink-benchmarks and http://codespeed.dak8s.net:8000/ for more info.

Flink - take advantage of input partitioning to avoid inter task-manager communications

We have a Flink pipeline aggregating data per "client" by combining data with identical keys ("client-id") and within the same window.
The problem is trivially parallelizable and the input Kafka topic has a few partitions (same number as the Flink parallelism) - each holding a subset of the clients. I.e., a single client is always in a specific Kafka partition.
Does Flink take advantage of this automatically or will reshuffle the keys? And if the latter is true - can we somehow avoid the reshuffle and keep the data local to each operator as assigned by the input partitioning?
Note: we are actually using Apache Beam with the Flink backend but I tried to simplify the question as much as possible. Beam is using FixedWindows followed by Combine.perKey

I'm not familiar with the internals of the Flink runner for Beam, but assuming it is using a Flink keyBy, then this will involve a network shuffle. This could be avoided, but only rather painfully by reimplementing the job to use low-level Flink primitives, rather than keyed windows and keyed state.
Flink does offer reinterpretAsKeyedStream, which can be used to avoid unnecessary shuffles, but this can only applied in situations where the existing partitioning exactly matches what keyBy would do -- and I see no reason to think that would apply here.

Is it possible to achieve Exacly Once Semantics using a BASE-fashioned database?

In Stream Processing applications (f. e. based on Apache Flink or Apache Spark Streaming) it is sometimes necessary to process data exactly once.
In the database world something equal be achieved by using databases that follow the ACID criteria (correct me if I'm wrong here).
However there are a lot of (non relational) databases that do not follow ACID but BASE.
Now my question is: If I'm going to integrate such a BASE database into a stream processing application (exactly once), can I still guarantee exactly once processing for the whole pipeline? And if this is possible, under what circumstances?

Exactly Once Semantics means the processing framework such as flink can guarantee each incoming record(event) will be processed exactly one time even if the pineline fails in any way.
This is done by having checkpoints after each operation in the pineline, so that when the application recovers from failure, successful operation will not be executed again.
Depends on what kind of operations you are trying to do with databases, most cases databases are used as sinks for processing result to write into. In that case the operation involving database is just a simple insert and it will not be executed again after one successful run therefore it's still exactly-once regardless of its ACID support.
You might be tempted to group operations together for databases that support ACID but it will be a bad practice in a parallel streaming pineline since they created mutilple transactions and the locks might block the whole process. Instead, use BASE (NoSQL) database that are fast with intensive read and update performance is preferable, you just need to make your operations to be idempotent so that partially re-executed statements (if they failed half way through then after recovery they might be executed all again) won't result in incorrect data.

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !

Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

How much RealTime is Elasticsearch, Solr and DSE realtime search?

From some last couple of weeks, I have been working around Elasticsearch and Solr, and trying to do OLTP processing in real time. However, what comes to me is they claims(especially ES) to be real time. The meaning of real time looks a lot fuzzy to me.
If we go deep into it, both ES and Solr, defines a refresh rate or a soft-commit rate, after which the newly indexed documents would be available for search, effectively providing only Near-Real time capabilities.
It looks like by Real time search, it is either a marketing statement to call it real time, or they make the word fuzzy by talking about Real Time Search rather than batch or analytical processing.
Am I correct, or correct me if I am wrong, and there is a real-time search possible in a typical OLTP system, where every transaction has search visibility to last document ?

Elasticsearch is a Near Real Time search engine for search. Elasticsearch is Real Time for operations like Create, Update, Delete and Get.
By default, refresh is 1 second. In some use cases, it could appear as real time. For example, I was working for a french gov service and we were producing statistics per day. So for our use case, it was somehow real time from our perspective.
For logs for example, 1 second is enough in most use cases.
You can modify this default value but it comes with a cost.
If you really need real time, then you probably want to use a SQL database.
My 2 cents.

Yes, DSE Search is indeed Near real-time and has not yet achieved the mythical goal of absolute zero latency. But... even traditional Real real-time is not real-time once you factor in the time to do the actual database update, plus the fact that a lot of traditional database updates are batch-oriented, or even if the actual update operation is not batched, there is likely to be some human process that delays the start of the database update from the original source of a data change.
Also keep in mind that the latency of a database update needs to include maintaining the required (tunable) consistency for replicating data updates in the cluster.
Rather than push you back towards SQL if you want real-time, I would challenge you to fully justify the true latency requirements of the app. For example, with complex distributed applications you need to be prepared for occasional resource outages, such as network delays, so that it is usually much better to design a modern distributed application to be a lot more flexible and asynchronous than a traditional, synchronous, fragile (think HealthCare.gov) app architecture that improperly depends on a perception of zero-latency distributed operations.
Finally, we are working on enhancements to reduce the actual latency of database updates, coupled with ongoing improvements in hardware performance that further shrink the update latency window.
But ultimately, all computing real-time measures will have some non-zero latency and modern distributed apps must be designed for at least some degree of decoupling between database updates and absolute dependency on those updates.
Worst case scenario, apps that need to synchronize with database updates may need to implement a polling strategy to wait for the update to complete.

ElasticSearch has real time features for CRUD operations. On GET operations, it checks the Transaction log, to look for any uncommitted changes and return the most relevant document.
The Percolator feature enables realtime in search queries as well. It allows you to register queries (percolation), that will be used at indexing time to return matching documents to those predefined queries.
This workflow looks like this:
Register specific query (percolation) in Elasticsearch
Index new content (passing a flag to trigger percolation)
The response to the indexing operation will contain the matched percolations
A very good blog with live example that explains the Percolator concept:
http://blog.qbox.io/elasticsesarch-percolator