Extensive benchamark for Flink in stream processing - apache-flink

I am using Flink at my company and I am considering to apply several scenarios to see the performance of each case.
Below is the scenarios that I will work on
Experiments
End-to-End
Exactly-At-Once or At-least-once
source : kafka
sink : Mysql and Redis
logic : simple counting logic
For the Exactly-At-Once, I will use the TwoPhaseCommitSink for achieving the case.
Before doing experiment, I am wondering some issues as below.
The performance speed of the sink
As you can see, I will use the mysql (RDB) for the sink. Is there any descriptive benchmarks result when we use the RDB for at-least-once or exactly-at-once? I think that when the sink uses the database, the throughput will be influenced because it takes some time to connect and communicate with database. But I cannot find any documents or technical blogs showing the detailed results of benchmark of Flink when using the Sink for RDB.
Especially, I am also wondering that the Exactly-at-once will have more degraded performance than the at-least-once and it is hard to use the commercial purpose because of its slow processing.
So my question is as below.
Is there any informative results for the two semantics mode (at least once, exactly at once) using the database sink (mysql or redis)?
Exactly-at-once semantics for end-to-end will be very slow when using the mysql sink? I will apply the twophasecommitsink.
Thanks.

A few reactions:
Simple, generic Flink benchmarks are pretty useless as predictors of specific application performance. So much depends on what a specific job is doing, and there's a lot of room for optimization.
Exactly-once with two-phase commit sinks is costly in terms of latency, but not so bad with respect to throughput. The issue is that the commit has to be done in concert with a checkpoint. If you need to checkpoint frequently in order to reduce the latency, then that will more significantly harm the throughput.
Unaligned checkpoints and the changelog state backend can make a big difference for some use cases. If you want to include these in your testing, be sure to use Flink 1.16, which saw significant improvements in these areas.
The Flink project has invested quite a bit in having a suite of benchmarks that run on every commit. See https://github.com/apache/flink-benchmarks and http://codespeed.dak8s.net:8000/ for more info.

Related

Is it possible to achieve Exacly Once Semantics using a BASE-fashioned database?

In Stream Processing applications (f. e. based on Apache Flink or Apache Spark Streaming) it is sometimes necessary to process data exactly once.
In the database world something equal be achieved by using databases that follow the ACID criteria (correct me if I'm wrong here).
However there are a lot of (non relational) databases that do not follow ACID but BASE.
Now my question is: If I'm going to integrate such a BASE database into a stream processing application (exactly once), can I still guarantee exactly once processing for the whole pipeline? And if this is possible, under what circumstances?
Exactly Once Semantics means the processing framework such as flink can guarantee each incoming record(event) will be processed exactly one time even if the pineline fails in any way.
This is done by having checkpoints after each operation in the pineline, so that when the application recovers from failure, successful operation will not be executed again.
Depends on what kind of operations you are trying to do with databases, most cases databases are used as sinks for processing result to write into. In that case the operation involving database is just a simple insert and it will not be executed again after one successful run therefore it's still exactly-once regardless of its ACID support.
You might be tempted to group operations together for databases that support ACID but it will be a bad practice in a parallel streaming pineline since they created mutilple transactions and the locks might block the whole process. Instead, use BASE (NoSQL) database that are fast with intensive read and update performance is preferable, you just need to make your operations to be idempotent so that partially re-executed statements (if they failed half way through then after recovery they might be executed all again) won't result in incorrect data.

When configure flink checkpoint, should I choose exactly-once or at-least-once?

When I use flink to log data pipeline, data volume about 100G every day,
the flink checkpoint default config is exactly-once, but I worry about this will affect latency.
exactly-once it is necessary? how about at-least-once?
If the computation you perform downstream is idempotent (ie duplicates do not change the result of your computation), then at-least-once will be more performant (as it requires less synchronisation).
For more info see here.
Is duplicated delivery a problem for your application? Then you may want to use exactly-once. Else at-least-once will offer better performance.
From experience: we started with exactly-once, but soon switched to at-least-once. You may be able to control duplication within Flink, but it is almost impossible to avoid throughout the system. Thus, we built all of our actions to be (testably) idempotent and turned exactly-once off.
Checkpoints in Flink are implemented via a variant of the Chandy/Lamport asynchronous barrier snapshotting algorithm. Docs.
Before Flink 1.11, the only difference between "exactly-once" and "at-least-once" has been that exactly-once required barrier alignment on any operator with multiple inputs. In general this tends to increase latency; how much it increases depends on the job.
Flink 1.11 has introduced the option of unaligned checkpoints. This alternative implementation of exactly-once helps in some cases, sometimes by a lot. Docs.
If you aren't happy with the thought of coping with some duplication during recovery, then you may want to so some benchmarking with your app.

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

Reason why Odoo being slow when there is huge data inside the database

We have observed one problem in Postgresql as it doesn't uses multi core of CPU for single query. For example, I have 8 cores in cpu. We are having 40 Million entries in stock.move table. When we apply massive query in single database connection to generate reporting & observe at backend side, we see only one core is 100% used, where as all other 7 are free. Due to that query execution time takes so longer and our odoo system being slow. Whereas problem is inside postgresql core. If by anyhow we can share a query between two or more cores than we can get performance boost in postgresql query execution.
I am sure by solving parallel query execution, we can make Odoo performance even faster. Anyone has any kind of suggestions regarding this ??
----------- * Editing this question to show you answer from Postgresql Core committee *---------
Here I am posting the answer which I got from one of top contributor of Postgresql database. ( I hope this information will be useful)
Hello Hiren,
It is expected behave. PostgreSQL doesn't support parallel CPU for
single query. This topic is under high development, and probably, this
feature will be in planned release 9.6 ~ September 2016. But table
with 40M rows isn't too big, so probably more CPU should not too help
to you (there is some overhead with start and processing multi CPU
query). You have to use some usual tricks like materialized view,
preagregations, ... the main idea of these tricks - don't try to
repeat often same calculation. Check health of PostgreSQL - indexes,
vacuum processing, statistics,.. Check hw - speed of IO. Check
PostgreSQL configuration - shared_buffers, work_mem. Some queries can
be slow due bad estimations - check a explain of slow queries. There
are some tools that can breaks some query to more queries and start
parallel execution, but I didn't use it. https://launchpad.net/stado
http://www.pgpool.net/docs/latest/tutorial-en.html#parallel
Regards Pavel Stehule
Well, I think you have your answer there -- PostgreSQL does not currently support parallel query yet. The general advice towards performance is very apt, and you might also consider partitioning, which might allow you to truncate partitions instead of deleting parts of a table, or increasing memory allocation. It's impossible to give good advice on that without knowing more about the query.
Having had experience with this sort of issue on non-parallel query Oracle systems, I suggest that you also consider what hardware you're using.
The modern trend towards CPUs with very many cores is a great help for web servers or other multi-process systems with many short-lived transactions, but you have a data processing system with few, large transactions. You need the correct hardware to support that. CPUs with fewer, more powerful cores are a better choice, and you have to pay attention to bandwidth to memory and storage.
This is why engineered systems have been popular with big data and data warehousing.

How much RealTime is Elasticsearch, Solr and DSE realtime search?

From some last couple of weeks, I have been working around Elasticsearch and Solr, and trying to do OLTP processing in real time. However, what comes to me is they claims(especially ES) to be real time. The meaning of real time looks a lot fuzzy to me.
If we go deep into it, both ES and Solr, defines a refresh rate or a soft-commit rate, after which the newly indexed documents would be available for search, effectively providing only Near-Real time capabilities.
It looks like by Real time search, it is either a marketing statement to call it real time, or they make the word fuzzy by talking about Real Time Search rather than batch or analytical processing.
Am I correct, or correct me if I am wrong, and there is a real-time search possible in a typical OLTP system, where every transaction has search visibility to last document ?
Elasticsearch is a Near Real Time search engine for search. Elasticsearch is Real Time for operations like Create, Update, Delete and Get.
By default, refresh is 1 second. In some use cases, it could appear as real time. For example, I was working for a french gov service and we were producing statistics per day. So for our use case, it was somehow real time from our perspective.
For logs for example, 1 second is enough in most use cases.
You can modify this default value but it comes with a cost.
If you really need real time, then you probably want to use a SQL database.
My 2 cents.
Yes, DSE Search is indeed Near real-time and has not yet achieved the mythical goal of absolute zero latency. But... even traditional Real real-time is not real-time once you factor in the time to do the actual database update, plus the fact that a lot of traditional database updates are batch-oriented, or even if the actual update operation is not batched, there is likely to be some human process that delays the start of the database update from the original source of a data change.
Also keep in mind that the latency of a database update needs to include maintaining the required (tunable) consistency for replicating data updates in the cluster.
Rather than push you back towards SQL if you want real-time, I would challenge you to fully justify the true latency requirements of the app. For example, with complex distributed applications you need to be prepared for occasional resource outages, such as network delays, so that it is usually much better to design a modern distributed application to be a lot more flexible and asynchronous than a traditional, synchronous, fragile (think HealthCare.gov) app architecture that improperly depends on a perception of zero-latency distributed operations.
Finally, we are working on enhancements to reduce the actual latency of database updates, coupled with ongoing improvements in hardware performance that further shrink the update latency window.
But ultimately, all computing real-time measures will have some non-zero latency and modern distributed apps must be designed for at least some degree of decoupling between database updates and absolute dependency on those updates.
Worst case scenario, apps that need to synchronize with database updates may need to implement a polling strategy to wait for the update to complete.
ElasticSearch has real time features for CRUD operations. On GET operations, it checks the Transaction log, to look for any uncommitted changes and return the most relevant document.
The Percolator feature enables realtime in search queries as well. It allows you to register queries (percolation), that will be used at indexing time to return matching documents to those predefined queries.
This workflow looks like this:
Register specific query (percolation) in Elasticsearch
Index new content (passing a flag to trigger percolation)
The response to the indexing operation will contain the matched percolations
A very good blog with live example that explains the Percolator concept:
http://blog.qbox.io/elasticsesarch-percolator

Resources