I'm considering to use Apache Flink to process some stream data in my project.
However, I was told that Flink may need much RAM by a friend. Also, I've found something which told me the same thing: https://www.quora.com/What-is-the-difference-between-Apache-Flink-and-Apache-Spark
For now I haven't learnt a lot about Flink, I just succeeded in installing it and running the Word Count example.
So I'm wondering why Flink needs much RAM. What is the mainly reason? Some disadvantage of Flink itself? Or saving the historical data? or anything else?
Can I use something like Redis to avoid this issue?
That answer on Quora is rather old, and lacks specifics.
It all depends on what you mean by "a lot of memory". I've seen Flink running on a cluster of Raspberry PIs -- see https://hal.inria.fr/hal-02463206/document. For another take on this, see also Extend Flink to edge computing with much lower footprint.
The out-of-the-box configuration is designed to work pretty well across a wide set of use cases. So there is some room for optimization if you need to squeeze Flink down into a more resource-constrained environment.
Related
I am using Flink at my company and I am considering to apply several scenarios to see the performance of each case.
Below is the scenarios that I will work on
Experiments
End-to-End
Exactly-At-Once or At-least-once
source : kafka
sink : Mysql and Redis
logic : simple counting logic
For the Exactly-At-Once, I will use the TwoPhaseCommitSink for achieving the case.
Before doing experiment, I am wondering some issues as below.
The performance speed of the sink
As you can see, I will use the mysql (RDB) for the sink. Is there any descriptive benchmarks result when we use the RDB for at-least-once or exactly-at-once? I think that when the sink uses the database, the throughput will be influenced because it takes some time to connect and communicate with database. But I cannot find any documents or technical blogs showing the detailed results of benchmark of Flink when using the Sink for RDB.
Especially, I am also wondering that the Exactly-at-once will have more degraded performance than the at-least-once and it is hard to use the commercial purpose because of its slow processing.
So my question is as below.
Is there any informative results for the two semantics mode (at least once, exactly at once) using the database sink (mysql or redis)?
Exactly-at-once semantics for end-to-end will be very slow when using the mysql sink? I will apply the twophasecommitsink.
Thanks.
A few reactions:
Simple, generic Flink benchmarks are pretty useless as predictors of specific application performance. So much depends on what a specific job is doing, and there's a lot of room for optimization.
Exactly-once with two-phase commit sinks is costly in terms of latency, but not so bad with respect to throughput. The issue is that the commit has to be done in concert with a checkpoint. If you need to checkpoint frequently in order to reduce the latency, then that will more significantly harm the throughput.
Unaligned checkpoints and the changelog state backend can make a big difference for some use cases. If you want to include these in your testing, be sure to use Flink 1.16, which saw significant improvements in these areas.
The Flink project has invested quite a bit in having a suite of benchmarks that run on every commit. See https://github.com/apache/flink-benchmarks and http://codespeed.dak8s.net:8000/ for more info.
I have a particular use case for multiple in memory key value maps that need very fast lookup time. They are set just set once a day so can be considered immutable for all practical purposes. Redis is not an option since it gets CPU throttled in case of multiple threads accessing it. Multi instance redis takes up too much memory because of data replication. The important thing to consider here is that the read rate is very high in bursts. Around 10 million requests in bursts from around 40-50 workers simultaneously.
I was thinking of creating a simple client server architecture with multiple readers connecting to a server to read from shared memory maps. However I wonder if such an architecture already exists and has been tested profusely for this use case in which case I should not be reinventing the wheel.
So to sum up what is my best alternative? TIA.
Might not be suitable for you but you could try RBLDNSD and store your values in DNS. It's high performance and results will be cached, and it's easy to read the values from pretty much any programming environment. To write values to it you'll need to write directly to its zone files, but the format is simple and easy to write.
You don't mention the size of your maps, but given that performance is so critical, it sounds like you may want to consider keeping copies of your 'multiple in memory key value maps' with each worker.
You could then implement a simple mechanism to notify each worker that it's time to refresh their maps (e.g. Redis PUBLISH, or any other pubsub type framework).
At the risk of running afoul of the stackoverlow self-promotion police :-) eXtremeDB might be a consideration. It's not schema-less, but your schema can simply define a key-value pair. It supports MVCC (optimistic, non-blocking) concurrency so even the relatively infrequent writes won't get in the way of readers, and you'll be able to utilize all the CPU cores.
I deployed an instance of Solr onto a ubuntu machine with tomcat. Then i have a single thread client program to read and inject data into Solr. I am observing memory and cpu usages, and realized that I still have a lot of resources (in terms of memory and CPUs) to use. I wonder if I should change my indexing code to multi-threading to inject into Solr? To index 20 millions of data using current single thread program, it needs about 14 hours. This is why i wonder if i should change to use multi-threading as well. Thanks in advance for your suggestions and help! :)
Multi-threading while indexing in Solr is widely used.
What you say is not very clear if you can also multi-thread the reading from your source, but I think that is the way to go.
I suggest you try it, but first try to analize your code and see which part of the code is the slowest and include that in the multi-threading.
Also keep an eye on your commit strategy.
From the Solr documentation: (http://wiki.apache.org/solr/SolrPerformanceFactors)
"In general, adding many documents per update request is faster than one per update request. ...
Reducing the frequency of automatic commits or disabling them entirely may speed indexing. Beware that this can lead to increased memory usage, which can cause performance issues of its own, such as excessive swapping or garbage collection."
Lighttpd, nginx and others use a range of techniques to provide maximum application performance such as AIO, sendfile, MMIO, caching and epoll and lock free data structures.
My collegue and I have written a little application server which uses many of these techniques and can also server static files. So we tested it with apache bench and compared ours with lighttpd and nginx and have at least matched the performance for static content for files from 100 bytes to 1K.
However, when we compare the transaction rate over the same static files to that of G-WAN, G-WAN is miles ahead.
I know this question may be a little subjective but what techniques apart from the obvious ones I've mentioned might Pierre Gauthier be using in GWAN that would enable him to achieve such astounding performance?
Following G-WAN server for years, I have read the (many) talks covering this question on the old G-WAN forum.
From what I can remember, what was repeatedly addressed were the program:
architecture (specific comparisons were made with nginx, lighty and cherokee)
implementation (how overall branching, request parsing and response building were made)
lean common path (the path followed by all types of requests: dynamic, static, handlers)
Pierre often mentionned other servers to explain what in their specific architecture and implementation was slowing them down.
As time goes, since G-WAN seems to stack more and more features (C# scripts support, a reverse-proxy and a load balancer are expected with the next version), it seems that those 3 points above are more and more important.
This is probably why each new release of G-WAN seems to be willing to be faster than the previous: the more work you do, the more extra fat must be eliminated because its cost gets higher. And like for a race car or a plane this is an incremental process, one calling for more of the other.
If you are looking for the 'secret' of G-WAN's speed then I guess that here is the key point. But if you want more details then you should rather talk directly to the G-WAN author.
Check out G-WAN's timeline. An update on August 8, 2011 might give you idea on what he is using.
G-WAN Timeline
Pierre mentioned that G-WAN uses it's wait-free Key-Value store a lot on G-WAN's core functions. Which gives it more speed since there's no locks being used.
He also uses a Lorenz Waterwheel inspired technique to handle threads. I am not sure how it works but he said that it allows G-WAN to run faster in every possible case.
I've been using Cassandra instance without reboot for a few days for a simple
task for storing tweets, 1-2 saves in a second. After then Cassandra got really slow and I had to kill a restart it.
Is this Cassandras's expected stability now? Would it be a good solution to write a daemon to kill/restart it every day or two?
No. Cassandra is widely expected to be more stable than that. If it is not stable, there is a substantial chance you have configured it wrong. It may be attempting to use more memory than you expect, for instance. If you have encountered a bug or defect in Cassandra, it is not one which is afflicting the majority of users.
As for your "restart daemon" plan, I'm going to go with "that's a horrible solution for pretty much everything, and especially so for something that you're trusting with any data you actually care about."
From https://cassandra.apache.org/
Cassandra is in use at Netflix, Twitter, Urban Airship, Constant
Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala, and more
companies that have large, active data sets. The largest known
Cassandra cluster has over 300 TB of data in over 400 machines.
It was (is?) largely used at Facebook too. I would say it's stable. :)
And btw, I don't think it's meant to be restarted every 1-2 days at all: you use it if you have huge data sets with high availability (HA) requirements, and going down every 2 days is not HA.