I have a Flink application running in Amazon's Kinesis Data Analytics Service (managed Flink cluster). In the app, I read in user data from a Kinesis stream, keyBy userId, and then aggregate some user information. After asking this question, I learned that Flink will split the reading of a stream across physical hosts in a cluster. Flink will then forward incoming events to the host that has the aggregator task assigned to the key space that corresponds to the given event.
With this in mind, I am trying to decide what to use as a partition key for the Kinesis stream that my Flink application reads from. My goal is to limit network traffic between hosts in the Flink cluster in order to optimize performance of my Flink application. I can either partition randomly, so the events are evenly distributed across the shards, or I can partition my shards by userId.
The decision depends on how Flink works internally. Is Flink smart enough to assign the local aggregator tasks on a host a key space that will correspond to the key space of the shard(s) the Kinesis consumer task on the same host is reading from? If this is the case, then sharding by userId would result in ZERO network traffic, since each event is streamed by the host that will aggregate it. It seems like Flink would not have a clear way of doing this, since it does not know how the Kinesis streams are sharded.
OR, does Flink randomly assign each Flink consumer task a subset of shards to read and randomly assign aggregator tasks a portion of the key space? If this is the case, then it seems a random partitioning of shards would result in the least amount of network traffic since at least some events will be read by a Flink consumer that is on the same host as the event's aggregator task. This would be better than partitioning by userId and then having to forward all events over the network because the keySpace of the shards did not align with the assigned key spaces of the local aggregators.
10 years ago, it was really important that as little data as possible is shipped over the network. Since 5 years, network has become so incredible fast that you notice little difference between accessing a chunk of data over network or memory (random access is of course still much faster), such that I wouldn't sweat to much about the additional traffic (unless you have to pay for it). Anecdotally, Google Datastream started to stream all data to a central shuffle server between two tasks, effectively doubling the traffic; but they still experience tremendous speedups on their Petabyte network.
So with that in mind, let's move to Flink. Flink currently has no way to dynamically adjust to shards as they can come and go over time. In half a year with FLIP-27, it could be different.
For now, there is a workaround, currently mostly used in Kafka-land (static partition). DataStreamUtils#reinterpretAsKeyedStream allows you to specify a logical keyby without a physical shuffle. Of course, you are responsible that the provided partitioning corresponds to the reality or else you would get incorrect results.
Related
I have 5 different jobs running in 5 task slots. They all read from Kafka and sink back to Kafka. Kafka load is about 200K messages/sec.
I have another job, lets say ,job6 which needs to get some information from these 5 jobs. For each device we make some calculations in those 5 jobs, and according the results of this calculations, in the 6. task I need to do something more.
As a first solution, I used sideOutputs in these 5 jobs and sent these additional info to an Kafka topic. Then my 6. job subscribed to it. But as the workload on Kafka was already very high, this solution doubled the workload on Kafka.
As all task slots run in the same task manager JVM, what I have in my mind is , developing custom RichSink and RichSource functions which use same static/singleton java object. As it will be static, I beleive all tasks will have access to same object. This object will keep a queue (java BlockingQueue).Instead of feeding data to Kafka, I will feed this queue in all tasks and 6.task will process the data received from this queue.
Please let me know if this is a good idea for a big distributed system. I assume clusters will not be a problem because after reading data from shared queue, I will call keyBy() so I hope Flink will handle that part. Also please let me know dangereous points and tips if you have.
You essentially have an in-memory data store for bridging between two jobs. One of several issues here is that if the Task Manager crashes, you lose this data, thus eliminating one of the key benefits of Flink (guaranteed at-least-once or exactly-once processing).
You'd also have to ensure that you've got at least one of your job 6 source operators running in a slot on every TM instance. Flink doesn't yet support the ability to easily control which sub-tasks run in what slots, though if you set the downstream job's parallelism == the number of slots then you can work around that issue.
I'm sure there are other issues, I just haven't spent much time thinking about it :)
Depending on the version of Flink you're using, I wonder if Flink's new Table Store would be an option for you.
The GlobalAggregateManager in the Flink may be helpful.
This can be used to share the state amongst parallel tasks in a job. However, performance may be poor in high-throughput scenarios.
Here are some demos of these projects:
Arctic, Flink
I am in the process of creating an ETL and fraud management module using flink to analyze a sequence of real time credit card transactions.
All transactions are received by an exposed API that pushes the data into a Kafka topic.
First, the received data needs to be checked and cleaned, and then stored in a database.
The next step is a fraud analysis of these transactions.
In this first step, with Flink, I have to check in the Card database that the card is known before continuing. The problem is, there are around a billion cards in this database and new card could be added over time.
So I'm not sure if I could cache the entire card number in memory or how to effectively handle this check: Is Flink able to handle some kind of sliding cache to check the card for existence in batch?
What you might do is to mirror the card database into Flink's key-partitioned state, either on-heap, or using RocksDB if you want to this to spill to disk. Key-partitioned state is sharded across the cluster, so if you do want to keep the entire card database in memory, you can scale up the cluster until that's feasible.
To keep only recently seen values, you could rely on state TTL to expire records that haven't been accessed recently.
An alternative: Flink SQL has support for doing streaming lookup joins against JDBC databases, and you can configure caching for that.
I have a question regarding sharding data in a Kinesis stream. I would like to use a random partition key when sending user data to my kinesis stream so that the data in the shards is evenly distributed. For the sake of making this question simpler, I would then like to aggregate the user data by keying off of a userId in my Flink application.
My question is this: if the shards are randomly partitioned so that data for one userId is spread across multiple Kinesis shards, can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task? Or, do I need to shard the kinesis stream by user id before it is consumed by Flink?
... can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task?
The effect of a keyBy(e -> e.userId), if you use Flink's DataStream API, is to redistribute all of the events so that all events for any particular userId will be streamed to the same downstream aggregator task.
Would each host read in data from a subset of the shards in the stream and would Flink then use the keyBy operator to pass messages of the same key to the host that will perform the actual aggregation?
Yes, that's right.
If, for example, you have 8 physical hosts, each providing 8 slots for running the job, then there will be 64 instances of the aggregator task, each of which will be responsible for a disjoint subset of the key space.
Assuming there are more than 64 shards available to read from, then each in each of the 64 tasks, the source will read from one or more shards, and then distribute the events it reads according to their userIds. Assuming the userIds are evenly spread across the shards, then each source instance will find that a few of the events it reads are for userIds it is been assigned to handle, and the local aggregator should be used. The rest of the events will each need to be sent out to one of the other 63 aggregators, depending on which worker is responsible for each userId.
My problem is exactly similar to this except that Backpressure in my application is coming as "OK".
I thought the problem was with my local machine not having enough configuration, so I created a 72 core Windows machine, where I am reading data from Kafka, processing it in Flink and then writing the output back in Kafka. I have checked, writing into Kafka Sink is not causing any issues.
All I am looking for are the areas that may be causing a split in Throughput among task slots by increasing parallelism?
Flink Version: 1.7.2
Scala version: 2.12.8
Kafka version: 2.11-2.2.1
Java version: 1.8.231
Working of application: Data is coming from Kafka (1 partition) which is deserialized by Flink (throughput here is 5k/sec). Then the deserialized message is passed through basic schema validation (Throughput here is 2k/sec).
Even after increasing the parallelism to 2, throughput at Level 1 (deserializing stage) remains same and doesn't increase two fold as per expectation.
I understand, without the code, it is difficult to debug so I am asking for the points which you can suggest for this problem, so that I can go back to my code and try that.
We are using 1 Kafka partition for our input topic.
If you want to process data in parallel, you actually need to read data in parallel.
There are certain requirements to read data in parallel. The most important once are that the source is able to actually split the data into smaller work chunks. For example, if you read from a file system, you have multiple files, or the system subdivides the files into splits. For Kafka, this necessarily means that you have to have more partitions. Ideally, you have at least as many partitions than you have max consumer parallelism.
The 5k/s seems to be the maximum throughput that you can achieve on one partition. You can also calculate the number of partitions by the maximum throughput you want to achieve. If you need to achieve 50k/s, you need at least 10 partitions. You should use more to also catch up in case of reprocessing or failure recovery.
Another way to distribute the work is to add a manual shuffle step. That means, if you keep the single input partition, you would still only reach 5k/s, but after that the work is actually redistributed and processed in parallel, such that you will not see a huge decline in your throughput afterwards. After a shuffle operation, work is somewhat evenly distributed among the parallel downstream tasks.
Trying to scope out a project that involves data ingestion and analytics, and could use some advice on tooling and software.
We have sensors creating records with 2-3 fields, each one producing ~200 records per second (~2kb/second) and will send them off to a remote server once per minute resulting in about ~18 mil records and 200MB of data per day per sensor. Not sure how many sensors we will need but it will likely start off in the single digits.
We need to be able to take action (alert) on recent data (not sure the time period guessing less than 1 day), as well as run queries on the past data. We'd like something that scales and is relatively stable .
Was thinking about using elastic search (then maybe use x-pack or sentinl for alerting). Thought about Postgres as well. Kafka and Hadoop are definitely overkill. We're on AWS so we have access to tools like kinesis as well.
Question is, what would be an appropriate set of software / architecture for the job?
Have you talked to your AWS Solutions Architect about the use case? They love this kind of thing, they'll be happy to help you figure out the right architecture. It may be a good fit for the AWS IoT services?
If you don't go with the managed IoT services, you'll want to push the messages to a scalable queue like Kafka or Kinesis (IMO, if you are processing 18M * 5 sensors = 90M events per day, that's >1000 events per second. Kafka is not overkill here; a lot of other stacks would be under-kill).
From Kinesis you then flow the data into a faster stack for analytics / querying, such as HBase, Cassandra, Druid or ElasticSearch, depending on your team's preferences. Some would say that this is time series data so you should use a time series database such as InfluxDB; but again, it's up to you. Just make sure it's a database that performs well (and behaves itself!) when subjected to a steady load of 1000 writes per second. I would not recommend using a RDBMS for that, not even Postgres. The ones mentioned above should all handle it.
Also, don't forget to flow your messages from Kinesis to S3 for safe keeping, even if you don't intend to keep the messages forever (just set a lifecycle rule to delete old data from the bucket if that's the case). After all, this is big data and the rule is "everything breaks, all the time". If your analytical stack crashes you probably don't want to lose the data entirely.
As for alerting, it depends 1) what stack you choose for the analytical part, and 2) what kinds of triggers you want to use. From your description I'm guessing you'll soon end up wanting to build more advanced triggers, such as machine learning models for anomaly detection, and for that you may want something that doesn't poll the analytical stack but rather consumes events straight out of Kinesis.