A couple of points i'll volunteer up front:
I'm new to Flink (working with it for about a month now)
I'm using Kinesis Analytics (AWS hosted Flink solution). By all accounts this doesn't really limit the versatility of Flink or the options for fault tolerance, but I'll call it out anyways.
We have a fairly straight forward sliding window application. A keyed stream organizes events by a particular key, IP address for example, and then processes them in a ProcessorFunction. We mostly use this to keep track of counts of things. For example, how many logins for a particular IP address in the last 24 hours. Every 30 seconds we count the events in the window, per key, and save that value to an external data store. State is also updated to reflect the events in that window so that old events expire and aren't taking up memory.
Interestingly enough, cardinality is not an issue. If we have 200k folks logging in, in a 24 hour period, everything is perfect. Things start to get hairy when one IP logs in 200k times in 24 hours. At this point, checkpoints start to take longer and longer. An average checkpoint takes 2-3 seconds, but with this user behaviour, the checkpoints start to take 5 minutes, then 10, then 15, then 30, then 40, etc etc.
The application can run smoothly in this condition for a while, surprisingly. Perhaps 10 or 12 hours. But, sooner or later checkpoints completely fail and then our max iterator age starts to spike, and no new events are processed etc etc.
I've tried a few of things at this point:
Throwing more metal at the problem (auto scaling turned on as well)
Fussing with CheckpointingInterval and MinimumPauseBetweenCheckpoints https://docs.aws.amazon.com/kinesisanalytics/latest/apiv2/API_CheckpointConfiguration.html
Refactoring to reduce the footprint of the state we store
(1) didn't really do much.
(2) This appeared to help but then another much larger traffic spike then what we'd seen before squashed any of the benefits
(3) It's unclear if this helped. I think our application memory footprint is fairly small compared to what you'd imagine from a Yelp or an Airbnb who both use Flink clusters for massive applications so I can't imagine that my state is really problematic.
I'll say I'm hoping we don't have to deeply change the expectations of the application output. This sliding window is a really valuable piece of data.
EDIT: Somebody asked about what my state looks like it's a ValueState[FooState]
case class FooState(
entityType: String,
entityID: String,
events: List[BarStateEvent],
tableName: String,
baseFeatureName: String,
)
case class BarStateEvent(target: Double, eventID: String, timestamp: Long)
EDIT:
I want to highlight something that user David Anderson said in the comments:
One approach sometimes used for implementing sliding windows is to use MapState, where the keys are the timestamps for the slices, and the values are lists of events.
This was essential. For anybody else trying to walk this path, I couldn't find a workable solution that didn't bucket events into some time slice. My final solution involves bucketing events into batches of 30 seconds and then writing those into map state as David suggested. This seems to do the trick. For our high periods of load, checkpoints remain at 3mb and they always finish in under a second.
If you have a sliding window that is 24 hours long, and it slides by 30 seconds, then every login is assigned to each of 2880 separate windows. That's right, Flink's sliding windows make copies. In this case 24 * 60 * 2 copies.
If you are simply counting login events, then there is no need to actually buffer the login events until the windows close. You can instead use a ReduceFunction to perform incremental aggregation.
My guess is that you aren't taking advantage of this optimization, and thus when you have a hot key (ip address), then the instance handling that hot key has a disproportionate amount of data, and takes a long time to checkpoint.
On the other hand, if you are already doing incremental aggregation, and the checkpoints are as problematic as you describe, then it's worth looking more deeply to try to understand why.
One possible remediation would be to implement your own sliding windows using a ProcessFunction. By doing so you could avoid maintaining 2880 separate windows, and use a more efficient data structure.
EDIT (based on the updated question):
I think the issue is this: When using the RocksDB state backend, state lives as serialized bytes. Every state access and update has to go through ser/de. This means that your List[BarStateEvent] is being deserialized and then re-serialized every time you modify it. For an IP address with 200k events in the list, that's going to be very expensive.
What you should do instead is to use either ListState or MapState. These state types are optimized for RocksDB. The RocksDB state backend can append to ListState without deserializing the list. And with MapState, each key/value pair in the map is a separate RocksDB object, allowing for efficient lookups and modifications.
One approach sometimes used for implementing sliding windows is to use MapState, where the keys are the timestamps for the slices, and the values are lists of events. There's an example of doing something similar (but with tumbling windows) in the Flink docs.
Or, if your state can fit into memory, you could use the FsStateBackend. Then all of your state will be objects on the JVM heap, and ser/de will only come into play during checkpointing and recovery.
Related
Could I set DataStream time window to a large value like 24 hours? The reason for the requirement is that I want to make data statistics based on the latest 24 hours client traffic to the web site. This way, I can check if there are security violations.
For example, check if a user account used multiple source IPs to log on to the web site. Or check how many unique pages a certain IP accessed in the latest 24 hours. If security violation is detected, the configured action will be taken in real time such as blocking the source IP or locking the relevant user account.
The throughput of the web site is around 200Mb/s. I think setting the time window to a large value will cause memory issue. Should I store the statistics results of each time window like 5 minutes into database?
Then make statistics based on database query for the date generated in the latest 24 hours?
I don't have any experience with big data analysis. Any advice will be appreciated.
It depends on what type of window and aggregations we're talking about:
Window where no eviction is used: in this case Flink will only save one accumulated result per physical window. This means that for a sliding window of 10h with 1h slide that computes a sum it would have to have a number 10 times. For a tumbling window (regardless of the parameters) it only saves the result of the aggregation once. However this is not the whole story: because state is keyed you have to multiply all of this for every distinct value of the field used in the group by.
Window with eviction: saves all events that were processed but still weren't evicted.
In short, generally the memory consumption is not tied to how many events you processed or the window's durations but to:
The number of windows (considering that one sliding window actually maps to several physical windows).
The cardinality of the field you're using in the group by.
All things considered, I'd say a simple 24-hour window has an almost nonexistent memory footprint.
You can check the relevant code here.
I am trying to perform some data processing in a GAE application over data that is stored in the Datastore. The bottleneck point is the throughput in which the query returns entities and I wonder how to improve the query's performance.
What I do in general:
everything works in a task queue, so we have plenty of time (10 minute deadline).
I run a query over the ndb entities in order to select which entities need to be processed.
as the query returns results, I group entities in batches of, say, 1000 and send them to another task queue for further processing.
the stored data is going to be large (say 500K-1M entities) and there is a chance that the 10 minutes deadline is not enough. Therefore, when the task is reaching the taskqueue deadline, I spawn a new task. This means I need an ndb.Cursor in order to continue the query from where it stopped.
The problem is the rate in which the query returns entities. I have tried several approaches and observed the following performance (which is too slow for my app):
Use fetch_page() in a while loop.
The code is straightforward
while has_more and theres_more_time:
entities, cursor, more = query.fetch_page(1000, ...)
send_to_process_queue(entities)
has_more = more and cursor
With this approach, it takes 25-30 seconds to process 10K entities. Roughly speaking, that is 20K entities per minute. I tried changing the page size or the class of the frontend instance; neither made any difference in performance.
Segment the data and fire multiple fetch_page_async() in parallel.
This approach is taken from here (approach C)
The overall performance remains the same as above. I tried with various number of segments (from 2 to 10) in order to have 2-10 parallel fetch_async() calls. In all cases, the overall time remained the same. The more parallel fetch_page_async() are called, the longer it takes for each one to complete. I also tried with 20 parallel fetches and it got worse. Changing the page size or the fronted instance class did not have and impact either.
Fetch everything with a single fetch() call.
Now this is the least suitable approach (if not unsuitable at all) as the instance may run out of memory, plus I don't get a cursor in case I need to spawn to another task (in fact I won't even have the ability to do so, the task will simply exceed the deadline). I tried this out of curiosity in order to see how it performs and I observed the best performance! It took 8-10 seconds for 10K entities, which is roughly be 60K entities per minute. Now that is approx. 3 times faster than fetch_page(). I wonder why this happens.
Use query.iter() in a single loop.
This is match like the first approach. This will make use of the query iterator's underlying generator, plus I can obtain a cursor from the iterator in case I need to spawn a new task, so it suits me. With the query iterator, it fetched 10K entities in 16-18 seconds, which is approx. 36-40K entities per minute. The iterator is 30% faster than fetch_page, but much slower that fetch().
For all the above approaches, I tried F1 and F4 frontend instances without any difference in Datastore performance. I also tried to change the batch_size parameter in the queries, still without any change.
A first question is why do fetch(), fetch_page() and iter() behave so differently and how to make either fetch_page() or iter() do equally well as fetch()? And then another critical question is whether these throughputs (20-60K entities per minute, depending on api call) are the best we can do in GAE.
I 'm aware of the MapReduce API but I think it doesn't suit me. AFAIK, the MapReduce API doesn't support queries and I don't want to scan all the Datastore entities (it's will be too costly and slow - the query may return only a few results). Last, but not least, I have to stick to GAE. Resorting to another platform is not an option for me. So the question really is how to optimize the ndb query.
Any suggestions?
In case anyone is interested, I was able to significantly increase the throughput of the data processing by re-designing the component - it was suggested that I change the data models but that was not possible.
First, I segmented the data and then processed each data segment in a separate taskqueue.Task instead of calling multiple fetch_page_async from a single task (as I described in the first post). Initially, these tasks were processed by GAE sequentially utilizing only a single Fx instance. To achieve parallelization of the tasks, I moved the component to a specific GAE module and used basic scaling, i.e. addressable Bx instances. When I enqueue the tasks for each data segment, I explicitly instruct which basic instance will handle each task by specifying the 'target' option.
With this design, I was able to process 20.000 entities in total within 4-5 seconds (instead of 40'-60'!), using 5 B4 instances.
Now, this has additional costs because of the Bx instances. We 'll have to fine tune the type and number of basic instances we need.
The new experimental Data Processing feature (an AppEngine API for MapReduce) might be suitable. It uses automatic sharding to execute multiple parallel worker processes, which may or may not help (like the Approach C in the other linked question).
Your comment about "no need to scan all entities" triggers the thought that custom indexes could help your queries. That may entail schema changes to store the data in a less normal form.
Design a solution from the output perspective - what the simplest query is that produces the required results, then what the entity structure is to support such a query, then what work is needed to create and maintain such an entity structure from the current data.
I need a sequencer for the entire application's data.
Using a counter entity is a bad idea (5 writes per second limit), and Sharding counters are not an option.
GMT time stamp seems unsafe due to clock variances with servers, plus a possible server time being set/reset.
Any idea?
How do I get a entity property which I can query for all entities changed since a given value?
TIA
Distributed datastores such as the app engine datastore don't have a global sequence - there's literally no way to determine if entity A was written to server A' before entity B was written to server B' if those events occur sufficiently close together, unless you have a single machine mediating all transactions and serializing them, which places a hard upper bound on how scalable your system can be.
For your actual practical problem, the easiest solution would be to assign a modification timestamp to each record, and each time you need to sync, look for records newer than (that timestamp) - (epsilon), where epsilon is a short time interval that is longer than the expected difference in time synchronization between servers (something like 10 seconds should be ample). Your client can then discard any duplicate records it receives.
I hold messages in a map for each user in the datastore. It's held as an unindexed serialized value keyed by a unique name. A user can message many users at once. Currently I execute a batch get for the (e.g.) 20 targets, update the serialized value in each, then execute a batch put. The serialized message size is small enough to be unimportant, around 1KB.
This is quick for the user, the real time shown in appstats is 90ms. However the cpu-time cost is 918ms. This causes warnings and may become expensive with high usage, or cause trouble if I wish to message 50 users. Is there any way to reduce this cpu-time cost, either with datastore tweaks, or an obvious change to the architecture I've missed? A task queue solution would remove the warnings but would really only redistribute the cost.
EDIT: The datastore key is the username of the receiver, the value is the messages stored as serialized Map where key is username of sender and Message is simple object holding two ints. There are two types of request. The 'update' type described above where the message map is retrieved, the new message is added to the map, and the map is stored. The 'get' type is the inbox owner reading the messages which is a simple get based on key. My thinking was that even if this was split out into a multi-value relationship or similar, this made improve the fidelity (allowing two updates at once) but the amount of put work would still be the same provided it's a simple key-value approach.
It sounds like you're already doing things fairly efficiently. It's not likely you're going to be able to reduce this substantially. Less than 1000 cpu milliseconds per request is a fairly reasonable amount anyway.
There's two things you might gain by splitting entities up: If your lists are long, you're saving the CPU cost of reading and writing large entities when you only need to read or modify some small part of it, and you're saving on transaction collisions. That is, if several tasks need to add items to the queue simultaneously, you can do it without transaction retries, saving you CPU time.
I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!