We have a keyed process function that uses state and a "key by" being done immediately before that. The "key by" attribute involved transactional values and hence we expect many keys to be created. But these will be short-lived and we don't expect them to last for more than a day. Is there any way by which we can delete all the state associated with a key and the key itself manually from within the keyed process function?
Will simply setting the value of the associated state variables to null enable Flink to clean it up?
We are worried that even a very minimal amount of residual data that might be left back for every key-value would accumulate and contribute to huge state size.
One solution would be to configure state TTL so that the state is automatically deleted after some period of not being used. Or you can register a keyed timer in your keyed process function, and call clear() in the onTimer method to delete the state when the timer fires.
Related
I have a use case where I don't want the window for each key to keep on growing with each element. The requirement is to have a new window for each key that starts with the first element and ends exactly 1 minute later.
I assume that it's your intention to collect all of the events that arrive for the same key within a minute of the first event, and then produce some result at the end of that minute.
This can be implemented in a pretty straightforward way with a KeyedProcessFunction. You can maintain ListState for each key, and append arriving records to the list. You can use a timer to trigger your end-of-window logic.
See the tutorial on process functions from the Flink documentation for more.
I have streaming job which listens to events, does operations on them using CEP.
Flow is
stream = source
.assignTimestampsAndWatermarks(...)
.filter(...);
CEP
.pattern(stream.keysBy(e-> e.getId()), pattern)
.process(PattenMatchProcessFunction)
.sink(...);
The keys are all short lived, and process function doesn't contains any state, to say state can be removed by setting ttl. Using EventTime characteristics
My question, how does flink handle the expired keys, would have any impact on the GC.
If flink removes the keys itself then at what frequency does this happen.
Facing GC issues, job is getting stuck after deploying for 3 hours.
Doing memory tuning, but want to eliminate this case.
FsStateBackend will hold the state in-memory for your CEP operator.
What Flink does for CEP is it buffers the elements in a MapState[Long, List[T]] which maps a timestamp to all elements that arrived for that time. Once a watermark occurs, Flink will process the buffered events as follows:
// 1) get the queue of pending elements for the key and the corresponding NFA,
// 2) process the pending elements in event time order and custom comparator if exists by feeding them in the NFA
// 3) advance the time to the current watermark, so that expired patterns are discarded.
// 4) update the stored state for the key, by only storing the new NFA and MapState iff they have state to be used later.
// 5) update the last seen watermark.
Once the events have been processed, Flink will advance the watermark which will cause old entries in the state to be expired (you can see this inside NFA.advanceTime). This means that eviction of elements in your depend on how often watermarks are being created and pushed through in your stream.
When thinking about the act of keying by something I traditionally think of the analogy of throwing all the events that match the key into the same bucket. As you can imagine, when the Flink application starts handling lots of data what you opt to key by starts to become important because you want to make sure you clean up state well. This leads me to my question, how exactly does Flink clean up these "buckets"? If the bucket is empty (all the MapStates and ValueStates are empty) does Flink close that area of the key space and delete the bucket?
Example:
Incoming Data Format: {userId, computerId, amountOfTimeLoggedOn}
Key: UserId/ComputerId
Current Key Space:
Alice, Computer 10: Has 2 events in it. Both events are stored in state.
Bob, Computer 11: Has no events in it. Nothing is stored in state.
Will Flink come and remove Bob, Computer 11 from the Key Space eventually or does it just live on forever because at one point it had an event in it?
Flink does not store any data for state keys which do not have any user value associated with them, at least in the existing state backends: Heap (in memory) or RocksDB.
The Key Space is virtual in Flink, Flink does not make any assumptions about which concrete keys can potentially exist. There are no any pre-allocated buckets per key or subset of keys. Only once user application writes some value for some key, it occupies storage.
The general idea is that all records with the same key are processed on the same machine (somewhat like being in the same bucket as you say). The local state for a certain key is also always kept on the same machine (if stored at all). This is not related to checkpoints though.
For your example, if some value was written for [Bob, Computer 11] at some point of time and then subsequently removed, Flink will remove it completely with the key.
Short Answer
It cleans up with the help of Time To Live (TTL) feature of Flink State and Java Garbage Collector (GC). TTL feature will remove any reference to the state entry and GC will take back the allocated memory.
Long Answer
Your question can be divided into 3 sub-questions:
I will try to be as brief as possible.
How does Flink partition the data based on Key?
For an operator over a keyed stream, Flink partitions the data on a key with the help of Consistent Hashing Algorithm. It creates max_parallelism number of buckets. Each operator instance is assigned one or more of these buckets. Whenever a datum is to be sent downstream, the key is assigned to one of those buckets and consequently sent to the concerned operator instance. No key is stored here because ranges are calculated mathematically. Hence no area is cleared or bucket is deleted anytime. You can create any type of key you want. It won't affect the memory in terms of keyspace or ranges.
How does Flink store state with a Key?
All operator instances have an instance-level state store. This store defines the state context of that operator instance and it can store multiple named-state-storages e.g. "count", "sum", "some-name" etc. These named-state-storages are Key-Value stores that can store values based on the key of the data.
These KV stores are created when we initialize the state with a state descriptor in open() function of an operator. i.e. getRuntimeContext().getValueState().
These KV stores will store data only when something is needed to be stored in the state. (like HashMap.put(k,v)). Thus no key or value is stored unless state update methods (like update, add, put) are called.
So,
If Flink hasn't seen a key, nothing is stored for that key.
If Flink has seen the key but didn't call the state update methods, nothing is stored for that key.
If a state update method is called for a key, the key-value pair will be stored in the KV store.
How does Flink clean up the state for a Key?
Flink does not delete the state unless it is required by the user or done by the user manually. As mentioned earlier, Flink has the TTL feature for the state. This TTL will mark the state expiry and remove it when a cleanup strategy is invoked. These cleanup strategies vary wrt backend type and the time of cleanup. For Heap State Backend, It will remove the entry from a state table i.e. removing any reference to the entry. The memory occupied by this non-referenced entry will be cleaned up by Java GC. For RocksDB State Backend, it simply calls the native delete method of RocksDB.
I have two tables in DynamoDB. One has data about homes, one has data about businesses. The homes table has a list of the closest businesses to it, with walking times to each of them. That is, the homes table has a list of IDs which refer to items in the businesses table. Since businesses are constantly opening and closing, both these tables need to be updated frequently.
The problem I'm facing is that, when either one of the tables is updated, the other table will have incorrect data until it is updated itself. To make this clearer: let's say one business closes and another one opens. I could update the businesses table first to remove the old business and add the new one, but the homes table would then still refer to the now-removed business. Similarly, if I updated the homes table first to refer to the new business, the businesses table would not yet have this new business' data yet. Whichever table I update first, there will always be a period of time where the two tables are not in synch.
What's the best way to deal with this problem? One way I've considered is to do all the updates to a secondary database and then swap it with my primary database, but I'm wondering if there's a better way.
Thanks!
Dynamo only offers atomic operations on the item level, not transaction level, but you can have something similar to an atomic transaction by enforcing some rules in your application.
Let's say you need to run a transaction with two operations:
Delete Business(id=123) from the table.
Update Home(id=456) to remove association with Business(id=123) from the home.businesses array.
Here's what you can do to mimic a transaction:
Generate a timestamp for locking the items
Let's say our current timestamp is 1234567890. Using a timestamp will allow you to clean up failed transactions (I'll explain later).
Lock the two items
Update both Business-123 and Home-456 and set an attribute lock=1234567890.
Do not change any other attributes yet on this update operation!
Use a ConditionalExpression (check the Developer Guide and API) to verify that attribute_not_exists(lock) before updating. This way you're sure there's no other process using the same items.
Handle update lock responses
Check if both updates succeeded to Home and Business. If yes to both, it means you can proceed with the actual changes you need to make: delete the Business-123 and update the Home-456 removing the Business association.
For extra care, also use a ConditionExpression in both updates again, but now ensuring that lock == 1234567890. This way you're extra sure no other process overwrote your lock.
If both updates succeed again, you can consider the two items updated and consistent to be read by other processes. To do this, run a third update removing the lock attribute from both items.
When one of the operations fail, you may try again X times for example. If it fails all X times, make sure the process cleans up the other lock that succeeded previously.
Enforce the transaction lock throught your code
Always use a ConditionExpression in any part of your code that may update/delete Home and Business items. This is crucial for the solution to work.
When reading Home and Business items, you'll need to do this (this may not be necessary in all reads, you'll decide if you need to ensure consistency from start to finish while working with an item read from DB):
Retrieve the item you want to read
Generate a lock timestamp
Update the item with lock=timestamp using a ConditionExpression
If the update succeeds, continue using the item normally; if not, wait one or two seconds and try again;
When you're done, update the item removing the lock
Regularly clean up failed transactions
Every minute or so, run a background process to look for potentially failed transactions. If your processes take at max 60 seconds to finish and there's an item with lock value older than, say 5 minutes (remember lock value is the time the transaction started), it's safe to say that this transaction failed at some point and whatever process running it didn't properly clean up the locks.
This background job would ensure that no items keep locked for eternity.
Beware this implementation do not assure a real atomic and consistent transaction in the sense traditional ACID DBs do. If this is mission critical for you (e.g. you're dealing with financial transactions), do not attempt to implement this. Since you said you're ok if atomicity is broken on rare failure occasions, you may live with it happily. ;)
Hope this helps!
In my application I run a cron job to loop over all users (2500 user) to choose an item for every user out of 4k items, considering that:
- choosing the item is based on some user info,
- I need to make sure that each user take a unique item that wasn't taken by any one else, so relation is one-to-one
To achieve this I have to run this cron job and loop over the users one by one sequentially and pick up the item for each then remove it from the list (not to be chosen by next user(s)) then move to the next user
actually in my system the number of users/items is getting bigger and bigger every single day, this cron job now takes 2 hours to set items to all users.
I need to improve this, one of the things I've thought about is using Threads but I cant do that since Im using automatic scaling, so I start thinking about push Queues, so when the cron jobs run, will make a loop like this:
for(User user : users){
getMyItem(user.getId());
}
where getMyItem will push the task to a servlet to handle it and choose the best item for this person based on his data.
Let's say I'll start doing that so what will be the best/robust solution to avoid setting an item to more than one user ?
Since Im using basic scaling and 8 instances, can't rely on static variables.
one of the things that came across my mind is to create a table in the DB that accept only unique items then I insert into it the taken items so if the insertion is done successfully it means no body else took this item so i can just assign it to that person, but this will make the performance a bit lower cause I need to make write DB operation with every call (I want to avoid that)
Also I thought about MemCach, its really fast but not robust enough, if I save a Set of items into it which will accept only unique items, then if more than one thread was trying to access this Set at the same time to update it, only one thread will be able to save its data and all other threads data might be overwritten and lost.
I hope you guys can help to find a solution for this problem, thanks in advance :)
First - I would advice against using solely memcache for such algorithm - the key thing to remember about memcache is that it is volatile and might dissapear at any time, breaking the algorithm.
From Service levels:
Note: Whether shared or dedicated, memcache is not durable storage. Keys can be evicted when the cache fills up, according to the
cache's LRU policy. Changes in the cache configuration or datacenter
maintenance events can also flush some or all of the cache.
And from How cached data expires:
Under rare circumstances, values can also disappear from the cache
prior to expiration for reasons other than memory pressure. While
memcache is resilient to server failures, memcache values are not
saved to disk, so a service failure can cause values to become
unavailable.
I'd suggest adding a property, let's say called assigned, to the item entities, by default unset (or set to null/None) and, when it's assigned to a user, set to the user's key or key ID. This allows you:
to query for unassigned items when you want to make assignments
to skip items recently assigned but still showing up in the query results due to eventual consistency, so no need to struggle for consistency
to be certain that an item can uniquely be assigned to only a single user
to easily find items assigned to a certain user if/when you're doing per-user processing of items, eventually setting the assigned property to a known value signifying done when its processing completes
Note: you may need a one-time migration task to update this assigned property for any existing entities when you first deploy the solution, to have these entities included in the query index, otherwise they would not show up in the query results.
As for the growing execution time of the cron jobs: just split the work into multiple fixed-size batches (as many as needed) to be performed in separate requests, typically push tasks. The usual approach for splitting is using query cursors. The cron job would only trigger enqueueing the initial batch processing task, which would then enqueue an additional such task if there are remaining batches for processing.
To get a general idea of such a solution works take a peek at Google appengine: Task queue performance (it's python, but the general idea is the same).
If you are planning for push jobs inside a cron and you want the jobs to be updating key-value pairs as an addon to improvise the speed and performance, we can split the number of users and number of items into multiple key-(list of values) pairs so that our push jobs will pick the key random ( logic to write to pick a key out of 4 or 5 keys) and then remove an item from the list of items and update the key again, try to have a locking before working on the above part. Example of key value paris.
Userlist1: ["vijay",...]
Userlist2: ["ramana",...]