Create a time-based map in C - c

So I have a map from Key -> Struct
My key will be a devices IP address and the Value(Struct) will hold a devices IP address and a time which after that amount of time has elapsed will make the key-value pair expire and so be deleted from the map.
I am fairly new at this so was wondering what would be a good way to go about it.
I have googled around and seem to find a lot on time-based maps in Java only
EDIT
After coming across this I think I may have to create a map with items in it , and then have a deque in parallel with references to each elem. Then periodically call clean and if it has been in there longer than x amount of time delete it.
Is this correcto r can anyone suggest a more optimal way of doing it ?

I've used three approaches to solve a problem like this.
Use a periodic timer. Once every time quantum, get all the expiring elements and expire them. Keep the elements in timer wheels, see scheme 7 in this paper for ideas. The overhead here is that the periodic timer will kick in when it has nothing to do and buckets have a constant memory overhead, but this is the most efficient thing you can do if you add and remove things from the map much more often than you expire elements from it.
Check all elements for the shortest expiry time. Schedule a timer to kick in after that amount of time. In the timer, remove the expired element and schedule the next timer. Reschedule the timer every time a new element is added if its expiration time is shorter than the currently scheduled timer. Keep the elements in a heap for fast lookup of who needs to expire first. This has a quite large insertion and deletion overhead, but is pretty efficient when the most common deletion from the map is through expiry.
Every time you access the map, check if the element you're accessing is expired. If it is, just throw it away and pretend it wasn't there in the first place. This could be quite inefficient because of all the calls to check timestamp on every access and doesn't work if you need to perform some action on expiry.

Related

Session window in Flink with fixed time

I have a use case where I don't want the window for each key to keep on growing with each element. The requirement is to have a new window for each key that starts with the first element and ends exactly 1 minute later.
I assume that it's your intention to collect all of the events that arrive for the same key within a minute of the first event, and then produce some result at the end of that minute.
This can be implemented in a pretty straightforward way with a KeyedProcessFunction. You can maintain ListState for each key, and append arriving records to the list. You can use a timer to trigger your end-of-window logic.
See the tutorial on process functions from the Flink documentation for more.

How can i run a function after n seconds in C?

My project is create a programs like Memcached. The program is store list of key , values and and the expiry time of the cached item.Its mean after n seconds a data will be remove. I think I can use struct to store key and values, but i can not remove a data after n seconds, Can you give me some solution? Thanks for all
Just do the delete lazily.
You don't need to delete expired data immediately. In order to maintain the semantics of the data store, you only need to do two things:
Not return expired data to a query. (But see below.)
Not allow the datastore to fill up with expired data.
In other words, it is sufficient to delete expired data when you happen to encounter it, either because it showed up as the response to a query or because it occupies a slot which you need to store an update.
To simplify detection of expired data, you should store the actual expiry time in the structure, not the time to live. Then it's easy to see whether a key/value pair has expired: you just compare the expiry time to the current time.
If you use a chained hash, you can edit the hash chain (by removing expired entries) during a search of that chain. If you use open addressing, you can replaced expired entries with a tombstone (or you can use expiry as a tombstone). In both cases, if you find the key you are looking for but the entry is expired, you can terminate the search, either returning "key not present" if it's a query or by overwriting the data (and expiry time) with the new data if it's an update.
Note:
The data store cannot really guarantee that expired data will never be returned, since it does not control the network latency for responses. It is quite possible that the data it returns had not expired at the moment that it was despatched from the server, but has expired by the time it arrives at the client. So the data store can only offer "best effort", and it is up to the client to decide whether or not to use the data returned (and it is important that the server return the expiry time along with the data as a response to a query).
Since the client must check the expiry time anyway, it would not be a definitive technical violation of the contract if the data store only checked expiry dates when it was updating an entry. But since the cost of not sending definitely expired data is so small that it's hardly worth worrying about, it seems reasonable to include the check in queries as well as updates.
What about using timer?
You can use a time_t struct and clock() function from time.h.
Store the start time in a time_t struct by using clock() and check the
elapsed time by comparing the difference between stored time and
current time.
Here is explained.

How to stop high load from leading to cascading Flink checkpoint failures

A couple of points i'll volunteer up front:
I'm new to Flink (working with it for about a month now)
I'm using Kinesis Analytics (AWS hosted Flink solution). By all accounts this doesn't really limit the versatility of Flink or the options for fault tolerance, but I'll call it out anyways.
We have a fairly straight forward sliding window application. A keyed stream organizes events by a particular key, IP address for example, and then processes them in a ProcessorFunction. We mostly use this to keep track of counts of things. For example, how many logins for a particular IP address in the last 24 hours. Every 30 seconds we count the events in the window, per key, and save that value to an external data store. State is also updated to reflect the events in that window so that old events expire and aren't taking up memory.
Interestingly enough, cardinality is not an issue. If we have 200k folks logging in, in a 24 hour period, everything is perfect. Things start to get hairy when one IP logs in 200k times in 24 hours. At this point, checkpoints start to take longer and longer. An average checkpoint takes 2-3 seconds, but with this user behaviour, the checkpoints start to take 5 minutes, then 10, then 15, then 30, then 40, etc etc.
The application can run smoothly in this condition for a while, surprisingly. Perhaps 10 or 12 hours. But, sooner or later checkpoints completely fail and then our max iterator age starts to spike, and no new events are processed etc etc.
I've tried a few of things at this point:
Throwing more metal at the problem (auto scaling turned on as well)
Fussing with CheckpointingInterval and MinimumPauseBetweenCheckpoints https://docs.aws.amazon.com/kinesisanalytics/latest/apiv2/API_CheckpointConfiguration.html
Refactoring to reduce the footprint of the state we store
(1) didn't really do much.
(2) This appeared to help but then another much larger traffic spike then what we'd seen before squashed any of the benefits
(3) It's unclear if this helped. I think our application memory footprint is fairly small compared to what you'd imagine from a Yelp or an Airbnb who both use Flink clusters for massive applications so I can't imagine that my state is really problematic.
I'll say I'm hoping we don't have to deeply change the expectations of the application output. This sliding window is a really valuable piece of data.
EDIT: Somebody asked about what my state looks like it's a ValueState[FooState]
case class FooState(
entityType: String,
entityID: String,
events: List[BarStateEvent],
tableName: String,
baseFeatureName: String,
)
case class BarStateEvent(target: Double, eventID: String, timestamp: Long)
EDIT:
I want to highlight something that user David Anderson said in the comments:
One approach sometimes used for implementing sliding windows is to use MapState, where the keys are the timestamps for the slices, and the values are lists of events.
This was essential. For anybody else trying to walk this path, I couldn't find a workable solution that didn't bucket events into some time slice. My final solution involves bucketing events into batches of 30 seconds and then writing those into map state as David suggested. This seems to do the trick. For our high periods of load, checkpoints remain at 3mb and they always finish in under a second.
If you have a sliding window that is 24 hours long, and it slides by 30 seconds, then every login is assigned to each of 2880 separate windows. That's right, Flink's sliding windows make copies. In this case 24 * 60 * 2 copies.
If you are simply counting login events, then there is no need to actually buffer the login events until the windows close. You can instead use a ReduceFunction to perform incremental aggregation.
My guess is that you aren't taking advantage of this optimization, and thus when you have a hot key (ip address), then the instance handling that hot key has a disproportionate amount of data, and takes a long time to checkpoint.
On the other hand, if you are already doing incremental aggregation, and the checkpoints are as problematic as you describe, then it's worth looking more deeply to try to understand why.
One possible remediation would be to implement your own sliding windows using a ProcessFunction. By doing so you could avoid maintaining 2880 separate windows, and use a more efficient data structure.
EDIT (based on the updated question):
I think the issue is this: When using the RocksDB state backend, state lives as serialized bytes. Every state access and update has to go through ser/de. This means that your List[BarStateEvent] is being deserialized and then re-serialized every time you modify it. For an IP address with 200k events in the list, that's going to be very expensive.
What you should do instead is to use either ListState or MapState. These state types are optimized for RocksDB. The RocksDB state backend can append to ListState without deserializing the list. And with MapState, each key/value pair in the map is a separate RocksDB object, allowing for efficient lookups and modifications.
One approach sometimes used for implementing sliding windows is to use MapState, where the keys are the timestamps for the slices, and the values are lists of events. There's an example of doing something similar (but with tumbling windows) in the Flink docs.
Or, if your state can fit into memory, you could use the FsStateBackend. Then all of your state will be objects on the JVM heap, and ser/de will only come into play during checkpointing and recovery.

how to use Google Cloud Memcach to save/update unique items

In my application I run a cron job to loop over all users (2500 user) to choose an item for every user out of 4k items, considering that:
- choosing the item is based on some user info,
- I need to make sure that each user take a unique item that wasn't taken by any one else, so relation is one-to-one
To achieve this I have to run this cron job and loop over the users one by one sequentially and pick up the item for each then remove it from the list (not to be chosen by next user(s)) then move to the next user
actually in my system the number of users/items is getting bigger and bigger every single day, this cron job now takes 2 hours to set items to all users.
I need to improve this, one of the things I've thought about is using Threads but I cant do that since Im using automatic scaling, so I start thinking about push Queues, so when the cron jobs run, will make a loop like this:
for(User user : users){
getMyItem(user.getId());
}
where getMyItem will push the task to a servlet to handle it and choose the best item for this person based on his data.
Let's say I'll start doing that so what will be the best/robust solution to avoid setting an item to more than one user ?
Since Im using basic scaling and 8 instances, can't rely on static variables.
one of the things that came across my mind is to create a table in the DB that accept only unique items then I insert into it the taken items so if the insertion is done successfully it means no body else took this item so i can just assign it to that person, but this will make the performance a bit lower cause I need to make write DB operation with every call (I want to avoid that)
Also I thought about MemCach, its really fast but not robust enough, if I save a Set of items into it which will accept only unique items, then if more than one thread was trying to access this Set at the same time to update it, only one thread will be able to save its data and all other threads data might be overwritten and lost.
I hope you guys can help to find a solution for this problem, thanks in advance :)
First - I would advice against using solely memcache for such algorithm - the key thing to remember about memcache is that it is volatile and might dissapear at any time, breaking the algorithm.
From Service levels:
Note: Whether shared or dedicated, memcache is not durable storage. Keys can be evicted when the cache fills up, according to the
cache's LRU policy. Changes in the cache configuration or datacenter
maintenance events can also flush some or all of the cache.
And from How cached data expires:
Under rare circumstances, values can also disappear from the cache
prior to expiration for reasons other than memory pressure. While
memcache is resilient to server failures, memcache values are not
saved to disk, so a service failure can cause values to become
unavailable.
I'd suggest adding a property, let's say called assigned, to the item entities, by default unset (or set to null/None) and, when it's assigned to a user, set to the user's key or key ID. This allows you:
to query for unassigned items when you want to make assignments
to skip items recently assigned but still showing up in the query results due to eventual consistency, so no need to struggle for consistency
to be certain that an item can uniquely be assigned to only a single user
to easily find items assigned to a certain user if/when you're doing per-user processing of items, eventually setting the assigned property to a known value signifying done when its processing completes
Note: you may need a one-time migration task to update this assigned property for any existing entities when you first deploy the solution, to have these entities included in the query index, otherwise they would not show up in the query results.
As for the growing execution time of the cron jobs: just split the work into multiple fixed-size batches (as many as needed) to be performed in separate requests, typically push tasks. The usual approach for splitting is using query cursors. The cron job would only trigger enqueueing the initial batch processing task, which would then enqueue an additional such task if there are remaining batches for processing.
To get a general idea of such a solution works take a peek at Google appengine: Task queue performance (it's python, but the general idea is the same).
If you are planning for push jobs inside a cron and you want the jobs to be updating key-value pairs as an addon to improvise the speed and performance, we can split the number of users and number of items into multiple key-(list of values) pairs so that our push jobs will pick the key random ( logic to write to pick a key out of 4 or 5 keys) and then remove an item from the list of items and update the key again, try to have a locking before working on the above part. Example of key value paris.
Userlist1: ["vijay",...]
Userlist2: ["ramana",...]

Track image views in a cost-effective manner

I'm looking for a solution to implement sponsored images on one of my GAE apps.
We've got about 5000 users using the app and these sponsored images needs to be tracked every time it is viewed and every time somebody clicks on them.
Somebody suggested having multiple entries for counters, then randomly incrementing these counters in order to get pass the datastore write limit, but if you happen to have two views at exactly the same time and both try to write to the datastore at the same time, the second write will overwrite the first write meaning you lose one view.
At the moment we're creating a new datastore entry for every view and every click and have a scheduler passing it to a queue that adds up all the views and clicks saving the count in a stats entity - not very efficient.
Posting this as an answer :)
You can use a queue with a throughput rate of one task a time, and send the count operations to that queue. That way you will know that only one count operation is preformed each time on counter.

Resources