My project is create a programs like Memcached. The program is store list of key , values and and the expiry time of the cached item.Its mean after n seconds a data will be remove. I think I can use struct to store key and values, but i can not remove a data after n seconds, Can you give me some solution? Thanks for all
Just do the delete lazily.
You don't need to delete expired data immediately. In order to maintain the semantics of the data store, you only need to do two things:
Not return expired data to a query. (But see below.)
Not allow the datastore to fill up with expired data.
In other words, it is sufficient to delete expired data when you happen to encounter it, either because it showed up as the response to a query or because it occupies a slot which you need to store an update.
To simplify detection of expired data, you should store the actual expiry time in the structure, not the time to live. Then it's easy to see whether a key/value pair has expired: you just compare the expiry time to the current time.
If you use a chained hash, you can edit the hash chain (by removing expired entries) during a search of that chain. If you use open addressing, you can replaced expired entries with a tombstone (or you can use expiry as a tombstone). In both cases, if you find the key you are looking for but the entry is expired, you can terminate the search, either returning "key not present" if it's a query or by overwriting the data (and expiry time) with the new data if it's an update.
Note:
The data store cannot really guarantee that expired data will never be returned, since it does not control the network latency for responses. It is quite possible that the data it returns had not expired at the moment that it was despatched from the server, but has expired by the time it arrives at the client. So the data store can only offer "best effort", and it is up to the client to decide whether or not to use the data returned (and it is important that the server return the expiry time along with the data as a response to a query).
Since the client must check the expiry time anyway, it would not be a definitive technical violation of the contract if the data store only checked expiry dates when it was updating an entry. But since the cost of not sending definitely expired data is so small that it's hardly worth worrying about, it seems reasonable to include the check in queries as well as updates.
What about using timer?
You can use a time_t struct and clock() function from time.h.
Store the start time in a time_t struct by using clock() and check the
elapsed time by comparing the difference between stored time and
current time.
Here is explained.
Related
Below is my structure for table
table
UUID- key - Let call this **EntryKey**
HistoryLog - this also version number
Map<UUID (Let call this **EntryChildKey**, BYTE> value
version - For **optimistic locking**
Let's assume map has around 10k entry uuid to some value.
So, my problem is once in while I am getting request to update 10k EntryChildKey(map) value and all this request bombard db at the same time and because, every time I am hitting same EntryKey row, I am running in to lot of concurrency error, version got update every time and I have to retry and all EntryChildKey updates are thrashing each other, resulting in DynamoDB throttling my request.
I can get out of this problem if I separate this in to 2 tables as below, but we have to maintain HistoryLog version changing at EntryKey level and also there are some other problem so I can’t take this route
Table1 Table2
UUID EntryChildKey UUID EntryKey
BYTE value List<UUID> EntryChildKey
So, another approach I am thinking is Write ahead log kind of stuff, where I’ll update the version and also record the intent to update the table, but won’t update the record, instead keep it as list in table and then update the EntryChildKey values sequentially. But, I don’t whether there is something like this or similar thing I can do with DynamoDb or not ?
Also any another approach that could help to solve this problem I’ll appreciate
If you really do need to have a version attribute be updated on a single key each time any one of the 10k EntryChild items is updated then your only option is to decouple the table from the update source.
DynamoDB has a hard limit of up to 1000 writes/second to any item at all times. There is simply no way to increase that, for a single item. It doesn't matter what size table you have, how many partitions, or how much total write capacity you allocate to your table, a single item will never be able to be updated more than 1000 times per second.
So, if your requirement to update an attribute (the HistoryLog in your example) on the "master" entry item is really firm, then to use DynamoDB your best bet is to introduce a queue and batching to pre-process the updates before writing to Dynamo.
You could create an SQS queue and use a lambda function to read from the queue and write to Dynamo.
In a naive approach, you could simple read from the queue and then write to the table as much as you can, based on the DynamoDB throttling. For 10k updates to the same "master" key this will take at least 10 seconds, though in reality it will likely take longer.
A better option though, would be to run the lambda on a schedule, say once a second, and have it read all the messages available in the queue and combine all updates to the same "master" key into a single update. That way, you only write to the same item at most once every second.
The big challenge with a normal SQS queue is that it does not offer exactly once semantics: meaning there will be items in the queue that will be received multiple times. If you can design a system where you can safely discard duplicate updates then this approach will work wonderful. If not, then things get more complicated.
In my application I run a cron job to loop over all users (2500 user) to choose an item for every user out of 4k items, considering that:
- choosing the item is based on some user info,
- I need to make sure that each user take a unique item that wasn't taken by any one else, so relation is one-to-one
To achieve this I have to run this cron job and loop over the users one by one sequentially and pick up the item for each then remove it from the list (not to be chosen by next user(s)) then move to the next user
actually in my system the number of users/items is getting bigger and bigger every single day, this cron job now takes 2 hours to set items to all users.
I need to improve this, one of the things I've thought about is using Threads but I cant do that since Im using automatic scaling, so I start thinking about push Queues, so when the cron jobs run, will make a loop like this:
for(User user : users){
getMyItem(user.getId());
}
where getMyItem will push the task to a servlet to handle it and choose the best item for this person based on his data.
Let's say I'll start doing that so what will be the best/robust solution to avoid setting an item to more than one user ?
Since Im using basic scaling and 8 instances, can't rely on static variables.
one of the things that came across my mind is to create a table in the DB that accept only unique items then I insert into it the taken items so if the insertion is done successfully it means no body else took this item so i can just assign it to that person, but this will make the performance a bit lower cause I need to make write DB operation with every call (I want to avoid that)
Also I thought about MemCach, its really fast but not robust enough, if I save a Set of items into it which will accept only unique items, then if more than one thread was trying to access this Set at the same time to update it, only one thread will be able to save its data and all other threads data might be overwritten and lost.
I hope you guys can help to find a solution for this problem, thanks in advance :)
First - I would advice against using solely memcache for such algorithm - the key thing to remember about memcache is that it is volatile and might dissapear at any time, breaking the algorithm.
From Service levels:
Note: Whether shared or dedicated, memcache is not durable storage. Keys can be evicted when the cache fills up, according to the
cache's LRU policy. Changes in the cache configuration or datacenter
maintenance events can also flush some or all of the cache.
And from How cached data expires:
Under rare circumstances, values can also disappear from the cache
prior to expiration for reasons other than memory pressure. While
memcache is resilient to server failures, memcache values are not
saved to disk, so a service failure can cause values to become
unavailable.
I'd suggest adding a property, let's say called assigned, to the item entities, by default unset (or set to null/None) and, when it's assigned to a user, set to the user's key or key ID. This allows you:
to query for unassigned items when you want to make assignments
to skip items recently assigned but still showing up in the query results due to eventual consistency, so no need to struggle for consistency
to be certain that an item can uniquely be assigned to only a single user
to easily find items assigned to a certain user if/when you're doing per-user processing of items, eventually setting the assigned property to a known value signifying done when its processing completes
Note: you may need a one-time migration task to update this assigned property for any existing entities when you first deploy the solution, to have these entities included in the query index, otherwise they would not show up in the query results.
As for the growing execution time of the cron jobs: just split the work into multiple fixed-size batches (as many as needed) to be performed in separate requests, typically push tasks. The usual approach for splitting is using query cursors. The cron job would only trigger enqueueing the initial batch processing task, which would then enqueue an additional such task if there are remaining batches for processing.
To get a general idea of such a solution works take a peek at Google appengine: Task queue performance (it's python, but the general idea is the same).
If you are planning for push jobs inside a cron and you want the jobs to be updating key-value pairs as an addon to improvise the speed and performance, we can split the number of users and number of items into multiple key-(list of values) pairs so that our push jobs will pick the key random ( logic to write to pick a key out of 4 or 5 keys) and then remove an item from the list of items and update the key again, try to have a locking before working on the above part. Example of key value paris.
Userlist1: ["vijay",...]
Userlist2: ["ramana",...]
I'm developing a web application which display a list of let's say "threads". The list can be sorted by the amount of likes a thread has. There can be thousands of threads in one list.
The application needs to work in a scenario where the likes of a thread can change more than 10x in a second. The application furthermore is distributed over multiple servers.
I can't figure out an efficient way to enable paging for this sort of list. And I can't transmit the whole sorted list by likes to a user at once.
As soon as an user would go to page 2 of this list, it likely changed and may contain threads already listed from page one
Solutions which don't work:
Storing the seen threads on the client side (could be too many on mobile)
Storing the seen threads on the Server side (too many users and threads)
Snapshot the list in temp database table (it's too frequent changing data and it need to be actual)
(If it matters I'm using MongoDB+c#)
How would you solve this kind of problem?
Interesting question. Unless I'm misunderstanding you, and by all means let me know if I am, it sounds like the best solution would be to implement a system that, instead of page numbers, uses timestamps. It would be similar to what many of the main APIs already do. I know Tumblr even does this on the dashboard, where this is, of course, not an unreasonable case: there can be tons of posts added in a small amount of time at peak hours, depending on how many people the user follows.
So basically, your "next page" button could just link to /threads/threadindex/1407051000, which could translate to "all the threads that were created before 2014-08-02 17:30. That makes your query super easy to implement. Then, when you pull down all the next elements, you just look for anything that occurred before the last element on the page.
The downfall of this, of course, is that it's hard to know how many new elements have been added since the user started browsing, but you could always log the start time and know anything since then would be new. And it's also difficult for users to type in their own pages, but that's not a problem in most applications. You also need to store the timestamps for every record in your thread, but that's probably already being done, and if it's not then it's certainly not hard to implement. You'll be paying the cost of something like eight bytes extra per record, but that's better than having to store anything about "seen" posts.
It's also nice because, and again this might not apply to you, but a user could bookmark a page in the list, and it would last unchanged forever since it's not relative to anything else.
This is typically handled using an OLAP cube. The idea here is that you add a natural time dimension. They may be too heavy for this application, but here's a summary in case someone else needs it.
OLAP cubes start with the fundamental concept of time. You have to know what time you care about to be able to make sense of the data.
You start off with a "Time" table:
Time {
timestamp long (PK)
created datetime
last_queried datetime
}
This basically tracks snapshots of your data. I've included a last_queried field. This should be updated with the current time any time a user asks for data based on this specific timestamp.
Now we can start talking about "Threads":
Threads {
id long (PK)
identifier long
last_modified datetime
title string
body string
score int
}
The id field is an auto-incrementing key; this is never exposed. identifier is the "unique" id for your thread. I say "unique" because there's no unique-ness constraint, and as far as the database is concerned it is not unique. Everything else in there is pretty standard... except... when you do writes you do not update this entry. In OLAP cubes you almost never modify data. Updates and inserts are explained at the end.
Now, how do we query this? You can't just directly query Threads. You need to include a star table:
ThreadStar {
timestamp long (FK -> Time.timestamp)
thread_id long (FK -> Threads.id)
thread_identifier long (matches Threads[thread_id].identifier)
(timestamp, thread_identifier should be unique)
}
This table gives you a mapping from what time it is to what the state of all of the threads are. Given a specific timestamp you can get the state of a Thread by doing:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
AND Thread.identifier = {thread_identifier}
That's not too bad. How do we get a stream of threads? First we need to know what time it is. Basically you want to get the largest timestamp from Time and update Time.last_queried to the current time. You can throw a cache up in front of that that only updates every few seconds, or whatever you want. Once you have that you can get all threads:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
ORDER BY Thread.score DESC
Nice. We've got a list of threads and the ordering is stable as the actual scores change. You can page through this at your leisure... kind of. Eventually data will be cleaned up and you'll lose your snapshot.
So this is great and all, but now you need to create or update a Thread. Creation and modification are almost identical. Both are handled with an INSERT, the only difference is whether you use an existing identifier or create a new one.
So now you've inserted a new Thread. You need to update ThreadStar. This is the crazy expensive part. Basically you make a copy of all of the ThreadStar entries with the most recent timestamp, except you update the thread_id for the Thread you just modified. That's a crazy amount of duplication. Fortunately it's pretty much only foreign keys, but still.
You also don't do DELETEs either; mark a row as deleted or just exclude it when you update ThreadStar.
Now you're humming along, but you've got crazy amounts of data growing. You'll probably want to clean it out, unless you've got a lot of storage budge, but even then things will start slowing down (aside: this will actually perform shockingly well, even with crazy amounts of data).
Cleanup is pretty straightforward. It's just a matter of some cascading deletes and scrubbing for orphaned data. Delete entries from Time whenever you want (e.g. it's not the latest entry and last_queried is null or older than whatever cutoff). Cascade those deletes to ThreadStar. Then find any Threads with an id that isn't in ThreadStar and scrub those.
This general mechanism also works if you have more nested data, but your queries get harder.
Final note: you'll find that your inserts get really slow because of the sheer amounts of data. Most places build this with appropriate constraints in development and testing environments, but then disable constraints in production!
Yeah. Make sure your tests are solid.
But at least you aren't sensitive to re-ordered data mid-paging.
For constantly changing data such as likes I would use a two stage appraoch. For the frequently changing data I would use an in memory DB to keep up with the change rates and flush this peridically to the "real" db.
Once you have that the query for constantly chaning data is easy.
Query the db.
Query the in memory db.
Merge the frequently changed data from the in memory db with the "slow" db data .
Remember which results you already have displayed so pressing the next button will
not display an already dispalyed value twice because on different pages because its rank has changed.
If many people look at the same data it might help to cache the results of 3 in itself to reduce the load on the real db even further.
Your current architecture has no caching layers (the bigger the site the more things are cached). You will not get away with a simple DB and efficient queries against the db if things become too massive.
I would cache all 'thread' results on the server when the user first time hits the database. Then return the first page of data to the user and for each subsequent next page calls I'd return cached results.
To minimize memory usage you can cache only records ids and fetch whole data when user requests it.
Cache can be evicted each time user exits current page. If it isn't a ton of data I would stick to this solution because user won't get annoyed of data constantly changing.
I'm looking for a solution to an edge case scenario where a client continually asking the server for what's new will fail due to timestamps.
In this example, I'm not using sequence numbers because of another edge case problem. You can see that problem here: A Client Walks Into a Server And Asks "What's New?" – Problems With Sequence Numbers
Assume we're using timestamps. Every row update adds a timestamp of the server time. Clients continually ask what's new since the timestamp of the last item they received. Simple? Yes, but...
Failure scenario:
The times below are arbitrary for readability. Assume milliseconds in the real world.
2:50 Client C checks for updates.
2:59 Client A starts update on a row. (Sets lastModified to 2:59)
2:59 Client B starts update on a row. (Sets lastModified to 2:59)
3:00 Client A Row update becomes visible on DB. (lastModified still at 2:59)
3:00 Client C checks for updates >2:50. Get’s A’s update. Good.
3:01 Client B Row update becomes visible on DB. (lastModified still at 2:59)
3:10 Client C checks for updates >2:59. Gets nothing. Misses B's update. Bad.
This assumes that the lastModified can't be set atomically and there may be a delay between it's setting and the row becoming available in the database. If the database were sharded, this delay could be much larger.
We could set the check for update to arbitrarily ask for an early time causing overlap. This is inefficient due to potentially duplicate data being retrieved but not fatal. However, is it possible to know how much overlap is needed for all cases? Could a sharded database rarely delay displaying an update by seconds? Minutes?
Having clients ask "what's new" repeatedly seems like a common use case and I find it surprising not to find a better wealth of best practices on this.
Any ideas on solving this scenario or recommending a better, preferably platform agnostic, solution for asking for changes?
So I have a map from Key -> Struct
My key will be a devices IP address and the Value(Struct) will hold a devices IP address and a time which after that amount of time has elapsed will make the key-value pair expire and so be deleted from the map.
I am fairly new at this so was wondering what would be a good way to go about it.
I have googled around and seem to find a lot on time-based maps in Java only
EDIT
After coming across this I think I may have to create a map with items in it , and then have a deque in parallel with references to each elem. Then periodically call clean and if it has been in there longer than x amount of time delete it.
Is this correcto r can anyone suggest a more optimal way of doing it ?
I've used three approaches to solve a problem like this.
Use a periodic timer. Once every time quantum, get all the expiring elements and expire them. Keep the elements in timer wheels, see scheme 7 in this paper for ideas. The overhead here is that the periodic timer will kick in when it has nothing to do and buckets have a constant memory overhead, but this is the most efficient thing you can do if you add and remove things from the map much more often than you expire elements from it.
Check all elements for the shortest expiry time. Schedule a timer to kick in after that amount of time. In the timer, remove the expired element and schedule the next timer. Reschedule the timer every time a new element is added if its expiration time is shorter than the currently scheduled timer. Keep the elements in a heap for fast lookup of who needs to expire first. This has a quite large insertion and deletion overhead, but is pretty efficient when the most common deletion from the map is through expiry.
Every time you access the map, check if the element you're accessing is expired. If it is, just throw it away and pretend it wasn't there in the first place. This could be quite inefficient because of all the calls to check timestamp on every access and doesn't work if you need to perform some action on expiry.