Sum of concurrently changing variable - database

I want to keep track of a sum of user-controlled variables.
Each user can add/remove/update his/her own variables.
Users should be able to see the sum change after their own update.
As the number of users scales up, the system will be distributed and updates will happen concurrently.
I want to avoid a bottleneck for updating the sum.
What is the best way to keep track of this sum?
Is a exist database that can handle this or do I need implement something myself?

So generally, if you know by how much each value has changed you know how the sum has changed and you can use these incremental changes to update the sum.
In a centralised was you could for instance use any SQL Database which supports triggers and transactions. You'd have a table with all the description of different clients / numbers and their values and another table to cache the sum. The idea is that the trigger would run on update / delete / insert und just update the cached sum. This way it would be much faster for huge amounts of data, but also much more error prone (you can also just re-sum up all the values in the trigger and store it in a cache, and it would work for few thousand values easily)
In a decentralised system you can do something similar. Here you can either share all the values between all the clients, or (as this might be too much) just the change. So every client is responsible for some values and on every change he'd share that the total sum has changed by the change. - example: if the user modifies a value from 5 to 3 the client will broadcast -2. you assume an initial state of 0 and just sum up all the numbers from the clients as they come in. The order doesn't matter due to the commutative property of the addition operation. You only need to make sure that everyone will receive the data, but you can achieve this via reliable multicast.

Related

DynamoDB concurrent write, result in throttling

Below is my structure for table
table
UUID- key - Let call this **EntryKey**
HistoryLog - this also version number
Map<UUID (Let call this **EntryChildKey**, BYTE> value
version - For **optimistic locking**
Let's assume map has around 10k entry uuid to some value.
So, my problem is once in while I am getting request to update 10k EntryChildKey(map) value and all this request bombard db at the same time and because, every time I am hitting same EntryKey row, I am running in to lot of concurrency error, version got update every time and I have to retry and all EntryChildKey updates are thrashing each other, resulting in DynamoDB throttling my request.
I can get out of this problem if I separate this in to 2 tables as below, but we have to maintain HistoryLog version changing at EntryKey level and also there are some other problem so I can’t take this route
Table1 Table2
UUID EntryChildKey UUID EntryKey
BYTE value List<UUID> EntryChildKey
So, another approach I am thinking is Write ahead log kind of stuff, where I’ll update the version and also record the intent to update the table, but won’t update the record, instead keep it as list in table and then update the EntryChildKey values sequentially. But, I don’t whether there is something like this or similar thing I can do with DynamoDb or not ?
Also any another approach that could help to solve this problem I’ll appreciate
If you really do need to have a version attribute be updated on a single key each time any one of the 10k EntryChild items is updated then your only option is to decouple the table from the update source.
DynamoDB has a hard limit of up to 1000 writes/second to any item at all times. There is simply no way to increase that, for a single item. It doesn't matter what size table you have, how many partitions, or how much total write capacity you allocate to your table, a single item will never be able to be updated more than 1000 times per second.
So, if your requirement to update an attribute (the HistoryLog in your example) on the "master" entry item is really firm, then to use DynamoDB your best bet is to introduce a queue and batching to pre-process the updates before writing to Dynamo.
You could create an SQS queue and use a lambda function to read from the queue and write to Dynamo.
In a naive approach, you could simple read from the queue and then write to the table as much as you can, based on the DynamoDB throttling. For 10k updates to the same "master" key this will take at least 10 seconds, though in reality it will likely take longer.
A better option though, would be to run the lambda on a schedule, say once a second, and have it read all the messages available in the queue and combine all updates to the same "master" key into a single update. That way, you only write to the same item at most once every second.
The big challenge with a normal SQS queue is that it does not offer exactly once semantics: meaning there will be items in the queue that will be received multiple times. If you can design a system where you can safely discard duplicate updates then this approach will work wonderful. If not, then things get more complicated.

Store calculated value in database in this scenario?

I store readings in a database from sensors for a Temperature monitoring system.
There's 2 types of reading: air and product. The product temperature is represents the slow temperature change of an item of food versus the actual air temperature.
They 2 temperatures are taken from different sensors (different locations within the environment, usually a large controlled environment) so they are not related (i.e. I cannot derive the product temperature from the air temperature).
Initially the product temperature I was provided with was already damped by the sensor, however whoever wrote the firmware made a mistake so the damped value is incorrect, and now I instead have to take the un-damped reading from the product sensor and apply the damping myself based on the last few readings in the database.
When a new reading comes in, I look at the last few undamped readings, and the last damped reading, and determine a new damped reading from that.
My question is: Should I store this calculated reading as well as the undamped reading, or should I calculate it in a view leaving all physically stored readings undamped?
One thing that might influence this: The readings are critical; alarms rows are generated against the readings when they go out of tolerance: it is to prevent food poisoning and people can lose there jobs over it. People sign off the values they see, so those values must never change..
Normally I would use a view and put the calculation in the view, but I'm a little nervous about doing that this time. If the calculation gets "tweaked" I then have to make the view more complicated to use the old calculation before a certain timestamp, etc. (which is fine; I just have to be careful wherever I query the reading values - I don't like nesting views in other views as sometimes it can slow the query..).
What would you do in this case?
Thanks!
The underlying idea from the relational model is "logical data independence". Among other things, SQL views implement logical data independence.
So you can start by putting the calculation in a view. Later, when it becomes too complex to maintain that way, you can move the calculation to a SQL function or SQL stored procedure, or you can move the calculation to application code. You can store the results in a base table if you want to. Then update the view definition.
The view's clients should continue to work as if nothing had changed.
Here's one problem with storing this calculated value in a base table: you probably can't write a CHECK constraint to guarantee it was calculated correctly. This is a problem regardless of whether you display the value in a view. That means you might need some kind of administrative procedure to periodically validate the data.

Paging of frequently changing data

I'm developing a web application which display a list of let's say "threads". The list can be sorted by the amount of likes a thread has. There can be thousands of threads in one list.
The application needs to work in a scenario where the likes of a thread can change more than 10x in a second. The application furthermore is distributed over multiple servers.
I can't figure out an efficient way to enable paging for this sort of list. And I can't transmit the whole sorted list by likes to a user at once.
As soon as an user would go to page 2 of this list, it likely changed and may contain threads already listed from page one
Solutions which don't work:
Storing the seen threads on the client side (could be too many on mobile)
Storing the seen threads on the Server side (too many users and threads)
Snapshot the list in temp database table (it's too frequent changing data and it need to be actual)
(If it matters I'm using MongoDB+c#)
How would you solve this kind of problem?
Interesting question. Unless I'm misunderstanding you, and by all means let me know if I am, it sounds like the best solution would be to implement a system that, instead of page numbers, uses timestamps. It would be similar to what many of the main APIs already do. I know Tumblr even does this on the dashboard, where this is, of course, not an unreasonable case: there can be tons of posts added in a small amount of time at peak hours, depending on how many people the user follows.
So basically, your "next page" button could just link to /threads/threadindex/1407051000, which could translate to "all the threads that were created before 2014-08-02 17:30. That makes your query super easy to implement. Then, when you pull down all the next elements, you just look for anything that occurred before the last element on the page.
The downfall of this, of course, is that it's hard to know how many new elements have been added since the user started browsing, but you could always log the start time and know anything since then would be new. And it's also difficult for users to type in their own pages, but that's not a problem in most applications. You also need to store the timestamps for every record in your thread, but that's probably already being done, and if it's not then it's certainly not hard to implement. You'll be paying the cost of something like eight bytes extra per record, but that's better than having to store anything about "seen" posts.
It's also nice because, and again this might not apply to you, but a user could bookmark a page in the list, and it would last unchanged forever since it's not relative to anything else.
This is typically handled using an OLAP cube. The idea here is that you add a natural time dimension. They may be too heavy for this application, but here's a summary in case someone else needs it.
OLAP cubes start with the fundamental concept of time. You have to know what time you care about to be able to make sense of the data.
You start off with a "Time" table:
Time {
timestamp long (PK)
created datetime
last_queried datetime
}
This basically tracks snapshots of your data. I've included a last_queried field. This should be updated with the current time any time a user asks for data based on this specific timestamp.
Now we can start talking about "Threads":
Threads {
id long (PK)
identifier long
last_modified datetime
title string
body string
score int
}
The id field is an auto-incrementing key; this is never exposed. identifier is the "unique" id for your thread. I say "unique" because there's no unique-ness constraint, and as far as the database is concerned it is not unique. Everything else in there is pretty standard... except... when you do writes you do not update this entry. In OLAP cubes you almost never modify data. Updates and inserts are explained at the end.
Now, how do we query this? You can't just directly query Threads. You need to include a star table:
ThreadStar {
timestamp long (FK -> Time.timestamp)
thread_id long (FK -> Threads.id)
thread_identifier long (matches Threads[thread_id].identifier)
(timestamp, thread_identifier should be unique)
}
This table gives you a mapping from what time it is to what the state of all of the threads are. Given a specific timestamp you can get the state of a Thread by doing:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
AND Thread.identifier = {thread_identifier}
That's not too bad. How do we get a stream of threads? First we need to know what time it is. Basically you want to get the largest timestamp from Time and update Time.last_queried to the current time. You can throw a cache up in front of that that only updates every few seconds, or whatever you want. Once you have that you can get all threads:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
ORDER BY Thread.score DESC
Nice. We've got a list of threads and the ordering is stable as the actual scores change. You can page through this at your leisure... kind of. Eventually data will be cleaned up and you'll lose your snapshot.
So this is great and all, but now you need to create or update a Thread. Creation and modification are almost identical. Both are handled with an INSERT, the only difference is whether you use an existing identifier or create a new one.
So now you've inserted a new Thread. You need to update ThreadStar. This is the crazy expensive part. Basically you make a copy of all of the ThreadStar entries with the most recent timestamp, except you update the thread_id for the Thread you just modified. That's a crazy amount of duplication. Fortunately it's pretty much only foreign keys, but still.
You also don't do DELETEs either; mark a row as deleted or just exclude it when you update ThreadStar.
Now you're humming along, but you've got crazy amounts of data growing. You'll probably want to clean it out, unless you've got a lot of storage budge, but even then things will start slowing down (aside: this will actually perform shockingly well, even with crazy amounts of data).
Cleanup is pretty straightforward. It's just a matter of some cascading deletes and scrubbing for orphaned data. Delete entries from Time whenever you want (e.g. it's not the latest entry and last_queried is null or older than whatever cutoff). Cascade those deletes to ThreadStar. Then find any Threads with an id that isn't in ThreadStar and scrub those.
This general mechanism also works if you have more nested data, but your queries get harder.
Final note: you'll find that your inserts get really slow because of the sheer amounts of data. Most places build this with appropriate constraints in development and testing environments, but then disable constraints in production!
Yeah. Make sure your tests are solid.
But at least you aren't sensitive to re-ordered data mid-paging.
For constantly changing data such as likes I would use a two stage appraoch. For the frequently changing data I would use an in memory DB to keep up with the change rates and flush this peridically to the "real" db.
Once you have that the query for constantly chaning data is easy.
Query the db.
Query the in memory db.
Merge the frequently changed data from the in memory db with the "slow" db data .
Remember which results you already have displayed so pressing the next button will
not display an already dispalyed value twice because on different pages because its rank has changed.
If many people look at the same data it might help to cache the results of 3 in itself to reduce the load on the real db even further.
Your current architecture has no caching layers (the bigger the site the more things are cached). You will not get away with a simple DB and efficient queries against the db if things become too massive.
I would cache all 'thread' results on the server when the user first time hits the database. Then return the first page of data to the user and for each subsequent next page calls I'd return cached results.
To minimize memory usage you can cache only records ids and fetch whole data when user requests it.
Cache can be evicted each time user exits current page. If it isn't a ton of data I would stick to this solution because user won't get annoyed of data constantly changing.

Where should I handle concurrent operations on the same data? In the application or the database?

So I am building a Java webapp with Spring and Hibernate. In the application userw can add points to a object and I'd like to count the points given to order my objects. The objects are also stored in the database. And hopefully hundreds of people will give points to the objects at the same time.
But how do I count the points and save them in the database at the same time? Usually I would just have a property on my object and just increase the points. But that would mean that I have to lock the data in the database with a pessimistic transaction in order to prevent concurrency issues (reading the amount of points while another thread is half way through changing it already). That would possibly make my app much slower (at least I imagine it would).
The other solution would be to store the amount of given points in an associated object and store them separately in the database while counting the points in memory within a "small" synchronized block or something.
Which solution has the least performance impact when handling many concurrent operations on the same objects. Or are there any other fitting solutions?
If you would like the values to be persisted, then you should persist them in your database.
Given that, try the following:
create a very narrow row, like just OBJ_ID and POINTS.
create an index only on OBJ_ID, so not a lot of time is spent updating indexes when values are inserted, updated or deleted.
use INNODB, which has row-level locking, so the locks will be smaller
mysql will give you the last committed (consistent) value
That's all pretty simple. Give a whirl! Setup a test case that mimics your expected load and see how it performs. Post back if you get stuck.
Good luck.

Inspiration needed: Selecting large amounts of data for a highscore

I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.

Resources