I have an application that is receiving a high volume of data that I want to store in a database. My current strategy is to fire off an asynchronous call (BeginExecuteNonQuery) with each record when it's ready. I'm using the asynchronous call to ensure that the rest of the application runs smoothly.
The problem I have is that as the volume of data increases, eventually I get to the point where I'm trying to fire a command down the connection while it's still in use. I can see two possible options:
Buffer the pending data myself until the existing command is finished.
Open multiple connections as needed.
I'm not sure which of these options is best, or if in fact there is a better way. Option 1 will probably lead to my buffer getting bigger and bigger, while option 2 may be very bad form - I just don't know.
Any help would be appreciated.
Depending on your locking strategy, it may be worth using several connections but certainly not a number "without upper bounds". So a good strategy/pattern to use here is "thread pool", with each of N dedicated threads holding a connection and picking up write requests as the requests come and the thread finishes the previous one it was doing. Number of threads in the pool for best performance is best determined empirically, by benchmarking various possibilities in a realistic experimental/prototype setting.
If the "buffer" queue (in which your main thread queues write requests and the dedicated threads in the pool picks them up) grows beyond a certain threshold, it means you're getting data faster than you can possibly write it out, so, unless you can get more resources, you'll simply have to drop some of the incoming data -- maybe by a random-sampling strategy to avoid biasing future statistical analysis. Just count how much you're writing and how much you're having to drop due to the resource shortage in each period of time (say every minute or so), so you can use "stratified sampling" techniques in future data-mining explorations.
Thanks Alex - so you'd suggest a hybrid method then, assuming that I'll still need to buffer updates if all connections are in use?
(I'm the original poster, I've just managed to get two accounts without realizing)
Related
Let's say there is a DynamoDB key with a value of 0, and there is a process that repeatably reads from this key using eventually consistent reads. While these reads are occurring, a second process sets the value of that key to 1.
Is it ever possible for the read process to ever read a 0 after it first reads a 1? Is it possible in DynamoDB's eventual consistency model for a client to successfully read a key's fully up-to-date value, but then read a stale value on a subsequent request?
Eventually, the write will be fully propagated and the read process will only read 1 values, but I'm unsure if it's possible for the reads to go 'backward in time' while the propagation is occuring.
The property you are looking for is known as monotonic reads, see for example the definition in https://jepsen.io/consistency/models/monotonic-reads.
Obviously, DynamoDB's strongly consistent read (ConsistentRead=true) is also monotonic, but you rightly asked about DynamoDB's eventually consistent read mode.
#Charles in his response gave a link, https://www.youtube.com/watch?v=yvBR71D0nAQ&t=706s, to a nice official official talk by Amazon on how eventually-consistent reads work. The talk explains that DynamoDB replicates written data to three copies, but a write completes when two out of three (including one designated as the "leader") of the copies were updated. It is possible that the third copy will take some time (usually a very short time to get updated).
The video goes on to explain that an eventually consistent read goes to one of the three replicas at random.
So in that short amount of time where the third replica has old data, a request might randomly go to one of the updated nodes and return new data, and then another request slightly later might go by random to the non-updated replica and return old data. This means that the "monotonic read" guarantee is not provided.
To summarize, I believe that DynamoDB does not provide the monotonic read guarantee if you use eventually consistent reads. You can use strongly-consistent reads to get it, of course.
Unfortunately I can't find an official document which claims this. It would also be nice to test this in practice, similar to how he paper http://www.aifb.kit.edu/images/1/17/How_soon_is_eventual.pdf tested whether Amazon S3 (not DynamoDB) guaranteed monotonic reads, and discovered that it did not by actually seeing monotonic-read violations.
One of the implementation details which may make it hard to see these monotonic-read violations in practice is how Amazon handles requests from the same process (which you said is your case). When the same process sends several requests in sequence, it may (but also may not...) may use the same HTTP connections to do so, and Amazon's internal load balancers may (but also may not) decide to send those requests to the same backend replica - despite the statement in the video that each request is sent to a random replica. If this happens, it may be hard to see monotonic read violations in practice - but it may still happen if the load balancer changes its mind, or the client library opens another connection, and so on, so you still can't trust the monotonic read property to hold.
Yes it is possible. Requests are stateless so a second read from the same client is just as likely as any other request to see slightly stale data. If that’s an issue, choose strong consistency.
You will (probably) not ever get the old data after getting the new data..
First off, there's no warning in the docs about repeated reads returning stale data, just that a read after a write may return stale data.
Eventually Consistent Reads
When you read data from a DynamoDB table, the response might not
reflect the results of a recently completed write operation. The
response might include some stale data. If you repeat your read
request after a short time, the response should return the latest
data.
But more importantly, every item in DDB is stored in three storage nodes. A write to DDB doesn't return a 200 - Success until that data is written to 2 of 3 storage nodes. Thus, it's only if your read is serviced by the third node, that you'd see stale data. Once that third node is updated, every node has the latest.
See Amazon DynamoDB Under the Hood
EDIT
#Nadav's answer points that it's at least theoretically possible; AWS certainly doesn't seem to guarantee monotonic reads. But I believe the reality depends on your application architecture.
Most languages, nodejs being an exception, will use persistent HTTP/HTTPS connections by default to the DDB request router. Especially given how expensive it is to open a TLS connection. I suspect though can't find any documents confirming it that there's at least some level of stickiness from the request router to a storage node. #Nadav discusses this possibility. But only AWS knows for sure and they haven't said.
Assuming that belief is correct
curl in a shell script loop - more likely to see the old data again
loop in C# using a single connection - less likely
The other thing to consider is that in the normal course of things, the third storage node in "only milliseconds behind".
Ironically, if the request router truly picks a storage node at random, a non-persistent connection is then less likely to see old data again given the extra time it takes to establish the connection.
If you absolutely need monotonic reads, then you'd need to use strongly consistent reads.
Another option might be to stick DynamoDB Accelerator (DAX) in front of your DDB. Especially if you're retrieving the key with GetItem(). As I read how it works it does seem to imply monotonic reads, especially if you've written-through DAX. Though it does not come right out an say so. Even if you've written around DAX, reading from it should still be monotonic, it's just there will be more latency until you start seeing the new data.
I'm building a web service which reserves unique items to users.
Service is required to handle high amounts of concurrent requests that should avoid blocking each other as much as possible. Each incoming request must reserve n-amount of unique items of the desired type, and then process them successfully or release them back to the reservables list so they can be reserved by an another request. A succesful processing contains multiple steps like communicating with integrated services and other time consuming steps, so keeping items reserved with a DB transaction until the end would not be an efficient solution.
Currently I've implemented a solution where reservable items are stored in a buffer DB table where items are being locked and deleted by incoming requests with SELECT FOR UPDATE SKIP LOCKED. As service must support multiple item types, this buffer table contains only n amount of items per type at a time as the table size would otherwise grow into too big as there is about ten thousand different types. When certain item types are all reserved (selected and removed) the request locks the item type and adds more reservable items into the buffer. This fill operation requires integration calls and may take some time. During the fill, all other operations needs to wait until the filling operation finishes and items become available. This is where the problem arises. When thousands of requests wait for the same item type to become available in the buffer, each needs to poll this information somehow.
What could be an efficient solution for this kind of polling?
I think the "real" answer is to start the refill process when the stock gets low, rather than when it is completely depleted. Then it would already be refilled by the time anyone needs to block on it. Or perhaps you could make the refill process work asynchronously, so that the new rows are generated near-instantly and then the integrations are called later. So you would enqueue the integrations, rather than the consumers.
But barring that, it seems like you want the waiters to lock the "item type" in a mode incompatible with the how the refiller locks it. Then it will naturally block, and be released once the refiller is done. The problem is that if you want to assemble an order of 50 things and the 47th is depleted, do you want to maintain the reservation on the previous 46 things while you wait?
Presumably your reservation is not blocking anyone else, unless the one you have reserved is the last one available. In which case you are not really blocking them, just forcing them to go through the refill process, which would have had to be done eventually anyway.
Let us say we have two users running a query against the same table in PostgreSQL. So,
User 1: SELECT * FROM table WHERE year = '2020' and
User 2: SELECT * FROM table WHERE year = '2019'
Are they going to be executed at the same time as opposed to executing one after the other?
I would expect that if I have 2 processors, I can run both at the same time. But I am thinking that matters become far more complicated depending on where the data is located (e.g. disk) given that it is the same table, whether there is partitioning, configurations, transactions, etc. Can someone help me understand how I can ensure that I get my desired behaviour as far as PostgreSQL is concerned? Under which circumstances will I get my desired behaviour and under which circumstances will I not?
EDIT: I have found this other question which is very close to what I was asking - https://dba.stackexchange.com/questions/72325/postgresql-if-i-run-multiple-queries-concurrently-under-what-circumstances-wo. It is a bit old and doesn't have much answers, would appreciate a fresh outlook on it.
If the two users have two independent connections and they don't go out of their way to block each other, then the queries will execute at the same time. If they need to access the same buffer at the same time, or read the same disk page into a buffer at the same time, they will use very fast locking/coordination methods (LWLocks, spin locks, or atomic operations like CAS) to coordinate that. The exact techniques vary from version to version, as better methods become widely available on supported platforms and as people find the time to change the implementation to use those better methods.
I can ensure that I get my desired behaviour as far as PostgreSQL is concerned?
You should always get the correct answer to your query (Or possibly some kind of ERROR indicating a failure to serialize if you are using the highest (and non-default) isolation level, but that doesn't seem to be a risk if each of those queries is run in a single-statement transaction.)
I think you are overthinking this. The point of using a database management system is that you don't need to micromanage it.
Also, "parallel-query" refers to a single query using multiple CPUs, not to different queries running at the same time.
I have a bit of a strange problem. I have a module running on gae that puts a whole lot of little tasks on the default task queue. The tasks access the same ndb module. Each task accesses a bunch of data from a few different tables then calls put.
The first few tasks work fine but as time continues I start getting these on the final put:
suspended generator _put_tasklet(context.py:358) raised TransactionFailedError(too much contention on these datastore entities. please try again.)
So I wrapped the put with a try and put in a randomised timeout so it retries a couple of times. This mitigated the problem a little, it just happens later on.
Here is some pseudocode for my task:
def my_task(request):
stuff = get_ndb_instances() #this accessed a few things from different tables
better_stuff = process(ndb_instances) #pretty much just a summation
try_put(better_stuff)
return {'status':'Groovy'}
def try_put(oInstance,iCountdown=10):
if iCountdown<1:
return oInstance.put()
try:
return oInstance.put()
except:
import time
import random
logger.info("sleeping")
time.sleep(random.random()*20)
return oInstance.try_put(iCountdown-1)
Without using try_put the queue gets about 30% of the way through until it stops working. With the try_put it gets further, like 60%.
Could it be that a task is holding onto ndb connections after it has completed somehow? I'm not making explicit use of transactions.
EDIT:
there seems to be some confusion about what I'm asking. The question is: Why does ndb contention get worse as time goes on. I have a whole lot of tasks running simultaneously and they access the ndb in a way that can cause contention. If contention is detected then a randomy timed retry happens and this eliminates contention perfectly well. For a little while. Tasks keep running and completing and the more that successfully return the more contention happens. Even though the processes using the contended upon data should be finished. Is there something going on that's holding onto datastore handles that shouldn't be? What's going on?
EDIT2:
Here is a little bit about the key structures in play:
My ndb models sit in a hierarchy where we have something like this (the direction of the arrows specifies parent child relationships, ie: Type has a bunch of child Instances etc)
Type->Instance->Position
The ids of the Positions are limited to a few different names, there are many thousands of instances and not many types.
I calculate a bunch of Positions and then do a try_put_multi (similar to try_put in an obvious way) and get contention. I'm going to run the code again pretty soon and get a full traceback to include here.
Contention will get worse overtime if you continually exceed the 1 write/transaction per entity group per second. The answer is in how Megastore/Paxo work and how Cloud Datastore handles contention in the backend.
When 2 writes are attempted at the same time on different nodes in Megastore, one transaction will win and the other will fail. Cloud Datastore detects this contention and will retry the failed transaction several times. Usually this results in the transaction succeeding without any errors being raised to the client.
If sustained writes above the recommended limit are being attempted, the chance that a transaction needs to be retried multiple times increases. The number of transactions in an internal retry state also increases. Eventually, transactions will start reaching our internal retry limit and will return a contention error to the client.
Randomized sleep method is an incorrect way to handle error response situations. You should instead look into exponential back-off with jitter (example).
Similarly, the core of your problem is a high write rate into a single entity group. you should look into whether the explicit parenting is required (removing it if not), or if you should shard the entity group in some manner that makes sense according to your queries and consistency requirements.
I want to make the "TRAP AGENT" library. The trap agent library keeps the tracks of the various parameter of the client system. If the parameter of the client system changes above threshold then trap agent library at client side notifies to the server about that parameter. For example, if CPU usage exceeds beyond threshold then it will notify the server that CPU usage is exceeded. I have to measure 50-100 parameters (like memory usage, network usage etc.) at client side.
Now I have the basic idea about the design, but I am stuck with the entire library design.
I have thought of below solutions:
I can create a thread for each parameter (i.e. each thread will monitor single parameter).
I can create a process for each parameter (i.e. each process will monitor single parameter).
I can classify the various parameters into the various groups, like data usage parameter will fall into network group, CPU memory usage parameter will fall into the system group, and then will create thread for each group.
Now 1st solution is looking good as compare to 2nd. If I am adopting 1st solution then it may fail when I want to upgrade my library for 100 to 1000 parameters. Because I have to create 1000 threads at that time, which is not good design (I think so; if I am wrong correct me.)
3rd solution is good, but response time will be high since many parameters will be monitored in single thread.
Is there any better approach?
In general, it's a bad idea to spawn threads 1-to-1 for any logical mapping in your code. You can quickly exhaust the available threads of the system.
In .NET this is very elegantly handled using thread pools:
Thread vs ThreadPool
Here is a C++ discussion, but the concept is the same:
Thread pooling in C++11
Processes are also high overhead on Windows. Both designs sound like they would ironically be quite taxing on the very resources you are trying to monitor.
Threads (and processes) give you parallelism where you need it. For example, letting the GUI be responsive while some background task is running. But if you are just monitoring in the background and reporting to a server, why require so much parallelism?
You could just run each check, one after the other, in a tight event loop in one single thread. If you are worried about not sampling the values as often, I'd say that's actually a benefit. It does no help to consume 50% CPU to monitor your CPU. If you are spot-checking values once every few seconds that is probably fine resolution.
In fact high resolution is of no help if you are reporting to a server. You don't want to denial-of-service-attack your server by doing a HTTP call to it multiple times a second once some value triggers.
NOTE: this doesn't mean you can't have a pluggable architecture. You could create some base class that represents checking a resource and then create subclasses for each specific type. Your event loop could iterate over an array or list of objects, calling each one successively and aggregating the results. At the end of the loop you report back to the server if any are out of range.
You may want to add logic to stop checking (or at least stop reporting back to the server) for some "cool down period" once a trap hits. You don't want to tax your server or spam your logs.
You can follow below methodology:
1.You can have two threads one thread is dedicated to measure emergency parameter and second thread monitors non emergency parameter.
hence response time for emergency parameter will be less.
2.You can define 3 threads.First thread will monitor the high priority(emergency parameter).Second thread will monitor the intermediate priority parameter. and last thread will monitor lowest priority parameter.
So overall response time will be improved as compared to first solution.
3.If response time is not concern then you can monitor all the parameters in single thread.But in this case response time becomes worst when you upgrade your library to monitor 100 to 1000 parameters.
So in 1st case there will be more response time for non emergency parameter.While in 3rd case there will be definitely very high response time.
So solution 2 is better.