How to dispatch thousands of SQL requests asynchronously

How to dispatch thousands of SQL requests asynchronously - sql-server

We are writing a simple application:
build thousands of SQL select statements
run each select using BeginExecuteReader
put the results into another database
We've tried a few things that either leave connections in a SUSPENDED state (as verified by sp_who2), or take a much longer time to complete than just the SQL query itself (maybe some kind of deadlocking?).
We are:
calling EndExecuteReader in the callback handler.
calling conn.Close() and conn.Dispose()
recursively starting another call
public static void StartQuery() {
// build the query for array[i]
// ...
SqlConnection conn = new SqlConnection(AsyncConnectionString);
conn.Open();
cmd.BeginExecuteReader(CallbackHandler, cmd);
i++;
}
public static void CallbackHandler(IAsyncResult ar) {
// unpack the cmd
cmd.EndExecuteReader();
// read some stuff to a DataTable...
// SqlBulkCopy to another database (synchronously)
cmd.Connection.Close();
cmd.Connection.Dispose();
StartQuery();
}
Does anyone have recommendations or links on a robust pattern to solve this type of problem?
Thanks!

I assume you did set the AsyncronousProcessing on the connection string. Thousands of BeginExecute queries pooled in CLR is a recipe for disaster:
you'll be quickly capped by the max worker threads in the SQL Server and start experiencing long connection Open times and frequent time outs.
running 1000 loads in parallel is guaranteed to be much slower than running 1000 loads sequentially on N connections, where N is given by the number of cores on the Server. Thousands of parallel requests will simply create excessive contention on shared resources and slow each other down.
You have absolutely no reliability with thousands of requests queued up in CLR. If the process crashes, you loose all the work whitout any trace.
A much better approach is to use a queue from which a pool of workers dequeue loads and execute them. A typical producer-consumer. The number of workers (consumers) will be tuned by the SQL Server resources (CPU cores, memory, IO pattern of the loads) but a safe number is 2 times the number of server cores. Each worker uses a dedicated connection for it's work. the role of the workers and the role of the queue is not to speed up the work, but on the contrary, they act as a throttling mechanism to prevent you from swamping the server.
An even better approach is to have the queue persisted in the database, as a means to recover from a crash. See Using Tables as Queues for the proper way of doing it, since table based queuing is notoriously error prone.
And finally, you can just let SQL Server handle everything, the queueing, the throttling and the processing itself via Activation. See Asynchronous Procedure Execution and the follow up article Passing Parameters to a Background Procedure.
Which one is the proper solution depends on lots of factors you know about your problem, but I don't, so I can't recommend which way should you go.

Related

NullPointerException happening intermittently with Hibernate criteriaQuery

I am working on a springboot application in a multi-threading environment where I am using Hibernate with javax.persistence.EntityManager to access database. I have separate HikariPools for read and write queries.
Here, the multiple threads while doing read operation from database (all the read queries) are using a single read connection (since I have autowired the entityManager and not using PersistanceContext). And similarly multiple threads will write to db as well with the help of writeEntityManager where a single connection is being used by all of the threads.
I am facing an issue with AbstractLockUpgradeEventListener.upgradeLock . This is happening intermittently and could not find the exact root cause for this.
Few assumptions:-
DB utilization touches 100%.( That might give an edge to this issue)
Lock is applied before executing any write query and threads are getting starved if one thread takes more than enough time
Can anyone suggest something here w.r.t design or implementation strategy or on what could be the actual root cause.
This only happens once in a while

The Hibernate EntityManager is not thread-safe, you must not use it from multiple threads.
In fact the EntityManager AND the objects loaded must not be used from multiple threads
https://discourse.hibernate.org/t/hibernate-and-multithreading/289

Snowflake as backend for high demand API

My team and I have been using Snowflake daily for the past eight months to transform/enrich our data (with DBT) and make it available in other tools.
While the platform seems great for heavy/long running queries on large datasets and powering analytics tools such as Metabase and Mode, it just doesnt seem to behave well in cases where we need to run really small queries (grab me one line of table A) behind a high demand API, what I mean by that is that SF sometimes takes as much as 100ms or even 300ms on a XLARGE-2XLARGE warehouse to fetch one row in a fairly small table (200k computed records/aggregates), that added up to the network latency makes for a very poor setup when we want to use it as a backend to power a high demand analytics API.
We've tested multiple setups with Nodejs + Fastify, as well as Python + Fastapi, with connection pooling (10-20-50-100)/without connection pooling (one connection per request, not ideal at all), deployed in same AWS region as our SF deployment, yet we werent able to sustain something close to 50-100 Requests/sec with 1s latency (acceptable), but rather we were only able to get 10-20 Requests/sec with as high as 15-30s latency. Both languages/frameworks behave well on their own, or even with just acquiring/releasing connections, what actually takes the longest and demands a lot of IO is the actual running of queries and waiting for a response. We've yet to try a Golang setup, but it all seems to boil down to how quick Snowflake can return results for such queries.
We'd really like to use Snowflake as database to power a read-only REST API that is expected to have something like 300 requests/second, while trying to have response times in the neighborhood 1s. (But are also ready to accept that it was just not meant for that)
Is anyone using Snowflake in a similar setup? What is the best tool/config to get the most out of Snowflake in such conditions? Should we spin up many servers and hope that we'll get to a decent request rate? Or should we just copy transformed data over to something like Postgres to be able to have better response times?

I don't claim to be the authoritative answer on this, so people can feel free to correct me, but:
At the end of the day, you're trying to use Snowflake for something it's not optimized for. First, I'm going to run SELECT 1; to demonstrate the lower-bound of latency you can ever expect to receive. The result takes 40ms to return. Looking at the breakdown that is 21ms for the query compiler and 19ms to execute it. The compiler is designed to come up with really smart ways to process huge complex queries; not to compile small simple queries quickly.
After it has its query plan it must find worker node(s) to execute it on. A virtual warehouse is a collection of worker nodes (servers/cloud VMs), with each VW size being a function of how many worker nodes it has, not necessarily the VM size of each worker (e.g. EC2 instance size). So now the compiled query gets sent off to a different machine to be run where a worker process is spun up. Similar to the query planner, the worker process is not likely optimized to run small queries quickly, so the spin-up and tear-down of that process might be involved (at least relative to say a PostgreSQL worker process).
Putting my SELECT 1; example aside in favor of a "real" query, let's talk caching. First, Snowflake does not buffer tables in memory the same way a typical RDBS does. RAM is reserved for computation resources. This makes sense since in traditional usage you're dealing with tables many GBs to TBs in size, so there would be no point since a typical LRU cache would purge that data before it was ever accessed again anyways. This means that a trip to an SSD disk must occur. This is where your performance will start to depend on how homogeneous/heterogeneous your API queries are. If you're lucky you get a cache hit on SSD, otherwise its off to S3 to get your tables. Table files are not redundantly cached across all worker nodes, so while the query planner will make an attempt to schedule a computation on a node most likely to have the needed files in cache, there is no guarantee that a subsequent query will benefit from the cache resulting from the first query if it is assigned to a different worker node. The likeliness of this happening increases if you're firing 100s of queries at the VM/second.
Lastly, and this could be the bulk of your problem but have saved it for last since I am the least certain on it. A small query can run on a subset of the workers in a virtual warehouse. In this case the VH can run concurrent queries with different queries on different nodes. BUT, I am not sure if a given worker node can process more than one query at once. In that case, your concurrency will be limited by the number of nodes in the VH, e.g. a VH with 10 worker nodes can at most run 10 queries in parallel, and what you're seeing are queries piling up at the query planner stage while it waits for worker nodes to free up.

maybe for this type of workload , the new SF feature Search Optimization Service could help you speeding up performances ( https://docs.snowflake.com/en/user-guide/search-optimization-service.html ).

I have to agree with #Danny C - that Snowflake is NOT designed for very low (sub-second) latency on single queries.
To demonstrate this consider the following SQL statements (which you can execute yourself):
create or replace table customer as
select *
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
limit 500000;
-- Execution time 840ms
create or replace table customer_ten as
select *
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
limit 10;
-- Execution time 431ms
I just ran this on an XSMALL warehouse and it demonstrates currently (November 2022) Snowflake can copy a HALF MILLION ROWS in 840 milliseconds - but takes 431 ms to copy just 10 rows.
Why is Snowflake so slow compared (for example) to Oracle 11g on premises:
Well - here's what Snowflake has do complete:
Compile the query and produce an efficient execution plan (plans are not currently cached as they often lead to a sub-optimal plan being executed on data which has significantly increased in volume)
Resume a virtual warehouse (if suspended)
Execute the query and write results to cloud storage
Synchronously replicate the data to two other data centres (typically a few miles apart)
Return OK to the user
Oracle on the other hands needs to:
Compile the query (if the query plan is not already cached)
Execute the query
Write results to local disk
If you REALLY want sub-second query performance on SELECT, INSERT, UPDATE and DELETE on Snowflake - it's coming soon. Just check out Snowflake Unistore and Hybrid Tables Explained
Hope this helps.

what is the optimal database connection strategy

i have a asp.net mvc website which runs a number of queries for each page. Should i open up a single connection or open and close a connection on each query?

It really doesn't matter. When you use ADO.NET (which includes Linq to SQL, NHibernate and any of the other ORMs), the library employs connection pooling. You can "close" and "reopen" a logical connection a dozen times but the same physical connection will remain open the whole time. So don't concern yourself too much with whether or not the connection is open or closed.
Instead, you should be trying to limit the number of queries you have to run per page, because every round-trip incurs a significant overhead. If you're displaying the same data on every page, cache the results, and set up a cache dependency or expiration if it changes infrequently. Also try to re-use query data by using appropriate joins and/or eager loading (if you're using an ORM that lazy-loads).
Even if the data will always be completely different on every page load, you'll get better performance by using a single stored procedure that returns multiple result sets, than you would by running each query separately.
Bottom line: Forget about the connection strategy and start worrying about the query strategy. Any more than 3-5 queries per page and you're liable to run into serious scale issues.

If you are running multiple queries on a page in regular ADO.NET, then they are run in sequence and connection pooling is going to mean it doesn't matter. Best practice is to open connections on demand and close them immediately - even for multiple queries in the same page. Connection pooling makes this fairly efficient.
When you are using multiple queries, your performance could improve significantly by opening multiple connections simultaneously and use asynchronous ADO, to ensure that all the requests are running at the same time in multiple threads. In this case, you need a connection for each query. But the overall connection time will be reduced.
There is also the potential to use MARS on a single connection, but I'm not a big proponent of that, and it's a lot more limited in functionality.

If you are fairly sure that the transactions will finish quickly then use a single connection.
Be sure to check all return results and wrap everything in exception handlingwhere possible.

To avoid unnecessary overhead it's better to use a single connection. But be sure to run the queries in a "try" block and close the connections in a "finally" block to be sure not to leave connections hanging.
try-finally

unitofwork?? this is a great strategy to employ. nhibernate and many others use this pattern.
give it a google for specific details relevant to your needs..
jim

How do you best offload a database insert, so a web response is returned quicker?

Setup
I have web service that takes its inputs through a REST interface. The REST call does not return any meaningful data, so whatever is passed in to the web service is just recorded in the database and that is it. It is an analytics service which my company is using internally to do some special processing on web requests that are received on their web page. So it is very important the response take as little time to return as possible.
I have pretty much optimized the code down as much as possible, to make the response as fast as possible. However, the time the database stays open still keeps the connection open for longer than I want before a response is sent back to the web client.
The code looks basically like this, by the way it is ASP.NET MVC, using Entity Framework, running on IIS 7, if that matters.
public ActionResult Add(/*..bunch of parameters..*/) {
using (var db = new Entities()) {
var log = new Log {
// populate Log from parameters
}
db.AddToLogs(log);
db.SaveChanges();
}
return File(pixelImage, "image/gif");
}
Question
Is there a way to off load the database insert in to another process, so the response to the client is returned almost instantly?
I was thinking about wrapping everything in the using block in another thread, to make the database insert asynchronous, but didn't know if that was the best way to free up the response back to the client.
What would you recommend if you were trying to accomplish this goal?

If the request has to be reliable then you need to write it into the database. Eg. if your return means 'I have paid the merchant' then you can't return before you actually commit in the database. If the processing is long then there are database based asynchronous patterns, using a table as a queue or using built-in queuing like Asynchronous procedure execution. But these apply when heavy and lengthy processing is needed, not for a simple log insert.
When you want just to insert a log record (visitor/url tracking stuff) then the simplest solution is to use CLR's thread pools and just queue the work, something like:
...
var log = new Log {// populate Log from parameters}
ThreadPool.QueueUserWorkItem(stateInfo=>{
var queueLog = stateInfo as Log;
using (var db = new Entities())
{
db.AddToLogs(queuedLog);
db.SaveChanges();
}
}, log);
...
This is quick and easy and it frees the ASP handler thread to return the response as soon as possible. But it has some drawbacks:
If the incomming rate of requests exceeds the thread pool processing rate then the in memory queue will grow until it will trigger an app pool 'recycle', thus loosing all items 'in progress' (as well as warm caches and other goodies).
The order of requests is not preserved (may or may not be important)
It consumes a CLR pool thread on doing nothing but waiting for a response from the DB
The last concern can be addressed by using a true asynchronous database call, via SqlCommand.BeginExecuteXXX and setting the AsynchronousProcessing on the connection to true. Unfortunately AFAIK EF doesn't yet have true asynchronous execution, so you would have to resort to the SqlClient layer (SqlConnection, SqlCommand). But this solution would not address the first concern, when the rate of page hits is so high that this logging (= writes on every page hit) becomes a critical bottle neck.
If the first concern is real then and no threading and/or producer/consumer wizardry can aleviate it. If you trully have an incomming rate vs. write rate scalability concern ('pending' queue grows in memory) you have to either make the writes faster in the DB layer (faster IO, special log flush IO) and/or you have to aggregate the writes. Instead of logging every requests, just increment in memory counters and write them periodically as aggregates.

I've been working on multi-tier solutions mostly for the last year or so that require this sort of functionality, and that's exactly how I've been doing it.
I have a singleton that takes care of running tasks in the background based on an ITask interface. Then I just register a new ITask with my singleton and pass control from my main thread back to the client.

Create a separate thread that monitors a global, in memory queue. Have your request put it's information on the queue and return, the thread then takes the item off the queue and posts it to the DB.
Under heavy load, if the thread lags the requests, your queue will grow.
Also, if you lose the machine, you will lose any unprocessed queue entries.
Whether these limitations are acceptable to you, you'd need to decide that.
A more formal mechanism is using some actual middleware messaging system (JMS in Java land, dunno the equivalent in .NET, but there's certainly something).

It depends: When you return to the client do you need to be 100% sure that the data is stored in the database?
Take this scenario:
Request comes in
A thread is started to save to the database
Response is sent to the client
Server crashes
Data was not saved to the database
You also need to check how many milliseconds you save by starting a new thread instead of saving to the database.
The added complexity and maintainence cost is probably too high compared with the savings in response time. And the savings in response time are probably so low that they will not be noticed.

Before I spent a lot of time on the optimization I'd be sure of where the time is going. Connections like these have significant latency overhead (check this out). Just for grins, make your service a NOP and see how it performs.
It seems to me that the 'async-ness' needs to be on the client - it should fire off the call to your service and move on, especially since it doesn't care about the result?
I also suspect that if the NOP performance is good-to-tolerable on your LAN it will be a different story in the wild.

Why is it bad practice to make multiple database connections in one request?

A discussion about Singletons in PHP has me thinking about this issue more and more. Most people instruct that you shouldn't make a bunch of DB connections in one request, and I'm just curious as to what your reasoning is. My first thought is the expense to your script of making that many requests to the DB, but then I counter myself with the question: wouldn't multiple connections make concurrent querying more efficient?
How about some answers (with evidence, folks) from some people in the know?

Database connections are a limited resource. Some DBs have a very low connection limit, and wasting connections is a major problem. By consuming many connections, you may be blocking others for using the database.
Additionally, throwing a ton of extra connections at the DB doesn't help anything unless there are resources on the DB server sitting idle. If you've got 8 cores and only one is being used to satisfy a query, then sure, making another connection might help. More likely, though, you are already using all the available cores. You're also likely hitting the same harddrive for every DB request, and adding additional lock contention.
If your DB has anything resembling high utilization, adding extra connections won't help. That'd be like spawning extra threads in an application with the blind hope that the extra concurrency will make processing faster. It might in some certain circumstances, but in other cases it'll just slow you down as you thrash the hard drive, waste time task-switching, and introduce synchronization overhead.

It is the cost of setting up the connection, transferring the data and then tearing it down. It will eat up your performance.
Evidence is harder to come by but consider the following...
Let's say it takes x microseconds to make a connection.
Now you want to make several requests and get data back and forth. Let's say that the difference in transport time is negligable between one connection and many (just ofr the sake of argument).
Now let's say it takes y microseconds to close the connection.
Opening one connection will take x+y microseconds of overhead. Opening many will take n * (x+y). That will delay your execution.

Setting up a DB connection is usually quite heavy. A lot of things are going on backstage (DNS resolution/TCP connection/Handshake/Authentication/Actual Query).
I've had an issue once with some weird DNS configuration that made every TCP connection took a few seconds before going up. My login procedure (because of a complex architecture) took 3 different DB connections to complete. With that issue, it was taking forever to log-in. We then refactored the code to make it go through one connection only.

We access Informix from .NET and use multiple connections. Unless we're starting a transaction on each connection, it often is handled in the connection pool. I know that's very brand-specific, but most(?) database systems' cilent access will pool connections to the best of its ability.
As an aside, we did have a problem with connection count because of cross-database connections. Informix supports synonyms, so we synonymed the common offenders and the multiple connections were handled server-side, saving a lot in transfer time, connection creation overhead, and (the real crux of our situtation) license fees.

I would assume that it is because your requests are not being sent asynchronously, since your requests are done iteratively on the server, blocking each time, you have to pay for the overhead of creating a connection each time, when you only have to do it once...
In Flex, all web service calls are automatically called asynchronously, so you it is common to see multiple connections, or queued up requests on the same connection.
Asynchronous requests mitigate the connection cost through faster request / response time...because you cannot easily achieve this in PHP without out some threading, then the performance hit is greater then simply reusing the same connection.
that's my 2 cents...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight