I'm hoping someone can confirm what is actually happening here with TPL and SQL connections.
Basically, I have a large application which, in essence, reads a table from SQL Server, and then processes each row - serially. The processing of each row can take quite some time. So, I thought to change this to use the Task Parallel Library, with a "Parallel.ForEach" across the rows in the datatable. This seems to work for a little while (minutes), then it all goes pear-shaped with...
"The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached."
Now, I surmised the following (which may of course be entirely wrong).
The "ForEach" creates tasks for each row, up to some limit based on the number of cores (or whatever). Lets say 4 for want of a better idea. Each of the four tasks gets a row, and goes off to process it. TPL waits until the machine is not too busy, and fires up some more. I'm expecting a max of four.
But that's not what I observe - and not what I think is happening.
So... I wrote a quick test (see below):
Sub Main()
Dim tbl As New DataTable()
FillTable(tbl)
Parallel.ForEach(tbl.AsEnumerable(), AddressOf ProcessRow)
End Sub
Private n As Integer = 0
Sub ProcessRow(row As DataRow, state As ParallelLoopState)
n += 1 ' I know... not thread safe
Console.WriteLine("Starting thread {0}({1})", n, Thread.CurrentThread.ManagedThreadId)
Using cnx As SqlConnection = New SqlConnection(My.Settings.ConnectionString)
cnx.Open()
Thread.Sleep(TimeSpan.FromMinutes(5))
cnx.Close()
End Using
Console.WriteLine("Closing thread {0}({1})", n, Thread.CurrentThread.ManagedThreadId)
n -= 1
End Sub
This creates way more than my guess at the number of tasks. So, I surmise that TPL fires up tasks to the limit it thinks will keep my machine busy, but hey, what's this, we're not very busy here, so lets start some more. Still not very busy, so... etc. (seems like one new task a second - roughly).
This is reasonable-ish, but I expect it to go pop 30 seconds (SQL connection timeout) after when and if it gets 100 open SQL connections - the default connection pool size - which it doesn't.
So, to scale it back a bit, I change my connection string to limit the max pool size.
Sub Main()
Dim tbl As New DataTable()
Dim csb As New SqlConnectionStringBuilder(My.Settings.ConnectionString)
csb.MaxPoolSize = 10
csb.ApplicationName = "Test 1"
My.Settings("ConnectionString") = csb.ToString()
FillTable(tbl)
Parallel.ForEach(tbl.AsEnumerable(), AddressOf ProcessRow)
End Sub
I count the real number of connections to the SQL server, and as expected, its 10. But my application has fired up 26 tasks - and then hangs. So, setting the max pool size for SQL somehow limited the number of tasks to 26, but why no 27, and especially, why doesn't it fall over at 11 because the pool is full ?
Obviously, somewhere along the line I'm asking for more work than my machine can do, and I can add "MaxDegreesOfParallelism" to the ForEach, but I'm interested in what's actually going on here.
PS.
Actually, after sitting with 26 tasks for (I'm guessing) 5 minutes, it does fall over with the original (max pool size reached) error. Huh ?
Thanks.
Edit 1:
Actually, what I now think happens in the tasks (my "ProcessRow" method) is that after 10 successful connections/tasks, the 11th does block for the connection timeout, and then does get the original exception - as do any subsequent tasks.
So... I conclude that the TPL is creating tasks at about 1 a second, and it gets enough time to create about 26/27 before task 11 throws an exception. All subsequent tasks then also throw exceptions (about a second apart) and the TPL stops creating new tasks (because it gets unhandled exceptions in one or more tasks ?)
For some reason (as yet undetermined), the ForEach than hangs for a while. If I modify my ProcessRow method to use the state to say "stop", it appears to have no effect.
Sub ProcessRow(row As DataRow, state As ParallelLoopState)
n += 1
Console.WriteLine("Starting thread {0}({1})", n, Thread.CurrentThread.ManagedThreadId)
Try
Using cnx As SqlConnection = fnNewConnection()
Thread.Sleep(TimeSpan.FromMinutes(5))
End Using
Catch ex As Exception
Console.WriteLine("Exception on thread {0}", Thread.CurrentThread.ManagedThreadId)
state.Stop()
Throw
End Try
Console.WriteLine("Closing thread {0}({1})", n, Thread.CurrentThread.ManagedThreadId)
n -= 1
End Sub
Edit 2:
Dur... The reason for the long delay is that, while tasks 11 onwards all crash and burn, tasks 1 to 10 don't, and all sit there sleeping for 5 minutes. The TPL has stopped creating new tasks (because of the unhandled exception in one or more of the tasks it has created), and then waits for the un-crashed tasks to complete.
The edits to the original question add more detail and, eventually, the answer becomes apparent.
TPL creates tasks repeatedly because the tasks it has created are (basically) idle. This is fine until the connection pool is exhausted, at which point the tasks which want a new connection wait for one to become available, and timeout. In the meantime, the TPL is still creating more tasks, all doomed to fail. After the connection timeout, the tasks start failing, and the ensuing exception(s) cause the TPL to stop creating new tasks. The TPL then waits for the tasks that did get connections to complete, before an AggregateException is thrown.
The TPL is not made for IO-bound work. It has heuristics which it uses to steer the count of threads being active. These heuristics fail for long-running and/or IO-bound tasks, causing it to inject more and more threads without a practical limit.
Use PLINQ to set a fixed amount of threads using WithDegreeOfParallelism. You should probably test different amounts. It could look like this. I have written much more about this topic on SO, but I can't find it at the moment.
I have no idea why you are seeing exactly 26 threads in your example. Note, that when the pool is depleted, a request to take a connection only fails after a timeout. This entire system is very non-deterministic and I'd consider any number of threads plausible.
Related
I want to buffer a datastream in flink. My initial idea is caching 100 pieces of data into a list or tuple and then using insert into values (???) to insert data into clickhouse in bulk. Do you have better ways to do this?
The first solution that you post works but it is flaky. It can lead to starvation due to a simplistic logic. For instance, let's say that you have a counter of 100 to create a batch. It is possible that your stream never receives 100 events, or it takes hours to receive the 100th event. Then your basic and working solution can have events stuck in the window batch because it is a count window. In other words, your batch can generate windows of 30 seconds in a high throughput, or windows of 1 hour when your throughput is very low.
DataStream<User> stream = ...;
DataStream<Tuple2<User, Long>> stream1 = stream
.countWindowAll(100)
.process(new MyProcessWindowFunction());
In general, it depends on your use case. However, I would use a time window to make sure that my job always has the flush batch even though there are few or no events on the window.
DataStream<Tuple2<User, Long>> stream1 = stream
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(30)))
.process(new MyProcessWindowFunction());;
Thanks for all the answers. I use a window function to solve this problem.
SingleOutputStreamOperator<ArrayList<User>> stream2 =
stream1.countWindowAll(batchSize).process(new MyProcessWindowFunction());
Then I overwrite the process function in which the batch size of data is buffered in an ArrayList.
If you want to import data into the database in batches, you can use the window(countWindow or timeWindow)to aggregate the data.
I'm trying to execute batch test method on 100 records and get CPU Runtime Limit error.
I placed the Limits.getCpuTime() method in my code and noticed that my code without the workflow segment takes 3148 ms to complete. However, when I activate two workflows that sends emails to one user each, I get the CPU runtime limit error. In total my process without those two workflows takes around 10 seconds to complete while with them activated it takes around 20 seconds.
#IsTest
static void returnIncClientAddress(){
//Select Required Records
User incidentClient = [SELECT Id FROM User WHERE Username = 'bbaggins#shire.qa.com' LIMIT 1];
BMCServiceDesk__Category__c category = [SELECT Id FROM BMCServiceDesk__Category__c WHERE Name = 'TestCategory'];
BMCServiceDesk__BMC_BaseElement__c service = [SELECT ID FROM BMCServiceDesk__BMC_BaseElement__c WHERE Name = 'TestService'];
BMCServiceDesk__BMC_BaseElement__c serviceOffering = [SELECT ID FROM BMCServiceDesk__BMC_BaseElement__c WHERE Name = 'TestServiceOffering'];
//Create Incidents
List<BMCServiceDesk__Incident__c> incidents = new List<BMCServiceDesk__Incident__c>();
for(integer i = 0; i < 100; i++){
BMCServiceDesk__Incident__c incident = new BMCServiceDesk__Incident__c(
BMCServiceDesk__FKClient__c = incidentClient.ID,
BMCServiceDesk__FKCategory__c = category.ID,
BMCServiceDesk__FKServiceOffering__c = serviceOffering.ID,
BMCServiceDesk__FKBusinessService__c = service.ID,
BMCServiceDesk__FKStatus__c = awaiting_for_handling
);
incidents.add(incident);
}
test.startTest();
insert incidents;
test.stopTest();
}
I expected the email workflows and alerts to be processed in batch and sent without being so expensive in CPU time, but it seems that Salesforce takes a lot of time both checking the workflows rules and executing on them when needed. The majority of the process' time seems to be spent on sending the workflows' emails (which it doesn't actually do because it's a test method).
There's not much you can do to control the execution time of Workflow Rules. You could try converting them into Apex and benchmarking to see whether that results in improvement in time consumed, but I suspect the real solution is that you're going to have to dial down your bulk test.
The CPU limit for a transaction is 10 seconds. If your unit test code is already taking approximately 10 seconds to complete without Workflows (I'm not sure exactly what bounds your 3148 ms and 10 s refer to), you've really got only two choices:
Make the sum total of automation running on insert of this object faster;
Reduce the quantity of data you're processing in this unit test.
It's not clear what you're actually testing here, but if it's an Apex trigger, you should make sure that it's properly bulkified and does not consume unnecessary CPU time, including through trigger recursion. Reviewing the call stack in your logs (or simply adding System.debug() statements) may help with that.
Lastly - make sure you write assertions in your test method. Test methods without assertions are close to worthless.
Are there triggers on the BMCServiceDesk__Incident__c or on objects modified by the Workflow? Triggers on updates could possible cause the code to execute multiple times in the same execution context causing you to hit the cpu limit. Consider prventing reentry into triggers or performing check to only run triggers if specific criteria is met.
Otherwise consider refactoring the code if possible to have work executed within the same loop if possible as loops especially nested loops drive up your cpu usage. Usually workflow on their own dont drive up CPU limit unless triggers are executes due to workflow updates.
When I add a task to the task queue, sometimes the task goes missing. I dont get any errors but I just dont find the tasks in my logs. Suppose I add n tasks. The computation cannot go forward without these n tasks finishing. However, I find that one or more of these n tasks just went missing after they were added and my whole algorithm stops in the middle.
What could be the reason ?
I keep a variable w to check the number of times the task was added. I observe w = n though some tasks were not created.
def addtask_whx(index,user,seqlen,vp_compress,iseq_compress):
global w
while True :
timeout_ms = 100
taskq_name = 'whx'+'--'+str(index[0])+'-'+str(index[1])+'-'+str(index[2])+'-'+str(index[3])+'-'+str(index[5]) + '--' + user
try :
taskqueue.add(name=taskq_name+str(timeout_ms),queue_name='whx',url='/whx', params={'m': index[0],'n': index[1],'o': index[2],'p': index[3],'q':0,'r':index[5],'user': user,'seqlen':seqlen,'vp':vp_compress,'iseq':iseq_compress})
w = w+1
break
except DeadlineExceededError:
taskq_name = taskq_name + str(timeout_ms)
time.sleep(float(timeout_ms)/1000)
timeout_ms = timeout_ms*4
logging.error("WHX Task Queue Add Timeout Retrying")
except TransientError:
taskq_name = taskq_name + str(timeout_ms)
time.sleep(float(timeout_ms)/1000)
timeout_ms = timeout_ms*4
logging.error("WHX Task Queue Add Transient Error Retrying")
except TombstonedTaskError:
logging.error("WHX Task Queue Tombstoned Error")
break
Disclaimer: this is not the answer you're looking for, but I hope it will help you nonetheless.
The computation cannot go forward
without these n tasks finishing
It sounds like you are using the task queue for something it was not designed to do. You should read: http://code.google.com/appengine/docs/java/taskqueue/overview.html#Queue_Concepts
Tasks are not guaranteed to be executed in the order they arrive, and they are not guaranteed to be executed exactly once. In some cases, a single task may be executed more than once or not at all. Further, a task can be cancelled and re-queued at the discretion of the App Engine based on available resources. For example, your timeout_ms = 100 is very low; if a new JVM has to be started, which could take several seconds, tasks n+1 and n+2 may be executed before task n.
In short, the task queue is not a reliable mechanism for performing strictly sequential computation. You've been warned.
-tjw
My profiler trace shows that exec sp_reset_connection is being called between every sql batch or procedure call. There are reasons for it, but can I prevent it from being called, if I'm confident that it's unnecessary, to improve performance?
UPDATE:
The reason I imagine this could improve performance is twofold:
SQL Server doesn't need to reset the connection state. I think this would be a relatively negligible improvement.
Reduced network latency because the client doesn't need to send down an exec sp_reset_connection, wait for response, then send whatever sql it really wants to execute.
The second benefit is the one I'm interested in, because in my architecture the clients are sometimes some distance from the database. If every sql batch or rpc requires a double round-trip this doubles the impact of any network latency. Eliminating such double calls could potentially improve performance.
Yes there are lots of other things I could do to improve performance like re-architect the app, and I'm a big fan of solving the root cause of problems, but in this case I just want to know if it's possible to prevent sp_reset_connection to be called. Then I can test if there is any performance improvement and properly assess the risks of not calling this.
This prompts another question: does the network communication with sp_reset_connection really occur like I outlined above? i.e. Does the client send exec sp_reset_connection, wait for a response, then send the real sql? Or does it all go in one chunk?
If you're using .NET to connect to SQL Server, disabling of the extra reset call was disabled as of .NET 3.5 -- see here. (The property remains, but it does nothing.)
I guess Microsoft realized (as someone did experimentally here) that opening the door to avoid the reset was far more dangerous than it was to get a (likely) small performance gain. Can't say I blame them.
Does the client send exec sp_reset_connection, wait for a response, then send the real sql?
EDIT: I was wrong -- see here -- the answer is no.
Summary: there is a special bit set in a TDS message that specifies that the connection should be reset, and SQL Server executes sp_reset_connection automatically. It appears as a separate batch in Profiler and would always be executed before the actual query you wanted to execute, so my test was invalid.
Yes, it's sent in a separate batch.
I put together a little C# test program to demonstrate this because I was curious:
using System.Data.SqlClient;
(...)
private void Form1_Load(object sender, EventArgs e)
{
SqlConnectionStringBuilder csb = new SqlConnectionStringBuilder();
csb.DataSource = #"MyInstanceName";
csb.IntegratedSecurity = true;
csb.InitialCatalog = "master";
csb.ApplicationName = "blarg";
for (int i = 0; i < 2; i++)
_RunQuery(csb);
}
private void _RunQuery(SqlConnectionStringBuilder csb)
{
using (SqlConnection conn = new SqlConnection(csb.ToString()))
{
conn.Open();
SqlCommand cmd = new SqlCommand("WAITFOR DELAY '00:00:05'", conn);
cmd.ExecuteNonQuery();
}
}
Start Profiler and attach it to your instance of choice, filtering on the dummy application name I provided. Then, put a breakpoint on the cmd.ExecuteNonQuery(); line and run the program.
The first time you step over, just the query runs, and all you get is the SQL:BatchCompleted event after the 5 second wait. When the breakpoint hits the second time, all you see in profiler is still just the one event. When you step over again, you immediately see the exec sp_reset_connection event, and then the SQL:BatchCompleted event shows up after the delay.
The only way to get rid of the exec sp_reset_connection call (which may or may not be a legitimate performance problem for you) would be to turn off .NET's connection pooling. And if you're planning to do that, you'd likely want to build your own connection pooling mechanism, because just turning it off and doing nothing else will probably hurt more overall than taking the hit of the extra roundtrip, and you will have to deal with the correctness issues manually.
This Q/A could be helpful:
What does "exec sp_reset_connection" mean in Sql Server Profiler?
However, I did a quick test using Entity Framework and MS-SQL 2008 R2. It shows that "exec sp_reset_connection" isn't time consuming after the first call:
for (int i = 0; i < n; i++)
{
using (ObjectContext context = new myEF())
{
DateTime timeStartOpenConnection = DateTime.Now;
context.Connection.Open();
Console.WriteLine();
Console.WriteLine("Opening connection time waste: {0} ticks.", (DateTime.Now - timeStartOpenConnection).Ticks);
ObjectSet<myEntity> query = context.CreateObjectSet<myEntity>();
DateTime timeStart = DateTime.Now;
myEntity e = query.OrderByDescending(x => x.EventDate).Skip(i).Take(1).SingleOrDefault<myEntity>();
Console.Write("{0}. Created By {1} on {2}... ", e.ID, e.CreatedBy, e.EventDate);
Console.WriteLine("({0} ticks).", (DateTime.Now - timeStart).Ticks);
DateTime timeStartCloseConnection = DateTime.Now;
context.Connection.Close();
context.Connection.Dispose();
Console.WriteLine("Closing connection time waste: {0} ticks.", (DateTime.Now - timeStartCloseConnection).Ticks);
Console.WriteLine();
}
}
And output was this:
Opening connection time waste: 5390101 ticks.
585. Created By sa on 12/20/2011 2:18:23 PM... (2560183 ticks).
Closing connection time waste: 0 ticks.
Opening connection time waste: 0 ticks.
584. Created By sa on 12/20/2011 2:18:20 PM... (1730173 ticks).
Closing connection time waste: 0 ticks.
Opening connection time waste: 0 ticks.
583. Created By sa on 12/20/2011 2:18:17 PM... (710071 ticks).
Closing connection time waste: 0 ticks.
Opening connection time waste: 0 ticks.
582. Created By sa on 12/20/2011 2:18:14 PM... (720072 ticks).
Closing connection time waste: 0 ticks.
Opening connection time waste: 0 ticks.
581. Created By sa on 12/20/2011 2:18:09 PM... (740074 ticks).
Closing connection time waste: 0 ticks.
So, the final conclusion is: Don't worry about "exec sp_reset_connection"! It wastes nothing.
Personally, I'd leave it.
Given what it does, I want to make sure I have no temp tables in scope or transactions left open.
To be fair, you will gain a bigger performance boost by not running profiler against your production database. And do you have any numbers or articles or recommendations about what you can gain from this please?
Just keep the connection open instead of returning it to the pool, and execute all commands on that one connection.
In the video/PDF from "Data pipelines with Google App Engine" Brett puts "now / 30" into the task name noting that he will explain the reason later, but somehow he never does. :)
http://www.youtube.com/watch?v=zSDC_TU7rtc#t=41m35
task_name = '%s-%d-%d' % (sum_name, int(now / 30), index)
Do you have any idea about the reason? Does it have anything to do with the 7 day period in which one can't re-use task names?
Link to the session page
Brett Slatkin's own explanation
[Brett]
Hey all,
The int(time.time()/30) part of the task name is to prevent queue stalls. When memcache gets evicted the work index counter will be reset to zero. That means new fork-join work items may insert tasks that are named the same as tasks that were already inserted. By including a time window of ~30 seconds in the task name, we ensure that this problem can only last for about thirty seconds. This is also why you should raise an exception when you see a TombstonedTaskError exception.
Worst-case scenario if the clocks are wonky is that two tasks are run to do the fan-in work instead of just one, which is an acceptable trade-off in many cases and a fundamental possibility when using the task queue API. This can be mitigated using pigeon-hole acknowledgment entities, like I use in my materialized view example.
Hope that helps,
[/Brett]