Well known issue when many clients query on a sqlite database : database is locked
I would like to inclease the delay to wait (in ms) for lock release on linux, to get rid of this error.
From sqlite-command, I can use for example (4 sec):
sqlite> .timeout 4000
sqlite>
I've started many processes which make select/insert/delete, and if I don't set this value with sqlite-command, I sometimes get :
sqlite> select * from items where code like '%30';
Error: database is locked
sqlite>
So what is the default value for .timeout ?
In Perl 5.10 programs, I also get sometimes this error, despite the default value seems to be 30.000 (so 30 sec, not documented).
Did programs actually waited for 30 sec before this error ? If yes, this seems crasy, there is at least a little moment where the database is free even if many other processes are running on this database
my $dbh = DBI->connect($database,"","") or die "cannot connect $DBI::errstr";
my $to = $dbh->sqlite_busy_timeout(); # $to get the value 30000
Thanks!
The default busy timeout for DBD::Sqlite is defined in dbdimp.h as 30000 milliseconds. You can change it with $dbh->sqlite_busy_timeout($ms);.
The sqlite3 command line shell has the normal Sqlite default of 0; that is to say, no timeout. If the database is locked, it errors right away. You can change it with .timeout ms or pragma busy_timeout=ms;.
The timeout works as follows:
The handler will sleep multiple times until at least "ms" milliseconds of sleeping have accumulated. After at least "ms" milliseconds of sleeping, the handler returns 0 which causes sqlite3_step() to return SQLITE_BUSY.
If you get a busy database error even with a 30 second timeout, you just got unlucky as to when attempts to acquire a lock were made on a heavily used database file (or something is running a really slow query). You might look into WAL mode if not already using it.
Related
That's a title and a half, but it pretty much summarises my "problem".
I have an Azure Databricks workspace, and a an Azure Virtual Machine running SQL Server 2019 Developer. They're on the same VNET, and they can communicate nicely with each other. I can select rows very happily from the SQL Server, and some instances of inserts work really nicely too.
My scenario:
I have a spark table foo, containing any number of rows. Could be 1, could be 20m.
foo contains 19 fields.
The contents of foo needs to be inserted into a table on the SQL Server also called foo, in a database called bar, meaning my destination is bar.dbo.foo
I've got the com.microsoft.sqlserver.jdbc.spark connector configured on the cluster, and I connect using an IP, port, username and password.
My notebook cell of relevance:
df = spark.table("foo")
try:
url = "jdbc:sqlserver://ip:port"
table_name = "bar.dbo.foo"
username = "user"
password = "password"
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("append") \
.option("truncate",True) \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("queryTimeout", 120) \
.option("tableLock",True) \
.option("numPartitions",1) \
.save()
except ValueError as error :
print("Connector write failed", error)
If I prepare foo to contain 10,000 rows, I can run this script time and time again, and it succeeds every time.
As the rows start dropping down, the Executor occasionally tries to process 4,096 rows in a task. As soon as it tries to do 4,096 in a task, weird things happen.
For example, having created foo to contain 5,000 rows and executing the code, this is the task information:
Index Task Id Attempt Status Executor ID Host Duration Input Size/Records Errors
0 660 0 FAILED 0 10.139.64.6 40s 261.3 KiB / 4096 com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
0 661 1 FAILED 3 10.139.64.8 40s 261.3 KiB / 4096 com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
0 662 2 FAILED 3 10.139.64.8 40s 261.3 KiB / 4096 com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
0 663 3 SUCCESS 1 10.139.64.5 0.4s 261.3 KiB / 5000
I don't fully understand why it fails after 40 seconds. Our timeouts are set to 600 seconds on the SQL box, and the query timeout in the script is 120 seconds.
Every time the Executor does more than 4,096 rows, it succeeds. This is true regardless of the size of the dataset. Sometimes it tries to do 4,096 rows on 100k row sets, fails, and then changes the records in the set to 100k and it immediately succeeds.
When the set is smaller than 4,096, the execution will typically generate one message:
com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed
and then immediately work successfully having moved onto the next executor.
On the SQL Server itself, I see ASYNC_NETWORK_IO as the wait using Adam Mechanic's sp_whoisactive. This wait persists for the full duration of the 40s attempt. It looks like at 40s there's an immediate abandonment of the attempt, and a new connection is created - consistent with the messages I see from the task information.
Additionally, when looking at the statements, I note that it's doing ROWS_PER_BATCH = 1000 regardless of the original number of rows. I can't see any way of changing that in the docs, but I tried rowsPerBatch in the option for the df, but didn't appear to make a difference - still showing the 1000 value.
I've been running this with lots of different amounts of rows in foo - and when the total rows is greater than 4,096 my testing suggests that the spark executor succeeds if it tries a number of records that exceeds 4,096. If I remove the numPartitions, there are more attempts of 4,096 records, and so I see more failures.
Weirdly, if I cancel a query that appears to be running for longer than 10s, and immediately retry it - if the number of rows in foo is != 4,096, it seems to succeed every time. My sample is obviously pretty small - tens of attempts.
Is there a limitation I'm not familiar with here? What's the magic of 4,096?
In discussing this with my friend, we're wondering whether there is some form of implicit type conversions happening in the arrays when they're <4,096 records, which causes delays somehow.
I'm at quite a loss on this one - and wondering whether I just need to check the length of the DF before attempting the transfer - doing an iterative cursor in PYODBC for fewer rows, and sticking to the JDBC connector for larger numbers of rows. It seems like it shouldn't be needed!
Many thanks,
Johan
In SQL XE for sp/rpc/stmt completed events it will be great if we can include wait types like IO/Network waits etc. Just like we can see reads/cpu/duration, If we can also gets other resource waits, we can get a good idea why sql is slow during scenarios where the duration/CPU is high and reads are low.
You can actually track the wait statistics of a query but the other way around - by tracking the wait statistics themselves. Take a look at the following snippet and the result image:
CREATE EVENT SESSION [Wait Statistics] ON SERVER
ADD EVENT sqlos.wait_info
(
ACTION(sqlserver.database_name,sqlserver.session_id,sqlserver.sql_text)
WHERE (
opcode = N'End'
AND duration > 0
)
),
ADD EVENT sqlos.wait_info_external
(
ACTION(sqlserver.database_name,sqlserver.session_id,sqlserver.sql_text)
WHERE (
opcode = N'End'
AND duration > 0
)
)
Here we capture the end of every wait (this is done because at this point SQL Server knows the duration of the statistic, so we can output it int he result) that has has duration greater than 0. In the ACTION part we retrieve the database name, the text of the query that caused the statistic and the session id of the query.
Beware though. Tracking wait statistics through Extended Events(and not through sys.dm_os_wait_stats that collects aggregate data) can generate a ton of data overwhelmingly fast. Should you choose this method, you should define very carefully which wait statistics you want to keep track of and what from what duration on the statistic causes you problem.
In my program, I’ve got several threads in pool that each try to write to the DB. The number of threads created is dynamic. When the number of threads created is only one, all works fine. However, when there are multi-thread executing, I get the error:
org.apache.ddlutils.DatabaseOperationException: org.postgresql.util.PSQLException: Cannot commit when autoCommit is enabled.
I’m guessing, perhaps since each thread executes in parallel, two threads are trying to write at the same time and giving this error.
Do you think this is the case, if not, what could be causing this error?
Otherwise, if what I said is the problem, what I can do to fix it?
In your jdbc code, you should turn off autocommit as soon as you fetch the connection. Something like this:
DataSource datasource = getDatasource(); // fetch your datasource somehow
Connection c = null;
try{
c = datasource.getConnection();
c.setAutoCommit(false);
My profiler trace shows that exec sp_reset_connection is being called between every sql batch or procedure call. There are reasons for it, but can I prevent it from being called, if I'm confident that it's unnecessary, to improve performance?
UPDATE:
The reason I imagine this could improve performance is twofold:
SQL Server doesn't need to reset the connection state. I think this would be a relatively negligible improvement.
Reduced network latency because the client doesn't need to send down an exec sp_reset_connection, wait for response, then send whatever sql it really wants to execute.
The second benefit is the one I'm interested in, because in my architecture the clients are sometimes some distance from the database. If every sql batch or rpc requires a double round-trip this doubles the impact of any network latency. Eliminating such double calls could potentially improve performance.
Yes there are lots of other things I could do to improve performance like re-architect the app, and I'm a big fan of solving the root cause of problems, but in this case I just want to know if it's possible to prevent sp_reset_connection to be called. Then I can test if there is any performance improvement and properly assess the risks of not calling this.
This prompts another question: does the network communication with sp_reset_connection really occur like I outlined above? i.e. Does the client send exec sp_reset_connection, wait for a response, then send the real sql? Or does it all go in one chunk?
If you're using .NET to connect to SQL Server, disabling of the extra reset call was disabled as of .NET 3.5 -- see here. (The property remains, but it does nothing.)
I guess Microsoft realized (as someone did experimentally here) that opening the door to avoid the reset was far more dangerous than it was to get a (likely) small performance gain. Can't say I blame them.
Does the client send exec sp_reset_connection, wait for a response, then send the real sql?
EDIT: I was wrong -- see here -- the answer is no.
Summary: there is a special bit set in a TDS message that specifies that the connection should be reset, and SQL Server executes sp_reset_connection automatically. It appears as a separate batch in Profiler and would always be executed before the actual query you wanted to execute, so my test was invalid.
Yes, it's sent in a separate batch.
I put together a little C# test program to demonstrate this because I was curious:
using System.Data.SqlClient;
(...)
private void Form1_Load(object sender, EventArgs e)
{
SqlConnectionStringBuilder csb = new SqlConnectionStringBuilder();
csb.DataSource = #"MyInstanceName";
csb.IntegratedSecurity = true;
csb.InitialCatalog = "master";
csb.ApplicationName = "blarg";
for (int i = 0; i < 2; i++)
_RunQuery(csb);
}
private void _RunQuery(SqlConnectionStringBuilder csb)
{
using (SqlConnection conn = new SqlConnection(csb.ToString()))
{
conn.Open();
SqlCommand cmd = new SqlCommand("WAITFOR DELAY '00:00:05'", conn);
cmd.ExecuteNonQuery();
}
}
Start Profiler and attach it to your instance of choice, filtering on the dummy application name I provided. Then, put a breakpoint on the cmd.ExecuteNonQuery(); line and run the program.
The first time you step over, just the query runs, and all you get is the SQL:BatchCompleted event after the 5 second wait. When the breakpoint hits the second time, all you see in profiler is still just the one event. When you step over again, you immediately see the exec sp_reset_connection event, and then the SQL:BatchCompleted event shows up after the delay.
The only way to get rid of the exec sp_reset_connection call (which may or may not be a legitimate performance problem for you) would be to turn off .NET's connection pooling. And if you're planning to do that, you'd likely want to build your own connection pooling mechanism, because just turning it off and doing nothing else will probably hurt more overall than taking the hit of the extra roundtrip, and you will have to deal with the correctness issues manually.
This Q/A could be helpful:
What does "exec sp_reset_connection" mean in Sql Server Profiler?
However, I did a quick test using Entity Framework and MS-SQL 2008 R2. It shows that "exec sp_reset_connection" isn't time consuming after the first call:
for (int i = 0; i < n; i++)
{
using (ObjectContext context = new myEF())
{
DateTime timeStartOpenConnection = DateTime.Now;
context.Connection.Open();
Console.WriteLine();
Console.WriteLine("Opening connection time waste: {0} ticks.", (DateTime.Now - timeStartOpenConnection).Ticks);
ObjectSet<myEntity> query = context.CreateObjectSet<myEntity>();
DateTime timeStart = DateTime.Now;
myEntity e = query.OrderByDescending(x => x.EventDate).Skip(i).Take(1).SingleOrDefault<myEntity>();
Console.Write("{0}. Created By {1} on {2}... ", e.ID, e.CreatedBy, e.EventDate);
Console.WriteLine("({0} ticks).", (DateTime.Now - timeStart).Ticks);
DateTime timeStartCloseConnection = DateTime.Now;
context.Connection.Close();
context.Connection.Dispose();
Console.WriteLine("Closing connection time waste: {0} ticks.", (DateTime.Now - timeStartCloseConnection).Ticks);
Console.WriteLine();
}
}
And output was this:
Opening connection time waste: 5390101 ticks.
585. Created By sa on 12/20/2011 2:18:23 PM... (2560183 ticks).
Closing connection time waste: 0 ticks.
Opening connection time waste: 0 ticks.
584. Created By sa on 12/20/2011 2:18:20 PM... (1730173 ticks).
Closing connection time waste: 0 ticks.
Opening connection time waste: 0 ticks.
583. Created By sa on 12/20/2011 2:18:17 PM... (710071 ticks).
Closing connection time waste: 0 ticks.
Opening connection time waste: 0 ticks.
582. Created By sa on 12/20/2011 2:18:14 PM... (720072 ticks).
Closing connection time waste: 0 ticks.
Opening connection time waste: 0 ticks.
581. Created By sa on 12/20/2011 2:18:09 PM... (740074 ticks).
Closing connection time waste: 0 ticks.
So, the final conclusion is: Don't worry about "exec sp_reset_connection"! It wastes nothing.
Personally, I'd leave it.
Given what it does, I want to make sure I have no temp tables in scope or transactions left open.
To be fair, you will gain a bigger performance boost by not running profiler against your production database. And do you have any numbers or articles or recommendations about what you can gain from this please?
Just keep the connection open instead of returning it to the pool, and execute all commands on that one connection.
Is there a concise list of SQL Server stored procedure errors that make sense to automatically retry? Obviously, retrying a "login failed" error doesn't make sense, but retrying "timeout" does. I'm thinking it might be easier to specify which errors to retry than to specify which errors not to retry.
So, besides "timeout" errors, what other errors would be good candidates for automatic retrying?
Thanks!
You should retry (re-run) the entire transaction, not just a single query/SP.
As for the errors to retry, I've been using the following list:
DeadlockVictim = 1205,
SnapshotUpdateConflict = 3960,
// I haven't encountered the following 4 errors in practice
// so I've removed these from my own code:
LockRequestTimeout = 1222,
OutOfMemory = 701,
OutOfLocks = 1204,
TimeoutWaitingForMemoryResource = 8645,
The most important one is of course the "deadlock victim" error 1205.
I would extend that list, if you want absolutely complete list use the query and filter the result.
select * from master.dbo.sysmessages where description like '%memory%'
int[] errorNums = new int[]
{
701, // Out of Memory
1204, // Lock Issue
1205, // Deadlock Victim
1222, // Lock request time out period exceeded.
7214, // Remote procedure time out of %d seconds exceeded. Remote procedure '%.*ls' is canceled.
7604, // Full-text operation failed due to a time out.
7618, // %d is not a valid value for a full-text connection time out.
8628, // A time out occurred while waiting to optimize the query. Rerun the query.
8645, // A time out occurred while waiting for memory resources to execute the query. Rerun the query.
8651, // Low memory condition
};
You can use a SQL query to look for errors explicitly requesting a retry (trying to exclude those that require another action too).
SELECT error, description
FROM master.dbo.sysmessages
WHERE msglangid = 1033
AND (description LIKE '%try%later.' OR description LIKE '%. rerun the%')
AND description NOT LIKE '%resolve%'
AND description NOT LIKE '%and try%'
AND description NOT LIKE '%and retry%'
Here's the list of error codes:
539,
617,
952,
956,
983,
1205,
1807,
3055,
5034,
5059,
5061,
5065,
8628,
8645,
8675,
10922,
14258,
20689,
25003,
27118,
30024,
30026,
30085,
33115,
33116,
40602,
40642,
40648
You can tweak the query to look for other conditions like timeouts or memory problems, but I'd recommend configuring your timeout length correctly up front, and then backing off slightly in these scenarios.
I'm not sure about a full listing of these errors, but I can warn you to be VERY careful about retrying queries. Often there's a larger problem afoot when you get errors from SQL, and simply re-running queries will only further compact the issue. For instance, with the timeout error, you typically will have either a network bottleneck, poorly indexed tables, or something on those lines, and re-running the same query will add to the latency of other queries already obviously struggling to execute.
The one sql server error that you should always catch on inserts and updates (and it is quite often missed), is the deadlock error no. 1205
Appropriate action is to retry the INSERT/UPDATE a small number of times.