How to execute huge SQL queries rapidly with EFCore?

How to execute huge SQL queries rapidly with EFCore? - sql-server

Here are two simplified examples of SQL queries I recently run on my SQL database:
insert into new_entity
select top 10000 field1,
field2
from old_entity
&
update top 1000 e
set field1 = 13
from entity e
where field1 is null
Their execution was so fast it's barely noticeable.
However, if I want to perform the same operation using EF, the way I know would be iterating over each object:
using(var db = new myDbContext())
{
var new_objs = db.old_entity.Take(10000).Select(ot=> new new_entity() { ... });
db.new_entity.AddRange(new_objs);
db.SaveChanges();
}
&
using(var db = new myDbContext())
{
var objs = db.entity.Where(e => e.field1 == null).Take(1000);
objs.ForEach(e => e.field1 = 13);
db.SaveChanges();
}
Which lasted for hours, which is unacceptable.
I can execute a raw SQL query from within the app, but for real-life complex objects, it's a dradgery.
Is there a way to code such operations using the EF model with a performance of a directly written query?

EF Core 7 have introduced new method ExecuteUpdate. And your query for update can be written in the following way:
using(var db = new myDbContext())
{
var objs = db.entity.Where(e => e.field1 == null).Take(1000);
objs.ExecuteUpdate(b => b.SetProperty(x => x.field1, x => 13));
}
INSERT FROM is not supported by EF Core any versions.
Anyway, you can install third-party extension linq2db.EntityFrameworkCore, note that i'm one of the creators.
Insert
using(var db = new myDbContext())
{
var new_objs = db.old_entity.Take(10000).Select(ot=> new new_entity() { ... });
new_objs.Insert(db.new_entity.ToLinqToDBTable());
}
Update
using(var db = new myDbContext())
{
var objs = db.entity.Where(e => e.field1 == null).Take(1000);
objs
.Set(x => x.field1, x => 13)
.Update();
}

Related

How to write sql to get data when using postgresql and EF?

I have the below C# and Entity framework 6.4 code to get data from
postgresql function
return Database.SqlQuery<TEntity>(sql, parameters).ToArrayAsync();
and I tried
var sql = $#"select * from GetLogs(?,?,?,?,?,?,?,?,?,?,?,?,?)"; or
var sql = $#"select * from GetLogs($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13)";
but got failed, anybody know about it?
I also try
var sql = "select * from GetLogs(TO_TIMESTAMP({0},'YYYY-MM-DD HH24:MI:SS'), TO_TIMESTAMP({1},'YYYY-MM-DD HH24:MI:SS'),{2},'{3}','{4}',{5},'{6}','{7}',{8},{9},{10})";
var from = request.From.ToString("yyyy-MM-dd HH:mm:ss");
var to = request.To.ToString("yyyy-MM-dd HH:mm:ss");
return Database.SqlQuery<AdminLogMasterModel>(sql,
from,
to,
request.HttpStatusCode,
request.ServerName,
request.Page,
request.TrackingId,
request.Content,
request.SortBy,
request.OrderBy == SortOrder.Ascending ? 0 : 1,
request.Page,
request.Rows).ToArrayAsync();
This got no error but no date return. It seems postgresql didn't pass parameter in EF.

I tried the below code can work
var sql = "select * from GetLogs(TO_TIMESTAMP(#p0,'YYYY-MM-DD HH24:MI:SS'),TO_TIMESTAMP(#p1,'YYYY-MM-DD HH24:MI:SS'),#p2,#p3,#p4,#p5,#p6,#p7,#p8,#p9,#p10)";
var para = new NpgsqlParameter[]
{
new NpgsqlParameter("#p0", request.From.ToString("yyyy-MM-dd HH:mm:ss")),
new NpgsqlParameter("#p1", request.To.ToString("yyyy-MM-dd HH:mm:ss")),
new NpgsqlParameter("#p2", CreateRequestValueifNull(request.HttpStatusCode)),
new NpgsqlParameter("#p3", CreateRequestValueifNull(request.ServerName)),
new NpgsqlParameter("#p4", CreateRequestValueifNull(request.Path)),
new NpgsqlParameter("#p5", CreateRequestValueifNull(request.TrackingId)),
new NpgsqlParameter("#p6", CreateRequestValueifNull(request.Content)),
new NpgsqlParameter("#p7", request.SortBy),
new NpgsqlParameter("#p8", request.OrderBy == SortOrder.Ascending ? 0 : 1),
new NpgsqlParameter("#p9", request.Page),
new NpgsqlParameter("#p10", request.Rows),
};
return Database.SqlQuery<LogModel>(sql, para).ToArrayAsync();

best solution for multiple insert update solution

Struggle with understanding C# & Npgsql as a beginner. Following code examples:
// Insert some data
using (var cmd = new NpgsqlCommand())
{ cmd.Connection = conn;
cmd.CommandText = "INSERT INTO data (some_field) VALUES (#p)";
cmd.Parameters.AddWithValue("p", "Hello world");
cmd.ExecuteNonQuery();
}
The syntax for more than one insert & update statement like this is clear so far:
cmd.CommandText = "INSERT INTO data (some_field) VALUES (#p);INSERT INTO data1...;INSERT into data2... and so on";
But what is the right solution for a loop which should handle one statement within.
This works not:
// Insert some data
using (var cmd = new NpgsqlCommand())
{
foreach(s in SomeStringCollectionOrWhatever)
{
cmd.Connection = conn;
cmd.CommandText = "INSERT INTO data (some_field) VALUES (#p)";
cmd.Parameters.AddWithValue("p", s);
cmd.ExecuteNonQuery();
}
}
It seems the values will be "concatenated" or remembered. I cannot see any possibility to "clear" the existing cmd-object.
My second solution would be to wrap the whole "using" block into the loop. But every cycle would create a new object. That seems ugly to me.
So what is the best solution for my problem?

To insert lots of rows efficiently, take a look at Npgsql's bulk copy feature - the API is more suitable (and more efficient) for inserting large numbers of rows than concatenating INSERT statements into a batch like you're trying to do.

If you want to rerun the same SQL with changing parameter values, you can do the following:
using (var cmd = new NpgsqlCommand("INSERT INTO data (some_field) VALUES (#p)", conn))
{
var p = new NpgsqlParameter("p", DbType.String); // Adjust DbType according to type
cmd.Parameters.Add(p);
cmd.Prepare(); // This is optional but will optimize the statement for repeated use
foreach(var s in SomeStringCollectionOrWhatever)
{
p.Value = s;
cmd.ExecuteNonQuery();
}
}

If you need lots of rows and performance is key then i would recommend Npgsql's bulk copy capability as #Shay mentioned. But if you are looking for quick way to do this without the bulk copy i would recommend to use Dapper.
Consider the example below.
Lets say you have a class called Event and a list of events to add.
List<Event> eventsToInsert = new List<Event>
{
new Event() { EventId = 1, EventName = "Bday1" },
new Event() { EventId = 2, EventName = "Bday2" },
new Event() { EventId = 3, EventName = "Bday3" }
};
The snippet that would add the list to the DB shown below.
var sqlInsert = "Insert into events( eventid, eventname ) values (#EventId, #EventName)";
using (IDbConnection conn = new NpgsqlConnection(cs))
{
conn.Open();
// Execute is an extension method supplied by Dapper
// This code will add all the entries in the eventsToInsert List and match up the values based on property name. Only caveat is that the property names of the POCO should match the placeholder names in the SQL Statement.
conn.Execute(sqlInsert, eventsToInsert);
// If we want to retrieve the data back into the list
List<Event> eventsAdded;
// This Dapper extension will return an Ienumerable, so i cast it to a List.
eventsAdded = conn.Query<Event>("Select * from events").ToList();
foreach( var row in eventsAdded)
{
Console.WriteLine($"{row.EventId} {row.EventName} was added");
}
}
-HTH

Nunit database rollback

I'm fairly new to using Nunit as a test framework and have come across something I don't really understand.
I am writing an integration test which inserts a row into a database table. As I want to repeatedly run this test I wanted to delete the row out once the test has completed. The test runs fine, however when I look in the database table the row in question is still there even though the delete command ran. I can even see it run when profiling the database whilst running the test.
Does Nunit somehow rollback database transaction? If anyone has any ideas why I am seeing this happen please let me know. I have been unable to find any information about Nunit rolling back transactions which makes me think I'm doing something wrong, but the test runs the row appears in the database it is just not being deleted afterwards even though profiler shows the command being run!
Here is the test code I am running
[Test]
[Category("Consultant split integration test")]
public void Should_Save_Splits()
{
var splitList = new List<Split>();
splitList.Add(new Split()
{
UnitGroupId = 69,
ConsultantUserId = 1,
CreatedByUserId = 1,
CreatedOn = DateTime.Now,
UpdatedByUserId = 1,
UpdatedOn = DateTime.Now,
Name = "Consultant1",
Unit = "Unit1",
Percentage = 100,
PlacementId = 47
});
var connection = Helpers.GetCpeDevDatabaseConnection();
var repository = new Placements.ConsultantSplit.DAL.PronetRepository();
var notebookManager = Helpers.GetNotebookManager(connection);
var userManager = Helpers.GetUserManager(connection);
var placementManager = Helpers.GetPlacementManager(connection);
var sut = new Placements.ConsultantSplit.ConsultantSplitManager(repository, connection, notebookManager, userManager, placementManager);
IEnumerable<string> errors;
sut.SaveSplits(splitList, out errors);
try
{
using (connection.Connection)
{
using (
var cmd = new SqlCommand("Delete from ConsultantSplit where placementid=47",
connection.Connection))
{
connection.Open();
cmd.CommandType = CommandType.Text;
connection.UseTransaction = false;
cmd.Transaction = connection.Transaction;
cmd.ExecuteNonQuery();
connection.Close();
}
}
}
catch (Exception exp)
{
throw new Exception(exp.Message);
}

Is using a SqlCacheDependency per user a good idea?

I'm thinking of caching permissions for every user on our application server. Is it a good idea to use a SqlCacheDependency for every user?
The query would look like this
SELECT PermissionId, PermissionName From Permissions Where UserId = #UserId
That way I know if any of those records change then to purge my cache for that user.

If you read how Query Notifications work you'll see why createing many dependency requests with a single query template is good practice. For a web app, which is implied by the fact that you use SqlCacheDependency and not SqlDependency, what you plan to do should be OK. If you use Linq2Sql you can also try LinqToCache:
var queryUsers = from u in repository.Users
where u.UserId = currentUserId
select u;
var user= queryUsers .AsCached("Users:" + currentUserId.ToString());
For a fat client app it would not be OK. Not because of the query per-se, but because SqlDependency in general is problematic with a large number of clients connected (it blocks a worker thread per app domain connected):
SqlDependency was designed to be used in ASP.NET or middle-tier
services where there is a relatively small number of servers having
dependencies active against the database. It was not designed for use
in client applications, where hundreds or thousands of client
computers would have SqlDependency objects set up for a single
database server.
Updated
Here is the same test as #usr did in his post. Full c# code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.SqlClient;
using DependencyMassTest.Properties;
using System.Threading.Tasks;
using System.Threading;
namespace DependencyMassTest
{
class Program
{
static volatile int goal = 50000;
static volatile int running = 0;
static volatile int notified = 0;
static int workers = 50;
static SqlConnectionStringBuilder scsb;
static AutoResetEvent done = new AutoResetEvent(false);
static void Main(string[] args)
{
scsb = new SqlConnectionStringBuilder(Settings.Default.ConnString);
scsb.AsynchronousProcessing = true;
scsb.Pooling = true;
try
{
SqlDependency.Start(scsb.ConnectionString);
using (var conn = new SqlConnection(scsb.ConnectionString))
{
conn.Open();
using (SqlCommand cmd = new SqlCommand(#"
if object_id('SqlDependencyTest') is not null
drop table SqlDependencyTest
create table SqlDependencyTest (
ID int not null identity,
SomeValue nvarchar(400),
primary key(ID)
)
", conn))
{
cmd.ExecuteNonQuery();
}
}
for (int i = 0; i < workers; ++i)
{
Task.Factory.StartNew(
() =>
{
RunTask();
});
}
done.WaitOne();
Console.WriteLine("All dependencies subscribed. Waiting...");
Console.ReadKey();
}
catch (Exception e)
{
Console.Error.WriteLine(e);
}
finally
{
SqlDependency.Stop(scsb.ConnectionString);
}
}
static void RunTask()
{
Random rand = new Random();
SqlConnection conn = new SqlConnection(scsb.ConnectionString);
conn.Open();
SqlCommand cmd = new SqlCommand(
#"select SomeValue
from dbo.SqlDependencyTest
where ID = #id", conn);
cmd.Parameters.AddWithValue("#id", rand.Next(50000));
SqlDependency dep = new SqlDependency(cmd);
dep.OnChange += new OnChangeEventHandler((ob, qnArgs) =>
{
Console.WriteLine("Notified {3}: Info:{0}, Source:{1}, Type:{2}", qnArgs.Info, qnArgs.Source, qnArgs.Type, Interlocked.Increment(ref notified));
});
cmd.BeginExecuteReader(
(ar) =>
{
try
{
int crt = Interlocked.Increment(ref running);
if (crt % 1000 == 0)
{
Console.WriteLine("{0} running...", crt);
}
using (SqlDataReader rdr = cmd.EndExecuteReader(ar))
{
while (rdr.Read())
{
}
}
}
catch (Exception e)
{
Console.Error.WriteLine(e.Message);
}
finally
{
conn.Close();
int left = Interlocked.Decrement(ref goal);
if (0 == left)
{
done.Set();
}
else if (left > 0)
{
RunTask();
}
}
}, null);
}
}
}
After 50k subscriptions are set up (takes about 5 min), here are the stats io of a single insert:
set statistics time on
insert into Test..SqlDependencyTest (SomeValue) values ('Foo');
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 16 ms.
Inserting 1000 rows takes about 7 seconds, which includes firing several hundred notifications. CPU utilization is about 11%. All this is on my T420s ThinkPad.
set nocount on;
go
begin transaction
go
insert into Test..SqlDependencyTest (SomeValue) values ('Foo');
go 1000
commit
go

the documentation says:
SqlDependency was designed to be used in ASP.NET or middle-tier
services where there is a relatively small number of servers having
dependencies active against the database. It was not designed for use
in client applications, where hundreds or thousands of client
computers would have SqlDependency objects set up for a single
database server.
It tells us not to open thousands of cache dependencies. That is likely to cause resource problems on the SQL Server.
There are a few alternatives:
Have a dependency per table
Have 100 dependencies per table, one for every percent of rows. This should be an acceptable number for SQL Server yet you only need to invalidate 1% of the cache.
Have a trigger output the ID of all changes rows into a logging table. Create a dependency on that table and read the IDs. This will tell you exactly which rows have changed.
In order to find out if SqlDependency is suitable for mass usage I did a benchmark:
static void SqlDependencyMassTest()
{
var connectionString = "Data Source=(local); Initial Catalog=Test; Integrated Security=true;";
using (var dependencyConnection = new SqlConnection(connectionString))
{
dependencyConnection.EnsureIsOpen();
dependencyConnection.ExecuteNonQuery(#"
if object_id('SqlDependencyTest') is not null
drop table SqlDependencyTest
create table SqlDependencyTest (
ID int not null identity,
SomeValue nvarchar(400),
primary key(ID)
)
--ALTER DATABASE Test SET ENABLE_BROKER with rollback immediate
");
SqlDependency.Start(connectionString);
for (int i = 0; i < 1000 * 1000; i++)
{
using (var sqlCommand = new SqlCommand("select ID from dbo.SqlDependencyTest where ID = #id", dependencyConnection))
{
sqlCommand.AddCommandParameters(new { id = StaticRandom.ThreadLocal.GetInt32() });
CreateSqlDependency(sqlCommand, args =>
{
});
}
if (i % 1000 == 0)
Console.WriteLine(i);
}
}
}
You can see the amount of dependencies created scroll through the console. It gets slow very quickly. I did not do a formal measurement because it was not necessary to prove the point.
Also, the execution plan for a simple insert into the table shows 99% of the cost being associated with maintaining the 50k dependencies.
Conclusion: Does not work at all for production use. After 30min I have 55k dependencies created. Machine at 100% CPU all the time.

SqlServer better to batch statements or foreach?

Hypothetically, is it better to send N statements to Sql Server (2008), or is it better to send 1 command comprising N statements to Sql Server? In either case, I am running the same statement over a list of objects, and in both cases I would be using named parameters. Suppose my use case is dumping a cache of log items every few hours.
foreach example
var sql = "update blah blah blah where id = #id";
using(var conn = GetConnection())
{
foreach(var obj in myList)
{
var cmd = new SqlCommand()
{CommandText = sql, Connection = conn};
//add params from obj
cmd.ExecuteNonQuery();
}
}
batch example
var sql = #"
update blah blah blah where id = #id1
update blah blah blah where id = #id2
update blah blah blah where id = #id3
-etc";
using (var conn = GetConnection())
{
var cmd = new SqlCommand
{ CommandText = sql, Connection = conn};
for(int i=0; i<myList.Count; i++)
{
//add params: "id" + i from myList[i]
}
cmd.ExecuteNonQuery();
}
In time tests, the batch version took 15% longer than the foreach version for large inputs. I figure the batch version takes longer to execute because the server has to parse a huge statement and bind up to 2000 parameters. Supposing Sql Server is on the LAN, is there any advantage to using the batch method?

Your tests would seem to have given you the answer however let me add another. It is preferrable to encapsulate the update into a separate function and call that using a foreach:
private function UpdateFoo( int id )
{
const sql = "Update Foo Where Id = #Id";
using ( var conn = GetConnection() )
{
using ( var cmd = new SqlCommand( sql, conn ) )
{
cmd.AddParameterWithValue( "#Id", id )
cmd.ExecuteNonQuery();
}
}
}
private function UpdateLotsOfFoo()
{
foreach( var foo in myList )
{
UpdateFoo( foo.Id );
}
}
In this setup you are leveraging connection pooling which mitgates the cost of opening and closing connections.

#Thomas - this design can increase overhead of opening / closing connections in a loop. This is not a preferred practice and should be avoided. The code below allows the iteration of the statements while using one connection and will be easier on resources (both client and server side).
private void UpdateFoo(int id)
{
const string sql = "Update Foo Where Id = #Id";
using (var conn = GetConnection())
{
conn.Open();
foreach (var foo in myList)
{
UpdateFoo(foo.Id);
using (var cmd = new SqlCommand(sql, conn))
{
cmd.AddParameterWithValue("#Id", id);
cmd.ExecuteNonQuery();
}
}
conn.Close();
}
}