linq2db - server side bulkcopy - sql-server

I'm trying to do a "database side" bulk copy (i.e. SELECT INTO/INSERT INTO) using linq2db. However, my code is trying to bring the dataset over the wire which is not possible given the size of the DB in question.
My code looks like this:
using (var db = new MyDb()) {
var list = db.SourceTable.
Where(s => s.Year > 2012).
GroupBy(s => new { s.Column1, s.Column2 }).
Select(g => new DestinationTable {
Property1 = 'Constant Value',
Property2 = g.First().Column1,
Property3 = g.First().Column2,
Property4 = g.Count(s => s.Column3 == 'Y')
});
db.Execute("TRUNCATE TABLE DESTINATION_TABLE");
db.BulkCopy(new BulkCopyOptions {
BulkCopyType = BulkCopyType.MultipleRows
}, list);
}
The generated SQL looks like this:
BeforeExecute
-- DBNAME SqlServer.2017
TRUNCATE TABLE DESTINATION_TABLE
DataConnection
Query Execution Time (AfterExecute): 00:00:00.0361209. Records Affected: -1.
DataConnection
BeforeExecute
-- DBNAME SqlServer.2017
DECLARE #take Int -- Int32
SET #take = 1
DECLARE #take_1 Int -- Int32
SET #take_1 = 1
DECLARE #take_2 Int -- Int32
...
SELECT
(
SELECT TOP (#take)
[p].[YEAR]
FROM
[dbo].[SOURCE_TABLE] [p]
WHERE
(([p_16].[YEAR] = [p].[YEAR] OR [p_16].[YEAR] IS NULL AND [p].[YEAR] IS NULL) AND ...
...)
FROM SOURCE_TABLE p_16
WHERE p_16.YEAR > 2012
GROUP BY
...
DataConnection
That is all that is logged as the bulkcopy fails with a timeout, i.e. SqlException "Execution Timeout Expired".
Please note that running this query as an INSERT INTO statement takes less than 1 second directly in the DB.
PS: Anyone have any recommendations as to good code based ETL tools to do large DB (+ 1 TB) ETL. Given the DB size I need things to run in the database and not bring data over the wire. I've tried pyspark, python bonobo, c# etlbox and they all move too much data around. I thought linq2db had potential, i.e. basically just act like a C# to SQL transpiler but it is also trying to move data around.

I would suggest to rewrite your query because group by can not return first element. Also Truncate is a part of the library.
var sourceQuery =
from s in db.SourceTable
where s.Year > 2012
select new
{
Source = s,
Count = Sql.Ext.Count(s.Column3 == 'Y' ? 1 : null).Over()
.PartitionBy(s.Column1, s.Column2).ToValue()
RN = Sql.Ext.RowNumber().Over()
.PartitionBy(s.Column1, s.Column2).OrderByDesc(s.Year).ToValue()
};
db.DestinationTable.Truncate();
sourceQuery.Where(s => s.RN == 1)
.Insert(db.DestinationTable,
e => new DestinationTable
{
Property1 = 'Constant Value',
Property2 = e.Source.Column1,
Property3 = e.Source.Column2,
Property4 = e.Count
});

After some investigation I stumbled onto this issue. Which lead me to the solution. The code above needs to change to:
db.Execute("TRUNCATE TABLE DESTINATION_TABLE");
db.SourceTable.
Where(s => s.Year > 2012).
GroupBy(s => new { s.Column1, s.Column2 }).
Select(g => new DestinationTable {
Property1 = 'Constant Value',
Property2 = g.First().Column1,
Property3 = g.First().Column2,
Property4 = g.Count(s => s.Column3 == 'Y')
}).Insert(db.DestinationTable, e => e);
Documentation of the linq2db project leaves a bit to be desired however, in terms of functionality its looking like a great project for ETLs (without horrible 1000s of line copy/paste sql/ssis scripts).

Related

Select specific columns with repository in Entity Framework Core

I have a big table with a binary column for picture. I need to show contents of this table in a view and make it searchable. I have tried only selecting a subset of columns that are needed in this. However, the generated SQL always has all the columns of the table in the generated query.
public IQueryable<ApplicantDTO> GetApplicantQueryable()
{
return DataContext.Applicants
.Include(a => a.Nationality)
.Select(x => new ApplicantDTO
{
Id = x.Id,
BirthDate = x.BirthDate,
Gender = x.Gender,
FirstName = x.FirstName,
LastName = x.LastName,
OtherName = x.OtherName,
MobileNo = x.MobileNo,
Age = x.Age,
Nationality = x.Nationality.National,
Admitted = x.admitted,
Completed = x.Completed
})
.Where(a => a.Admitted == false && a.Completed == true)
.OrderBy(a => a.LastName)
.AsNoTracking();
}
But instead of just specifying the above rows, the generated SQL from profiler is
SELECT
[a].[Id], [a].[BirthDate], [a].[BirthPlace], [a].[CompleteDate],
[a].[Completed], [a].[ContentType], [a].[CreateDate],
[a].[Denomination], [a].[Disability], [a].[Email],
[a].[FirstName], [a].[Gender], [a].[HomeTown], [a].[LastName],
[a].[MarryStatus], [a].[MatureApp], [a].[MobileNo], [a].[NationalityID],
[a].[OtherName], [a].[Passport], [a].[Pin], [a].[PostalAddress],
[a].[Region], [a].[Religion], [a].[ResAddress], [a].[SerialNo],
[a].[Title], [a].[VoucherID], [a].[admitted], [a.Nationality].[National]
FROM
[Applicants] AS [a]
INNER JOIN
[Nationality] AS [a.Nationality] ON [a].[NationalityID] = [a.Nationality].[Id]
WHERE
([a].[admitted] = 0)
AND ([a].[Completed] = 1)
ORDER BY
[a].[LastName]
With all the underlying columns all included in the query.
I tried putting it in an anonymous type before casting it to the ApplicantDTO but still the same effect.
What's wrong?

Linq to entites (EF6) return latest records of each group using Row number in T-SQL server

I'm using Linq to entities to get most recent updated record of each group. But actually when I checked in sql profiler my Ling query generated many sub-query so that It really take too much time to complete. To solve this performance problem, I already wrote native T-Sql mentioned below so that I'm looking for solution to use Linq query that entity framework is generating the same my query using (ROW_NUMBER() OVER(PARTITION BY ...) . Below is my sample data :
Parent and Child tables:
Sample data of Parent and Child tables:
Below is my query result:
TSQL query:
WITH summary AS (
SELECT a.ParentId
,a.Name
,a.Email
,p.Created
,p.[Status],
ROW_NUMBER() OVER(PARTITION BY p.ParentId
ORDER BY p.Created DESC) AS rk
FROM Parent a
LEFT JOIN Child p
ON a.ParentId = P.ParentId
)
SELECT s.*
FROM summary s
WHERE s.rk = 1
My sample C# using Linq:
using (DbContext context = new DbContext())
{
return context.Parents.Where(p => p.ParentId == parentId)
.Include(a => a.Childs)
.Select(x => new ObjectDto()
{
ParentId = x.ParentId,
Status = x.Childs.OrderByDescending(a => a.Created).FirstOrDefault(p => p.ParentId).Status,
ChildName = x.Childs.OrderByDescending(a => a.Created).FirstOrDefault(p => p.ParentId).ChildName
})
.ToList();
}
There are a couple of things to improve your C# query:
using (DbContext context = new DbContext())
{
return context.Parents.Where(p => p.ParentId == parentId)
.Include(a => a.Childs)
.Select(x => new ObjectDto()
{
ParentId = x.ParentId,
Status = x.Childs.OrderByDescending(a => a.Created).FirstOrDefault(p => p.ParentId).Status,
ChildName = x.Childs.OrderByDescending(a => a.ChildName).FirstOrDefault(p => p.ParentId).ChildName
})
.ToList();
}
Firstly the Include call does nothing here as you're not just returning EF entities (and thus the lazy loading semantics don't apply).
Secondly: avoid the repeated subqueries with a let clause.
(Also the lambda passed to FirstOrDefault must be an error as it takes a Func<T, bool> which that isn't.)
Thus
using (DbContext context = new DbContext()) {
return await (from p in context.Parents
where p.ParentId == parentId
let cs = p.Childs.OrderByDescending(a => a.Created).FirstOrDefault()
select new ObjectDto {
ParentId = p.ParentId,
Status = cs.Status,
ChildName = cs.ChildName
}).ToListAsync();
}
Otherwise it looks reasonable. You would need to look at the generated query plan and see what you can do with respect to indexing.
If that doesn't work, then use a stored procedure where you have full control. (Mechanical generation of code – without a lot of work in the code generator and optimiser – can always be beaten by hand written code).

Cannot understand how will Entity Framewrok generate a SQL statement for an Update operation using timestamp?

I have the following method inside my asp.net mvc web application :
var rack = IT.ITRacks.Where(a => !a.Technology.IsDeleted && a.Technology.IsCompleted);
foreach (var r in rack)
{
long? it360id = technology[r.ITRackID];
if (it360resource.ContainsKey(it360id.Value))
{
long? CurrentIT360siteid = it360resource[it360id.Value];
if (CurrentIT360siteid != r.IT360SiteID)
{
r.IT360SiteID = CurrentIT360siteid.Value;
IT.Entry(r).State = EntityState.Modified;
count = count + 1;
}
}
IT.SaveChanges();
}
When I checked SQL Server profiler I noted that EF will generated the following SQL statement:
exec sp_executesql N'update [dbo].[ITSwitches]
set [ModelID] = #0, [Spec] = null, [RackID] = #1, [ConsoleServerID] = null, [Description] = null, [IT360SiteID] = #2, [ConsoleServerPort] = null
where (([SwitchID] = #3) and ([timestamp] = #4))
select [timestamp]
from [dbo].[ITSwitches]
where ##ROWCOUNT > 0 and [SwitchID] = #3',N'#0 int,#1 int,#2 bigint,#3 int,#4 binary(8)',#0=1,#1=539,#2=1502,#3=1484,#4=0x00000000000EDCB2
I can not understand the purpose of having the following section :-
select [timestamp]
from [dbo].[ITSwitches]
where ##ROWCOUNT > 0 and [SwitchID] = #3',N'#0 int,#1 int,#2 bigint,#3 int,#4 binary(8)',#0=1,#1=539,#2=1502,#3=1484,#4=0x00000000000EDCB2
Can anyone advice?
Entity Framework uses timestamps to check whether a row has changed. If the row has changed since the last time EF retrieved it, then it knows it has a concurrency problem.
Here's an explanation:
http://www.remondo.net/entity-framework-concurrency-checking-with-timestamp/
This is because EF (and you) want to update the updated client-side object by the newly generated rowversion value.
First the update is executed. If this succeeds (because the rowversion is still the one you had in the client) a new rowversion is generated by the database and EF retrieves that value. Suppose you'd immediately want to make a second update. That would be impossible if you didn't have the new rowversion.
This happens with all properties that are marked as identity or computed (by DatabaseGenertedOption).

SqlCeException-Expressions in the ORDER BY list cannot contain aggregate functions

I have the following code which counts vehicles grouped by vehicle type.
using (var uow = new UnitOfWork(new NHibernateHelper().SessionFactory)) {
var repo = new Repository<Vehicle>(uow.Session);
var vtSummary= repo.ListAll()
.GroupBy(v => v.VehicleType.Name)
.Select(v => new NameCount {
EntityDescription = v.First().VehicleType.Name,
QtyCount = v.Count() })
.OrderByDescending(v => v.QtyCount).ToList();
uow.Commit();
return vtSummary;
}
The above produces the following sql code:
SELECT VehicleType.Name as col_0_0_,
CAST(COUNT(*) AS INT) as col_1_0_
FROM Vehicle vehicle0_
LEFT OUTER JOIN VehicleType vehicletype1_
ON vehicle0_.VehicleTypeId= VehicleType.Id
GROUP BY VehicleType.Name
ORDER BY CAST(COUNT(*) AS INT) DESC
The SQL code runs perfectly well under MS SQL Server but testing under SQl CE it produces the following error:
System.Data.SqlServerCe.SqlCeException : Expressions in the ORDER BY list cannot contain aggregate functions.
A solution from Sub query in Sql server CE is to specify an alias for the column and use the alias in the order by clause.
Is there any way to provide an alias in the LINQ expression that I am using to enable it run without raising errors?
You can perform the OrderBy in memory, with LINQ to objects:
var vtSummary= repo.ListAll()
.GroupBy(v => v.VehicleType.Name)
.Select(v => new NameCount {
EntityDescription = v.First().VehicleType.Name,
QtyCount = v.Count() })
.AsEnumerable() //SQL executes at this point
.OrderByDescending(v => v.QtyCount).ToList();

Strange behaviour execute SQL Server from .NET to CREATE tables or columns

I have a big SQL text file where I have a lot of SQL commands to create tables, columns, etc.
Example row(s):
IF NOT EXISTS ( SELECT * FROM dbo.sysobjects WHERE id = object_id(N'xcal_views') AND OBJECTPROPERTY(id, N'IsUserTable') = 1)
CREATE TABLE xcal_views (lid INT NOT NULL);
GO
IF NOT EXISTS ( SELECT * FROM dbo.sysobjects WHERE id = object_id(N'xcal_views_actors') AND OBJECTPROPERTY(id, N'IsUserTable') = 1)
CREATE TABLE xcal_views_actors (lid INT NOT NULL);
GO
IF NOT EXISTS ( SELECT * FROM dbo.syscolumns, dbo.sysobjects WHERE [dbo].[syscolumns].[name] = 'xlactor' AND [dbo].[sysobjects].[id] = [dbo].[syscolumns].[id] AND [dbo].[sysobjects].[id] = object_id(N'xcal_views_actors') AND OBJECTPROPERTY([dbo].[sysobjects].[id], N'IsUserTable') = 1 )
ALTER TABLE [dbo].[xcal_views_actors] ADD xlactor INT NULL;
GO
IF NOT EXISTS ( SELECT * FROM dbo.syscolumns, dbo.sysobjects WHERE [dbo].[syscolumns].[name] = 'lparentid' AND [dbo].[sysobjects].[id] = [dbo].[syscolumns].[id] AND [dbo].[sysobjects].[id] = object_id(N'xcal_views_actors') AND OBJECTPROPERTY([dbo].[sysobjects].[id], N'IsUserTable') = 1 )
ALTER TABLE [dbo].[xcal_views_actors] ADD lparentid INT NULL;
GO
IF NOT EXISTS ( SELECT * FROM dbo.sysobjects WHERE parent_obj = (SELECT id FROM dbo.sysobjects WHERE id = object_id(N'xcal_views_actors') AND OBJECTPROPERTY(id, N'IsUserTable') = 1) AND OBJECTPROPERTY(id, N'IsPrimaryKey') = 1)
ALTER TABLE [dbo].[xcal_views_actors]
ADD CONSTRAINT [CT_00000501] PRIMARY KEY CLUSTERED (lid ASC);
GO
IF NOT EXISTS ( SELECT * FROM [sys].[indexes] i INNER JOIN [sys].[objects] o ON o.object_id = i.object_id AND o.name = 'xcal_views_actors' WHERE i.name = 'parent_id' )
CREATE INDEX parent_id ON xcal_views_actors (lparentid ASC)
GO
Between each command I have a GO in extra line to separate the commands.
If I run the whole patch.sql file from SQL Server Management Studio all commands are executed and works fine.
In .NET I read the whole text file, then split them with 'GO' and execute each SQL command against the database.
Now the strange thing: some of the commands don't get executed. And I can't find out why.
This is the method that does the job:
private static void patchDatabase(string connection, string sqlfile)
{
var defaultEncoding = Encoding.Default;
using (FileStream fs = File.OpenRead(sqlfile))
{
defaultEncoding = TextFileEncodingDetector.DetectTextFileEncoding(fs, defaultEncoding, 1024);
}
//Console.WriteLine(string.Format("File {0} using encoding: {1}",sqlfile, defaultEncoding));
var dbPatch = new StreamReader(sqlfile, defaultEncoding);
string sqlPatch = dbPatch.ReadToEnd();
dbPatch.Close();
string[] stringSeparators = new[] {"GO"};
string[] sqlPatches = sqlPatch.Split(stringSeparators, StringSplitOptions.None);
if (connection != null && sqlPatch.Length > 0)
{
Console.WriteLine(string.Format("Executing {0} statements from {1}", sqlPatches.Length, sqlfile));
using (var cnn = new SqlConnection(connection))
{
cnn.Open();
foreach (var sql in sqlPatches)
{
if (String.IsNullOrEmpty(sql))
continue; // Not a real sql statement, use next
using (var cmd = new SqlCommand(sql, cnn))
{
try
{
cmd.CommandTimeout = 120;
cmd.ExecuteNonQuery();
//int results = cmd.ExecuteNonQuery();
//if (results < 1)
// Console.WriteLine(String.Format("Failed:\nResult: {0}\n{1}",results, sql));
}
catch (Exception ex)
{
Console.WriteLine("Execution error!\n\n" + sql + "\n\n\n" + ex);
}
}
}
cnn.Close();
}
}
}
It looks like my function splutters...
My current textfile has around 6.000+ lines.
Any idea what I do wrong?
It looks like it wasn't a coding problem at all.
In fact I produced the following situation:
I have a database "test_db"
I create a user "test_db_user" and make him db_owner
He is added in sql server in general "security - roles" with objekt "test_db" as role db_owner
He also gets added as user in the database "test_db" - security - user.
Now it comes: I restored the database again after some tests.
The user is not anymore listed in "test_db" - security - user
but still configured as db_owner in general.
Somehow the connection works then but it don't has not full access to the db all the time. Can't really know what is going wrong.
From the management studio I ever used admin account for starting the sql batches, that is the reason it worked all the time there.
So solution: Make sure the security settings are 100% correct and then it works :S
Another problem was my stupidity :S!!! I had some create table statements without dbo. schema before the table so the different user created a different schema name for the table.
Thanks to all for the feedback.

Resources