is there something faster than Enumerable.Except<TSource> Method? - sql-server

I have a program that downloads data from server database to client database. server database keeps growing recently.
in that program, there is an option to select download all data OR download data for a specific time period (can select backward days from today). if the user selects all, I wrote the program to truncate client database table and insert all data using bulk copy. that part is ok.
but the problem is when user select a specific time period (each recode has created data time ) program has to compare two tables and divide recodes (server data) in two tables. one is, not exist data and the second one is not existing data. and what I'm going to do is,
not existing data directly insert into client DB (i'm using bulk insert) and Existing data inserting into a tempory table using bulkcopy and after update client's table using the above tempory table. My actual problem occurs when dividing server's table. this is how I did it
updateTable = (From c In dt_from_server.AsEnumerable()
Join o In Dt_from_client.AsEnumerable()
On c.Field(Of String)("BARCODE").Trim() Equals o.Field(Of String)("BARCODE").Trim()
And c.Field(Of String)("ITEM_CODE").Trim() Equals o.Field(Of String)("ITEM_CODE").Trim()
Select c).CopyToDataTable()
insertTable = dt_server.AsEnumerable()
.Except(updateTable.AsEnumerable(), DataRowComparer.Default)
.CopyToDataTable()
(normally there is over 1M recodes in the server table )
when there is over 1 Milion recodes, Update part taking acceptable time like 10 minutes (Yes it taking 5GB space from Ram - in this case, it's ok when considering performance )
but insert part seams taking days, just to assing the insertTable(datatable). this is the issue.
AsEnumerable().Except() part taking long time and I couldn't find a solution speedup this process. I'm not sure I explained this correctly. Could anyone can give me some advice for this?

Since you have commented that dt_from_server and dt_server are actually the same DataTable you don't need to compare all values of all DataRows with each other, which is what DataRowComparer.Default does. You can use Except without second parameter for the comparer, then only references are compared which is much faster.
You also don't need two CopyToDataTable which creates two additonal big DataTables in memory, process the rows one after the other.
Here is a different approach using Linq's left-outer join, which is more efficient:
Dim query = from rServ in dt_from_server.AsEnumerable()
group join rClient in Dt_from_client.AsEnumerable()
On New With{
Key .BarCode = rServ.Field(Of String)("BARCODE").Trim(),
Key .ItemCode = rServ.Field(Of String)("ITEM_CODE").Trim()
} Equals New With{
Key .BarCode = rClient.Field(Of String)("BARCODE").Trim(),
Key .ItemCode = rClient.Field(Of String)("ITEM_CODE").Trim()
} into Group
From client In Group.DefaultIfEmpty()
Select new With { .ServerRow = rServ, .InsertRow = client is Nothing }
Dim insertOrUpdateRows = query.ToLookup(Function(x) x.InsertRow, Function(x) x.ServerRow)
Dim insertRows = insertOrUpdateRows(true).CopyToDataTable() 'CopyToDataTable redundant if you process rows immediately now'
Dim updateRows = insertOrUpdateRows(false).CopyToDataTable() 'CopyToDataTable redundant if you process rows immediately now'
But in general the most scalable and efficient approach would be to not load all into memory at once and then process all, but to use database paging(or a stored-procedure) to process only parts of it in memory, otherwise it's likely that you will encounter a OutOfMemoryException sooner or later.
C# as requested:
var query = from rServ in dt_from_server.AsEnumerable()
join rClient in Dt_from_client.AsEnumerable()
on new { BarCode = rServ.Field<string>("BARCODE").Trim(), ItemCode = rServ.Field<string>("ITEM_CODE").Trim() }
equals new { BarCode = rClient.Field<string>("BARCODE").Trim(), ItemCode = rClient.Field<string>("ITEM_CODE").Trim() }
into clientGroup
from client in clientGroup.DefaultIfEmpty()
select new { ServerRow = rServ, InsertRow = client == null };
var insertOrUpdateRows = query.ToLookup(x => x.InsertRow, x => x.ServerRow);
var insertRows = insertOrUpdateRows[true].CopyToDataTable(); // CopyToDataTable redundant if you process rows immediately now
var updateRows = insertOrUpdateRows[false].CopyToDataTable(); // CopyToDataTable redundant if you process rows immediately now

Related

Streams + tasks missing inserts?

We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;
I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.
The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.

Concurrent updates on a single staging table

I am developing a service application (VB.NET) which pulls information from a source and imports it to a SQL Server database
The process can involve one or more “batches” of information at a time (the number and size of batches in any given “run” is arbitrary based on a queue maintained elsewhere)
Each batch is assigned an identifier (BatchID) so that the set of records in the staging table which belong to that batch can be easily identified
The ETL process for each batch is sequential in nature; the raw data is bulk inserted to a staging table and then a series of stored procedures perform updates on a number of columns until the data is ready for import
These stored procedures are called in sequence by the service and are generally simple UPDATE commands
Each SP takes the BatchID as an input parameter and specifies this as the criteria for inclusion in each UPDATE, á la :
UPDATE dbo.stgTable
SET FieldOne = (CASE
WHEN S.[FieldOne] IS NULL
THEN T1.FieldOne
ELSE
S.[FieldOne]
END
)
, FieldTwo = (CASE
WHEN S.[FieldTwo] IS NULL
THEN T2.FieldTwo
ELSE
S.[FieldTwo]
END
)
FROM dbo.stgTable AS S
LEFT JOIN dbo.someTable T1 ON S.[SomeField] = T1.[SomeField]
LEFT JOIN dbo.someOtherTable T2 ON S.[SomeOtherField] = T2.[SomeOtherField]
WHERE S.BatchID = #BatchID
Some of the SP’s also refer to functions (both scalar and table-valued) and all incorporate a TRY / CATCH structure so I can tell from the output parameters if a particular SP has failed
The final SP is a MERGE operation to move the enriched data from the staging table into the production table (again, specific to the provided BatchID)
I would like to thread this process in the service so that a large batch doesn’t hold up smaller batches in the same run
I figured there should be no issue with this as no thread could ever attempt to process records in the staging table that could be targeted by another thread (no race conditions)
However, I’ve noticed that, when I do thread the process, arbitrary steps on arbitrary batches seem to fail (but no error is recorded from the output of the SP)
The failures are inconsistent; e.g. sometimes batches 2, 3 & 5 will fail (on SP’s 3, 5 & 7 respectively), other times it will be different batches, each at different steps in the sequence
When I import the batches sequentially, they all import perfectly fine – always!
I can’t figure out if this is an issue on the service side (VB.NET) – e.g. is each thread opening an independent connection to the DB or could they be sharing the same one (I’ve set it up that each one should be independent…)
Or if the issue is on the SQL Server side – e.g. is it not feasible for concurrent SP calls to manipulate data on the same table, even though, as described above, no thread/batch will ever touch records belonging to another thread/batch
(On this point – I tried using CTE’s to create subsets of data from the staging table based on the BatchID and apply the UPDATE’s to those instead but the exact same behaviour occurred)
WITH CTE AS (
SELECT *
FROM dbo.stgTable
WHERE BatchID = #BatchID
)
UPDATE CTE...
Or maybe the problem is that multiple SP’s are calling the same function at the same time and that is why one or more of them are failing (I don’t see why that would be a problem though?)
Any suggestions would be very gratefully received – I’ve been playing around with this all week and I can’t for the life of me determine precisely what the problem might be!
Update to include sample service code
This is the code in the service class where the threading is initiated
For Each ItemInScope In ScopedItems
With ItemInScope
_batches(_batchCount) = New Batch(.Parameter1, .Parameter2, .ParameterX)
With _batches(_batchCount)
If .Initiate() Then
_doneEvents(_batchCount) = New ManualResetEvent(False)
Dim _batchWriter = New BatchWriter(_batches(_batchCount), _doneEvents(_batchCount))
ThreadPool.QueueUserWorkItem(AddressOf _batchWriter.ThreadPoolCallBack, _batchCount)
Else
_doneEvents(_batchCount) = New ManualResetEvent(True)
End If
End With
End With
_batchCount += 1
Next
WaitHandle.WaitAll(_doneEvents)
Here is the BatchWriter class
Public Class BatchWriter
Private _batch As Batch
Private _doneEvent As ManualResetEvent
Public Sub New(ByRef batch As Batch, ByVal doneEvent As ManualResetEvent)
_batch = batch
_doneEvent = doneEvent
End Sub
Public Sub ThreadPoolCallBack(ByVal threadContext As Object)
Dim threadIndex As Integer = CType(threadContext, Integer)
With _batch
If .PrepareBatch() Then
If .WriteTextOutput() Then
.ProcessBatch()
End If
End If
End With
_doneEvent.Set()
End Sub
End Class
The PrepareBatch and WriteTextOutput functions of the Batch class are entirely contained within the service application - it is only the ProcessBatch function where the service starts to interact with the database (via Entity Framework)
Here is that function
Public Sub ProcessScan()
' Confirm that a file is ready for import
If My.Computer.FileSystem.FileExists(_filePath) Then
Dim dbModel As New DatabaseModel
With dbModel
' Pass the batch to the staging table in the database
If .StageBatch(_batchID, _filePath) Then
' First update (results recorded for event log)
If .UpdateOne(_batchID) Then
_stepOneUpdates = .RetUpdates.Value
' Second update (results recorded for event log)
If .UpdateTwo(_batchID) Then
_stepTwoUpdates = .RetUpdates.Value
' Third update (results recorded for event log)
If .UpdateThree(_batchID) Then
_stepThreeUpdates = .RetUpdates.Value
....
End Sub

SQL CLR Trigger - get source table

I am creating a DB synchronization engine using SQL CLR Triggers in Microsoft SQL Server 2012. These triggers do not call a stored procedure or function (and thereby have access to the INSERTED and DELETED pseudo-tables but do not have access to the ##procid).
Differences here, for reference.
This "sync engine" uses mapping tables to determine what the table and field maps are for this sync job. In order to determine the target table and fields (from my mapping table) I need to get the source table name from the trigger itself. I have come across many answers on Stack Overflow and other sites that say that this isn't possible. But, I've found one website that provides a clue:
Potential Solution:
using (SqlConnection lConnection = new SqlConnection(#"context connection=true")) {
SqlCommand cmd = new SqlCommand("SELECT object_name(resource_associated_entity_id) FROM sys.dm_tran_locks WHERE request_session_id = ##spid and resource_type = 'OBJECT'", lConnection);
cmd.CommandType = CommandType.Text;
var obj = cmd.ExecuteScalar();
}
This does in fact return the correct table name.
Question:
My question is, how reliable is this potential solution? Is the ##spid actually limited to this single trigger execution? Or is it possible that other simultaneous triggers will overlap within this process id? Will it stand up to multiple executions of the same and/or different triggers within the database?
From these sites, it seems the process Id is in fact limited to the open connection, which doesn't overlap: here, here, and here.
Will this be a safe method to get my source table?
Why?
As I've noticed similar questions, but all without a valid answer for my specific situation (except that one). Most of the comments on those sites ask "Why?", and in order to preempt that, here is why:
This synchronization engine operates on a single DB and can push changes to target tables, transforming the data with user-defined transformations, automatic source-to-target type casting and parsing and can even use the CSharpCodeProvider to execute methods also stored in those mapping tables for transforming data. It is already built, quite robust and has good performance metrics for what we are doing. I'm now trying to build it out to allow for 1:n table changes (including extension tables requiring the same Id as the 'master' table) and am trying to "genericise" the code. Previously each trigger had a "target table" definition hard coded in it and I was using my mapping tables to determine the source. Now I'd like to get the source table and use my mapping tables to determine all the target tables. This is used in a medium-load environment and pushes changes to a "Change Order Book" which a separate server process picks up to finish the CRUD operation.
Edit
As mentioned in the comments, the query listed above is quite "iffy". It will often (after a SQL Server restart, for example) return system objects like syscolpars or sysidxstats. But, it seems that in the dm_tran_locks table there's always an associated resource_type of 'RID' (Row ID) with the same object_name. My current query which works reliably so far is the following (will update if this changes or doesn't work under high load testing):
select t1.ObjectName FROM (
SELECT object_name(resource_associated_entity_id) as ObjectName
FROM sys.dm_tran_locks WHERE resource_type = 'OBJECT' and request_session_id = ##spid
) t1 inner join (
SELECT OBJECT_NAME(partitions.OBJECT_ID) as ObjectName
FROM sys.dm_tran_locks
INNER JOIN sys.partitions ON partitions.hobt_id = dm_tran_locks.resource_associated_entity_id
WHERE resource_type = 'RID'
) t2 on t1.ObjectName = t2.ObjectName
If this is always the case, I'll have to find that out during testing.
How reliable is this potential solution?
While I do not have time to set up a test case to show it not working, I find this approach (even taking into account the query in the Edit section) "iffy" (i.e. not guaranteed to always be reliable).
The main concerns are:
cascading (whether recursive or not) Trigger executions
User (i.e. Explicit / Implicit) transactions
Sub-processes (i.e. EXEC and sp_executesql)
These scenarios allow for multiple objects to be locked, all at the same time.
Is the ##SPID actually limited to this single trigger execution? Or is it possible that other simultaneous triggers will overlap within this process id?
and (from a comment on the question):
I think I can join my query up with the sys.partitions and get a dm_trans_lock that has a type of 'RID' with an object name that will match up to the one in my original query.
And here is why it shouldn't be entirely reliable: the Session ID (i.e. ##SPID) is constant for all of the requests on that Connection). So all sub-processes (i.e. EXEC calls, sp_executesql, Triggers, etc) will all be on the same ##SPID / session_id. So, between sub-processes and User Transactions, you can very easily get locks on multiple resources, all on the same Session ID.
The reason I say "resources" instead of "OBJECT" or even "RID" is that locks can occur on: rows, pages, keys, tables, schemas, stored procedures, the database itself, etc. More than one thing can be considered an "OBJECT", and it is possible that you will have page locks instead of row locks.
Will it stand up to multiple executions of the same and/or different triggers within the database?
As long as these executions occur in different Sessions, then they are a non-issue.
ALL THAT BEING SAID, I can see where simple testing would show that your current method is reliable. However, it should also be easy enough to add more detailed tests that include an explicit transaction that first does some DML on another table, or have a trigger on one table do some DML on one of these tables, etc.
Unfortunately, there is no built-in mechanism that provides the same functionality that ##PROCID does for T-SQL Triggers. I have come up with a scheme that should allow for getting the parent table for a SQLCLR Trigger (that takes into account these various issues), but haven't had a chance to test it out. It requires using a T-SQL trigger, set as the "first" trigger, to set info that can be discovered by the SQLCLR Trigger.
A simpler form can be constructed using CONTEXT_INFO, if you are not already using it for something else (and if you don't already have a "first" Trigger set). In this approach you would still create a T-SQL Trigger, and then set it as the "first" Trigger using sp_settriggerorder. In this Trigger you SET CONTEXT_INFO to the table name that is the parent of ##PROCID. You can then read CONTEXT_INFO() on a Context Connection in a SQLCLR Trigger. If there are multiple levels of Triggers then the value of CONTEXT INFO will get overwritten, so reading that value must be the first thing you do in each SQLCLR Trigger.
This is an old thread, but it is an FAQ and I think I have a better solution. Essentially it uses the schema of the inserted or deleted table to find the base table by doing a hash of the column names and comparing the hash with the hashes of tables with a CLR trigger on them.
Code snippet below - at some point I will probably put the whole solution on Git (it sends a message to Azure Service Bus when the trigger fires).
private const string colqry = "select top 1 * from inserted union all select top 1 * from deleted";
private const string hashqry = "WITH cols as ( "+
"select top 100000 c.object_id, column_id, c.[name] "+
"from sys.columns c "+
"JOIN sys.objects ot on (c.object_id= ot.parent_object_id and ot.type= 'TA') " +
"order by c.object_id, column_id ) "+
"SELECT s.[name] + '.' + o.[name] as 'TableName', CONVERT(NCHAR(32), HASHBYTES('MD5',STRING_AGG(CONVERT(NCHAR(32), HASHBYTES('MD5', cols.[name]), 2), '|')),2) as 'MD5Hash' " +
"FROM cols "+
"JOIN sys.objects o on (cols.object_id= o.object_id) "+
"JOIN sys.schemas s on (o.schema_id= s.schema_id) "+
"WHERE o.is_ms_shipped = 0 "+
"GROUP BY s.[name], o.[name]";
public static void trgSendSBMsg()
{
string table = "";
SqlCommand cmd;
SqlDataReader rdr;
SqlTriggerContext trigContxt = SqlContext.TriggerContext;
SqlPipe p = SqlContext.Pipe;
using (SqlConnection con = new SqlConnection("context connection=true"))
{
try
{
con.Open();
string tblhash = "";
using (cmd = new SqlCommand(colqry, con))
{
using (rdr = cmd.ExecuteReader(CommandBehavior.SingleResult))
{
if (rdr.Read())
{
MD5 hash = MD5.Create();
StringBuilder hashstr = new StringBuilder(250);
for (int i=0; i < rdr.FieldCount; i++)
{
if (i > 0) hashstr.Append("|");
hashstr.Append(GetMD5Hash(hash, rdr.GetName(i)));
}
tblhash = GetMD5Hash(hash, hashstr.ToString().ToUpper()).ToUpper();
}
rdr.Close();
}
}
using (cmd = new SqlCommand(hashqry, con))
{
using (rdr = cmd.ExecuteReader(CommandBehavior.SingleResult))
{
while (rdr.Read())
{
string hash = rdr.GetString(1).ToUpper();
if (hash == tblhash)
{
table = rdr.GetString(0);
break;
}
}
rdr.Close();
}
}
if (table.Length == 0)
{
p.Send("Error: Unable to find table that CLR trigger is on. Message not sent!");
return;
}
….
HTH

Is it possible to do data paging in SQL Server (2012) in constant time?

I've been researching a solution to a performance problem with a system I'm responsible for, and I think at least part of the problem is due to database query performance. We use stored procedures to query "pages" of data in a pretty standard way. However, this paging appears to be more costly when the datasets get large.
Given this simple table populated with sample data:
create table Data (
Value uniqueidentifier not null,
constraint PK_Data primary key clustered (Value)
)
insert into Data
-- SeedTable has ~2M rows
select newid() from SeedTable
And this stored procedure to return paged data: (this requires Sql2012 apparently, though the Sql2008 style of using ROW_NUMBER() behaves the same):
create proc
GetDataPage #Offset int, #Count int
as
select Value
from Data
order by Value
offset #Offset rows
fetch next #Count rows only
I then test the performance of this sproc with this C# code:
const int PageSize = 50;
const int MaxCount = 50000;
using (var conn = new SqlConnection("Data Source=.;Initial Catalog=TestDB;Integrated Security=true;")) {
conn.Open();
int a = 0;
for (int i = 0; ; i += PageSize) {
using (var cmd = conn.CreateCommand()) {
cmd.CommandType = System.Data.CommandType.StoredProcedure;
cmd.CommandText = "GetDataPage";
var oid = cmd.CreateParameter();
var offset = cmd.CreateParameter();
offset.Value = i;
offset.ParameterName = "Offset";
cmd.Parameters.Add(offset);
var count = cmd.CreateParameter();
count.Value = PageSize;
count.ParameterName = "Count";
cmd.Parameters.Add(count);
var sw = Stopwatch.StartNew();
int c = 0;
using(var reader = cmd.ExecuteReader()) {
while (reader.Read()) {
c++;
}
}
a += c;
sw.Stop();
Console.WriteLine(sw.ElapsedTicks + "\t" + a);
if (c < PageSize || a >= MaxCount)
break;
}
}
}
When I chart the output of this code I get the following:
I would have expected that paging like this in SQL would have constant time performance, or perhaps logarithmic at worst, but it is pretty clear from the chart that performance is linear.
Are there any special tricks (hints) to make this work better?
Is there another approach to this that might be faster?
Do other databases behave the same way?
Changing the experimental code to use the "page from" technique, that Kevin Suchlicki suggests, results in the following:
Very impressive. This performance looks more like what I would expect/want. Now I just need to figure out if I can apply this to my real problem. The potential issue being that it doesn't allow for "random access" of the data, but a forward only cursor-like access. I'm aware that it must look like what I'm doing violates every notion of good database design.
The most obvious possibility is in the app design itself. Offer your users filter criteria. Users usually have some idea what they are looking for and would rather not page thru 1000 pages of returned results. How often do you pass page 10 on a google search?
Having said that, you could try storing the id (clustered index value) of the last row returned on the previous page and use that in your SQL where clause. If you need to allow sorting on different keys (e.g. last name), then store the clustered index id value and the final last name of the previous page. Then write your SQL like this (you always need to order on your key field and clustered id value in order to deterministically order the records in the case of duplicate key values):
select top (#count) Id, LastName, FirstName
from Data
where LastName >= #previousLastName and Id > #previousId
order by LastName, Id
You would also want to index all the fields that could be sort keys. Not sure how the above would perform but I would expect the search on indexed fields would perform O(log n).
Another option might be to persist the full list, in order, with row value, every time the source data changes, behind the scenes, and have the app pull from the persisted table.
Good question... Let us know how it turns out please!

Correct method of deleting over 2100 rows (by ID) with Dapper

I am trying to use Dapper support my data access for my server app.
My server app has another application that drops records into my database at a rate of 400 per minute.
My app pulls them out in batches, processes them, and then deletes them from the database.
Since data continues to flow into the database while I am processing, I don't have a good way to say delete from myTable where allProcessed = true.
However, I do know the PK value of the rows to delete. So I want to do a delete from myTable where Id in #listToDelete
Problem is that if my server goes down for even 6 mintues, then I have over 2100 rows to delete.
Since Dapper takes my #listToDelete and turns each one into a parameter, my call to delete fails. (Causing my data purging to get even further behind.)
What is the best way to deal with this in Dapper?
NOTES:
I have looked at Tabled Valued Parameters but from what I can see, they are not very performant. This piece of my architecture is the bottle neck of my system and I need to be very very fast.
One option is to create a temp table on the server and then use the bulk load facility to upload all the IDs into that table at once. Then use a join, EXISTS or IN clause to delete only the records that you uploaded into your temp table.
Bulk loads are a well-optimized path in SQL Server and it should be very fast.
For example:
Execute the statement CREATE TABLE #RowsToDelete(ID INT PRIMARY KEY)
Use a bulk load to insert keys into #RowsToDelete
Execute DELETE FROM myTable where Id IN (SELECT ID FROM #RowsToDelete)
Execute DROP TABLE #RowsToDelte (the table will also be automatically dropped if you close the session)
(Assuming Dapper) code example:
conn.Open();
var columnName = "ID";
conn.Execute(string.Format("CREATE TABLE #{0}s({0} INT PRIMARY KEY)", columnName));
using (var bulkCopy = new SqlBulkCopy(conn))
{
bulkCopy.BatchSize = ids.Count;
bulkCopy.DestinationTableName = string.Format("#{0}s", columnName);
var table = new DataTable();
table.Columns.Add(columnName, typeof (int));
bulkCopy.ColumnMappings.Add(columnName, columnName);
foreach (var id in ids)
{
table.Rows.Add(id);
}
bulkCopy.WriteToServer(table);
}
//or do other things with your table instead of deleting here
conn.Execute(string.Format(#"DELETE FROM myTable where Id IN
(SELECT {0} FROM #{0}s", columnName));
conn.Execute(string.Format("DROP TABLE #{0}s", columnName));
To get this code working, I went dark side.
Since Dapper makes my list into parameters. And SQL Server can't handle a lot of parameters. (I have never needed even double digit parameters before). I had to go with Dynamic SQL.
So here was my solution:
string listOfIdsJoined = "("+String.Join(",", listOfIds.ToArray())+")";
connection.Execute("delete from myTable where Id in " + listOfIdsJoined);
Before everyone grabs the their torches and pitchforks, let me explain.
This code runs on a server whose only input is a data feed from a Mainframe system.
The list I am dynamically creating is a list of longs/bigints.
The longs/bigints are from an Identity column.
I know constructing dynamic SQL is bad juju, but in this case, I just can't see how it leads to a security risk.
Dapper request the List of object having parameter as a property so in above case a list of object having Id as property will work.
connection.Execute("delete from myTable where Id in (#Id)", listOfIds.AsEnumerable().Select(i=> new { Id = i }).ToList());
This will work.

Resources