(Submitting for a Snowflake User, hoping to receive additional assistance)
Is there another way to perform table insertion using a stored procedure faster?
I started building a usp with the purpose to insert million or so of rows of test data into a table for the purpose of load testing.
I got to this stage show below and set the iteration value to 10,000.
This took over 10 mins to iterate 10,000 times to insert a single integer into a table each iteration
Yes - I am using a XS data warehouse, but even if this is increased to MAX - this is way to slow to be of any use.
--build a test table
CREATE OR REPLACE TABLE myTable
(
myInt NUMERIC(18,0)
);
--testing a js usp using a while statement with the intention to insert multiple rows into a table (Millions) for load testing
CREATE OR REPLACE PROCEDURE usp_LoadTable_test()
RETURNS float
LANGUAGE javascript
EXECUTE AS OWNER
AS
$$
//set the number of iterations
var maxLoops = 10;
//set the row Pointer
var rowPointer = 1;
//set the Insert sql statement
var sql_insert = 'INSERT INTO myTable VALUES(:1);';
//Insert the fist Value
sf_startInt = rowPointer + 1000;
resultSet = snowflake.execute( {sqlText: sql_insert, binds: [sf_startInt] });
//Loop thorugh to insert all other values
while (rowPointer < maxLoops)
{
rowPointer += 1;
sf_startInt = rowPointer + 1000;
resultSet = snowflake.execute( {sqlText: sql_insert, binds: [sf_startInt] });
}
return rowPointer;
$$;
CALL usp_LoadTable_test();
So far, I've received the following recommendations:
Recommendation #1
One thing you can do is to use a "feeder table" containing 1000 or more rows instead of INSERT ... VALUES, eg:
INSERT INTO myTable SELECT <some transformation of columns> FROM "feeder table"
Recommendation #2
When you perform a million single row inserts, you consume one million micropartitions - each 16MB.
That 16 TB chunk of storage might be visible on your Snowflake bill ... Normal tables are retained for 7 days minimum after drop.
To optimize storage, you could define a clustering key and load the table in ascending order with each chunk filling up as much of a micropartition as possible.
Recommendation #3
Use data generation functions that work very fast if you need sequential integers: https://docs.snowflake.net/manuals/sql-reference/functions/seq1.html
Any other ideas?
This question was also asked at the Snowflake Lodge some weeks ago.
Given the answers you received, do you still feel unanswered, then maybe hint about why?
If you just want a table with a single column of sequence numbers, use GENERATOR() as in #3 above. Otherwise, if you want more advice, share your specific requirements.
Related
I have a trigger that basically get the last number inserted (folio) based on two columns (service_type and prefix) and then increment one value. Everything works fine but in some situations when there a lot of insert statements at the same time it causes that the function use the same last value inserted and it duplicates the folio column
CREATE OR REPLACE FUNCTION public.insert_folio()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
DECLARE incremental INTEGER;
SELECT max(f.folio) INTO incremental FROM "folios" f
WHERE f.prefix = NEW."prefix" AND service_type = NEW."service_type";
NEW."folio" = incremental + 1;
IF NEW."folio" IS NULL THEN NEW."folio" = 1;
END IF;
RETURN NEW;
END;
$function$
CREATE TRIGGER insert_folio BEFORE INSERT ON "folios" FOR EACH ROW EXECUTE PROCEDURE insert_folio();
sample:
folio
service_type
prefix
1
DOCUMENT
DC
1
IMAGE
IMG
1
IMAGE
O
2
IMAGE
O
2 (This should be 3)
IMAGE
O
Any ideas?
Thanks!
It is because you have concurrent transactions that both see the same data. The second transaction does not see the record inserted by the first transaction until the first commits.
To prevent this behaviour you will have to lock the whole table when inserting to prevent concurrent writes or use advisory locks to prevent concurrent insertions of the same service_type and prefix.
I have a program that downloads data from server database to client database. server database keeps growing recently.
in that program, there is an option to select download all data OR download data for a specific time period (can select backward days from today). if the user selects all, I wrote the program to truncate client database table and insert all data using bulk copy. that part is ok.
but the problem is when user select a specific time period (each recode has created data time ) program has to compare two tables and divide recodes (server data) in two tables. one is, not exist data and the second one is not existing data. and what I'm going to do is,
not existing data directly insert into client DB (i'm using bulk insert) and Existing data inserting into a tempory table using bulkcopy and after update client's table using the above tempory table. My actual problem occurs when dividing server's table. this is how I did it
updateTable = (From c In dt_from_server.AsEnumerable()
Join o In Dt_from_client.AsEnumerable()
On c.Field(Of String)("BARCODE").Trim() Equals o.Field(Of String)("BARCODE").Trim()
And c.Field(Of String)("ITEM_CODE").Trim() Equals o.Field(Of String)("ITEM_CODE").Trim()
Select c).CopyToDataTable()
insertTable = dt_server.AsEnumerable()
.Except(updateTable.AsEnumerable(), DataRowComparer.Default)
.CopyToDataTable()
(normally there is over 1M recodes in the server table )
when there is over 1 Milion recodes, Update part taking acceptable time like 10 minutes (Yes it taking 5GB space from Ram - in this case, it's ok when considering performance )
but insert part seams taking days, just to assing the insertTable(datatable). this is the issue.
AsEnumerable().Except() part taking long time and I couldn't find a solution speedup this process. I'm not sure I explained this correctly. Could anyone can give me some advice for this?
Since you have commented that dt_from_server and dt_server are actually the same DataTable you don't need to compare all values of all DataRows with each other, which is what DataRowComparer.Default does. You can use Except without second parameter for the comparer, then only references are compared which is much faster.
You also don't need two CopyToDataTable which creates two additonal big DataTables in memory, process the rows one after the other.
Here is a different approach using Linq's left-outer join, which is more efficient:
Dim query = from rServ in dt_from_server.AsEnumerable()
group join rClient in Dt_from_client.AsEnumerable()
On New With{
Key .BarCode = rServ.Field(Of String)("BARCODE").Trim(),
Key .ItemCode = rServ.Field(Of String)("ITEM_CODE").Trim()
} Equals New With{
Key .BarCode = rClient.Field(Of String)("BARCODE").Trim(),
Key .ItemCode = rClient.Field(Of String)("ITEM_CODE").Trim()
} into Group
From client In Group.DefaultIfEmpty()
Select new With { .ServerRow = rServ, .InsertRow = client is Nothing }
Dim insertOrUpdateRows = query.ToLookup(Function(x) x.InsertRow, Function(x) x.ServerRow)
Dim insertRows = insertOrUpdateRows(true).CopyToDataTable() 'CopyToDataTable redundant if you process rows immediately now'
Dim updateRows = insertOrUpdateRows(false).CopyToDataTable() 'CopyToDataTable redundant if you process rows immediately now'
But in general the most scalable and efficient approach would be to not load all into memory at once and then process all, but to use database paging(or a stored-procedure) to process only parts of it in memory, otherwise it's likely that you will encounter a OutOfMemoryException sooner or later.
C# as requested:
var query = from rServ in dt_from_server.AsEnumerable()
join rClient in Dt_from_client.AsEnumerable()
on new { BarCode = rServ.Field<string>("BARCODE").Trim(), ItemCode = rServ.Field<string>("ITEM_CODE").Trim() }
equals new { BarCode = rClient.Field<string>("BARCODE").Trim(), ItemCode = rClient.Field<string>("ITEM_CODE").Trim() }
into clientGroup
from client in clientGroup.DefaultIfEmpty()
select new { ServerRow = rServ, InsertRow = client == null };
var insertOrUpdateRows = query.ToLookup(x => x.InsertRow, x => x.ServerRow);
var insertRows = insertOrUpdateRows[true].CopyToDataTable(); // CopyToDataTable redundant if you process rows immediately now
var updateRows = insertOrUpdateRows[false].CopyToDataTable(); // CopyToDataTable redundant if you process rows immediately now
I have a problem to resolve cache updates when delta includes fields that have UNIQUE constraint on the database. I have a database with the following DDL schema (SQLite in memory can be used to reproduce):
create table FOO
(
ID integer primary key,
DESC char(2) UNIQUE
);
The initial database table contains one record with ID = 1 and DESC = R1
Acessing this table with a TFDQuery (select * from FOO), if the following steps are performed, the generated delta will be correctly applied with ApplyUpdates:
Update record ID = 1 to DESC = R2
Append a new record ID = 2 with DESC = R1
Delta includes the following:
R2
R1
No error will be generated on ApplyUpdates, because the first operation on delta will be an update. The second will be an insert. As record 1 now is R2, the insertion can be done because there are no violation of the unique contraint on this transaction.
Now, performing the following steps, will generate the exactly same delta (look at the FDQuery.Delta property), but a UNIQUE constraint violation will be generated.
Append a new temporary record ID = 2 with DESC = TT
Update the first record ID = 1 to DESC = R2
Update the temporary record 2 - TT to DESC = R1
Delta includes the following:
R2
R1
Note that FireDAC generates the same delta on both scenarios, this can be viewed through the FDquery's Delta property.
This steps cand be used to reproduce the error:
File > New VCL Forms Application; Drop a FDConnection and FDQuery on form; Set FDConnection to use SQLite driver (using in memory database); Drop two buttons on form, one to reproduce the correctly behavior, and another to reproduce the error, as follows:
Button OK:
procedure TFrmMain.btnOkClick(Sender: TObject);
begin
// create the default database with a FOO table
con.Open();
con.ExecSQL('create table FOO' + '(ID integer primary key, DESC char(2) UNIQUE)');
// insert a default record
con.ExecSQL('insert into FOO values (1,''R1'')');
qry.CachedUpdates := true;
qry.Open('select * from FOO');
// update the first record to T2
qry.First();
qry.Edit();
qry.Fields[1].AsString := 'R2';
qry.Post();
// append the second record to T1
qry.Append();
qry.Fields[0].AsInteger := 2;
qry.Fields[1].AsString := 'R1';
qry.Post();
// apply will not generate a unique constraint violation
qry.ApplyUpdates();
end;
Button Error:
// create the default database with a FOO table
con.Open();
con.ExecSQL('create table FOO' + '(ID integer primary key, DESC char(2) UNIQUE)');
// insert a default record
con.ExecSQL('insert into FOO values (1,''R1'')');
qry.CachedUpdates := true;
qry.Open('select * from FOO');
// append a temporary record (TT)
qry.Append();
qry.Fields[0].AsInteger := 2;
qry.Fields[1].AsString := 'TT';
qry.Post();
// update R1 to R2
qry.First();
qry.Edit();
qry.Fields[1].AsString := 'R2';
qry.Post();
qry.Next();
// update TT to R1
qry.Edit();
qry.Fields[1].AsString := 'R1';
qry.Post();
// apply will generate a unique contraint violation
qry.ApplyUpdates();
Update Since writing the original version of this answer, I've done some more investigation and am beginning to think that either there is a problem with ApplyUpdates, etc, in FireDAC's support for Sqlite (in Seattle, at least), or we are not using the FD components correctly. It would need FireDAC's author (who is a contributor here) to say which it is.
Leaving aside the ApplyUpdates business for a moment, there are a number of other problems with your code, namely your dataset navigation makes assumptions about the ordering on the rows in qry and the numbering of its Fields.
The test case I have used is to start (before execution of the application) with the Foo table containing the single row
(1, 'R1')
Then, I execute the following Delphi code, at the same time as monitoring the contents of Foo using an external application (the Sqlite Manager plug-in for FireFox). The code executes without an error being reported in the application, but notice that it does not call ApplyUpdates.
Con.Open();
Con.StartTransaction;
qry.Open('select * from FOO');
qry.InsertRecord([2, 'TT']);
assert(qry.Locate('ID', 1, []));
qry.Edit;
qry.FieldByName('DESC').AsString := 'R2';
qry.Post;
assert(qry.Locate('ID', 2, []));
qry.Edit;
qry.FieldByName('DESC').AsString := 'R1';
qry.Post;
Con.Commit;
qry.Close;
Con.Close;
The added row (ID = 2) is not visible to the external application until after Con.Close has executed, which I find puzzling. Once Con.Close has been called, the external application shows Foo as containing
(1, 'R2')
(2, 'R1')
However, I have been unable to avoid the constraint violation error if I call ApplyUpdates, regardless of any other changes I make to the code, including adding a call to ApplyUpdates after the first Post.
So, it seems to me that either the operation of ApplyUpdates is flawed or it is not being used correctly.
I mentioned FireDAC's author. His name is Dmitry Arefiev and he has answered a lot of FD qs on SO, though I haven't noticed him here in the past couple of months or so. You might try catching his attention by posting in EMBA's FireDAC NG forum, https://forums.embarcadero.com/forum.jspa?forumID=502.
I've been researching a solution to a performance problem with a system I'm responsible for, and I think at least part of the problem is due to database query performance. We use stored procedures to query "pages" of data in a pretty standard way. However, this paging appears to be more costly when the datasets get large.
Given this simple table populated with sample data:
create table Data (
Value uniqueidentifier not null,
constraint PK_Data primary key clustered (Value)
)
insert into Data
-- SeedTable has ~2M rows
select newid() from SeedTable
And this stored procedure to return paged data: (this requires Sql2012 apparently, though the Sql2008 style of using ROW_NUMBER() behaves the same):
create proc
GetDataPage #Offset int, #Count int
as
select Value
from Data
order by Value
offset #Offset rows
fetch next #Count rows only
I then test the performance of this sproc with this C# code:
const int PageSize = 50;
const int MaxCount = 50000;
using (var conn = new SqlConnection("Data Source=.;Initial Catalog=TestDB;Integrated Security=true;")) {
conn.Open();
int a = 0;
for (int i = 0; ; i += PageSize) {
using (var cmd = conn.CreateCommand()) {
cmd.CommandType = System.Data.CommandType.StoredProcedure;
cmd.CommandText = "GetDataPage";
var oid = cmd.CreateParameter();
var offset = cmd.CreateParameter();
offset.Value = i;
offset.ParameterName = "Offset";
cmd.Parameters.Add(offset);
var count = cmd.CreateParameter();
count.Value = PageSize;
count.ParameterName = "Count";
cmd.Parameters.Add(count);
var sw = Stopwatch.StartNew();
int c = 0;
using(var reader = cmd.ExecuteReader()) {
while (reader.Read()) {
c++;
}
}
a += c;
sw.Stop();
Console.WriteLine(sw.ElapsedTicks + "\t" + a);
if (c < PageSize || a >= MaxCount)
break;
}
}
}
When I chart the output of this code I get the following:
I would have expected that paging like this in SQL would have constant time performance, or perhaps logarithmic at worst, but it is pretty clear from the chart that performance is linear.
Are there any special tricks (hints) to make this work better?
Is there another approach to this that might be faster?
Do other databases behave the same way?
Changing the experimental code to use the "page from" technique, that Kevin Suchlicki suggests, results in the following:
Very impressive. This performance looks more like what I would expect/want. Now I just need to figure out if I can apply this to my real problem. The potential issue being that it doesn't allow for "random access" of the data, but a forward only cursor-like access. I'm aware that it must look like what I'm doing violates every notion of good database design.
The most obvious possibility is in the app design itself. Offer your users filter criteria. Users usually have some idea what they are looking for and would rather not page thru 1000 pages of returned results. How often do you pass page 10 on a google search?
Having said that, you could try storing the id (clustered index value) of the last row returned on the previous page and use that in your SQL where clause. If you need to allow sorting on different keys (e.g. last name), then store the clustered index id value and the final last name of the previous page. Then write your SQL like this (you always need to order on your key field and clustered id value in order to deterministically order the records in the case of duplicate key values):
select top (#count) Id, LastName, FirstName
from Data
where LastName >= #previousLastName and Id > #previousId
order by LastName, Id
You would also want to index all the fields that could be sort keys. Not sure how the above would perform but I would expect the search on indexed fields would perform O(log n).
Another option might be to persist the full list, in order, with row value, every time the source data changes, behind the scenes, and have the app pull from the persisted table.
Good question... Let us know how it turns out please!
I am trying to use Dapper support my data access for my server app.
My server app has another application that drops records into my database at a rate of 400 per minute.
My app pulls them out in batches, processes them, and then deletes them from the database.
Since data continues to flow into the database while I am processing, I don't have a good way to say delete from myTable where allProcessed = true.
However, I do know the PK value of the rows to delete. So I want to do a delete from myTable where Id in #listToDelete
Problem is that if my server goes down for even 6 mintues, then I have over 2100 rows to delete.
Since Dapper takes my #listToDelete and turns each one into a parameter, my call to delete fails. (Causing my data purging to get even further behind.)
What is the best way to deal with this in Dapper?
NOTES:
I have looked at Tabled Valued Parameters but from what I can see, they are not very performant. This piece of my architecture is the bottle neck of my system and I need to be very very fast.
One option is to create a temp table on the server and then use the bulk load facility to upload all the IDs into that table at once. Then use a join, EXISTS or IN clause to delete only the records that you uploaded into your temp table.
Bulk loads are a well-optimized path in SQL Server and it should be very fast.
For example:
Execute the statement CREATE TABLE #RowsToDelete(ID INT PRIMARY KEY)
Use a bulk load to insert keys into #RowsToDelete
Execute DELETE FROM myTable where Id IN (SELECT ID FROM #RowsToDelete)
Execute DROP TABLE #RowsToDelte (the table will also be automatically dropped if you close the session)
(Assuming Dapper) code example:
conn.Open();
var columnName = "ID";
conn.Execute(string.Format("CREATE TABLE #{0}s({0} INT PRIMARY KEY)", columnName));
using (var bulkCopy = new SqlBulkCopy(conn))
{
bulkCopy.BatchSize = ids.Count;
bulkCopy.DestinationTableName = string.Format("#{0}s", columnName);
var table = new DataTable();
table.Columns.Add(columnName, typeof (int));
bulkCopy.ColumnMappings.Add(columnName, columnName);
foreach (var id in ids)
{
table.Rows.Add(id);
}
bulkCopy.WriteToServer(table);
}
//or do other things with your table instead of deleting here
conn.Execute(string.Format(#"DELETE FROM myTable where Id IN
(SELECT {0} FROM #{0}s", columnName));
conn.Execute(string.Format("DROP TABLE #{0}s", columnName));
To get this code working, I went dark side.
Since Dapper makes my list into parameters. And SQL Server can't handle a lot of parameters. (I have never needed even double digit parameters before). I had to go with Dynamic SQL.
So here was my solution:
string listOfIdsJoined = "("+String.Join(",", listOfIds.ToArray())+")";
connection.Execute("delete from myTable where Id in " + listOfIdsJoined);
Before everyone grabs the their torches and pitchforks, let me explain.
This code runs on a server whose only input is a data feed from a Mainframe system.
The list I am dynamically creating is a list of longs/bigints.
The longs/bigints are from an Identity column.
I know constructing dynamic SQL is bad juju, but in this case, I just can't see how it leads to a security risk.
Dapper request the List of object having parameter as a property so in above case a list of object having Id as property will work.
connection.Execute("delete from myTable where Id in (#Id)", listOfIds.AsEnumerable().Select(i=> new { Id = i }).ToList());
This will work.