I'm making frequent inserts and updates in large batches from c# code and I need to do it as fast as possible, please help me find all ways to speed up this process.
Build command text using StringBuilder, separate statements with ;
Don't use String.Format or StringBuilder.AppendFormat, it's slower then multiple StringBuilder.Append calls
Reuse SqlCommand and SqlConnection
Don't use SqlParameters (limits batch size)
Use insert into table values(..),values(..),values(..) syntax (1000 rows per statement)
Use as few indexes and constraints as possible
Use simple recovery model if possible
?
Here are questions to help update the list above
What is optimal number of statements per command (per one ExecuteNonQuery() call)?
Is it good to have inserts and updates in the same batch, or it is better to execute them separately?
My data is being received by tcp, so please don't suggest any Bulk Insert commands that involve reading data from file or external table.
Insert/Update statements rate is about 10/3.
Use table-valued parameters. They can scale really well when using large numbers of rows, and you can get performance that approaches BCP level. I blogged about a way of making that process pretty simple from the C# side here. Everything else you need to know is on the MSDN site here. You will get far better performance doing things this way rather than making little tweaks around normal SQL batches.
As of SQLServer2008 TableParameters are the way to go. See this article (step four)
http://www.altdevblogaday.com/2012/05/16/sql-server-high-performance-inserts/
I combined this with parallelizing the insertion process. Think that helped also, but would have to check ;-)
Use SqlBulkCopy into a temp table and then use the MERGE SQL command to merge the data.
Related
I have a performance question. I got two tables in Cassandra, both of them have exactly same structure. I need to save incoming data in both of them. The problem I have is what would be better solution to do this:
Create two repositories, both of them open Cassandra session, save data to both tables separately (all in code).
Save data to one table, have a trigger on this table and copy incoming data to another one
Any other solution?
I think first two are ok, but I am not sure if first one is good enough. Can someone explain it to me?
This sounds like a good use case for BATCH. Essentially, you can assemble two write statements and execute them in a BATCH to ensure atomicity. That should keep the two tables in-sync. Example below from the DataStax docs (URL).
cqlsh> BEGIN LOGGED BATCH
INSERT INTO cycling.cyclist_names (cyclist_name, race_id) VALUES ('Vera ADRIAN', 100);
INSERT INTO cycling.cyclist_by_id (race_id, cyclist_name) VALUES (100, 'Vera ADRIAN');
APPLY BATCH;
+1 to Aaron's response about using BATCH statements but the quoted example is specific to CQL. It's a bit more nuanced when implementing it in your app.
If you're using the Java driver, a typical INSERT statement would look like:
SimpleStatement simpleInsertUser = SimpleStatement.newInstance(
"INSERT INTO users (...) VALUES (?), "..." );
And here's a prepared statement:
PreparedStatement psInsertUserByMobile = session.prepare(
"INSERT INTO users_by_mobile (...) VALUES (...)" );
If you were to batch these 2 statements:
BatchStatement batch = BatchStatement.newInstance(
DefaultBatchType.LOGGED,
simpleInsertBalance,
preparedInsertExpense.bind(..., false) );
session.execute(batch);
For item 2 in your list, I don't know companies who use Cassandra TRIGGERs in production so it isn't something I would recommend. It was experimental for a while and I don't have enough experience to be able to recommend them for production.
For item 3, this is the use case that Materialized Views are trying to solve. They are certainly a lot simpler from a dev point-of-view since the table updates are done server-side instead of client-side.
They are OK to use if you don't have a lot of tables but be aware that the updates to the views happen asynchronously (not at the same time as when the mutations occur on the base table). With MVs, there's also a risk that when the views get so out-of-sync with the base table, the only solution is to drop and recreate the MV.
If you prefer not to use BATCH statements, just make sure you're fully aware of the tradeoffs with using MVs. If you're interested, I've explained it in a bit more detail in https://community.datastax.com/articles/2774/. Cheers!
One Destination - All Merge Join Rows
Two Destinations - Fewer Merge Join Rows
Can anyone please explain this behavior for me?
I am generating a count field and then feeding that to back into the main stream with the merge join and then performing a conditional split based on the count. It works fine without the update statement but I get different results when I run it with an update statement from the conditional split. Maybe also worth mentioning that there are no nulls in the data both pictures are the same file as well. Any thoughts are appreciated. Thanks.
When OLEDB command didn't finish executing the current batch of rows, it's precedent component (condotionnal split) will not send more rows until he finish processing and so on. Also it is depending on the dataFlow DefaultBufferSize and DefaultBufferMaxRows
Read more on Data Flow Performance Features
I figured I'd update what I learned. It appears that the issue with how many rows were loaded (49430 versus 52220) was due to the DefaultMaxBuffer size and DefaultMaxBuffer Rows settings in SSIS. This did not improve performance, just how many records were loaded into memory.
As Martin suggested above, the delay with processing the update was due to inefficiency. For anyone wanted to know what a Staging Table is.. it's just a generic term for a table you make in your database (Or make the table in SSIS with a sql command task) then use sql command in SSIS to run an update statement. You can drop the staging table in a SSIS task after the update if you want. I can not understate how much of a performance increase this gives you for large updates.
PerformanceReview
prID
reviewDate
passed
notes
successStrategy
empID
nextReviewDate
above is my table I am working with, my goal is to get the nextReviewDate check to see if it is within 7 days of the current date ( I will do this using DATEDIFF() ) and send an email to a specified email address if this condition is true.
My question is, how do I make it so that my sql job will perform this task for each performance review row in the table. I have researched and found information on CURSORS, or using WHILE loops being slow and inefficient for this task. Any help is appreciated as I am in the final stage of development :)
If you are within a SQL Server context and you want to send the mails using sp_send_dbmail, using a CURSOR to loop through the rows and call sp_send_dbmail is just fine. It may not be the fastest but it in this case it won't matter all that much. You are not looking to shave off milliseconds for this sort of process.
It will be a lot more of a hassle to formulate a set-based approach. This would involve creating a dynamic SQL statement to have all the sp_send_dbmail calls in one batch. But the gain would be marginal.
Apologies if I am enraging the forum with repetitive question. Couldn't find the right solution in the forum, hence posting it.
I need to fetch 129991763 rows into a cursor or temp table or a staging table quickly and process them into another table. And this destination table is also huge table.
Currently I am using INSERT using SELECT statement (the SELECT is nested 4 levels) used hints like Option (FAST 1000), MAXDOP 1, RECOMPILE ...etc...
The procedure is consuming lot of time and showing no results or not getting completed at all.
Previously I used a cursor with the same hints; but as it was also running more than 22 hours; I switched to INSERT using SELECT.
Literally, I need to stop the execution for above both methods.
And to be honest, I am beginner in SQL Server database.
Even if specifically filter out the records in SELECT based on criteria; still the process needs to broken 4 or 5 chunks and these chunks are also taking more than 4 - 5 hours to complete.
Please help.
Thanks
Pradyumna
In the past I've used BULK INSERT with reasonable success, but I suspect the suggestion of breaking it into chunks and dropping indexes would still be wise. You can find some details on it here
https://msdn.microsoft.com/en-GB/library/ms188365.aspx
Hope it helps, good luck.
Apologies, you will probably be best using an SSIS package to pull it across. With this you can also transform the data if needed. I would still recommend keeping indexes off the table you are inserting the data into where possible. You'll need to have a bit of a read but hard to explain on here due to the use of the GUI.
Good luck
I have two Oracle tables, an old one and a new one.
The old one was poorly designed (more so than mine, mind you) but there is a lot of current data that needs to be migrated into the new table that I created.
The new table has new columns, different columns.
I thought of just writing a PHP script or something with a whole bunch of string replacement... clearly that's a stupid way to do it though.
I would really like to be able to clean up the data a bit along the way as well. Some it was stored with markup in it (ex: "First Name"), lots of blank space, etc, so I would really like to fix all that before putting it into the new table.
Does anyone have any experience doing something like this? What should I do?
Thanks :)
I do this quite a bit - you can migrate with simple select statememt:
create table newtable as select
field1,
trim(oldfield2) as field3,
cast(field3 as number(6)) as field4,
(select pk from lookuptable where value = field5) as field5,
etc,
from
oldtable
There's really very little you could do with an intermediate language like php, etc that you can't do in native SQL when it comes to cleaning and transforming data.
For more complex cleanup, you can always create a sql function that does the heavy lifting, but I have cleaned up some pretty horrible data without resorting to that. Don't forget in oracle you have decode, case statements, etc.
I'd checkout an ETL tool like Pentaho Kettle. You'll be able to query the data from the old table, transform and clean it up, and re-insert it into the new table, all with a nice WYSIWYG tool.
Here's a previous question i answered regarding data migration and manipulation with Kettle.
Using Pentaho Kettle, how do I load multiple tables from a single table while keeping referential integrity?
If the data volumes aren't massive and if you are only going to do this once, then it will be hard to beat a roll-it-yourself program. Especially if you have some custom logic you need implemented.
The time taken to download, learn & use a tool (such as pentaho etc.) will probably not worth your while.
Coding a select *, updating columns in memory & doing an insert into will be quickly done in PHP or any other programming language.
That being said, if you find yourself doing this often, then an ETL tool might be worth learning.
I'm working on a similar project myself - migrating data from one model containing a couple of dozen tables to a somewhat different model of similar number of tables.
I've taken the approach of creating a MERGE statement for each target table. The source query gets all the data it needs, formats it as required, then the merge works out if the row already exists and updates/inserts as required. This way, I can run the statement multiple times as I develop the solution.
Depends on how complex the conversion process is. If it is easy enough to express in a single SQL statement, you're all set; just create the SELECT statement and then do the CREATE TABLE / INSERT statement. However, if you need to perform some complex transformation or (shudder) split or merge any of the rows to convert them properly, you should use a pipelined table function. It doesn't sound like that is the case, though; try to stick to the single statement as the other Chris suggested above. You definitely do not want to pull the data out of the database to do the transform as the transfer in and out of Oracle will always be slower than keeping it all in the database.
A couple more tips:
If the table already exists and you are doing an INSERT...SELECT statement, use the /*+ APPEND */ hint on the insert so that you are doing a bulk operation. Note that CREATE TABLE does this by default (as long as it's possible; you cannot perform bulk ops under certain conditions, e.g. if the new table is an index-organized table, has triggers, etc.
If you are on 10.2 or later, you should also consider using the LOG ERRORS INTO clause to log rejected records to an error table. That way, you won't lose the whole operation if one record has an error you didn't expect.