data migration takes long time - sql-server

I have been written an c# console app for migrate data;
count of records not so more; each table has almost 100 hundred records. but the structure of data and the businesses logic is so complicated with almost 200 tables.
my data migration has all types of actions: delete, update, insert, get.
delete and update operation just use for data correction in source database
now my migrate data takes so long time; almost three days or more!
some actions for improvement:
1- at first set 'NOCHECK CONSTRAINT' in source database; when this operation do it: delete, update and insert.
2- then for fetch data from source database; set some index.
3- disable all index and constraint in destination database when insert data.
now can any one suggest a solution for improvement duration time?
It should be noted that in this phase of the project I couldn't switch to another solution for example SSIS. I must be improvement this console app!.
be used EFCore 2.2 with pure query for transfer data;
thanks a lot

I had a very significant performance improvement by switching recovery model from Full to Simple. Obviously, there are good reasons to use Full; but depending on the changes that your migration is doing, the performance improvement may be order of magnitude!
Sorry to bring it up, but it is very difficult to understand what you are trying to say (I suppose it is due to poor English). Maybe run some grammar checker to improve clarity! Is your question about EF Migrations? Or you are using a custom query to read data from one DB and write into another? It seems the latter, then you probably need to look at the SQL extended events to identify poorly written queries before you start tuning the database instance! Nothing improves performance like tuning the SQL!

Related

improve database querying in ms sql

what's a fast way to query large amounts of data (between 10.000 - 100.000, it will get bigger in the future ... maybe 1.000.000+) spread across multiple tables (20+) that involves left joins, functions (sum, max, count,etc.)?
my solution would be to make one table that contains all the data i need and have triggers that update this table whenever one of the other tables gets updated. i know that trigger aren't really recommended, but this way i take the load off the querying. or do one big update every night.
i've also tried with views, but once it starts involving left joins and calculations it's way too slow and times out.
Since your question is too general, here's a general answer...
The path you're taking right now is optimizing a single query/single issue. Sure, it might solve the issue you have right now, but it's usually not very good in the long run (not to mention the cumulative cost of maintainance of such a thing).
The common path to take is to create an 'analytics' database - the real-time copy of your production database that you're going to query for all your reports. This analytics database can eventually be even a full blown DWH, but you're probably going to start with a simple real-time replication (or replicate nightly or whatever) and work from there...
As I said, the question/problem is too broad to be answered in a couple of paragraphs, these only some of the guidelines...
Need a bit more details, but I can already suggest this:
Use "with(nolock)", this will slightly improve the speed.
Reference: Effect of NOLOCK hint in SELECT statements
Use Indexing for your table fields for fetching data fast.
Reference: sql query to select millions record very fast

How can I store temporary, self-destroying data into a t-sql table?

Suppose I have a table which contains relevant information. However, the data is only relevant for, let's say, 30 minutes.
After that it's just database junk, so I need to get rid of it asap.
If I wanted, I could clean this table periodically, setting an expiration date time for each record individually and deleting expired records through a job or something. This is my #1 option, and it's what will be done unless someone convince me otherwise.
But I think this solution may be problematic. What if someone stops the job from running and no one notices? I'm looking for something like a built-in way to insert temporary data into a table. Or a table that has "volatile" data itself, in a way that it automagically removes data after x amount of time after its insertion.
And last but not least, if there's no built-in way to do that, could I be able to implement this functionality in SQL server 2008 (or 2012, we will be migrating soon) myself? If so, could someone give me directions as to what to look for to implement something like it?
(Sorry if the formatting ends up bad, first time using a smartphone to post on SO)
As another answer indicated, TRUNCATE TABLE is a fast way to remove the contents of a table, but it's aggressive; it will completely empty the table. Also, there are restrictions on its use; among others, it can't be used on tables which "are referenced by a FOREIGN KEY constraint".
Any more targeted removal of rows will require a DELETE statement with a WHERE clause. Having an index on relevant criteria fields (such as the insertion date) will improve performance of the deletion and might be a good idea (depending on its effect on INSERT and UPDATE statements).
You will need something to "trigger" the DELETE statement (or TRUNCATE statement). As you've suggested, a SQL Server Agent job is an obvious choice, but you are worried about the job being disabled or removed. Any solution will be vulnerable to someone removing your work, but there are more obscure ways to trigger an activity than a job. You could embed the deletion into the insertion process-- either in whatever stored procedure or application code you have, or as an actual table trigger. Both of those methods increase the time required for an INSERT and, because they are not handled out of band by the SQL Server Agent, will require your users to wait slightly longer. If you have the right indexes and the table is reasonably-sized, that might be an acceptable trade-off.
There isn't any other capability that I'm aware of for SQL Server to just start deleting data. There isn't automatic data retention policy enforcement.
See #Yuriy comment, that's relevant.
If you really need to implement it DB side....
Truncate table is fast way to get rid of records.
If all you need is ONE table and you just need to fill it with data, use it and dispose it asap you can consider truncating a (permanent) "CACHE_TEMP" table.
The scenario can become more complicated you are running concurrent threads/jobs and each is handling it's own data.
If that data is just existing for a single "job"/context you can consider using #TEMP tables. They are a bit volatile and maybe can be what you are looking for.
Also you maybe can use table variables, they are a bit more volatile than temporary tables but it depends on things you don't posted, so I cannot say what's really better.

Need recommendations on pushing the envelope with SqlBulkCopy on SQL Server

I am designing an application, one aspect of which is that it is supposed to be able to receive massive amounts of data into SQL database. I designed the database stricture as a single table with bigint identity, something like this one:
CREATE TABLE MainTable
(
_id bigint IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
field1, field2, ...
)
I will omit how am I intending to perform queries, since it is irrelevant to the question I have.
I have written a prototype, which inserts data into this table using SqlBulkCopy. It seemed to work very well in the lab. I was able to insert tens of millions records at a rate of ~3K records/sec (full record itself is rather large, ~4K). Since the only index on this table is autoincrementing bigint, I have not seen a slowdown even after significant amount of rows was pushed.
Considering that the lab SQL server was a virtual machine with relatively weak configuration (4Gb RAM, shared with other VMs disk sybsystem), I was expecting to get significantly better throughput on the physical machine, but it didn't happen, or lets say the performance increase was negligible. I could, maybe get 25% faster inserts on physical machine. Even after I configured 3-drive RAID0, which performed 3 times faster than a single drive (measured by a benchmarking software), I got no improvement. Basically: faster drive subsystem, dedicated physical CPU and double RAM almost didn't translate into any performance gain.
I then repeated the test using biggest instance on Azure (8 cores, 16Gb), and I got the same result. So, adding more cores did not change insert speed.
At this time I have played around with following software parameters without any significant performance gain:
Modifying SqlBulkInsert.BatchSize parameter
Inserting from multiple threads simultaneously, and adjusting # of threads
Using table lock option on SqlBulkInsert
Eliminating network latency by inserting from a local process using shared memory driver
I am trying to increase performance at least 2-3 times, and my original idea was that throwing more hardware would get tings done, but so far it doesn't.
So, can someone recommend me:
What resource could be suspected a bottleneck here? How to confirm?
Is there a methodology I could try to get reliably scalable bulk insert improvement considering there is a single SQL server system?
UPDATE I am certain that load app is not a problem. It creates record in a temporary queue in a separate thread, so when there is an insert it goes like this (simplified):
===>start logging time
int batchCount = (queue.Count - 1) / targetBatchSize + 1;
Enumerable.Range(0, batchCount).AsParallel().
WithDegreeOfParallelism(MAX_DEGREE_OF_PARALLELISM).ForAll(i =>
{
var batch = queue.Skip(i * targetBatchSize).Take(targetBatchSize);
var data = MYRECORDTYPE.MakeDataTable(batch);
var bcp = GetBulkCopy();
bcp.WriteToServer(data);
});
====> end loging time
timings are logged, and the part that creates a queue never takes any significant chunk
UPDATE2 I have implemented collecting how long each operation in that cycle takes and the layout is as follows:
queue.Skip().Take() - negligible
MakeDataTable(batch) - 10%
GetBulkCopy() - negligible
WriteToServer(data) - 90%
UPDATE3 I am designing for standard version of SQL, so I cannot rely on partitioning, since it's only available in Enterprise version. But I tried a variant of partitioning scheme:
created 16 filegroups (G0 to G15),
made 16 tables for insertion only (T0 to T15) each bound to its individual group. Tables are with no indexes at all, not even clustered int identity.
threads that insert data will cycle through all 16 tables each. This makes it almost a guarantee that each bulk insert operation uses its own table
That did yield ~20% improvement in bulk insert. CPU cores, LAN interface, Drive I/O were not maximized, and used at around 25% of max capacity.
UPDATE4 I think it is now as good as it gets. I was able to push inserts to a reasonable speeds using following techniques:
Each bulk insert goes into its own table, then results are merged into main one
Tables are recreated fresh for every bulk insert, table locks are used
Used IDataReader implementation from here instead of DataTable.
Bulk inserts done from multiple clients
Each client is accessing SQL using individual gigabit VLAN
Side processes accessing the main table use NOLOCK option
I examined sys.dm_os_wait_stats, and sys.dm_os_latch_stats to eliminate contentions
I have a hard time to decide at this point who gets a credit for answered question. Those of you who don't get an "answered", I apologize, it was a really tough decision, and I thank you all.
UPDATE5: Following item could use some optimization:
Used IDataReader implementation from here instead of DataTable.
Unless you run your program on machine with massive CPU core count, it could use some re-factoring. Since it is using reflection to generate get/set methods, that becomes a major load on CPUs. If performance is a key, it adds a lot of performance when you code IDataReader manually, so that it is compiled, instead of using reflection
For recommendations on tuning SQL Server for bulk loads, see the Data Loading and Performance Guide paper from MS, and also Guidelines for Optimising Bulk Import from books online. Although they focus on bulk loading from SQL Server, most of the advice applies to bulk loading using the client API. This papers apply to SQL 2008 - you don't say which SQL Server version you're targetting
Both have quite a lot of information which it's worth going through in detail. However, some highlights:
Minimally log the bulk operation. Use bulk-logged or simple recovery.
You may need to enable traceflag 610 (but see the caveats on doing
this)
Tune the batch size
Consider partitioning the target table
Consider dropping indexes during bulk load
Nicely summarised in this flow chart from Data Loading and Performance Guide:
As others have said, you need to get some peformance counters to establish the source of the bottleneck, since your experiments suggest that IO might not be the limitation.
Data Loading and Performance Guide includes a list of SQL wait types and performance counters to monitor (there are no anchors in the document to link to but this is about 75% through the document, in the section "Optimizing Bulk Load")
UPDATE
It took me a while to find the link, but this SQLBits talk by Thomas Kejser is also well worth watching - the slides are available if you don't have time to watch the whole thing. It repeats some of the material linked here but also covers a couple of other suggestions for how to deal with high incidences of particular performance counters.
It seems you have done a lot however I am not sure if you have had chance to study Alberto Ferrari SqlBulkCopy Performance Analysis report, which describes several factors to consider the performance related with SqlBulkCopy. I would say lots of things discussed in that paper is still worth trying to that would good to try first.
I am not sure why you are not getting 100% utilization on CPU, IO or memory. But if you simply want to improve your bulk load speeds, here is something to consider:
Partition you data file into different files. Or if they are coming from different sources, then simply create different data files.
Then run multiple bulk inserts simultaneously.
Depending on your situation the above may not be feasible; but if you can then I am sure it should improve your load speeds.

Complex processing in Stored procedures Vs .net application

We are building a new application in .net 3.5 with SQL server database. The database is fairly large having around 60 tables with loads on data. The .net application have functionality to bring data into this database from data entry and from third party systems.
After all the data is available in database the system have to do lots of calculation. The calculation logic is pretty complex. All the data required for calculations is in database and the output also needs to be stored in database. The data gathering will happen every week and the calculation needs to be done every week to generate required reports.
Due to above scenario I was thinking do all these calculations using Stored Procedure. The problem is we need data independence also and stored procedure will not be able to provide us that. But if I do all this in .net by query database all the time, I don't think it will be able to finish the work quickly.
For example, I need to query one table which will return me 2000 rows then for each row I need to query another table which will return me 300 results than for each row of this I need to query multiple tables (around 10) to get required data, do the calculation and store the output in another table.
Now my question should I go ahead with stored-procedure solution and forget about database independence since performance is important. I also think development time will be much less if we use stored procedure solution. If any of client want this solution on say oracle database (because they don't want to maintain another database) then we port the stored procedures to oracle database and maintain two versions for any future changes/enhancements. Similarly other clients may ask for other databases.
The 2000 rows which I mentioned above is of product skus. The 300 rows I mentioned is of different attributes which we want to calculate, e.g. handling cost, transport cost, etc. The 10 tables I mentioned have information about currency conversion, unit conversion, network, area, company, sell price, number sold per day, etc. The resulting table stores all the information as a star schema for analysis and reporting purpose. The goal is to get any minute information about the product so that one know what attribute of a product selling is costing us money and where we can do the improvement.
I wouldn't consider doing the data manipulation anywhere other than in the database.
most people try to work with database data using looping algorithms. if you need real speed, think of your data as a SET of rows and you can update thousands of rows within a single update. I have rewritten so many cursor loops written by novice programmers into single update statements where the execution time was massively improved.
you say:
I need to query one table which will
return me 2000 rows then for each row
I need to query another table which
will return me 300 results than for
each row of this I need to query
multiple tables (around 10) to get
required data
from your question it looks like you are not using joins, and you are already thinking in loops. even if you do intend to loop, it is much better to write a query to join in all data necessary then loop over it. remember update and insert statements can have massively complex queries driving them. include in CASE statements, derived tables, conditional joins (LEFT OUTER JOIN) and you can just about solve any problem in a single update/insert.
Well without any specific details of what data you have in these tables, just a back of the napkin calculation shows that you're talking about processing over 6 million rows of information in the example you provided (2,000 rows * 300 rows * (1 row * 10 tables)).
Are all of these rows distinct, or are the 10 tables lookup information that has a relatively low cardinality? In other words, would it be possible to make a program that has the information from the 10 lookup tables in memory, and then just process the 300 row result set in memory to perform the calculations?
Also, I would be concerned about scalability -- if you do this in a stored procedure, it is guaranteed to be a serial process limited by the speed of the single database server. If you have the possibility of multiple copies of a client program, each processing a chunk of the 2,000 initial record set, then you can perform some of the calculations in parallel perhaps speeding up your overall processing time, as well as making it scalable for when your initial record set is 10 times larger.
Programming things like calculation code tend to be easier and more maintainable in C#. Also, normally keeping processing on the SQL Server to a minimum is a good practice since the database is the hardest to scale.
Having said that, from your description it sounds like the stored procedure approach is the way to go. When calculation code is dependent on large volumes of data, it's going to be more expensive to move the data off server for calculation. So unless you have reasonable ways of optimizing the dependent data (such as caching lookup tables?) then you are most likely going to find it more painful then it's worth to not use a stored proc.
Stored procedures every time, but as KM said within those stored procedures keep those iterations to minimum that is to say use joins in your SQL, relational databases are soooooo good at joining.
Database scalibility will be a small issue especially as it sounds like you'd be performing these calcualtions in a batch process.
Database independence doesn't really exist except for the most trivial of CRUD applications so if your initial requirement is to get this all working with SQL Server then leverage the tools that the RDBMS provides (after all your client will have spent a great deal of money on it). If (and it's a big if) a subsequent client really really doesn't want to use SQL Server then you'll have to bite the bullet and code it up in another flavour of stored procedure. But then as you identifed: "if I do all this in .net by query database all the time, I don't think it will be able to finish the work quickly." you've defered the expense of doing it until if and when required.
I would consider doing this in SQL Server Integration Services (SSIS). I'd put the calculations into SSIS, but leave the queries as stored procedures. This would provide you database independence - SSIS can process data from any database with an ODBC connection - as well as high performance. Only the simple SELECT statements would be in stored procedures, and those are the parts of the SQL standard most likely to be identical across multiple database products (assuming you stick to standard forms of query).

How to gain performance when maintaining historical and current data?

I want to maintain last ten years of stock market data in a single table. Certain analysis need only data of the last one month data. When I do this short term analysis it takes a long time to complete the operation.
To overcome this I created another table to hold current year data alone. When I perform the analysis from this table it 20 times faster than the previous one.
Now my question is:
Is this the right way to have a separate table for this kind of problem. (Or we use separate database instead of table)
If I have separate table Is there any way to update the secondary table automatically.
Or we can use anything like dematerialized view or something like that to gain performance.
Note: I'm using Postgresql database.
You want table partitioning. This will automatically split the data between multiple tables, and will in general work much better than doing it by hand.
I'm working on near the exact same issue.
Table partitioning is definitely the way to go here. I would segment by more than year though, it would give you a greater degree of control. Just set up your partitions and then constrain them by months (or some other date). In your postgresql.conf you'll need to turn constraint_exclusion=on to really get the benefit. The additional benefit here is that you can only index the exact tables you really want to pull information from. If you're batch importing large amounts of data into this table, you may get slightly better results a Rule vs a Trigger and for partitioning, I find rules easier to maintain. But for smaller transactions, triggers are much faster. The postgresql manual has a great section on partitioning via inheritance.
I'm not sure about PostgreSQL, but I can confirm that you are on the right track. When dealing with large data volumes partitioning data into multiple tables and then using some kind of query generator to build your queries is absolutely the right way to go. This approach is well established in Data Warehousing, and specifically in your case stock market data.
However, I'm curious why do you need to update your historical data? If you're dealing with stock splits, it's common to implement that using a seperate multiplier table that is used in conjunction with the raw historical data to give an accurate price/share.
it is perfectly sensible to use separate table for historical records. It's much more problematic with separate database, as it's not simple to write cross-database queries
automatic updates - it's a tool for cronjob
you can use partial indexes for such things - they do wonderful job
Frankly, you should check your execution plans and try fixing your queries or indexing before taking more radical steps.
Indexing comes at very little cost (unless you do a lot of insertions) and your existing code will be faster (if you index properly) without modifying it.
Other measures such as partioning come after that...

Resources