Efficient use of SQL Transactions - sql-server

My application currently needs to upload a large amount of data to a database server (SQL Server) and locally on a SQLite database (local cache).
I have always used Transactions when inserting data to a database for speed purposes. But now that I am working with something like 20k rows or more per insert batch, I am worried that Transactions might cause issues. Basically, what I don't know is if Transactions have a limit on how much data you can insert under them.
What is the correct way to use transactions with large amounts of rows to be inserted in a database? Do you for instance begin/commit every 1000 rows?

No there is no such limit. Contrary to what you might believe, SQLite writes pending transactions into the database file, not RAM. So you should not run into any limits on the amount of data you can write under a transaction.
See SQLite docs for these info: http://sqlite.org/docs.html
Follow the link "Limits in SQLite" for implementation limits like these.
Follow the link "How SQLite Implements Atomic Commit" for how transactions work

I dont see any problems doing this but if there are any constraint/ referential integrity errors then probably you got insert them all again and also the table is locked till the time the transaction is commited. Breaking down into smaller portions while logging activity in each batch will help.
A better option would be to BCP insert them into the target while dealing with many rows or even an SSIS package to do this.

Related

Disable transactions on SQL Server

I need some light here. I am working with SQL Server 2008.
I have a database for my application. Each table has a trigger to stores all changes on another database (on the same server) on one unique table 'tbSysMasterLog'. Yes the log of the application its stored on another database.
Problem is, before any Insert/update/delete command on the application database, a transaction its started, and therefore, the table of the log database is locked until the transaction is committed or rolled back. So anyone else who tries to write in any another table of the application will be locked.
So...is there any way possible to disable transactions on a particular database or on a particular table?
You cannot turn off the log. Everything gets logged. You can set to "Simple" which will limit the amount of data saved after the records are committed.
" the table of the log database is locked": why that?
Normally you log changes by inserting records. The insert of records should not lock the complete table, normally there should not be any contention in insertion.
If you do more than inserts, perhaps you should consider changing that. Perhaps you should look at the indices defined on log, perhaps you can avoid some of them.
It sounds from the question that you have a create transaction at the start of your triggers, and that you are logging to the other database prior to the commit transaction.
Normally you do not need to have explicit transactions in SQL server.
If you do need explicit transactions. You could put the data to be logged into variables. Commit the transaction and then insert it into your log table.
Normally inserts are fast and can happen in parallel with out locking. There are certain things like identity columns that require order, but this is very lightweight structure they can be avoided by generating guids so inserts are non blocking, but for something like your log table a primary key identity column would give you a clear sequence that is probably helpful in working out the order.
Obviously if you log after the transaction, this may not be in the same order as the transactions occurred due to the different times that transactions take to commit.
We normally log into individual tables with a similar name to the master table e.g. FooHistory or AuditFoo
There are other options a very lightweight method is to use a trace, this is what is used for performance tuning and will give you a copy of every statement run on the database (including triggers), and you can log this to a different database server. It is a good idea to log to different server if you are doing a trace on a heavily used servers since the volume of data is massive if you are doing a trace across say 1,000 simultaneous sessions.
https://learn.microsoft.com/en-us/sql/tools/sql-server-profiler/save-trace-results-to-a-table-sql-server-profiler?view=sql-server-ver15
You can also trace to a file and then load it into a table, ( better performance), and script up starting stopping and loading traces.
The load on the server that is getting the trace log is minimal and I have never had a locking problem on the server receiving the trace, so I am pretty sure that you are doing something to cause the locks.

Database tables optimized for both read and write

We have a web service that pumps data into 3 database tables and a web application that reads that data in aggregated format in a SQL Server + ASP.Net environment.
There is so much data arriving to the database tables and so much data read from them and at such high velocity, that the system started to fail.
The tables have indexes on them, one of them is unique. One of the tables has billions of records and occupies a few hundred gigabytes of disk space; the other table is a smaller one, with only a few million records. It is emptied daily.
What options do I have to eliminate the obvious problem of simultaneously reading and writing from- and to multiple database tables?
I am interested in every optimization trick, although we have tried every trick we came across.
We don't have the option to install SQL Server Enterprise edition to be able to use partitions and in-memory-optimized tables.
Edit:
The system is used to collect fitness tracker data from tens of thousands of devices and to display data to thousands of them on their dashboard in real-time.
Way too broad of requirements and specifics to give a concrete answer. But a suggestion would be to setup a second database and do log shipping over to it. So the original db would be the "write" and the new db would be the "read" database.
Cons
Diskspace
Read db would be out of date by the length of time for log tranfser
Pro
- Could possible drop some of the indexes on "write" db, this would/could increase performance
- You could then summarize the table in the "read" database in order to increase query performance
https://msdn.microsoft.com/en-us/library/ms187103.aspx
Here's some ideas, some more complicated than others, their usefulness depending really heavily on the usage which isn't fully described in the question. Disclaimer: I am not a DBA, but I have worked with some great ones on my DB projects.
[Simple] More system memory always helps
[Simple] Use multiple files for tempdb (one filegroup, 1 file for each core on your system. Even if the query is being done entirely in memory, it can still block on the number of I/O threads)
[Simple] Transaction logs on SIMPLE over FULL recover
[Simple] Transaction logs written to separate spindle from the rest of data.
[Complicated] Split your data into separate tables yourself, then union them in your queries.
[Complicated] Try and put data which is not updated into a separate table so static data indices don't need to be rebuilt.
[Complicated] If possible, make sure you are doing append-only inserts (auto-incrementing PK/clustered index should already be doing this). Avoid updates if possible, obviously.
[Complicated] If queries don't need the absolute latest data, change read queries to use WITH NOLOCK on tables and remove row and page locks from indices. You won't get incomplete rows, but you might miss a few rows if they are being written at the same time you are reading.
[Complicated] Create separate filegroups for table data and index data. Place those filegroups on separate disk spindles if possible. SQL Server has separate I/O threads for each file so you can parallelize reads/writes to a certain extent.
Also, make sure all of your large tables are in separate filegroups, on different spindles as well.
[Complicated] Remove inserts with transactional locks
[Complicated] Use bulk-insert for data
[Complicated] Remove unnecessary indices
Prefer included columns over indexed columns if sorting isn't required on them
That's kind of a generic list of things I've done in the past on various DB projects I've worked on. Database optimizations tend to be highly specific to your situation...which is why DBA's have jobs. Some of the 'complicated' answers could be simple if your architecture supports it already.

Transaction Size Limit in SQL Server

I'm loading large amounts of data from a text file into SQL Server. Currently each record is inserted (or updated) in a separate transaction, but this leaves the DB in a bad state if a record fails.
I'd like to put it all in one big transaction. In my case, I'm looking at ~250,000 inserts or updates and maybe ~1,000,000 queries. The text file is roughly 60MB.
Is it unreasonable to put the entire operation into one transaction? What's the limiting factor?
It's not only not unreasonable to do so, but it's a must in case you want to preserve integrity in case any record fails, so you get an "all or nothing" import as you note. 250000 inserts or updates will be no problem for SQL to handle, but I would take a look at what are those million queries. If they're not needed to perform the data modification, I would take them out of the transaction, so they don't slow down the whole process.
You have to consider that when you have an open transaction (regardless of size), looks will occur at the tables it touches, and lengthy transactions like yours might cause blocking in other users that are trying to read them at the same time. If you expect the import to be big and time-consuming and the system will be under load, consider doing the whole process over the night (or any non-peak hours) to mitigate the effect.
About the size, there is no specific size limit in SQL Server, they can theoretically modify any amount of data without problems. The practical limit is really the size of the transaction log file of the target database. The DB engine stores all the temporary and modified data in this file while the transaction is in progress (so it can use it to roll it back if needed), so this file will grow in size. It must have enough free space in the DB properties, and enough HD space for the file to grow. Also, the row or table locks that the engine will put on the affected tables consumes memory, so the server must have enough free memory for all this plumbing too. Anyway, 60MB in size is often too little to worry about generally. 250,000 rows is considerable, but not that much too, so any decent-sized server will be able to handle it.
SQL Server can handle those size transactions. We use a single transaction to bulk load several million records.
The most expensive part of a database operation is usually the client server connection and traffic. For inserts/updates indexing and logging are also expensive, but you can mitigate those costs by using the correct loading techniques(see below). You really want to limit the amount of connections and data transfered between client and server.
To that end, you should consider bulk loading the data using SSIS or C# with SqlBulkCopy. Once you bulk load everything then you can use set based operations ON THE SERVER to update or verify your data.
Take a look at this question for more suggestions about optimizing data loads. The question is related to C# but a lot of the information is useful for SSIS or other loading methods. What's the fastest way to bulk insert a lot of data in SQL Server (C# client) .
There is no issue with doing an all or nothing bulk operation, unless a complete rollback is problematic for your business. In fact, a single transaction is the default behavior for a lot of bulk insert utilities.
I would strongly advise against a single operation per row. If you want to weed out bad data, you can load the data into a staging table first and pro grammatically determine "bad data" and skip those rows.
Well personally, I don't load imported data directly to my prod tables ever and I weed out all the records which won't pass muster long before I ever get to the point of loading. Some kinds of errors kill the import completely and others might just send the record to an exception table to be sent back to the provider and fixed for the next load. Typically I have logic that determines if there are too many exceptions and kills the package as well.
For instance suppose the city is a reuired field in your database and in the file of 1,000,000 records, you have ten that have no city. It is probably best to send them to an exception table and load the rest. But suppose you have 357,894 records with no city. Then you might need to be having a conversation with the data provider to get the data fixed before loading. It will certainly affect prod less if you can determine that the file is unuseable before you ever try to affect production tables.
Also, why are you doing this one record at a time? You can often go much faster with set-based processing especially if you have already managed to clean the data beforehand. Now you may still need to do in batches, but one record at a time can be very slow.
If you really want to roll back the whole thing if any part errors, yes you need to use transactions. If you do this in SSIS, then you can put transactions on just the part of the package where you affect prod tables and not worry about them in the staging of the data and the clean up parts.

How to figure the read/write ratio in Sql Server?

How can I query the read/write ratio in Sql Server 2005? Are there any caveats I should be aware of?
Perhaps it can be found in a DMV query, a standard report, a custom report (i.e the Performance Dashboard), or examining a Sql Profiler trace. I'm not sure exactly.
Why do I care?
I'm taking time to improve the performance of my web app's data layer. It deals with millions of records and thousands of users.
One of the points I'm examining is database concurrency. Sql Server uses pessimistic concurrency by default--good for a write-heavy app. If my app is read-heavy, I might switch it to optimistic concurrency (isolation level: read committed snapshot) like Jeff Atwood did with StackOverflow.
All apps are heavy read only.
An UPDATE is a read for the WHERE clause followed by a write
An INSERT must check unique indexes and FKs, which are reads and why you index FK columns
At most you have 15% writes. I saw an article once discussing it, but can't find it again. More likely 1%.
I know that in our 6 million new rows per day DB, we still have a minimum of 95%+ reads (an estimate of course).
Why do you need to know?
Also: How to find out SQL Server table’s read/write statistics?
Edit, based on the question update...
I would leave DB concurrency until you need to change it. We've not change anything out of the box for our 6 million rows + heavy reads too
For tuning our web app, we designed it to reduce round trips (one call = one action, mutliple record sets per call etc)
Check out sys.dm_db_index_usage_stats:
seeks, scans, lookups are all reads
updates are writes
Keep in mind that the counters are reset with each server restart, you need to look at them only after a representative load was run.
There are also some performance counters that can help you:
Batch Requests/sec: number of Transact-SQL command batches received per second.
Write Transactions/sec: number of transactions that wrote to the database and committed
Transactions/sec: number of transactions started for the database
From these rates you can get a pretty good estimate of read:write ratio of your requests.
after your update
Turning on the version store is probably the best avenue for dealing with concurrency. Rather than using the snapshot isolation explicitly, I'd recommend turning on read committed snapshot:
alter database <dbname> set allow_snapshot_isolation on;
alter database <dbname> set read_committed_snapshot on;
this will make read committed reads (ie. the default ones) to use snapshot instead, so it literally doesn't require any change in the app and can be quickly tested.
You should also investigate if your reads don't get executed under serialization reads isolation level, which is what happens when a TransactionScope is used w/o explicitly specifying the isolation level.
One word of caution that the version store is not exactly free. See Row Versioning Resource Usage. And you should give a read to SQL Server 2005 Row Versioning-Based Transaction Isolation.
How about finding a ratio of num_of_writes & num_of_reads counters in sys.dm_io_virtual_file_stats?
I did it using SQL Server Profiler. I just opened it before running application and tested what kind of queries are executed while I'm doing something in application. But I think it's better just for making sure that queries work, don't know if it is convenient for measuring server workload like this. Profiler can also save traced which you can analyse later, so it might work.

What is a good SQL Server 2008 solution for handling massive writes so that they don't slow down reads for users of the database?

We have large SQL Server 2008 databases. Very often we'll have to run massive data imports into the databases that take a couple hours. During that time everyone else's read and small write speeds slow down a ton.
I'm looking for a solution where maybe we setup one database server that is used for bulk writing and then two other database servers that are setup to be read and maybe have small writes made to them. The goal is to maintain fast small reads and writes while the bulk changes are running.
Does anyone have an idea of a good way to accomplish this using SQL Server 2008?
Paul. There's two parts to your question.
First, why are writes slow?
When you say you have large databases, you may want to clarify that with some numbers. The Microsoft teams have demonstrated multi-terabyte loads in less than an hour, but of course they're using high-end gear and specialized data warehousing techniques. I've been involved with data warehousing teams that regularly loaded so much data overnight that the transaction log drives had to be over a terabyte just to handle the quick bursts, but not a terabyte per hour.
To find out why writes are slow, you'll want to compare your load methods to data warehousing techniques. For example, have you tried using staging tables? Table partitioning? Data and log files on different arrays? If you're not sure where to start, check out my Perfmon tutorial to measure your system looking for bottlenecks:
http://www.brentozar.com/archive/2006/12/dba-101-using-perfmon-for-sql-performance-tuning/
Second, how do you scale out?
You asked how to set up multiple database servers so that one handles the bulk load while others handle reads and some writes. I would heavily, heavily caution against taking the multiple-servers-for-writes approach because it gets a lot more complicated quickly, but using multiple servers for reads is not uncommon.
The easiest way to do it is with log shipping: every X minutes, the primary server takes a transaction log backup and then that log backup is applied to the read-only reporting server. There's some catches with this - the data is a little behind, and the restore process has to kick all connections out of the database to apply the restore. This can be a perfectly acceptable solution for things like data warehouses, where the end users want to keep running their own reports while the new day's data loads. You can simply not do transaction log restores while the data warehouse is loading, and the users can maintain connections the whole time.
To help find out what solution is right for you, consider adding the following to your question:
The size of your database (GB/TB in size, # of millions of rows in the largest table that's having the writes)
The size of your server & storage (a box with 10 drives has different solutions available than a box hooked up to a SAN)
The method of loading data (is it single-record inserts, are you using bulk load, are you using table partitioning, etc)
Why not use MemCached to eliminate the reads, I've got the same situation where I work and we've been using memcached on Windows with great results. I was supprised how trivial it was to get my code running with it too. There are open-source wrapping libraries for virtually every mainstream language, and using it could result in 99% of your reads, not even touching the database (becasue you set the memcache values on the write operation of the database).
Memcached, is really just a giant hash table store (and can even be clustered or run on any machine you like since it uses sockets to read and store the hashes).
When reading the memcached value, simply check if its null (return if its not) or do your ussual database read and return. It can store just about everything, so long as each memcached key/value pair is less than 1MB.
The easiest way would be to slow down the rate at which writes occur, and feed them in one record at a time. They'll be slower, but it would make things faster for users. If the batches take "a couple hours", you perhaps can spread them out more.
This is just an idea. Create a view over your "active" tables. Then BCP in the data into a "staging" table. When it is done, update the view to include the "staging" tables. Just an idea.
I'm not sure what you mean when you say everyone else's read and write slows down. Does it slow down when they read & write to the same database where the data is currently being imported or from different databases on the same server?
If it is the same database, you could always use the "with (nolock)" hint to do the reads even when the table is locked for writes/inserts. However, please be aware that the reads can be dirty reads. I am not sure how you can do faster quick writes when the table is locked because a write is already in progress. You can keep the transaction small to make the writes faster and release the locks. The other option is to have a separate database for bulk inserts and another database for reading.

Resources