Understanding database inserts - database

Most databases support some form of "insert into select..." statement.
insert into a
select value from b;
How is this being achieved?
My understanding: The rows that are present at that point of time when the statement starts execution qualify to be picked up, and they are inserted into table a. At the same-time new values can be inserted into table b and they would not be "considered" since the query has already started execution.
Is my understanding close to being accurate? Any reference docs on this greatly appreciated.
Thanks!

The answer for most modern databases is multiversion concurrency control.
Basically each row has a timestamp from what instant it is visible. The select then considers the isolation level to see if rows added by transactions that have committed before the current statement (for read committed isolation) or before the current transaction (for serializable isolation) should be visible to the select.

Since you are not talking about any engine in particular, that could be happening. Also there could be a point where the database just pick a row at the time.. It's all depends on the engine, and the locks applied to the database.

"New values can be inserted" depending on your isolation level; for example if it is serializable that will not happen.

I guess there are database specific difference, but I can provide a general answer for most of them.
When performing a "insert as select", the RDBMS would go and execute the SELECT statement. Like any other SELECT statements, the results would be stored in a "virtual table" in the memory (each database and its own cache and RAM management). Then, the INSERT statement becomes a normal multi-rows INSERT statement, as the results in the memory behaves exactly like data which would be provided via the command line.
At this stage, if any new row would be inserted to the "selected" table, it will not affect the INSERT statement.
Finally, if the SELECT yields too many rows as a result, or would refer to a locked table, things could change, as the RDBMS would select the values differently.

SQL Server selectivity (makes use of indexes which you'll want to look at as well)
-http://blog.namwarrizvi.com/?p=157
-http://www.sqlsolutions.com/articles/articles/How_Values_with_Irregular_Selectivity_Impact_SQL_Server_Database_Performance.htm
-http://sqlserverpedia.com/blog/sql-server-bloggers/index-columns-selectivity-and-inequality-predicates/
Oracle selectivity (again these articles refer to index selectivity)
-http://www.akadia.com/services/ora_index_selectivity.html
-http://courses.csusm.edu/cs643yo/slides/optimization.htm (talks about architecture, might be more useful for you here)

Related

Does Oracle (RDB in general?) take a snapshot of the table affected by DML?

Objective
To understand the mechanism/implementation when processing DMLs against a table. Does a database (I work on Oracle 11G R2) take snapshots (for each DML) of the table to apply the DMLs?
Background
I run a SQL to update the AID field of the target table containing old values with the new values from the source table.
UPDATE CASES1 t
SET t.AID=(
SELECT DISTINCT NID
FROM REF1
WHERE
oid=t.aid
)
WHERE EXISTS (
SELECT 1
FROM REF1
WHERE oid=t.aid
);
I thought the 'OLD01' could be updated twice (OLD01 -> NEW01 -> SCREWED).
However, it did not happen.
Question
For each DML, does a database take a snapshot of table X (call it X+1) for a DML (1st) and then keep taking snapshot (call it X+2) of the result (X+1) for the next DML (2nd) on the table, and so on for each DML that are successibly executed? Is this also used as a mechanism to implement Rollback/Commit?
Is it an expected behaviour specified as a standard somewhere? If so, kindly suggest relevant references. I Googled but not sure what the key words should be to get the right result.
Thanks in advance for your help.
Update
Started reading Oracle Core (ISBN 9781430239543) by Jonathan Lewis and saw the diagram. So current understanding is the UNDO records are created in the UNDO tablespace for each update and the original data is reconstructed from there, which I initially thought as snapshots.
In Oracle, if you ran that update twice in a row in the same session, with the data as you've shown, I believe you should get the results that you expected. I think you must have gone off track somewhere. (For example, if you executed the update once, then without committing you opened a second session and executed the same update again, then your result would make sense.)
Conceptually, I think the answer to your question is yes (speaking specifically about Oracle, that is). A SQL statement effectively operates on a snapshot of the tables as of the point in time that the statement starts executing. The proper term for this in Oracle is read-consistency. The mechanism for it, however, does not involve taking a snapshot of the entire table before changes are made. It is more the reverse - records of the changes are kept in undo segments, and used to revert blocks of the table to the appropriate point in time as needed.
The documentation you ought to look at to understand this in some depth is in the Oracle Concepts manual: http://docs.oracle.com/cd/E11882_01/server.112/e40540/part_txn.htm#CHDJIGBH

SQL Server Trigger Isolation / Scope Documentation

I have been looking for definitive documentation regarding the isolation level ( or concurrency or scope ... I'm not sure EXACTLY what to call it) of triggers in SQL Server.
I have found the following sources which indicate that what I believe is true (which is to say that two users, executing updates to the same table --even the same rows-- will then have independent and isolated triggers executed):
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/601977fb-306c-4888-a72b-3fbab6af0cdc/effects-of-concurrent-trigger-firing-on-inserted-and-deleted-tables?forum=transactsql
https://social.msdn.microsoft.com/forums/sqlserver/en-US/b78c3e7b-6b98-48e1-ad43-3c773c79a6ff/trigger-and-inserted-table
The first question is essentially the same question I am trying to find the answer to, but the answer given doesn't provide any sources. The second question also hits near the mark, and the answer is the same, but again, no sources are provided.
Can someone point me to where the available documentation makes the same assertions?
Thanks!
Well, Isolation Level and Scope are two very different things.
Isolation Level
Triggers operate within a transaction. By default, that transaction should be using the default isolation level of READ COMMITTED. However, if the calling process has specified a different isolation level, then that would override the default. As per usual: if desired, you should be able to override that within the trigger itself.
According to the MSDN page for DML Triggers:
The trigger and the statement that fires it are treated as a single transaction, which can be rolled back from within the trigger. If a severe error is detected (for example, insufficient disk space), the entire transaction automatically rolls back.
Scope
The context provided is:
{from you}
two users, executing updates to the same table --even the same rows
{from the first linked MSDN article in the Question that is "essentially the same question I am trying to find the answer to"}
Are the inserted and deleted tables scoped to the current session? In other words will they only contain the inserted and deleted records for the current scope, or will they contain the records for all current update operations against the same table? Can there even be truely concurrent operations or will locks prevent this?
Before getting into the inserted and deleted tables it should be made very clear that there will only ever be a single DML operation happening on a particular row at any given moment. Two or more requests might come in at the exact same nanosecond, but all requests will take their turn, one at a time (and yes, due to locking).
Now, regarding what is in the inserted and deleted tables: Yes, only the rows for that particular event will be (and even can be) in those two pseudo-tables. If you execute an UPDATE that will modify 5 rows, only those 5 rows will be in the inserted and deleted tables. And since you are looking for documentation, the MSDN page for Use the inserted and deleted Tables states:
The deleted table stores copies of the affected rows during DELETE and UPDATE statements. During the execution of a DELETE or UPDATE statement, rows are deleted from the trigger table and transferred to the deleted table. The deleted table and the trigger table ordinarily have no rows in common.
The inserted table stores copies of the affected rows during INSERT and UPDATE statements. During an insert or update transaction, new rows are added to both the inserted table and the trigger table. The rows in the inserted table are copies of the new rows in the trigger table.
Tying this back to the other part of the question, the part relating to the Transaction Isolation Level: The Transaction Isolation Level has absolutely no effect on the inserted and deleted tables as they pertain specifically to that event/query. However, the net effect of that operation, which is captured in those two psuedo-tables, can still be visible to other processes if they are using the READ UNCOMMITTED Isolation Level or the NOLOCK table hint.
And just to clarify something, the MSDN page linked above regarding the inserted and deleted tables states at the very beginning that they are "in memory" but that is not exactly correct. Starting in SQL Server 2005, those two pseudo-tables are actually based in tempdb. The MSDN page for the tempdb Database states:
The tempdb system database is a global resource that is available to all users connected to the instance of SQL Server and is used to hold the following:
...
Row versions that are generated by data modification transactions for features, such as: online index operations, Multiple Active Result Sets (MARS), and AFTER triggers.
Prior to SQL Server 2005, the inserted and deleted tables were read from the Transaction Log (I believe).
To summarize, the inserted and deleted tables:
operate within a Transaction
are static (i.e. read-only) tables
are visible to only the current Trigger
only contain rows for the specific event/operation/query that fired that instance of that Trigger

WITH (NOLOCK) on table in SQL Server 2008

In my SQL tempOrder table has millions of records and with 10 trigger to update tempOrder table with another table's update.
So I want to apply apply with(NOLOCK) on table.
I know with
SELECT * FROM temporder with(NOLOCK)
This statement I can do. But is there any way to apply with(NOLOCK) directly to the table from SQL Server 2008.
The direct answer to your question is NO -- there is no option to to tell SQL to never lock tableX. With that said, your question opens up a whole series of things that should be brought up.
Isolation Level
First, the most direct way you can accomplish what you want is to use with (nolock) option or SET TRANSACTION ISLOATION LEVEL READ UNCOMMITTED (aka chaos). These options are good for the query or the duration of the connection respectively. If I chose this route I would combine it with a long running SQL Profiler trace to identify any queries taking locks on TableX.
Lock Escalation
Second, SQL Server does have a table wide LOCK_ESCALATION threshold (executed as ALTER TABLE SET LOCK_ESCALATION x where X is the number of locks or AUTO). This controls when SQL attempts to consolidate many fine grained locks into fewer coarse grained locks. Said another way, it is a numeric threshold for converting how many locks are taken out on a single database object (think index).
Overriding SQL's lock escaltion generally isn't a good idea. As the documentation states:
In most cases, the Database Engine delivers the best performance when
operating with its default settings for locking and lock escalation.
As counter intuitive as it may seem, from the scenario you described you might have some luck with fewer broad locks instead of NOLOCK. You'll need to test this theory out with a real workload to determine if its worthwhile.
Snapshot Isolation
You might also check out the SNAPSHOT isolation level. There isn't enough information in your question to know, but I suspect it would help.
Dangers of NOLOCK
With that said, as you might have picked up from #GSerg's comment, NOLOCK can be evil. No-Lock is colloquially referred to as Chaos--and for good reason. When developers first encounter NOLOCK it seems like allowing dirty reads is the only implication. There are more...
dirty data is read for inconsistent results (the common impression)
wrong data -- meaning neither consistent with the pre-write or post-write state of your data.
Hard exceptions (like error 601 due to data movement) that terminate your query
Blank data is returned
previously committed rows are missed
Malformed bytes are returned
But don't take my word for it :
Actual Email: "NoLOCK is the epitome of evil?"
SQL Sever NOLOCK hint & other poor ideas
Is the nolock hint a bad practice
this is not a table's configuration.
If you add (nolock) to the query (it is called a query hint) you are saying that when executing this (and only this) query, it wont create lock on the affected tables.
Of course, you can make this configuration permanent for the current connection by setting a transaction isolation level to read uncommitted for example: set transaction isolation level read uncommitted. But again, it is valid only until that connection is open.
Perhaps if you explain in more details what you are trying to achieve, we can better help you.
You cannot change the default isolation level (except for snapshot) for a table or a database, however you can change it for all read queries in one transaction:
set transaction isolation level read uncommitted
See msdn for more information.

syntax for nolock in sql

I have seen sql statements using nolock and with(nolock)
e.g -
select * from table1 nolock where column1 > 10
AND
select * from table1 with(nolock) where column1 > 10
Which of the above statements is correct and why?
The first statement doesn't lock anything, whereas the second one does. When I tested this out just now on SQL Server 2005, in
select * from table1 nolock where column1 > 10 --INCORRECT
"nolock" became the alias, within that query, of table1.
select * from table1 with(nolock) where column1 > 10
performs the desired nolock functionality. Skeptical? In a separate window, run
BEGIN TRANSACTION
UPDATE tabl1
set SomeColumn = 'x' + SomeColumn
to lock the table, and then try each locking statement in its own window. The first will hang, waiting for the lock to be released, and the second will run immediately (and show the "dirty data"). Don't forget to issue
ROLLBACK
when you're done.
The list of deprecated features is at Deprecated Database Engine Features in SQL Server 2008:
Specifying NOLOCK or READUNCOMMITTED
in the FROM clause of an UPDATE or
DELETE statement.
Specifying table
hints without using the WITH keyword.
HOLDLOCK table hint without
parenthesis
Use of a space as a separator between table hints.
The indirect application of table hints to an invocation of a multi-statement table-valued function (TVF) through a view.
They are all in the list of features that will be removed sometimes after the next release of SQL, meaning they'll likely be supported in the enxt release only under a lower database compatibility level.
That being said my 2c on the issue are as such:
Both from table nolock and from table with(nolock) are wrong. If you need dirty reads, you should use appropiate transaction isolation levels: set transaction isolation level read uncommitted. This way the islation level used is explictily stated and controlled from one 'knob', as opposed to being spread out trough the source and subject to all the quirks of table hints (indirect application through views and TVFs etc).
Dirty reads are an abonimation. What is needed, in 99.99% of the cases, is reduction of contention, not read uncommitted data. Contention is reduced by writing proper queries against a well designed schema and, if necessary, by deploying snapshot isolation. The best solution, that solves works almost always save a few extreme cases, is to enable read commited snapshot in the database and let the engine work its magic:
ALTER DATABASE MyDatabase SET ALLOW_SNAPSHOT_ISOLATION ON
ALTER DATABASE MyDatabase SET READ_COMMITTED_SNAPSHOT ON
Then remove ALL hints from the selects.
They are both technically correct, however not using the WITH keyword has been deprecated as of SQL 2005, so get used to using the WITH keyword - short answer, use the WITH keyword.
Use "WITH (NOLOCK)".
Both are syntactically correct.
NOLOCK will become the alias for table1.
WITH (NOLOCK) is often exploited as a magic way to speed up database reads, but I try to avoid using it whever possible.
The result set can contain rows that have not yet been committed, that are often later rolled back.
An error or Result set can be empty, be missing rows or display the same row multiple times.
This is because other transactions are moving data at the same time you're reading it.
READ COMMITTED adds an additional issue where data is corrupted within a single column where multiple users change the same cell simultaneously.
There are other side-effects too, which result in sacrificing the speed increase you were hoping to gain in the first place.
Now you know, never use it again.

Long query prevents inserts

I have a query that runs each night on a table with a bunch of records (200,000+). This application simply iterates over the results (using a DbDataReader in a C# app if that's relevant) and processes each one. The processing is done outside of the database altogether. During the time that the application is iterating over the results I am unable to insert any records into the table that I am querying for. The insert statements just hang and eventually timeout. The inserts are done in completely separate applications.
Does SQL Server lock the table down while a query is being done? This seems like an overly aggressive locking policy. I could understand how there could be a conflict between the query and newly inserted records, but I would be perfectly ok if records inserted after the query started were simply not included in the results.
Any ways to avoid this?
Update:
The WITH (NOLOCK) definitely did the trick. As some of you pointed out, this isn't the cleanest approach. I can't really query everything into memory given the amount of records and some of the columns in this table are binary (some records are actually about 1MB of total data).
The other suggestion, was to query for batches of records at a time. This isn't a bad idea either, but it does bring up a new issue: database independent queries. Right now the application can work with a variety of different databases (Oracle, MySQL, Access, etc). Each database has their own way of limiting the rows returned in a query. But maybe this is better saved for another question?
Back on topic, the "WITH (NOLOCK)" clause is certainly SQL Server specific, is there any way to keep this out of my query (and thus preventing it from working with other databases)? Maybe I could somehow specify a parameter on the DbCommand object? Or can I specify the locking policy at the database level? That is, change some properties in SQL Server itself that will prevent the table from locking like this by default?
If you're using SQL Server 2005+, then how about giving the new MVCC snapshot isolation a try. I've had good results with it:
ALTER DATABASE SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
ALTER DATABASE SET READ_COMMITTED_SNAPSHOT ON;
ALTER DATABASE SET MULTI_USER;
It will stop readers blocking writers and vice-versa. It eliminates many deadlocks, at very little cost.
It depends what Isolation Level you are using. You might try doing your selects using the With (NoLock) hint, that will prevent the read locks, but will also mean the data being read might change before the selecting transaction completes.
The first thing you could do is try to add the "WITH (NOLOCK)" to any tables you have in your query. This will "Tame down" the locking that SQL Server does. An example of using "NOLOCK" on a join is as follows...
SELECT COUNT(Users.UserID)
FROM Users WITH (NOLOCK)
JOIN UsersInUserGroups WITH (NOLOCK) ON
Users.UserID = UsersInUserGroups.UserID
Another option is to use a dataset instead of a datareader. A datareader is a "fire hose" technique that stays connected to the tables while your program is processing and basically handling the table row by row through the hose. A dataset uses a "disconnected" methodology where all the data is loaded into memory and then the connection is closed. Your program can then loop the data in memory without having to worry about locking. However, if this is a really large amount of data, there maybe memory issues.
Hope this helps.
If you add the WITH (NOLOCK) hint after a table name in the FROM clause it should make sure it doesn't lock, and it doesn't care about reading data that is locked. You might get "out of date" results if you are writing at the same time, but if you don't care about that then you should be fine.
I reckon your best way of avoiding this is to do it in SQL rather than in the application.
You can add a
WAITFOR DELAY '000:00:01'
at the end of each loop iteration to provide time for other processes to run - just make sure that you haven't initiated a TRANSACTION such that all other processes are locked out anyway
The query is performing a table lock, thus the inserts are failing.
It sounds to me like you're keeping a lock on the table while processing the results.
You should instead load them into an array or collection of some sort, and close the database connection.
Then process the array.
In addition, while you're doing your select use either:
WITH(NOLOCK) or WITH(READPAST)
I'm not a big fan of using lock hints as you could end up with dirty reads or other weirdness. A couple of other ideas:
Can you break the number of rows down so you don't grab 200k at a time? Is there a way to tell whether you've processed a row - a flag, a timestamp - you could use to make the query? Your query could be 'SELECT TOP 5000 ...' getting a differnet 5k each time. Shorter queries mean shorter-lived locks.
If you can use smaller sets of rows I like the DataSet vs. IDataReader idea. You will be loading data into memory and not consuming any SQL locks, but the amount of memory can cause other problems.
-Brian
You should be able to set the isolation level at the .NET level so that you don't have to include the WITH (NOLOCK) hint.
If you want to go with the batching option, you should be able to specify the Rowcount setting from the .NET level which would tell the database to only return n number of records. By setting these settings at the .NET level they should become database independent and work across all the platforms.

Resources