I have a table in an SQL Server 2017 DB used by a lot of long running transactions that originate from multiple threads. This causes deadlocking several times a day so I am considering implementing read committed snapshot isolation. The trick is that this table has 3 VARBINARY(MAX) columns and each of them contains data between 10-1000MB (with the mean around 20 MB) beside several int and bit columns.
Now the questions:
Q1: Will SQL Server copy the entire row (including the VARBINARY(MAX) columns) into the TEMPDB?
Q2: If so, would the performance benefit from moving the VARBINARY(MAX) columns into a separate table with a 1:1 relationship to the original table?
Sql Server has to present you with consistent view on your data (e.g. T2 sees your row, including LOB, as it were before T1 started mutating transaction). Which means -- yes, it has no choice but to copy LOB with the rest of the row data. Which makes me think that yes, performance may benefit from having separate table with LOBs.
As usual, I would recommend doing simple experiment that will measure performance with both configurations. Please post your results here.
Related
We would like to synchronize data (insert, update) from Oracle (11g) to PostgreSQL (10). Our approach was the following:
A trigger on the table in Oracle updates a column with nextval from a sequence before insert and update.
PostgreSQL knows the last sequence number processed and fetches the rows from Oracle > lastSequenceNumberFetched.
We now have the following problem:
Session 1 in Oracle inserts a row, sequence number (let's say 45) is written but no COMMIT is done in Oracle.
Session 2 in Oracle inserts a row, sequence number is written (let's say 49 (because sequences in Oracle can have gaps)) and a COMMIT is done in Oracle.
Session in PostgreSQL fetches rows from Oracle with sequenceNumber > 44 (because the lastSequenceNumberFetched is 44) and gets the row with sequenceNumber 49. So this is the new lastSequenceNumberFetched.
Session 1 in Oracle makes a commit.
Session in PostgreSQL fetches rows from Oracle with sequenceNumber > 49. Problem is that the row with sequenceNumber 45 is never fetched.
Are there any better approaches for our use case avoiding our problem with missing data?
In case you don't have delete operations in your tables and the tables are not very big then I suggest to use Oracle System Change Number (SCN) on the row level which is returned by the pseudo column ORA_ROWSCN (link). This is the commit time presented by number. By default the SCN is tracked for the data block, but you can enable tracking on the row level (keyword rowdependencies). So you have to recreate your table with this keyword. At the sync procedure launch you get the current scn by the function call dbms_flashback.get_system_change_number, then scan all tables where ora_rowscn between _last_scn_value_ and _current_scn_value_. The disadvantage is that this pseudo columns is not indexed, so you will have full table scans, which is slow for big tables.
If you use delete statements then you have to track the records which were deleted. For this purpose you can use one log table having the following columns: table_name, table_id_value, operation (insert/update/delete). Table is filled by the trigger on base tables. So for your case when session 1 commits data in base table - then you have the record in log table to process. And you don't see it until the session commits. So no issues with sequence numbers that you described.
Hope that helps.
Is this purely a data project or do you have some client here. If you do have a middle tier you could use an ORM to abstract some of this and do writes to both. Do you care whether the sequences are the same? It would be possible to do something like collect all the data to synchronize since a particular timestamp (every table would have to have a UTC timestamp) and then take a hash of all the data and compare with what is in Postgres.
It might be useful to have some more of your requirements for the synchronization of data and the reasoning behind this e.g.
Do the keys need to be the same against both environments? Why?
Who views the data, is the same consumer looking at both sources.
Why wouldn't you just use an ORM to target only one db why do you need oracle and postgres?
I have seen a similar setup. An application on Postgres mostly for reporting and other secondary tasks while main app was on Oracle.
Some of the main app tables are cached in Postgres for convenience. But this setup brings in the sync problem.
The compromise solution was a mix of incremental sequence-based sync during daytime and full table copy overnight
Regarding other solutions proposed here:
Postgres fdw is slow for complex queries and it puts extra load on foreign db especially when where clause refer to both local and foreign tables.
The same query will run much faster if foreign table is cached in postgres.
Incremental/differential sync using sequence numbers -tried this and works acceptable for small tables, but the nightmare starts with child relations maybe an orm can help here
The ideal solution in my opinion would probably be to stream Oracle changes to Postgres or intermediary process that replicates changes to Postgres
I have no clue about how to do this as I understood it requires Oracle golden gate app (+ licence)
I use SSMS 2016. I have a view that has a few millions of records. The view is not indexed and should not be as it's being updated (insert, delete, update) every 5 minutes by a job on the server to then display update data sets in to the client calling application in GUI.
The view does a very heavy volume of conversion INT values to VARCHAR appending to them some string values.
The view also does some CAST operations on the NULL assigning them column names aliases. And the worst performance hit is that the view uses FOR XML PATH('') function on 20 columns.
Also the view uses two CTEs as the source as well as Subsidiaries to define a single column value.
I made sure I created the right indexes (Clustered, nonclustered,composite and Covering) that are used in the view Select,JOIN,and WHERE clauses.
Database Tuning Advisor also have not suggested anything that could substantialy improve performance.
AS a workaround I decided to create two identical physical tables with clustered indexes on each and using the Merge statement (further converted into a SP and then Into as SQL Server Agent Job) maintain them updated. And to assure there is no long locking of the view. I will then swap(rename) the tables names immediately after each merge finishes. So in this case all the heavy workload falls onto a SQL Server Agent Job keep the tables updated.
The problem is that the merge will take roughly 15 minutes considering current size of the data, which may increase in the future. So, I need to have a real time design to assure that the view has the most up-to-date information.
Any ideas?
Our team needs to insert a cruel amount of data into our SQL Server 2008 database. We're looking for a good solution. Now we came up with one, but I have doubts with it, simply because it doesn't feel right. So I'm asking here if this seems like a good solution. Extra challange is that it's a peer-to-peer replicated database over 4 servers! :)
Imagine we have 1 million rows to insert
Start transaction
Increase current ident value on a table with 1 million
Have a DataSet/DataTable ready with 1 million rows and the correct ids
BulkCopy the data into the database
Commit transaction
Is this a good solution, might we get into concurrency issues, have too large transactions, etc.
you'll only get problems (as far as I can see, so there might be things I overlook!) if the database is online and users can insert rows into that table. Increasing the identity value for new rows on the meta-level simply means that the next row inserted by the system will use that number, so if you bump it with 1 million, it means you reserved those numbers up front.
Identity columns are 'nice' but have the side effect that they're not transferable. So if you have to migrate the data to another DB, realize that you likely have to adjust the data inserted to match the database you insert it in (as that's the scope of the data which means identity fields could collide with rows already in the table).
If this is a one-time affair, it might work out. If you're planning to do this regularly, I'd look into a more higher-level migration system where you migrate the data to new identity values or use guid's with NEWSEQUENTIALID() so you get proper checked indexes and also unique, transferable id's.
I've inherited an SSIS package which loads 500K rows (about 30 columns) into a staging table.
It's been cooking now for about 120 minutes and it's not done --- this suggests it's running at less than 70 rows per second. I know that everybody's environment is different but I think this is a couple orders of magnitude off from "typical".
Oddly enough the staging table has a PK constraint on an INT (identity) column -- and now I'm thinking that it may be hampering the load performance. There are no other constraints, indexes, or triggers on the staging table.
Any suggestions?
---- Additional information ------
The source is a tab delimited file which connects to two separate Data Flow Components that add some static data (the run date, and batch ID) to the stream, which then connects to an OLE DB Destination Adapter
Access mode is OpenRowset using FastLoad
FastLoadOptions are TABLOCK,CHECK_CONSTRAINTS
Maximum insert commit size: 0
I’m not sure about the etiquette of answering my own question -- so sorry in advance if this is better suited for a comment.
The issue was the datatype of the input columns from the text file: They were all declared as “text stream [DT_TEXT]” and when I changed that to “String [DT_STR]” 2 million rows loaded in 58 seconds which is now in the realm of “typical” -- I'm not sure what the Text file source is doing when columns are declared that way, but it's behind me now!
I'd say there is a problem of some sort, I bulk insert a staging table from a file with 20 million records and more columns and an identity field in far less time than that and SSIS is supposed to be faster than SQL Server 2000 bulk insert.
Have you checked for blocking issues?
If it is running in one big transaction, that may explain things. Make sure that a commit is done every now and then.
You may also want to check processor load, memory and IO to rule out resource issues.
This is hard to say.
I there was complex ETL, I would check the max number of threads allowed in the data flows, see if some things can run in parallel.
But it sounds like it's a simple transfer.
With 500,000 rows, batching is an option, but I wouldn't think it necessary for that few rows.
The PK identity should not be an issue. Do you have any complex constraints or persisted calculated columns on the destination?
Is this pulling or pushing over a slow network link? Is it pulling or pushing from a complex SP or view? What is the data source?
What would be more efficient in storing some temp data (50k rows in one and 50k in another) to perform come calculation. I'll be doing this process once, nightly.
How do you check the efficiency when comparing something like this?
The results will vary on which will be easier to store the data, in disk (#temp) or in memory (#temp).
A few excerpts from the references below
A temporary table is created and populated on disk, in the system database tempdb.
A table variable is created in memory, and so performs slightly better than #temp tables (also because there is even less locking and logging in a table variable). A table variable might still perform I/O to tempdb (which is where the performance issues of #temp tables make themselves apparent), though the documentation is not very explicit about this.
Table variables result in fewer recompilations of a stored procedure as compared to temporary tables.
[Y]ou can create indexes on the temporary table to increase query performance.
Regarding your specific case with 50k rows:
As your data size gets larger, and/or the repeated use of the temporary data increases, you will find that the use of #temp tables makes more sense
References:
Should I use a #temp table or a #table variable?
MSKB 305977 - SQL Server 2000 - Table Variables
There can be a big performance difference between using table variables and temporary tables. In most cases, temporary tables are faster than table variables. I took the following tip from the private SQL Server MVP newsgroup and received permission from Microsoft to share it with you. One MVP noticed that although queries using table variables didn't generate parallel query plans on a large SMP box, similar queries using temporary tables (local or global) and running under the same circumstances did generate parallel plans.
More from SQL Mag (subscription required unfortunately, I'll try and find more resources momentarily)
EDIT: Here is some more in depth information from CodeProject