SQL Server Bulk Insert - Should other inserts be allowed during the operation? - sql-server

I have a some exported rows that I want to import into a database that is currently live. Should I disable the app that interacts with the DB before doing the bulk-insert so that the BI is the only operation being performed?
I assumed that this would be best-practice but just wanted to check with the community.
Many Thanks!

I wouldn't generally disable any other app using the database. To me, the bulk load is just another client of the database and is subject to the usual concurrency/isolation mechanisms.
There are exceptions where the bulk load is part of some long running release process or maintanance routine, in which case it'd out of hours anyway
I always use a staging table to load data though. After after processing, scrubbing, cleansing and key lookups etc, I'd flush the data to the live table in a single atomic operation. In other words, I wouldn't mix bulk loads with other table access: I'd buffer the bulk load via a staging table

Related

What are the best way to load incremental data into Database using Spark?

I have tried multiple ways to incremental load (upsert) into the Postgress Database (RDS) using Spark (with Glue Job) but did not find satisfactory performance.
I have tried the below ways:
1. Delete the records based on the primary Key and append new records
In this approach, I delete the records where the primary key exists in
incremental data. This is done by Python API such as pg8000 etc. Once it gets successfully completed, I write data using Spark in append mode.
This approach works very smooth when the incremental load is less but
with high volumes, Delete and writing takes time.
2. Writing data into a staging table and triggering DB store procedure
I have also tried this alternative way to handle the upsert with a relational database. Here I am writing data into a Staging table using Spark and triggering Store Procedure (with pg8000). The Store Procedure will upsert the data from the staging table to the final table.
With the approach, we have to manage multiple Store procedure and also have to depend on the Database (in case we are not owning that system). Handling the staging to final upsert will be tricky if the previous job gets failed after writing data into staging and before triggering the Store procedure. Overall I feel this approach needs good orchestration not only with the Spark but with Store procedure as well.
Is there any way that can handle upsert in a relational database (Postgres/MySQL/Oracle etc) very smoothly and efficiently using Spark?

Pulling instead pushing data from database

Loading data from my OLTP database (it's part of ETL) via OPENQUERY or SSIS Data Flow to another SQL Server database (Warehouse which run this SSIS package / OPENQUERY statement), kills it. As I checked in Performance Monitor I use resources from source database, not from destiny. Is possible to reverse this resource utilization (using SQL Server 2016 or SSIS)?
The problem here is in your destination write operation. If you are using OLE DB Destination with fast load access mode try setting the rows per batch value to a non-zero value and reduce the maximum insert commit size to a value that will be easy on your memory and CPU. SSIS will not have to wait for the default of 2147483647 before writing to the destination table which can have a large impact on your log file slowing your process down. Please refer to this Article for more info on setting this values. All the best
How does your export query looks like? Is it just a simple data dump or do you have some complex logic in (e.g. doing some denormalization/aggregation with the export)?
If it's just a simple export, check on which server your SSIS package runs and what resources it uses. In any case, you need to read the data from your source system, so expect some read disc operations.
In general it is better to get the data from an OLTP as quickly as possible and then apply other operations in further steps of your ETL process on your ETL/Data warehouse server. In order to reduce an impact on your transactional system.
Hope it helps.

Is there a congruent command to truncate for re-filling a table? [duplicate]

I have an INSERT statement that is eating a hell of a lot of log space, so much so that the hard drive is actually filling up before the statement completes.
The thing is, I really don't need this to be logged as it is only an intermediate data upload step.
For argument's sake, let's say I have:
Table A: Initial upload table (populated using bcp, so no logging problems)
Table B: Populated using INSERT INTO B from A
Is there a way that I can copy between A and B without anything being written to the log?
P.S. I'm using SQL Server 2008 with simple recovery model.
From Louis Davidson, Microsoft MVP:
There is no way to insert without
logging at all. SELECT INTO is the
best way to minimize logging in T-SQL,
using SSIS you can do the same sort of
light logging using Bulk Insert.
From your requirements, I would
probably use SSIS, drop all
constraints, especially unique and
primary key ones, load the data in,
add the constraints back. I load
about 100GB in just over an hour like
this, with fairly minimal overhead. I
am using BULK LOGGED recovery model,
which just logs the existence of new
extents during the logging, and then
you can remove them later.
The key is to start with barebones
tables, and it just screams. Building
the index once leaves you will no
indexes to maintain, just the one
index build per index.
If you don't want to use SSIS, the point still applies to drop all of your constraints and use the BULK LOGGED recovery model. This greatly reduces the logging done on INSERT INTO statements and thus should solve your issue.
http://msdn.microsoft.com/en-us/library/ms191244.aspx
Upload the data into tempdb instead of your database, and do all the intermediate transformations in tempdb. Then copy only the final data into the destination database. Use batches to minimize individual transaction size. If you still have problems, look into deploying trace flag 610, see The Data Loading Performance Guide and Prerequisites for Minimal Logging in Bulk Import:
Trace Flag 610
SQL Server 2008 introduces trace flag
610, which controls minimally logged
inserts into indexed tables.

Efficient use of SQL Transactions

My application currently needs to upload a large amount of data to a database server (SQL Server) and locally on a SQLite database (local cache).
I have always used Transactions when inserting data to a database for speed purposes. But now that I am working with something like 20k rows or more per insert batch, I am worried that Transactions might cause issues. Basically, what I don't know is if Transactions have a limit on how much data you can insert under them.
What is the correct way to use transactions with large amounts of rows to be inserted in a database? Do you for instance begin/commit every 1000 rows?
No there is no such limit. Contrary to what you might believe, SQLite writes pending transactions into the database file, not RAM. So you should not run into any limits on the amount of data you can write under a transaction.
See SQLite docs for these info: http://sqlite.org/docs.html
Follow the link "Limits in SQLite" for implementation limits like these.
Follow the link "How SQLite Implements Atomic Commit" for how transactions work
I dont see any problems doing this but if there are any constraint/ referential integrity errors then probably you got insert them all again and also the table is locked till the time the transaction is commited. Breaking down into smaller portions while logging activity in each batch will help.
A better option would be to BCP insert them into the target while dealing with many rows or even an SSIS package to do this.

Checklist towards optimal performance in a database import application

I am facing an application designed to import huge amounts of data into a Microsoft SQL Server 2000 database. The application seems to take an awful long time to complete and I suspect the application design is flawed. Someone asked me to dig into the application to find and fix serious bottlenecks, if any. I would like a structured approach to this job and have decided to prepare a checklist of potential problems to look for. I have some experience with SQL databases and have so far written down some things to look for.
But it would be very helpful with some outside inspiration as well. Can any of you point me to some good resources on checklists for good database schema design and good database application design?
I plan on developing checklists for the following main topics:
Database hardware - First thing is to establish proof that the server hardware is appropriate?
Database configuration - Next step is to ensure the database is configured for optimal performance?
Database schema - Does the database schema have a sound design?
Database application - does the application incorporate sound algorithms?
Good start. Here are the recommended priorities.
First Principle. Import should do little or no processing other than source file reads and SQL Inserts. Other processing must be done prior to the load.
Application Design is #1. Do the apps do as much as possible on the flat files before attempting to load? This is the secret sauce in large data warehouse loads: prepare offline and then bulk load the rows.
Database Schema is #2. Do you have the right tables and the right indexes? A load doesn't require any indexes. Mostly you want to drop and rebuild the indexes.
A load had best not require any triggers. All that triggered processing can be done off-line to prepare the file for a load.
A load had best not be done as a stored procedure. You want to be using a simple utility program from Microsoft to bulk load rows.
Configuration. Matters, but much, much less than schema design and application design.
Hardware. Unless you have money to burn, you're not going far here. If -- after everything else -- you can prove that hardware is the bottleneck, then spend money.
Three items I would add to the list:
Bulk insert - Are you importing the data using a bulk provider (i.e. BCP or SqlBulkCopy) or via individual INSERT/UPDATE statements?
Indexes - Do you have indexes on your target tables? Can they be dropped prior to import and then rebuilt after?
Locking - Is there any contention occurring while you are trying to import data?
You have left out the first place I would start looking: the technique used to import the data.
If the application is inserting single rows, is there a reason for that? Is it using DTS or BULK INSERT or BCP? Is it loading to a staging table or a table with triggers? Is it loading in batches or attempting to load and commit the entire batch? Is there extensive transformation or data type conversion of rows on their way in? Is there extensive transformation of data into a different schema or model?
I wouldn't worry about 1 and 2 until I saw if the ETL techniques used were sound, and if they are importing data into an existing schema, then you aren't going to have much room to change anything to do with 3. With regard to import and 4, I prefer not to do much algorithms on the data during the load portion.
For the best performance in the most general cases, load to a flat staging table with good reliable basic type conversion and exceptions done at that time (using SSIS or DTS). For very large data (multi-million row daily loads, say), I load in 100,000 or 1,000,000 record batches (this is easily settable in BCP or SSIS). Any derived columns are either created at the time of the load (SSIS or DTS) or right after with an UPDATE. Run exceptions, validate the data and create contraints. Then manipulate the data into the final schema as part of one or more transactions - UPDATEs, INSERTs, DELETEs, GROUP BYs for dimensions or entities or whatever.
Obviously, there are exceptions to this and it depends a lot on the input data and the model. For instance, with EBCDIC packed data in inputs, there's no such thing as good reliable basic type conversion in the load stage, so that causes your load to be more complex and slower as the data has to be processed more significantly.
Using this overall approach, we let SQL do what it is good for, and let client applications (or SSIS) do what they are good for.

Resources