So BCP for inserting data into a SQL Server DB is very very fast. What is is doing that makes it so fast?
In SQL Server, BCP input is logged very differently than traditional insert statements. How SQL decides to handle things depends on a number of factors and some are things most developers never even consider like what recovery model the database is set to use.
bcp uses the same facility as BULK INSERT and the SqlBulkCopy classes.
More details here
http://msdn.microsoft.com/en-us/library/ms188365.aspx
The bottom line is this, these bulk operations log less data than normal operations and have the ability to instruct SQL Server to ignore its traditional checks and balances on the data coming in. All those things together serve to make it faster.
It cheats.
It has intimate knowledge of the internals and is able to map your input data more directly to those internals. It can skip other heavyweight operations (like parsing, optimization, transactions, logging, deferring indexes, isolation). It can make assumptions that apply to every row of data that a normal insert sql statement can not.
Basically, it's able to skip a bulk of the functionality that makes a database a database, and then clean up after itself en masse at the end.
The main difference I know between bcp and a normal insert is that bcp doesn't need to keep a separate transaction log entry for each individual transaction.
The speed is because they use of BCP API of the SQL Server Native Client ODBC driver. According to Microsoft:
http://technet.microsoft.com/en-us/library/aa337544.aspx
The bcp utility (Bcp.exe) is a command-line tool that uses the Bulk
Copy Program (BCP) API...
Bulk Copy Functions reference:
http://technet.microsoft.com/en-us/library/ms130922.aspx
Related
I have a database ("DatabaseA") that I cannot modify in any way, but I need to detect the addition of rows to a table in it and then add a log record to a table in a separate database ("DatabaseB") along with some info about the user who added the row to DatabaseA. (So it needs to be event-driven, not merely a periodic scan of the DatabaseA table.)
I know that normally, I could add a trigger to DatabaseA and run, say, a stored procedure to add log records to the DatabaseB table. But how can I do this without modifying DatabaseA?
I have free-reign to do whatever I like in DatabaseB.
EDIT in response to questions/comments ...
Databases A and B are MS SQL 2008/R2 databases (as tagged), users are interacting with the DB via a proprietary Windows desktop application (not my own) and each user has a SQL login associated with their application session.
Any ideas?
Ok, so I have not put together a proof of concept, but this might work.
You can configure an extended events session on databaseB that watches for all the procedures on databaseA that can insert into the table or any sql statements that run against the table on databaseA (using a LIKE '%your table name here%').
This is a custom solution that writes the XE session to a table:
https://github.com/spaghettidba/XESmartTarget
You could probably mimic functionality by writing the XE events table to a custom user table every 1 minute or so using the SQL job agent.
Your session would monitor databaseA, write the XE output to databaseB, you write a trigger that upon each XE output write, it would compare the two tables and if there are differences, write the differences to your log table. This would be a nonstop running process, but it is still kind of a period scan in a way. The XE only writes when the event happens, but it is still running a check every couple of seconds.
I recommend you look at a data integration tool that can mine the transaction log for Change Data Capture events. We are recently using StreamSets Data Collector for Oracle CDC but it also has SQL Server CDC. There are many other competing technologies including Oracle GoldenGate and Informatica PowerExchange (not PowerCenter). We like StreamSets because it is open source and is designed to build realtime data pipelines between DB at the schema level. Till now we have used batch ETL tools like Informatica PowerCenter and Pentaho Data Integration. I can near real-time copy all the tables in a schema in one StreamSets pipeline provided I already deployed DDL in the target. I use this approach between Oracle and Vertica. You can add additional columns to the target and populate them as part of the pipeline.
The only catch might be identifying which user made the change. I don't know whether that is in the SQL Server transaction log. Seems probable but I am not a SQL Server DBA.
I looked at both solutions provided by the time of writing this answer (refer Dan Flippo and dfundaka) but found that the first - using Change Data Capture - required modification to the database and the second - using Extended Events - wasn't really a complete answer, though it got me thinking of other options.
And the option that seems cleanest, and doesn't require any database modification - is to use SQL Server Dynamic Management Views. Within this library residing, in the System database, are various procedures to view server process history - in this case INSERTs and UPDATEs - such as sys.dm_exec_sql_text and sys.dm_exec_query_stats which contain records of database transactions (and are, in fact, what Extended Events seems to be based on).
Though it's quite an involved process initially to extract the required information, the queries can be tuned and generalized to a degree.
There are restrictions on transaction history retention, etc but for the purposes of this particular exercise, this wasn't an issue.
I'm not going to select this answer as the correct one yet partly because it's a matter of preference as to how you approach the problem and also because I'm yet to provide a complete solution. Hopefully, I'll post back with that later. But if anyone cares to comment on this approach - good or bad - I'd be interested in your views.
I am not a SQL developer, but I have a bit of SQL that is longer and more complex than my usual query/update. It is not a stored proc (policy thing).
(The code reads some state into variables, selects some data into temp tables, and performs some updates and deletes. There are a few loops to perform the deletes in small batches, as not to fill the transaction log.)
If I were writing code in Java I could create some test data and step through the method to watch the data being manipulated. Are there any tools that DB developers use to debug their code in a similar fashion?
You haven't provided any details on how your SQL is being processed, eg:
is your application submitting SQL batches to ASE? perhaps as a looping construct submitting prepared statements? [if so, you'll likely have a better chance of finding a debugger for your application; otherwise, add some print/select statements, perhaps based on a debug variable being set]
is your application submitting a (large) SQL batch to ASE? [if so, you may be able to use ASE's sqldbgr utility to step through your SQL code; you can find more details about ASE's sqldbgr in the ASE Utility Guide]
I am new to SSIS and have a pair of questions
I want to transfer 1,25,000 rows from one table to another in the same database. But When I use Data Flow Task, it is taking too much time. I tried using an ADO NET Destination as well as an OLE DB Destination but the performance was unacceptable. When I wrote the equivalent query inside an Execute SQL Task it provided acceptable performance. Why is such a difference in performance.
INSERT INTO table1 select * from table2
Based on the first observation, I changed my package. It is exclusively composed of Execute SQL Tasks either with a direct query or with a stored procedure. If I can solve my problem using only the Execute SQL Task, then why would one use SSIS as so many documents and articles indicate. I have seen as it's reliable, easy to maintain and comparatively fast.
Difference in performance
There are many things that could cause the performance of a "straight" data flow task and the equivalent Execute SQL Task.
Network latency. You are performing insert into table a from table b on the same server and instance. In an Execute SQL Task, that work would be performed entirely on the same machine. I could run a package on server B that queries 1.25M rows from server A which will then be streamed over the network to server B. That data will then be streamed back to server A for the corresponding INSERT operation. If you have a poor network, wide data-especially binary types, or simply great distance between servers (server A is in the US, server B is in the India) there will be poor performance
Memory starvation. Assuming the package executes on the same server as the target/source database, it can still be slow as the Data Flow Task is an in-memory engine. Meaning, all of the data that is going to flow from the source to the destination will get into memory. The more memory SSIS can get, the faster it's going to go. However, it's going to have to fight the OS for memory allocations as well as SQL Server itself. Even though SSIS is SQL Server Integration Services, it does not run in the same memory space as the SQL Server database. If your server has 10GB of memory allocated to it and the OS uses 2GB and SQL Server has claimed 8GB, there is little room for SSIS to operate. It cannot ask SQL Server to give up some of its memory so the OS will have to page out while trickles of data move through a constricted data pipeline.
Shoddy destination. Depending on which version of SSIS you are using, the default access mode for an OLE DB Destination was "Table or View." This was a nice setting to try and prevent a low level lock escalating to a table lock. However, this results in row by agonizing row inserts (1.25M unique insert statements being sent). Contrast that with the set-based approach of the Execute SQL Tasks INSERT INTO. More recent versions of SSIS default the access method to the "Fast" version of the destination. This will behave much more like the set-based equivalent and yield better performance.
OLE DB Command Transformation. There is an OLE DB Destination and some folks confuse that with the OLE DB Command Transformation. Those are two very different components with different uses. The former is a destination and consumes all the data. It can go very fast. The latter is always RBAR. It will perform singleton operations for each row that flows through it.
Debugging. There is overhead running a package in BIDS/SSDT. That package execution gets wrapped in DTS Debugging Host. That can cause a "not insignificant" slowdown of package execution. There's not much the debugger can do about an Execute SQL Task-it runs or it doesn't. A data flow, there's a lot of memory it can inspect, monitor, etc which reduces the amount of memory available (see pt 2) as well as just slows it down because of assorted checks it's performing. To get a more accurate comparison, always run packages from the command line (dtexec.exe /file MyPackage.dtsx) or schedule it from SQL Server Agent.
Package design
There is nothing inherently wrong with an SSIS package that is just Execute SQL Tasks. If the problem is easily solved by running queries, then I'd forgo SSIS entirely and write the appropriate stored procedure(s) and schedule it with SQL Agent and be done.
Maybe. What I still like about using SSIS even for "simple" cases like this is it can ensure a consistent deliverable. That may not sound like much, but from a maintenance perspective, it can be nice to know that everything that is mucking with the data is contained in these source controlled SSIS packages. I don't have to remember or train the new person that tasks A-C are "simple" so they are stored procs called from a SQL Agent job. Tasks D-J, or was it K, are even simpler than that so it's just "in line" queries in the Agent jobs to load data and then we have packages for the rest of stuff. Except for the Service Broker thing and some web services, those too update the database. The older I get and the more places I get exposed to, the more I can find value in a consistent, even if overkill, approach to solution delivery.
Performance isn't everything, but the SSIS team did set the ETL benchmarks using SSIS so it definitely has the capability to push some data in a hurry.
As this answer grows long, I'd simply leave it as the advantages of SSIS and the Data Flow over straight TSQL are native, out of the box
logging
error handling
configuration
parallelization
It's hard to beat those for my money.
If you are Passing SSIS Variables As Parameter in Parameter mapping Tab and assigning values to These Variables by Expression Then Your Execute SQL Task consume a lot of time in Evaluating that Expression.
Use Expression Task(Separately) To assign Variables Instead of using Expression in Variable Tab.
I have several services which dumps data to database (oracle) after processing different input file formats (XML, flat files etc). I was wondering if I can have them generate SQL statements instead and log them to some file system, and have a single SQL processor ( something like java hibernet) which will process these SQL files and upload to DB.
What's the fastest way to execute a huge set of SQL statements ( spread over a file system, and written by multiple writers) into an oracle DB? I was considering partioning the DB and batch updates. However, I want to know the best practice here. Seems like this is a common problem and somebody must have faced/resolved this issue already.
Thanks
Atanu
atanu,
the worst thing to do is to generate huge lists of insert statements. If you want speed and if you know the layout of your data, use external tables to load the data into your oracle database. This looks a lot like using sql*loader but you can access your data using a table. In the table definition your data fields are mapped to your column names and data types.
This will be the fastest way to do bulk loads into your database, for sure it is.
See Managing External Tables for some documentation.
What is the best practice rather depends on your criteria for determining "best". In many places the approach taken in many places is to use an ETL tool, perhaps Oracle Warehouse Builder, perhaps a third-party product. This need not be an expensive product: Pentaho offers Kettle in a free "self-supported" community edition.
When it comes to rolling your own, I don't think Hibernate is the way to go. Especially if your main concern is performance. I also think changing your feeds to generate SQL statements is an overly-complicated solution. What is wrong with PL/SQL modules to read the files and execute the SQL natively?
Certainly when I have done things like this before it has been with PL/SQL. The trick is to separate the input reading layer from the data writing layer. This is because the files are likely to require a lot of bespoke coding whereas the writing stuff is often fairly generic (this obviously depends on the precise details of your application).
A dynamic metadata-driven architecture is an attractive concept, particularly if your input structures are subject to a lot of variability. However such an approach can be difficult to debug and to tune. Code generation is an alternative technique.
When it comes to performance look to use bulk processing as much as possible. This is the main reason to prefer PL/SQL over files with individual SQL statements. Find out more.
The last thing you want is a bunch of insert statements...SUPER slow approach (doesn't matter how many processes you're running, trust me). Get all files into a delimited format and do a DIRECT load into Oracle via sqlldr would be the simplest approach (and very fast).
If you want maximum performance, you don't want tons of SQL statement. Instead have a look at Oracle Data Pump.
And don't do any preprocessing for flat files. Instead feed them directly to impdp (the Oracle Data Pump Importer).
If the importing the data requires transformations, updates etc., then best practice is to load the data into a staging table (with data pump), do some preprocessing on the staging table and then merge the data into the productive tables.
Preprocessing outside the database is usually very limited, since you don't have access to the already loaded data. So you cannot even check whether a record is new or an update to an existing one.
As others have mentioned, there are some tools you should look into if performance is your only concern.
But there are some advantages to using plain SQL statements. Many organizations have regulations, policies, and stubborn developers that will block any new tools. A simple SQL script is the universal language of your database, it's pretty much gaurenteed to work anywhere.
If you decide to go with SQL statements you need to avoid scripts like this:
insert into my_table values(...);
insert into my_table values(...);
...
And replace it a single statement that unions multiple lines:
insert into my_table
select ... from dual union all
select ... from dual union all
...
The second version will run several times faster.
However, picking the right size is tricky. A large number of small inserts will waste a lot of time on communication and other overhead. But Oracle parse time grows exponentially with very large sizes. In my experience 100 is usually a good number. Parsing gets really slow around a thousand. Also, use the "union all" method, avoid the multi-table insert trick. For some reason multi-table insert is much slower, and some Oracle versions have bugs that will cause your query to hang at 501 tables.
(You can also create a somewhat similar script using PL/SQL. A 1 megabyte PL/SQL procedure will compile much faster than a 1 megabyte SQL statement will parse. But creating the script is complicated; collections, dynamic sql, handling all the types correctly, creating a temporary object instead of an anonymous block because large anonymous blocks cause Diana node errors, etc. I've built a procedure like this, and it worked well, but it probably wasn't worth the effort.)
What is the fastest method to fill a database table with 10 Million rows? I'm asking about the technique but also about any specific database engine that would allow for a way to do this as fast as possible. I"m not requiring this data to be indexed during this initial data-table population.
Using SQL to load a lot of data into a database will usually result in poor performance. In order to do things quickly, you need to go around the SQL engine. Most databases (including Firebird I think) have the ability to backup all the data into a text (or maybe XML) file and to restore the entire database from such a dump file. Since the restoration process doesn't need to be transaction aware and the data isn't represented as SQL, it is usually very quick.
I would write a script that generates a dump file by hand, and then use the database's restore utility to load the data.
After a bit of searching I found FBExport, that seems to be able to do exactly that - you'll just need to generate a CSV file and then use the FBExport tool to import that data into your database.
The fastest method is probably running an INSERT sql statement with a SELECT FROM. I've generated test data to populate tables from other databases and even the same database a number of times. But it all depends on the nature and availability of your own data. In my case i had enough rows of collected data where a few select/insert routines with random row selection applied half-cleverly against real data yielded decent test data quickly. In some cases where table data was uniquely identifying i used intermediate tables and frequency distribution sorting to eliminate things like uncommon names (eliminated instances where a count with group by was less than or equal to 2)
Also, Red Gate actually provides a utility to do just what you're asking. It's not free and i think it's Sql Server-specific but their tools are top notch. Well worth the cost. There's also a free trial period.
If you don't want to pay or their utility you could conceivably build your own pretty quickly. What they do is not magic by any means. A decent developer should be able to knock out a similarly-featured though alpha/hardcoded version of the app in a day or two...
You might be interested in the answers to this question. It looks at uploading a massive CSV file to a SQL server (2005) database. For SQL Server, it appears that a SSIS DTS package is the fastest way to bulk import data into a database.
It entirely depends on your DB. For instance, Oracle has something called direct path load (http://download.oracle.com/docs/cd/B10501_01/server.920/a96652/ch09.htm), which effectively disables indexing, and if I understand correctly, builds the binary structures that will be written to disk on the -client- side rather than sending SQL over.
Combined with partitioning and rebuilding indexes per partition, we were able to load a 1 billion row (I kid you not) database in a relatively short order. 10 million rows is nothing.
Use MySQL or MS SQL and embedded functions to generate records inside the database engine. Or generate a text file (in cvs like format) and then use Bulk copy functionality.