I have several services which dumps data to database (oracle) after processing different input file formats (XML, flat files etc). I was wondering if I can have them generate SQL statements instead and log them to some file system, and have a single SQL processor ( something like java hibernet) which will process these SQL files and upload to DB.
What's the fastest way to execute a huge set of SQL statements ( spread over a file system, and written by multiple writers) into an oracle DB? I was considering partioning the DB and batch updates. However, I want to know the best practice here. Seems like this is a common problem and somebody must have faced/resolved this issue already.
Thanks
Atanu
atanu,
the worst thing to do is to generate huge lists of insert statements. If you want speed and if you know the layout of your data, use external tables to load the data into your oracle database. This looks a lot like using sql*loader but you can access your data using a table. In the table definition your data fields are mapped to your column names and data types.
This will be the fastest way to do bulk loads into your database, for sure it is.
See Managing External Tables for some documentation.
What is the best practice rather depends on your criteria for determining "best". In many places the approach taken in many places is to use an ETL tool, perhaps Oracle Warehouse Builder, perhaps a third-party product. This need not be an expensive product: Pentaho offers Kettle in a free "self-supported" community edition.
When it comes to rolling your own, I don't think Hibernate is the way to go. Especially if your main concern is performance. I also think changing your feeds to generate SQL statements is an overly-complicated solution. What is wrong with PL/SQL modules to read the files and execute the SQL natively?
Certainly when I have done things like this before it has been with PL/SQL. The trick is to separate the input reading layer from the data writing layer. This is because the files are likely to require a lot of bespoke coding whereas the writing stuff is often fairly generic (this obviously depends on the precise details of your application).
A dynamic metadata-driven architecture is an attractive concept, particularly if your input structures are subject to a lot of variability. However such an approach can be difficult to debug and to tune. Code generation is an alternative technique.
When it comes to performance look to use bulk processing as much as possible. This is the main reason to prefer PL/SQL over files with individual SQL statements. Find out more.
The last thing you want is a bunch of insert statements...SUPER slow approach (doesn't matter how many processes you're running, trust me). Get all files into a delimited format and do a DIRECT load into Oracle via sqlldr would be the simplest approach (and very fast).
If you want maximum performance, you don't want tons of SQL statement. Instead have a look at Oracle Data Pump.
And don't do any preprocessing for flat files. Instead feed them directly to impdp (the Oracle Data Pump Importer).
If the importing the data requires transformations, updates etc., then best practice is to load the data into a staging table (with data pump), do some preprocessing on the staging table and then merge the data into the productive tables.
Preprocessing outside the database is usually very limited, since you don't have access to the already loaded data. So you cannot even check whether a record is new or an update to an existing one.
As others have mentioned, there are some tools you should look into if performance is your only concern.
But there are some advantages to using plain SQL statements. Many organizations have regulations, policies, and stubborn developers that will block any new tools. A simple SQL script is the universal language of your database, it's pretty much gaurenteed to work anywhere.
If you decide to go with SQL statements you need to avoid scripts like this:
insert into my_table values(...);
insert into my_table values(...);
...
And replace it a single statement that unions multiple lines:
insert into my_table
select ... from dual union all
select ... from dual union all
...
The second version will run several times faster.
However, picking the right size is tricky. A large number of small inserts will waste a lot of time on communication and other overhead. But Oracle parse time grows exponentially with very large sizes. In my experience 100 is usually a good number. Parsing gets really slow around a thousand. Also, use the "union all" method, avoid the multi-table insert trick. For some reason multi-table insert is much slower, and some Oracle versions have bugs that will cause your query to hang at 501 tables.
(You can also create a somewhat similar script using PL/SQL. A 1 megabyte PL/SQL procedure will compile much faster than a 1 megabyte SQL statement will parse. But creating the script is complicated; collections, dynamic sql, handling all the types correctly, creating a temporary object instead of an anonymous block because large anonymous blocks cause Diana node errors, etc. I've built a procedure like this, and it worked well, but it probably wasn't worth the effort.)
Related
I've found various opinions on this one, including a reference to an article that indicates that 'select into' operators can run in parallel from 2014+ and may or may not be more efficient than 'insert' as a result.
My use case is moving data from one table to another identical table across databases, same instance, 2014. The inserts will be 5-10M rows-ish, and I don't care about logging just efficiency. I need a general recommendation, not a case-by-case analysis.
I realize that there are other factors (row length, etc) that might affect the answer, but I'm looking for the best place to start. I can always try other methods if necessary.
So what's the most efficient way to load a table in one database from an identical table in another?
Thanks in advance!
I would suggest a SSIS (SQL Server Integration Services) package that performs BULK operations. Although 5M rows isn't significant in our current world.
Since "it depends" you'll have to help us understand what you're trying to save. INSERT INTO is nice only in that it is self contained and "easy." If this is a one time deal you might do it this way and stop thinking about it.
If however you're going to be shoveling 10M records daily - you might consider a scheduled SSIS script. There is overhead to maintaining the script but it is generally faster. If you are reloading data for testing purposes (reset to baseline) then the SSIS package is a good way to go.
You might also look at this article: https://dba.stackexchange.com/questions/99367/insert-into-table-select-from-table-vs-bulk-insert
If this has been answered elsewhere, please post a link to it, yell at me, and close this question. I looked around and saw similar things, but didn't find exactly what I was looking for.
I am currently writing several stored procedures that require data from another database. That database could be on another server or the same server, it just depends on the customer's network. I want to use a Synonym so that if the location of the table that I need data from changes, I can update the synonym once and not have to go back in to all of the stored procedures and update their references.
What I want to know is what the best approach is with a synonym. I read a post on SO before that said there was a performance hit when using a view or table (especially across a linked server). This may be due to SQL Server's ability to recognize indexes on tables when using synonyms. I can't find that post anymore or I would post a link to it. It was suggested that the best approach is to create a synonym for a stored procedure, and load the resulting data in to a memory or temp table.
I may not have my facts straight on that, though, and was hoping for some clarification. From what I can tell, creating and loading data in to memory tables generally accounts for a large percentage of the execution plan. Is using a stored procedure worth the extra effort of loading the data in to a table over just being able to run queries against a view or table? What is the most efficient way to get data from another database using a synonym?
Thanks!
Synonyms are just defined alias's to make redirection easier, they have no performance impact worth considering. And yes, they are advised for redirection, they do make it a lot easier.
What a synonym points to on the other hand can have a significant performance impact (this has nothing to do with the synonym itself).
Using tables and views in other databases on the same server-instance has a small impact. I've heard 10% quoted and I can fairly say that I have never observed it to be higher than that. This impact is mostly from reductions in the optimizers efficiency, as far as I can tell.
Using objects on other server-instances, whether through linked server definitions, or OpenQuery is another story entirely. These tend to be much slower, primarily because of the combined effects of MS DTC and the optimizer deciding to do almost no optimizations for the remote aspects of a query. This tends to be bearable for small queries and small remote tables, but increasingly awful the bigger the query and/or remote table is.
Most practitioners eventually decide on one of two fixes for this problem, either 1) If it is a table, then just copy the remote table rows to a local #temp table first and then query on that, or, 2) if it is more complex, then write a stored procedure on the remote server and then execute it with INSERT INTO..EXECUTE AT, to retrieve the remote info.
As for how to use/organize your synonyms, my advice would be to create a separate owner-schema in your database (with an appropriate name like [Remote]) and then put all of your Synonyms there. Then when you need to redirect, you can write a stored procedure that will automatically find all of the synonyms pointing to the old location and change them to the new location (this is how I do it). Makes it a lot easier to deal with location/name changes.
Choosing Option 1 or 2 depends on the nature of your query. If you can retrieve the data with a relatively simple Select with a good Where clause to restrict the number of rows then Option 1 is generally the best choice. DO NOT JOIN local and remote tables. Pull the remote data to a local #temp table and join the local tables on that temp table in a separate query.
If the query is more complex with multiple joins and/or complex Where conditions then retrieving the data into a local #temp table via a Remote procedure call is generally the best choice. Again, DO NOT JOIN local and remote procedures and minimize the number/size of the parameters to the remote procedure.
The balance point between "simple Select" and "complex Select" is a matter of knowing you data and testing.
HTH :)
I'm building a .net web app that will involve validating an email input field against a list of acceptable email addresses. There could be up to 10,000 acceptable values and they will not change very often. When they do, the entire list would be replaced, not individual entries.
I'm debating the best way to implement this. We have a SQL Server database but since these records will be relatively static and only replaced in bulk, I'm considering just referencing / searching text files containing the string values. Seems like that would make the upload process easier and there is little benefit to having this info in an rdbms.
Feedback appreciated.
If the database is already there then use it. What you are talking about is exactly what databases are designed to do. If down the road you decided you need do something slightly more complex you will be very glad you went with the DB.
I'm going to make a few assumptions about your situation.
Your data set contains more than 1 column of usable data.
Your record set contains more rows than you will always display.
Your data will need to be formatted into some kind of output view (e.g. HTML).
Here are some specific reasons why your semi-static data should stay in SQL and not in a text file.
You will not have to parse the text data every time you wish to read and process it. (String parsing is a relatively heavy memory and CPU load). SQL will store your columns as structured data which are pre-parsed.
You will not have to develop your own row-filtering or searching algorithm (or implement a library that does it for you). SQL is already a sophisticated engine that applies advanced algorithms of caching, query optimization (with many underlying advanced algorithms for seek/search/scan/index/hash/etc.)
You have the option with SQL to expand your solution's robustness and integration with other tools over time. (Putting the data into a text or XML file will limit future possibilities).
Caveats:
Your particular SQL Server implementation can be affected by disk IO and network latency performance. To tune disk IO performance, a well constructed SQL Server places the data files (.mdf) on fast multi-spindle disk arrays tuned for fast reads, and separates them from the Log Files (.log) on spindles that are tuned for fast writes.
You might find that the latency and busy-ness (not a word) of the SQL server can affect performance. If you're running on a very busy or slow SQL server, you might be in a situation where you'd look to a local-file alternative, in which case I would recommend a structured format such as XML. (However, if you find yourself looking for work-arounds to avoid using your SQL server, it would probably be best to invest some time/money into improving your SQL implementation.
What is the fastest method to fill a database table with 10 Million rows? I'm asking about the technique but also about any specific database engine that would allow for a way to do this as fast as possible. I"m not requiring this data to be indexed during this initial data-table population.
Using SQL to load a lot of data into a database will usually result in poor performance. In order to do things quickly, you need to go around the SQL engine. Most databases (including Firebird I think) have the ability to backup all the data into a text (or maybe XML) file and to restore the entire database from such a dump file. Since the restoration process doesn't need to be transaction aware and the data isn't represented as SQL, it is usually very quick.
I would write a script that generates a dump file by hand, and then use the database's restore utility to load the data.
After a bit of searching I found FBExport, that seems to be able to do exactly that - you'll just need to generate a CSV file and then use the FBExport tool to import that data into your database.
The fastest method is probably running an INSERT sql statement with a SELECT FROM. I've generated test data to populate tables from other databases and even the same database a number of times. But it all depends on the nature and availability of your own data. In my case i had enough rows of collected data where a few select/insert routines with random row selection applied half-cleverly against real data yielded decent test data quickly. In some cases where table data was uniquely identifying i used intermediate tables and frequency distribution sorting to eliminate things like uncommon names (eliminated instances where a count with group by was less than or equal to 2)
Also, Red Gate actually provides a utility to do just what you're asking. It's not free and i think it's Sql Server-specific but their tools are top notch. Well worth the cost. There's also a free trial period.
If you don't want to pay or their utility you could conceivably build your own pretty quickly. What they do is not magic by any means. A decent developer should be able to knock out a similarly-featured though alpha/hardcoded version of the app in a day or two...
You might be interested in the answers to this question. It looks at uploading a massive CSV file to a SQL server (2005) database. For SQL Server, it appears that a SSIS DTS package is the fastest way to bulk import data into a database.
It entirely depends on your DB. For instance, Oracle has something called direct path load (http://download.oracle.com/docs/cd/B10501_01/server.920/a96652/ch09.htm), which effectively disables indexing, and if I understand correctly, builds the binary structures that will be written to disk on the -client- side rather than sending SQL over.
Combined with partitioning and rebuilding indexes per partition, we were able to load a 1 billion row (I kid you not) database in a relatively short order. 10 million rows is nothing.
Use MySQL or MS SQL and embedded functions to generate records inside the database engine. Or generate a text file (in cvs like format) and then use Bulk copy functionality.
This question is related to another:
Will having multiple filegroups help speed up my database?
The software we're developing is an analytical tool that uses MS SQL Server 2005 to store relational data. Initial analysis can be slow (since we're processing millions or billions of rows of data), but there are performance requirements on recalling previous analyses quickly, so we "save" results of each analysis.
Our current approach is to save analysis results in a series of "run-specific" tables, and the analysis is complex enough that we might end up with as many as 100 tables per analysis. Usually these tables use up a couple hundred MB per analysis (which is small compared to our hundreds of GB, or sometimes multiple TB, of source data). But overall, disk space is not a problem for us. Each set of tables is specific to one analysis, and in many cases this provides us enormous performance improvements over referring back to the source data.
The approach starts to break down once we accumulate enough saved analysis results -- before we added more robust archive/cleanup capability, our testing database climbed to several million tables. But it's not a stretch for us to have more than 100,000 tables, even in production. Microsoft places a pretty enormous theoretical limit on the size of sysobjects (~2 billion), but once our database grows beyond 100,000 or so, simple queries like CREATE TABLE and DROP TABLE can slow down dramatically.
We have some room to debate our approach, but I think that might be tough to do without more context, so instead I want to ask the question more generally: if we're forced to create so many tables, what's the best approach for managing them? Multiple filegroups? Multiple schemas/owners? Multiple databases?
Another note: I'm not thrilled about the idea of "simply throwing hardware at the problem" (i.e. adding RAM, CPU power, disk speed). But we won't rule it out either, especially if (for example) someone can tell us definitively what effect adding RAM or using multiple filegroups will have on managing a large system catalog.
Without first seeing the entire system, my first recommendation would be to save the historical runs in combined tables with a RunID as part of the key - a dimensional model may also be relevant here. This table can be partitioned for improvement, which will also allow you to spread the table into other filegroups.
Another possibility it to put each run in its own database and then detach them, only attaching them as needed (and in read-only form)
CREATE TABLE and DROP TABLE are probably performing poorly because the master or model databases are not optimized for this kind of behavior.
I also recommend talking to Microsoft about your choice of database design.
Are the tables all different structures? If they are the same structure you might get away with a single partitioned table.
If they are different structures, but just subsets of the same set of dimension columns, you could still store them in partitions in the same table with nulls in the non-applicable columns.
If this is analytic (derivative pricing computations perhaps?) you could dump the results of a computation run to flat files and reuse your computations by loading from the flat files.
This seems to be a very interesting problem/application that you are working with. I would love to work on something like this. :)
You have a very large problem surface area, and that makes it hard to start helping. There are several solution parameters that are not evident in your post. For example, how long do you plan to keep the run analysis tables? There's a LOT other questions that need to be asked.
You are going to need a combination of serious data warehousing, and data/table partitioning. Depending on how much data you want to keep and archive you may need to start de-normalizing and flattening the tables.
This would be pretty good case where contacting Microsoft directly can be mutually beneficial. Microsoft gets a good case to show other customers, and you get help directly from the vendor.
We ended up splitting our database into multiple databases. So the main database contains a "databases" table that refers to one or more "run" databases, each of which contains distinct sets of analysis results. Then the main "run" table contains a database ID, and the code that retrieves a saved result includes the relevant database prefix on all queries.
This approach allows the system catalog of each database to be more reasonable, it provides better separation between the core/permanent tables and the dynamic/run tables, and it also makes backups and archiving more manageable. It also allows us to split our data across multiple physical disks, although using multiple filegroups would have done that too. Overall, it's working well for us now given our current requirements, and based on expected growth we think it will scale well for us too.
We've also noticed that SQL 2008 tends to handle large system catalogs better than SQL 2000 and SQL 2005 did. (We hadn't upgraded to 2008 when I posted this question.)