Fastest method to fill a database table with 10 Million rows - database

What is the fastest method to fill a database table with 10 Million rows? I'm asking about the technique but also about any specific database engine that would allow for a way to do this as fast as possible. I"m not requiring this data to be indexed during this initial data-table population.

Using SQL to load a lot of data into a database will usually result in poor performance. In order to do things quickly, you need to go around the SQL engine. Most databases (including Firebird I think) have the ability to backup all the data into a text (or maybe XML) file and to restore the entire database from such a dump file. Since the restoration process doesn't need to be transaction aware and the data isn't represented as SQL, it is usually very quick.
I would write a script that generates a dump file by hand, and then use the database's restore utility to load the data.
After a bit of searching I found FBExport, that seems to be able to do exactly that - you'll just need to generate a CSV file and then use the FBExport tool to import that data into your database.

The fastest method is probably running an INSERT sql statement with a SELECT FROM. I've generated test data to populate tables from other databases and even the same database a number of times. But it all depends on the nature and availability of your own data. In my case i had enough rows of collected data where a few select/insert routines with random row selection applied half-cleverly against real data yielded decent test data quickly. In some cases where table data was uniquely identifying i used intermediate tables and frequency distribution sorting to eliminate things like uncommon names (eliminated instances where a count with group by was less than or equal to 2)
Also, Red Gate actually provides a utility to do just what you're asking. It's not free and i think it's Sql Server-specific but their tools are top notch. Well worth the cost. There's also a free trial period.
If you don't want to pay or their utility you could conceivably build your own pretty quickly. What they do is not magic by any means. A decent developer should be able to knock out a similarly-featured though alpha/hardcoded version of the app in a day or two...

You might be interested in the answers to this question. It looks at uploading a massive CSV file to a SQL server (2005) database. For SQL Server, it appears that a SSIS DTS package is the fastest way to bulk import data into a database.

It entirely depends on your DB. For instance, Oracle has something called direct path load (http://download.oracle.com/docs/cd/B10501_01/server.920/a96652/ch09.htm), which effectively disables indexing, and if I understand correctly, builds the binary structures that will be written to disk on the -client- side rather than sending SQL over.
Combined with partitioning and rebuilding indexes per partition, we were able to load a 1 billion row (I kid you not) database in a relatively short order. 10 million rows is nothing.

Use MySQL or MS SQL and embedded functions to generate records inside the database engine. Or generate a text file (in cvs like format) and then use Bulk copy functionality.

Related

How can I link a SQL image column to an external database?

I need a bit of a help with the following.
Note: in the following scenario, I do not have access to the application's source code, therefore I can only make changes at the database level.
Our database uses dbo.[BLOB] to store all kinds of files and documents. The table uses an IMAGE (yeah, obsolete) data type. Since this particular table is growing quite fast, I was thinking to implement some archiving feature.
My idea is to move all files older than X months to a second database, and then somehow link from the dbo.[BLOB] table to the external/archiving database.
Is this even possible? The goal is to reduce the database size, in order to improve backup and query performance.
Any ideas and hints much appreciated.
Thanks.
Fabian
There are 2 features to help you with backup speed and database size in this case:
Filestream will allow you to store BLOBS as files on the file system instead of in database file. It complicates backup scenario, you have to backup both database and files but you get smaller database file along with faster access time to documents. It is much faster to read file from filesystem than from blob column. Additionally filestream allows for files bigger than 2GB.
Partitioning will split table into smaller chunks on physical level. This way you do not need to access application code to change where particular rows are stored physically and decide which data needs to be accessed fast and put it on SSD drive and which can land on slower archive. This way you can have more frequent backups on current partition, while less frequent on archive.
Prior to SQL Server 2016 SP1 - this feature was available in Enterprise version only. For SQL Server 2016 SP1 this is available in all editions.
In your case most likely you should go with filestream first.
W/o modifying the application you can do, basically, nothing. You may try to see if changing the column type will be tolerated by the application (very unlikely, 99.99% it will break the app) and try to use FILESTREAM, but even if you succeed it does not give much benefits (backup size will be the same, for example).
A second thing you can try is to replace the table with a view, using INSTEAD OF triggers for updates. It is still very likely to break the application (lets say 99.98%). The goal would be to have a distributed partitioned view (or cross DB partitioned view) which presents to the application an unified view of the 'cold' and 'hot' data. Is complex, error prone, but it will reduce the size of the backups (as long as data is moved from hot to cold and cold data is immutable, requiring few backups).
The goal is to reduce the database size, in order to improve backup and query performance.
To reduce the backup size, as I explained above, you can do, basically, nothing. But performance you need to investigate it and address it appropriately, based on your findings. Saying the the database is slow 'because of BLOBs' is hand-waving.

SQL Server table or flat text file for infrequently changed data

I'm building a .net web app that will involve validating an email input field against a list of acceptable email addresses. There could be up to 10,000 acceptable values and they will not change very often. When they do, the entire list would be replaced, not individual entries.
I'm debating the best way to implement this. We have a SQL Server database but since these records will be relatively static and only replaced in bulk, I'm considering just referencing / searching text files containing the string values. Seems like that would make the upload process easier and there is little benefit to having this info in an rdbms.
Feedback appreciated.
If the database is already there then use it. What you are talking about is exactly what databases are designed to do. If down the road you decided you need do something slightly more complex you will be very glad you went with the DB.
I'm going to make a few assumptions about your situation.
Your data set contains more than 1 column of usable data.
Your record set contains more rows than you will always display.
Your data will need to be formatted into some kind of output view (e.g. HTML).
Here are some specific reasons why your semi-static data should stay in SQL and not in a text file.
You will not have to parse the text data every time you wish to read and process it. (String parsing is a relatively heavy memory and CPU load). SQL will store your columns as structured data which are pre-parsed.
You will not have to develop your own row-filtering or searching algorithm (or implement a library that does it for you). SQL is already a sophisticated engine that applies advanced algorithms of caching, query optimization (with many underlying advanced algorithms for seek/search/scan/index/hash/etc.)
You have the option with SQL to expand your solution's robustness and integration with other tools over time. (Putting the data into a text or XML file will limit future possibilities).
Caveats:
Your particular SQL Server implementation can be affected by disk IO and network latency performance. To tune disk IO performance, a well constructed SQL Server places the data files (.mdf) on fast multi-spindle disk arrays tuned for fast reads, and separates them from the Log Files (.log) on spindles that are tuned for fast writes.
You might find that the latency and busy-ness (not a word) of the SQL server can affect performance. If you're running on a very busy or slow SQL server, you might be in a situation where you'd look to a local-file alternative, in which case I would recommend a structured format such as XML. (However, if you find yourself looking for work-arounds to avoid using your SQL server, it would probably be best to invest some time/money into improving your SQL implementation.

What should I use for a Database?

My vb.net code calculates the growth rate of a company's stock price for every quarter from 1901 to present and stores it in a datatable. This takes a while to perform (10-15 minutes). I would like to save the information in the datatable after it is calculated so that I don't have to recalculate past growth rates every time I run the program. When I open my program I want the datatable to contain any growth rates that have already been calculated so I only have to calculate growth rates for new quarters.
Should I store my datatable in a database of some kind or is there another way to do this? My datatable is quite large. It currently has 450 columns (one for each quarter from 1901 to present) and can have thousands of rows (one for each company). Is this too big for Microsoft Access? Would Microsoft Excel be an option?
Thanks!
First of all, it's unclear you actually need a database. If you don't need things such as concurrent access, client/server operation, ACID transactions etc... you might as well just implement your cache using the file system.
If you conclude you do need a DBMS, there are many good choices, including free such as: PostgreSQL, MS SQL Server Express, Oracle Express, MySQL, Firebird, SQLite etc... or commercial such as: Oracle, MS SQL Server, IBM DB2, Sybase etc...
I suggest you make your data model flexible, so you don't have to add new column for each new quarter:
This model is also well suited for clustering (if your DBMS of choice supports it), so the calculations belonging to the same company are stored physically close together in the database, potentially lowering the I/O during querying. Alternatively, you may choose to cluster on year/quarter.
I would change the database design to:
ID
Quarter
Year
CompanyName
Value1
Value2
Value3
as your columns and start saving it as a vertical table.
Then, you don't have as much data as you think, so I'd recommend something free like mysql, or even nosql, since you're not doing anything but storing and retrieving the data. Any text based file: xml, csv, .xls that you use is going to be way slower because the entire file needs to get loaded into memory for you to be able to parse it.
Excel has a limit in regards to sizes of the sheets, and you shouldn't really ever use it as an explicit "database" for anything wish you wish to port over to different structures. It's good for things like spreadsheets and accounting in general, but you shouldn't use it for an absolute-truth database as is understood in computing. Also, Excel has a limit on the number of records that can be contained: Worksheet size 65,536 rows by 256 columns as of 2003
Access may work for this, but with the number of records you're looking at, you'll probably begin to experience issues with file sizes, slowdowns, and just general things like that. In situations when you start having more than 3,000 records at a time, it's probably better to use one of the big RDBMs or something like that; Oracle, MySQL, SQL Server, etc.
I think that the main problem might be the way you designed the database.
A column for each quarter doesn't sound very good practice, specially when you have to change your DB schema every new quarter.
You could start with a MS Access database and then if you have any performance problems with it, migrate to a SQL Server database or something.
Again, I think that you should take a carefull look at your database design.
I have a great deal of experience with stock data. Having tested quite a few methods, I think for a simple free method you should try SQL Server. The amount of data you are working with is just too much for Access (I imagine this is not the only calculations you would like). You can use SQL Server Express for free.
For this design I would create a database within SQL Server named HistoricalGrowthRate. I would have a table for each stock symbol and store the data in there.
One way to accomplish this is to have a separate database with a table that contains all the symbols you wish to follow (if you don't have can use the CompanyList.csv from Nasdaq). Loop through each symbol in that table and run a create table in HistoricalGrowthRate. When you wish to populate the values, simply loop again and insert your values. You could also just export from Access, which ever is faster for you.
This will decrease the load when you call for the information and provide an easy way to access the info. So, if you want the historical growth rate for AAPL, you simply set the connection string to your HistoricalGrowthRate database, refrence table AAPL and the extract the values.

Choosing a clever solution: SQL Server or file processing for bulk data?

We have a number of files generated from a test, with each file having almost 60,000 lines of data. Requirement is to calculate number of parameters with the help of data present in these files. There could be two ways of processing the data :
Each file is read line-by-line and processed to obtain required parameters
The file data is bulk copied into the database tables and required parameters are calculated with the help of aggregate functions in the stored procedure.
I was trying to figure out the overheads related to both the methods. As a database is meant to handle such situations, I am concerned with overheads which may be a problem when database grows larger.
Will it affect the retrieval rate from the tables, consequently making the calculations slower? Thus will file processing be a better solution taking into account the database size? Should database partitioning solve the problem for large database?
Did you consider using map-reduce (say under Hadoop maybe with HBase) to perform these tasks? If you're looking for high-throughput with big data volumes this is a very scaleable approach. Of course, not every problem can be addressed effectively using this paradigm and I don't know the details of your calculation.
If you set up indexes correctly you won't suffer performance issues. Additionally, there is nothing stopping you loading the files into a table and running the calculations and then moving the data into an archive table or deleting it altogether.
you can run a querty directly agianst the text file from SQL
SELECT * FROM OPENROWSET('MSDASQL',
'Driver={Microsoft Text Driver (*.txt; *.csv)};DefaultDir=C:\;',
'SELECT * FROM [text.txt];')
The distributed queries needs to be enabled to run this.
Or as you mentioned you can load the data data to a table (using SSIS, BCP, the query above ..). You did not mentioned what does it mean that the database will be larger. 60k of lines for a table is not so much (meaning that it will perform well).

high performance database update (oracle)

I have several services which dumps data to database (oracle) after processing different input file formats (XML, flat files etc). I was wondering if I can have them generate SQL statements instead and log them to some file system, and have a single SQL processor ( something like java hibernet) which will process these SQL files and upload to DB.
What's the fastest way to execute a huge set of SQL statements ( spread over a file system, and written by multiple writers) into an oracle DB? I was considering partioning the DB and batch updates. However, I want to know the best practice here. Seems like this is a common problem and somebody must have faced/resolved this issue already.
Thanks
Atanu
atanu,
the worst thing to do is to generate huge lists of insert statements. If you want speed and if you know the layout of your data, use external tables to load the data into your oracle database. This looks a lot like using sql*loader but you can access your data using a table. In the table definition your data fields are mapped to your column names and data types.
This will be the fastest way to do bulk loads into your database, for sure it is.
See Managing External Tables for some documentation.
What is the best practice rather depends on your criteria for determining "best". In many places the approach taken in many places is to use an ETL tool, perhaps Oracle Warehouse Builder, perhaps a third-party product. This need not be an expensive product: Pentaho offers Kettle in a free "self-supported" community edition.
When it comes to rolling your own, I don't think Hibernate is the way to go. Especially if your main concern is performance. I also think changing your feeds to generate SQL statements is an overly-complicated solution. What is wrong with PL/SQL modules to read the files and execute the SQL natively?
Certainly when I have done things like this before it has been with PL/SQL. The trick is to separate the input reading layer from the data writing layer. This is because the files are likely to require a lot of bespoke coding whereas the writing stuff is often fairly generic (this obviously depends on the precise details of your application).
A dynamic metadata-driven architecture is an attractive concept, particularly if your input structures are subject to a lot of variability. However such an approach can be difficult to debug and to tune. Code generation is an alternative technique.
When it comes to performance look to use bulk processing as much as possible. This is the main reason to prefer PL/SQL over files with individual SQL statements. Find out more.
The last thing you want is a bunch of insert statements...SUPER slow approach (doesn't matter how many processes you're running, trust me). Get all files into a delimited format and do a DIRECT load into Oracle via sqlldr would be the simplest approach (and very fast).
If you want maximum performance, you don't want tons of SQL statement. Instead have a look at Oracle Data Pump.
And don't do any preprocessing for flat files. Instead feed them directly to impdp (the Oracle Data Pump Importer).
If the importing the data requires transformations, updates etc., then best practice is to load the data into a staging table (with data pump), do some preprocessing on the staging table and then merge the data into the productive tables.
Preprocessing outside the database is usually very limited, since you don't have access to the already loaded data. So you cannot even check whether a record is new or an update to an existing one.
As others have mentioned, there are some tools you should look into if performance is your only concern.
But there are some advantages to using plain SQL statements. Many organizations have regulations, policies, and stubborn developers that will block any new tools. A simple SQL script is the universal language of your database, it's pretty much gaurenteed to work anywhere.
If you decide to go with SQL statements you need to avoid scripts like this:
insert into my_table values(...);
insert into my_table values(...);
...
And replace it a single statement that unions multiple lines:
insert into my_table
select ... from dual union all
select ... from dual union all
...
The second version will run several times faster.
However, picking the right size is tricky. A large number of small inserts will waste a lot of time on communication and other overhead. But Oracle parse time grows exponentially with very large sizes. In my experience 100 is usually a good number. Parsing gets really slow around a thousand. Also, use the "union all" method, avoid the multi-table insert trick. For some reason multi-table insert is much slower, and some Oracle versions have bugs that will cause your query to hang at 501 tables.
(You can also create a somewhat similar script using PL/SQL. A 1 megabyte PL/SQL procedure will compile much faster than a 1 megabyte SQL statement will parse. But creating the script is complicated; collections, dynamic sql, handling all the types correctly, creating a temporary object instead of an anonymous block because large anonymous blocks cause Diana node errors, etc. I've built a procedure like this, and it worked well, but it probably wasn't worth the effort.)

Resources