I'm importing a huge amount of data using BCP, and I was curious about what options it has to improve the speed of this operation.
I read about parallel loading in a couple of places, and also saw that you can tell it to not bother to sort the data or check constraints (which are all viable options for me since the source is another database with good integrity).
I haven't seen examples of these options being used though (as in, I don't know what command line switch enables parallel loading or contraint checking disabling).
Does anyone know a good resource for learning this, or can someone give me a couple trivial examples? And please don't point me to the BCP parameters help page, I couldn't make heads or tails of it with regard to these specific options.
Any help is greatly appreciated!
You need to read The Data Loading Performance Guide. There is no magical command line switch 'load faster', is a very complicated balance of doing the right thing in the right context. It depends on whether you load a heap or a B-Tree, whether there is already data or the table is empty, whether it has secondary indexes, whether minimally logging is possible in the database recovery model, whether the table is partitioned or not, whether the data is pre-sorted or not and that is just the surface. The linked white paper has all the details.
It looks like the parallel loading you are talking about is just running multiple instances of the BCP utility against the same table. You would be responsible for partitioning the data before hand. You use it by specifying the TABLOCK table hint. From MSDN:
Bulk update (BU) locks allow processes to bulk copy data concurrently into the same table while preventing other processes that are not bulk copying data from accessing the table.
So it's really just a special lock for BCP.
To further increase performance, you may read further into the -a flag on the BCP parameters page.
-a allows to to specify a larger packet size (between 4096 and 65535) to increase the amount of data sent at one time to the server per network packet.
I would also suggest using the -e flag with an error if you intend to run multiple BCP processes to help keep track of any errors encountered.
Related
My boss wants me to send a large SQL Server query result. The result is around 60 Million lines. Is there any advice on how to send this? File format? Compression? Thanks for any help!
What is the row size?
Typically when I need to export large amounts of data from SQL, I use BoostSSMS scripting templates. You could also use SSIS to generate the file, especially if this is an ETL process (since you can debug and step through the process). I find SSMS 'results to file' to be totally inadequate compared to BoostSSMS in terms of speed. My preferred format is tab delimited and single quote text identifiers.
If you're sending the data somewhere, it's probably best to compress it (which will drastically reduce the file size). Which method to use is going to depend on your target and their requirements but I've never had any problems with plain old ZIP compression.
Edit - your boss says he just wants to view it. Ask him what exactly he's looking for (whether that's some specific combination of values, or doing analysis on the data). The reason that's important is because I can guarantee he isn't looking at this data line-by-line for 60 million rows. If he's looking for something specific, filter your query down using those conditions. If he's doing analysis on it, do the aggregation in your SQL and give him the results.
I'm trying to insert data into a database, but first I check if each row exists using a lookup, similar to the method suggested here:
How to prevent SSIS from importing data from a file that already exist in database?
SELECT DISTINCT VALUES // OleDb Source
|
LOOKUP // If exists
| // No Match Output
OLE DB DESTINATION // Insert new records
I'm using RetainSameConnection=True to enable transactions on my workflow. With a default buffer around 10,000 rows, as rows get passed to the OLE DB Destination, the destination INSERT will lock with the lookup SELECT.
I've tried SET READ_COMMITTED_SNAPSHOT ON, which will work, but performance during the lookup is now incredibly slow, which I believe is due to the RetainSameConnection property, and I can't tell that SSIS is even using the READ COMMITTED SNAPSHOT isolation level. I thought about ignoring failures on the destination, but I read it will cause bulk inserts to fail completely instead of by row. I've also considered using NOLOCK on all the reads, but it would turn all my lookups into SQL queries.
The source DB may read millions of rows. Is there a better way to accomplish this?
This question is getting into the area where the answers are going to be based more on preference and experience than there just being one easily defined way of doing this. There will likely be a good number of answers which will show different methodologies that might work for you, but the basics of any of these are going be based around a couple of things
Reducing the number of rows that are saved in your lookup during pre-caching of the data flow task
Generally speaking, the values for all lookups are populated before the data flow begins to execute and read data from the source based on the configuration of the lookup. If you have your SSIS configured this way, then there should not be any contention between the lookup and the insertion of rows into your table regardless of isolation level.
Based on what you said above, I'm thinking that perhaps the performance of the lookup is not changing drastically because of your configurations but more because the amount of data being cached into the lookup is increasing with each execution.
Changing the Lookup pattern to use a different pattern
The most basic implementation of the lookup is usually fine for most cases. However, if performance is a heavy concern for package execution then there may be other more ideal methods to accomplishing the same objective. One of these which I talk about in my blog refers to using a merge join as an alternative to the standard lookup. This may not be ideal for your situation as I designed this particular strategy to cover a corner case involving large data sets, but hopefully it should give you some ideas.
I have a fairly detailed walkthrough of that pattern at the link below
https://web.archive.org/web/20140819083150/http://bigdatabigdave.info/archive/2013/02/15/alternate-ssis-lookup-pattern-merge-join/
One place I remember they wrote all the insert lines to a raw file. Then there was a second dataflow (after the first dataflow finished) that inserted the raw file records into the destination table.
The project requires storing binary data into PostgreSQL (project requirement) database. For that purpose we made a table with following columns:
id : integer, primary key, generated by client
data : bytea, for storing client binary data
The client is a C++ program, running on Linux.
The rows must be inserted (initialized with a chunk of binary data), and after that updated (concatenating additional binary data to data field).
Simple tests have shown that this yields better performance.
Depending on your inputs, we will make client use concurrent threads to insert / update data (with different DB connections), or a single thread with only one DB connection.
We haven't much experience with PostgreSQL, so could you help us with some pointers concerning possible bottlenecks, and whether using multiple threads to insert data is better than using a single thread.
Thank you :)
Edit 1:
More detailed information:
there will be only one client accessing the database, using only one Linux process
database and client are on the same high performance server, but this must not matter, client must be fast no matter the machine, without additional client configuration
we will get new stream of data every 10 seconds, stream will provide new 16000 bytes per 0.5 seconds (CBR, but we can use buffering and only do inserts every 4 seconds max)
stream will last anywhere between 10 seconds and 5 minutes
It makes extremely little sense that you should get better performance inserting a row then appending to it if you are using bytea.
PostgreSQL's MVCC design means that an UPDATE is logically equivalent to a DELETE and an INSERT. When you insert the row then update it, what's happening is that the original tuple you inserted is marked as deleted and new tuple is written that contains the concatentation of the old and added data.
I question your testing methodology - can you explain in more detail how you determined that insert-then-append was faster? It makes no sense.
Beyond that, I think this question is too broad as written to really say much of use. You've given no details or numbers; no estimates of binary data size, rowcount estimates, client count estimates, etc.
bytea insert performance is no different to any other insert performance tuning in PostgreSQL. All the same advice applies: Batch work into transactions, use multiple concurrent sessions (but not too many; rule of thumb is number_of_cpus + number_of_hard_drives) to insert data, avoid having transactions use each others' data so you don't need UPDATE locks, use async commit and/or a commit_delay if you don't have a disk subsystem with a safe write-back cache like a battery-backed RAID controller, etc.
Given the updated stats you provided in the main comments thread, the amount of data you want to consume sounds entirely practical with appropriate hardware and application design. Your peak load might be achievable even on a plain hard drive if you had to commit every block that came in, since it'd require about 60 transactions per second. You could use a commit_delay to achieve group commit and significantly lower fsync() overhead, or even use synchronous_commit = off if you can afford to lose a time window of transactions in case of a crash.
With a write-back caching storage device like a battery-backed cache RAID controller or an SSD with reliable power-loss-safe cache, this load should be easy to cope with.
I haven't benchmarked different scenarios for this, so I can only speak in general terms. If designing this myself, I'd be concerned about checkpoint stalls with PostgreSQL, and would want to make sure I could buffer a bit of data. It sounds like you can so you should be OK.
Here's the first approach I'd test, benchmark and load-test, as it's in my view probably the most practical:
One connection per data stream, synchronous_commit = off + a commit_delay.
INSERT each 16kb record as it comes in into a staging table (if possible UNLOGGED or TEMPORARY if you can afford to lose incomplete records) and let Pg synchronize and group up commits. When each stream ends, read the byte arrays, concatenate them, and write the record to the final table.
For absolutely best speed with this approach, implement a bytea_agg aggregate function for bytea as an extension module (and submit it to PostgreSQL for inclusion in future versions). In reality it's likely you can get away with doing the bytea concatenation in your application by reading the data out, or with the rather inefficient and nonlinearly scaling:
CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
INSERT INTO final_table SELECT stream_id, bytea_agg(data_block) FROM temp_stream_table;
You would want to be sure to tune your checkpointing behaviour, and if you were using an ordinary or UNLOGGED table rather than a TEMPORARY table to accumulate those 16kb records, you'd need to make sure it was being quite aggressively VACUUMed.
See also:
Whats the fastest way to do a bulk insert into Postgres?
How to speed up insertion performance in PostgreSQL
We have flat files (CSV) with >200,000,000 rows, which we import into a star schema with 23 dimension tables. The biggest dimension table has 3 million rows. At the moment we run the importing process on a single computer and it takes around 15 hours. As this is too long time, we want to utilize something like 40 computers to do the importing.
My question
How can we efficiently utilize the 40 computers to do the importing. The main worry is that there will be a lot of time spent replicating the dimension tables across all the nodes as they need to be identical on all nodes. This could mean that if we utilized 1000 servers to do the importing in the future, it might actually be slower than utilize a single one, due to the extensive network communication and coordination between the servers.
Does anyone have suggestion?
EDIT:
The following is a simplification of the CSV files:
"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue"
"bvalue";"evenanothervalue"
"avalue";"anothervalue"
After importing, the tables look like this:
dimension_table1
id name
1 "avalue"
2 "bvalue"
dimension_table2
id name
1 "anothervalue"
2 "evenanothervalue"
Fact table
dimension_table1_ID dimension_table2_ID
1 1
2 2
1 2
1 2
2 2
1 1
You could consider using a 64bit hash function to produce a bigint ID for each string, instead of using sequential IDs.
With 64-bit hash codes, you can store 2^(32 - 7) or over 30 million items in your hash table before there is a 0.0031% chance of a collision.
This would allow you to have identical IDs on all nodes, with no communication whatsoever between servers between the 'dispatch' and the 'merge' phases.
You could even increase the number of bits to further lower the chance of collision; only, you would not be able to make the resultant hash fit in a 64bit integer database field.
See:
http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
http://code.google.com/p/smhasher/wiki/MurmurHash
http://www.partow.net/programming/hashfunctions/index.html
Loading CSV data into a database is slow because it needs to read, split and validate the data.
So what you should try is this:
Setup a local database on each computer. This will get rid of the network latency.
Load a different part of the data on each computer. Try to give each computer the same chunk. If that isn't easy for some reason, give each computer, say, 10'000 rows. When they are done, give them the next chunk.
Dump the data with the DB tools
Load all dumps into a single DB
Make sure that your loader tool can import data into a table which already contains data. If you can't do this, check your DB documentation for "remote table". A lot of databases allow to make a table from another DB server visible locally.
That allows you to run commands like insert into TABLE (....) select .... from REMOTE_SERVER.TABLE
If you need primary keys (and you should), you will also have the problem to assign PKs during the import into the local DBs. I suggest to add the PKs to the CSV file.
[EDIT] After checking with your edits, here is what you should try:
Write a small program which extract the unique values in the first and second column of the CSV file. That could be a simple script like:
cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
This is a pretty cheap process (a couple of minutes even for huge files). It gives you ID-value files.
Write a program which reads the new ID-value files, caches them in memory and then reads the huge CSV files and replaces the values with the IDs.
If the ID-value files are too big, just do this step for the small files and load the huge ones into all 40 per-machine DBs.
Split the huge file into 40 chunks and load each of them on each machine.
If you had huge ID-value files, you can use the tables created on each machine to replace all the values that remained.
Use backup/restore or remote tables to merge the results.
Or, even better, keep the data on the 40 machines and use algorithms from parallel computing to split the work and merge the results. That's how Google can create search results from billions of web pages in a few milliseconds.
See here for an introduction.
This is a very generic question and does not take the database backend into account. Firing with 40 or 1000 machines on a database backend that can not handle the load will give you nothing. Such a problem is truly to broad to answer it in a specific way..you should get in touch with people inside your organization with enough skills on the DB level first and then come back with a more specific question.
Assuming N computers, X files at about 50GB files each, and a goal of having 1 database containing everything at the end.
Question: It takes 15 hours now. Do you know which part of the process is taking the longest? (Reading data, cleansing data, saving read data in tables, indexing… you are inserting data into unindexed tables and indexing after, right?)
To split this job up amongst the N computers, I’d do something like (and this is a back-of-the-envelope design):
Have a “central” or master database. Use this to mangae the overall process, and to hold the final complete warehouse.
It contains lists of all X files and all N-1 (not counting itself) “worker” databases
Each worker database is somehow linked to the master database (just how depends on RDBMS, which you have not specified)
When up and running, a "ready" worker database polls the master database for a file to process. The master database dolls out files to worker systems, ensuring that no file gets processed by more than one at a time. (Have to track success/failure of loading a given file; watch for timeouts (worker failed), manage retries.)
Worker database has local instance of star schema. When assigned a file, it empties the schema and loads the data from that one file. (For scalability, might be worth loading a few files at a time?) “First stage” data cleansing is done here for the data contained within that file(s).
When loaded, master database is updated with a “ready flagy” for that worker, and it goes into waiting mode.
Master database has it’s own to-do list of worker databases that have finished loading data. It processes each waiting worker set in turn; when a worker set has been processed, the worker is set back to “check if there’s another file to process” mode.
At start of process, the star schema in the master database is cleared. The first set loaded can probably just be copied over verbatim.
For second set and up, have to read and “merge” data – toss out redundant entries, merge data via conformed dimensions, etc. Business rules that apply to all the data, not just one set at a time, must be done now as well. This would be “second stage” data cleansing.
Again, repeat the above step for each worker database, until all files have been uploaded.
Advantages:
Reading/converting data from files into databases and doing “first stage” cleansing gets scaled out across N computers.
Ideally, little work (“second stage”, merging datasets) is left for the master database
Limitations:
Lots of data is first read into worker database, and then read again (albeit in DBMS-native format) across the network
Master database is a possible chokepoint. Everything has to go through here.
Shortcuts:
It seems likely that when a workstation “checks in” for a new file, it can refresh a local store of data already loaded in the master and add data cleansing considerations based on this to its “first stage” work (i.e. it knows code 5484J has already been loaded, so it can filter it out and not pass it back to the master database).
SQL Server table partitioning or similar physical implementation tricks of other RDBMSs could probably be used to good effect.
Other shortcuts are likely, but it totally depends upon the business rules being implemented.
Unfortunately, without further information or understanding of the system and data involved, one can’t tell if this process would end up being faster or slower than the “do it all one one box” solution. At the end of the day it depends a lot on your data: does it submit to “divide and conquer” techniques, or must it all be run through a single processing instance?
The simplest thing is to make one computer responsible for handing out new dimension item id's. You can have one for each dimension. If the dimension handling computers are on the same network, you can have them broadcast the id's. That should be fast enough.
What database did you plan on using with a 23-dimensional starscheme? Importing might not be the only performance bottleneck. You might want to do this in a distributed main-memory system. That avoids a lot of the materalization issues.
You should investigate if there are highly correlating dimensions.
In general, with a 23 dimensional star scheme with large dimensions a standard relational database (SQL Server, PostgreSQL, MySQL) is going to perform extremely bad with datawarehouse questions. In order to avoid having to do a full table scan, relational databases use materialized views. With 23 dimensions you cannot afford enough of them. A distributed main-memory database might be able to do full table scans fast enough (in 2004 I did about 8 million rows/sec/thread on a Pentium 4 3 GHz in Delphi). Vertica might be an other option.
Another question: how large is the file when you zip it? That provides a good first order estimate of the amount of normalization you can do.
[edit] I've taken a look at your other questions. This does not look like a good match for PostgreSQL (or MySQL or SQL server). How long are you willing to wait for query results?
Rohita,
I'd suggest you eliminate a lot of the work from the load by sumarising the data FIRST, outside of the database. I work in a Solaris unix environment. I'd be leaning towards a korn-shell script, which cuts the file up into more managable chunks, then farms those chunks out equally to my two OTHER servers. I'd process the chunks using a nawk script (nawk has an efficient hashtable, which they call "associative arrays") to calculate the distinct values (the dimensions tables) and the Fact table. Just associate each new-name-seen with an incrementor-for-this-dimension, then write the Fact.
If you do this through named pipes you can push, process-remotely, and readback-back the data 'on the fly' while the "host" computer sits there loading it straight into tables.
Remember, No matter WHAT you do with 200,000,000 rows of data (How many Gig is it?), it's going to take some time. Sounds like you're in for some fun. It's interesting to read how other people propose to tackle this problem... The old adage "there's more than one way to do it!" has never been so true. Good luck!
Cheers. Keith.
On another note you could utilize Windows Hyper-V Cloud Computing addon for Windows Server:http://www.microsoft.com/virtualization/en/us/private-cloud.aspx
It seems that your implementation is very inefficient as it's loading at the speed of less than 1 MB/sec (50GB/15hrs).
Proper implementation on a modern single server (2x Xeon 5690 CPUs + RAM that's enough for ALL dimensions loaded in hash tables + 8GB ) should give you at least 10 times better speed i.e at least 10MB/sec.
I am investigating an issue relating to a large log expansion during an ETL process, even though the database is set in bulk logged mode (and it is not running in psuedo simple but truely bulk logged)
Using the ::fn_dblog(null,null) function to examine the transaction log operations and the context of the operation, the log expansion is pretty much entirely down to the logging of a LOP_FORMAT_PAGE operation, on a LCX_Heap context. (97% of the expansion is that operation, appearing in the log over 600k times for a single data load.)
The question is, what is the lop_format_page doing / recording that SQL has done?
Given that, I should be able to reverse the logic and understand what the cause / effect chain is that results in this and be able to alter the ETL if appropriate.
I'm not expecting many people have come across this one, the level of available detail on the operations and context is minimal to none.
You're correct that this is very thinly (AKA not!) documented. I've done a little poking around inside logs and have done a lot of log-reduction work (mostly by ensuring bulk inserts were actually being done in bulk!). So I know this can be challenging to track down.
My best guess, having seen LOP_FORMAT_PAGE used in context, is that it's clearing out a new page-- for example when splitting an index page once that page is full and another entry needs to be created. So, if this assumption is correct, you may want to track down what may be causing a whole bunch of new pages to get allocated.
Do you know which operations are going on in the ETL while you're seeing the log expansion? It would be helpful to understand this context-- please add that info to your question if possible.
Also, are you able to run and vary your ETL code in a test environment? Instead of figuring out this inscrutable log record definition, it may be easier to isolate the problem by running your ETL while commenting out some steps (or limiting the number of rows affected) and then seeing which change makes the problem go away.
I think you and Justin are onto the answer, but it is not all that complicated.
The ETL process (Extract, transform, load) is loading data into the db. Naturally, as pages fill up, new ones need to be allocated on the heap.
I thought that LOP_FORMAT_PAGE only formatting page too. But it contains either full page data if count of arrays is 1 or part of page with data (header plus records) and offsets to records from the end of page in second array.