I'm reading some data from a SQL Server 2012. The source table has some 600 million rows, and two of the headache columns are nvarchar(max).
Problem: The performance drops from 100k rows per second to 800 per second when including these blob columns. Is there some way in make things go faster with blobs in SSIS?
Max datalength for the varchar(max) is 250 bytes and avg datalength is 50 bytes. The content is XML stored as text. I could possibly cast them to a nvarchar(2000), but I need later to work with a table with bigger xml-text objects.
The destination is simply a RowCount.
What I see is
SSIS works on the data in chunks. While SSIS is working, the source server is idle: No CPU, IO or Pagelookups. I don't know what SSIS is doing internally.
I have 40 GB of RAM on the server, SSD disks, should be enough resources.
I've tried to set the Blob and Buffer Storage Path to a explicit folder, but I can't see anything going on there.
The working set of the SSIS package fine, nothing going on there. It consumes only some 300-400 MB, that's it.
No paging going on.
I'm using the SQLNCLI11.1 provider, 32K packet size.
Networking is fine, I've ruled out that part. The waitstats on the source server says vaguly it's waiting for network, but that's simple because the client app is not consuming anything. Again, if I leave out the blob column the network delivers 400 mbpbs constantly (1 gb network ).
It seems to me as SSIS simply doesn't handle blobs very well. It's very very conservative (because it cannot know how much space the data will occupy from the buffer). From the doc I can read the SSIS handles blobs differently from ordinairy columns, that they may be stored on disk, read back to memory again and then sent down the pipeline. Whatever happens, the performance is simply a disaster.
I've played around with the buffer size and maxbufferrows, it doesn't really make that much difference. The average performance is crap whatever I choose to do.
I've googled "everything", this is my last resort. What have I missed so far?
Related
I know oracle automatically preserve frequently accessed data in memory. I'm curious is any way to keep a table in memory manually for more performance?
Yes, you could certainly do that. You need to pin the table in the KEEP POOL cache in DB cache.
For example,
ALTER TABLE table_name STORAGE (buffer_pool KEEP);
By the way, Oracle 11g and up, you can have a look at the RESULT CACHE. It is quite useful.
Have a look at this AskTom link https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:253415112676
The short answer is no, and you don't want to.
If you need that high a level of retrieval performance, then consider using an in memory DB like Times 10.
Think about what you are asking the DB to do. You are asking the DB to dedicate n amount of cache memory to a single table and hold it there indefinitely. In a busy DB this will simply kill performance to the point of the DB being useless. Lets say you have a DB with a few hundred tables in it, some of them small, some large and some very large and throw in a few PK's and indexes.
A query comes along that asks for say 100K rows of data that are 1 Kbyte each and the index is a 100 byte long string. The DB will allocate sufficient memory to load up the index, and then start grabbing 8K chunks of data off the disk and putting those into cache.
If you ask it to store a few gigabytes of data, in ram, permanently, you will run out of memory in a big hurry unless you have a VERY expensive machine with 512 gigs of ram in it and you will start hitting the swap file and well at that point your performance is toast.
If you are having performance issues on queries run explain plan and learn how to use it to discover the bottle necks. I have a 24 core machine with 48 gigs of ram, but I have tables with billions of rows of data. I keep a close eye on my cache hits and execution plans.
Also consider materialized views.
I'm doing a rather large import to a SQL Database, 10^8+ items and I am doing this with a bulk insert. I'm curious to know if the speed at which the bulk insert runs can be improved by importing multiple rows of data as a single row and splitting them once imported?
If the time to import data is defined by the sheer volume of data itself (ie. 10GB), then I'd expect that importing 10^6 rows vs 10^2 with the data consolidated would take about the same amount of time.
If the time to import however is limited more by row operations and logging each line and not by the data itself then I'd expect that consolidating data would have a performance benefit. I'm not sure however how this would carry over if one had tot then break up the data in DB later on.
Does anyone have experience with this and can shed some light on what specifically can be done to reduce bulk insert time without simply adding that time later to split the data in DB?
Given a 10GB import, is it better to import data on separate rows or consolidate and separate the rows in the DB?
[EDIT] I'm testing this on a Quad 2.5GH with 8GB or RAM and 300MB/sec of read/writes to disk (stripped array). The files are hosted n the same array and the average row size varies with some rows containing large amounts of data (> 100 KB) and many under 100 B.
I've chunked my data into 100 MB files and it takes about 40 seconds to import the file. Each file has 10^6 rows in it.
Where is the data that you are importing? If it is on another server, then the Network might be the bottleneck. This then depends on number of NIC'S and frame sizes.
If it is on the same server, things to play with are batch size and recovery model which effect the log file. In full recovery model, everything is written to a log file. Bulk copy recovery model is a little less overhead in the log.
Since this is staging data, maybe a full backup before the process, change the model to simple, then import might reduce the time. Of course, change the model back to full and do another backup.
As for importing non-normalized data, multiple rows at a time, I usually stay away from the extra coding.
Most of the time, I use SSIS packages. More packages, threads, means a fuller NIC pipe. I usually have at least a 4 GB back bone that is seldom full.
Other things that come to play are your disks. Do you have multiple files (path ways) to the RAID 5 array? If not, you might want to think about it.
In short, it really depends on your environment.
Use a DMAIC process.
1 - Define what you want to do
2 - Measure the current implementation
3 - Analyze ways to improve.
4 - Implement the change.
5 - Control the environment by remeasuring.
Did the change go in the positive direction?
If not, rollback the change and try another one.
Repeat the process until the desired result (timing) is achieve.
Good luck, J
If this is a one time thing and done in an offline change window.. you may want to consider to put the database in simple recovery model prior to inserting the data.
Keep in mind though this would break the log chain....
I've been playing around with database programming lately, and I noticed something a little bit alarming.
I took a binary flat file saved in a proprietary, non-compressed format that holds several different types of records, built schemas to represent the same records, and uploaded the data into a Firebird database. The original flat file was about 7 MB. The database is over 70 MB!
I can understand that there's some overhead to describe the tables themselves, and I've got a few minimal indices (mostly PKs) and FKs on various tables, and all that is going to take up some space, but a factor of 10 just seems a little bit ridiculous. Does anyone have any ideas as to what could be bloating up this database so badly, and how I could bring the size down?
From Firebird FAQ:
Many users wonder why they don't get their disk space back when they delete a lot of records from database.
The reason is that it is an expensive operation, it would require a lot of disk writes and memory - just like doing refragmentation of hard disk partition. The parts of database (pages) that were used by such data are marked as empty and Firebird will reuse them next time it needs to write new data.
If disk space is critical for you, you can get the space back by doing backup and then restore. Since you're doing the backup to restore right away, it's wise to use the "inhibit garbage collection" or "don't use garbage collection" switch (-G in isql), which will make backup go A LOT FASTER. Garbage collection is used to clean up your database, and as it is a maintenance task, it's often done together with backup (as backup has to go throught entire database anyway). However, you're soon going to ditch that database file, and there's no need to clean it up.
Gstat is the tool to examine table sizes etc, maybe it will give you some hints what's using space.
In addition, you may also have multiple snapshots or other garbage in database file, it depends on how you add data to the database. The database file never shrinks automatically, but backup/restore cycle gets rid of junk and empty space.
Firebird fill pages in some factor not full.
e.g. db page can contain 70% of data and 30% free space to speed up future record updates, deletes without moving to new db page.
CONFIGREVISIONSTORE (213)
Primary pointer page: 572, Index root page: 573
Data pages: 2122, data page slots: 2122, average fill: 82%
Fill distribution:
0 - 19% = 1
20 - 39% = 0
40 - 59% = 0
60 - 79% = 79
80 - 99% = 2042
The same is for indexes.
You can see how really db size is when you do backup and restore with option
-USE_ALL_SPACE
then database will be restored without this space preservation.
You must know also that not only pages with data are allocated but also some pages are preallocated (empty) for future fast use without expensive disc allocation and fragmentation.
as "Peter G." say - database is much more then flat file and is optimized to speed up thinks.
and as "Harriv" say - you can get details about database file with gstat
use command like gstat -
here are details about its output
I'm creating a database, and prototyping and benchmarking first. I am using H2, an open-source, commercially free, embeddable, relational, java database. I am not currently indexing on any column.
After the database grew to about 5GB, its batch write speed doubled (the rate of writing was slowed 2x the original rate). I was writing roughly 25 rows per milliseconds with a fresh, clean database and now at 7GB I'm writing roughly 7 rows/ms. My rows consist of a short, an int, a float, and a byte[5].
I do not know much about database internals or even how H2 was programmed. I would also like to note I'm not badmouthing H2, since this is a problem with other DBMSs I've tested.
What factors might slow down the database like this if there's no indexing overhead? Does it mainly have something to do with the file system structure? From my results, I assume the way windows XP and ntfs handle files makes it slower to append data to the end of a file as the file grows.
One factor that can complicate inserts as a database grows is the number of indexes on the table, and the depth of those indexes if they are B-trees or similar. There's simply more work to do, and it may be that you're causing index nodes to split, or you may simply have moved from, say, a 5-level B-tree to a 6-level one (or. more generally, from N to N+1 levels).
Another factor could be disk space usage -- if you are using cooked files (that's the normal kind for most people most of the time; some DBMS use 'raw files' on Unix, but it is unlikely that your embedded system would do so, and you'd know if it did because you'd have to tell it to do so), it could be that your bigger tables are now fragmented across the disk, leading to worse performance.
If the problem was on SELECT performance, there could be many other factors also affecting your system's performance.
This sounds about right. Database performance usually drops significantly as the data can no longer be held in memory and operations become disk bound. If you are using normal insert operations, and want a significant performance improvement, I suggest using some sort of a bulk load API if H2 supports it (like Oracle sqlldr, Sybase BCP, Mysql 'load data infile'). This type of API writes data directly to the data-file bypassing many of the database subsystems.
This is most likely caused by variable width fields. I don't know if H2 allows this, but in MySQL, you have to create your table with all fixed width fields, then explicitly declare it as a fixed width field table. This allows MySQL to calculate exactly where it needs to go in the database file to do the insert. If you aren't using a fixed width table, then it has to read through the table to find the end of the last row.
Appending data (if done right) is an O(n) operation, where n is the length of the data to be written. It doesn't depend on the file length, there are seek operations to skip over that easily.
For most databases, appending to a database file is definitely slower than pre-growing the file and then adding rows. See if H2 supports pre-growing the file.
Another cause is whether the entire database is held in memory or if the OS has to do a lot of disk swapping to find the location to store the record.
I would blame it on I/O, specially if you're running your database on a normal PC with a normal hard disk (by that I mean not in server with super fast hard drives, etc).
Many database engines create an implicit integer primary key for each update, so even if you haven't declared any indexes, your table is still indexed. This may be a factor.
Using H2 for 7G datafile is a wrong choice from technological point of view. As you said, embeddable. What kind of "embedded" application do you have, if you need to store so much data.
Are you performing incremental commits? Since H2 is an ACID compliant database, if you are not performing incremental commits, then there is some type of redo log so that in the case of some accidental failure (say, power outage) or rollback, the deletes can be rolled back.
In that case, your redo log may be growing large and overflowing memory buffers and needing to write out your redo log to disk, as well as your actual data, adding to your I/O overhead.
For our application, we keep large amounts of data indexed by three integer columns (source, type and time). Loading significant chunks of that data can take some time and we have implemented various measures to reduce the amount of data that has to be searched and loaded for larger queries, such as storing larger granularities for queries that don't require a high resolution (time-wise).
When searching for data in our backup archives, where the data is stored in bzipped text files, but has basically the same structure, I noticed that it is significantly faster to untar to stdout and pipe it through grep than to untar it to disk and grep the files. In fact, the untar-to-pipe was even noticeably faster than just grepping the uncompressed files (i. e. discounting the untar-to-disk).
This made me wonder if the performance impact of disk I/O is actually much heavier than I thought. So here's my question:
Do you think putting the data of multiple rows into a (compressed) blob field of a single row and search for single rows on the fly during extraction could be faster than searching for the same rows via the table index?
For example, instead of having this table
CREATE TABLE data ( `source` INT, `type` INT, `timestamp` INT, `value` DOUBLE);
I would have
CREATE TABLE quickdata ( `source` INT, `type` INT, `day` INT, `dayvalues` BLOB );
with approximately 100-300 rows in data for each row in quickdata and searching for the desired timestamps on the fly during decompression and decoding of the blob field.
Does this make sense to you? What parameters should I investigate? What strings might be attached? What DB features (any DBMS) exist to achieve similar effects?
This made me wonder if the performance impact of disk I/O is actually much heavier than I thought.
Definitely. If you have to go to disk, the performance hit is many orders of magnitude greater than memory. This reminds me of the classic Jim Gray paper, Distributed Computing Economics:
Computing economics are changing. Today there is rough price parity between (1) one database access, (2) ten bytes of network traffic, (3) 100,000 instructions, (4) 10 bytes of disk storage, and (5) a megabyte of disk bandwidth. This has implications for how one structures Internet-scale distributed computing: one puts computing as close to the data as possible in order to avoid expensive network traffic.
The question, then, is how much data do you have and how much memory can you afford?
And if the database gets really big -- as in nobody could ever afford that much memory, even in 20 years -- you need clever distributed database systems like Google's BigTable or Hadoop.
I made a similar discovery when working within Python on a database: the cost of accessing a disk is very, very high. It turned out to be much faster (ie nearly two orders of magnitude) to request a whole chunk of data and iterate through it in python than it was to create seven queries that were narrower. (One per day in question for the data)
It blew out even further when I was getting hourly data. 24x7 lots of queries it lots!