SQL Server file and filegroup - sql-server

I can not think of any reasons why we need to have multiple files inside a file group. The reason why I think of this way is we can control from T-SQL (end user) level about file group, but can not control from T-SQL (end user) level about individual files of a file group. Any comments or ideas why files are still needed?
thanks in advance,
George

Having multiple files per file group is only useful for the following reasons:
Distributing disk I/O load over multiple disks for performance reasons. i.e. in cases where re-configuring the RAID configuration with additional disks is not possible, or there is no RAID.
In cases where you have a VLDB and do not wish to deal with very large single files for logistical reasons.
There is 'urban legend' that SQL Server uses only 1 thread per file, so that the number of files should match the number of CPU's. This is however false, as discussed by Microsoft here.
Historically, there is another reason. Believe it or not in the days of SQL Server 4.2 through 7 sql server was sometimes installed on FAT32 file systems which had a 4 gig file limit. The ability to chain files together (in what we now call file groups) was a way to work around file system limitations and allow DBs larger than 4gigs on FAT based installs.

old thread, i know, but here is what makes sense to me: back in the day max file size in windows filesystem FAT32 was 2GB. If your database-file got bigger you were screwed (happened to me with a MS Access-Database once). Hence they allowed to define a max filesize (like: 2GB) and You could add more files. If your database grew and the max size got exceeded the next file got filled until that was full and so on. All those files can be addressed as one filegroup. You can define a tables data-location by choosing a filegroup, but you don't see in which file within that filegroup the tabledata will end up. All You know is that your tables data can end up in any of the files within the filegroup.
By this "splitting" Your filesystem never sees a file larger than the max filesize (here: 2GB) although tables in Your Database can be many times larger.
Today setting up multiple files can be useful to have large datafiles "chopped" into smaller pieces for filebased backup (ask your network admins what they want, because during backup writing a large (like 1TB) file into a partition takes long, even in fast RAID. All other writing operations would need to wait a long time. Shorter waiting intervals let high prioritized operations come to execution quicker).
If You care for parallel access of the same table consider horizontal partitioning as in http://msdn.microsoft.com/en-us/library/ms188730%28v=sql.105%29.aspx. this allows to spread the data of a table over different harddisks, like "all sales of January on disk R:", "all sales of February on disk S:", without creating separate tables. During the procedure of partitioning of a table You can define which part shall go to what filegroup.

I could provide a long explanation but MSDN does a good job of it here. It may be that you specifically don't need to have more than one file in a file group, but that is not true of everybody.

Related

SSIS processing large amount of flat files is painfully slow

From one of our partners, I receive about 10.000 small tab delimited text files with +/- 30 records in each file. It is impossible for them to deliver it in one big file.
I process these files in a ForEach loop container. After reading a file, 4 column derivations are performed and then finally contents are stored in a SQL Server 2012 table.
This process can take up to two hours.
I already tried processing the small files into one big file and then importing this one in the same table. This process takes even more time.
Does anyone have any suggestions to speed up processing?
One thing that sounds counter intuitive is to replace your one Derived Column Transformation with 4 and have each one perform a single task. The reason this can provide performance improvement is that the engine can better parallelize operations if it can determine that these changes are independent.
Investigation: Can different combinations of components affect Dataflow performance?
Increasing Throughput of Pipelines by Splitting Synchronous Transformations into Multiple Tasks
You might be running into network latency since you are referencing files on a remote server. Perhaps you can improve performance by copying those remote files to the local box before you being processing. The performance counters you'd be interested in are
Network Interface / Current Bandwidth
Network Interface / Bytes Total / sec
Network Interface / Transfers/sec
The other thing you can do is replace your destination and derived column with a Row Count transformation. Run the package a few times for all the files and that will determine your theoretical maximum speed. You won't be able to go any faster than that. Then add in your Derived column and re-run. That should help you understand whether the drop in performance is due to the destination, the derived column operation or the package is running as fast as the IO subsystem can go.
Do your files offer an easy way (i.e. their names) of subdividing them into even (or mostly even) groups? If so, you could run your loads in parallel.
For example, let's say you could divide them into 4 groups of 2,500 files each.
Create a Foreach Loop container for each group.
For your destination for each group, write your records to their own staging table.
Combine all recordss from all staging tables into your big table at the end.
If the files themselves don't offer an easy way to group them, consider pushing them into subfolders when your partner sends them over, or inserting the file paths into a database so you can write a query to subdivide them and use the file path field as a variable in the Data Flow task.

Blob data in huge SQL Server database

We have 20.000.000 generated textfiles every year, average size is approx 250 Kb each (35 Kb zipped).
We must put these files in some kind of archive for 10 years. No need to search inside textfiles, but we must be able to find one texfile by searching on 5-10 metadata fields such as "productname", "creationdate", etc.
I'm considering zipping each file and storing them in a SQL Server database with 5-10 searchable (indexed) columns and a varbinary(MAX) column for the zipped file data.
The database will be grow huge over the years; 5-10 Tb. So I think we need to partition data for example by keeping one database per year.
I've been looking into using FILESTREAM in SQL Server for the varbinary column that holds the data, but it seems this is more suitable for blobs > 1 Mb?
Any other suggestions on how to manage such data volumes?
I'd say keeping the files in the filesystem would be a better idea. And you can keep file name and path in the DB. Here's a similar question.
Filestream is definitely more suited to larger blobs (750kB-1MB) as the overhead required to open the external file begins to impact read and write performance vs. vb(max) blob storage for small files. If this is not so much of an issue (ie. reads of blob data after the initial write are infrequent, and the blobs are effectively immutable) then it's definitely an option.
I would probably suggest keeping the files directly in a vb(max) column if you can guarantee they won't get much larger in size, but have this table stored in a seperate filegroup using the TEXTIMAGE_ON option which would allow you to move it to different storage from the rest of the metadata if necessary. Also, make sure to design your schema so the actual storage of blobs can be split over multiple filegroups either using partitions or via some multiple table scheme so you can scale to different disks if necessary in the future.
Keeping the blobs directly associated with the SQL metadata either via Filestream or direct vb(max) storage has many advantages over dealing with filesystem / SQL inconsistencies not limited to ease of backup and other management operations.
I assume by "generated" you mean something like data are being injected into document templates, and so there's much repetition of text content, i.e. "boilerplate" ?
20 million of such "generated" files per year is ~55,000 per day, ~2300 per hour!
I would manage such volume by not generating text files in the first place, and instead by creating database abstracts that contain the data that are pumped into the generated text, so that you can reconstitute the full document if necessary.
If you mean something else by "generated" could you elaborate?

How to efficiently utilize 10+ computers to import data

We have flat files (CSV) with >200,000,000 rows, which we import into a star schema with 23 dimension tables. The biggest dimension table has 3 million rows. At the moment we run the importing process on a single computer and it takes around 15 hours. As this is too long time, we want to utilize something like 40 computers to do the importing.
My question
How can we efficiently utilize the 40 computers to do the importing. The main worry is that there will be a lot of time spent replicating the dimension tables across all the nodes as they need to be identical on all nodes. This could mean that if we utilized 1000 servers to do the importing in the future, it might actually be slower than utilize a single one, due to the extensive network communication and coordination between the servers.
Does anyone have suggestion?
EDIT:
The following is a simplification of the CSV files:
"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue"
"bvalue";"evenanothervalue"
"avalue";"anothervalue"
After importing, the tables look like this:
dimension_table1
id name
1 "avalue"
2 "bvalue"
dimension_table2
id name
1 "anothervalue"
2 "evenanothervalue"
Fact table
dimension_table1_ID dimension_table2_ID
1 1
2 2
1 2
1 2
2 2
1 1
You could consider using a 64bit hash function to produce a bigint ID for each string, instead of using sequential IDs.
With 64-bit hash codes, you can store 2^(32 - 7) or over 30 million items in your hash table before there is a 0.0031% chance of a collision.
This would allow you to have identical IDs on all nodes, with no communication whatsoever between servers between the 'dispatch' and the 'merge' phases.
You could even increase the number of bits to further lower the chance of collision; only, you would not be able to make the resultant hash fit in a 64bit integer database field.
See:
http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
http://code.google.com/p/smhasher/wiki/MurmurHash
http://www.partow.net/programming/hashfunctions/index.html
Loading CSV data into a database is slow because it needs to read, split and validate the data.
So what you should try is this:
Setup a local database on each computer. This will get rid of the network latency.
Load a different part of the data on each computer. Try to give each computer the same chunk. If that isn't easy for some reason, give each computer, say, 10'000 rows. When they are done, give them the next chunk.
Dump the data with the DB tools
Load all dumps into a single DB
Make sure that your loader tool can import data into a table which already contains data. If you can't do this, check your DB documentation for "remote table". A lot of databases allow to make a table from another DB server visible locally.
That allows you to run commands like insert into TABLE (....) select .... from REMOTE_SERVER.TABLE
If you need primary keys (and you should), you will also have the problem to assign PKs during the import into the local DBs. I suggest to add the PKs to the CSV file.
[EDIT] After checking with your edits, here is what you should try:
Write a small program which extract the unique values in the first and second column of the CSV file. That could be a simple script like:
cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
This is a pretty cheap process (a couple of minutes even for huge files). It gives you ID-value files.
Write a program which reads the new ID-value files, caches them in memory and then reads the huge CSV files and replaces the values with the IDs.
If the ID-value files are too big, just do this step for the small files and load the huge ones into all 40 per-machine DBs.
Split the huge file into 40 chunks and load each of them on each machine.
If you had huge ID-value files, you can use the tables created on each machine to replace all the values that remained.
Use backup/restore or remote tables to merge the results.
Or, even better, keep the data on the 40 machines and use algorithms from parallel computing to split the work and merge the results. That's how Google can create search results from billions of web pages in a few milliseconds.
See here for an introduction.
This is a very generic question and does not take the database backend into account. Firing with 40 or 1000 machines on a database backend that can not handle the load will give you nothing. Such a problem is truly to broad to answer it in a specific way..you should get in touch with people inside your organization with enough skills on the DB level first and then come back with a more specific question.
Assuming N computers, X files at about 50GB files each, and a goal of having 1 database containing everything at the end.
Question: It takes 15 hours now. Do you know which part of the process is taking the longest? (Reading data, cleansing data, saving read data in tables, indexing… you are inserting data into unindexed tables and indexing after, right?)
To split this job up amongst the N computers, I’d do something like (and this is a back-of-the-envelope design):
Have a “central” or master database. Use this to mangae the overall process, and to hold the final complete warehouse.
It contains lists of all X files and all N-1 (not counting itself) “worker” databases
Each worker database is somehow linked to the master database (just how depends on RDBMS, which you have not specified)
When up and running, a "ready" worker database polls the master database for a file to process. The master database dolls out files to worker systems, ensuring that no file gets processed by more than one at a time. (Have to track success/failure of loading a given file; watch for timeouts (worker failed), manage retries.)
Worker database has local instance of star schema. When assigned a file, it empties the schema and loads the data from that one file. (For scalability, might be worth loading a few files at a time?) “First stage” data cleansing is done here for the data contained within that file(s).
When loaded, master database is updated with a “ready flagy” for that worker, and it goes into waiting mode.
Master database has it’s own to-do list of worker databases that have finished loading data. It processes each waiting worker set in turn; when a worker set has been processed, the worker is set back to “check if there’s another file to process” mode.
At start of process, the star schema in the master database is cleared. The first set loaded can probably just be copied over verbatim.
For second set and up, have to read and “merge” data – toss out redundant entries, merge data via conformed dimensions, etc. Business rules that apply to all the data, not just one set at a time, must be done now as well. This would be “second stage” data cleansing.
Again, repeat the above step for each worker database, until all files have been uploaded.
Advantages:
Reading/converting data from files into databases and doing “first stage” cleansing gets scaled out across N computers.
Ideally, little work (“second stage”, merging datasets) is left for the master database
Limitations:
Lots of data is first read into worker database, and then read again (albeit in DBMS-native format) across the network
Master database is a possible chokepoint. Everything has to go through here.
Shortcuts:
It seems likely that when a workstation “checks in” for a new file, it can refresh a local store of data already loaded in the master and add data cleansing considerations based on this to its “first stage” work (i.e. it knows code 5484J has already been loaded, so it can filter it out and not pass it back to the master database).
SQL Server table partitioning or similar physical implementation tricks of other RDBMSs could probably be used to good effect.
Other shortcuts are likely, but it totally depends upon the business rules being implemented.
Unfortunately, without further information or understanding of the system and data involved, one can’t tell if this process would end up being faster or slower than the “do it all one one box” solution. At the end of the day it depends a lot on your data: does it submit to “divide and conquer” techniques, or must it all be run through a single processing instance?
The simplest thing is to make one computer responsible for handing out new dimension item id's. You can have one for each dimension. If the dimension handling computers are on the same network, you can have them broadcast the id's. That should be fast enough.
What database did you plan on using with a 23-dimensional starscheme? Importing might not be the only performance bottleneck. You might want to do this in a distributed main-memory system. That avoids a lot of the materalization issues.
You should investigate if there are highly correlating dimensions.
In general, with a 23 dimensional star scheme with large dimensions a standard relational database (SQL Server, PostgreSQL, MySQL) is going to perform extremely bad with datawarehouse questions. In order to avoid having to do a full table scan, relational databases use materialized views. With 23 dimensions you cannot afford enough of them. A distributed main-memory database might be able to do full table scans fast enough (in 2004 I did about 8 million rows/sec/thread on a Pentium 4 3 GHz in Delphi). Vertica might be an other option.
Another question: how large is the file when you zip it? That provides a good first order estimate of the amount of normalization you can do.
[edit] I've taken a look at your other questions. This does not look like a good match for PostgreSQL (or MySQL or SQL server). How long are you willing to wait for query results?
Rohita,
I'd suggest you eliminate a lot of the work from the load by sumarising the data FIRST, outside of the database. I work in a Solaris unix environment. I'd be leaning towards a korn-shell script, which cuts the file up into more managable chunks, then farms those chunks out equally to my two OTHER servers. I'd process the chunks using a nawk script (nawk has an efficient hashtable, which they call "associative arrays") to calculate the distinct values (the dimensions tables) and the Fact table. Just associate each new-name-seen with an incrementor-for-this-dimension, then write the Fact.
If you do this through named pipes you can push, process-remotely, and readback-back the data 'on the fly' while the "host" computer sits there loading it straight into tables.
Remember, No matter WHAT you do with 200,000,000 rows of data (How many Gig is it?), it's going to take some time. Sounds like you're in for some fun. It's interesting to read how other people propose to tackle this problem... The old adage "there's more than one way to do it!" has never been so true. Good luck!
Cheers. Keith.
On another note you could utilize Windows Hyper-V Cloud Computing addon for Windows Server:http://www.microsoft.com/virtualization/en/us/private-cloud.aspx
It seems that your implementation is very inefficient as it's loading at the speed of less than 1 MB/sec (50GB/15hrs).
Proper implementation on a modern single server (2x Xeon 5690 CPUs + RAM that's enough for ALL dimensions loaded in hash tables + 8GB ) should give you at least 10 times better speed i.e at least 10MB/sec.

Storage of many log files

I have a system which is receiving log files from different places through http (>10k producers, 10 logs per day, ~100 lines of text each).
I would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ...
My question is : what's the best way to store them ?
Flat text files (with proper locking), one file per uploaded file, one directory per day/producer
Flat text files, one (big) file per day for all producers (problem here will be indexing and locking)
Database Table with text (MySQL is preferred for internal reasons) (pb with DB purge as delete can be very long !)
Database Table with one record per line of text
Database with sharding (one table per day), allowing simple data purge. (this is partitioning. However the version of mysql I have access to (ie supported internally) does not support it)
Document based DB à la couchdb or mongodb (problem could be with indexing / maturity / speed of ingestion)
Any advice ?
(Disclaimer: I work on MongoDB.)
I think MongoDB is the best solution for logging. It is blazingly fast, as in, it can probably insert data faster than you can send it. You can do interesting queries on the data (e.g., ranges of dates or log levels) and index and field or combination of fields. It's also nice because you can randomly add more fields to logs ("oops, we want a stack trace field for some of these") and it won't cause problems (as it would with flat text files).
As far as stability goes, a lot of people are already using MongoDB in production (see http://www.mongodb.org/display/DOCS/Production+Deployments). We just have a few more features we want to add before we go to 1.0.
I'd pick the very first solution.
I don't see why would you need DB at all. Seems like all you need is to scan through the data. Keep the logs in the most "raw" state, then process it and then create a tarball for each day.
The only reason to aggregate would be to reduce the number of files. On some file systems, if you put more than N files in a directory, the performance decreases rapidly. Check your filesystem and if it's the case, organize a simple 2-level hierarchy, say, using the first 2 digits of producer ID as the first level directory name.
I would write one file per upload, and one directory/day as you first suggested. At the end of the day, run your processing over the files, and then tar.bz2 the directory.
The tarball will still be searchable, and will likely be quite small as logs can usually compress quite well.
For total data, you are talking about 1GB [corrected 10MB] a day uncompressed. This will likely compress to 100MB or less. I've seen 200x compression on my log files with bzip2. You could easily store the compressed data on a file system for years without any worries. For additional processing you can write scripts which can search the compressed tarball and generate more stats.
Since you would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ... You're expecting 100,000 files a day, at a total of 10,000,000 lines:
I'd suggest:
Store all the files as regular textfiles using the following format : yyyymmdd/producerid/fileno.
At the end of the day, clear the database, and load all the textfiles for the day.
After loading the files, it would be easy to get the stats from the database, and post them in any format needed. (maybe even another "stats" database). You could also generate graphs.
To save space ,you could compress the daily folder. Since they're textfiles, they would compress well.
So you would only be using the database to be able to easily aggregate the data. You could also reproduce the reports for an older day if the process didn't work, by going through the same steps.
To my experience, single large table performs much faster then several linked tables if we talk about database solution. Particularly on write and delete operations. For example, splitting one table into three linked tables decreases performance 3-5 times. This is very rough, of course it depends on details, but generally this is the risk. It gets worse when data volumes get very large. Best way, IMO, to store log data is not in a flat text, but rather in a structured form, so that you can do efficient queries and formatting later. Managing log files could be pain, especially when there are lots of them and coming from many sources and locations. Check out our solution, IMO it can save you lots of development time.

What slows down growing database performance?

I'm creating a database, and prototyping and benchmarking first. I am using H2, an open-source, commercially free, embeddable, relational, java database. I am not currently indexing on any column.
After the database grew to about 5GB, its batch write speed doubled (the rate of writing was slowed 2x the original rate). I was writing roughly 25 rows per milliseconds with a fresh, clean database and now at 7GB I'm writing roughly 7 rows/ms. My rows consist of a short, an int, a float, and a byte[5].
I do not know much about database internals or even how H2 was programmed. I would also like to note I'm not badmouthing H2, since this is a problem with other DBMSs I've tested.
What factors might slow down the database like this if there's no indexing overhead? Does it mainly have something to do with the file system structure? From my results, I assume the way windows XP and ntfs handle files makes it slower to append data to the end of a file as the file grows.
One factor that can complicate inserts as a database grows is the number of indexes on the table, and the depth of those indexes if they are B-trees or similar. There's simply more work to do, and it may be that you're causing index nodes to split, or you may simply have moved from, say, a 5-level B-tree to a 6-level one (or. more generally, from N to N+1 levels).
Another factor could be disk space usage -- if you are using cooked files (that's the normal kind for most people most of the time; some DBMS use 'raw files' on Unix, but it is unlikely that your embedded system would do so, and you'd know if it did because you'd have to tell it to do so), it could be that your bigger tables are now fragmented across the disk, leading to worse performance.
If the problem was on SELECT performance, there could be many other factors also affecting your system's performance.
This sounds about right. Database performance usually drops significantly as the data can no longer be held in memory and operations become disk bound. If you are using normal insert operations, and want a significant performance improvement, I suggest using some sort of a bulk load API if H2 supports it (like Oracle sqlldr, Sybase BCP, Mysql 'load data infile'). This type of API writes data directly to the data-file bypassing many of the database subsystems.
This is most likely caused by variable width fields. I don't know if H2 allows this, but in MySQL, you have to create your table with all fixed width fields, then explicitly declare it as a fixed width field table. This allows MySQL to calculate exactly where it needs to go in the database file to do the insert. If you aren't using a fixed width table, then it has to read through the table to find the end of the last row.
Appending data (if done right) is an O(n) operation, where n is the length of the data to be written. It doesn't depend on the file length, there are seek operations to skip over that easily.
For most databases, appending to a database file is definitely slower than pre-growing the file and then adding rows. See if H2 supports pre-growing the file.
Another cause is whether the entire database is held in memory or if the OS has to do a lot of disk swapping to find the location to store the record.
I would blame it on I/O, specially if you're running your database on a normal PC with a normal hard disk (by that I mean not in server with super fast hard drives, etc).
Many database engines create an implicit integer primary key for each update, so even if you haven't declared any indexes, your table is still indexed. This may be a factor.
Using H2 for 7G datafile is a wrong choice from technological point of view. As you said, embeddable. What kind of "embedded" application do you have, if you need to store so much data.
Are you performing incremental commits? Since H2 is an ACID compliant database, if you are not performing incremental commits, then there is some type of redo log so that in the case of some accidental failure (say, power outage) or rollback, the deletes can be rolled back.
In that case, your redo log may be growing large and overflowing memory buffers and needing to write out your redo log to disk, as well as your actual data, adding to your I/O overhead.

Resources