Yesterday I asked the question on how I should save my files.
After some research I've desided to go with storing the files "in" the database.
I've checked the difference between storing the files using filestream and storing the files in the database itself.
Each has it's advantages and disadvantages. To help me with my research this site helped me out a lot:
http://www.codeproject.com/KB/database/SqlFileStream.aspx
So basically it says that saving the files using filestream is better if the files are bigger than 1mb.
But I've discovered another problem with filestreaming. If you delete a record in the database the file still exists on the filesystem.
Therefore I need you guys opinion.
What to use? Filestream or saving the files in the database using VARBINARY?
Grtz,
M.
The data on the filesystem will (should) be removed soon after you delete the data from the database, though this is done in a separate system background thread, so it may remain on the filesystem until basically the garbage collector runs again. However, all results accessed through any of the filestream APIs (i.e. tsql or streaming) will be guaranteed to not access anything that has been removed, whether or not the filesystem data still resides on disk (ACID is ensured with filestream).
Use Filestreams.
Databases are designed to store relational data, file systems are designed to store files.
I heard often from Microsoft employees that storing large blobs in an SQL Server 2000/2005 (cant remember) is not a good idea.
Also consider Backup: A Databasefile is ONE file (EDIT: If you configure it so). If you're using Filestreams you can backup individual files.
Related
In my project (similar to mediafire and rapidshare), clients can upload files to the server. I am using DB2 database and IBM WAS web server and JSP as server side scripting. I am creating my own encryption algorithm, as it is the main aim of the project.
I need suggestion whether files themselves should be stored in the database or if only the location of the files should be stored. Which approach is best?
There are Pros and Cons for storing BLOBs in the database.
Advantages
DBMS support for BLOBs is very good nowadays
JDBC driver support for BLOBs is very good
access to the "documents" can happen inside a transaction. No need to worry about manual cleanup or "housekeeping". If the row is deleted, so is the BLOB data
Don't have to worry about filesystem limits. Filesystems are typically not very good at storing million of files in a single directory. You will have to distribute your files across several directories.
Everything is backed up together. If you take a database backup you have everything, no need to worry about an additional filesystem backup (but see below)
Easily accessible through SQL (no FTP or other tools necessary). That access is already there and under control.
Same access controls as for the rest of the data. No need to set up OS user groups to limit access to the BLOB files.
Disadvantages
Not accessible from the OS directly (problem if you need to manipulate the files using commandline tools)
Cannot be served by e.g. a webserver directly (that could be performance problem)
Database backup (and restore) is more complicated (because of size). Incremental backups are usually more efficient in the filesystem
DBMS cache considerations
Not suited for high-write scenarios
You need to judge for yourself which advantage and which disadvantage is more important for you.
I don't share the wide-spread assumption that storing BLOBs in a database is always a bad idea. It depends - as with many other decisions.
It's general knowledge that storing files in the database -especially big ones- it's generally a bad idea. There are brilliant explanations in these questions:
Storing a file in a database as opposed to the file system?
Storing Images in DB - Yea or Nay?
And I'd like to highlight some points myself:
Storing files in your DBMS will make your data very big, and big databases are a maintaining hell (specially backups)
Portability becomes an issue, as every DBMS vendor makes its own implementation of BLOB files
There's a performance lost related to SELECT sentences to BLOB fields, compared to disk access
Well my Opinion would be to store the relevant information like path, name, description, etc... in the database and keep the file evtl. encrypted on the filesystem, it would be cheaper to scale your system adding a webserver than adding a database one as webspace is cheap comparing with databases, all you will need then is to add an IP column to your database or server name so you can address teh new webserver.
One of our teams is going to be developing an application to store records in a SQL2008 database and each of these records will have an associated PDF file. There is currently about 340GB of files, with most (70%) being about 100K, but some are several Megabytes in size. Data is mostly inserted and read, but the files are updated on occasion. We are debating between the following options:
Store the files as BLOBs in the database.
Store the files outside the database and store the paths in the database.
Use SQL2008's Filestream feature to store the files.
We have read the Micrsoft best practices regarding filestream data, but since the files vary in size, we are not sure which path to choose. We are leaning toward option 3 (filestream), but have some questions:
Which architecture would you choose given the amount of data and file sizes noted above?
Data access will be done using SQL authentication, not Windows authentication, and the web server will likely not be able to access the files using Windows API. Would this make filstream perform worse than the other two options?
Since the SQL backups include the filestream data, this would lead to very large database backups. How do others handle backing up databases with a large amount of filestream data?
OK, here we go. Option 2 is a really bad idea - you end up with untestable integrity constraints and backups that are not guaranteed to be consistent per definition because you can not take point in time backups. Not a problem in MOST scenarios, it turns into one the moment you have a more complicated (point in time) recovery.
Options 1 and 3 are pretty equal, albeit with some implications.
Filestream can use a lot more disc space. Basically, every version has a guid, if you make updates the old files stay around until the next backup.
OTOH the files do not count as db size (express edition - not against the 10gb limit should you use it) and access is further down possible using a file share. This is added flexibility.
In database has the most limited options regarding access (no way for the web server to just open the file after getting the path from the sql - it has to funnel the complete file through the sql protocol layer) but has advantages in regards of having less files (numbers). Putting the blobs into a separate table and that one a separate set of spindles may be strategically a good idea.
Regarding your questions:
1: I would go with in database storage. Try out both - filestream and not. As you use the same API anyway, this is a simple change in the table definition.
2: Yes, worse than direct file access, but it would be more protected than direct file access. Otherwise I do not think filestream and blob make a significant difference.
3: where do you have a huge backup here? Sorry to ask, but your 340gb is not exactly a large database. And you need to back it up ANYWAY. Better do it in one consistent state, which is what you achieve with db storage. Plus integrity (no one accidentally deleting unused documents without cleaning up the database). The DB is not significantly larger than doing that split, and it is a simple one place backup.
At the end, the question is db integrity and ease of backing things up. Win for SQL Server unless you get large - and this means 360 terabyte of data.
Store the files outside the database and store the paths in the database.
because it takes too much space to store files in the database.
I would definitely recommend (3) - this is the sort of scenario that this feature is specifically built to handle, and it is handled very well in my opinion.
This white paper has lots of useful information - http://msdn.microsoft.com/en-us/library/cc949109(SQL.100).aspx - and from a security point of view mentions that...
There are two security requirements for using the FILESTREAM feature. Firstly, SQL Server must be configured for integrated security. Secondly, if remote access will be used, then the SMB port (445) must be enabled through any firewall systems.
With regard to Backups, see the accepted answer to this question - SQL Server FILESTREAM limitation
I've used a Index/Content method that you haven't listed but it might help. You have a table of files that are stored as a blob of binary code with a unique id or row number. The next SQL table will provide the index, the name of the file, the path to it, keywords, file type, file size, check sum... what ever you need. This is the best I have have seen to store files for working with thousands of uploaded documents. The index is required to view the file as it would just be binary text to the user if they have no idea what the file type is. We store the data in 2 separate databases to allow the index on one server and the file store on multiple servers for easy expansion. At that point the index table/database contains the name or key to the server the file is on. If the user has access to read that particular index table, then they have access to the file.
This scenario is easy: the FILESTREAM recomendation said that is best when the files are (on average) larger than 1MB, wich is not your case, for smaller objects, storing varbinary(max) BLOBs in the database often provides better streaming performance.
Since you will be accesing the files directly from SQL Server and not from filesystem then you should store it using BLOBs.
Read When to Use FILESTREAM: http://technet.microsoft.com/en-us/library/bb933993%28v=sql.105%29.aspx
Have you looked at RBS (Remote Blob Storage) solution? If you use the Filestream RBS provider, it will internally keep your blobs as Filestream files or varbinary(max) values, depending on what gets better performances based on the blob size.
Remote BLOB Store Provider Library Implementation Specification
SQL Remote Blob Storage Team Blog
I have a design decision to make regarding documents uploaded to my web site: I can either store them on my file server somewhere, or I can store them as a blob in my database (MSSQL 2005). If it makes any difference to the design decision, these documents are confidential and must have a certain degree of protection.
The considerations I've thought of are:
Storing on the file server makes for HUUUUUUUGE numbers of files all dumped in a single directory, and therefore slower access, unless I can work out a reasonable semantic definition for a directory tree structure
OTOH, I'm guessing that the file server can handle compression somewhat better than the DB... or am I wrong?
My instincts tell me that the DB's security is stronger than the file server's, but I'm not sure if that's necessarily true.
Don't know how having terabytes of blobs in my DB will affect performance.
I'd very much appreciate some recommendations here. Thanks!
In SQL Server 2005, you only have the choice of using VARBINARY(MAX) to store the files inside the database table, or then keep them outside.
The obvious drawback of leaving them outside the database is that the database can't really control what happens to them; they could be moved, renamed, deleted.....
SQL Server 2008 introduces the FILESTERAM attribute on VARBINARY(MAX) types, which allows you to leave the files outside the database table, but still under transactional control of the database - e.g. you cannot just delete the files from the disk, the files are integral part of the database and thus get copied and backed up with it. Great if you need it, but it could make for some huge backups! :-)
The SQL Server 2008 launch presented some "best practices" as to when to store stuff in the database directly, and when to use FILESTREAM. These are:
if the files are typically less than 256 KB in size, the database table is the best option
if the files are typically over 1 MB in size, or could be more than 2 GB in size, then FILESTREAM (or in your case: plain old filesystem) is your best choice
no recommendation for files between those two margins
Also, in order not to negatively impact performance of your queries, it's often a good idea to put the large files into a separate table alltogether - don't have the huge blobs be part of your regular tables which you query - but rather create a separate table, which you only ever query against, if you really need the megabytes of documents or images.
So that might give you an idea of where to start out from!
I strongly suggest you to consider the filesystem solution. The reasons are:
you have better access to the files (precious in case of debugging), meaning that you can use regular console-based tools
you can quickly and easily take advantage of the OS to distribute the load, for example using a distributed filesystem, add redundancy via a hardware RAID etc.
you can take advantage of the OS access control lists to enforce permissions.
you don't clog your database
If you are worried about large amounts of entries in your directories, you can always create a branching schema. for example:
filename : hello.txt
filename md5: 2e54144ba487ae25d03a3caba233da71
final filesystem position: /path/2e/54/hello.txt
There's a LOT of "it depends" behind this popular subject. Since you say the documents are sensitive and confidential, off the cuff I'd go with storing in the database. Here are a few reasons:
Potentially better security. It is often easier to hack a file system than a database.
Better volume control. Thousands of files in one folder can strain an OS, where a database can take millions of rows in one table without blinking.
Better searching and scanning. Add categorizing columns when you load the data, or try out full text indexing to scan the actual documents.
Backups may be more efficient -- just add another database to your backup plan, and you're covered (once you work out space details, of course). And those backup files are another layer of obfuscation on anyone trying to get at your sensitive documents.
SQL Server 2008 has data compression options that may help here. That, or have the application do it? (More security through obfuscation, perhaps)
SQL Server 2008 also has the filestream data type, which may help here, but I'm not familiar enough with it to give a recommendation for your situation.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Storing Images in DB - Yea or Nay?
For ages I've been told not to store images on the database, or any big BLOB for that matter. While I can understand why the databases aren't/weren't efficient for that I never understood why they couldn't. If I can put a file somewhere and reference it, why couldn't the database engine do the same. I'm glad Damien Katz mentioned it on a recent Stack Overflow podcast and Joel Spolsky and Jeff Atwood, at least silently, agreed.
I've been reading hints that Microsoft SQL Server 2008 should be able to handle BLOBs efficient, is that true? If so, what is there stopping us from just storing images there and getting rid of one problem? One thing I can think of is that while the image can be served by a static web server very quickly if it's a file somewhere, when it's in the database it has to travel from the database to the web server application (which might be slower than the static web server) and then it's served. Shouldn't caching help/solve that last issue?
Yes, it's true, SQL Server 2008 just implemented a feature like the one you mention, it's called a filestream. And it's a good argument indeed for storing blobs in a DB, if you are certain you will only want to use SQL Server for your app (or are willing to pay the price in either performance or in developing a similar layer on top of the new DB server). Although I expect similar layers will start to appear if they don't already exist for different DB servers.
As always what would the real benefits be depend on the particular scenario. If you will serve lots of relatively static, big files, then this scenario plus caching will probably be the best option considering a performance/manageability combo.
This white paper describes the FILESTREAM feature of SQL Server 2008, which allows storage of and efficient access to BLOB data using a combination of SQL Server 2008 and the NTFS file system. It covers choices for BLOB storage, configuring Windows and SQL Server for using FILESTREAM data, considerations for combining FILESTREAM with other features, and implementation details such as partitioning and performance.
Just because you can do something doesn't mean you should.
If you care about efficiency you'll still most likely not want to do this for any sufficiently large scale file serving.
Also it looks like this topic has been heavily discussed...
Exact Duplicate: User Images: Database or filesystem storage?
Exact Duplicate: Storing images in database: Yea or nay?
Exact Duplicate: Should I store my images in the database or folders?
Exact Duplicate: Would you store binary data in database or folders?
Exact Duplicate: Store pictures as files or or the database for a web app?
Exact Duplicate: Storing a small number of images: blob or fs?
Exact Duplicate: store image in filesystem or database?
I'll try to decompose your question and address your various parts as best I can.
SQL Server 2008 and the Filestream Type - Vinko's answer above is the best one I've seen so far. The Filestream type is the SQL Server 2008 is what you were looking for. Filestream is in version 1 so there are still some reasons why I wouldn't recommend using if for an enterprise application. As an example, my recollection is that you can't split the storage of the underlying physical files across multiple Windows UNC paths. Sooner or later that will become a pretty serious constraint for an enterprise app.
Storing Files in the Database - In the grander scheme of things, Damien Katz's original direction was correct. Most of the big enterprise content management (ECM) players store files on the filesystem and metadata in the RDBMS. If you go even bigger and look at Amazon's S3 service, you're looking at physical files with a non-relational database backend. Unless you're measuring your files under storage in the billions, I wouldn't recommend going this route and rolling your own.
A Bit More Detail on Files in the Database - At first glance, a lot of things speak for files in the database. One is simplicity, two is transactional integrity. Since the Windows file system cannot be enlisted in a transaction, writes that need to occur across the database and filesystem need to have transaction compensation logic built in. I didn't really see the other side of the story until I talked to DBAs. They generally don't like commingling business data and blobs (backup becomes painful) so unless you have a separate database dedicated to file storage, this option is generally not as appealing to DBAs. You're right that the database will be faster, all other things being equal. Not knowing the use case for your application, I can't say much about the caching option. Suffice it to say that in many enterprise applications, the cache hit rate on documents is just too darn low to justify caching them.
Hope this helps.
One of the classical reasons for caution about storing blobs in databases is that the data will be stored and edited (changed) under transaction control, which means that the DBMS needs to ensure that it can rollback changes, and recover changes after a crash. This is normally done by some variation on the theme of a transaction log. If the DBMS is to record the change in a 2 GB blob, then it has to have a way of identifying what has changed. This might be simple-minded (the before image and the after image) or more sophisticated (some sort of binary delta operation) that is more computationally expensive. Even so, sometimes the net result will be gigabytes of data to be stored through the logs. This hurts the system performance. There are various ways of limiting the impact of the changes - reducing the amount of data flowing through the logs - but there are trade-offs.
The penalty for storing filenames in the database is that the DBMS has no control (in general) over when the files change - and hence again, the reproducibility of the data is compromised; you cannot guarantee that something outside the DBMS has not changed the data. (There's a very general version of that argument - you can't be sure that someone hasn't tampered with the database storage files in general. But I'm referring to storing a file name in the database referencing a file not controlled by the DBMS. Files controlled by the DBMS are protected against casual change by the unprivileged.)
The new SQL Server functionality sounds interesting. I've not explored what it does, so I can't comment on the extent to which it avoids or limits the problems alluded to above.
There are options within SQL Server to manage where it stores large blobs of data, these have been in there since at lease SQL2005 so I don't know why you couldn't store large BLOBs of data. MOSS for instance stores all of the documents you upload to it in a SQL database.
There are of course some performance implications, as with just about anything, so you should take care that you don't retreive the blob if you don't need it, and don't include it in indexes etc.
Our EPOS system copies data by compressing the database into a zip file, and manually copying to each till, using shared directories.
Each branched is liked to the main location, using VPN which can be problematic, but is required for the file sharing to work correctly.
Since our database system currently does not support replication, is there another solution for copying data or should we migrate our software to another database?
Replication is the "right" way to go, so if migrating to another database is an option (is it really?), that's the best route.
You might consider a utility that queries all the tables for raw data (in CSV?), sending that to files. Then at least you don't have to take the database down to do the backup.