As the database size keeps growing, as part of business requirement, we have been asked to compress data for our prod database. Need to know what are some disadvantages of database compression in postgres.
This is too broad for a good answer. For one, how do you want to compress the data? I'll try to give a general answer though.
With all compression, the trade-off is between size savings and extra work for the CPU. And storage space is usually the cheaper resource.
Normally it is not a good idea to compress a database completely. Instead, consider
increasing the storage space. It is comparatively cheap.
moving old data to an archive database.
Compressing large data in columns can be a good idea, but PostgreSQL does that automatically anyway:
large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or “the best thing since sliced bread”).
Related
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Storing Images in DB - Yea or Nay?
Is it faster and more reliable to store images in the file system or should I store them in a database?
Let's say the images will be no larger than 200 MB. The objective is fast, reliable access.
In general, how do people decide between storing files (e.g., images, PDFs) in the file system or in the database?
Personal opinion: I ALWAYS store images on the file system, and only store a filepath in the database. In many situations, databases are stored on fast (read: expensive storage, 15k RPM or SSD drives) storage. Images or other files, typically can be stored on slower (read: cheaper, larger drives, 7.2k RPM drives) storage.
I find this to be the best, since it allows for the database to remain small in size. In general, databases store "data" well. They can search and retrieve small bits of data fast. File Systems store "files" well, they are optimized to find and retrieve larger bits of data fast.
Obviously there are tradeoffs to both approaches, and there isn't going to be a one-size fits all; however, there may be some use cases where storing images in the database is a good thing, if they are all quite small, and you don't anticipate very many of them, and your database is on the same storage medium as your file share, then it probably makes sense to drop the images directly into the database.
As a side note, SQL Server 2008R2 has a FileStream field type, which can provide the best of both worlds, I have not used it yet, so I can't speak to how well it works.
Store files/images in the database if you require following:
Access Control
Versioning
Checkin/Check out
Searching based on metadata
That has been the design of major CMS like SharePoint has been.
However, if your content is much more static and not going to change over time , you can go with files ystem and enable optimizations/cache on the web server.
With the database approach, you only have one thing to connect to. If your users are distributed, that might be simpler. Note that if the images are not accessed too frequently, the efficiency issues might not matter.
First of all you must know that in databases, you can't store files bigger than i think 8 mb, leavinig that, so i think is better to store files in the system, and small images in the database
I'm using a sqlite db which is very convenient and seems to meet all of my needs at this point.
Currently my db size is <50MB, but I now need to add a new table which will store large text blobs, which will cause the db to reach up to 5GB within the next year.
Would sqlite be able to deal with a 5GB db size? Any caveats to that, compared with say mysql?
I'm not a huge expert on databases, but most of the DB-related work I've done used SQLite. In my experience, making the database larger in-itself shouldn't incur a large performance hit. Naturally you'll have more data, so prepare to spend more time querying it!
Consider this thought experiment: you have a table named mydata you use all the time in the DB. Now, you add an unrelated table otherdata. Your queries for mydata don't depend on the information in otherdata. Even if you shove GBs of data into otherdata, you won't feel any real performance hit in your usage of mydata.
AFAIK, the architecture of SQLite supports this claim.
SQLite should be just fine for what you want to do. Size really isn't a concern. As long as your data file can reside on the same computer that's making the call, you should be just fine. If you put it on the network, that's ok, but multi-user access is subject to the bugs of the operating system when it comes to locking records, etc. Per comparing with mysql, since you've eliminated the server, you've also eliminated the network traffic associated with the data retrieval. this should speed things up.
-don
As stated in Sqlite FAQS , FAQ
look at point 12 , it says max limit of sqlite db can be upto 140 TB!!
I find using indexes will save your time a lot, you can have a try!
I have an interesting database problem. I have a DB that is 150GB in size. My memory buffer is 8GB.
Most of my data is rarely being retrieved, or mainly being retrieved by backend processes. I would very much prefer to keep them around because some features require them.
Some of it (namely some tables, and some identifiable parts of certain tables) are used very often in a user facing manner
How can I make sure that the latter is always being kept in memory? (there is more than enough space for these)
More info:
We are on Ruby on rails. The database is MYSQL, our tables are stored using INNODB. We are sharding the data across 2 partitions. Because we are sharding it, we store most of our data using JSON blobs, while indexing only the primary keys
Update 2
The tricky thing is that the data is actually being used for both backend processes as well as user facing features. But they are accessed far less often for the latter
Update 3
Some people are commenting than 8Gb is toy these days. I agree, but just increasing the size of the db is pure LAZINESS if there is a smarter, efficient solution
This is why we have Data Warehouses. Separate the two things into either (a) separate databases or (b) separate schema within one database.
Data that is current, for immediate access, being updated.
Data that is historical fact, for analysis, not being updated.
150Gb is not very big and a single database can handle your little bit of live data and your big bit of history.
Use a "periodic" ETL process to get things out of active database, denormalize into a star schema and load into the historical data warehouse.
If the number of columns used in the customer facing tables are small you can make indexes with all the columns being used in the queries. This doesn't mean that all the data stays in memory but it can make the queries much faster. Its trading space for response time.
This calls for memcached! I'd recommend using cache-money, a great ActiveRecord write-through caching library. The ngmoco branch has support for enabling caching per-model, so you could only cache those things you knew you wanted to keep in memory.
You could also do the caching by hand using $cache.set/get/expire calls in controller actions or model hooks.
With MySQL, proper use of the Query Cache will keep frequently queried data in memory. You can provide a hint to MySQL not to cache certain queries (e.g. from the backend processes) with the SQL_NO_CACHE keyword.
If the backend processes are accessing historical data, or accessing data for reporting purposes, certainly follow S. Lott's suggestion to create a separate data warehouse and query that instead. If a data warehouse is too much to accomplish in the short term, you can replicate your transactional database to a different server and perform queries there (a Data Warehouse gives you MUCH more flexibility and capability, so go down that path if possible)
UPDATE:
See documentation of SELECT and scroll down to SQL_NO_CACHE.
Read about the Query Cache
Ensure query_cache_type set appropriate for your needs.
UPDATE 2:
I confirmed with MySQL support that there is no mechanism to selectively cache certain tables etc. in the innodb buffer pool.
So, what is the problem?
First, 150gb is not very large today. It was 10 years ago.
Second any non-total-crap database system will utilize your memory as cache. If the cache is big enough (compared to the amount of data that is in use) it will be efficient. If not, the only thing you CAN do is get more memory (because, sorry, 8gb of memory is VERY low for a modern server - it was low 2 years ago).
You should not have to do anything for the memory to be efficiently used. At least not on a commercial level database - maybe mysql sucks, but I would not assume this.
I have application which retrieves many large log files from a system LAN.
Currently I put all log files on Postgresql, the table has a column type TEXT and I don't plan any search on this text column because I use another external process which nightly retrieves all files and scans for sensitive pattern.
So the column value could be also a BLOB or a CLOB, but now my question is the following,
the database has already its compression system, but could I improve this compression manually like with common compressor utilities? And above all WHAT IF I manually pre-compress the large file and then I put as binary into the data table, is it unuseful as database system provides its internal compression?
I don't know who would compress the data more efficiently, you or the db, depends on the algo used etc. But what is sure is that if you compress it, asking the db to compress it again will be a waste of CPU. Once compressed, trying to compress it again yields less gain each time until you end up consuming more space eventually.
The internal compression used in PostgreSQL is designed to err on the side of speed, particularly for decompression. Thus, if you don't actually need that, you will be able to reach higher compression ratios if you compress it in your application.
Note also that if the database does the compression, the data will travel between the database and the application server in uncompressed format - which may or may not be a problem depending on your network.
As others have mentioned, if you do this, be sure to turn off the builtin compression, or you're wasting cycles.
The question you need to ask yourself is do you really need more compression than the database provides, and can you spare the CPU cycles for this on your application server. The only way to find out how much more compression you can get on your data is to try it out. Unless there's a substantial gain, don't bother with it.
My guess here is that if you do not need any searching or querying ability here that you could gain a reduction in disk usage by zipping the file and then just storing the binary data directly in the database.
Generally, how bad of a performance hit is storing a file in a database (specifically mssql) as opposed to the file system? I can't come up with a reason outside of application portability that I would want to store my files as varbinaries in SQL Server.
Have a look at this answer:
Storing Images in DB - Yea or Nay?
Essentially, the space and performance hit can be quite big, depending on the number of users. Also, keep in mind that Web servers are cheap and you can easily add more to balance the load, whereas the database is the most expensive and hardest to scale part of a web architecture usually.
There are some opposite examples (e.g., Microsoft Sharepoint), but usually, storing files in the database is not a good idea.
Unless possibly you write desktop apps and/or know roughly how many users you will ever have, but on something as random and unexpectable like a public web site, you may pay a high price for storing files in the database.
If you can move to SQL Server 2008, you can take advantage of the FILESTREAM support which gives you the best of both - the files are stored in the filesystem, but the database integration is much better than just storing a filepath in a varchar field. Your query can return a standard .NET file stream, which makes the integration a lot simpler.
Getting Started with FILESTREAM Storage
I'd say, it depends on your situation. For example, I work in local government, and we have lots of images like mugshots, etc. We don't have a high number of users, but we need to have good security and auditing around the data. The database is a better solution for us since it makes this easier and we aren't going to run into scaling problems.
What's the question here?
Modern DBMS SQL2008 have a variety of ways of dealing with BLOBs which aren't just sticking in them in a table. There are pros and cons, of course, and you might need to think about it a little deeper.
This is an interesting paper, by the late (?) Jim Gray
To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
In my own experience, it is always better to store files as files. The reason is that the filesystem is optimised for file storeage, whereas a database is not. Of course, there are some exceptions (e.g. the much heralded next-gen MS filesystem is supposed to be built on top of SQL server), but in general that's my rule.
While performance is an issue, I think modern database designs have made it much less of an issue for small files.
Performance aside, it also depends on just how tightly-coupled the data is. If the file contains data that is closely related to the fields of the database, then it conceptually belongs close to it and may be stored in a blob. If it contains information which could potentially relate to multiple records or may have some use outside of the context of the database, then it belongs outside. For example, an image on a web page is fetched on a separate request from the page that links to it, so it may belong outside (depending on the specific design and security considerations).
Our compromise, and I don't promise it's the best, has been to store smallish XML files in the database but images and other files outside it.
We made the decision to store as varbinary for http://www.freshlogicstudios.com/Products/Folders/ halfway expecting performance issues. I can say that we've been pleasantly surprised at how well it's worked out.
I agree with #ZombieSheep.
Just one more thing - I generally don't think that databases actually need be portable because you miss all the features your DBMS vendor provides. I think that migrating to another database would be the last thing one would consider. Just my $.02
The overhead of having to parse a blob (image) into a byte array and then write it to disk in the proper file name and then reading it is enough of an overhead hit to discourage you from doing this too often, especially if the files are rather large.
Not to be vague or anything but I think the type of 'file' you will be storing is one of the biggest determining factors. If you essentially talking about a large text field which could be stored as file my preference would be for db storage.
Interesting topic.
There is no absolutely one correct answer to this question.
There are few key elements to consider:
What’s your database engine?
What’s the route of file from database to end user and/or backwards?
What are the security requirements?
If files are meant for public audience and accessible via website, you shouldn’t even consider storing files in database. Use some smart indexing for files instead.
If files are containing highly sensitive information, then it might be worth of storing these into database. But you have to implement proper safe gateways too.
If performance is crucial, it’s better do not store files in database.
Backup and restoring and migrating of database might become a nightmare if database grows big just because of files. If you are DBA, then you would like to kill the person who “invented” an idea to put files into database.
I recommend to use storing files into database at last option, when there is absolutely no any better alternative available.