I"m looking to run PostgreSQL in RAM for performance enhancement. The database isn't more than 1GB and shouldn't ever grow to more than 5GB. Is it worth doing? Are there any benchmarks out there? Is it buggy?
My second major concern is: How easy is it to back things up when it's running purely in RAM. Is this just like using RAM as tier 1 HD, or is it much more complicated?
It might be worth it if your database is I/O bound. If it's CPU-bound, a RAM drive will make no difference.
But first things first, you should make sure that your database is properly tuned, you can get huge performance gains that way without losing any guarantees. Even a RAM-based database will perform badly if it's not properly tuned. See PostgreSQL wiki on this, mainly shared_buffers, effective_cache_size, checkpoint_*, default_statistics_target
Second, if you want to avoid synchronizing disk buffers on every commit (like codeka explained in his comment), disable the synchronous_commit configuration option. When your machine loses power, this will lose some latest transactions, but your database will still be 100% consistent. In this mode, RAM will be used to buffer all writes, including writes to the transaction log. So with very rare checkpoints, large shared_buffers and wal_buffers, it can actually approach speeds close to those of a RAM-drive.
Also hardware can make a huge difference. 15000 RPM disks can, in practice, be 3x as fast as cheap drives for database workloads. RAID controllers with battery-backed cache also make a significant difference.
If that's still not enough, then it may make sense to consider turning to volatile storage.
The whole thing about whether to hold you database in memory depends on size and performance as well how robust you want it to be with writes. I assume you are writing to your database and that you want to persist the data in case of failure.
Personally, I would not worry about this optimization until I ran into performance issues. It just seems risky to me.
If you are doing a lot of reads and very few writes a cache might serve your purpose, Many ORMs come with one or more caching mechanisms.
From a performance point of view, clustering across a network to another DBMS that does all the disk writing, seems a lot more inefficient than just having a regular DBMS and having it tuned to keep as much as possible in RAM as you want.
Actually... as long as you have enough memory available your database will already be fully running in RAM. Your filesystem will completely buffer all the data so it won't make much of a difference.
But... there is ofcourse always a bit of overhead so you can still try and run it all from a ramdrive.
As for the backups, that's just like any other database. You could use the normal Postgres dump utilities to backup the system. Or even better, let it replicate to another server as a backup.
5 to 40 times faster than disk resident DBMS. Check out Gartner's Magic Quadrant for Operational DBMSs 2013.
Gartner shows who is strong and more importantly notes severe cautions...bugs. .errors...lack of support and hard to use of vendors.
Related
I'm looking for a proper way to archive and back up my data. This data consists of photos, video's, documents and more.
There are two main things I'm afraid might cause data loss or corruption, hard drive failure and bit rot.
I'm looking for a strategy that can ensure my data's safety.
I came up with the following. One hard drive which I will regularly use to store and display data. A second hard drive which will serve as an onsite backup of the first one. And a third hard drive which will serve as an offsite backup. I am however not sure if this is sufficient.
I would prefer to use regular drives, and not network attached storage, however if it's better suited I will adapt.
One of the things I read about that might help with bit rot is ZFS. ZFS does not prevent bit rot but can detect data corruption by using checksums. This would allow me to recover a corrupted file from a different drive and copy it to the corrupted one.
I need at least 2TB of storage but I'm considering 4TB to ensure potential future needs.
What would be the best way to safely store my data and prevent data loss and corruption?
For your local system plus local backup, I think a RAID configuration / ZFS makes sense because you’re just trying to handle single-disk failures / bit rot, and having a synchronous copy of the data at all times means you won’t lose the data written since your last backup was taken. With two disks ZFS can do a mirror and handles bit rot well, and if you have more disks you may consider using RAIDZ configurations since they use less storage overall to provide single-disk failure recovery. I would recommend using ZFS here over a general RAID solutions because it has a better user interface.
For your offsite backup, ZFS could make sense too. If you go that route, periodically use zfs send to copy a snapshot on the source system to the destination system. Debatably, you should use mirroring or RAIDZ on the backup system to protect against bit rot there too.
That said — there are a lot of products that will do the offsite backup for you automatically, and if you have an offsite backup, the only advantage of having an on-site backup is faster recovery if you lose your primary. Since we’re just talking about personal documents and photos, taking a little while to re-download them doesn’t seem super high stakes. If you use Dropbox / Google Drive / etc. instead, this will all be automatic and have a nice UI and support people to yell at if anything goes wrong. Also, storage at those companies will have much higher failure tolerances because they use huge numbers of disks (allowing stuff like RAIDZ with tens of parity disks and replicated across multiple geographic locations) and they have security experts to make sure that all your data is not stolen by hackers.
The only downsides are cost, and not being as intimately involved in building the system, if that part is fun for you like it is for me :).
I have been monitoring the performance of an OLTP database (approx. 150GB); the average disk sec/read and average disk sec/write values are exceeding 20 ms over a 24hr period.
I need to arrive at a clear explanation as to why the business application has no influence over the 'less-than-stellar' performance on these counters. I also need to exert some pressure to have the storage folk re-examine their configuration as it applies to the placement of the mdf, ldf and tempdb files on their SAN. At present, my argument is shaky but I am pressing my point with people who don't understand the difference between IOPs and disk latency.
Beyond the limitations of physical hardware and the placement of data files across physical disks, is there anything else that would influence these counter values? For instance: the number of transactions per second, the size of the query, poorly written queries or missing indexes? My readings say 'no' but I need a voice of authority in this debate.
There are "a lot" of factors that can affect the overall latency. To truly rule it as SAN or not, you will want to look at the "Avg. Disk sec/Read counter" and the "Avg. Desk sec/Write Counter", that you mentioned. Just make sure you are looking at the "Physical Disk" object, and not the "Logical Disk" object. The logical disk counter includes the file system overhead, and may be different, depending on different factors.
Once you have the counters for the physical disks, you will want to compare them to the latency counters for the Storage unit, the server is connected to. You mentioned "storage folk" So I'm going to assume that is a different team, hopefully they will be nice and provide the info to you.
If it is a Storage unit issue, then both of these counters should match up pretty good. That indicates the storage unit is truly running slow. If the storage unit counters show significantly better, then it's something in between. Depending on what type of storage network you are using this would be the HBA/NIC/Switches that connect the server and storage together. Or if it's a VM then the host machine stats would prove useful as well.
Apart from obvious reasons such as "not enough memory for buffer pool", latency mostly depends on how your storage is actually implemented.
If your server has external SAN, usually its problem is that it might give you stellar throughput, but it will never (again, usually) give you stellar latency. It's just the way things are. It might become a real headache for heavy loaded OLTP systems, sure.
So, if you are about to squeeze every last microsecond from your storage, most probably you will need local drives. That, and your RAID 10 should have enough spindles to cope with the load.
I am experimenting with running PostgreSQL on a ramdisk on windows. The way I did it was to simply place the data directory on the ramdisk.
Without having done any specific benchmarks, the performance seems to be magnificent and only CPU bound. My question is what the optimal values for things like work_mem, shared_buffers etc. would be?
Even when the database is in ram it take more than half a minute to run many of my queries. Therefore, I wonder whether it makes sence to create indexes on the table. The indexes would, of course, need to stay in ram. I should mention that I am using PostgreSQL for a data warehouse (small one, though).
Edit: I should mention that I am using the RamDisk utility from DataRam.com. It only gives me a bluescreen once in a while, when I configure the ramdisk, never when it is established. I think of this as nostalgic eyecandy. ;)
I would definitely create indexes. The engine can use the info contained for all sorts of optimizations, and it should improve your performance quite a bit. RamDisk solves the worst case table scan type cases, but it doesn't necessarily mean that that tablescan is faster than doing a correct lookup.
Yes, use indexes but work_mem still depends on the size of the machine itself. Of course one of the reasons you want a high work_mem is so you don't hit disk to do the tape sort. On a ramdisk this isn't nearly as dangerous. Just remember, you have no data integrity on a ram disk.
Will the performance of a SQL server drastically degrade if the database is bigger than the RAM? Or does only the index have to fit in the memory? I know this is complex, but as a rule of thumb?
Only the working set or common data or currently used data needs to fit into the buffer cache (aka data cache). This includes indexes too.
There is also the plan cache, network buffers + other stuff too. MS have put a lot of work into memory management on SQL Server and it's works well, IMHO.
Generally, more RAM will help but it's not essential.
Yes, when indexes cant fit in the memory or when doing full table scans. Doing aggregate functions over data not in memory will also require many (and maybe random) disc reads.
For some benchmarks:
Query time will depend significantly
on whether the affected data currently
resides in memory or disk access is
required. For disk intensive
operations, the characteristics of the
disk sequential and random I/O
performance are also important.
http://www.sql-server-performance.com/articles/per/large_data_operations_p7.aspx
There for, don't expect the same performance if your db size > ram size.
Edit:
http://highscalability.com/ is full of examples like:
Once the database doesn't fit in RAM you hit a wall.
http://highscalability.com/blog/2010/5/3/mocospace-architecture-3-billion-mobile-page-views-a-month.html
Or here:
Even if the DB size is just 10% bigger than RAM size this test shows a 2.6 times drop in performance.
http://www.mysqlperformanceblog.com/2010/04/08/fast-ssd-or-more-memory/
Although, remember that this is for hot data, data that you want to query over and don't can cache. If you can, you can easily live with significant less memory.
All DB operations will have to be backed up by writing to disk, having more RAM is helpful, but not essential.
Loading the whole database into RAM is not practical. Database can be upto a Terabytes these days. There is little chance that anyone would buy so much RAM. I think performance will be optimal even if the size of the RAM available is one tenth of the size of the database.
What is the most efficient solution when you need to record some data on every page view in your application - should you write to a file or write to the database?
Or maybe neither - perhaps you should cache the data in memory or a file and only write it to the database (or file system if you use a memory cache) occasionally?
If it's purely recording a small amount of data with no subsequent lookups, straight file I/O is almost guaranteed to be more efficient. You're losing all the advantages of a DBMS though -- indexing, transactional integrity (really, ACID in general), concurrent access, etc..
It almost sounds like you're talking about what amounts to simple logging. If that's the case, and you don't need to do frequent complex queries on the resulting data, you're probably better off with straight file I/O if performance is a serious issue. Be careful of concurrent-write issues, though.
If the properties of an RDBMS are desirable, you might think about using SQLite, which for simplistic loads will get you better performance than most RDBMSs with less overhead, at the cost of some of the benefits (highly concurrent access and availability over the network to other machines are a couple of the "biggies"). It still wouldn't be as fast as straight file I/O in the general case, though.
Your later mention of it being for page view tracking causes me to ask: Are you incrementing a counter, rather than logging data about the page view? If so, I'd strongly suggest going with something like SQLite (doing something like UPDATE tbl SET counter = counter+1). You really don't want to get into the timing issues involved in doing this by hand -- if you don't do it right, you'll start losing counts on simultaneous access (A reads "100", B reads "100", A writes "101", B writes "101"; B should have written 102, but has no way of knowing that).
Conceptually, writing to the database is always slower than writing to a file.
The database has to write to a file too, with the extra overhead of communication to get the data to the database, so it can write it to a file. Therefore, it must be slower.
That said, databases do disk I/O very well, probably better than you will. Don't be surprised if you find out that a simple file logger is slower than writing it to a database. The database has a lot of I/O optimizations, and has some tricks available that you may not (depending on your web lanaguage and environment).
Don't be surprised if the answer changes over time. When your site is small, logging to a database is very fast. As your site grows, the logging table can become a major pain: It uses a lot of disk space, makes the backups take forever, and consumes all the disk I/O when you try to query it. This is why you should benchmark both methods yourself. Then you can re-test in the future, when conditions change.
Hitting the database is most likely going to be more expensive than writing to a file.
If your pageviews per second are high, and if the data doesn't need to be available in the database right away, then writing to a file and periodically loading the data into the DB will be a more optimal solution.
However it all depends on the nature of the data you're recording per page view and how critical it is to whatever business function it serves.
That highly depends on your needs for data safety. If you can afford to lose some data in case of a crash then keeping the data in memory and writing it periodically to a persistent store is certainly the most efficient way to go.
Edit: You mentioned pageviews. In that case I would keep the counters in memory and periodically update a database table (like every minute or so).
That depends.
Ands it really does: it depends on the DBMS and/or the OS+filesystem you use. In other words: your mileage varies.
If you just append data somewhere modern DBMS/OS+filesystems should handle this equally well and fast. Problems arise when you want to change data.
Caching - depends too on what kind of caching granularity you can afford (need to have every stepped logged crash-safe versus potential saving).
Use a hybrid solution like redis its designed for this sort of stuff