I am experimenting with running PostgreSQL on a ramdisk on windows. The way I did it was to simply place the data directory on the ramdisk.
Without having done any specific benchmarks, the performance seems to be magnificent and only CPU bound. My question is what the optimal values for things like work_mem, shared_buffers etc. would be?
Even when the database is in ram it take more than half a minute to run many of my queries. Therefore, I wonder whether it makes sence to create indexes on the table. The indexes would, of course, need to stay in ram. I should mention that I am using PostgreSQL for a data warehouse (small one, though).
Edit: I should mention that I am using the RamDisk utility from DataRam.com. It only gives me a bluescreen once in a while, when I configure the ramdisk, never when it is established. I think of this as nostalgic eyecandy. ;)
I would definitely create indexes. The engine can use the info contained for all sorts of optimizations, and it should improve your performance quite a bit. RamDisk solves the worst case table scan type cases, but it doesn't necessarily mean that that tablescan is faster than doing a correct lookup.
Yes, use indexes but work_mem still depends on the size of the machine itself. Of course one of the reasons you want a high work_mem is so you don't hit disk to do the tape sort. On a ramdisk this isn't nearly as dangerous. Just remember, you have no data integrity on a ram disk.
Related
We are looking for ways to improve query/index performance on solr. We are not concerned about the storage requirements. We have lot of storage space.
Essentially we want to speed up solr query/indexing by throwing more storage at the solr index.
I have already reviewed http://wiki.apache.org/solr/SolrPerformanceFactors. But it doesn't cover this particular scenario.
p.s. you can tell me that this is a stupid question, and i won't mind :)
Indexing side:
You could potentially increase the speed of indexing by using very high mergeFactors. Thus there will be very many Lucene index segments that will merge at slow speed (basically, merging is what takes quite a lot of time). Then by the time you are done with indexing, set the mergeFactor back to something sensible, like 10. This will make sure you have only few segments to read through.
Also consider SSDs, some folks have reported better perf with these.
Querying side:
It is nearly impossible to recommend anything without knowing your particalur setup and use case. But monitor the cache usage. If you have high utilization rate of a particular cache, yet evictions, get rid of them by giving more RAM.
Will the performance of a SQL server drastically degrade if the database is bigger than the RAM? Or does only the index have to fit in the memory? I know this is complex, but as a rule of thumb?
Only the working set or common data or currently used data needs to fit into the buffer cache (aka data cache). This includes indexes too.
There is also the plan cache, network buffers + other stuff too. MS have put a lot of work into memory management on SQL Server and it's works well, IMHO.
Generally, more RAM will help but it's not essential.
Yes, when indexes cant fit in the memory or when doing full table scans. Doing aggregate functions over data not in memory will also require many (and maybe random) disc reads.
For some benchmarks:
Query time will depend significantly
on whether the affected data currently
resides in memory or disk access is
required. For disk intensive
operations, the characteristics of the
disk sequential and random I/O
performance are also important.
http://www.sql-server-performance.com/articles/per/large_data_operations_p7.aspx
There for, don't expect the same performance if your db size > ram size.
Edit:
http://highscalability.com/ is full of examples like:
Once the database doesn't fit in RAM you hit a wall.
http://highscalability.com/blog/2010/5/3/mocospace-architecture-3-billion-mobile-page-views-a-month.html
Or here:
Even if the DB size is just 10% bigger than RAM size this test shows a 2.6 times drop in performance.
http://www.mysqlperformanceblog.com/2010/04/08/fast-ssd-or-more-memory/
Although, remember that this is for hot data, data that you want to query over and don't can cache. If you can, you can easily live with significant less memory.
All DB operations will have to be backed up by writing to disk, having more RAM is helpful, but not essential.
Loading the whole database into RAM is not practical. Database can be upto a Terabytes these days. There is little chance that anyone would buy so much RAM. I think performance will be optimal even if the size of the RAM available is one tenth of the size of the database.
I"m looking to run PostgreSQL in RAM for performance enhancement. The database isn't more than 1GB and shouldn't ever grow to more than 5GB. Is it worth doing? Are there any benchmarks out there? Is it buggy?
My second major concern is: How easy is it to back things up when it's running purely in RAM. Is this just like using RAM as tier 1 HD, or is it much more complicated?
It might be worth it if your database is I/O bound. If it's CPU-bound, a RAM drive will make no difference.
But first things first, you should make sure that your database is properly tuned, you can get huge performance gains that way without losing any guarantees. Even a RAM-based database will perform badly if it's not properly tuned. See PostgreSQL wiki on this, mainly shared_buffers, effective_cache_size, checkpoint_*, default_statistics_target
Second, if you want to avoid synchronizing disk buffers on every commit (like codeka explained in his comment), disable the synchronous_commit configuration option. When your machine loses power, this will lose some latest transactions, but your database will still be 100% consistent. In this mode, RAM will be used to buffer all writes, including writes to the transaction log. So with very rare checkpoints, large shared_buffers and wal_buffers, it can actually approach speeds close to those of a RAM-drive.
Also hardware can make a huge difference. 15000 RPM disks can, in practice, be 3x as fast as cheap drives for database workloads. RAID controllers with battery-backed cache also make a significant difference.
If that's still not enough, then it may make sense to consider turning to volatile storage.
The whole thing about whether to hold you database in memory depends on size and performance as well how robust you want it to be with writes. I assume you are writing to your database and that you want to persist the data in case of failure.
Personally, I would not worry about this optimization until I ran into performance issues. It just seems risky to me.
If you are doing a lot of reads and very few writes a cache might serve your purpose, Many ORMs come with one or more caching mechanisms.
From a performance point of view, clustering across a network to another DBMS that does all the disk writing, seems a lot more inefficient than just having a regular DBMS and having it tuned to keep as much as possible in RAM as you want.
Actually... as long as you have enough memory available your database will already be fully running in RAM. Your filesystem will completely buffer all the data so it won't make much of a difference.
But... there is ofcourse always a bit of overhead so you can still try and run it all from a ramdrive.
As for the backups, that's just like any other database. You could use the normal Postgres dump utilities to backup the system. Or even better, let it replicate to another server as a backup.
5 to 40 times faster than disk resident DBMS. Check out Gartner's Magic Quadrant for Operational DBMSs 2013.
Gartner shows who is strong and more importantly notes severe cautions...bugs. .errors...lack of support and hard to use of vendors.
What is the most efficient solution when you need to record some data on every page view in your application - should you write to a file or write to the database?
Or maybe neither - perhaps you should cache the data in memory or a file and only write it to the database (or file system if you use a memory cache) occasionally?
If it's purely recording a small amount of data with no subsequent lookups, straight file I/O is almost guaranteed to be more efficient. You're losing all the advantages of a DBMS though -- indexing, transactional integrity (really, ACID in general), concurrent access, etc..
It almost sounds like you're talking about what amounts to simple logging. If that's the case, and you don't need to do frequent complex queries on the resulting data, you're probably better off with straight file I/O if performance is a serious issue. Be careful of concurrent-write issues, though.
If the properties of an RDBMS are desirable, you might think about using SQLite, which for simplistic loads will get you better performance than most RDBMSs with less overhead, at the cost of some of the benefits (highly concurrent access and availability over the network to other machines are a couple of the "biggies"). It still wouldn't be as fast as straight file I/O in the general case, though.
Your later mention of it being for page view tracking causes me to ask: Are you incrementing a counter, rather than logging data about the page view? If so, I'd strongly suggest going with something like SQLite (doing something like UPDATE tbl SET counter = counter+1). You really don't want to get into the timing issues involved in doing this by hand -- if you don't do it right, you'll start losing counts on simultaneous access (A reads "100", B reads "100", A writes "101", B writes "101"; B should have written 102, but has no way of knowing that).
Conceptually, writing to the database is always slower than writing to a file.
The database has to write to a file too, with the extra overhead of communication to get the data to the database, so it can write it to a file. Therefore, it must be slower.
That said, databases do disk I/O very well, probably better than you will. Don't be surprised if you find out that a simple file logger is slower than writing it to a database. The database has a lot of I/O optimizations, and has some tricks available that you may not (depending on your web lanaguage and environment).
Don't be surprised if the answer changes over time. When your site is small, logging to a database is very fast. As your site grows, the logging table can become a major pain: It uses a lot of disk space, makes the backups take forever, and consumes all the disk I/O when you try to query it. This is why you should benchmark both methods yourself. Then you can re-test in the future, when conditions change.
Hitting the database is most likely going to be more expensive than writing to a file.
If your pageviews per second are high, and if the data doesn't need to be available in the database right away, then writing to a file and periodically loading the data into the DB will be a more optimal solution.
However it all depends on the nature of the data you're recording per page view and how critical it is to whatever business function it serves.
That highly depends on your needs for data safety. If you can afford to lose some data in case of a crash then keeping the data in memory and writing it periodically to a persistent store is certainly the most efficient way to go.
Edit: You mentioned pageviews. In that case I would keep the counters in memory and periodically update a database table (like every minute or so).
That depends.
Ands it really does: it depends on the DBMS and/or the OS+filesystem you use. In other words: your mileage varies.
If you just append data somewhere modern DBMS/OS+filesystems should handle this equally well and fast. Problems arise when you want to change data.
Caching - depends too on what kind of caching granularity you can afford (need to have every stepped logged crash-safe versus potential saving).
Use a hybrid solution like redis its designed for this sort of stuff
I have ran a defrag report on a SQL server and the mdf file for the database is being reported as 99% fragmented. I asked a colleague and they said this may be normal as it just the way that type of file works internally.
Is it an issue and would running a disk defrag for this file improve DB performance? It is a pain to because it requires a lot of space on disk as it is a huge file so I want to be sure whether it is necessary.
Update
Another linked question leading on from the answers. How would a RAID striped configuration affect this. If the data is striped over multiple disks would defragmentation be less of an issue?
As long as you have determined that, in fact, you have a database response problem this would be a good candidate for improvement. Presumably you've at least determined that there is a problem to be solved, and that it's in the database. (i.e. you've tested queries which are slower than you would like; or you see locks; etc.).
Otherwise it starts to look like guessing and trying. Caching naturally compensates for a lot of this depending on your query patterns.
Short answer is yes a defrag would help.
Clustered indexes try to organize the index such that only sequential reads will be required. This is great if the .MDF is contiguous on disk. Not so much if the file is spread out all over the disk.
Also even though SQL allocates in 8K blocks having to move the disk head all over the place leads to much slower access times. You really want the file contiguous.
A good way to avoid this is to not rely on autogrow. Give the MDF room to breath.