Databases usually are a storage for most applications. Our company also makes a lot of calculations and data manipulations with that data on daily basis.
As soon as we get more and more data, data generation became an issue cause takes too long. And I think it can make sense to separate database to at least two :
for storing data with focus on read/write performance;
for calculations with focus on data aggregation performance.
Does anybody has similar experience and can tell if this idea is good and what will be design differences for mentioned two points?
Maybe it is worth to look for noSQL solution for calculating data e.g. in-memory databases?
it can make sense to separate database to at least two
If the databases are in different Disks (with different spindles ), it may help otherwise you get no gain because disk IO is shared between these databases.
For best practice,read Storage Top 10 Best Practices
Maybe it is worth to look for noSQL solution for calculating data e.g. in-memory databases?
No need to go to noSQL solution, you can use in-memory tables
In-Memory OLTP can significantly improve the performance of transaction processing, data load and transient data scenarios.
For more details, In-Memory OLTP (In-Memory Optimization)
Other Strategies
1) Tune tempdb
Tempdb is common for all databases and heavily used in calculations.
A more pragmatic approach, is to have a 1:1 mapping between files and logical CPUs(cores) up to eight.
for more details: SQL Server TempDB Usage, Performance, and Tuning Tips
2) Evaluate life expectancy (PLE) Counter and take actions for enhancement
To evaluate data cache, run the following query
SELECT [object_name],
[counter_name],
[cntr_value] FROM sys.dm_os_performance_counters
WHERE [object_name] LIKE '%Manager%'
AND [counter_name] = 'Page life expectancy'
The recommended value of the PLE counter (in seconds ) is greater than:
total_memory_dedicated_for_sql_server / 4 * 300
Page Life Expectancy is the number of seconds a page will stay in the buffer pool without references. In simple words, if your page stays longer in the buffer pool (area of the memory cache) your PLE is higher, leading to higher performance as every time request comes there are chances it may find its data in the cache itself instead of going to the hard drive to read the data.
If PLE is't enough Increase memory and tune indexes and statistics.
3) Use SSD disks
With the cost of solid state disks (SSDs) going down, use the SSDs as a second tier of cache.
4) Use RAID 5 for the databases; and RAID 10 for the transaction logs and tempdb.
In general, the SQL optimizer game is moving data from disk (low speed) to cache (memory- high speed).
Increase memory and enhance diskIo speed, you gain high performance
Related
I'm about to write an application for Android, and it will use Mysql.
I know that access to DB is really expensive in terms of time, and would like to know how often do applications like instant messaging, online gaming access to databases?
For example in a game, we would like to save the positions of a player in the world, when he's moving all the time.
Is the database access actually not expensive, and there is a way to be connected to it all the time and just do request that are actually not expensive?
Or is IT really expensive in anyway, and there are techniques to access to it for example every X interval of time, and saving it locally in the meantime?
I Know that my question is really general, and it depends always on what we need and want.
My question came out because i made a really simple login application that connects and does 1 request to database, and it takes 1 second (a lot!!) to get the result, so how online applications can be so fast?
Thank you
Before answering this I would recommend simulating the process as much as possible, benchmarking and you can work towards the best solution for your use case.
e.g. If I have an application submitting data to a database simulate the submission so I can easily run multiple submissions at the same time and see what the bottle neck is...and see how it compares when I using caching, replication, indexes, etc.
Also reading company blogs can be helpful as they often share success stories that support the usage of a particular approach
How expensive is access to database?
Accessing a database can be a pretty quick operation
SELECT 1; // 0.005 Secs :D
However there are situations that can lead to poor performance (slow reads, writes and updates) but there are some relatively simple ways to combat this
Indexes
The best way to improve the performance of SELECT operations is to
create indexes on one or more of the columns that are tested in the
query. The index entries act like pointers to the table rows, allowing
the query to quickly determine which rows match a condition in the
WHERE clause, and retrieve the other column values for those rows.
Replication
spreading the load among multiple slaves to improve performance. In
this environment, all writes and updates must take place on the master
server. Reads, however, may take place on one or more slaves. This
model can improve the performance of writes (since the master is
dedicated to updates), while dramatically increasing read speed across
an increasing number of slaves.
How often do we access to it?
If you are solely using a database you will access it every time you n position and every time you need to find out their position.
This is where you would explore options to prevent accessing the database.
Memory caches such as redis or memcache
Replication - Only read from slaves
It depends on your design and requirement.
1) Most of the applications manage Connection Pools to minimize the initialization time.
2) Most of the ORM frameworks have external Cache to improve the reading performance. So if you do heavy data reading in your application then don't worry about storing it in locally. The Cache will be effective in this case.
3) When you store locally either in File (or) some format, then it will also add extra performance delay.
4) If you keep the data in primary memory, then obviously Game performance would be better. That's why Gamers prefer high end graphics card, and huge RAM.
For most databases there is the option of batch insertions. Obviously even a small overhead will accumulate if you have to many connections over time. And performing single insertions will have a greater overhead than on batch. The only issue is how often?.... And you should test how often you wan't to insert and how much information you should store locally before doing a batch insertion.
We're working on an application that's going to serve thousands of users daily (90% of them will be active during the working hours, using the system constantly during their workday). The main purpose of the system is to query multiple databases and combine the information from the databases into a single response to the user. Depending on the user input, our query load could be around 500 queries per second for a system with 1000 users. 80% of those queries are read queries.
Now, I did some profiling using the SQL Server Profiler tool and I get on average ~300 logical reads for the read queries (I did not bother with the write queries yet). That would amount to 150k logical reads per second for 1k users. Full production system is expected to have ~10k users.
How do I estimate actual read requirement on the storage for those databases? I am pretty sure that actual physical reads will amount to much less than that, but how do I estimate that? Of course, I can't do an actual run in the production environment as the production environment is not there yet, and I need to tell the hardware guys how much IOPS we're going to need for the system so that they know what to buy.
I tried the HP sizing tool suggested in the previous answers, but it only suggests HP products, without actual performance estimates. Any insight is appreciated.
EDIT: Main read-only dataset (where most of the queries will go) is a couple of gigs (order of magnitude 4gigs) on the disk. This will probably significantly affect the logical vs physical reads. Any insight how to get this ratio?
Disk I/O demand varies tremendously based on many factors, including:
How much data is already in RAM
Structure of your schema (indexes, row width, data types, triggers, etc)
Nature of your queries (joins, multiple single-row vs. row range, etc)
Data access methodology (ORM vs. set-oriented, single command vs. batching)
Ratio of reads vs. writes
Disk (database, table, index) fragmentation status
Use of SSDs vs. rotating media
For those reasons, the best way to estimate production disk load is usually by building a small prototype and benchmarking it. Use a copy of production data if you can; otherwise, use a data generation tool to build a similarly sized DB.
With the sample data in place, build a simple benchmark app that produces a mix of the types of queries you're expecting. Scale memory size if you need to.
Measure the results with Windows performance counters. The most useful stats are for the Physical Disk: time per transfer, transfers per second, queue depth, etc.
You can then apply some heuristics (also known as "experience") to those results and extrapolate them to a first-cut estimate for production I/O requirements.
If you absolutely can't build a prototype, then it's possible to make some educated guesses based on initial measurements, but it still takes work. For starters, turn on statistics:
SET STATISTICS IO ON
Before you run a test query, clear the RAM cache:
CHECKPOINT
DBCC DROPCLEANBUFFERS
Then, run your query, and look at physical reads + read-ahead reads to see the physical disk I/O demand. Repeat in some mix without clearing the RAM cache first to get an idea of how much caching will help.
Having said that, I would recommend against using IOPS alone as a target. I realize that SAN vendors and IT managers seem to love IOPS, but they are a very misleading measure of disk subsystem performance. As an example, there can be a 40:1 difference in deliverable IOPS when you switch from sequential I/O to random.
You certainly cannot derive your estimates from logical reads. This counter really is not that helpful because it is often unclear how much of it is physical and also the CPU cost of each of these accesses is unknown. I do not look at this number at all.
You need to gather virtual file stats which will show you the physical IO. For example: http://sqlserverio.com/2011/02/08/gather-virtual-file-statistics-using-t-sql-tsql2sday-15/
Google for "virtual file stats sql server".
Please note that you can only extrapolate IOs from the user count if you assume that cache hit ratio of the buffer pool will stay the same. Estimating this is much harder. Basically you need to estimate the working set of pages you will have under full load.
If you can ensure that your buffer pool can always take all hot data you can basically live without any reads. Then you only have to scale writes (for example with an SSD drive).
Sorry that the title isn't exactly obvious, but I couldn't word it better.
We are right now using a conventional DB (oracle) as our job queue, and these "jobs" are consumed by some number of nodes (machines). So the DB server gets hit by these nodes, and we have to pay a lot for the software and hardware for this database server.
Now, it occurred to me the other day that,
1) There are already multiple nodes in the system
2) "Jobs" may not be lost because of node failures, but there is no reason they have to be sitting in a secondary storage (no reason why they couldn't reside in memory, as long as they are not lost)
Given this, couldn't one retain these jobs in-memory, making sure that at least n number of copies of this job is present in the entire cluster, thereby getting rid of the DB server?
Are such technologies available?
Did you take a look at Gigaspaces? On an internet scale, you do not need to persist at all. You just have to know sufficient copies are around. If you have low latency connections to places that are not on the same powergrid (or have battery power), pushing out your transactions to the duplicates is enough.
If you're only looking at storing up to a few terabytes of data, and you're looking for redundancy vs. disk recoverability, then take a look at Oracle Coherence. For example:
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
Datacenter continuous availability (CA). Kill a data center, no data loss.
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
It depends on how much you expect these technologies to do for you. There are loads of basic in-memory databases (SQLite, Redis, etc) and you can use normal database replication techniques with multiple slaves in multiple data centers to pretty much ensure durability without persistence.
If you're storing in memory you're likely going to run out of space and require horizontal partitioning (sharding) and may want to check out something like VoltDB if you want to stick with SQL.
Will the performance of a SQL server drastically degrade if the database is bigger than the RAM? Or does only the index have to fit in the memory? I know this is complex, but as a rule of thumb?
Only the working set or common data or currently used data needs to fit into the buffer cache (aka data cache). This includes indexes too.
There is also the plan cache, network buffers + other stuff too. MS have put a lot of work into memory management on SQL Server and it's works well, IMHO.
Generally, more RAM will help but it's not essential.
Yes, when indexes cant fit in the memory or when doing full table scans. Doing aggregate functions over data not in memory will also require many (and maybe random) disc reads.
For some benchmarks:
Query time will depend significantly
on whether the affected data currently
resides in memory or disk access is
required. For disk intensive
operations, the characteristics of the
disk sequential and random I/O
performance are also important.
http://www.sql-server-performance.com/articles/per/large_data_operations_p7.aspx
There for, don't expect the same performance if your db size > ram size.
Edit:
http://highscalability.com/ is full of examples like:
Once the database doesn't fit in RAM you hit a wall.
http://highscalability.com/blog/2010/5/3/mocospace-architecture-3-billion-mobile-page-views-a-month.html
Or here:
Even if the DB size is just 10% bigger than RAM size this test shows a 2.6 times drop in performance.
http://www.mysqlperformanceblog.com/2010/04/08/fast-ssd-or-more-memory/
Although, remember that this is for hot data, data that you want to query over and don't can cache. If you can, you can easily live with significant less memory.
All DB operations will have to be backed up by writing to disk, having more RAM is helpful, but not essential.
Loading the whole database into RAM is not practical. Database can be upto a Terabytes these days. There is little chance that anyone would buy so much RAM. I think performance will be optimal even if the size of the RAM available is one tenth of the size of the database.
I"m looking to run PostgreSQL in RAM for performance enhancement. The database isn't more than 1GB and shouldn't ever grow to more than 5GB. Is it worth doing? Are there any benchmarks out there? Is it buggy?
My second major concern is: How easy is it to back things up when it's running purely in RAM. Is this just like using RAM as tier 1 HD, or is it much more complicated?
It might be worth it if your database is I/O bound. If it's CPU-bound, a RAM drive will make no difference.
But first things first, you should make sure that your database is properly tuned, you can get huge performance gains that way without losing any guarantees. Even a RAM-based database will perform badly if it's not properly tuned. See PostgreSQL wiki on this, mainly shared_buffers, effective_cache_size, checkpoint_*, default_statistics_target
Second, if you want to avoid synchronizing disk buffers on every commit (like codeka explained in his comment), disable the synchronous_commit configuration option. When your machine loses power, this will lose some latest transactions, but your database will still be 100% consistent. In this mode, RAM will be used to buffer all writes, including writes to the transaction log. So with very rare checkpoints, large shared_buffers and wal_buffers, it can actually approach speeds close to those of a RAM-drive.
Also hardware can make a huge difference. 15000 RPM disks can, in practice, be 3x as fast as cheap drives for database workloads. RAID controllers with battery-backed cache also make a significant difference.
If that's still not enough, then it may make sense to consider turning to volatile storage.
The whole thing about whether to hold you database in memory depends on size and performance as well how robust you want it to be with writes. I assume you are writing to your database and that you want to persist the data in case of failure.
Personally, I would not worry about this optimization until I ran into performance issues. It just seems risky to me.
If you are doing a lot of reads and very few writes a cache might serve your purpose, Many ORMs come with one or more caching mechanisms.
From a performance point of view, clustering across a network to another DBMS that does all the disk writing, seems a lot more inefficient than just having a regular DBMS and having it tuned to keep as much as possible in RAM as you want.
Actually... as long as you have enough memory available your database will already be fully running in RAM. Your filesystem will completely buffer all the data so it won't make much of a difference.
But... there is ofcourse always a bit of overhead so you can still try and run it all from a ramdrive.
As for the backups, that's just like any other database. You could use the normal Postgres dump utilities to backup the system. Or even better, let it replicate to another server as a backup.
5 to 40 times faster than disk resident DBMS. Check out Gartner's Magic Quadrant for Operational DBMSs 2013.
Gartner shows who is strong and more importantly notes severe cautions...bugs. .errors...lack of support and hard to use of vendors.