The situation
We are planning to migrate our entire (production)DWH to a new cluster. One of the requirements is that the new cluster is atleast as fast as the current cluster. This calls for performance testing with the current cluster serving as a baseline.
When conducting these tests we want both enviorments to be near identical in terms of behaviour.
I can already clone the user behaviour from the live production cluster and execute it on the new cluster. Leaving the cache to be tackled.
The catch
Since we are going to compare this new cluster to the live production enviorment I can't simply clear the cache of both servers. Clearing the cache of the new cluster would be possible since it isn't in production yet. However I am not going to clear the cache of the live production cluster since this is still being used and will have a big impact on the performance.
I was wondering if it would be possible to clone/mimic the cache between the two clusters.
I'm also open for an entire different approach on this matter.
I think you are going about this the wrong way and here is why. I assume the following:
The new cluster's hardware is of the same vendor, quality, etc as the previous
The cores / CPU, RAM, etc is as good or better on the new instance
The instance is of the same version, or an upgraded version of SQL Server. Note, upgrading doesn't mean the queries will perform better in all cases
The storage is the same, or better (SAN / NAS configurations)
The server settings are the same (MAXDOP, etc)
If these aren't true, then I don't see why you are conducting the test anyway since it wouldn't be comparing apples to apples. With that being said, I still don't see how the tests would be equal even if you could mirror the plan cache. You could create a brand new query that would be used for performance testing, and run it on both instances to compare their performance (it would use a new plan) but here's a big catch... you aren't going to kick off all the users from production instance, so your baseline query will contend for resources. Unless you have an identical mirror of your production server, which no users are using, I don't see how you're going to get an unbiased test.
With all that being said, most often you are upgrading to faster, better hardware so one could feel safe that it would be faster, or at least not slower, assuming equal configurations. Additionally, there are tons of performance tuning blogs out there from Dave Pinal, Paul White, Brent Ozar, Paul Randall, Aaron Bertrand, etc... ranging from optimal server settings to query tuning. This alone, could be a night and day difference in performance along with proper DB maintenance (fixing index fragmentation, fixing queries with hundreds or thousands of plans which only get used once, fixing indexing in general, etc)
Related
My app currently connects to a RDS Multi-AZ database. I also have a Single-AZ Read Replica used to serve my analytics portal.
Recently there have been an increasing load on my master database, and I am thinking of how to resolve this situation without having to scale up my database again. The two ways I have in mind are
Move all the read queries from my app to the read-replica, and just scale up the read-replica, if necessary.
Implement ElastiCache Memcached.
To me these two options seem to achieve the same outcome for me - which is to reduce load on my master database, but I am thinking I may have understood some fundamentals wrongly because Google doesnt seem to return any results on a comparison between them.
In terms of load, they have the same goal, but they differ in other areas:
Up-to-dateness of data:
A read replica will continuously sync from the master. So your results will probably lag 0 - 3s (depending on the load) behind the master.
A cache takes the query result at a specific point in time and stores it for a certain amount of time. The longer your queries are being cached, the more lag you'll have; but your master database will experience less load. It's a trade-off you'll need to choose wisely depending on your application.
Performance / query features:
A cache can only return results for queries it has already seen. So if you run the same queries over and over again, it's a good match. Note that queries must not contain changing parts like NOW(), but must be equal in terms of the actual data to be fetched.
If you have many different, frequently changing, or dynamic (NOW(),...) queries, a read replica will be a better match.
ElastiCache should be much faster, since it's returning values directly from RAM. However, this also limits the number of results you can store.
So you'll first need to evaluate how outdated your data can be and how cacheable your queries are. If you're using ElastiCache, you might be able to cache more than queries — like caching whole sections of a website instead of the underlying queries only, which should improve the overall load of your application.
PS: Have you tuned your indexes? If your main problems are writes that won't help. But if you are fighting reads, indexes are the #1 thing to check and they do make a huge difference.
Using AppEngine with Python and the HRD retrieving records sequentially (via an indexed field which is an incrementing integer timestamp) we get 15,000 records returned in 30-45 seconds. (Batching and limiting is used.) I did experiment with doing queries on two instances in parallel but still achieved the same overall throughput.
Is there a way to improve this overall number without changing any code? I'm hoping we can just pay some more and get better database throughput. (You can pay more for bigger frontends but that didn't affect database throughput.)
We will be changing our code to store multiple underlying data items in one database record, but hopefully there is a short term workaround.
Edit: These are log records being downloaded to another system. We will fix it in the future and know how to do so, but I'd rather work on more important things first.
Try splitting the records on different entity groups. That might force them to go to different physical servers. Read entity groups in parallel from multiple threads or instances.
Using cache mght not work well for large tables.
Maybe you can cache your records, like use Memcache:
https://developers.google.com/appengine/docs/python/memcache/
This could definitely speed up your application access. I don't think that App Engine Datastore is designed for speed but for scalability. Memcache however is.
BTW, if you are conscious about the performance that GAE gives as per what you pay, then maybe you can try setting up your own App Engine cloud with:
AppScale
JBoss CapeDwarf
Both have an active community support. I'm using CapeDwarf in my local environment it is still in BETA but it works.
Move to any of the in-memory databases. If you have Oracle Database, using TimesTen will improve the throughput multifold.
we have a website that uses nhibernate and 2nd level cache. We are having a debate as one person wants to turn off the second level cache as we are moving to a multi webserver environment (with a load balancer in front).
One argument is to get rid of the second level cache and focus on optimizing and tuning the Db. the other argument is to roll out a distributed cache as the second level cache.
I am curious to hear folks pro and con here of DB tuning versus distributed cache (factoring in effort involved, cost, complexity, etc)
In case of a load balancing scenario you have to use a distributed cache provider to get best performance and consistency, that has nothing to do with optimizing your database. In any scenario you should optimize you database.
Both. You should have a distributed cache to prevent unecessary calls to the database and a tuned database so the initial calls are quickly returned. As an example, facebook required a significant amount of caching to scale, but I'm sure it wouldn't do much good if the initial queries took 10 minutes. :)
Two words: measure it.
Since you already have cache implement it you can probably measure what the impact would be of turning it off for benchmark purposes.
I would think that a multi-web server and a distributed second level cache can -and probably should- coexist.
First of all if we take as example memcached, it supports distributed object storing so if you're not using that, you could switch to that. it works.
Secondly, I'm guessing that you're introducing the web-server farm to respond to increasing web requests which will in turn mean increasing requests for data. If you kill your caching, it won't matter how much you optimize your database you're going to thrash it with queries. So you are going to improve your execution time, but while you wait for the database to return your data.
This is especially true for the case that web-node 1 requests dataset A and web-node 2 requests dataset A --> you are going to do the same query twice while with second level caching you only do it once.
So my recommendation is:
Don't kill your second level cache. You have already spent resources to implement it and by disabling it you are NOT going to improve your application's performance. Even a single node of memcached is going to be faster than having none at all.
Do optimize your database operations. This means both from the database side (indexes, views, sp's, functions, perhaps a cluster with read-only and write-only nodes) and application side (optimize your queries, lazy/eager loading profiling, don't fetch data you don't need, combine multiple queries into single-round-trips via Future, MutliQuery, MultiCriteria)
Do optimize your second-level cache implementation. There are datasets that have an infinite expiration date, and thus you query the db for them only once, and there are datasets that have short expiration dates, and thus probably expensive queries are executed more frequently. By optimizing your queries and your db you are going to improve the performance for the queries but the second-level cache is going to save your skin on peak load where short-expiration date datasets will be fetched by the cache more frequently.
If using textual queries is an everyday operation use the database's full-text capabilities or, even better, use a independent service like Lucene.NET (which can be integrated with NHibernate via NHibernate.Search)
That's a very difficult topic. In either case you need proficiency. Either a very proficient DBA, or a very proficient NHibernate / Cache administrator.
Personally, I prefer having full control over my SQL and tuning the database. Since you only have multiple webservers (and not necessarily multiple database instances), you might be better off that way, too. Modern databases have very efficient caches, so usually you create more harm with badly configured second-level caches in the application, rather than just letting the database cache sql statements, cursors, data, buffers, etc. I have experienced this to work very well for around 15 weblogic servers and only one database with lots of memory.
Since you do have NHibernate already, though, moving away from it, back to SQL (maybe with LINQ?) might be quite a costly task, that's not worth the effort.
We use NHibernate's 2nd level cache in our multi-server environment using Microsoft AppFabric distributed cache framework (NHibernate Velocity Provider) with great success.
Having said that, using 2nd level cache requires deeper understanding of the framework to prevent unexpected results. In addition, before using distributed caches, it is important to measure their overhead.
So my answer is basically - before using 2nd-level cache, you should really test and see whether it is really needed.
I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.
First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.
1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).
Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.
You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.
Further to my previous question about the Optimal RAID setup for SQL server, could anyone suggest a quick and dirty way of benchmarking the database performance on the new and old servers to compare them? Obviously, the proper way would be to monitor our actual usage and set up all sorts of performance counters and capture the queries, etc., but we are just not at that level of sophistication yet and this isn't something we'll be able to do in a hurry. So in the meanwhile, I'm after something that would be a bit less accurate, but quick to do and still better than nothing. Just as long as it's not misleading, which would be worse than nothing. It should be SQL Server specific, not just a "synthetic" benchmark. It would be even better if we could use our actual database for this.
Measure the performance of your application itself with the new and old servers. It's not necessarily easy:
Set up a performance test environment with your application on (depending on your architecture this may consist of several machines, some of which may be able to be VMs, but some of which may not be)
Create "driver" program(s) which give the application simulated work to do
Run batches of work under the same conditions - remember to reboot the database server between runs to nullify effects of caching (Otherwise your 2nd and subsequent runs will probably be amazingly fast)
Ensure that the performance test environment has enough hardware machines in to be able to load the database heavily - this may mean swapping out some VMs for real hardware.
Remember to use production-grade hardware in your performance test environment - even if it is expensive.
Our database performance test cluster contains six hardware machines, several of which are production-grade, one of which contains an expensive storage array. We also have about a dozen VMs on a 7th simulating other parts of the service.
you can always insert, read, and delete a couple of million rows - it's not a realistic mix of operations but it should strain the disks nicely...
Find at least a couple of the queries that are taking some time, or at least that you suspect are taking time, insert a lot of data if you don't have it already, and run the queries having set:
SET STATISTICS IO ON
SET STATISTICS TIME ON
SET STATISTICS PROFILE ON
Those should give you a rough idea of the resources being consumed.
You can also run SQL Server Profiler to get a general idea of what queries are taking a long time and how long they are taking plus other statistics. It outputs a lot of data so try to filter it down a little bit, possibly by long duration or one of the other performance statistics.