Combining cache methods - memcache/disk based - database

Here's the deal. We would have taken the complete static html road to solve performance issues, but since the site will be partially dynamic, this won't work out for us.
What we have thought of instead is using memcache + eAccelerator to speed up PHP and take care of caching for the most used data.
Here's our two approaches that we have thought of right now:
Using memcache on >>all<< major queries and leaving it alone to do what it does best.
Usinc memcache for most commonly retrieved data, and combining with a standard harddrive-stored cache for further usage.
The major advantage of only using memcache is of course the performance, but as users increases, the memory usage gets heavy. Combining the two sounds like a more natural approach to us, even though the theoretical compromize in performance.
Memcached appears to have some replication features available as well, which may come handy when it's time to increase the nodes.
What approach should we use?
- Is it stupid to compromize and combine the two methods? Should we insted be focusing on utilizing memcache and instead focusing on upgrading the memory as the load increases with the number of users?
Thanks a lot!

Compromize and combine this two method is a very clever way, I think.
The most obvious cache management rule is latency v.s. size rule, which is used in CPU cached also. In multi level caches each next level should have more size for compensating higher latency. We have higher latency but higher cache hit ratio. So, I didn't recommend you to place disk based cache in front of memcache. Сonversely it's should be place behind memcache. The only exception is if you cache directory mounted in memory (tmpfs). In this case file based cache could compensate high load on memcache, and also could have latency profits (because of data locality).
This two storages (file based, memcache) are not only storages that are convenient for cache. You also could use almost any KV database as they are very good at concurrency control.
Cache invalidation is separate question which can engage your attention. There are several tricks you could use to provide more subtle cache update on cache misses. One of them is dog pile effect prediction. If several concurrent threads got cache miss simultaneously all of them go to backend (database). Application should allow only one of them to proceed and rest of them should wait on cache. Second is background cache update. It's nice to update cache not in web request thread but in background. In background you can control concurrency level and update timeouts more gracefully.
Actually there is one cool method which allows you to do tag based cache tracking (memcached-tag for example). It's very simple under the hood. With every cache entry you save a vector of tags versions which it is belongs to (for example: {directory#5: 1, user#8: 2}). When you reading cache line you also read all actual vector numbers from memcached (this could be effectively performed with multiget). If at least one actual tag version is greater than tag version saved in cache line then cache is invalidated. And when you change objects (for example directory) appropriate tag version should be incremented. It's very simple and powerful method, but have it's own disadvantages, though. In this scheme you couldn't perform efficient cache invalidation. Memcached could easily drop out live entries and keep old entries.
And of course you should remember: "There are only two hard things in Computer Science: cache invalidation and naming things" - Phil Karlton.

Memcached is quite a scalable system. For instance, you can replicate cache to decrease access time for certain key buckets or implement Ketama algorithm that enables you to add/remove Memcached instances from pool without remap of all keys. In this way, you can easily add new machines dedicated to Memcached when you happen to have extra memory. Furthermore, as its instance can be run with different sizes, you can throw up one instance by adding more RAM to an old machine. Generally, this approach is more economic and to some extent does not inferior to the first one, especially for multiget() requests. Regarding a performance drop with data growth, the runtime of the algorithms used in Memcached does not vary with the size of the data, and therefore the access time depend only on number of simultaneous requests. Finally, if you want to tune your memory/performance priorities you can set expire time and available memory configuration values which will strict RAM usage or increase cache hits.
At the same time, when you use a hard-disk the file system can become a bottleneck of your application. Besides general I/O latency, such things as fragmentation and huge directories can noticeably affect your overall request speed. Also, beware that default Linux hard disk settings are tuned more for compatibility than for speed, so it is advisable to configure it properly before usage (for instance, you can try hdparm utility).
Thus, before adding one more integrating point, I think you should tune the existent system. Usually, properly designed database, configured PHP, Memcached and handling of static data should be enough even for a high-load web site.

I would suggest that you first use memcache for all major queries. Then, test to find queries that are least used or data that is rarely changed and then provide a cache for this.
If you can isolate common data from rarely used data, then you can focus on improving performance on the more commonly used data.

Memcached is something that you use when you're sure you need to. You don't worry about it being heavy on memory, because when you evaluate it, you include the cost of the dedicated boxes that you're going to deploy it on.
In most cases putting memcached on a shared machine is a waste of time, as its memory would be better used caching whatever else it does instead.
The benefit of memcached is that you can use it as a shared cache between many machines, which increases the hit rate. Moreover, you can have the cache size and performance higher than a single box can give, as you can (and normally would) deploy several boxes (per geographical location).
Also the way memcached is normally used is dependent on a low latency link from your app servers; so you wouldn't normally use the same memcached cluster in different geographical locations within your infrastructure (each DC would have its own cluster)
The process is:
Identify performance problems
Decide how much performance improvement is enough
Reproduce problems in your test lab, on production-grade hardware with necessary driver machines - this is nontrivial and you may need a lot of dedicated (even specialised) hardware to drive your app hard enough.
Test a proposed solution
If it works, release it to production, if not, try more options and start again.
You should not
Cache "everything"
Do things without measuring their actual impact.
As your performance test environment will never be perfect, you should have sufficient instrumentation / monitoring that you can measure performance and profile your app IN PRODUCTION.
This also means that every single thing that you cache should have a cache hit/miss counter on it. You can use this to determine when the cache is being wasted. If a cache has a low hit rate (< 90%, say), then it is probably not worthwhile.
It may also be worth having the individual caches switchable in production.
Remember: OPTIMISATIONS INTRODUCE FUNCTIONAL BUGS. Do as few optimisations as possible, and be sure that they are necessary AND effective.

You can delegate the combination of disk/memory cache to the OS (if your OS is smart enough).
For Solaris, you can actually even add SSD layer in the middle; this technology is called L2ARC.
I'd recommend you to read this for a start: http://blogs.oracle.com/brendan/entry/test.

Related

Datomic: robust settings to avoid timeouts and out-of-memory errors

I'm running the same datomic-backed application on a variety of architectures with varying amounts of memory (1GB - 16GB). When I do bulk imports of data I frequently run into timeouts or out-of-memory errors.
After looking at the documentation I happened upon this helpful document (and this one) which seem to outline best practices to obtain good performance under heavy imports.
I'm not as interested in performance as I am in making imports "just work." This leads to my main question:
What is the minimum complexity configuration to ensure that an arbitrarily large import process terminates on a given machine?
I understand that this configuration may be a function of my available memory, that's fine. I also understand that it may not be maximally performant; that's also fine. But I do need know that it will terminate.
Data Distribution
I think the critical pieces of information missing from your question are the type of data and its distribution and the available metrics for your system during the bulk imports. Why?
Datomic's transaction rate is limited by the cost of background indexing jobs and the cost of that indexing is a function of the distribution of new values and the size of your database.
What this means is that, for example, if you have indexed attributes (i.e. :db/index) and, as your bulk import goes through, the distribution of those attribute values is random, you'll put a lot of pressure in the indexing jobs as it rewrites an ever increasing number of segments. As your database size grows, the indexing will dominate the work of the transactor and won't be able to catch up.
Transactor Memory
As described in the docs, the more memory you can give to object-cache-max, the better. This is especially important if your data has a lot uniqueness constraints (i.e. db/unique) since that will prevent the transactor from fetching some storage segments multiple times.
Depending on your data distribution, increasing the memory-index-threshold and memory-index-max settings may let your imports run longer... until the indexing job can't keep up. This seems to be what it's happening to you.
Recommendations
Try reducing the memory-index-threshold and memory-index-max settings. That may seem counterintuitive but you'll have much better chances to have any import complete (of course they'll take more time, but you could almost guarantee they will finish). The key is to make the transactor throttle your (peer) requests before it's not able to catch up with the indexing jobs.

What are the pros and cons of a distributed second level cache versus focusing on tuning database

we have a website that uses nhibernate and 2nd level cache. We are having a debate as one person wants to turn off the second level cache as we are moving to a multi webserver environment (with a load balancer in front).
One argument is to get rid of the second level cache and focus on optimizing and tuning the Db. the other argument is to roll out a distributed cache as the second level cache.
I am curious to hear folks pro and con here of DB tuning versus distributed cache (factoring in effort involved, cost, complexity, etc)
In case of a load balancing scenario you have to use a distributed cache provider to get best performance and consistency, that has nothing to do with optimizing your database. In any scenario you should optimize you database.
Both. You should have a distributed cache to prevent unecessary calls to the database and a tuned database so the initial calls are quickly returned. As an example, facebook required a significant amount of caching to scale, but I'm sure it wouldn't do much good if the initial queries took 10 minutes. :)
Two words: measure it.
Since you already have cache implement it you can probably measure what the impact would be of turning it off for benchmark purposes.
I would think that a multi-web server and a distributed second level cache can -and probably should- coexist.
First of all if we take as example memcached, it supports distributed object storing so if you're not using that, you could switch to that. it works.
Secondly, I'm guessing that you're introducing the web-server farm to respond to increasing web requests which will in turn mean increasing requests for data. If you kill your caching, it won't matter how much you optimize your database you're going to thrash it with queries. So you are going to improve your execution time, but while you wait for the database to return your data.
This is especially true for the case that web-node 1 requests dataset A and web-node 2 requests dataset A --> you are going to do the same query twice while with second level caching you only do it once.
So my recommendation is:
Don't kill your second level cache. You have already spent resources to implement it and by disabling it you are NOT going to improve your application's performance. Even a single node of memcached is going to be faster than having none at all.
Do optimize your database operations. This means both from the database side (indexes, views, sp's, functions, perhaps a cluster with read-only and write-only nodes) and application side (optimize your queries, lazy/eager loading profiling, don't fetch data you don't need, combine multiple queries into single-round-trips via Future, MutliQuery, MultiCriteria)
Do optimize your second-level cache implementation. There are datasets that have an infinite expiration date, and thus you query the db for them only once, and there are datasets that have short expiration dates, and thus probably expensive queries are executed more frequently. By optimizing your queries and your db you are going to improve the performance for the queries but the second-level cache is going to save your skin on peak load where short-expiration date datasets will be fetched by the cache more frequently.
If using textual queries is an everyday operation use the database's full-text capabilities or, even better, use a independent service like Lucene.NET (which can be integrated with NHibernate via NHibernate.Search)
That's a very difficult topic. In either case you need proficiency. Either a very proficient DBA, or a very proficient NHibernate / Cache administrator.
Personally, I prefer having full control over my SQL and tuning the database. Since you only have multiple webservers (and not necessarily multiple database instances), you might be better off that way, too. Modern databases have very efficient caches, so usually you create more harm with badly configured second-level caches in the application, rather than just letting the database cache sql statements, cursors, data, buffers, etc. I have experienced this to work very well for around 15 weblogic servers and only one database with lots of memory.
Since you do have NHibernate already, though, moving away from it, back to SQL (maybe with LINQ?) might be quite a costly task, that's not worth the effort.
We use NHibernate's 2nd level cache in our multi-server environment using Microsoft AppFabric distributed cache framework (NHibernate Velocity Provider) with great success.
Having said that, using 2nd level cache requires deeper understanding of the framework to prevent unexpected results. In addition, before using distributed caches, it is important to measure their overhead.
So my answer is basically - before using 2nd-level cache, you should really test and see whether it is really needed.

How to gear towards scalability for a start up e-commerce portal?

I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.
First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.
1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).
Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.
You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.

What's more costly on every page view - Database Writes or File Writes?

What is the most efficient solution when you need to record some data on every page view in your application - should you write to a file or write to the database?
Or maybe neither - perhaps you should cache the data in memory or a file and only write it to the database (or file system if you use a memory cache) occasionally?
If it's purely recording a small amount of data with no subsequent lookups, straight file I/O is almost guaranteed to be more efficient. You're losing all the advantages of a DBMS though -- indexing, transactional integrity (really, ACID in general), concurrent access, etc..
It almost sounds like you're talking about what amounts to simple logging. If that's the case, and you don't need to do frequent complex queries on the resulting data, you're probably better off with straight file I/O if performance is a serious issue. Be careful of concurrent-write issues, though.
If the properties of an RDBMS are desirable, you might think about using SQLite, which for simplistic loads will get you better performance than most RDBMSs with less overhead, at the cost of some of the benefits (highly concurrent access and availability over the network to other machines are a couple of the "biggies"). It still wouldn't be as fast as straight file I/O in the general case, though.
Your later mention of it being for page view tracking causes me to ask: Are you incrementing a counter, rather than logging data about the page view? If so, I'd strongly suggest going with something like SQLite (doing something like UPDATE tbl SET counter = counter+1). You really don't want to get into the timing issues involved in doing this by hand -- if you don't do it right, you'll start losing counts on simultaneous access (A reads "100", B reads "100", A writes "101", B writes "101"; B should have written 102, but has no way of knowing that).
Conceptually, writing to the database is always slower than writing to a file.
The database has to write to a file too, with the extra overhead of communication to get the data to the database, so it can write it to a file. Therefore, it must be slower.
That said, databases do disk I/O very well, probably better than you will. Don't be surprised if you find out that a simple file logger is slower than writing it to a database. The database has a lot of I/O optimizations, and has some tricks available that you may not (depending on your web lanaguage and environment).
Don't be surprised if the answer changes over time. When your site is small, logging to a database is very fast. As your site grows, the logging table can become a major pain: It uses a lot of disk space, makes the backups take forever, and consumes all the disk I/O when you try to query it. This is why you should benchmark both methods yourself. Then you can re-test in the future, when conditions change.
Hitting the database is most likely going to be more expensive than writing to a file.
If your pageviews per second are high, and if the data doesn't need to be available in the database right away, then writing to a file and periodically loading the data into the DB will be a more optimal solution.
However it all depends on the nature of the data you're recording per page view and how critical it is to whatever business function it serves.
That highly depends on your needs for data safety. If you can afford to lose some data in case of a crash then keeping the data in memory and writing it periodically to a persistent store is certainly the most efficient way to go.
Edit: You mentioned pageviews. In that case I would keep the counters in memory and periodically update a database table (like every minute or so).
That depends.
Ands it really does: it depends on the DBMS and/or the OS+filesystem you use. In other words: your mileage varies.
If you just append data somewhere modern DBMS/OS+filesystems should handle this equally well and fast. Problems arise when you want to change data.
Caching - depends too on what kind of caching granularity you can afford (need to have every stepped logged crash-safe versus potential saving).
Use a hybrid solution like redis its designed for this sort of stuff

What advice can you give me for writing a meaningful benchmark?

I have developed a framework that is used by several teams in our organisation. Those "modules", developed on top of this framework, can behave quite differently but they are all pretty resources consuming even though some are more than others. They all receive data in input, analyse and/or transform it, and send it further.
We planned to buy new hardware and my boss asked me to define and implement a benchmark based on the modules in order to compare the different offers we have got.
My idea is to simply start sequentially each module with a well chosen bunch of data as input.
Do you have any advice? Any remarks on this simple procedure?
Your question is pretty broad, so unfortunately my answer will not be very specific either.
First, benchmarking is hard. Do not underestimate the effort necessary to produce meaningful, repeatable, high-confidence results.
Second, what is your performance goal? Is it throughput (transaction or operations per second)? Is it latency (time it takes to execute a transaction)? Do you care about average performance? Do I care about worst case performance? Do you care about the absolute worst case or I care that 90%, 95% or some other percentile get adequate performance?
Depending on which goal you have, then you should design your benchmark to measure against that goal. So, if you are interested in throughput, you probably want to send messages / transactions / input into your system at a prescribed rate and see if the system is keeping up.
If you are interested in latency, you would send messages / transactions / input and measure how long it takes to process each one.
If you are interested in worst case performance you will add load to the system until up to whatever you consider "realistic" (or whatever the system design says it should support.)
Second, you do not say if these modules are going to be CPU bound, I/O bound, if they can take advantage of multiple CPUs/cores, etc. As you are trying to evaluate different hardware solutions you may find that your application benefits more from a great I/O subsystem vs. a huge number of CPUs.
Third, the best benchmark (and the hardest) is to put realistic load into the system. Meaning, you record data from a production environment, and put the new hardware solution through this data. Getting this done is harder than it sounds, often, this means adding all kinds of measure points in the system to see how it behaves (if you do not have them already,) modifying the existing system to add record/playback capabilities, modifying the playback to run at different rates, and getting a realistic (i.e., similar to production) environment for testing.
The most meaningful benchmark is to measure how your code performs under everyday usage. That will obviously provide you with the most realistic numbers.
Choose several real-life data sets and put them through the same processes your org uses every day. For extra credit, talk with the people that use your framework and ask them to provide some "best-case", "normal", and "worst-case" data. Anonymize the data if there are privacy concerns, but try not to change anything that could affect performance.
Remember that you are benchmarking and comparing two sets of hardware, not your framework. Treat all of the software as a black box and simply measure the hardware performance.
Lastly, consider saving the data sets and using them to similarly evaluate any later changes you make to the software.
If you're system is supposed to be able to handle multiple clients all calling at the same time, then your benchmark should reflect this. Note that some calls will not play well together. For example, having 25 threads post the same bit of information at the same time could lead to locks on the server end, thus skewing your results.
From a nuts-and-bolts point of view, I've used Perl and its Benchmark module to gather the information I care about.
If you're comparing differing hardware, then measuring the cost per transaction will give you a good comparison of the trade offs of hardware for performance. One configuration may give you the best performance, but costs too much. A less expensive configuration may give you adequate performance.
It's important to emulate the "worst case" or "peak hour" of load. It's also important to test with "typical" volumes. It's a balancing act to get good server utilization, that doesn't cost too much, that gives the required performance.
Testing across hardware configurations quickly becomes expensive. Another viable option is to first measure on the configuration you have, then simulate that behavior across virtual systems using a model.
If you can, try to record some operations users (or processes) are doing with your framework, ideally using a clone of the real system. That gives you the most realistic data. Things to consider:
Which functions are most often used?
How much data is transferred?
Do not assume anything. If you think "that is going to be fast/slow", don't bet on it. In 9 out of 10 cases, you're wrong.
Create a top ten for 1+2 and work from that.
That said: If you replace old hardware with new hardware, you can expect roughly 10% faster execution for each year that has passed since you bought the first set (if the systems are otherwise pretty equal).
If you have a specialized system, the numbers may be completely different but usually, new hardware doesn't change much. For example, adding an useful index to a database can reduce the runtime of a query from two hours to two seconds. Hardware will never give you that.
As I see it, there are two kinds of benchmarks when it comes to benchmarking software. First, microbenchmarks, when you try to evaluate a piece of code in isolation or how a system deals with narrowly defined workload. Compare two sorting algorithms written in Java. Compare two web browsers how fast can each perform some DOM manipulation operation. Second, there are system benchmarks (I just made the name up), when you try to evaluate a software system under a realistic workload. Compare my Python based backend running on Google Compute Engine and on Amazon AWS.
When dealing with Java and such like, keep in mind that the VM needs to warm up before it can give you realistic performance. If you measure time with the time command, the JVM startup time will be included. You almost always want to either ignore start-up time or keep track of it separately.
Microbenchmarking
During the first run, CPU caches are getting filled with the necessary data. The same goes for disk caches. During few subsequent runs the VM continues to warm up, meaning JIT compiles what it deems helpful to compile. You want to ignore these runs and start measuring afterwards.
Make a lot of measurements and compute some statistics. Mean, median, standard deviation, plot a chart. Look at it and see how much it changes. Things that can influence the result include GC pauses in the VM, frequency scaling on the CPU, some other process may start some background task (like virus scan), OS may decide move the process on a different CPU core, if you have NUMA architecture, the results would be even more marked.
In case of microbenchmarks, all of this is a problem. Kill what processes you can before you begin. Use a benchmarking library that can do some of it for you. Like https://github.com/google/caliper and such like.
System benchmarking
In case of benchmarking a system under a realistic workload, these details do not really interest you and your problem is "only" to know what a realistic workload is, how to generate it and what data to collect. It is always best if you can instrument a production system and collect data there. You can usually do that, because you are measuring end-user characteristics (how long did a web page render) and these are I/O bound so the code gathering data does not slow down the system. (The page needs to be shipped to the user over the network, it does not matter if we also log a few numbers in the process).
Be mindful of the difference between profiling and benchmarking. Benchmarking can give you absolute time spent doing something, profiling gives you relative time spent doing something compared to everything else that needed doing. This is because profilers run heavily instrumented programs (common technique is to stop-the-world every few hundred ms and save a stack trace) and the instrumentation slows everything down significantly.

Resources