Demonstrate the Power of a Linux based Load balancing cluster - benchmarking

I have to demonstrate the power of a cluster, i.e its advantage over a normal machine.
Is there any way I can show,to the layman that "If this had been done on a normal workstation, it would have taken X hours and with this cluster its the nth fraction of the x hrs taken"?
Please suggest some simulations/renders/computational tests that can give concrete figures to support the above argument.

Some ideas: the website, the Linux High Performance Computing website, the Top500 Related Papers (not really Linux specific but there are some interesting data).


Running C simulations on EC2

I've been running parallel, independent simulations on a SGE cluster and would like to transition to using EC2. I've been looking at the documentation for StarCluster but, due to my inexperience, I'm still missing a few things.
1) My code is written in C and uses GSL -- do I need to install GSL on the virtual machines and compile there, or can I precompile the code? Are there any tutorials that cover this exact usage of EC2?
2) I need to run maybe 10,000 CPU hours of code, but I could easily set this up as many short instances or fewer, longer jobs. Given these requirements, is EC2 really the best choice? If so, is StarCluster the best interface for my needs?
Thanks very much.
You can create an AMI (basically an image of your virtual machine) with all your dependencies installed. Then all you need to do is configure job specific parameters at launch.
You can run as many instances as you want on ec2. You may be able to take advantage of spot instances to save money (all you need to be able to tolerate is the instance getting shutdown if the price exceeds your bid.)

Optimizing Solr 4 on EC2 debian instance(s)

My Solr 4 instance is slow and I don't know why.
I am attempting to modify the configurations of JVM, Tomcat6 and Solr 4 in order
to optimize performance, with queries per second as the key metric.
Currently I am running on an EC2 small tier with Debian squeeze, but ready to switch to Ubuntu if needed.
There is nothing special about my use case. The index is small. Queries do include a moderate number of unions (e.g. 10), plus faceting, but I don't think that's unusual.
My understanding is that these areas could need tweaking:
Configuring the JVM Garbage collection schedule and memory allocation ("GC tuning is a precise art form", ref)
Other JVM settings
Solr's Query Result cache, Filter cache, Document cache settings
Solr's Auto-warming settings
There are a number of ways to monitor the performance of Solr:
Sematext SPM
New Relic
But none of these methods indicate which settings need to be adjusted, and there's no guide that I know of that steps through an exhaustive list of settings that could possibly improve performance. I've reviewed the following pages (one, two, three, four), and gone through some rounds of trial and error so far without improvement.
How to tell JVM to use all the 2 GB memory on the small EC2 instance?
How to debug and optimize JVM Garbage Collection?
How do I know when I/O throttling, such as the new EBS IOPS pricing, is the issue?
Using figures like the NewRelic examples below, how to detect what is problematic behavior, and how to approach solutions.
I'm looking for link to good documentation for setting up and optimizing Solr 4, from a DevOps or server admin perspective (not index or application design).
I'm looking for the top trouble spots in, solrconfig.xml, solr.xml (other?) that are most likely causes of problems.
Or any tips you think address the questions.
First, you should not focus on switching your linux distribution. A different distribution might bring some changes but considering the information you gave, nothing prove that these changes may be significant.
You are mentionning lots of possibilities for your optimisations, this can be overwhelming. You should consider an tweaking area only once you have proven that the problem lies in that particular part of your stack.
JVM Heap Sizing
You can use the parameter -mx1700m to give a maximum of 1.7GB of RAM to the JVM. Hotspot might not need it, so don't be surprised if your heap capacity does not reach that number.
You should set the minimum heap size to a low value, so that Hotspot can optimise its memory usage. For instance, to set a minimal heap size at 128MB, use -mx128m.
Garbage Collector
From what you say, you have limited hardware (1-core at 1.2GHz max, see this page)
M1 Small Instance
1.7 GiB memory
1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2
GHz 2007 Opteron or 2007 Xeon processor
Therefore, using that low-latency GC (CMS) won't do any good. It won't be able to run concurrently with your application since you have only one core. You should switch to the Throughput GC using -XX:+UseParallelGC -XX:+UseParallelOldGC.
Is the GC really a problem ?
To answer that question, you need to turn on GC logging. It is the only way to see whether GC pauses are responsible for your application response time. You should turn these on with -Xloggc:gc.log -XX:+PrintGCDetails.
But I don't think the problem lies here.
Is it a hardware problem ?
To answer this question, you need to monitor resource utilization (disk I/O, network I/O, memory usage, CPU usage). You have a lot of tools to do that, including top, free, vmstat, iostat, mpstat, ifstat, ...
If you find that some of these resources are saturating, then you need a bigger EC2 instance.
Is it a software problem ?
In your stats, the document cache hit rate and the filter cache hit rate are healthy. However, I think the query result cache hit rate is pretty low. This implies a lot of queries operations.
You should monitor the query execution time. Depending on that value you may want to increase the cache size or tune the queries so that they take less time.
More links
JVM options reference :
A feedback that I did on some application performance audit :
Hope that helps !

Most efficient tool to store frequent temperature samples

I'm working on a system that will need to store lots of temperature data. I could potentially store 5 samples per second or more.
I've done this in the past with a relatively simple mysql database and performance became unbearable. Inserts were not too bad, but had a noticeable load. Queries, however, could take minutes.
At that time, I had something like 50 gb of data, which is ridiculous. I can think of many ways to compress or discard data without losing critical information, but that's a completely different problem.
I'd like to pick a tool/database that is optimized for this kind of data, preferably cross platform (at a minimum, linux/c++).
RRD (Round Robin Database) seems built for this kind of thing, but it seems designed more for processing data than for storing it.
What other tools are available?
Edit: more info...
This will be running on an embedded system (a Raspberry Pi) so an ideal tool has low computing overhead, low memory footprint, and few library dependencies.
The storage may not necessarily be on the same device.
I suppose growth could reach as much as 500k samples per hour in a contrived, extreme case. More likely it will be about 20k samples per hour.
Internet access should not be presumed.
Looks like you're looking for a time series DB.
I know of two candidates:
If you could be a bit more specific about your requirements (req/s, daily data growth, API type, self-hosted or a fully managed solution, etc), I'd be able to go into more details or recommend other solutions.
Good luck.

How should I estimate hardware requirements for SQL Server 2005 database?

We're being asked to spec out production database hardware for an ASP.NET web application that hasn't been built yet.
The specs we need to determine are:
Database CPU
Database I/O
Database RAM
Here are the metrics I'm currently looking at:
Estimated number of future hits to
website - based on current IIS logs.
Estimated worst-case peak loads to
Estimated number of DB queries per
page, on average.
Number of servers in web farm that
will be hitting database.
Cache polling traffic from database
(using SqlCacheDependency).
Estimated data cache misses.
Estimated number of daily database transactions.
Maximum acceptable page render time.
Any other metrics we should be taking into account?
Also, once we have all those metrics in place, how do they translate into hardware requirements?
What I have been doing lately for server planning is using some free tools that HP provides, which are collectively referred to as the "server sizers". These are great tools because they figure out the optimal type of RAID to use, and the correct number of disk spindles to handle the load (very important when planning for a good DB server) and memory processor etc. I've provided the link below I hope this helps.
What I am missing is a measure for the needed / required / defined level of reliability.
While you could probably spec out a big honking machine to handle all the load, depending on your reliabiltiy requirements, you might rather want to invest in smaller, but multiple machines, and into safer disk subsystems (RAID 5).
In my opinion, estimating hardware for an application that hasn't been built and designed yet is more of a political issue than a scientific issue. By the time you finish the project, current hardware capability and their price, functional requirements, expected number of concurrent users, external systems and all other things will change and this change is beyond your control.
However this question comes up very often since you need to put numbers in a proposal or provide a report to your manager. If it is a proposal, what you are trying to accomplish is to come up with a spec that can support the proposed sofware system. The only trick is to propose a system that will not increase your cost for competiteveness while not puting yourself at the risk of a low performance system.
If you can characterize your current workload in terms of hits to pages, then you can then:
1) calculate the typical type of query that will be done for each page
2) using the above 2 pieces of information, estimate the workload on the database server
You also need to determine your performance requirements - what is the max and average response time you want for your website?
Given the workload, and performance requirements, you can then calculate capacity. The best way to make this estimate is to use some existing hardware, run a simulated database workload on a database on that hardware, and then extrapolate your hardware requirements based on your data from the first steps.

What advice can you give me for writing a meaningful benchmark?

I have developed a framework that is used by several teams in our organisation. Those "modules", developed on top of this framework, can behave quite differently but they are all pretty resources consuming even though some are more than others. They all receive data in input, analyse and/or transform it, and send it further.
We planned to buy new hardware and my boss asked me to define and implement a benchmark based on the modules in order to compare the different offers we have got.
My idea is to simply start sequentially each module with a well chosen bunch of data as input.
Do you have any advice? Any remarks on this simple procedure?
Your question is pretty broad, so unfortunately my answer will not be very specific either.
First, benchmarking is hard. Do not underestimate the effort necessary to produce meaningful, repeatable, high-confidence results.
Second, what is your performance goal? Is it throughput (transaction or operations per second)? Is it latency (time it takes to execute a transaction)? Do you care about average performance? Do I care about worst case performance? Do you care about the absolute worst case or I care that 90%, 95% or some other percentile get adequate performance?
Depending on which goal you have, then you should design your benchmark to measure against that goal. So, if you are interested in throughput, you probably want to send messages / transactions / input into your system at a prescribed rate and see if the system is keeping up.
If you are interested in latency, you would send messages / transactions / input and measure how long it takes to process each one.
If you are interested in worst case performance you will add load to the system until up to whatever you consider "realistic" (or whatever the system design says it should support.)
Second, you do not say if these modules are going to be CPU bound, I/O bound, if they can take advantage of multiple CPUs/cores, etc. As you are trying to evaluate different hardware solutions you may find that your application benefits more from a great I/O subsystem vs. a huge number of CPUs.
Third, the best benchmark (and the hardest) is to put realistic load into the system. Meaning, you record data from a production environment, and put the new hardware solution through this data. Getting this done is harder than it sounds, often, this means adding all kinds of measure points in the system to see how it behaves (if you do not have them already,) modifying the existing system to add record/playback capabilities, modifying the playback to run at different rates, and getting a realistic (i.e., similar to production) environment for testing.
The most meaningful benchmark is to measure how your code performs under everyday usage. That will obviously provide you with the most realistic numbers.
Choose several real-life data sets and put them through the same processes your org uses every day. For extra credit, talk with the people that use your framework and ask them to provide some "best-case", "normal", and "worst-case" data. Anonymize the data if there are privacy concerns, but try not to change anything that could affect performance.
Remember that you are benchmarking and comparing two sets of hardware, not your framework. Treat all of the software as a black box and simply measure the hardware performance.
Lastly, consider saving the data sets and using them to similarly evaluate any later changes you make to the software.
If you're system is supposed to be able to handle multiple clients all calling at the same time, then your benchmark should reflect this. Note that some calls will not play well together. For example, having 25 threads post the same bit of information at the same time could lead to locks on the server end, thus skewing your results.
From a nuts-and-bolts point of view, I've used Perl and its Benchmark module to gather the information I care about.
If you're comparing differing hardware, then measuring the cost per transaction will give you a good comparison of the trade offs of hardware for performance. One configuration may give you the best performance, but costs too much. A less expensive configuration may give you adequate performance.
It's important to emulate the "worst case" or "peak hour" of load. It's also important to test with "typical" volumes. It's a balancing act to get good server utilization, that doesn't cost too much, that gives the required performance.
Testing across hardware configurations quickly becomes expensive. Another viable option is to first measure on the configuration you have, then simulate that behavior across virtual systems using a model.
If you can, try to record some operations users (or processes) are doing with your framework, ideally using a clone of the real system. That gives you the most realistic data. Things to consider:
Which functions are most often used?
How much data is transferred?
Do not assume anything. If you think "that is going to be fast/slow", don't bet on it. In 9 out of 10 cases, you're wrong.
Create a top ten for 1+2 and work from that.
That said: If you replace old hardware with new hardware, you can expect roughly 10% faster execution for each year that has passed since you bought the first set (if the systems are otherwise pretty equal).
If you have a specialized system, the numbers may be completely different but usually, new hardware doesn't change much. For example, adding an useful index to a database can reduce the runtime of a query from two hours to two seconds. Hardware will never give you that.
As I see it, there are two kinds of benchmarks when it comes to benchmarking software. First, microbenchmarks, when you try to evaluate a piece of code in isolation or how a system deals with narrowly defined workload. Compare two sorting algorithms written in Java. Compare two web browsers how fast can each perform some DOM manipulation operation. Second, there are system benchmarks (I just made the name up), when you try to evaluate a software system under a realistic workload. Compare my Python based backend running on Google Compute Engine and on Amazon AWS.
When dealing with Java and such like, keep in mind that the VM needs to warm up before it can give you realistic performance. If you measure time with the time command, the JVM startup time will be included. You almost always want to either ignore start-up time or keep track of it separately.
During the first run, CPU caches are getting filled with the necessary data. The same goes for disk caches. During few subsequent runs the VM continues to warm up, meaning JIT compiles what it deems helpful to compile. You want to ignore these runs and start measuring afterwards.
Make a lot of measurements and compute some statistics. Mean, median, standard deviation, plot a chart. Look at it and see how much it changes. Things that can influence the result include GC pauses in the VM, frequency scaling on the CPU, some other process may start some background task (like virus scan), OS may decide move the process on a different CPU core, if you have NUMA architecture, the results would be even more marked.
In case of microbenchmarks, all of this is a problem. Kill what processes you can before you begin. Use a benchmarking library that can do some of it for you. Like and such like.
System benchmarking
In case of benchmarking a system under a realistic workload, these details do not really interest you and your problem is "only" to know what a realistic workload is, how to generate it and what data to collect. It is always best if you can instrument a production system and collect data there. You can usually do that, because you are measuring end-user characteristics (how long did a web page render) and these are I/O bound so the code gathering data does not slow down the system. (The page needs to be shipped to the user over the network, it does not matter if we also log a few numbers in the process).
Be mindful of the difference between profiling and benchmarking. Benchmarking can give you absolute time spent doing something, profiling gives you relative time spent doing something compared to everything else that needed doing. This is because profilers run heavily instrumented programs (common technique is to stop-the-world every few hundred ms and save a stack trace) and the instrumentation slows everything down significantly.
