Considering the below piece of C code, I expected the CPU utilization to go up to 100% as the processor would try to complete the job (endless in this case) given to it. On running the executable for 5 mins, I found the CPU to go up to a max. of 48%. I am running Mac OS X 10.5.8; processor: Intel Core 2 Duo; Compiler: GCC 4.1.
int i = 10;
while(1) {
i = i * 5;
}
Could someone please explain why the CPU usage does not go up to 100%? Does the OS limit the CPU from reaching 100%?
Please note that if I added a "printf()" inside the loop the CPU hits 88%. I understand that in this case, the processor also has to write to the standard output stream hence the sharp rise in usage.
Has this got something to do with the amount of job assigned to the processor per unit time?
Regards,
Ven.
You have a multicore processor and you are in a single thread scenario, so you will use only one core full throttle ... Why do you expect the overall processor use go to 100% in a similar context ?
Run two copies of your program at the same time. These will use both cores of your "Core 2 Duo" CPU and overall CPU usage will go to 100%
Edit
if I added a "printf()" inside the loop the CPU hits 88%.
The printf send some characters to the terminal/screen. Sending information, Display and Update is handeled by code outside your exe, this is likely to be executed on another thread. But displaying a few characters does not need 100% of such a thread. That is why you see 100% for Core 1 and 76% for Core 2 which results in the overal CPU usage of 88% what you see.
I know that there are tools like top and ps for measuring CPU usage, but the way they measure the CPU usage is by measuring how much time the idle task was not running. So, for example, even if a CPU has a stall due to a cache miss, these tools would still consider the CPU to be occupied. However, what I want is for the profiling tool to consider the CPU as idle during a stall. Is there any tool which does that?
tools like top and ps for measuring CPU usage, .. measure the CPU usage is by measuring how much time the idle task was not running.
No, they don't measure idle, they just read what kernel thinks about its CPU usage via /proc/stat (try vmstat 1 tool too). Did you check that system-wide user + system times are accounted only by idle? I think, kernel just exports some stats of scheduler, which records user/system state on rescheduling, both on system timer and on blocking system calls (possibly one of callers of cpuacct_charge, like update_curr - Update the current task's runtime statistics.).
/proc/stat example:
cat /proc/stat
cpu 2255 34 2290 22625563 6290 127 456
and decode by http://www.linuxhowtos.org/System/procstat.htm
The very first "cpu" line aggregates the numbers in all of the other "cpuN" lines. These numbers identify the amount of time the CPU has spent performing different kinds of work. Time units are in USER_HZ or Jiffies (typically hundredths of a second).
The meanings of the columns are as follows, from left to right:
user: normal processes executing in user mode
nice: niced processes executing in user mode
system: processes executing in kernel mode
idle: twiddling thumbs
When we hear jiffie, it means that scheduler was used to get the numbers, not estimating of idle task (top even don't see this task or tasks with pid 0).
So, for example, even if a CPU has a stall due to a cache miss, these tools would still consider the CPU to be occupied.
And basically (when there is no SMT, like HT in Intels), CPU is occupied when your task has pipeline stall due to memory access (or taking wrong path with out-of-order). OS can't run other task, because task switch is more expensive that waiting this one stall.
Case of SMT is different, because there is hardware which either switchs two logical tasks on single hardware, or even (in fine grained SMT) mixing their instructions (microoperations) into the single stream to execute on shared hardware. There are usually SMT statistic counters to check actual mixing.
However, what I want is for the profiling tool to consider the CPU as idle during a stall. Is there any tool which does that?
Performance monitoring unit may have useful events for this. For example, perf stat reports some (on Sandy Bridge)
$ perf stat /bin/sleep 10
Performance counter stats for '/bin/sleep 10':
0,563759 task-clock # 0,000 CPUs utilized
1 context-switches # 0,002 M/sec
0 CPU-migrations # 0,000 M/sec
175 page-faults # 0,310 M/sec
888 767 cycles # 1,577 GHz
568 603 stalled-cycles-frontend # 63,98% frontend cycles idle
445 567 stalled-cycles-backend # 50,13% backend cycles idle
593 103 instructions # 0,67 insns per cycle
# 0,96 stalled cycles per insn
115 709 branches # 205,246 M/sec
7 699 branch-misses # 6,65% of all branches
10,000933734 seconds time elapsed
So, it says that 0,5 jiffie (task-clock) was used by the sleep 10. It is too low to be accounted in classic rusage, and /usr/bin/time got 0 jiffie as task CPU usage (user + system):
$ /usr/bin/time sleep 10
0.00user 0.00system 0:10.00elapsed 0%CPU (0avgtext+0avgdata 2608maxresident)k
0inputs+0outputs (0major+210minor)pagefaults 0swaps
Then perf measures (counts with help of PMU) real cycles and real instructions executed by the task (and by kernel on behalf of the task) - cycles and instructions lines. Sleep has used 888k cycles but only 593k useful instructions were finished, and mean IPC was 0.6-0.7 (30-40% stalls). Around 300k cycles was lost; and on Sandy bridge perf reports where they were lost - stalled-cycles-* events for frontend (decoder - CPU don't know what to execute due to branch miss or due to code not prefetched to L1I) and for backend (can't execute because instruction needs some data from memory which is not available at right time - memory stall).
Why we see more stalls inside CPU when there should be only 300k cycles without any instruction executed? This is because modern processors are often superscalar and out-of-order - they can start to executing several instructions every CPU clock tick, and even reorder them. If you want to see execution port utilization, try ocperf (perf wrapper) from Andi Kleen's pmu-tools and some Intel manuals about their PMU counters. There is also toplev.py script to "identify the micro-architectural bottleneck for a workload" without selecting Intel events by hands.
restarting a node B in a three node cluster always ends in the following warn message on the node B:
Back pressure is active for Index work pool with total work queue size 764 and average processing time 9242
the queue size is constantly increasing.
I'm running DSE 4.5 SOLR on a ubuntu 12.04 machine with 20 GB RAM, 4 Cores per server. Monitoring the Systemresources seems to be "normal" , 12 GB RAM usage, very low CPU usage.
opscenter is showing me for this node:
a heap usage of 1,5 GB, a very low load (0.32), a compaction task of a keyspace that is at 100% for several hours now.
On node A a nodetool -pr always show the message:
INFO [HintedHandoff:2] 2015-01-05 17:24:19,570 HintedHandOffManager.java (line 466) Timed out replaying hints to /10.0.106.200; aborting (0 delivered)
where 10.0.106.200 is node B
Any idea what this Warning could cause?
The backpressure warning is just an indication of overloading, as the indexing process isn't able to keep up with the insertion rate according to your configured threshold; generally, it is not a serious issue unless it happens for a prolonged period of time, in which case you might want to scale either horizontally or vertically, or increase the backpressure threshold if you already have powerful machines.
As for the other problems, they look a bit more serious, but it's difficult to diagnose with so little information.
I have a problem with high CPU cunsumption by the linux kernel, while bootstrapping my java applications on server. This problem occurs only in production, on dev servers everything is light-speed.
upd9: There was two question about this issue:
How to fix it? - Nominal Animal suggested to sync and drop everything, and it really helps. sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; Works. upd12: But indeed sync is enough.
Why this happening? - It is still open for me, I do understand that flushing durty pages to disk consumes Kernel CPU and IO time, it is normal. But what is strage, why even single threaded application written in "C" my load ALL cores by 100% in kernel space?
Due to ref-upd10 and ref-upd11 I have an idea that echo 3 > /proc/sys/vm/drop_caches is not required to fix my problem with slow memory allocation.
It should be enough to run `sync' before starting memory-consuming application.
Probably will try this tommorow in production and post results here.
upd10: Lost of FS caches pages case:
I executed cat 10GB.fiel > /dev/null, then
sync to be sure, no durty pages (cat /proc/meminfo |grep ^Dirty displayed 184kb.
Checking cat /proc/meminfo |grep ^Cached I got: 4GB cached
Running int main(char**) I got normal performance (like 50ms to initialize 32MB of allocated data).
Cached memory reduced to 900MB
Test summary: I think it is no problem for linux to reclaim pages used as FS cache into allocated memory.
upd11: LOTS of dirty pages case.
List item
I run my HowMongoDdWorks example with commented read part, and after some time
/proc/meminfo said that 2.8GB is Dirty and a 3.6GB is Cached.
I stopped HowMongoDdWorks and run my int main(char**).
Here is part of result:
init 15, time 0.00s
x 0 [try 1/part 0] time 1.11s
x 1 [try 2/part 0] time 0.04s
x 0 [try 1/part 1] time 1.04s
x 1 [try 2/part 1] time 0.05s
x 0 [try 1/part 2] time 0.42s
x 1 [try 2/part 2] time 0.04s
Summary by test: lost of durty pages significantly slow down first access to allocated memory (to be fair, this starts to happen only when total application memory starts to be comparable to whole OS memory, i.e. if you have 8 of 16 GB free, it is no problem to allocate 1GB, slowdown starst from 3GB or so).
Now I managed to reproduce this situation in my dev environment, so here is new details.
Dev machine configuration:
Linux 2.6.32-220.13.1.el6.x86_64 - Scientific Linux release 6.1 (Carbon)
RAM: 15.55 GB
CPU: 1 X Intel(R) Core(TM) i5-2300 CPU # 2.80GHz (4 threads) (physical)
It's 99.9% that problem caused by large amount of durty pages in FS cache. Here is the application which creates lots on dirty pages:
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.Random;
/**
* #author dmitry.mamonov
* Created: 10/2/12 2:53 PM
*/
public class HowMongoDdWorks{
public static void main(String[] args) throws IOException {
final long length = 10L*1024L*1024L*1024L;
final int pageSize = 4*1024;
final int lengthPages = (int) (length/pageSize);
final byte[] buffer = new byte[pageSize];
final Random random = new Random();
System.out.println("Init file");
final RandomAccessFile raf = new RandomAccessFile("random.file","rw");
raf.setLength(length);
int written = 0;
int readed = 0;
System.out.println("Test started");
while(true){
{ //write.
random.nextBytes(buffer);
final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
raf.seek(randomPageLocation);
raf.write(buffer);
written++;
}
{ //read.
random.nextBytes(buffer);
final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
raf.seek(randomPageLocation);
raf.read(buffer);
readed++;
}
if (written % 1024==0 || readed%1024==0){
System.out.printf("W %10d R %10d pages\n", written, readed);
}
}
}
}
And here is test application, which causes HI (up to 100% by all cores) CPU load in Kernel Space (same as below, but I will copy it once again).
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main(char** argv){
int last = clock(); //remember the time
for(int i=0;i<16;i++){ //repeat test several times
int size = 256 * 1024 * 1024;
int size4=size/4;
int* buffer = malloc(size); //allocate 256MB of memory
for(int k=0;k<2;k++){ //initialize allocated memory twice
for(int j=0;j<size4;j++){
//memory initialization (if I skip this step my test ends in
buffer[j]=k; 0.000s
}
//printing
printf(x "[%d] %.2f\n",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
last = clock();
}
}
return 0;
}
While previous HowMongoDdWorks program is running, int main(char** argv) will show results like this:
x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...
I keep everything below this line only for historical perpose.
upd1: both, dev and production systems are big enought for this test.
upd7: it is not paging, at least I have not seen any storage IO activity during problem time.
dev ~ 4 Cores, 16 GM RAM, ~8 GB free
production ~ 12 Cores, 24 GB
RAM, ~ 16 GB free (from 8 to 10 GM is under FS Cache, but it no
difference, same results even if all 16GM is totally free), also this machine is loaded by CPU, but not too high ~10%.
upd8(ref): New test case and and potentional explanation see in tail.
Here is my test case (I also tested java and python, but "c" should be most clear):
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main(char** argv){
int last = clock(); //remember the time
for(int i=0;i<16;i++){ //repeat test several times
int size = 256 * 1024 * 1024;
int size4=size/4;
int* buffer = malloc(size); //allocate 256MB of memory
for(int k=0;k<2;k++){ //initialize allocated memory twice
for(int j=0;j<size4;j++){
//memory initialization (if I skip this step my test ends in
buffer[j]=k; 0.000s
}
//printing
printf(x "[%d] %.2f\n",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
last = clock();
}
}
return 0;
}
The output on dev machine (partial):
x [1] 0.13 --first initialization takes a bit longer
x [2] 0.12 --then second one, but the different is not significant.
x [1] 0.13
x [2] 0.12
x [1] 0.15
x [2] 0.11
x [1] 0.14
x [2] 0.12
x [1] 0.14
x [2] 0.12
x [1] 0.13
x [2] 0.12
x [1] 0.14
x [2] 0.11
x [1] 0.14
x [2] 0.12 -- and the results is quite stable
...
The output on production machine (partial):
x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...
While running this test on development machine the CPU usage is not even rised from gound, like all cores is less then 5% usage in htop.
But running this test on production machine I see up to 100% CPU usage by ALL cores (in average load rises up to 50% on 12 cores machine), and it's all Kernel time.
upd2: all machines has same centos linux 2.6 installed, I work with them using ssh.
upd3: A: It's unlikely to be swapping, haven't seen any disk activity during my test, and plenty of RAM is also free. (also, descriptin is updated). – Dmitry 9 mins ago
upd4: htop say HI CPU utilisation by Kernel, up to 100% utilisation by al cores (on prod).
upd5: does CPU utilization settle down after initialization completes? In my simple test - Yes. For real application, it is only helping to stop everything else to start a new program (which is nonsense).
I have two questions:
Why this happening?
How to fix it?
upd8: Improved test and explain.
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main(char** argv){
const int partition = 8;
int last = clock();
for(int i=0;i<16;i++){
int size = 256 * 1024 * 1024;
int size4=size/4;
int* buffer = malloc(size);
buffer[0]=123;
printf("init %d, time %.2fs\n",i, (clock()-last)/(double)CLOCKS_PER_SEC);
last = clock();
for(int p=0;p<partition;p++){
for(int k=0;k<2;k++){
for(int j=p*size4/partition;j<(p+1)*size4/partition;j++){
buffer[j]=k;
}
printf("x [try %d/part %d] time %.2fs\n",k+1, p, (clock()-last)/(double)CLOCKS_PER_SEC);
last = clock();
}
}
}
return 0;
}
And result looks like this:
init 15, time 0.00s -- malloc call takes nothing.
x [try 1/part 0] time 0.07s -- usually first try to fill buffer part with values is fast enough.
x [try 2/part 0] time 0.04s -- second try to fill buffer part with values is always fast.
x [try 1/part 1] time 0.17s
x [try 2/part 1] time 0.05s -- second try...
x [try 1/part 2] time 0.07s
x [try 2/part 2] time 0.05s -- second try...
x [try 1/part 3] time 0.07s
x [try 2/part 3] time 0.04s -- second try...
x [try 1/part 4] time 0.08s
x [try 2/part 4] time 0.04s -- second try...
x [try 1/part 5] time 0.39s -- BUT some times it takes significantly longer then average to fill part of allocated buffer with values.
x [try 2/part 5] time 0.05s -- second try...
x [try 1/part 6] time 0.35s
x [try 2/part 6] time 0.05s -- second try...
x [try 1/part 7] time 0.16s
x [try 2/part 7] time 0.04s -- second try...
Facts I learned from this test.
Memory allocation itself is fast.
First access to allocated memory is fast (so it is not a lazy buffer allocation problem).
I split allocated buffer into parts (8 in test).
And filling each buffer part with value 0, and then with value 1, printing consumed time.
Second buffer part fill is always fast.
BUT furst buffer part fill is always a bit slower then second fill (I belive some extra work is done my kernel on first page access).
Some Times it takes SIGNIFICANTLY longer to fill buffer part with values at first time.
I tried suggested anwser and it seems it helped. I will recheck and post results again later.
Looks like linux maps allocated pages to durty file system cache pages, and it takes a lot of time to flush pages to disk one by one. But total sync works fast and eliminates problem.
Run
sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; sync'
on your dev machine. It is a safe, nondestructive way to make sure your caches are empty. (You will not lose any data by running the above command, even if you happen to save or write to disk at the exact same time. It really is safe.)
Then, make sure you don't have any Java stuff running, and re-run the above command just to be sure. You can check if you have any Java running for example
ps axu | sed -ne '/ sed -ne /d; /java/p'
It should output nothing. If it does, close your Java stuff first.
Now, re-run your application test. Does the same slowdown now occur on your dev machine, too?
If you care to leave the comment either way, Dmitry, I'd be happy to explore the issue further.
Edited to add: I suspect that the slowdown does occur, and is due to the large startup latency incurred by Java itself. It is a very common issue, and basically built-in to Java, a result of its architecture. For larger applications, startup latency is often a significant fraction of a second, no matter how fast the machine, simply because Java has to load and prepare the classes (mostly serially, too, so adding cores will not help).
In other words, I believe the blame should fall on Java, not Linux; quite the opposite, since Linux manages to mitigate the latency on your development machine through kernel-level caching -- and that only because you keep running those Java components practically all the time, so the kernel knows to cache them.
Edit 2: It would be very useful to see which files your Java environment accesses when your application is started. You can do this with strace:
strace -f -o trace.log -q -tt -T -e trace=open COMMAND...
which creates file trace.log containing the open() syscalls done by any of the processes started by COMMAND.... To save the output to trace.PID for each process the COMMAND... starts, use
strace -f -o trace -ff -q -tt -T -e trace=open COMMAND...
Comparing the outputs on your dev and prod installations will tell you if they are truly equivalent. One of them may have extra or missing libraries, affecting the startup time.
In case an installation is old and system partition is reasonably full, it is possible that those files have been fragmented, causing the kernel to spend more time waiting for I/O to complete. (Note that the amount of I/O stays the same; only the time it takes to complete will increase if the files are fragmented.) You can use command
LANG=C LC_ALL=C sed -ne 's|^[^"]* open("\(.*\)", O[^"]*$|\1|p' trace.* \
| LANG=C LC_ALL=C sed -ne 's|^[^"]* open("\(.*\)", O[^"]*$|\1|p' \
| LANG=C LC_ALL=C xargs -r -d '\n' filefrag \
| LANG=C LC_ALL=C awk '(NF > 3 && $NF == "found") { n[$(NF-2)]++ }
END { for (i in n) printf "%d extents %d files\n", i, n[i] }' \
| sort -g
to check how fragmented the files used by your application are; it reports how many files use only one, or more than one extents. Note that it does not include the original executable (COMMAND...), only the files it accesses.
If you just want to get the fragmentation statistics for files accessed by a single command, you can use
LANG=C LC_ALL=C strace -f -q -tt -T -e trace=open COMMAND... 2>&1 \
| LANG=C LC_ALL=C sed -ne 's|^[0-9:.]* open("\(.*\)", O[^"]*$|\1|p' \
| LANG=C LC_ALL=C xargs -r filefrag \
| LANG=C LC_ALL=C awk '(NF > 3 && $NF == "found") { n[$(NF-2)]++ }
END { for (i in n) printf "%d extents %d files\n", i, n[i] }' \
| sort -g
If the problem is not due to caching, then I think it is most likely that the two installations are not truly equivalent. If they are, then I'd check the fragmentation. After that, I'd do a full trace (omitting the -e trace=open) on both environments to see where exactly the differences are.
I do believe I now understand your problem/situation.
On your prod environment, kernel page cache is mostly dirty, i.e. most cached stuff is stuff that is going to be written to disk.
When your application allocates new pages, the kernel only sets up the page mappings, it does not actually give physical RAM immediately. That only happens on the first access to each page.
On the first access, the kernel first locates a free page -- typically, a page that contains "clean" cached data, i.e. something read from the disk but not modified. Then, it clears it to zeros, to avoid information leaks between processes. (When using the C library allocation facilities like malloc() etc., instead of the direct mmap() family of functions, the library may use/reuse parts of the mapping. Although the kernel does clear the pages to zeros, the library may "dirty" them. Using mmap() to get anonymous pages you do get them zeroed out.)
If the kernel does not have suitable clean pages at hand, it must first flush some of the oldest dirty pages to disk first. (There are processes inside the kernel that flush pages to disk, and mark them clean, but if the server load is such that pages are continuously dirtied, it is usually desirable to have mostly dirty pages instead of mostly clean pages -- the server gets more work done that way. Unfortunately, it does also mean an increase in the first page access latency, which you're now encountered.)
Each page is sysconf(_SC_PAGESIZE) bytes long, aligned. In other words, pointer p points to the start of a page if and only if ((long)p % sysconf(_SC_PAGESIZE)) == 0. Most kernels, I believe, do actually populate groups of pages in most cases instead of individual pages, thus increasing the latency of the first access (to each group of pages).
Finally, there may be some compiler optimization that plays havoc with your benchmarking. I recommend you write a separate source file for a benchmarking main(), and the actual work done on each iteration in a separate file. Compile them separately, and just link them together, to make sure the compiler does not rearrange the time functions wrt. the actual work done. Basically, in benchmark.c:
#define _POSIX_C_SOURCE 200809L
#include <time.h>
#include <stdio.h>
/* in work.c, adjust as needed */
void work_init(void); /* Optional, allocations etc. */
void work(long iteration); /* Completely up to you, including parameters */
void work_done(void); /* Optional, deallocations etc. */
#define PRIMING 0
#define REPEATS 100
int main(void)
{
double wall_seconds[REPEATS];
struct timespec wall_start, wall_stop;
long iteration;
work_init();
/* Priming: do you want caches hot? */
for (iteration = 0L; iteration < PRIMING; iteration++)
work(iteration);
/* Timed iterations */
for (iteration = 0L; iteration < REPEATS; iteration++) {
clock_gettime(CLOCK_REALTIME, &wall_start);
work(iteration);
clock_gettime(CLOCK_REALTIME, &wall_stop);
wall_seconds[iteration] = (double)(wall_stop.tv_sec - wall_start.tv_sec)
+ (double)(wall_stop.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
}
work_done();
/* TODO: wall_seconds[0] is the first iteration.
* Comparing to successive iterations (assuming REPEATS > 0)
* tells you about the initial latency.
*/
/* TODO: Sort wall_seconds, for easier statistics.
* Most reliable value is the median, with half of the
* values larger and half smaller.
* Personally, I like to discard first and last 15.85%
* of the results, to get "one-sigma confidence" interval.
*/
return 0;
}
with the actual array allocation, deallocation, and filling (per repeat loop) done in the work() functions defined in work.c.
When the kernel runs out of available clean pages, it must flush dirty pages to disk. Flushing a lot of dirty pages to disk looks like a high CPU load, because most kernel-side stuff require one or more pages (temporarily) to work. Essentially, the kernel is waiting for I/O to complete, even when the userspace applications called a non-I/O related kernel function.
If you run a microbenchmark in parallel, say a program that just continuously dirties a very large mapping over and over, and measures the CPU time (__builtin_ia32_rdtsc() if using GCC on x86 or x86-64) without calling any syscalls, you should see that this one gets plenty of CPU time even when the kernel is seemingly eating "all" of CPU time. Only when a process calls a kernel function (syscall) that internally requires some memory, will that call "block", stuck waiting in the kernel for the page flushing to yield new pages.
When running benchmarks, it is usually enough to just run sudo sh -c 'sync ; echo 3 >/proc/sys/vm/drop_caches ; sync' a couple of times before running the benchmark, to make sure there should be no undue memory pressure during the benchmark. I never use it in a production environment. (While it is safe to run, i.e. does not lose data, it is like killing mosquitoes using a sledgehammer: the wrong tool for the job.)
When you find in a production environment that your latencies start to grow too large due to kernel flushing dirty pages -- which it does at maximum device speed I believe, possibly causing a hiccup in application I/O speeds too --, you can tune the kernel dirty page flushing mechanism. Basically, you can tell the kernel to flush dirty pages much sooner to disk, and make sure there won't be that many dirty pages at any point in time (if possible).
Gregory Smith has written about the theory and tuning of the flushing mechanism here. In short, /proc/sys/vm/ contains kernel tunables you can modify. They are reset to defaults at boot, but you can easily write a simple init script to echo the desired values to the files at boot. If the processes running on the production machine do heavy I/O, you might also look at the filesystem tunables. At minimum, you should mount your filesystems (see /etc/fstab) using the relatime flag, so that file access times are only updated for the first access after the file has been modified or its status changed.
Personally, I also use a low-latency pre-emptible kernel with 1000 Hz timer for multimedia workstations (and for multimedia servers if I had any right now). Such kernels run user processes in shorter slices, and usually provide much better latencies, although maximum computation capability is slightly less. If your production services are latency-sensitive, I do recommend switching your production servers to such kernels.
Many distributions do provide such kernels already, but I find it much simpler to recompile distro kernels instead, or even switch to kernel.org kernels. The procedure is simple: you need kernel development and tools installed (on Debian variants, make-kpkg is VERY useful). To upgrade a kernel, you obtain new sources, configure the kernel (usually using current configuration as the basis -- make oldconfig), build the new kernel, and install the packages, before rebooting. Most people do find that just upgrading the hardware is more cost effective than recompiling distro kernels, but I find recompiling kernels very effortless myself. I don't automatically reboot for kernel upgrades anyway, so adding a simple step (triggered by running a single script) prior to rebooting is not too much effort for me.