I am trying to run a benchmark on Mac with OS version 10.9.5, 2 GHz Intel Core i7 processor, and 8 GB 1600 MHz DDR3 memory. The workload reads in sections of a large file that was mmapped into the address space. I am interested in how performance is affected if the amount of buffer cache available to my process is limited. My colleague and I were not aware of straightforward ways of doing this (i.e. a direct system call, etc), so we created a memory hog program that runs in the background and constantly touches every page allocated to it, to force the kernel to allocate a certain chunk of physical memory to the hogger. Our basic problem is that we found the benchmark to run significantly faster with the hogger running in the background, and we did not know how to interpret that result.
The basic gist of the memhog code is (there is a bit of error checking in our actual code):
mem = malloc(size); // Allocate size bytes
for (int i = 0; ; i = (i + PAGE_SIZE) % size) { // Infinite loop
mem[i]++;
}
This program was compiled with gcc with no optimizations enabled. By adding print statements into the loop, we verified that every iteration of the loop (except the first one, due to all the page faults that must occur the first time we touch the memory) happens quickly, which suggests that about size bytes of physical memory is actually held by this process.
The oddness occurs when we tried running our benchmark while running the memory hog. Without running the memory hog, the benchmark runs in about 16 seconds. When we do run the memory hog in the background, however, the benchmark speeds up significantly! Namely, we found that when we are hogging 1 MB of memory (basically hogging almost nothing at all, but just having a useless process spin in the background), the benchmark finishes in about 13 seconds consistently. Hogging more memory (4 GB), it takes 14-15 seconds to finish, and hogging 6 GB, the benchmark takes roughly 16 seconds to finish. The benchmark itself takes about 4.3 GB of memory (as measured by getrusage by subtracting values for ru_maxrss before loading in all data and after loading in all data).
We found this to be very puzzling; while performance might not deteriorate with memhog in the background, it is odd that performance could improve by so much when a useless process is spinning in the background. Do you have any insights into this behavior? Thanks!
(We also have vm_stat data that we can post if it helps with the question; we didn't included it now as it was rather long).
Related
It's an odd title, but bear with me. Some time ago, I finished writing a program in C with Visual Studio Community 2017 that made significant use of OpenSSL's secp256k1 implementation. It was 10x faster than an equivalent program in Java, so I was happy. However, today I decided to upgrade it to use the bitcoin project's libsecp256k1 optimized library. It worked out great and I got a further 7x performance boost! Changing what library was used to do the EC multiplications is the ONLY thing I changed about the software- it still reads in the same input files, computes the same things, and outputs that same results.
Input files consist of 5 million initial values, and I break that up into chunks of 50k. I then set pthreads to use 6 threads to compute those 50k, save the results, and move on to the next 50k until all 5 million are done (I've also used OpenMP with 6 threads). For some reason when running this program on my Windows 10 4-core laptop, after exactly 16 chunks, the CPU utilization drops from 75% down to 65%, and after another 10 chunks down to 55%, and so on until it's only using about 25% of my CPU by the time all 5 million inputs are calculated.
The thread count (7- 1 main thread, 6 worker threads) remains the same, and the memory usage never goes over 1.5GB (laptop has 16GB), yet the CPU utilization drops as if I'm dropping threads. My temps never go over 83C, and all-core turbo stays at the max 3.4Ghz (base 2.8), so there is no temp throttling happening here. The laptop is always plugged in and power settings are set to max performance. There are no other CPU- or memory-intensive programs running besides this one.
Even stranger is that this problem doesn't happen on either of my two Windows 7 desktops- they both hold correct CPU utilization throughout all 5 million calculations. The old OpenSSL implementation always stayed at correct CPU utilization on all computers, so something's different yet it only affects the Windows 10 laptop.
I'm sorry I don't have code to demonstrate this, and maybe another forum would be more appropriate, but since it's my code I thought I'd ask here. Anyone have any ideas what might be causing this or how to fix it?
As it may sound like an ffmpeg related issue, I believe it is not.
We have a system that processes live TV feeds by using ffmpeg's filters.
We capture frames from a video capture card and copy it into our own data structures.
We copy the frame info ffmpeg's native structures.
We run the filters.
We copy the resulting frame from ffmpeg's native structures back into our own structures.
A single frame uses 4.15 Mb dynamically allocated memory.
Frame buffers are allocated by using _aligned_malloc().
The server is a 2x Intel Xeon E5-2697v4 Windows Server 2016 box with 64Gb memory.
2 NUMA nodes. A process assigned to each node.
2 channels per process, a total of 4 channels per server, each at 25 fps.
A single default process heap is used for all memory allocations.
In addition to frame buffers, versatile dynamic memory allocations take place for multiple purposes.
Each process uses 2Gb of physical memory.
When we run the system, everything works fine. A flat CPU load is observed.
After a while (generally a couple of hours), we start seeing a sawtooth pattern in the CPU load:
When we investigate by using Intel's VTune Amplifier 2018, it shows that memcpy() in step2 consumes a lot of CPU time during those high-CPU periods.
Below is what we see from VTune 2018 Hot-Spot Analysis
**LOW CPU PERIOD (a total of 8,13 sec)**
SleepConditionVariableCS 14,167 sec CPU time
WaitForSingleObjectEx 14,080 sec CPU time
memcpy 3,443 sec CPU time with the following decomposition:
-- get_frame(step4) -- 2,568
-- put_frame(step2) -- 0,740
-- decklink_cb(step1) -- 0,037
**HIGH CPU PERIOD (a total of 8,13 sec)**
memcpy 16,812 sec CPU time with the following decomposition:
-- put_frame(step2) -- 10,429
-- get_frame(step4) -- 3,692
-- decklink_cb(step1) -- 2,236
SleepConditionVariableCS 14,765 sec CPU time
WaitForSingleObjectEx 13,928 sec CPU time
_aligned_free() 3,532 sec CPU time
Below is the graph of the time it takes to perform the memcpy() operation in step 2.
Y-axis is the time in milliseconds.
Below is the graph of total processing time for a single frame.
Y-axis is the time in milliseconds.
"0" readings are frame drops due to delay and should be neglected.
When the CPU gets higher and memcpy() starts to take longer, the handle count of our process decreases.
We logged the addresses that is returned from _aligned_malloc() in step2.
This is the buffer location in memory that we copy 4.15 Mb frame data to.
For low CPU periods, the addresses returned from _aligned_malloc() are very close to each other, ie. the difference between the addresses returned from two consecutive memory allocation operations tends to be small.
For high CPU periods, the addresses returned from _aligned_malloc() span a very large range.
The sawtooth pattern we see in the total CPU load seems to be due to the memcpy() operation in step2.
For the high CPU periods, the buffer addresses that is returned from _aligned_malloc() seems to have bad locality when compared to the buffer addresses returned during low CPU periods.
We believe, the operating system starts doing something extra when we memcpy(). Some additional task during page faults? Some cleanup stuff? Is this a cache issue? Don't know.
Is there anybody that can comment on the reason and the solution to this situation?
I am running the word2phrase.c using a very large (45Gb) training set. My PC has 16Gb of physical RAM and 4Gb of swap. I've left it train overnight (second time tbh) and I come back in the morning, to see it was "killed" without further explanation. I sat and watched it die, when my RAM run out.
I set in my /etc/sysctl.conf
vm.oom-kill = 0
vm.overcommit_memory = 2
The actual source code does not appear to write to the file the data, but rather keep it in memory, which is creating the issue.
Is the total memory (RAM + SWAP) used to kill OOM? For example, if I increase my SWAP to 32Gb, will this stop happening?
Can I force this process to use SWAP instead of Physical RAM, at the expense of slower performance?
Q: Is the total memory (RAM + SWAP) used to kill OOM?
Yes.
Q: For example, if I increase my SWAP to 32Gb, will this stop happening?
Yes, if RAM and swap space combined (48 GB) are enough for the process.
Q: Can I force this process to use SWAP instead of Physical RAM, at the expense of slower performance?
This will be managed automatically by the operating system. All you have to do is to increase swap space.
To answer the first question, yes.
Second question:
Can I force this process to use SWAP instead of Physical RAM
linux dictates how the process is running, and allocate the memory appropriately for the process. When the threshold gets reached, linux will use the swap space as a measure.
Increasing swap space may work in this case. Then, again, I do not know how linux will cope with such a large swap, bear in mind, this could decrease performance dramatically.
Best alternative thing to do is split up the 45GB training set to smaller chunks.
My program allocates a chunk of 16 MB of memory and proceeds by performing a CPU-bound computation (A* search) that uses only that pre-allocated memory. When I perf stat -e cache-references,cache-misses this program, I sometimes get significantly different counts (e.g. 19% vs. 20%) of cache misses between the different executions despite the fact that the memory access pattern of these executions is identical.
I tried to warm up the cache in two ways:
Putting the main computation inside a loop.
Running various loops that access the allocated memory in various patterns.
However, I still do not get better consistency. After much effort (described in https://askubuntu.com/questions/582356/can-i-get-a-100-dedicated-cpu-to-my-critical-process and How can I explain a slower execution when perf stat does not give a clue?), my program suffers from very few context switches and consistency of cache behavior seems to be the only thing that separates me from getting consistent timings.
EDIT 1: One more relevant detail. When I put the computation in a loop and measure the time that each iteration takes, the timings of the different iterations within the same execution are extremely consistent (within 0.02 seconds on a computation that takes about 6 seconds). However, a different execution may show a different (by as much as 0.3 seconds) timing, while again the time that the different iterations within that new execution take would be consistent.
EDIT 2: I also forgot to mention that the 16 MB chunk is locked in memory.
I want to test a program with various memory bus usage levels. For example, I would like to find out if my program works as expected when other processes use 50% of the memory bus.
How would I simulate this kind of disturbance?
My attempt was to run a process with multiple threads, each thread doing random reads from a big block of memory. This didn't appear to have a big impact on my program. My program has a lot of memory operations, so I would expect that a significant disturbance will be noticeable.
I want to saturate the bus but without using too many CPU cycles, so that any performance degradation will be caused only by bus contention.
Notes:
I'm using a Xeon E5645 processor, DDR3 memory
The mental model of "processes use 50% of the memory bus" is not a great one. A thread that has acquired a core and accesses memory that's not in the caches uses the memory bus.
Getting a thread to saturate the bus is simple, just use memcpy(). Copy several times the amount that fits in the last cache and warm it up by running it multiple times so there are no page faults to slow the code down.
My first instinct would be to set up a bunch of DMA operations to bounce data around without using the CPU too much. This all depends on what operating system you're running and what hardware. Is this an embedded system? I'd be glad to give more detail in the comments.
I'd use SSE2 movntps instructions to stream data, to avoid cache conflicts for the other thread in the same core. Maybe unroll that loop 16 times to minimize number of instructions per memory transfer. While DMA idea sounds good, the linked manual is old and for 32bit linux and your processor model makes me think you probably have 64bit os, which makes me wonder how much of it is correct still. And bug in your test code may screw your hard drive in worst case.