Measure page faults from a c program

Measure page faults from a c program - c

I am comparing a few system calls where I read/write from/to memory. Is there any API defined to measure page faults (pages in/out) in C ?
I found this library libperfstat.a but it is for AIX, I couldn't find anything for linux.
Edit:
I am aware of time & perf-stat commands in linux, just exploring if there is anything available for me to use inside the C program.

If you are running on Linux, you can use the perf_event_open system call (used by perf stat). It's a little bit tricky to get the right parameters, look at the man page http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html and see the code below.
There is no lib C wrapper so you have to call it as following:
static long perf_event_open(struct perf_event_attr *hw_event,
pid_t pid,
int cpu,
int group_fd,
unsigned long flags) {
int ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
and then to count page faults:
struct perf_event_attr pe_attr_page_faults;
memset(&pe_attr_page_faults, 0, sizeof(pe_attr_page_faults));
pe_attr_page_faults.size = sizeof(pe_attr_page_faults);
pe_attr_page_faults.type = PERF_TYPE_SOFTWARE;
pe_attr_page_faults.config = PERF_COUNT_SW_PAGE_FAULTS;
pe_attr_page_faults.disabled = 1;
pe_attr_page_faults.exclude_kernel = 1;
int page_faults_fd = perf_event_open(&pe_attr_page_faults, 0, CPU, -1, 0);
if (page_faults_fd == -1) {
printf("perf_event_open failed for page faults: %s\n", strerror(errno));
return -1;
}
// Start counting
ioctl(page_faults_fd, PERF_EVENT_IOC_RESET, 0);
ioctl(page_faults_fd, PERF_EVENT_IOC_ENABLE, 0);
// Your code to be profiled here
.....
// Stop counting and read value
ioctl(page_faults_fd, PERF_EVENT_IOC_DISABLE, 0);
uint64_t page_faults_count;
read(page_faults_fd, &page_faults_count, sizeof(page_faults_count));

There is getrusage function (SVr4, 4.3BSD. POSIX.1-2001; but not all fields are defined in standard). In linux there are several broken fields, but man getrusage lists several interesting fields:
long ru_minflt; /* page reclaims (soft page faults) */
long ru_majflt; /* page faults (hard page faults) */
long ru_inblock; /* block input operations */
long ru_oublock; /* block output operations */
The rusage is also reported in wait4 (only usable in external program). This one is used by /usr/bin/time program (it prints minor/major pagefault counts).

It is not an API as such, however I have had a lot of success by rolling my own and reading /proc/myPID/stat from within my C program which includes page fault statistics for my process, this allows me to monitor counts in real time as my program runs and store these however I like.
Remember that doing so can cause page faults in itself so there will be some inaccuracy but you will get a general idea.
See here for details of the format of the file: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Memory_allocation.html

Related

Why does the page fault not cause the thread to finish its execution later?

I have the below code where I'm intentionally creating a page fault in one of the threads in file.c
util.c
#include "util.h"
// to use as a fence() instruction
extern inline __attribute__((always_inline))
CYCLES rdtscp(void) {
CYCLES cycles;
asm volatile ("rdtscp" : "=a" (cycles));
return cycles;
}
// initialize address
void init_ram_address(char* FILE_NAME){
char *filename = FILE_NAME;
int fd = open(filename, O_RDWR);
if(fd == -1) {
printf("Could not open file .\n");
exit(0);
}
void *file_address = mmap(NULL, DEFAULT_FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, 0);
ram_address = (int *) file_address;
}
// initialize address
void init_disk_address(char* FILE_NAME){
char *filename = FILE_NAME;
int fd = open(filename, O_RDWR);
if(fd == -1) {
printf("Could not open file .\n");
exit(0);
}
void *file_address = mmap(NULL, DEFAULT_FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
disk_address = (int *) file_address;
}
file.c
#include "util.h"
void *f1();
void *f2();
pthread_barrier_t barrier;
pthread_mutex_t mutex;
int main(int argc, char **argv)
{
pthread_t t1, t2;
// in ram
init_ram_address(RAM_FILE_NAME);
// in disk
init_disk_address(DISK_FILE_NAME);
pthread_create(&t1, NULL, &f1, NULL);
pthread_create(&t2, NULL, &f2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
return 0;
}
void *f1()
{
rdtscp();
int load = *(ram_address);
rdtscp();
printf("Expecting this to be run first.\n");
}
void *f2()
{
rdtscp();
int load = *(disk_address);
rdtscp();
printf("Expecting this to be run second.\n");
}
I've used rdtscp() in the above code for fencing purposes (to ensure that the print statement get executed only after the load operation is done).
Since t2 will incur a page fault, I expect t1 to finish executing its print statement first.
To run both the threads on the same core, I run taskset -c 10 ./file.
I see that t2 prints its statement before t1. What could be the reason for this?

I think you're expecting t2's int load = *(disk_address); to cause a context switch to t1, and since you're pinning everything to the same CPU core, that would give t1 time to win the race to take the lock for stdout.
A soft page fault doesn't need to context-switch, just update the page tables with a file page from the pagecache. Despite the mapping being backed by a disk file, not anonymous memory or just copy-on-write tricks, if the the file has been read or written recently it will be hot in the pagecache and not require I/O (which would make it a hard page fault).
Maybe try evicting disk cache before a test run, like with echo 3 | sudo tee /proc/sys/vm/drop_caches if this is Linux, so access to the mmap region without MAP_POPULATE will be a hard page fault (requiring I/O).
(See *https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache*; sync first, at least on the disk file, if it was recently written, to make sure it's page(s) are clean and able to be evicted aka dropped. Dropping caches is mainly useful for benchmarking.)
Or programmatically, you can hint the kernel with the madvise(2) system call, like madvise(MADV_DONTNEED) on a page, encouraging it to evict it from pagecache soon. (Or at least hint that your process doesn't need it; other processes might keep it hot).
In Linux kernel 5.4 and later, MADV_COLD works as a hint to evict the specified page(s) on memory pressure. ("Deactivate" probably means remove from HW page tables, so next access will at least be a soft page fault.) Or MADV_PAGEOUT is apparently supposed to get the kernel to reclaim the specified page(s) right away, I guess before the system call returns. After that, the next access should be a hard page fault.
MADV_COLD (since Linux 5.4)
Deactivate a given range of pages. This will make the pages a more probable reclaim target should there be a memory pressure. This is a nondestructive operation. The advice might be ignored for some pages in the range when it is not applicable.
MADV_PAGEOUT (since Linux 5.4)
Reclaim a given range of pages. This is done to free up memory occupied by these pages. If a page is anonymous, it will be swapped out. If a page is file-backed and dirty, it will be written back to the backing storage. The advice might be ignored for some pages in the range when it is not applicable.
These madvise args are Linux-specific. The madvise system call itself (as opposed to posix_madvise) is not guaranteed portable, but the man page gives the impression that some other systems have their own madvise system calls supporting some standard "advice" hints to the kernel.
You haven't shown the declaration of ram_address or disk_address.
If it's not a pointer-to-volatile like volatile int *disk_address, the loads may be optimized away at compile time. Writes to non-escaped local vars like int load don't have to respect "memory" clobbers because nothing else could possibly have a reference to them.
If you compiled without optimization or something, then yes the load will still happen even without volatile.

ALSA - Non blocking (interleaved) read

I inherited some ALSA code that runs on a Linux embedded platform.
The existing implementation does blocking reads and writes using snd_pcm_readi() and snd_pcm_writei().
I am tasked to make this run on an ARM processor, but I find that the blocked interleaved reads push the CPU to 99%, so I am exploring non-blocking reads and writes.
I open the device as can be expected:
snd_pcm_handle *handle;
const char* hwname = "plughw:0"; // example name
snd_pcm_open(&handle, hwname, SND_PCM_STREAM_CAPTURE, SND_PCM_NONBLOCK);
Other ALSA stuff then happens which I can supply on request.
Noteworthy to mention at this point that:
we set a sampling rate of 48,000 [Hz]
the sample type is signed 32 bit integer
the device always overrides our requested period size to 1024 frames
Reading the stream like so:
int32* buffer; // buffer set up to hold #period_size samples
int actual = snd_pcm_readi(handle, buffer, period_size);
This call takes approx 15 [ms] to complete in blocking mode. Obviously, variable actual will read 1024 on return.
The problem is; in non-blocking mode, this function also takes 15 msec to complete and actual also always reads 1024 on return.
I would expect that the function would return immediately, with actual being <=1024 and quite possibly reading "EAGAIN" (-11).
In between read attempts I plan to put the thread to sleep for a specific amount of time, yielding CPU time to other processes.
Am I misunderstanding the ALSA API? Or could it be that my code is missing a vital step?

If the function returns a value of 1024, then at least 1024 frames were available at the time of the call.
(It's possible that the 15 ms is time needed by the driver to actually start the device.)
Anyway, blocking or non-blocking mode does not make any difference regarding CPU usage. To reduce CPU usage, replace the default device with plughw or hw, but then you lose features like device sharing or sample rate/format conversion.

I solved my problem by wrapping snd_pcm_readi() as follows:
/*
** Read interleaved stream in non-blocking mode
*/
template <typename SampleType>
snd_pcm_sframes_t snd_pcm_readi_nb(snd_pcm_t* pcm, SampleType* buffer, snd_pcm_uframes_t size, unsigned samplerate)
{
const snd_pcm_sframes_t avail = ::snd_pcm_avail(pcm);
if (avail < 0) {
return avail;
}
if (avail < size) {
snd_pcm_uframes_t remain = size - avail;
unsigned long msec = (remain * 1000) / samplerate;
static const unsigned long SLEEP_THRESHOLD_MS = 1;
if (msec > SLEEP_THRESHOLD_MS) {
msec -= SLEEP_THRESHOLD_MS;
// exercise for the reader: sleep for msec
}
}
return ::snd_pcm_readi(pcm, buffer, size);
}
This works quite well for me. My audio process now 'only' takes 19% CPU time.
And it matters not if the PCM interface was opened using SND_PCM_NONBLOCK or 0.
Going to perform callgrind analysis to see if more CPU cycles can be saved elsewhere in the code.

Emulate `perf record -g` with `perf_event_open`

My goal is to write some code to record the current call stack for all CPUs at some interval. Essentially I would like to do the same as perf record but using perf_event_open myself.
According to the manpage it seems I need to use the PERF_SAMPLE_CALLCHAIN sample type and read the results with mmap. That said, the manpage is incredibly terse, and some sample code would go a long way right now.
Can someone point me in the right direction?

The best way to learn about this would be to read the Linux kernel source code and see how you can emulate perf record -g yourself.
As you correctly identified, recording of perf events would start with the system call perf_event_open. So that is where we can start,
definition of perf_event_open
If you observe the parameters of the system call, you will see that the first parameter is a struct perf_event_attr * type. This is the parameter that takes in the attributes for the system call. This is what you need to modify to record callchains. A sample code could be like this (remember you can tweak other parameters and members of the struct perf_event_attr the way you want) :
int buf_size_shift = 8;
static unsigned perf_mmap_size(int buf_size_shift)
{
return ((1U << buf_size_shift) + 1) * sysconf(_SC_PAGESIZE);
}
int main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.sample_type = PERF_SAMPLE_CALLCHAIN; /* this is what allows you to obtain callchains */
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
pe.disabled = 1;
pe.exclude_kernel = 1;
pe.sample_period = 1000;
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
/* associate a buffer with the file */
struct perf_event_mmap_page *mpage;
mpage = mmap(NULL, perf_mmap_size(buf_size_shift),
PROT_READ|PROT_WRITE, MAP_SHARED,
fd, 0);
if (mpage == (struct perf_event_mmap_page *)-1L) {
close(fd);
return -1;
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
printf("Measuring instruction count for this printf\n");
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("Used %lld instructions\n", count);
close(fd);
}
Note: A nice and easy way to understand the handling of all of these perf events can be seen below -
PMU-TOOLS by Andi Kleen
If you start reading the source code for the system call, you will see that a function perf_event_alloc is being called. This function, among other things, will setup the buffer for obtaining callchains using perf record.
The function get_callchain_buffers is responsible for setting up callchain buffers.
perf_event_open works via a sampling/counting mechanism where if the performance monitoring counter corresponding to the event you are profiling overflows, then all the event relevant information will be collected and stored into a ring-buffer by the kernel. This ring-buffer can be prepared and accessed via mmap(2).
Edit #1:
The flowchart describing the use of mmap when doing perf record is shown via the below image.
The process of mmaping ring buffers would start from the first function when you call perf record - which is __cmd_record, this calls record__open, which then calls record__mmap, followed by a call to record__mmap_evlist, which then calls perf_evlist__mmap_ex, this is followed by perf_evlist__mmap_per_cpu and finally ending up in perf_evlist__mmap_per_evsel which is doing most of the heavy-lifting as far as doing an mmap for each event is concerned.
Edit #2:
Yes you are correct. When you set the sample period to be, say, a 1000, this means for every 1000th occurrence of the event(which by default is cycles), the kernel will record a sample of this event into this buffer. This means the perf counters will be set to 1000, so that it overflows at 0 and you get an interrupt and eventual recording of the samples.

copy_to_user returns an error in a char device read function

I've implemented a char device for my kernel module and implemented a read function for it. The read function calls copy_to_user to return data to the caller. I've originally implemented the read function in a blocking manner (with wait_event_interruptible) but the problem reproduces even when I implement read in a non-blocking manner. My code is running on a MIPS procesor.
The user space program opens the char device and reads into a buffer allocated on the stack.
What I've found is that occasionally copy_to_user will fail to copy any bytes. Moreover, even if I replace copy_to_user with a call to memcpy (only for the purposes of checking... I know this isn't the right thing to do), and print out the destination buffer immediately afterwards, I see that memcpy has failed to copy any bytes.
I'm not really sure how to further debug this - how can I determine why memory is not being copied? Is it possible that the process context is wrong?
EDIT: Here's some pseudo-code outlining what the code currently looks like:
User mode (runs repeatedly):
char buf[BUF_LEN];
FILE *f = fopen(char_device_file, "rb");
fread(buf, 1, BUF_LEN, f);
fclose(f);
Kernel mode:
char_device =
create_char_device(char_device_name,
NULL,
read_func,
NULL,
NULL);
int read_func(char *output_buffer, int output_buffer_length, loff_t *offset)
{
int rc;
if (*offset == 0)
{
spin_lock_irqsave(&lock, flags);
while (get_available_bytes_to_read() == 0)
{
spin_unlock_irqrestore(&lock, flags);
if (wait_event_interruptible(self->wait_queue, get_available_bytes_to_read() != 0))
{
// Got a signal; retry the read
return -ERESTARTSYS;
}
spin_lock_irqsave(&lock, flags);
}
rc = copy_to_user(output_buffer, internal_buffer, bytes_to_copy);
spin_unlock_irqrestore(&lock, flags);
}
else rc = 0;
return rc;
}

It took quite a bit of debugging, but in the end Tsyvarev's hint (the comment about not calling copy_to_user with a spinlock taken) seems to have been the cause.
Our process had a background thread which occasionally launched a new process (fork + exec). When we disabled this thread, everything worked well. The best theory we have is that the fork made all of our memory pages copy-on-write, so when we tried to copy to them, the kernel had to do some work which could not be done with the spinlock taken. Hopefully it at least makes some sense (although I'd have guessed that this would apply only to the child process, and the parent's process pages would simply remain writable, but who knows...).
We rewrote our code to be lockless and the problem disappeared.
Now we just need to verify that our lockless code is indeed safe on different architectures. Easy as pie.

What is the most reliable way to measure the number of cycles of my program in C?

I am familiar with two approaches, but both of them have their limitations.
The first one is to use the instruction RDTSC. However, the problem is that it doesn't count the number of cycles of my program in isolation and is therefore sensitive to noise due to concurrent processes.
The second option is to use the clock library function. I thought that this approach is reliable, since I expected it to count the number of cycles for my program only (what I intend to achieve). However, it turns out that in my case it measures the elapsed time and then multiplies it by CLOCKS_PER_SEC. This is not only unreliable, but also wrong, since CLOCKS_PER_SEC is set to 1,000,000 which does not correspond to the actual frequency of my processor.
Given the limitation of the proposed approaches, is there a better and more reliable alternative to produce consistent results?

A lot here depends on how large an amount of time you're trying to measure.
RDTSC can be (almost) 100% reliable when used correctly. It is, however, of use primarily for measuring truly microscopic pieces of code. If you want to measure two sequences of, say, a few dozen or so instructions apiece, there's probably nothing else that can do the job nearly as well.
Using it correctly is somewhat challenging though. Generally speaking, to get good measurements you want to do at least the following:
Set the code to only run on one specific core.
Set the code to execute at maximum priority so nothing preempts it.
Use CPUID liberally to ensure serialization where needed.
If, on the other hand, you're trying to measure something that takes anywhere from, say, 100 ms on up, RDTSC is pointless. It's like trying to measure the distance between cities with a micrometer. For this, it's generally best to assure that the code in question takes (at least) the better part of a second or so. clock isn't particularly precise, but for a length of time on this general order, the fact that it might only be accurate to, say, 10 ms or so, is more or less irrelevant.

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This system call has explicit controls for:
process PID selection
whether to consider kernel/hypervisor instructions or not
and it will therefore count the cycles properly even when multiple processes are running concurrently.
See this answer for more details: How to get the CPU cycle count in x86_64 from C++?
perf_event_open.c
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}

RDTSC is the most accurate way of counting program execution cycles. If you are looking to measure execution performance over time scales where it matters if your thread has been preempted, then you would probably be better served with a profiler (VTune, for instance).
CLOCKS_PER_SECOND/clock() is pretty much a very bad (low performance) way of getting time as compared to RDTSC which has almost no overhead.
If you have a specific issue with RDTSC, I may be able to assist.
re: Comments
Intel Performance Counter Monitor: This is mainly for measuring metrics outside of the processor, such as Memory bandwidth, power usage, PCIe utilization. It does also happen to measure CPU frequency, but it typically is not useful for processor bound application performance.
RDTSC portability: RDTSC is an intel CPU instruction supported by all modern Intel CPU's. On modern CPU's it is based on the uncore frequency of your CPU and somewhat similar across CPU cores, although it is not appropriate if your application is frequently being preempted to different cores (and especially to different sockets). If that is the case you really want to look at a profiler.
Out of order Execution: Yes, things get executed out of order, so this can affect performance slightly, but it still takes time to execute instructions and RDTSC is the best way of measuring that time. It excels in the normal use case of executing Non-IO bound instructions on the same core, and this is really how it is meant to be used. If you have a more complicated use case you really should be using a different tool, but that doesn't negate that rdtsc() can be very useful in analyzing program execution.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight