Emulate `perf record -g` with `perf_event_open` - c

My goal is to write some code to record the current call stack for all CPUs at some interval. Essentially I would like to do the same as perf record but using perf_event_open myself.
According to the manpage it seems I need to use the PERF_SAMPLE_CALLCHAIN sample type and read the results with mmap. That said, the manpage is incredibly terse, and some sample code would go a long way right now.
Can someone point me in the right direction?

The best way to learn about this would be to read the Linux kernel source code and see how you can emulate perf record -g yourself.
As you correctly identified, recording of perf events would start with the system call perf_event_open. So that is where we can start,
definition of perf_event_open
If you observe the parameters of the system call, you will see that the first parameter is a struct perf_event_attr * type. This is the parameter that takes in the attributes for the system call. This is what you need to modify to record callchains. A sample code could be like this (remember you can tweak other parameters and members of the struct perf_event_attr the way you want) :
int buf_size_shift = 8;
static unsigned perf_mmap_size(int buf_size_shift)
{
return ((1U << buf_size_shift) + 1) * sysconf(_SC_PAGESIZE);
}
int main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.sample_type = PERF_SAMPLE_CALLCHAIN; /* this is what allows you to obtain callchains */
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
pe.disabled = 1;
pe.exclude_kernel = 1;
pe.sample_period = 1000;
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
/* associate a buffer with the file */
struct perf_event_mmap_page *mpage;
mpage = mmap(NULL, perf_mmap_size(buf_size_shift),
PROT_READ|PROT_WRITE, MAP_SHARED,
fd, 0);
if (mpage == (struct perf_event_mmap_page *)-1L) {
close(fd);
return -1;
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
printf("Measuring instruction count for this printf\n");
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("Used %lld instructions\n", count);
close(fd);
}
Note: A nice and easy way to understand the handling of all of these perf events can be seen below -
PMU-TOOLS by Andi Kleen
If you start reading the source code for the system call, you will see that a function perf_event_alloc is being called. This function, among other things, will setup the buffer for obtaining callchains using perf record.
The function get_callchain_buffers is responsible for setting up callchain buffers.
perf_event_open works via a sampling/counting mechanism where if the performance monitoring counter corresponding to the event you are profiling overflows, then all the event relevant information will be collected and stored into a ring-buffer by the kernel. This ring-buffer can be prepared and accessed via mmap(2).
Edit #1:
The flowchart describing the use of mmap when doing perf record is shown via the below image.
The process of mmaping ring buffers would start from the first function when you call perf record - which is __cmd_record, this calls record__open, which then calls record__mmap, followed by a call to record__mmap_evlist, which then calls perf_evlist__mmap_ex, this is followed by perf_evlist__mmap_per_cpu and finally ending up in perf_evlist__mmap_per_evsel which is doing most of the heavy-lifting as far as doing an mmap for each event is concerned.
Edit #2:
Yes you are correct. When you set the sample period to be, say, a 1000, this means for every 1000th occurrence of the event(which by default is cycles), the kernel will record a sample of this event into this buffer. This means the perf counters will be set to 1000, so that it overflows at 0 and you get an interrupt and eventual recording of the samples.

Related

Is it legal to read a file descriptor into NULL?

Recently, I've been fixing the timestep for the sake of a library that I am writing. Following some research, suppose that I ended up with this prototype, precise and easy to combine with the generic event system of my library:
#include <stdio.h>
#include <unistd.h>
#include <sys/timerfd.h>
#include <poll.h>
struct pollfd fds[1];
struct itimerspec its;
int main(void) {
fds[0] = (struct pollfd) {timerfd_create(CLOCK_MONOTONIC, 0), POLLIN, 0}; //long live clarity
its.it_interval = (struct timespec) {0, 16666667};
its.it_value = (struct timespec) {0, 16666667};
timerfd_settime(fds[0].fd, 0, &its, NULL);
while(1) {
poll(fds, 1, -1);
if(fds[0].revents == POLLIN) {
long long buffer;
read(fds[0].fd, &buffer, 8);
printf("ROFL\n");
} else {
printf("BOOM\n");
break;
}
}
close(fds[0].fd);
return 0;
}
However, it severely hurt me that I've had to pollute my CPU caches with a whole precious 8 bytes of data in order to make the timer's file descriptor reusable. Because of that, I've tried to replace the read() call with lseek(), as follows:
lseek(fds[0].fd, 0, SEEK_END);
Unfortunately, both that and even lseek(fds[0].fd, 8, SEEK_CUR); gave me ESPIPE errors and would not work. But then, I found out that the following actually did its job, despite of giving EFAULTs:
read(fds[0].fd, NULL, 8);
Is it legal, defined behavior to offset the file descriptor like this? If it is not (as the EFAULTs suggested to me, strongly enough to refrain from using that piece of genius), does there exist a function that would discard the read data, without ever writing it down, or otherwise offset my timer's file descriptor?
The POSIX specification of read(2) does not specify the consequences of passing a null pointer as the buffer argument. No specific error code is given, nor does it say whether any data will be read from the descriptor.
The Linux man page has this error, though:
EFAULT buf is outside your accessible address space.
It doesn't say that it will read the 8 bytes and discard them when this happens, though.
So I don't think you can depend on this working as you desire.

Is it possible to display progress of Asynchronous IO

I am trying to implement file copy program with POSIX Asynchronous IO APIs in linux.
I tried this:
main() {
char data[200];
int fd = open("data.txt", O_RDONLY); // text file on the disk
struct aiocb aio;
aio.aio_fildes = fd;
aio.aio_buf = data;
aio.aio_nbytes = sizeof(data);
aio.aio_offset = 0;
memset(&aio, 0, sizeof(struct aiocb));
aio_read(arg->aio_p);
int counter = 0;
while (aio_error(arg->aio_p) == EINPROGRESS) {
printf("counter: %d\n", counter++);
}
int ret = aio_return(&aio);
printf("ret value %d \n",ret);
return 0;
}
But counter giving different results every time when I run
Is it possible to display progress of aio_read and aio_write functions?
You have different results because each different execution has its own context of execution that may differe from others (Do you always follow the exact same path to go from your house to the bank? Even if yes, elapsed time to reach the bank always exactly the same? Same task at the end - you're in the bank, different executions). What your program tries to measure is not I/O completion but the number of time it tests for completion of some async-I/O.
And no, there is no concept of percentage of completion of a given async-I/O.

Unable to write the complete script onto a device on the serial port

The script file has over 6000 bytes which is copied into a buffer.The contents of the buffer are then written to the device connected to the serial port.However the write function only returns 4608 bytes whereas the buffer contains 6117 bytes.I'm unable to understand why this happens.
{
FILE *ptr;
long numbytes;
int i;
ptr=fopen("compass_script(1).4th","r");//Opening the script file
if(ptr==NULL)
return 1;
fseek(ptr,0,SEEK_END);
numbytes = ftell(ptr);//Number of bytes in the script
printf("number of bytes in the calibration script %ld\n",numbytes);
//Number of bytes in the script is 6117.
fseek(ptr,0,SEEK_SET);
char writebuffer[numbytes];//Creating a buffer to copy the file
if(writebuffer == NULL)
return 1;
int s=fread(writebuffer,sizeof(char),numbytes,ptr);
//Transferring contents into the buffer
perror("fread");
fclose(ptr);
fd = open("/dev/ttyUSB3",O_RDWR | O_NOCTTY | O_NONBLOCK);
//Opening serial port
speed_t baud=B115200;
struct termios serialset;//Setting a baud rate for communication
tcgetattr(fd,&serialset);
cfsetispeed(&serialset,baud);
cfsetospeed(&serialset,baud);
tcsetattr(fd,TCSANOW,&serialset);
long bytesw=0;
tcflush(fd,TCIFLUSH);
printf("\nnumbytes %ld",numbytes);
bytesw=write(fd,writebuffer,numbytes);
//Writing the script into the device connected to the serial port
printf("bytes written%ld\n",bytesw);//Only 4608 bytes are written
close (fd);
return 0;
}
Well, that's the specification. When you write to a file, your process normally is blocked until the whole data is written. And this means your process will run again only when all the data has been written to the disk buffers. This is not true for devices, as the device driver is the responsible of determining how much data is to be written in one pass. This means that, depending on the device driver, you'll get all data driven, only part of it, or even none at all. That simply depends on the device, and how the driver implements its control.
On the floor, device drivers normally have a limited amount of memory to fill buffers and are capable of a limited amount of data to be accepted. There are two policies here, the driver can block the process until more buffer space is available to process it, or it can return with a partial write only.
It's your program resposibility to accept a partial read and continue writing the rest of the buffer, or to pass back the problem to the client module and return only a partial write again. This approach is the most flexible one, and is the one implemented everywhere. Now you have a reason for your partial write, but the ball is on your roof, you have to decide what to do next.
Also, be careful, as you use long for the ftell() function call return value and int for the fwrite() function call... Although your amount of data is not huge and it's not probable that this values cannot be converted to long and int respectively, the return type of both calls is size_t and ssize_t resp. (like the speed_t type you use for the baudrate values) long can be 32bit and size_t a 64bit type.
The best thing you can do is to ensure the whole buffer is written by some code snippet like the next one:
char *p = buffer;
while (numbytes > 0) {
ssize_t n = write(fd, p, numbytes);
if (n < 0) {
perror("write");
/* driver signals some error */
return 1;
}
/* writing 0 bytes is weird, but possible, consider putting
* some code here to cope for that possibility. */
/* n >= 0 */
/* update pointer and numbytes */
p += n;
numbytes -= n;
}
/* if we get here, we have written all numbytes */

Measure page faults from a c program

I am comparing a few system calls where I read/write from/to memory. Is there any API defined to measure page faults (pages in/out) in C ?
I found this library libperfstat.a but it is for AIX, I couldn't find anything for linux.
Edit:
I am aware of time & perf-stat commands in linux, just exploring if there is anything available for me to use inside the C program.
If you are running on Linux, you can use the perf_event_open system call (used by perf stat). It's a little bit tricky to get the right parameters, look at the man page http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html and see the code below.
There is no lib C wrapper so you have to call it as following:
static long perf_event_open(struct perf_event_attr *hw_event,
pid_t pid,
int cpu,
int group_fd,
unsigned long flags) {
int ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
and then to count page faults:
struct perf_event_attr pe_attr_page_faults;
memset(&pe_attr_page_faults, 0, sizeof(pe_attr_page_faults));
pe_attr_page_faults.size = sizeof(pe_attr_page_faults);
pe_attr_page_faults.type = PERF_TYPE_SOFTWARE;
pe_attr_page_faults.config = PERF_COUNT_SW_PAGE_FAULTS;
pe_attr_page_faults.disabled = 1;
pe_attr_page_faults.exclude_kernel = 1;
int page_faults_fd = perf_event_open(&pe_attr_page_faults, 0, CPU, -1, 0);
if (page_faults_fd == -1) {
printf("perf_event_open failed for page faults: %s\n", strerror(errno));
return -1;
}
// Start counting
ioctl(page_faults_fd, PERF_EVENT_IOC_RESET, 0);
ioctl(page_faults_fd, PERF_EVENT_IOC_ENABLE, 0);
// Your code to be profiled here
.....
// Stop counting and read value
ioctl(page_faults_fd, PERF_EVENT_IOC_DISABLE, 0);
uint64_t page_faults_count;
read(page_faults_fd, &page_faults_count, sizeof(page_faults_count));
There is getrusage function (SVr4, 4.3BSD. POSIX.1-2001; but not all fields are defined in standard). In linux there are several broken fields, but man getrusage lists several interesting fields:
long ru_minflt; /* page reclaims (soft page faults) */
long ru_majflt; /* page faults (hard page faults) */
long ru_inblock; /* block input operations */
long ru_oublock; /* block output operations */
The rusage is also reported in wait4 (only usable in external program). This one is used by /usr/bin/time program (it prints minor/major pagefault counts).
It is not an API as such, however I have had a lot of success by rolling my own and reading /proc/myPID/stat from within my C program which includes page fault statistics for my process, this allows me to monitor counts in real time as my program runs and store these however I like.
Remember that doing so can cause page faults in itself so there will be some inaccuracy but you will get a general idea.
See here for details of the format of the file: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Memory_allocation.html

Flush kernel's TCP buffer for `MSG_MORE`-flagged packets

send()'s man page reveals the MSG_MORE flag which is asserted to act like TCP_CORK. I have a wrapper function around send():
int SocketConnection_Write(SocketConnection *this, void *buf, int len) {
errno = 0;
int sent = send(this->fd, buf, len, MSG_NOSIGNAL);
if (errno == EPIPE || errno == ENOTCONN) {
throw(exc, &SocketConnection_NotConnectedException);
} else if (errno == ECONNRESET) {
throw(exc, &SocketConnection_ConnectionResetException);
} else if (sent != len) {
throw(exc, &SocketConnection_LengthMismatchException);
}
return sent;
}
Assuming I want to use the kernel buffer, I could go with TCP_CORK, enable whenever it is necessary and then disable it to flush the buffer. But on the other hand, thereby the need for an additional system call arises. Thus, the usage of MSG_MORE seems more appropriate to me. I'd simply change the above send() line to:
int sent = send(this->fd, buf, len, MSG_NOSIGNAL | MSG_MORE);
According to lwm.net, packets will be flushed automatically if they are large enough:
If an application sets that option on
a socket, the kernel will not send out
short packets. Instead, it will wait
until enough data has shown up to fill
a maximum-size packet, then send it.
When TCP_CORK is turned off, any
remaining data will go out on the
wire.
But this section only refers to TCP_CORK. Now, what is the proper way to flush MSG_MORE packets?
I can only think of two possibilities:
Call send() with an empty buffer and without MSG_MORE being set
Re-apply the TCP_CORK option as described on this page
Unfortunately the whole topic is very poorly documented and I couldn't find much on the Internet.
I am also wondering how to check that everything works as expected? Obviously running the server through strace is not an option. So the simplest way would be to use netcat and then look at its strace output? Or will the kernel handle traffic transmitted over a loopback interface differently?
I have taken a look at the kernel source and both assumptions seem to be true. The following code are extracts from net/ipv4/tcp.c (2.6.33.1).
static inline void tcp_push(struct sock *sk, int flags, int mss_now,
int nonagle)
{
struct tcp_sock *tp = tcp_sk(sk);
if (tcp_send_head(sk)) {
struct sk_buff *skb = tcp_write_queue_tail(sk);
if (!(flags & MSG_MORE) || forced_push(tp))
tcp_mark_push(tp, skb);
tcp_mark_urg(tp, flags, skb);
__tcp_push_pending_frames(sk, mss_now,
(flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
}
}
Hence, if the flag is not set, the pending frames will definitely be flushed. But this is be only the case when the buffer is not empty:
static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
size_t psize, int flags)
{
(...)
ssize_t copied;
(...)
copied = 0;
while (psize > 0) {
(...)
if (forced_push(tp)) {
tcp_mark_push(tp, skb);
__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
} else if (skb == tcp_send_head(sk))
tcp_push_one(sk, mss_now);
continue;
wait_for_sndbuf:
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
wait_for_memory:
if (copied)
tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
goto do_error;
mss_now = tcp_send_mss(sk, &size_goal, flags);
}
out:
if (copied)
tcp_push(sk, flags, mss_now, tp->nonagle);
return copied;
do_error:
if (copied)
goto out;
out_err:
return sk_stream_error(sk, flags, err);
}
The while loop's body will never be executed because psize is not greater 0. Then, in the out section, there is another chance, tcp_push() gets called but because copied still has its default value, it will fail as well.
So sending a packet with the length 0 will never result in a flush.
The next theory was to re-apply TCP_CORK. Let's take a look at the code first:
static int do_tcp_setsockopt(struct sock *sk, int level,
int optname, char __user *optval, unsigned int optlen)
{
(...)
switch (optname) {
(...)
case TCP_NODELAY:
if (val) {
/* TCP_NODELAY is weaker than TCP_CORK, so that
* this option on corked socket is remembered, but
* it is not activated until cork is cleared.
*
* However, when TCP_NODELAY is set we make
* an explicit push, which overrides even TCP_CORK
* for currently queued segments.
*/
tp->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
tcp_push_pending_frames(sk);
} else {
tp->nonagle &= ~TCP_NAGLE_OFF;
}
break;
case TCP_CORK:
/* When set indicates to always queue non-full frames.
* Later the user clears this option and we transmit
* any pending partial frames in the queue. This is
* meant to be used alongside sendfile() to get properly
* filled frames when the user (for example) must write
* out headers with a write() call first and then use
* sendfile to send out the data parts.
*
* TCP_CORK can be set together with TCP_NODELAY and it is
* stronger than TCP_NODELAY.
*/
if (val) {
tp->nonagle |= TCP_NAGLE_CORK;
} else {
tp->nonagle &= ~TCP_NAGLE_CORK;
if (tp->nonagle&TCP_NAGLE_OFF)
tp->nonagle |= TCP_NAGLE_PUSH;
tcp_push_pending_frames(sk);
}
break;
(...)
As you can see, there are two ways to flush. You can either set TCP_NODELAY to 1 or TCP_CORK to 0. Luckily, both won't check whether the flag is already set. Thus, my initial plan to re-apply the TCP_CORK flag can be optimized to just disable it, even if it's currently not set.
I hope this helps someone with similar issues.
That's a lot of research... all I can offer is this empirical post note:
Sending a bunch of packet with MSG_MORE set, followed by a packet without MSG_MORE, the whole lot goes out. It works a treat for something like this:
for (i=0; i<mg_live.length; i++) {
// [...]
if ((n = pth_send(sock, query, len, MSG_MORE | MSG_NOSIGNAL)) < len) {
printf("error writing to socket (sent %i bytes of %i)\n", n, len);
exit(1);
}
}
}
pth_send(sock, "END\n", 4, MSG_NOSIGNAL);
That is, when you're sending out all the packets at once, and have a clearly defined end... AND you are only using one socket.
If you tried writing to another socket in the middle of the above loop, you may find that Linux releases the previously held packets. At least that appears to be the trouble I'm having right now. But it might be an easy solution for you.

Resources