perf_event_open tracepoint with bpf for a specific cpu - c

I want to write a C program that triggers execution of a bpf program when a syscall
is executed on a specific CPU by any process/thread
So the idea is to do a perf_event_open(pattr, -1, {MY_CPU_NUM}, -1, 0) followed
by ret = ioctl(efd, PERF_EVENT_IOC_SET_BPF, prog_fd);. My BPF program increments a counter in a map, that I am reading.
The specific system call I am using in my example is sys_exit_unlinkat, and I am testing the program by command taskset --cpu-list {ANY_CPU_OTHER_THAN_MY_CPU_NUMBER} rm -rf {DIRECTORY}}.
I expect that if I command to remove directory from a different core than where I placed my perf event, I should not see my counter increment. However, I see my counter increment irrespective of the cpu argument I provide in perf_event_open.
I dont understand why!
I tried, seeing what does perf record -C XX do, and it shows up bunch of perf_event_open along with one perf_event_open with PERF_TYPE_TRACEPOINT with similar arguments as mine, and it works correctly that it shows output only when rm -rf is executed on the MY_CPU_NUM.
Code Snippet:
pattr.type = PERF_TYPE_TRACEPOINT;
pattr.size = sizeof(pattr);
pattr.config =721; //unlinkat // 723; // rmdir
pattr.sample_period = 1;
pattr.wakeup_events = 1;
pattr.disabled = 1;
pattr.exclude_guest = 1;
pattr.sample_type = PERF_SAMPLE_RAW;
efd = perf_event_open(&pattr, -1, 0, -1, 0); // cpu number is zero
if(efd < 0) {
printf("error in efd opening, %s\n", strerror(errno));
exit(1);
}
ret = ioctl(efd, PERF_EVENT_IOC_SET_BPF, prog_fd);
if (ret < 0) {
printf("PERF_EVENT_IOC_SET_BPF error: %s\n", strerror(errno));
exit(-1);
}
ret = ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
if (ret < 0) {
printf("PERF_EVENT_IOC_ENABLE error: %s\n", strerror(errno));
exit(-1);
}
output of uname -a
Linux zephyr 5.4.0-110-generic in my machine.
EDIT-1:
Okay, I tried some noob debugging by putting the kernel into gdb
and trying to figure out the issue.
So, in the syscall_exit path perf_syscall_exit(kernel/events/trace_syscalls.c) is called, which then looks if there is some perf event associated with the current cpu.
code snippet:
static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
{
...
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
if (!test_bit(syscall_nr, enabled_perf_exit_syscalls))
return;
sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
return;
head = this_cpu_ptr(sys_data->exit_event->perf_events);
valid_prog_array = bpf_prog_array_valid(sys_data->exit_event);
if (!valid_prog_array && hlist_empty(head)) // <--- WATCH
return;
...
Now, in the above code, see where I commented WATCH. So what it checks I think is, that if the program is invalid and the event list is empty, return. So, imagine if the program is valid yet the event list is empty, then irrespective whether cpu has an event attached or not, this check will not pass and we will go ahead exeucting the BPF program.
So, I checked by installing perf_event without attaching bpf program and I saw that the check passed and we did not go ahead when the rm -rf {DIRECTORY} was executed from a different cpu. And when I executed from the core 0(where event was attached), the check failed and the program proceeded ahead.
So does that mean, that in the kernel, we cannot attach BPF program to an event that is tied to a specific CPU? Is this a kernel bug? or design necessity?

Related

pcap_set_rfmon return 0 as success but the interface is not set to monitor mode

I'm trying to write a small program which set my network interface to monitor mode using C, the function pcap_set_rfmon returns 0 as success but the interface is still in mange mode. I'm sure my network card supports Monitor mode because i have checked using ng-airmon and iwconfig wlp3s0 mode monitor the wlp3s0 is my network interface's name.
Here's my code:
#include <pcap.h>
main()
{
char error_buffer[PCAP_ERRBUF_SIZE];
pcap_t *handle = pcap_create("wlp3s0", error_buffer);
int result = pcap_set_rfmon(handle, 1);
if (result != 0)
{
printf("failed to set pcap rfmon");
}
}
Since the code output nothing and just returns 0, i don't know what has gone wrong and where to look at, can you guys tell me what i should check or something is missing
To quote the documentation for pcap_set_rfmon():
pcap_set_rfmon() sets whether monitor mode should be set on a capture handle when the handle is activated. ...
I've emphasized part of that - "when the handle is activated". All pcap_set_rfmon() does is set a flag in the pcap_t to indicate that, when the program calls pcap_activate(), the adapter would be put in monitor mode (if pcap_activate() succeeds).
You aren't calling pcap_activate(), so nothing happens.
You will also have to keep the pcap_t open - even a program that does
#include <pcap.h>
main()
{
char error_buffer[PCAP_ERRBUF_SIZE];
pcap_t *handle;
int result;
handle = pcap_create("wlp3s0", error_buffer);
if (handle == NULL)
{
printf("failed to create a handle: %s\n",
error_buffer);
return 2;
}
result = pcap_set_rfmon(handle, 1);
if (result != 0)
{
printf("failed to set pcap rfmon: %s (%s)\n",
pcap_statustostr(result),
pcap_geterr(handle));
return 2;
}
result = pcap_activate(handle);
{
printf("failed to activate handle: %s (%s)\n",
pcap_statustostr(result),
pcap_geterr(handle));
return 2;
}
}
will just let the adapter revert to managed mode when it exits. You will need to add something such as
for (;;)
pause();
at the end of main(), so the program doesn't exit unless you interrupt or terminate it.
(Note: I added more error checking and reporting to the program. This Is A Good Thing, as it means that, if something doesn't work, the program will give a detailed error report, helping you - or whoever you ask for help - try to fix the problem, rather than just silently failing or, if pcap_create() fails, crashing.)

Faking an input device for testing purpose

What I want to do
I'm writing a daemon which listen to the input devices for keys presses and send signals via D-Bus. The main goal is to manage audio volume and screen backlight level by requesting changes or informing about changes.
I use libevdev to handle the input device events.
I wrote a function for opening an input device located at a specified path:
Device device_open(const char *path);
That function works well, but while I'm writing unit tests for it, I wanted to create file fixtures with different properties (existence of the file, read access, etc.) to check the error handling of my function and memory management (as I store data in a structure).
What I have already done
But testing it with a real input device (located at /dev/input/event*) needs root access rights. Setting read access for everyone on /dev/input/event* files works but seems risky to me. Executing my tests as root is worse !
Creating a device using mknod works but needs to be done as root.
I also tried to use character special files (because input devices are one of those) allowing read for everyone (like /dev/random, /dev/zero, /dev/null and even the terminal device i'm currently using: /dev/tty2).
But those devices does not handles ioctl requests needed by libevdev: EVIOCGBIT is the first request returning an error "Inappropriate ioctl for device".
What I'm looking for
I want to be able to create device files as a regular user (the user executing the unit tests). Then, by setting access rights I should be able to test my function behavior for different kinds of file (read only, no read allowed, bad device type, etc.).
If it appears to be impossible, I will certainly refactor my function using private helpers. But how to do it. Any examples ?
Thanks.
Edit: I tried to express better my needs.
Create a group for users who are allowed to access the device, and an udev rule to set the ownership of that input event device to that group.
I use teensy (system) group:
sudo groupadd -r teensy
and add each user into it using e.g.
sudo usermod -a -g teensy my-user-name
or whatever graphical user interface I have available.
By managing which users and service daemons belong to the teensy group, you can easily manage the access to the devices.
For my Teensy microcontrollers (that have native USB, and I use for HID testing), I have the following /lib/udev/rules.d/49-teensy.rules:
ATTRS{idVendor}=="16c0", ATTRS{idProduct}=="04[789B]?", ENV{ID_MM_DEVICE_IGNORE}="1"
ATTRS{idVendor}=="16c0", ATTRS{idProduct}=="04[789A]?", ENV{MTP_NO_PROBE}="1"
SUBSYSTEMS=="usb", ATTRS{idVendor}=="16c0", ATTRS{idProduct}=="04[789ABCD]?", GROUP:="teensy", MODE:="0660"
KERNEL=="ttyACM*", ATTRS{idVendor}=="16c0", ATTRS{idProduct}=="04[789B]?", GROUP:="teensy", MODE:="0660"
You only need the third line (SUBSYSTEMS=="usb", one) for HID devices, though. Make sure the idVendor and idProduct match your USB HID device. You can use lsusb to list the currently connected USB devices vendor and product numbers. The matching uses glob patterns, just like file names.
After adding the above, don't forget running sudo udevadm control --reload-rules && sudo udevadm trigger to reload the rules. Next time you plug in your USB HID device, all members of your group (teensy in the above) can access it directly.
Note that by default in most distributions, udev also creates persistent symlinks in /dev/input/by-id/ using the USB device type and serial. In my case, one of my Teensy LC's (serial 4298820) with a combined keyboard-mouse-joystic device provides /dev/input/by-id/usb-Teensyduino_Keyboard_Mouse_Joystick_4298820-event-kbd for the keyboard event device, /dev/input/by-id/usb-Teensyduino_Keyboard_Mouse_Joystick_4298820-if01-event-mouse for the mouse event device, and /dev/input/by-id/usb-Teensyduino_Keyboard_Mouse_Joystick_4298820-if03-event-joystick and /dev/input/by-id/usb-Teensyduino_Keyboard_Mouse_Joystick_4298820-if04-event-joystick for the two joystick interfaces.
(By "persistent", I do not mean these symlinks always exist; I mean that whenever that particular device is plugged in, the symlink of exactly that name exists, and points to the actual Linux input event character device.)
The Linux uinput device can be used to implement a virtual input event device using a simple privileged daemon.
The process to create a new virtual USB input event device goes as follows.
Open /dev/uinput for writing (or reading and writing):
fd = open("/dev/uinput", O_RDWR);
if (fd == -1) {
fprintf(stderr, "Cannot open /dev/uinput: %s.\n", strerror(errno));
exit(EXIT_FAILURE);
}
The above requires superuser privileges. However, immediately after opening the device, you can drop all privileges, and have your daemon/service run as a dedicated user instead.
Use the UI_SET_EVBIT ioctl for each event type allowed.
You will want to allow at least EV_SYN; and EV_KEY for keyboards and mouse buttons, and EV_REL for mouse movement, and so on.
if (ioctl(fd, UI_SET_EVBIT, EV_SYN) == -1 ||
ioctl(fd, UI_SET_EVBIT, EV_KEY) == -1 ||
ioctl(fd, UI_SET_EVBIT, EV_REL) == -1) {
fprintf(stderr, "Uinput event types not allowed: %s.\n", strerror(errno));
close(fd);
exit(EXIT_FAILURE);
}
I personally use a static constant array with the codes, for easier management.
Use the UI_SET_KEYBIT ioctl for each key code the device may emit, and UI_SET_RELBIT ioctl for each relative movement code (mouse code). For example, to allow space, left mouse button, horizontal and vertical mouse movement, and mouse wheel:
if (ioctl(fd, UI_SET_KEYBIT, KEY_SPACE) == -1 ||
ioctl(fd, UI_SET_KEYBIT, BTN_LEFT) == -1 ||
ioctl(fd, UI_SET_RELBIT, REL_X) == -1 ||
ioctl(fd, UI_SET_RELBIT, REL_Y) == -1 ||
ioctl(fd, UI_SET_RELBIT, REL_WHEEL) == -1) {
fprintf(stderr, "Uinput event types not allowed: %s.\n", strerror(errno));
close(fd);
exit(EXIT_FAILURE);
}
Again, static const arrays (one for UI_SET_KEYBIT and one for UI_SET_RELBIT codes) is much easier to maintain.
Define a struct uinput_user_dev, and write it to the device.
If you have name containing the device name string, vendor and product with the USB vendor and product ID numbers, version with a version number (0 is fine), use
struct uinput_user_dev dev;
memset(&dev, 0, sizeof dev);
strncpy(dev.name, name, UINPUT_MAX_NAME_SIZE);
dev.id.bustype = BUS_USB;
dev.id.vendor = vendor;
dev.id.product = product;
dev.id.version = version;
if (write(fd, &dev, sizeof dev) != sizeof dev) {
fprintf(stderr, "Cannot write an uinput device description: %s.\n", strerror(errno));
close(fd);
exit(EXIT_FAILURE);
}
Later kernels have an ioctl to do the same thing (apparently being involved in systemd development causes this kind of drain bamage);
struct uinput_setup dev;
memset(&dev, 0, sizeof dev);
strncpy(dev.name, name, UINPUT_MAX_NAME_SIZE);
dev.id.bustype = BUS_USB;
dev.id.vendor = vendor;
dev.id.product = product;
dev.id.version = version;
if (ioctl(fd, UI_DEV_SETUP, &dev) == -1) {
fprintf(stderr, "Cannot write an uinput device description: %s.\n", strerror(errno));
close(fd);
exit(EXIT_FAILURE);
}
The idea seems to be that instead of using the former, you can try the latter first, and if it fails, do the former instead. You know, because a single interface might some day not be enough. (That's what the documentation and commit say, anyway.)
I might sound a bit cranky, here, but that's just because I do subscribe to both the Unix philosophy and the KISS principle (or minimalist approach), and see such warts completely unnecessary. And too often coming from the same loosely related group of developers. Ahem. No personal insult intended; I just think they are doing poor job.
Create the virtual device, by issuing an UI_DEV_CREATE ioctl:
if (ioctl(fd, UI_DEV_CREATE) == -1) {
fprintf(stderr, "Cannot create the virtual uinput device: %s.\n", strerror(errno));
close(fd);
exit(EXIT_FAILURE);
}
At this point, the kernel will construct the device, provide the corresponding event to the udev daemon, and the udev daemon will construct the device node and symlink(s) according to its configuration. All this will take a bit of time -- a fraction of a second in the real world, but enough that trying to emit events immediately might cause some of them to be lost.
Emit the input events (struct input_event) by writing to the uinput device.
You can write one or more struct input_events at a time, and should never see short writes (unless you try to write a partial event structure). Partial event structures are completely ignored. (See drivers/input/misc/uinput.c:uinput_write() uinput_inject_events() for how the kernel handles such writes.)
Many actions consists of more than one struct input_event. For example, you might move the mouse diagonally (emitting both { .type == EV_REL, .code == REL_X, .value = xdelta } and { .type == EV_REL, .code == REL_Y, .value = ydelta } for that single movement). The synchronization events ({ .type == EV_SYN, .code == 0, .value == 0 }) are used as a sentinel or separator, denoting the end of related events.
Because of this, you'll need to append an { .type == EV_SYN, .code == 0, .value == 0 } input event after each individual action (mouse movement, key press, key release, and so on). Think of it as the equivalent of a newline, for line-buffered input.
For example, the following code moves the mouse diagonally down right by a single pixel.
struct input_event event[3];
memset(event, 0, sizeof event);
event[0].type = EV_REL;
event[0].code = REL_X;
event[0].value = +1; /* Right */
event[1].type = EV_REL;
event[1].code = REL_Y;
event[1].value = +1; /* Down */
event[2].type = EV_SYN;
event[2].code = 0;
event[2].value = 0;
if (write(fd, event, sizeof event) != sizeof event)
fprintf(stderr, "Failed to inject mouse movement event.\n");
The failure case is not fatal; it only means the events were not injected (although I don't see how that could happen in current kernels; better be defensive, just in case). You can simply retry the same again, or ignore the failure (but letting the user know, so they can investigate, if it ever happens). So log it or output a warning, but no need for it to cause the daemon/service to exit.
Destroy the device:
ioctl(fd, UI_DEV_DESTROY);
close(fd);
The device does get automatically destroyed when the last duplicate of the original opened descriptor gets closed, but I recommend doing it explicitly as above.
Putting steps 1-5 in a function, you get something like
#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/uinput.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
static const unsigned int allow_event_type[] = {
EV_KEY,
EV_SYN,
EV_REL,
};
#define ALLOWED_EVENT_TYPES (sizeof allow_event_type / sizeof allow_event_type[0])
static const unsigned int allow_key_code[] = {
KEY_SPACE,
BTN_LEFT,
BTN_MIDDLE,
BTN_RIGHT,
};
#define ALLOWED_KEY_CODES (sizeof allow_key_code / sizeof allow_key_code[0])
static const unsigned int allow_rel_code[] = {
REL_X,
REL_Y,
REL_WHEEL,
};
#define ALLOWED_REL_CODES (sizeof allow_rel_code / sizeof allow_rel_code[0])
static int uinput_open(const char *name, const unsigned int vendor, const unsigned int product, const unsigned int version)
{
struct uinput_user_dev dev;
int fd;
size_t i;
if (!name || strlen(name) < 1 || strlen(name) >= UINPUT_MAX_NAME_SIZE) {
errno = EINVAL;
return -1;
}
fd = open("/dev/uinput", O_RDWR);
if (fd == -1)
return -1;
memset(&dev, 0, sizeof dev);
strncpy(dev.name, name, UINPUT_MAX_NAME_SIZE);
dev.id.bustype = BUS_USB;
dev.id.vendor = vendor;
dev.id.product = product;
dev.id.version = version;
do {
for (i = 0; i < ALLOWED_EVENT_TYPES; i++)
if (ioctl(fd, UI_SET_EVBIT, allow_event_type[i]) == -1)
break;
if (i < ALLOWED_EVENT_TYPES)
break;
for (i = 0; i < ALLOWED_KEY_CODES; i++)
if (ioctl(fd, UI_SET_KEYBIT, allow_key_code[i]) == -1)
break;
if (i < ALLOWED_KEY_CODES)
break;
for (i = 0; i < ALLOWED_REL_CODES; i++)
if (ioctl(fd, UI_SET_RELBIT, allow_rel_code[i]) == -1)
break;
if (i < ALLOWED_REL_CODES)
break;
if (write(fd, &dev, sizeof dev) != sizeof dev)
break;
if (ioctl(fd, UI_DEV_CREATE) == -1)
break;
/* Success. */
return fd;
} while (0);
/* FAILED: */
{
const int saved_errno = errno;
close(fd);
errno = saved_errno;
return -1;
}
}
static void uinput_close(const int fd)
{
ioctl(fd, UI_DEV_DESTROY);
close(fd);
}
which seem to work fine, and requires no libraries (other than the standard C library).
It is important to realize that the Linux input subsystem, including uinput and struct input_event, are binary interfaces to the Linux kernel, and therefore will be kept backwards compatible (except for pressing technical reasons, like security issues or serious conflicts with other parts of the kernel). (The desire to wrap everything under the freedesktop.org or systemd umbrella is not one.)

When does libcurl's curl_multi_perform start the transfer?

I am trying to understand the functionality of curl_multi_perform so that I can use it in my project. I found a sample code at https://gist.github.com/clemensg/4960504.
Below is the code where my doubt is:
for (i = 0; i < CNT; ++i) {
init(cm, i); // this is setting options in curl easy handles.
}
// I thought this statement will start transfer.
//A-- curl_multi_perform(cm, &still_running);
sleep(5); // put this to check when transfer starts.
do {
int numfds=0;
int res = curl_multi_wait(cm, NULL, 0, MAX_WAIT_MSECS, &numfds);
if(res != CURLM_OK) {
fprintf(stderr, "error: curl_multi_wait() returned %d\n", res);
return EXIT_FAILURE;
}
/*
if(!numfds) {
fprintf(stderr, "error: curl_multi_wait() numfds=%d\n", numfds);
return EXIT_FAILURE;
}
*/
//B-- curl_multi_perform(cm, &still_running);
} while(still_running);
My understanding is that when curl_multi_perform is called the transfer starts but in the above code the curl_multi_perform at label A does not start the transfer. I checked in the wireshark logs. I see the first log output when the control moves past the sleep() statement.
I even tried below code:
for (i = 0; i < CNT; ++i) {
init(cm, i); // this is setting options in curl easy handles.
curl_multi_perform(cm, &still_running);
sleep(5);
}
but, the result was the same. I didn't see any log in the wireshark while the control was in this loop, but, once I started seeing the logs in wireshark they were at 5 seconds interval.
Apart from these doubts, the other doubts I have are:
Why there are two curl_multi_perform at label A & B?
Can I call curl_multi_perform multiple times while adding handles
in between the calls?
Help appreciated.
Thanks
curl_multi_perform works in a non-blocking way. It'll do as much as it can without blocking and then it returns, expecting to get called again when needed. The first call is thus most likely to start resolving the name used in the URL, and then perhaps the second or third call starts the actual transfer or something. It is designed so that an application shouldn't have to care about which exact function call number that does it.
Then you keep calling it until all the transfers are completed.
I've tried to explain this concept in a chapter in the "everything curl" book: Driving with the "multi" interface

Inject shared library into a process

I just started to learn injection techniques in Linux and want to write a simple program to inject a shared library into a running process. (the library will simply print a string.) However, after a couple of hours research, I couldn't find any complete example. Well, I did figure out I probably need to use ptrace() to pause the process and inject the contents, but not sure how to load the library into the memory space of target process and relocation stuff in C code. Does anyone know any good resources or working examples for shared library injection? (Of course, I know there might be some existing libraries like hotpatch I can use to make injection much easier but that's not what I want)
And if anyone can write some pseudo code or give me an example, I will appreciate it. Thanks.
PS: I am not asking about LD_PRELOAD trick.
The "LD_PRELOAD trick" André Puel mentioned in a comment to the original question, is no trick, really. It is the standard method of adding functionality -- or more commonly, interposing existing functionality -- in a dynamically-linked process. It is standard functionality provided by ld.so, the Linux dynamic linker.
The Linux dynamic linker is controlled by environment variables (and configuration files); LD_PRELOAD is simply an environment variable that provides a list of dynamic libraries that should be linked against each process. (You could also add the library to /etc/ld.so.preload, in which case it is automatically loaded for every binary, regardless of the LD_PRELOAD environment variable.)
Here's an example, example.c:
#include <unistd.h>
#include <errno.h>
static void init(void) __attribute__((constructor));
static void wrerr(const char *p)
{
const char *q;
int saved_errno;
if (!p)
return;
q = p;
while (*q)
q++;
if (q == p)
return;
saved_errno = errno;
while (p < q) {
ssize_t n = write(STDERR_FILENO, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != (ssize_t)-1 || errno != EINTR)
break;
}
errno = saved_errno;
}
static void init(void)
{
wrerr("I am loaded and running.\n");
}
Compile it to libexample.so using
gcc -Wall -O2 -fPIC -shared example.c -ldl -Wl,-soname,libexample.so -o libexample.so
If you then run any (dynamically linked) binary with the full path to libexample.so listed in LD_PREALOD environment variable, the binary will output "I am loaded and running" to standard output before its normal output. For example,
LD_PRELOAD=$PWD/libexample.so date
will output something like
I am loaded and running.
Mon Jun 23 21:30:00 UTC 2014
Note that the init() function in the example library is automatically executed, because it is marked __attribute__((constructor)); that attribute means the function will be executed prior to main().
My example library may seem funny to you -- no printf() et cetera, wrerr() messing with errno --, but there are very good reasons I wrote it like this.
First, errno is a thread-local variable. If you run some code, initially saving the original errno value, and restoring that value just before returning, the interrupted thread will not see any change in errno. (And because it is thread-local, nobody else will see any change either, unless you try something silly like &errno.) Code that is supposed to run without the rest of the process noticing random effects, better make sure it keeps errno unchanged in this manner!
The wrerr() function itself is a simple function that writes a string safely to standard error. It is async-signal-safe (meaning you can use it in signal handlers, unlike printf() et al.), and other than errno which is kept unchanged, it does not affect the state of the rest of the process in any way. Simply put, it is a safe way to output strings to standard error. It is also simple enough for everbody to understand.
Second, not all processes use standard C I/O. For example, programs compiled in Fortran do not. So, if you try to use standard C I/O, it might work, it might not, or it might even confuse the heck out of the target binary. Using the wrerr() function avoids all that: it will just write the string to standard error, without confusing the rest of the process, no matter what programming language it was written in -- well, as long as that language's runtime does not move or close the standard error file descriptor (STDERR_FILENO == 2).
To load that library dynamically in a running process, you'll need to first attach ptrace to it, then stop it before next entry to a syscall (PTRACE_SYSEMU), to make sure you're somewhere you can safely do the dlopen call.
Check /proc/PID/maps to verify you are within the process' own code, not in shared library code. You can do PTRACE_SYSCALL or PTRACE_SYSEMU to continue to next candidate stopping point. Also, remember to wait() for the child to actually stop after attaching to it, and that you attach to all threads.
While stopped, use PTRACE_GETREGS to get the register state, and PTRACE_PEEKTEXT to copy enough code, so you can replace it with PTRACE_POKETEXT to a position-independent sequence that calls dlopen("/path/to/libexample.so", RTLD_NOW), RTLD_NOW being an integer constant defined for your architecture in /usr/include/.../dlfcn.h, typically 2. Since the pathname is constant string, you can save it (temporarily) over the code; the function call takes a pointer to it, after all.
Have that position-independent sequence you used to rewrite some of the existing code end with a syscall, so that you can run the inserted using PTRACE_SYSCALL (in a loop, until it ends up at that inserted syscall) without having to single-step it. Then you use PTRACE_POKETEXT to revert the code to its original state, and finally PTRACE_SETREGS to revert the program state to what its initial state was.
Consider this trivial program, compiled as say target:
#include <stdio.h>
int main(void)
{
int c;
while (EOF != (c = getc(stdin)))
putc(c, stdout);
return 0;
}
Let's say we're already running that (pid $(ps -o pid= -C target)), and we wish to inject code that prints "Hello, world!" to standard error.
On x86-64, kernel syscalls are done using the syscall instruction (0F 05 in binary; it's a two-byte instruction). So, to execute any syscall you want on behalf of a target process, you need to replace two bytes. (On x86-64 PTRACE_POKETEXT actually transfers a 64-bit word, preferably aligned on a 64-bit boundary.)
Consider the following program, compiled to say agent:
#define _GNU_SOURCE
#include <sys/ptrace.h>
#include <sys/user.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/syscall.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
struct user_regs_struct oldregs, regs;
unsigned long pid, addr, save[2];
siginfo_t info;
char dummy;
if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s PID ADDRESS\n", argv[0]);
fprintf(stderr, "\n");
return 1;
}
if (sscanf(argv[1], " %lu %c", &pid, &dummy) != 1 || pid < 1UL) {
fprintf(stderr, "%s: Invalid process ID.\n", argv[1]);
return 1;
}
if (sscanf(argv[2], " %lx %c", &addr, &dummy) != 1) {
fprintf(stderr, "%s: Invalid address.\n", argv[2]);
return 1;
}
if (addr & 7) {
fprintf(stderr, "%s: Address is not a multiple of 8.\n", argv[2]);
return 1;
}
/* Attach to the target process. */
if (ptrace(PTRACE_ATTACH, (pid_t)pid, NULL, NULL)) {
fprintf(stderr, "Cannot attach to process %lu: %s.\n", pid, strerror(errno));
return 1;
}
/* Wait for attaching to complete. */
waitid(P_PID, (pid_t)pid, &info, WSTOPPED);
/* Get target process (main thread) register state. */
if (ptrace(PTRACE_GETREGS, (pid_t)pid, NULL, &oldregs)) {
fprintf(stderr, "Cannot get register state from process %lu: %s.\n", pid, strerror(errno));
ptrace(PTRACE_DETACH, (pid_t)pid, NULL, NULL);
return 1;
}
/* Save the 16 bytes at the specified address in the target process. */
save[0] = ptrace(PTRACE_PEEKTEXT, (pid_t)pid, (void *)(addr + 0UL), NULL);
save[1] = ptrace(PTRACE_PEEKTEXT, (pid_t)pid, (void *)(addr + 8UL), NULL);
/* Replace the 16 bytes with 'syscall' (0F 05), followed by the message string. */
if (ptrace(PTRACE_POKETEXT, (pid_t)pid, (void *)(addr + 0UL), (void *)0x2c6f6c6c6548050fULL) ||
ptrace(PTRACE_POKETEXT, (pid_t)pid, (void *)(addr + 8UL), (void *)0x0a21646c726f7720ULL)) {
fprintf(stderr, "Cannot modify process %lu code: %s.\n", pid, strerror(errno));
ptrace(PTRACE_DETACH, (pid_t)pid, NULL, NULL);
return 1;
}
/* Modify process registers, to execute the just inserted code. */
regs = oldregs;
regs.rip = addr;
regs.rax = SYS_write;
regs.rdi = STDERR_FILENO;
regs.rsi = addr + 2UL;
regs.rdx = 14; /* 14 bytes of message, no '\0' at end needed. */
if (ptrace(PTRACE_SETREGS, (pid_t)pid, NULL, &regs)) {
fprintf(stderr, "Cannot set register state from process %lu: %s.\n", pid, strerror(errno));
ptrace(PTRACE_DETACH, (pid_t)pid, NULL, NULL);
return 1;
}
/* Do the syscall. */
if (ptrace(PTRACE_SINGLESTEP, (pid_t)pid, NULL, NULL)) {
fprintf(stderr, "Cannot execute injected code to process %lu: %s.\n", pid, strerror(errno));
ptrace(PTRACE_DETACH, (pid_t)pid, NULL, NULL);
return 1;
}
/* Wait for the client to execute the syscall, and stop. */
waitid(P_PID, (pid_t)pid, &info, WSTOPPED);
/* Revert the 16 bytes we modified. */
if (ptrace(PTRACE_POKETEXT, (pid_t)pid, (void *)(addr + 0UL), (void *)save[0]) ||
ptrace(PTRACE_POKETEXT, (pid_t)pid, (void *)(addr + 8UL), (void *)save[1])) {
fprintf(stderr, "Cannot revert process %lu code modifications: %s.\n", pid, strerror(errno));
ptrace(PTRACE_DETACH, (pid_t)pid, NULL, NULL);
return 1;
}
/* Revert the registers, too, to the old state. */
if (ptrace(PTRACE_SETREGS, (pid_t)pid, NULL, &oldregs)) {
fprintf(stderr, "Cannot reset register state from process %lu: %s.\n", pid, strerror(errno));
ptrace(PTRACE_DETACH, (pid_t)pid, NULL, NULL);
return 1;
}
/* Detach. */
if (ptrace(PTRACE_DETACH, (pid_t)pid, NULL, NULL)) {
fprintf(stderr, "Cannot detach from process %lu: %s.\n", pid, strerror(errno));
return 1;
}
fprintf(stderr, "Done.\n");
return 0;
}
It takes two parameters: the pid of the target process, and the address to use to replace with the injected executable code.
The two magic constants, 0x2c6f6c6c6548050fULL and 0x0a21646c726f7720ULL, are simply the native representation on x86-64 for the 16 bytes
0F 05 "Hello, world!\n"
with no string-terminating NUL byte. Note that the string is 14 characters long, and starts two bytes after the original address.
On my machine, running cat /proc/$(ps -o pid= -C target)/maps -- which shows the complete address mapping for the target -- shows that target's code is located at 0x400000 .. 0x401000. objdump -d ./target shows that there is no code after 0x4006ef or so. Therefore, addresses 0x400700 to 0x401000 are reserved for executable code, but do not contain any. The address 0x400700 -- on my machine; may very well differ on yours! -- is therefore a very good address for injecting code into target while it is running.
Running ./agent $(ps -o pid= -C target) 0x400700 injects the necessary syscall code and string to the target binary at 0x400700, executes the injected code, and replaces the injected code with original code. Essentially, it accomplishes the desired task: for target to output "Hello, world!" to standard error.
Note that Ubuntu and some other Linux distributions nowadays allow a process to ptrace only their child processes running as the same user. Since target is not a child of agent, you either need to have superuser privileges (run sudo ./agent $(ps -o pid= -C target) 0x400700), or modify target so that it explicitly allows the ptracing (for example, by adding prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY); near the start of the program). See man ptrace and man prctl for details.
Like I explained already above, for longer or more complicated code, use ptrace to cause the target to first execute mmap(NULL, page_aligned_length, PROT_READ | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0), which allocates executable memory for new code. So, on x86-64, you only need to locate one 64-bit word you can replace safely, and then you can PTRACE_POKETEXT the new code for the target to execute. While my example uses the write() syscall, it is a really small change to have it use mmap() or mmap2() syscall instead.
(On x86-64 in Linux, the syscall number is in rax, and parameters in rdi, rsi, rdx, r10, r8, and r9, reading from left to right, respectively; and return value is also in rax.)
Parsing /proc/PID/maps is very useful -- see /proc/PID/maps under man 5 proc. It provides all the pertinent information on the target process address space. To find out whether there are useful unused code areas, parse objdump -wh /proc/$(ps -o pid= -C target)/exe output; it examines the actual binary of the target process directly. (In fact, you could easily find how much unused code there is at the end of the code mapping, and use that automatically.)
Further questions?

How to find if the underlying Linux Kernel supports Copy on Write?

For example, I am working on an ancient kernel and want to know whether it really implements Copy on Write. Is there a way ( preferably programattically in C ) to find out?
No, there isn't a reliable programmatic way to find that out from within a userland process.
The idea behind COW is that it should be fully transparent to the user code. Your code touches the individual pages, a page fault is invoked, the kernel copies the corresponding page and your process is resumed as if nothing had happened.
I casually stumbled upon this rather old question, and I see that other people already pointed out that it does not make much sense to "detect CoW" since Linux already implies CoW.
However I find this question pretty interesting, and while technically one should not be able to detect this kind of kernel mechanism which should be completely transparent to userspace processes, there actually are architecture specific ways (i.e. side-channels) that can be exploited to determine whether Copy on Write happens or not.
On x86 processors that support Restricted Transactional Memory, you can leverage the fact that memory transactions are aborted when an exception such as a page fault occurs. Given a valid address, this information can be used to detect if a page is resident in memory or not (similarly to the use of minicore(2)), or even to detect Copy on Write.
Here's a working example. Note: check that your processor supports RTM by looking at /proc/cpuinfo for the rtm flag, and compile using GCC without optimizations and with the -mrtm flag.
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <immintrin.h>
/* Use x86 transactional memory to detect a page fault when trying to write
* at the specified address, assuming it's a valid address.
*/
static int page_dirty(void *page) {
unsigned char *p = page;
if (_xbegin() == _XBEGIN_STARTED) {
*p = 0;
_xend();
/* Transaction successfully ended => no context switch happened to
* copy page into virtual memory of the process => page was dirty.
*/
return 1;
} else {
/* Transaction aborted => page fault happened and context was switched
* to copy page into virtual memory of the process => page wasn't dirty.
*/
return 0;
}
/* Should not happen! */
return -1;
}
int main(void) {
unsigned char *addr;
addr = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap failed");
return 1;
}
// Write to trigger initial page fault and actually reserve memory
*addr = 123;
fprintf(stderr, "Initial state : %d\n", page_dirty(addr));
fputs("----- fork -----\n", stderr);
if (fork()) {
fprintf(stderr, "Parent before : %d\n", page_dirty(addr));
// Read (should NOT trigger Copy on Write)
*addr;
fprintf(stderr, "Parent after R: %d\n", page_dirty(addr));
// Write (should trigger Copy on Write)
*addr = 123;
fprintf(stderr, "Parent after W: %d\n", page_dirty(addr));
} else {
fprintf(stderr, "Child before : %d\n", page_dirty(addr));
// Read (should NOT trigger Copy on Write)
*addr;
fprintf(stderr, "Child after R : %d\n", page_dirty(addr));
// Write (should trigger Copy on Write)
*addr = 123;
fprintf(stderr, "Child after W : %d\n", page_dirty(addr));
}
return 0;
}
Output on my machine:
Initial state : 1
----- fork -----
Parent before : 0
Parent after R: 0
Parent after W: 1
Child before : 0
Child after R : 0
Child after W : 1
As you can see, writing to pages marked as CoW (in this case after fork), causes the transaction to fail because a page fault exception is triggered and causes a transaction abort. The changes are reverted by hardware before the transaction is aborted. After writing to the page, trying to do the same thing again results in the transaction correctly terminating and the function returning 1.
Of course, this should not really be used seriously, but merely be taken as a fun and interesting exercise. Since RTM transactions are aborted for any kind of exception and also for context switch, false negatives are possible (for example if the process is preempted by the kernel right in the middle of the transaction). Keeping the transaction code really short (in the above case just a branch and an assignment *p = 0) is essential. Multiple tests could also be made to avoid false negatives.

Resources