System call stat() converted to stat64() without any cpp option - c

I am running 32-bit SUSE Linux with kernel level 3.0.76.
I can see a stat() call in my code translate to stat64() in the strace output, without me specifying any CPP options like _LARGEFILE64_SOURCE or _FILE_OFFSET_BITS=64.
#include<stdio.h>
#include<sys/stat.h>
int main ( int argc, char * argv[] )
{
char * path = "nofile";
struct stat b;
if (stat(path, &b) != 0) {
}
}
I compiled this file with gcc with no compiler options/flags.
On running the program, relevant strace output is:
munmap(0xb770a000, 200704) = 0
stat64("nofile", 0xbfb17834) = -1 ENOENT (No such file or directory)
exit_group(-1)
Can anyone please tell me how the stat() was converted to stat64() ?
Thanks in advance!

The answer seems to be found in the stat man page
Over time, increases in the size of the stat structure have led to
three successive versions of stat(): sys_stat() (slot __NR_oldstat),
sys_newstat() (slot __NR_stat), and sys_stat64() (new in kernel 2.4;
slot __NR_stat64). The glibc stat() wrapper function hides these
details from applications, invoking the most recent version of the
system call provided by the kernel, and repacking the returned
information if required for old binaries. Similar remarks apply for
fstat() and lstat().
Basically, glibc always call stat64.
If you add a printf("%zu\n", sizeof b); , the struct stat sizes are likely different depending on whether you use _FILE_OFFSET_BITS=64 or not, and glibc converts the struct stat between the kernel and your code.

I can see a stat() call in my code translate to stat64() in the strace output
You need to analyze the way syscall stat is wrapped on your system http://sourceware.org/glibc/wiki/SyscallWrappers (one way that you can see this code is to use gdb: execute this command under gdb: catch syscall stat and when you program will stop on this breakpoint get backtrace and analyze the function that called stat).
From http://linux.die.net/man/2/stat64
Over time, increases in the size of the stat structure have led to
three successive versions of stat(): sys_stat() (slot __NR_oldstat),
sys_newstat() (slot __NR_stat), and sys_stat64() (new in kernel 2.4;
slot __NR_stat64). The glibc stat() wrapper function hides these
details from applications, invoking the most recent version of the
system call provided by the kernel, and repacking the returned
information if required for old binaries.
So it is glibc_wrapper that decides based on sizeof(struct stat) that it should call why stat64.

Related

How do the err/warn functions get the program name?

From the err/warn manpage:
The err() and warn() family of functions display a formatted error
message on the standard error output. In all cases, the last component
of the program name, a colon character, and a space are output. If the
fmt argument is not NULL, the printf(3)-like formatted error message is
output.
If I make this call: warn("message"); it will output something like this:
a.out: message: (output of strerror here)
How do the warn/err functions find the name of the program (in this case, a.out) without seemingly having any access to argv at all? Does it have anything to do with the fact that they are BSD extensions?
The err/warn functions prepend the basename of the program name. According to the answers to this SO post, there are a few ways to get the program name without access to argv.
One way is to call readlink on /proc/self/exe, then call basename on that. A simple program that demonstrates this:
#include <libgen.h>
#include <linux/limits.h>
#include <stdio.h>
#include <unistd.h>
char *
progname(void)
{
char path[PATH_MAX];
if (readlink("/proc/self/exe", path, PATH_MAX) == -1)
return NULL;
/* not sure if a basename-returned string should be
* modified, maybe don't use this in production */
return basename(path);
}
int
main(void)
{
printf("%s: this is my fancy warn message!\n", progname());
return 0;
}
You can also use the nonstandard variables __progname, which may not work depending on your compiler, and program_invocation_short_name, which is a GNU extension defined in errno.h.
In pure standard C, there's no way to get the program name passed as argv[0] without getting it, directly or indirectly, from main. You can pass it as an argument to functions, or save it in a global variable.
But system functions also have the option of using system-specific methods. On open-source operating system, you can download the source code and see how it's done. For Unix-like systems, that's libc.
For example, on FreeBSD:
The warn and err functions call the internal system function _getprogname().
_getprogname() reads the global variable __progname.
__progname is set in handle_argv which is called from _start(). This code is not in libc, but in CSU, which is a separate library containing program startup code.
_start() is the program's entry point. It's defined in the architecture-specific crt1*.c. It's also the function that calls main, and it passes the same argv to both handle_argv() and main().
_start is the first C function called in the program. It's called from assembly code that reads the argv pointer from the stack.
The program arguments are copied into the program's address space by the kernel as part of the implementation of the execve system call.
Note that there are several concepts of “program name” and they aren't always equivalent. See Finding current executable's path without /proc/self/exe for a discussion of the topic.
How do the warn/err functions find the name of the program (in this case, a.out) without seemingly having any access to argv at all? Does it have anything to do with the fact that they are BSD extensions?
Such things can easily be figured out using the strace utility, which records all system calls.
I wrote the highly complex program test.c:
#include <err.h>
int main() { warn("foo"); }
and gcc -o test -static test.c; strace ./test yields (the -static to avoid the noise from trying to load a lot of libraries):
execve("./test", ["./test"], 0x7fffcbb7fd60 /* 101 vars */) = 0
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffd388d6540) = -1 EINVAL (Invalid argument)
brk(NULL) = 0x201e000
brk(0x201edc0) = 0x201edc0
arch_prctl(ARCH_SET_FS, 0x201e3c0) = 0
set_tid_address(0x201e690) = 55889
set_robust_list(0x201e6a0, 24) = 0
uname({sysname="Linux", nodename="workhorse", ...}) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
readlink("/proc/self/exe", "/tmp/test", 4096) = 9
getrandom("\x43\xff\x90\x4b\xa8\x82\x38\xdd", 8, GRND_NONBLOCK) = 8
brk(0x203fdc0) = 0x203fdc0
brk(0x2040000) = 0x2040000
mprotect(0x4b6000, 16384, PROT_READ) = 0
write(2, "test: ", 6) = 6
write(2, "foo", 3) = 3
write(2, ": Success\n", 10) = 10
exit_group(0) = ?
+++ exited with 0 +++
And there you have it: you can just readlink /proc/self/exe to know what you're called.

poll() returns EINVAL for more than 256 descriptors on macOS

Here is the example code that crashes:
#include <stdio.h>
#include <poll.h>
#include <stdlib.h>
#include <limits.h>
#define POLL_SIZE 1024
int main(int argc, const char * argv[]) {
printf("%d\n", OPEN_MAX);
struct pollfd *poll_ = calloc(POLL_SIZE, sizeof(struct pollfd));
if (poll(poll_, POLL_SIZE, -1) < 0)
if (errno == EINVAL)
perror("poll error");
return 0;
}
If you set POLL_SIZE to 256 or less, the code works just fine. What's interesting is that if you run this code in Xcode, it executes normally, but if you run the binary yourself you get a crash.
The output is like this:
10240
poll error: Invalid argument
According to poll(2):
[EINVAL] The nfds argument is greater than OPEN_MAX or the
timeout argument is less than -1.
As you can see, POLL_SIZE is a lot smaller than the limit, and the timeout is exactly -1, yet it crashed.
My clang version that I use for manual building:
Configured with: prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/c++/4.2.1
Apple LLVM version 8.1.0 (clang-802.0.41)
Target: x86_64-apple-darwin16.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
On Unix systems, processes have limits on resources. See e.g. getrlimit. You might change them (using setrlimit) and your sysadmin can also change them (e.g. configure these limits at startup or login time). There is a limit RLIMIT_NOFILE related to file descriptors. Read also about the ulimit bash builtin. See also sysconf with _SC_OPEN_MAX.
The poll system call is given a not-too-big array, and it is bad taste (possible, but inefficient) to repeat some file descriptor in it. So in practice you'll often use it with a quite small array mentioning different (but valid) file descriptors. The second argument to poll is the number of useful entries (practically, all different), not the allocated size of the array.
You may deal with many file descriptors. Read about the C10K problem.
BTW, your code is not crashing. poll is failing as documented (but not crashing).
You should read some POSIX programming book. Advanced Linux Programming is freely available, and most of it is on POSIX (not Linux specific) things.

Determine `OSTYPE` during runtime in C program

In a C program, I need to find the OSTYPE during runtime, on the basis of which I will do some operations.
Here is the code
#include <stdlib.h>
#include <string.h>
int main () {
const char * ostype = getenv("OSTYPE");
if (strcasecmp(ostype, /* insert os name */) == 0) ...
return 0;
}
But getenv returns NULL (and there is segmentation fault). When I do a echo $OSTYPE in the terminal it prints darwin15 . But when I do env | grep OSTYPE nothing gets printed, which means it is not in the list of environment variables. To make it work on my local machine I can go to the .bash_profile and export the OSTYPE manually but that doesn't solve the problem if I want to run a generated executable on a new machine.
Why is OSTYPE available while running terminal, but apparently not there in the list of environment variables. How to get around this ?
For the crash, you should check if the return was NULL or not before using it in strcmp or any function. From man 3 getenv:
The getenv() function returns a pointer to the value in the
environment, or NULL if there is no match.
If you're at POSIX (most Unix's and somehow all Linux's), I agree with Paul's comment on uname.
But actually you can check for OSTYPE at compile time with precompiler (with #ifdef's), here's a similar question on so: Determine OS during runtime
Edit: uname
Good point Jonathan. man 2 uname on my linux tells how to use (and begin POSIX, macos has the same header, too):
SYNOPSIS
#include <sys/utsname.h>
int uname(struct utsname *buf);
DESCRIPTION
uname() returns system information in the structure pointed to by buf. The utsname struct is
defined in :
struct utsname {
char sysname[]; /* Operating system name (e.g., "Linux") */
char nodename[]; /* Name within "some implementation-defined
network" */
char release[]; /* Operating system release (e.g., "2.6.28") */
char version[]; /* Operating system version */
char machine[]; /* Hardware identifier */
#ifdef _GNU_SOURCE
char domainname[]; /* NIS or YP domain name */
#endif
};

What are _nocancel() system calls in linux, and is there a way to use LD_PRELOAD to intercept them

Please let me know what are the _nocancel() system calls (e.g. __pwrite_nocancel(), and whether there is a way to create an LD_PRELOAD library to intercept those calls. Here's a bit of the background:
I'm investigating functionality of an Oracle database, and would like to add a small shim layer using LD_PRELOAD to capture information about the call in user space. I'm aware of other ways of capturing this information using system tap, but using LD_PRELOAD is a hard requirement from the customer.
strace shows that this particular process is calling pwrite() repeatedly; similarly, the pstack stack trace shows that __pwrite_nocancel() is being called as the last entry on the stack. I tried reproducing my own __libc_pwrite() function, and declaring
extern ssize_t pwrite(int fd, const void *buf, size_t numBytes, off_t offset)__attribute__((weak, alias ( "__libc_pwrite")));
but when I link the library and run nm -a |grep pwrite, I get this:
000000000006c190 T __libc_pwrite
000000000006c190 W pwrite
in contrast, nm -a /lib64/libpthread.so.0 |grep pwrite gives the following:
000000000000eaf0 t __libc_pwrite
000000000000eaf0 t __libc_pwrite64
000000000000eaf0 W pwrite
000000000000eaf0 t __pwrite
000000000000eaf0 W pwrite64
000000000000eaf0 W __pwrite64
0000000000000000 a pwrite64.c
000000000000eaf9 t __pwrite_nocancel
I've noticed that the _nocancel version is only 9 bytes ahead of the __pwrite, but looking at the source code, I'm unsure as to where it's being created:
/* Copyright (C) 1997, 1998, 2000, 2002, 2003, 2004, 2006
Free Software Foundation, Inc.
This file is part of the GNU C Library.
Contributed by Ulrich Drepper <drepper#cygnus.com>, 1997.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */
#include <assert.h>
#include <errno.h>
#include <unistd.h>
#include <endian.h>
#include <sysdep-cancel.h>
#include <sys/syscall.h>
#include <bp-checks.h>
#include <kernel-features.h>
#ifdef __NR_pwrite64 /* Newer kernels renamed but it's the same. */
# ifdef __NR_pwrite
# error "__NR_pwrite and __NR_pwrite64 both defined???"
# endif
# define __NR_pwrite __NR_pwrite64
#endif
#if defined __NR_pwrite || __ASSUME_PWRITE_SYSCALL > 0
# if __ASSUME_PWRITE_SYSCALL == 0
static ssize_t __emulate_pwrite (int fd, const void *buf, size_t count,
off_t offset) internal_function;
# endif
ssize_t
__libc_pwrite (fd, buf, count, offset)
int fd;
const void *buf;
size_t count;
off_t offset;
{
ssize_t result;
if (SINGLE_THREAD_P)
{
/* First try the syscall. */
result = INLINE_SYSCALL (pwrite, 6, fd, CHECK_N (buf, count), count, 0,
__LONG_LONG_PAIR (offset >> 31, offset));
# if __ASSUME_PWRITE_SYSCALL == 0
if (result == -1 && errno == ENOSYS)
/* No system call available. Use the emulation. */
result = __emulate_pwrite (fd, buf, count, offset);
# endif
return result;
}
int oldtype = LIBC_CANCEL_ASYNC ();
/* First try the syscall. */
result = INLINE_SYSCALL (pwrite, 6, fd, CHECK_N (buf, count), count, 0,
__LONG_LONG_PAIR (offset >> 31, offset));
# if __ASSUME_PWRITE_SYSCALL == 0
if (result == -1 && errno == ENOSYS)
/* No system call available. Use the emulation. */
result = __emulate_pwrite (fd, buf, count, offset);
# endif
LIBC_CANCEL_RESET (oldtype);
return result;
}
strong_alias (__libc_pwrite, __pwrite)
weak_alias (__libc_pwrite, pwrite)
# define __libc_pwrite(fd, buf, count, offset) \
static internal_function __emulate_pwrite (fd, buf, count, offset)
#endif
#if __ASSUME_PWRITE_SYSCALL == 0
# include <sysdeps/posix/pwrite.c>
#endif
Any help is appreciated.
The pwrite_nocancel() et cetera are not syscalls in Linux. They are functions internal to the C library, and tightly coupled with pthreads and thread cancellation.
The _nocancel() versions behave exactly the same as the original functions, except that these versions are not thread cancellation points.
Most I/O functions are cancellation points. That is, if the thread cancellation type is deferred and cancellation state is enabled, and another thread in the process has requested the thread to be cancelled, the thread will cancel (exit) when entering a cancellation point. See man 3 pthread_cancel, man 3 pthread_setcancelstate, and man 3 pthread_setcanceltype for further details.
Unfortunately, the pwrite_nocancel() and other _nocancel() functions are internal (local) to the pthreads library, and as such, are quite difficult to interpose; they're not dynamic symbols, so the dynamic linker cannot override them. At this point, I suspect, but am not sure, that the way to interpose them would involve rewriting the beginning of the library code with a direct jump to your own code.
If they were exported (global) functions, they could be interposed just like any other library function (these are provided by the pthread library, libpthread), using the normal approach. (Of my own answers here, you might find this, this, and this informative. Otherwise, just search for LD_PRELOAD example.)
Given that you want to interpose via LD_PRELOAD, you need to look at the same symbols rtld is using. Therefore, instead of nm -a, use objdump -T. If the symbol you want is not listed there, you cannot interpose it. All exported glibc symbols have a "version" string that is compared by rtld and allows having multiple versions of the same symbol (this is why even nm -D is not useful -- it doesn't show the versions). To interpose something, you should have the same version string. This can be done using a version script (see info ld), the .symver asm directive or a macro that invokes that directive. Note that symbols with a "private" version may change at any new revision of the libc, and that other symbols may be superseded by new ones at any new version (the symbols still work, but applications compiled against the new version don't use them anymore).

Get Linux system information in C

I have to check Linux system information. I can execute system commands in C, but doing so I create a new process for every one, which is pretty expensive. I was wondering if there is a way to obtain system information without being forced to execute a shell command. I've been looking around for a while and I found nothing. Actually, I'm not even sure if it's more convenient to execute commands via Bash calling them from my C program or find a way to accomplish the tasks using only C.
Linux exposes a lot of information under /proc. You can read the data from there. For example, fopen the file at /proc/cpuinfo and read its contents.
A presumably less known (and more complicated) way to do that, is that you can also use the api interface to sysctl. To use it under Linux, you need to #include <unistd.h>, #include <linux/sysctl.h>. A code example of that is available in the man page:
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <linux/sysctl.h>
int _sysctl(struct __sysctl_args *args );
#define OSNAMESZ 100
int
main(void)
{
struct __sysctl_args args;
char osname[OSNAMESZ];
size_t osnamelth;
int name[] = { CTL_KERN, KERN_OSTYPE };
memset(&args, 0, sizeof(struct __sysctl_args));
args.name = name;
args.nlen = sizeof(name)/sizeof(name[0]);
args.oldval = osname;
args.oldlenp = &osnamelth;
osnamelth = sizeof(osname);
if (syscall(SYS__sysctl, &args) == -1) {
perror("_sysctl");
exit(EXIT_FAILURE);
}
printf("This machine is running %*s\n", osnamelth, osname);
exit(EXIT_SUCCESS);
}
However, the man page linked also notes:
Glibc does not provide a wrapper for this system call; call it using
syscall(2). Or rather... don't call it: use of this system call has
long been discouraged, and it is so unloved that it is likely to
disappear in a future kernel version. Since Linux 2.6.24, uses of this
system call result in warnings in the kernel log. Remove it from your
programs now; use the /proc/sys interface instead.
This system call is available only if the kernel was configured with
the CONFIG_SYSCTL_SYSCALL option.
Please keep in mind that anything you can do with sysctl(), you can also just read() from /proc/sys. Also note that I do understand that the usefulness of that syscall is questionable, I just put it here for reference.
You can also use the sys/utsname.h header file to get the kernel version, hostname, operating system, machine hardware name, etc. More about sys/utsname.h is here. This is an example of getting the current kernel release.
#include <stdio.h> // I/O
#include <sys/utsname.h>
int main(int argc, char const *argv[])
{
struct utsname buff;
printf("Kernel Release = %s\n", buff.release); // kernel release
return 0;
}
This is the same as using the uname command. You can also use the -a option which stands for all information.
uname -r # -r stands for kernel release

Resources