Accessing "file" structure from user space

Accessing "file" structure from user space - c

In reference to linux kernel, I would like to access the "file" structure information like current file offset in a user space C program. How do I do it?
Thanks in advance

Is "in reference to linux kernel" relevant, or misleading information? That is, are you asking about the kernel-level open file description and its status, or the C library-level FILE * used in stdio? Either way, you cannot poke at the internals yourself. There are accessor functions you can use: ftello(f) for stdio, or lseek(fd, 0, SEEK_CUR) for file descriptors.

You cannot access kernel structures in userspace.

Related

How does fread in C actually work?

I understand that fread() has the following function definition:
size_t fread(void *buffer, size_t size, size_t qty, FILE *inptr);
I also understand that inptr is a file pointer that is returned when a FILE pointer is opened using the fopen() function. My question is does inptr store the memory address of every single character/letter of the file in its memory? If that is the case, do the memory addresses from the inptr get copied to *buffer (pointer to buffer array)?
There is one more thing that I am confused about. For each time fread() is called, size * qty bytes of memory is being copied/transferred. Is it the content of the file pointed to by inptr itself or is the memory address of the content of the file that is being copied/transferred?
Would appreciate if someone can help me clear the confusion. Thank you :)

FILE is implemented by your operating system. The functions operating on FILE are implemented by your system. You don't know. To know, you need to browse sources of your operating system.
inptr may be a pointer to memory allocated by your operating system. Or it may be a number, that your operating system uses to find it's data. Anyway, it's a handle, that your system uses to find FILE specific data. And your system decides what is in that data. For caching purposes, maybe all letters are cached in some buffer. Maybe not.
fread call. Fread reads data from an underlying entity behind inptr handle. inptr is interpreted by your system, to access the underlying memory or structure or device or hard drive or printer or keyboard or mouse or anything. It reads qty*size bytes of data. Those data are placed in the buffer. No pointers are placed there. The bytes that are read from the device are placed in the memory pointed to by buffer.

Your questions are a bit confusing (which is probably why you're asking them) so I'll do my best to answer.
FILE *inptr is a handle to the open file. You do not directly read it, it is just used to tell related functions what to operate on. You can kinda think of it like a human reading a file name in a folder, where the file name is used to identify the file, but the contents are accessed in another way.
As for the data, it is read from the file which is opened with fopen() and subsequently provided a file handle. The data does not directly correlate to the FILE pointer, and typically you should not be messing with the FILE pointer directly (don't try to read/write from it directly).
I tried to not get too technical as to the operation, as it seems you are new to C, but just kind of think of the FILE * as the computer's way of "naming" the file internally for its own usage, and the data buffer is merely the content.

You can think of fread as being implemented something like this:
size_t fread(char *ptr, size_t size, size_t nitems, FILE *fp)
{
size_t i;
for(i = 0; i < size * nitems; i++) {
int c = getc(fp);
if(c == EOF) break;
*ptr++ = c;
}
(I've left out the return value because in my simplified illustration there isn't a good way to show it.)
In other words, fread reads a bunch of characters as if by repeatedly calling getc(). So obviously this begs the question of how getc works.
What you have to know is that FILE * points to a structure which, one way or another, contains a buffer of some (not necessarily all) of the file's characters read into memory. So, in pseudocode, getc() looks like this:
int getc(FILE *fp)
{
if(fp->buffer is empty) {
fill fp->buffer by reading more characters from underlying file;
if(that resulted in end-of-file)
return EOF;
}
return(next character from fp->buffer);
}

The answer to the question,
"how does fread() work?"
is basically
"it asks your operating system to read the file for you."
More or less the sole purpose of an operating system kernel is to perform actions like this on your behalf. The kernel hosts the device drivers for the disks and file systems, and is able to fetch data for your program no matter what the file is stored on (e.g. a FAT32 formatted HDD, a network share, etc).
The way in which fread() asks your operating system to fetch data from a file varies slightly between OS and CPU. Back in the good old days of MS-DOS, the fread() function would load up various parameters (calculated from the parameters your program gave to fread()) into CPU registers, and then raise an interrupt. The interrupt handler, which was actually part of MS-DOS, would then go and fetch the requested data, and place it in a given place in memory. The registers to be loaded and the interrupt to raise were all specified by the MS-DOS manuals. The parameters you pass to fread() are abstractions of those needed by the system call.
This is what's known as making a system call. Every operating system has a system calling interface. Libraries like glibc on Linux provide handy functions like fread() (which is part of the standard C library), and make the system call for you (which is not standardised between operating systems).
Note that this means that glibc is not a fundamental part of the operating system. It's just a library of routines that implements the C standard library around the system calls that Linux provides. This means you can use an alternative C library. For example, Android does not use glibc, even though it has a Linux kernel.
Similarly on Windows. All software in Windows (C, C++, the .NET runtime, etc) is written to use the WIN32 API library (win32.dll). The difference on Windows is that the NT kernel system calling interface is not published; we don't know what it is.
This leads to some interesting things.
WINE on Linux recreates WIN32.dll, not the NT kernel system call interface.
Windows Subsystem for Linux on Windows 10 does recreate the Linux system calling interface (which is possible because it is public knowledge).
Solaris, QNX and FreeBSD pull the same trick.
Even more oddly it's looking like MS have done a NT kernel system interface shim for Linux (i.e, the thing that WINE hasn't done) to allow MS-SQLServer to run on Linux. This in effect is a Linux Subsystem for Windows. They've not given this away.

Is it possible to fake a file stream, such as stdin, in C?

I am working on an embedded system with no filesystem and I need to execute programs that take input data from files specified via command like arguments or directly from stdin.
I know it is possible to bake-in the file data with the binary using the method from this answer: C/C++ with GCC: Statically add resource files to executable/library but currently I would need to rewrite all the programs to access the data in a new way.
Is it possible to bake-in a text file, for example, and access it using a fake file pointer to stdin when running the program?

If your system is an OS-less bare-metal system, then your C library will have "retargetting" stubs or hooks that you need to implement to hook the library into the platform. This will typically include low-level I/O functions such as open(), read(), write(), seek() etc. You can implement these as you wish to implement the basic stdin, stdout, stderr streams (in POSIX and most other implementations they will have fixed file descriptors 0, 1 and 2 respectively, and do not need to be explicitly opened), file I/O and in this case for managing an arbitrary memory block.
open() for example will be passed a file or device name (the string may be interpreted any way you wish), and will return a file descriptor. You might perhaps recognise "cfgdata:" as a device name to access your "memory file", and you would return a unique descriptor that is then passed into read(). You use the descriptor to reference data for managing the stream; probably little more that an index that is incremented by the number if characters read. The same index may be set directly by the seek() implementation.
Once you have implemented these functions, the higher level stdio functions or even C++ iostreams will work normally for the devices or filesystems you have supported in your low level implementation.

As commented, you could use the POSIX fmemopen function. You'll need a libc providing it, e.g. musl-libc or possibly glibc. BTW for benchmarking purposes you might install some tiny Linux-like OS on your hardware, e.g. uclinux

Passing memory address as a file in c

If I have some sort of library that has a function like so void foo(FILE* fp), and I have a char* array of data, is there any way for me to pass a pointer to the array or something similar to the foo function?.
What I'm doing now is using system() to write to a temporary file, and I don't really like doing that.

No, the stdio library has no facility for defining your own buffering scheme.
Given POSIX and virtual memory, you can use mmap to open a memory region with a backing file. The OS may opt not to write the file to disk. Details vary by OS, but this can approach the ideal solution.
Or, as Barmar suggests, use pipe to send the data to the main thread from an auxiliary thread within your process.
As for the simple solution… why use system() when you can use mktemp(), fopen(), fwrite()?

I think you want fmemopen. Assuming you're running on Linux (or something else posixy; if you need Windows support, will have to check), it should be available.

Writing and reading to linux /proc/... filesystem without lseek()

In this source code http://man7.org/tlpi/code/online/dist/sysinfo/procfs_pidmax.c.html the file /proc/sys/kernel/pid_max is first simply read (using the read syscall) and then simply written (using the write syscall).
Why is it no necessary to lseek to the beginning before writing? I thought the file-offset pointer is the same for read's and write's (that's what the author of the associated books says).

This is because of /proc is not real file system so pid_max writes are handled in a way you don't need any seek. I even don't know if seeks are supported here.
Just to give you feeling of how different /proc files are here is reference for pretty old but illustrative kernel bug specially related to pid_max: https://bugzilla.kernel.org/show_bug.cgi?id=13090
This link should explain you even more details: T H E /proc F I L E S Y S T E M
And finally developerWorks article "Access the Linux kernel using the /proc filesystem" with step-by-step illustration of kernel module code which have /proc FS API. This looks like 100% what you need.

I've looked at kernel source, files under /proc/sys/ is under sysctl table control, read/write callbacks for each entry support file offset. "pid_max entry" has one int value to operate and, hence, offset in those callbacks actually is not using.

Alternatives to using stat() to get file type?

Are there any alternatives to stat (which is found on most Unix systems) which can determine the file type? The manpage says that a call to stat is expensive, and I need to call it quite often in my app.

The alternative is fstat() if you already have the file open (so you have a file descriptor for it). Or lstat() if you want to find out about symbolic links rather than the file the symlink points to.
I think the man page is exaggerating the cost; it is not much worse than any other system call that has to resolve the name of the file into an inode. It is more costly than getpid(); it is less costly than open().

The "file type" that stat() gives you is whether the file is a regular file or something like a device file or directory, among other things like its size and inode number. If that's what you need to know, then you must use stat().
If what you actually need to know is the type of the file's contents -- e.g. text file, JPEG image, MP3 audio -- then you have two options. You can guess based on the filename extension (if it ends in ".mp3", the file probably contains MP3 audio), or you can use libmagic, which actually opens the file and reads some of its contents to figure out what it is. The libmagic approach is more expensive (if you're trying to avoid stat(), you probably want to avoid open() too), but less prone to error (in case that ".mp3" file is actually a JPEG image, for example).

Under Linux with some filesystems the file type (regular, char device, block device, directory, pipe, sym link, ...) is stored in the linux_dirent struct, which is what the kernel supplies applications directory entries in via the getdents system call. If the only thing in the stat structure you needed was the file type and you needed to get that for all or many entries of a directory, you could use getdents directly (rather than readdir) and attempt to get the file type out of that, only using stat if you found an invalid file type in linux_dirent. Depending on the your application's filesystem usage pattern this could be faster than using stat if you are using Linux, but stat should be fast in many cases.
Stat's speed has mostly to do with locating the data that is being asked for on disk. If you are traversing a directory recursively stat-ing all of the files then each stat should end up being fairly quick overall because most of the work getting the data stat needs ends up cached before you ask the kernel for it by a previous call to stat. If on the other hand you stat the same number of files randomly distributed around the system then the kernel will likely have to read from disk several directories for each file you are going to call stat on.
fstat should always be very fast since the kernel should already have the data you're asking for in RAM, as it needs to access it for the file to be in the open state, and the kernel won't have to go through the trouble of traversing the path of the filename to see if each component is in RAM or on disk and possibly reading in a directory from disk (but likely not having to), only to discover that it has the data that you are asking for in RAM.
That being said, calling stat on an open file should be faster than calling it on an unopened file.

Are you aware of the "magic" file on *nix systems? By querying a file from the command line with something like file myfile.ext you can get the real file type.
This is done by reading the contents of the file rather than looking at its extension, and is widely used on *nix (Linux, Unix, ...) systems.

If your application is expected to run on Linux systems, why don't you try inotify(7). It is definitely faster than stating many files.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight