Writing and reading to linux /proc/... filesystem without lseek() - c

In this source code http://man7.org/tlpi/code/online/dist/sysinfo/procfs_pidmax.c.html the file /proc/sys/kernel/pid_max is first simply read (using the read syscall) and then simply written (using the write syscall).
Why is it no necessary to lseek to the beginning before writing? I thought the file-offset pointer is the same for read's and write's (that's what the author of the associated books says).

This is because of /proc is not real file system so pid_max writes are handled in a way you don't need any seek. I even don't know if seeks are supported here.
Just to give you feeling of how different /proc files are here is reference for pretty old but illustrative kernel bug specially related to pid_max: https://bugzilla.kernel.org/show_bug.cgi?id=13090
This link should explain you even more details: T H E /proc F I L E S Y S T E M
And finally developerWorks article "Access the Linux kernel using the /proc filesystem" with step-by-step illustration of kernel module code which have /proc FS API. This looks like 100% what you need.

I've looked at kernel source, files under /proc/sys/ is under sysctl table control, read/write callbacks for each entry support file offset. "pid_max entry" has one int value to operate and, hence, offset in those callbacks actually is not using.

Related

What is the most efficient way to copy many files programmatically?

Once upon a time long ago, we had a bash script that works out a list of files that need to be copied based on some criteria (basically like a filtered version of cp -rf).
This was too slow and was replaced by a C++ program.
What the C++ program does is essentially:
foreach file
read entire file into buffer
write entire file
The program uses Posix calls open(), read() and write() to avoid buffering and other overheads vs iostream and fopen, fread & fwrite.
Is it possible to improve on this?
Notes:
I am assuming these are not sparse files
I am assuming GNU/Linux
I am not assuming a particular filesystem is available
I am not assuming prior knowledge of whether the source and destination are on the same disk.
I am not assuming prior knowledge of the kind of disk, SSD, HDD maybe even NFS or sshfs.
We can assume the source files are on the same disk as each other.
We can assume the destination files will also be on the same disk as each other.
We cannot assume whether the source and destinations are on the same disk or or not.
I think the answer is yes but it is quite nuanced.
Copying speed is of course limited by disk IO not CPU.
But how can we be sure to optimise our use of disk IO?
Maybe the disk has the equivalent of multiple read or write heads available? (perhaps an SSD?)
In which case performing multiple copies in parallel will help.
Can we determine and exploit this somehow?
This is surely well trod territory so rather than re-invent the wheel straight away (though that is always fun) it would be nice to hear what others have tried or would recommend.
Otherwise I will try various things and answer my own question sometime in the distant future.
This is what my evolving answer looks like so far...
If the source and destination are different physical disks then
we can at least read and write at the same time with something like:
writer thread
read from write queue
write file
reader thread
foreach file
read file
queue write on writer thread
If the source and destination are on the same physical disk and we happen to be on a filesystem
with copy on write semantics (like xfs or btrfs) we can potentially avoid actually copying the file at all.
This is apparently called "reflinking".
The cp command supports this using --reflink=auto.
See also:
https://www.reddit.com/r/btrfs/comments/721rxp/eli5_how_does_copyonwrite_and_deduplication_work/
https://unix.stackexchange.com/questions/80351/why-is-cp-reflink-auto-not-the-default-behaviour
From this question
and https://github.com/coreutils/coreutils/blob/master/src/copy.c
it looks as if this is done using an ioctl as in:
ioctl (dest_fd, FICLONE, src_fd);
So a quick win is probably:
try FICLONE on first file.
If it succeeds then:
foreach file
srcFD = open(src);
destFD = open(dest);
ioctl(destFD,FICLONE,srcFD);
else
do it the other way - perhaps in parallel
In terms of low-level system APIs we have:
copy_file_range
ioctl FICLONE
sendfile
I am not clear when to choose one over the other except that copy_file_range is not safe to use with some filesystems notably procfs.
This answer gives some advice and suggests sendfile() is intended for sockets but in fact this is only true for kernels before 2.6.33.
https://www.reddit.com/r/kernel/comments/4b5czd/what_is_the_difference_between_splice_sendfile/
copy_file_range() is useful for copying one file to another (within
the same filesystem) without actually copying anything until either
file is modified (copy-on-write or COW).
splice() only works if one of the file descriptors refer to a pipe. So
you can use for e.g. socket-to-pipe or pipe-to-file without copying
the data into userspace. But you can't do file-to-file copies with it.
sendfile() only works if the source file descriptor refers to
something that can be mmap()ed (i.e. mostly normal files) and before
2.6.33 the destination must be a socket.
There is also a suggestion in a comment that reading multiple files then writing multiple files will result in better performance.
This could use some explanation.
My guess is that it tries to exploit the heuristic that the source files and destination files will be close together on the disk.
I think the parallel reader and writer thread version could perhaps do the same.
The problem with such a design is it cannot exploit any performance gain from the low level system copy APIs.
The general answer is: Measure before trying another strategy.
For HDD this is probably your answer: https://unix.stackexchange.com/questions/124527/speed-up-copying-1000000-small-files
Ultimately I did not determine the "most efficient" way but I did end up with a solution that was sufficiently fast for my needs.
generate a list of files to copy and store it
copy files in parallel using openMP
#pragma omp parallel for
for (auto iter = filesToCopy.begin(); iter < filesToCopy.end(); ++iter)
{
copyFile(*iter);
}
copy each file using copy_file_range()
falling back to using splice() with a pipe() when compiling for old platforms not supporting copy_file_range().
Reflinking, as supported by copy_file_range(), to avoid copying at all when the source and destination are on the same filesystem is a massive win.

simulating memfd_create on Linux 2.6

I'm backporting a piece of code that uses a virtual memory trick involving a file descriptor that gets passed to mmap, but doesn't have a mount point. Having the physical file would be an unnecessary overhead in this application. The original code uses memfd_create which is great.
Since Linux 2.6 doesn't have memfd_create or the O_TMPFILE flag for open, I'm currently creating a file with mkstemp and then unlinking it without closing it first. This works, but it doesn't please me at all.
Is there a better way to get a file descriptor for mmap purposes without ever touching the file system in 2.6?
Before somebody says "XY problem," what I really need is two different virtual memory addresses to the same data in memory. This is implemented by mmap'ing the same anonymous file to two different addresses. Any other "Y" to my "X" also welcome.
Thanks
I considered two approaches:
Creating my temporary under /dev/shm/ rather than /tmp/
Using shm_open to get a file descriptor.
Although irrelevant to the specific problem at hand, /dev/shm/ is not guaranteed to exist on all distributions, so #2 felt more correct to me.
In order to not have to worry about unique names of the shared memory objects, I just generate UUIDs.
I think I'm happy with this.
Shout out to #NominalAnimal.

Information about file on Linux system?

I want known if a determinate file is in use by process, i.e. if file is open in read-only mode by that process.
I thought about searching through /proc/[pid]/[fd] directory, but this way I waste a lot of time, and I think that doing this is not beautiful.
Is there any way using some Linux API to determinate if X file is open by any process? Or maybe some structures data like /proc but for files?
Not that I know of. The lsof and fuser tools do precisely what you suggest, wander through /proc/*/fd.
Note that it is possible for open files to not have a name, if the file was deleted after being opened, and it is possible for a file to be open without the process holding a file descriptor (through mmap), and even the combination of both (this would be a process-private swap file that is automatically cleaned up on process exit).
Determining if a process is using a file is easy. The inverse less so. The reason is that the kernel does not keep track of the inverse directly. The information that IS kept is:
A file knows how many links refer to itself (inode table)
A processes knows what files it has open (file descriptor table)
This is why lsof's /proc walking is necessary. The file descriptors in use by a particular process are kept in /proj/$PID (among other things), and so lsof can use this (and other things) to spit out all of the pid <-> fd <-> inode relationships.
This is a nice article on lsof. As with any Linux util, you can always check out its source code for all of the details :)
lsof might be the tool you're searching for.
EDIT: I din't realize you are specifically searching for something to be integrated in your application, so my answer appears a little simplistic. But anyway, I think that this question is pretty much related to yours.

How do I open a directory at kernel level using the file descriptor for that directory?

I'm working on a project where I must open a directory and read the files/directories inside at kernel level. I'm basically trying to find out how ls is implemented at kernel level.
Right now I've figured out how to get a file descriptor for a directory using sys_open() and the O_DIRECTORY flag, but I don't know how to read the fd that I receive. If anyone has any tips or other suggestions I'd appreciate it. (Keep in mind this has to be done at kernel level).
Edit:For a long story short, For a school project I am implementing file/directory attributes. Where I'm storring the attributes is a hidden folder at the same level of the file with a given attribute. (So a file in Desktop/MyFolder has an attributes folder called Desktop/MyFolder/.filename_attr). Trust me I don't care to mess around in kernel for funsies. But the reason I need to read a dir at kernel level is because it's apart of project specs.
To add to caf's answer mentioning vfs_readdir(), reading and writing to files from within the kernel is is considered unsafe (except for /proc, which acts as an interface to internal data structures in the kernel.)
The reasons are well described in this linuxjournal article, although they also provide a hack to access files. I don't think their method could be easily modified to work for directories. A more correct approach is accessing the kernel's filesystem inode entries, which is what vfs_readdir does.
Inodes are filesystem objects such as regular files, directories, FIFOs and other
beasts. They live either on the disc (for block device filesystems)
or in the memory (for pseudo filesystems).
Notice that vfs_readdir() expects a file * parameter. To obtain a file structure pointer from a user space file descriptor, you should utilize the kernel's file descriptor table.
The kernel.org files documentation says the following on doing so safely:
To look up the file structure given an fd, a reader
must use either fcheck() or fcheck_files() APIs. These
take care of barrier requirements due to lock-free lookup.
An example :
rcu_read_lock();
file = fcheck_files(files, fd);
if (file) {
// Handling of the file structures is special.
// Since the look-up of the fd (fget() / fget_light())
// are lock-free, it is possible that look-up may race with
// the last put() operation on the file structure.
// This is avoided using atomic_long_inc_not_zero() on ->f_count
if (atomic_long_inc_not_zero(&file->f_count))
*fput_needed = 1;
else
/* Didn't get the reference, someone's freed */
file = NULL;
}
rcu_read_unlock();
....
return file;
atomic_long_inc_not_zero() detects if refcounts is already zero or
goes to zero during increment. If it does, we fail fget() / fget_light().
Finally, take a look at filldir_t, the second parameter type.
You probably want vfs_readdir() from fs/readdir.c.
In general though kernel code does not read directories, user code does.

Direct access to hard disk with no FS from C program on Linux

I want to access the whole hard disk directly from a C program. There's no FS on it and never's gonna be one.
I just want to open /dev/sda (for example) and do I/O at the block/sector level of the disk.
I'm planning to write some programs for learning C programming in the Linux environment (I know C language, Python, Perl and Java) but lack confidence with the Linux environment.
For my learning purposes I'm thinking about playing with kyoto-cabinet and saving the value corresponding to the computed hash directly into a "block/sector" of the hard disk, recording the pair: "hash, block/sector reference" into a kyoto-cabinet hash database file.
I don't know if this is feasible using standard C I/O functions or otherwise I'd have to write a "device driver" or something like...
As mentioned elsewhere, under *NIX systems, block devices like /dev/sda can be accessed as plain files. Note that if file system is mounted from the device, opening it as file for writing would fail.
If you want to play with block devices, I would advise to first use the loop device, which presents a plain file as a block device. For example:
dd if=/dev/zero of=./loop_file_10MB bs=1024 count=10K
losetup /dev/loop0 $PWD/loop_file_10MB
After that, /dev/loop0 would behave as if it was a block device, but all information written would be stored in the file.
As device files for drives (e.g. /dev/sda) are block devices, this means you can open, seek and use the file almost like a normal file.
Yes, as others have noted, you can simply open the block device.
However, it's a really good idea to do IO (writes anyway) on block boundaries and whole blocks. You can use something like pread() and pwrite() to do these IO, or mmap some or all of the device.
There are a bunch of ioctls which can be used, see "man sd" for some more info. They don't seem to all be documented in the same place.
In linux/fs.h BLKROSET and a bunch of other ioctls are defined, you have to look around to find out how to use them. You can do useful things like find out how big the device is, and what the block size is.
The source code of the util-linux-ng package is your friend, it contains examples.

Resources