How the getcwd is implemented in the kernel (library)? - c

One process could do
chdir("/to/some/where");
when from the another shell
mv /to/some/where /now/different/path/
the 1st process
print getcwd();
#prints /now/different/path/
How the getcwd is implemented? (at the lowest level, e.g. at the level of kernel, inodes ...).
I know how common (inode based) filesystem works, e.g. what contains the directory (name of the entries and the corresponding inode numbers).
EDIT
Probably the question was to vague - trying to refine it. One possible scenario (from what o knows)
the kernel knows the inode of the CWD for the given process (and his threads) - e.g. inode number 1000
reads the inode (gets the blocks what needs to read)
reads the corresponding blocks (e.g. opens the directory)
read the directory entries (name of the entries and the inode numbers)
gets the inode number for the .. parent directory (for example 900) and the inode number of the . (current directory)
reads the content of the parent directory where gets
the name of the previous directory (for the inode 1000)
the inode number of the parent directory
continue to 5. - until the root inode is reached.
Thats mean, the getcwd for
/some/very/very/very/deep/directory/level
tooks more raw IO operations (more directory entries need to read) as for the short
/tmp
where the whole getcwd is done by two readings?
Is this correct? or it is done in totally another way?

First, you asking on the wrong place. This question is more about the operating system, so the unix.stackexchange is the better place.
Anyway, your proposed solution is true for some ancient UNIX implementation (for example BSD 2.8) or like. That pathname resolution could be done as you described.
However, many problems arises - few of them:
as you said - too complicated pathname resolution (and yes, for the deeper directories needs more IO)
depends on the premise that only ONE ROOT directory exists. This isn't true from the BSD 4.2 where are introduced the per process root directory - what allows the chroot system call - what allow sets the root to any directory without showing the real path to the process. (One of the coolest FreeBSD feature are the jails - depends on this) (Also ancient linuxes have only one root - only in the 0.96c are introduced the VFS - virtual filesystem layer)
and permission problems - e.g. what happens when
#shell1
$ mkdir -p /tmp/some
$ cd /tmp/some
second shell
$ su
# mkdir -p /tmp/my
# chmod 700 /tmp/my
# mv /tmp/some /tmp/my/
the /tmp/my directory isn't readable for the first process. So, it can't determine the path, so how it should work with the files? So, in shell1 again:
$ pwd
/tmp/some #the original
$ echo $CWD
/tmp/some
$ /bin/pwd
pwd: .: Permission denied
But, you still can do for example
$ touch bob #works
e.g. the system allows you work in the "current" directory without let you know where are you. (in both scenarios e.g. in chroot and in the second one) ;)
That's mean than every process stores in his table the current working directory:
device number (e.g. hdd1 or hdd2)
inode number on the device
and
the kernel maintains another global table(s), (in linux called as dentry (directory entries)), - where the kernel maintaining the "inode" -> "path" mapping for every process, every opened file descriptor, and also indode caches (in the linux maintained by the kernel itself, BSD: job for the vnod driver) and like.
E.g. when some process asks for the pathname for the inode X, the kernel searches the dentry table, if the entry found - return immediately, if not - calls the lookup process, what doing the pathname resolution.
When for example the rename occurs, the kernel searched the dentry table, if found the entry and changes it as needed.
All above is extremely simplified, as you can see yourself, all above is highly OS dependent, the common base is defined by POSIX - but happens behind (e.g. the implementation) - you need really read the sources of the kernel and/or google for:
linux dentry
linux vfs
freebsd vnode
pathname resolution
and such.
Ps: for the nitpickers, :) - as i said - everything is over-simplyfied, so if you want correct and add more details - edit the answer - i converted it to "community wiki answer".

In current POSIX kernels like Linux (or *BSD-s) the current working directory (as a kernel inode) is part of the process state. So the in-kernel process descriptor (probably some struct task_struct on Linux) contains or refers to that cwd. Then getcwd is "simply" a syscall querying that.
The kernel inodes (for opened file descriptors, including working directories) are related to filesystems and are not the same as disk inodes.
Of course, the evil is in the details!

Key point: chdir() only affects the current process and any child processes launched after that - it is not a global state.

Related

Special file in UNIX file system when mounting?

I am reading The UNIX Time-Sharing System by D. M. Ritchie and K. Thompson, where they briefly introduce the UNIX OS. In the file system section, when they talk about the "mount", they say the following 2 paragraphs. And I have a few questions about the bold and itatic content in the paragraphs.
Paragraph 1: When an I/O request is made to a file whose i-node
indicates that it is special, the last 12 device address words are
immaterial, and the first specifies an internal device name, which is
interpreted as a pair of numbers representing, respectively, a
device type and subdevice number. The device type indicates which system routine will deal with I/O on that device; the
subdevice number selects, for example, a disk drive attached to a
particular controller or one of several similar terminal interfaces.
Paragraph 2: In this environment, the implementation of the mount
system call (Section 3.4) is quite straightforward. mount maintains a
system table whose argument is the i-number and device name of the ordinary file specified during the mount, and whose
corresponding value is the device name of the indicated special file. This table is searched for each i-number/device pair that turns
up while a path name is being scanned during an open or create; if a
match is found, the i-number is replaced by the i-number of the
root directory and the device name is replaced by the table value.
From the first paragraph, I know that device name is something existing in a special file's i-node. However, why in the second paragraph it says the ordinary file also has it?
What is the system table the mount tries to maintain? In para. 2 is it indicating that the system table is part of the internal file system, and the mount process makes such a table that the entries of it are special files that point to files in the mounted external device?
Let's start with 2 observations:
mount /dev/sda2 /mnt/sda2 takes a special file /dev/sda2, and a non-special file (ordinary file) /mnt/sda2.
file name lookup has to know how to cross into another mounted filesystems.
Let's assume that /dev/sda1 is device 100, and mounted on /. Let's assume that /dev/sda2 is device 200, and mounted on /mnt/sda2. What happens when you look up /mnt/sda2/x? That file is stored as /x on /dev/sda2. Here's what happens:
Assume that the inode number of the root inode is 1 on every filesystem.
The OS looks up mnt in inode 1 of device 100, and finds e.g. inode 5.
The OS checks it's global system table of mounts to see if (device 100, inode 5) maps to something - it doesn't.
The OS looks up sda2 in inode 5 of device 100, and finds e.g. inode 17.
The OS checks it's global system table of mounts to see if (device 100, inode 17) maps to something.
Because /dev/sda2 is mounted onto /mnt/sda2, the table returns /dev/sda2, which is device 200 - we're crossing into another mounted filesystem.
The OS lookups up x in inode 1 of device 200, and finds e.g. inode 11.
Lookup returns (device 200, inode 5).
So to answer your questions:
The ordinary file is the mount point. "Ordinary" means "not a special file". Note that it is possible to mount a regular file, not just a directory. Docker uses this capability.
The system table's entries are mount points which point to mounted devices.

Get path from file descriptor when path is longer than PATH_MAX

I receive filesystem events from fanotify. Sometimes I want to get an absolute path to a file that's being accessed.
Usually, it's not a problem - fanotify_event_metadata contains a file descriptor fd, so I can call readlink on /proc/self/fd/<fd> and get my path.
However, if a path exceeds PATH_MAX readlink can no longer be used - it fails with ENAMETOOLONG. I'm wondering if there's a way to get a file path in this case.
Obviously, I can fstat the descriptor I get from a fanotify and traverse the entire filesystem looking for files with identical device ID and inode number. But this approach is not feasible for me performance-wise (even if I optimize it to ignore paths shorter than PATH_MAX).
I've tried getting a parent directory by reopening fd with O_PATH and calling openat(fd, "..", ...). Obviously, that failed because fd doesn't refer to a directory. I've also tried examining contents of a buffer after a failed readlink call (hoping it contains partial path). That didn't work either.
So far I've managed to get long paths for files inside the working directory of a process that opened them (fanotify events contain a pid of a target process, so I can read /proc/<pid>/cwd and get the path to the root from there). But that is a partial solution.
Is there a way to get an absolute path from a file descriptor without traversing the whole filesystem? Preferably the one that will work with kernel 2.6.32/glibc 2.11.
Update: For the curious. I've figured out why calling readlink("/proc/self/fd/<fd>", ... with a buffer large enough to store the entire path doesn't work.
Look at the implementation of do_proc_readlink. Notice that it doesn't use provided buffer directly. Instead, it allocates a single page and uses it as a temporary buffer when it calls d_path. In other words, no matter how large is buffer, d_path will always be limited to a size of a page. Which is 4096 bytes on amd64. Same as PATH_MAX! The -ENAMETOOLONG itself is returned by prepend when it runs out of mentioned page.
readlink can be used with a link target that's longer than PATH_MAX. There are two restrictions: the name of the link itself must be shorter than PATH_MAX (check, "/proc/self/fd/<fd>" is about 20 characters) and the provided output buffer must be large enough. You might want to call lstat first to figure out how big the output buffer should be, or just call readlink repeatedly with growing buffers.
the limitation of PATH_MAX births from the fact that the unix (or linux, from now) needs to bind the size of parameters passed to the kernel. There's no limit on how deep a file hierarchy can grow, and always there's the possibility to access all files, independent on how deep they are in the filesystem hierarchy. What is actually limited is the lenght of the string you can pass or receive from the kernel representing a file name. This means you cannot create (because you have to pass the target path) a symlink longer than this length, but you can have easily paths far longer this limit.
When you pass a filename to the kernel, you can do that for two reasons, to name a file (or device, or socket, or fifo, or whatever), to open it, etc. YOu do this and your filename goes first to a routine that converts that path into an inode (which is what the kernel manages actually). That routine begins scanning from two possible point in the filesystem hierarchi. Those points are the inode reference of the root inode and the inode reference of the curren working diretory of a process. The selection of which inode to use as departure inode depends on the presence of a leading / character at the begining of the path. From this point, up to PATH_MAX characters will be processed each time, but that can lead us deep enough that we cannot get to the root in one step only...
Suppose you use the path to change your current directory, and do a chdir A/B/C/D/E/.../Z. Once there, you create new directories and do the same thing, chdir AA/AB/AC/AD/AE/.../AZ, then chdir BA/BB/BC/BD/... and so on... there's nothing in the system that forbids you to get so deep in the filesystem (you can try that yourself, I have done and tested before) You can grow to a map that is by far larger than PATH_MAX. But this only mean that you cannot get there directly from the filesystem root. You can go there in steps, as much as the system allows you, and depending on where you fix you root directory (by means of the chroot(2) syscall) or your current directory (by means of the chdir(2) syscall)
probably you have notice (or not) that there's no system call to get your curren working directory path from root... There are several reasons for this:
root inode and curren working inode are two local-to-process concepts. Two processes in the same system can have different working directories, and also different root directories, up to the point that they are able to share nothing in common and no way from one's directory to reach the other.
inode path can be ambiguous. Well, this is not true for a directory, as it is not allowed two hard links to point to the same directory inode (this was possible in older unices, where directories had to be created with the mknod(2) system call, if you have access to some hp-ux v6 or old Unix SysV R4 you can create directories with a ... entry ---pointing to the granparent of a directory or similar things, just being root and knowing how to use the mknod(2) syscall) The idea is that when two links point to the same inode, which (or both) of then goes to the root, which one is the right path from the root inode to the current dir?
curren inode and root can be separated by a path far enough to not fit in the PATH_MAX limit.
there can be several different filesystems (and filesystem types) involved in getting to the root. So this is not something that can be obtained only knowing the stored data in the disks, you must know the mounting table.
For these reasons, there's no direct support in the kernel to know the root path to a file. And also there's no way to get the path (and this is what the pwd(1) command does) than to follow the .. entry and get to the parent directory and search there a link that gets to the inode number of the current dir... and repeat this until the parent inode is the same as the last inode visited. Only then you'll be in the root directory (your root directory, that is different in general of other processes root directories)
Just try this exercise:
i=0
while [ "$i" -lt 10000 ]
do
mkdir dir-$i
cd dir-$i
i=$(expr "$i" + 1)
done
and see how far you can go from the root directory in your hierarchy.
NOTE 1
Another reason to be impossible to get the path to a file from an open descriptor is that you have access only to the inode (the path you used to open(2) it can have no relationship to the actual root path, as you can use symlinks and relative to the working directory, or changed root dir in between the open call and the time you want to access the path, it can even not exist, as you can have unlink(2)d it) The inode information has no reference to the path to the inode, as there can be multiple (even millions) paths to a file. In the inode you have only a ref count, which means the number of paths that actually finish on that inode.

Copying a directory using sockets

I'm writing a program in C that sends files across the network using sockets. This works fine for files - they are read into a buffer and then written onto the socket. They are picked up at the other end by reversing this process.
However, how can this apply to directories? I also want to copy directories, keeping the permissions the same (so I don't think mkdir will work). At the moment when I try to run this on a directory, it says the size is -1. How is a directory represented?
To be clear, for example, if I want my program to copy /tmp across the network, it will do this:
/tmp/1.txt - OK
/tmp/2.txt - OK
/tmp/dir/ - Skip
/tmp/dir/3.txt - Can't write to path
There are several possibilities. It would fit fairly will with what you have already to tar the directory to transfer, send the resulting archive across the network, and untar on the other side.
Alternatively, you can walk the directory tree recursively. For each directory you need transfer only the name and whichever attributes you want to preserve, but then you must list the directory contents (probably via readdir()) and transfer each member.
By the way, don't neglect to think about how you're going to handle links, both symbolic ones and hard ones. And if you want your program to be really robust then consider also what to do with special files such as device files and FIFOs.
I guess it is homework, otherwise why not use FTP, scp, rsync, unison etc.
To test if a file path is a plain file, a device, a directory, etc etc... use
stat(2)
To read a directory, use opendir(3) then loop on readdir(3) (then of course closedir). You don't need to know how a directory is represented.
You probably should be interested in nftw(3) to recursively traverse a file tree.
To make one directory, use mkdir(2)
You should read Advanced Linux Programming
BTW, this answer contains useful information too...

Print all files on a filesystem using system call

I am working in the kernel and I am trying to make a system call that takes a partition as input (i.e. /dev/sda1) and then prints every file on the filesystem using printk().
I enter a partition (i.e. /dev/sda1) and I put a printk() inside this system call to print.
First, I tried to do this with a process, because if I am right each process is represented by a task_struct and I tried to access the files with the files_struct. But the problem is that I only have the file descriptors of the opened files and not all the files.
So, what I want to do is that I pass the name of the partition and I printk() the names of all the files.
For example:
I enter the path /dev/sda1 as an argument and let's suppose I have the file a.txt and b.txt inside this partition , so the system call should print a.txt and b.txt.
The signature will be like this:
asmlinkage long sys_acall(char *partition_name);
There is a few things that needs to be discussed.
The partition_name parameter of your syscall should have the __user tag.
If you want to, strictly speaking, read files from a partition you will have to implement filesystem recognition (is that partition ext3, reiserfs, ntfs, ...?) and then implement the driver for that kind of filesystem. As Christ pointed out, partitions doesn't contain files but filesystems does. Another option is use the drivers already implemented for the filesystem on that partition. This option is just horrible.
If you want to read files from a filesystem your work gets easier, you can use the VFS interface to access it, but you will need that filesystem to be mounted (you can do it on-the-fly though).
My final opinion, I would change "implement a system call that prints every file in a partition" for "implement a system call that prints every file in a directory". The signature for that system call would be:
asmlinkage long sys_crazyness(__user const char *dir);
We don't care if the directory passed is the root of a filesystem or just a folder in any depth-level of a filesystem.
If you can change your problem to this one it would be much easier ;)

What can I do if getcwd() and getenv("PWD") don't match?

I have a build system tool that is using getcwd() to get the current working directory. That's great, except that sometimes people have spaces in their paths, which isn't supported by the build system. You'd think that you could just make a symbolic link:
ln -s "Directory With Spaces" DirectoryWithoutSpaces
And then be happy. But unfortunately for me, getcwd() resolves all the symbolic links. I tried to use getenv("PWD"), but it is not pointing at the same path as I get back from getcwd(). I blame make -C for not updating the environment variable, I think. Right now, getcwd() gives me back a path like this:
/Users/carl/Directory With Spaces/Some/Other/Directories
And getenv("PWD") gives me:
/Users/carl/DirectoryWithoutSpaces
So - is there any function like getcwd() that doesn't resolve the symbolic links?
Edit:
I changed
make -C Some/Other/Directories
to
cd Some/Other/Directories ; make
And then getenv("PWD") works.. If there's no other solution, I can use that.
According to the Advanced Programming in the UNIX Environment bible by Stevens, p.112:
Since the kernel must maintain knowledge of the current working directory, we should be able to fetch its current value. Unfortunately, all the kernel maintains for each process is the i-node number and device identification for the current working directory. The kernel does not maintain the full pathname of the directory.
Sorry, looks like you do need to work around this in another way.
There is no way for getcwd() to determine the path you followed via symbolic links. The basic implementation of getcwd() stats the current directory '.', and then opens the parent directory '..' and scans the entries until it finds the directory name with the same inode number as '.' has. It then repeats the process upwards until it finds the root directory, at which point it has the full path. At no point does it ever traverse a symbolic link. So the goal of having getcwd() calculate the path followed via symlinks is impossible, whether it is implemented as a system call or as a library function.
The best resolution is to ensure that the build system handles path names containing spaces. That means quoting pathnames passed through the shell. C programs don't care about the spaces in the name; it is only when a program like the shell interprets the strings that you run into problems. (Compilers implemented as shell scripts that run pre-processors often have problems with pathnames that contain spaces - speaking from experience.)

Resources