What corner cases must we consider when parsing $PATH on Linux? - c

I'm working on a C application that has to walk $PATH to find full pathnames for binaries, and the only allowed dependency is glibc (i.e. no calling external programs like which). In the normal case, this just entails splitting getenv("PATH") by colons and checking each directory one by one, but I want to be sure I cover all of the possible corner cases. What gotchas should I look out for? In particular, are relative paths, paths starting with ~ meant to be expanded to $HOME, or paths containing the : char allowed?

One thing that once surprised me is that the empty string in PATH means the current directory. Two adjacent colons or a colon at the end or beginning of PATH means the current directory is included. This is documented in man bash for instance.
It also is in the POSIX specification.
So
PATH=:/bin
PATH=/bin:
PATH=/bin::/usr/bin
All mean the current directory is in PATH

I'm not sure this is a problem with Linux in general, but make sure that your code works if PATH has some funky (like, UTF-8) encoding to deal with directories with fancy letters. I suspect this might depend on the filesystem encoding.
I remember working on a bug report of some russian guy who had fancy letters in his user name (and hence, his home directory name which appeared in PATH).

This is minor but I'll added it since it hasn't already been mentioned. $PATH can include both absolute and relative paths. If your crawling the paths list by chdir(2)ing into each directory, you need to keep track of the original working directory (getcwd(3)) and chdir(2) back to it at each iteration of the crawl.

The existing answers cover most of it, but it's worth covering parts of the question that wasn't answered yet:
$ and ~ are not special in the value of $PATH.
If $PATH is not set at all, execvp() will use a default value.

Related

Where does fopen() search for File to read?

The question is self-descriptive. I just want to know the search range of fopen() in :
a) Windows
b) Unix-like systems like MacOS & Linux
When asked to open a file for reading, or reading & writing or even just writing, with a relative path, i.e "File.txt". And I need an answer addressing both - text & binary files (if at all they differ in this regard).
Does it scan only the current directory , or does it scan particular folders ?
(Since scanning full disk would be painstakingly slow, right ?)
Edit:
Why the downvotes ? Because the ya'll simply don't know ?
fopen() doesn't scan at all
It just opens the file you tell it to open.
The path is either absolute, or relative to the current directory.
The behaviour is pretty much the same across platforms.
Of course in Windows paths look a bit different (drive letters, backslashes instead of slashes).
One relevant difference I can think of:
If the path starts with a drive letter and a colon, it will look at another drive.
If there is no backslash after the drive letter and colon, then the location will be relative to that drive's current working directory (as Windows remembers a current directory per drive letter).

Check if file is *child of folder

I have a directory name and a subpath, e.g. "./files" and "/example.txt". While the directory can be arbitrarily placed in the filesystem, I need to make sure that directory+subpath ("./files/example.txt") actually is inside the given directory. So this example would be valid, while subpath "/../example.txt" would be invalid because it is neither a child of the directory, nor a grandchild, etc. Soft-links leading outside of the directory are allowed.
How should I perform this test in C?
My first guess was to use realpath(directory_subpath) and comparing the start of the result with realpath(directory), but after reading about the problems with PATH_MAX I'm a bit unsure about that, and this is also likely to cause problems with soft-links.
My second idea is simply checking if the subpath starts with /../ and if is, resulting in invalid. If /../ exists anywhere else in the subpath, the directory name before that will be removed (from left-to-right, repeating this until the path turns out to be invalid or the end of the path name is reached).
The subpath might be given with malicious intent, so I want to be really, really sure about this. Is my second approach safe? Is there a different, better way?
The second approach is safe if you check for /.. (without trailing slash).
I would just forbid .. in the subpath: the cases when .. is really necessary and is not malicious are rather rare.

How can I convert path containing wildcard to corresponding file entries in C program?

I'm trying to implement the ls command with wildcard, *.
I have just learned the fact that most shells convert ls-argument containing * to the corresponding entries when performing ls command.
For example, The directory foo consist of a.file, b.file, and directory bar.
Then, the directory bar has c.file, d.file, and e.file.
and assume that current directory is the directory foo.
the argument */* is converted is to the following entries.
"bar/c.file", "bar/d.file", "bar/e.file"
How can program perform this? I don't know where to start from. And
there are many possible cases.
*/../*, ../../*, */*/*, etc.
Any advice would be awesome. Thank you..
You can of couse use glob() to do a lot of this work.
Such patterns are called globs, for some reason I won't dig up now. :)
POSIX provides glob(3) for programmatic wildcard path expansion.

Spaces in Directory Names

Is putting a space in a directory name still a big deal? I've been doing some reading, but all the articles are from the early 2000s. Is it a problem now?
For those who don't get what I mean: public_html/space directory/index.html
If this is still an issue, why shouldn't I use spaces when naming files and directories?
Spaces in URLs are still special characters that need to be escaped or encoded (either a + or %20).
Well, I am still crossing fingers when executing external processes (from ant or Java's ProcessBuilder for example). If you just pass this dir to the external process within the command - it may break apart in two arguments which is clearly not what you want.
Some quoting and minding the spaces is still required in some usecases.

What can I do if getcwd() and getenv("PWD") don't match?

I have a build system tool that is using getcwd() to get the current working directory. That's great, except that sometimes people have spaces in their paths, which isn't supported by the build system. You'd think that you could just make a symbolic link:
ln -s "Directory With Spaces" DirectoryWithoutSpaces
And then be happy. But unfortunately for me, getcwd() resolves all the symbolic links. I tried to use getenv("PWD"), but it is not pointing at the same path as I get back from getcwd(). I blame make -C for not updating the environment variable, I think. Right now, getcwd() gives me back a path like this:
/Users/carl/Directory With Spaces/Some/Other/Directories
And getenv("PWD") gives me:
/Users/carl/DirectoryWithoutSpaces
So - is there any function like getcwd() that doesn't resolve the symbolic links?
Edit:
I changed
make -C Some/Other/Directories
to
cd Some/Other/Directories ; make
And then getenv("PWD") works.. If there's no other solution, I can use that.
According to the Advanced Programming in the UNIX Environment bible by Stevens, p.112:
Since the kernel must maintain knowledge of the current working directory, we should be able to fetch its current value. Unfortunately, all the kernel maintains for each process is the i-node number and device identification for the current working directory. The kernel does not maintain the full pathname of the directory.
Sorry, looks like you do need to work around this in another way.
There is no way for getcwd() to determine the path you followed via symbolic links. The basic implementation of getcwd() stats the current directory '.', and then opens the parent directory '..' and scans the entries until it finds the directory name with the same inode number as '.' has. It then repeats the process upwards until it finds the root directory, at which point it has the full path. At no point does it ever traverse a symbolic link. So the goal of having getcwd() calculate the path followed via symlinks is impossible, whether it is implemented as a system call or as a library function.
The best resolution is to ensure that the build system handles path names containing spaces. That means quoting pathnames passed through the shell. C programs don't care about the spaces in the name; it is only when a program like the shell interprets the strings that you run into problems. (Compilers implemented as shell scripts that run pre-processors often have problems with pathnames that contain spaces - speaking from experience.)

Resources