Globbing in C, how to exclude files

Globbing in C, how to exclude files - c

I've read http://linux.die.net/man/3/glob and it seems that glob will do disk access, even though I don't want it to.
Is there a C glob function to compare a string with a glob pattern and tell me if it matches? i.e. no disk access.
If not, how can I use glob to exclude files when recursively (depth first) traversing a filesystem?
while((entry = readdir(dp))) {
// need to continue to next iteration of loop, here, if entry->d_name matches glob pattern
// do stuff and recurse
}

Use the fnmatch function. It compares a filename/path against a given pattern.

Related

How to print the file tree of only the found files in a recursive function?

I have a recursive function that searches a path for a given file name. What I am trying to do is to print the files that match, along with their parent directories.
So for a file tree like this:
mydir
mysubdir
mysubsubdir
file1
file2
file1
mysubdir2
file2
I want to print this when I search for file1:
mydir
mysubdir
mysubdir
file1
file1
I am able to see each found files' paths, so I thought of constructing a new tree from those paths and then printing that tree, but It seems to me that there must be a much simpler way.

Your function needs the path from the root to the current directory that you are processing. For example, via a const char ** argument, and append to each time you descent a directory (linked list if you don't like recalloc or ensure sufficiently large size up front). When there is match you can print the path starting from the root (see below though).
To get the short-cut behavior of mydir/file1, you need the path the previous match. This could be another const char ** argument. The print rule is now refined to indent as many levels as previous and current match have common path elements, then print the remaining unique path from current match. This implies a depth first search and sub-directories are visited in sorted order.
Instead of printing as you go along, you can also record each match in const char *** or as tree as suggested by #wildplasser. Then loop/walk through the result using the same refined print algorithm (if you use a tree you only need to know the level not the prefix path). If you don't do a depth first search, you can sort the array, in this approach. And if you use a tree to store the result, you walked the depth first.

What is the safest way to check that a file resides within a base directory?

What is the safest and most secure way in Go to validate on any platform that a given file path lies within a base path?
The paths are initially provided as strings and use "/" as separators, but they are user-supplied and I need to assume plenty of malicious inputs. Which kind of path normalization should I perform to ensure that e.g. sequences like ".." are evaluated, so I can safely check against the base path? What exceptions are there to watch out for on various file systems and platforms? Which Go libraries are supposed to be safe in that respect?
The results will be fed to external functions like os.Create and sqlite3.Open and any failure to recognize that the base path is left would be a security violation.

I believe you could use filepath.Rel for this (and check if it returns a value not starting with ..).
Rel returns a relative path that is lexically equivalent to targpath
when joined to basepath with an intervening separator. That is,
Join(basepath, Rel(basepath, targpath)) is equivalent to targpath
itself. On success, the returned path will always be relative to
basepath, even if basepath and targpath share no elements. An error is
returned if targpath can't be made relative to basepath or if knowing
the current working directory would be necessary to compute it. Rel
calls Clean on the result.
filepath.Rel also calls filepath.Clean on its input paths, resolving any .s and ..s.
Clean returns the shortest path name equivalent to path by purely
lexical processing. It applies the following rules iteratively until
no further processing can be done:
Replace multiple Separator elements with a single one.
Eliminate each . path name element (the current directory).
Eliminate each inner .. path name element (the parent directory) along with the non-.. element that precedes it.
Eliminate .. elements that begin a rooted path: that is, replace "/.." by "/" at the beginning of a path, assuming Separator is '/'.
You could also use filepath.Clean directly and check for prefix when it's done. Here are some sample outputs for filepath.Clean:
ps := []string{
"/a/../b",
"/a/b/./c/../../d",
"/b",
}
for _, p := range ps {
fmt.Println(p, filepath.Clean(p))
}
Prints:
/a/../b /b
/a/b/./c/../../d /a/d
/b /b
That said, path manipulation shouldn't be the only security mechanism you deploy. If you truly worry about exploits, use defense in depth by sandboxing, creating a virtual file system / containers, etc.

How are dirent entries ordered?

I am at a loss as to how dirent entries are ordered. For example, if I had the code
DIR* dir = opendir("/some/directory");
struct dirent* entry;
while ((entry = readdir(dir))
printf("%s\n", entry->d_name);
This may output something like the following:
abcdef
example3
..
.
123456789
example2
example1
As you can see, this output is not alphabetically ordered. So, I was wondering how exactly dirent entries are ordered? What causes some entries to have a higher precedence than others?

They are not ordered alphabetically; they are retrieved in the order that the filesystem maintains them.
The directory "file" simply contains a list of filenames and inode numbers. For some filesystem types, the filesystem prefers to not split a filename/inode value across blocks. As the filesystem adds or removes files from the list, it may find space in one of the blocks. Other schemes (such as frequently-used filenames being earlier in the list) are possible.
A list sorted by filename depends upon the way things are sorted: it can be locale-dependent. (The filesystem does not know or care about your locale-settings). So that decision is left to applications rather than the filesystem itself.
For additional comments, see
What determines the order directory entries are returned by getdents?
What is the “directory order” of files in a directory (used by ls -U)?
In Bash, are wildcard expansions guaranteed to be in order?

They are not ordered in any relevant way. It's up to the implementation to retrieve and return directory entries in whatever order is most convenient.
Advanced Programming in the UNIX Environment, 3rd ed., goes a little further and even says that the order is usually not alphabetical (Chapter 4, Section 4.22):
Note that the ordering of entries within the directory is
implementation dependent and is usually not alphabetical.
If you're wondering, the output of ls is sorted because ls sorts it.

Emacs lisp: Concise way to get `directory-files` without "." and ".."?

The function directory-files returns the . and .. entries as well. While in a sense it is true, that only this way the function returns all existing entries, I have yet to see a use for including these. On the other hand, every time a use directory-files I also write something like
(unless (string-match-p "^\\.\\.?$" ...
or for better efficiency
(unless (or (string= "." entry)
(string= ".." entry))
..)
Particularly in interactive use (M-:) the extra code is undesirable.
Is there some predefined function that returns only actual subentries of a directory efficiently?

You can do this as part of the original function call.
(directory-files DIRECTORY &optional FULL MATCH NOSORT)
If MATCH is non-nil, mention only file names that match the regexp MATCH.
so:
(directory-files (expand-file-name "~/") nil "^\\([^.]\\|\\.[^.]\\|\\.\\..\\)")
or:
(defun my-directory-files (directory &optional full nosort)
"Like `directory-files' with MATCH hard-coded to exclude \".\" and \"..\"."
(directory-files directory full "^\\([^.]\\|\\.[^.]\\|\\.\\..\\)" nosort))
although something more akin to your own approach might make for a more efficient wrapper, really.
(defun my-directory-files (directory &optional full match nosort)
"Like `directory-files', but excluding \".\" and \"..\"."
(delete "." (delete ".." (directory-files directory full match nosort))))
although that's processing the list twice, and we know there's only one instance of each of the names we wish to exclude (and there's a fair chance they'll appear first), so something more like this might be a good solution if you're expecting to deal with large directories on a frequent basis:
(defun my-directory-files (directory &optional full match nosort)
"Like `directory-files', but excluding \".\" and \"..\"."
(let* ((files (cons nil (directory-files directory full match nosort)))
(parent files)
(current (cdr files))
(exclude (list "." ".."))
(file nil))
(while (and current exclude)
(setq file (car current))
(if (not (member file exclude))
(setq parent current)
(setcdr parent (cdr current))
(setq exclude (delete file exclude)))
(setq current (cdr current)))
(cdr files)))

If you use f.el, a convenient file and directory manipulation library, you only need function f-entries.
However, if you don't want to use this library for some reason and you are ok for a non-portable *nix solution, you can use ls command.
(defun my-directory-files (d)
(let* ((path (file-name-as-directory (expand-file-name d)))
(command (concat "ls -A1d " path "*")))
(split-string (shell-command-to-string command) "\n" t)))
The code above suffice, but for explanation read further.
Get rid of dots
According to man ls:
-A, --almost-all
do not list implied . and ..
With split-string that splits a string by whitespace, we can parse ls output:
(split-string (shell-command-to-string "ls -A"))
Spaces in filenames
The problem is that some filenames may contain spaces. split-string by default splits by regex in variable split-string-default-separators, which is "[ \f\t\n\r\v]+".
-1 list one file per line
-1 allows to delimit files by newline, to pass "\n" as a sole separator. You can wrap this in a function and use it with arbitrary directory.
(split-string (shell-command-to-string "ls -A1") "\n")
Recursion
But what if you want to recursively dive into subdirectories, returning files for future use? If you just change directory and issue ls, you'll get filenames without paths, so Emacs wouldn't know where this files are located. One solution is to make ls always return absolute paths. According to man ls:
-d, --directory
list directory entries instead of contents, and do not dereference symbolic links
If you pass absolute path to directory with a wildcard and -d option, then you'll get a list of absolute paths of immediate files and subdirectories, according to How can I list files with their absolute path in linux?. For explanation on path construction see In Elisp, how to get path string with slash properly inserted?.
(let ((path (file-name-as-directory (expand-file-name d))))
(split-srting (shell-command-to-string (concat "ls -A1d " path "*")) "\n"))
Omit null string
Unix commands have to add a trailing whitespace to output, so that prompt is on the new line. Otherwise instead of:
user#host$ ls
somefile.txt
user#host$
there would be:
user#host$ ls
somefile.txtuser#host$
When you pass custom separators to split-string, it treats this newline as a line on its own. In general, this allows to correctly parse CSV files, where an empty line may be valid data. But with ls we end up with a null-string, that should be omitted by passing t as a third parameter to split-string.

How about just using remove-if?
(remove-if (lambda (x) (member x '("." "..")))
(directory-files path))

How to determine if a path is inside a directory? (POSIX)

In C, using POSIX calls, how can I determine if a path is inside a target directory?
For example, a web server has its root directory in /srv, this is getcwd() for the daemon.
When parsing a request for /index.html, it returns the contents of /srv/index.html.
How can I filter out requests for paths outside of /srv?
/../etc/passwd,
/valid/../../etc/passwd,
etc.
Splitting the path at / and rejecting any array containing .. will break valid accesses /srv/valid/../index.html.
Is there a canonical way to do this with system calls? Or do I need to manually walk the path and count directory depth?

There's always realpath:
The realpath() function shall derive, from the pathname pointed to by *file_name*, an absolute pathname that resolves to the same directory entry, whose resolution does not involve '.' , '..' , or symbolic links.
Then compare what realpath gives you with your desired root directory and see if they match up.
You could also clean up the filename by hand by expanding the double-dots before you prepend the "/srv". Split the incoming path on slashes and walk through it piece by piece. If you get a "." then remove it and move on; if you get a "..", then remove it and the previous component (taking care not go past the first entry in your list); if you get anything else, just move on to the next component. Then paste what's left back together with slashes between the components and prepend your "/srv/". So if someone gives you "/valid/../../etc/passwd", you'll end up with "/srv/etc/passwd" and "/where/is/../pancakes/house" will end up as "/srv/where/pancakes/house".
That way you can't get outside "/srv" (except through symbolic links of course) and an incoming "/../.." will be the same as "/" (just like in a normal file system). But you'd still want to use realpath if you're worried about symbolic under "/srv".
Working with the path name component by component would also allow you to break the connection between the layout you present to the outside world and the actual file system layout; there's no need for "/this/that/other/thing" to map to an actual "/srv/this/that/other/thing" file anywhere, the path could just be a key in some sort of database or some sort of namespace path to a function call.

To determine if a file F is within a directory D, first stat D to determine its device number and inode number (members st_dev and st_ino of struct stat).
Then stat F to determine if it is a directory. If not, call basename to determine the name of the directory containing it. Set G to the name of this directory. If F was already a directory, set G=F.
Now, F is within D if and only if G is within D. Next we have a loop.
while (1) {
if (samefile(d_statinfo.d_dev, d_statinfo.d_ino, G)) {
return 1; // F was within D
} else if (0 == strcmp("/", G) {
return 0; // F was not within D.
}
G = dirname(G);
}
The samefile function is simple:
int samefile(dev_t ddev, ino_t dino, const char *path) {
struct stat st;
if (0 == stat(path, &st)) {
return ddev == st.st_dev && dino == st.st_no;
} else {
throw ...; // or return error value (but also change the caller to detect it)
}
}
This will work on POSIX filesystems. But many filesystems are not POSIX. Problems to look out for include:
Filesystems where the device/inode are not unique. Some FUSE filesystems are examples of this; they sometimes make up inode numbers when the underlying filesystems don't have them. They shouldn't re-use inode numbers, but some FUSE filesystems have bugs.
Broken NFS implementations. On some systems all NFS filesystems have the same device number. If they pass through the inode number as it exists on the server, this could cause a problem (though I've never seen it happen in practice).
Linux bind mount points. If /a is a bind mount of /b, then /a/1 correctly appears to be inside /a, but with the implementation above, /b/1 also appears to be inside /a. I think that's probably the correct answer. However, if this is not the result you prefer, this is easily fixed by changing the return 1 case to call strcmp() to compare the path names too. However, for this to work you will need to start by calling realpath on both F and D. The realpath call can be quite expensive (since it may need to hit the disk a number of times).
The special path //foo/bar. POSIX allows path names beginning with // to be special in a way which is somewhat not well defined. Actually I forget the precise level of guarantee about semantics that POSIX provides. I think that POSIX allows //foo/bar and //baz/ugh to refer to the same file. The device/inode check should still do the right thing for you but you may find it does not (i.e. you may find that //foo/bar and //baz/ugh can refer to the same file but have different device/inode numbers).
This answer assumes that we start with an absolute path for both F and D. If this is not guaranteed you may need to do some conversion using realpath() and getcwd(). This will be a problem if the name of the current directory is longer than PATH_MAX (which can certainly happen).

You should simply process .. yourself and remove the previous path component when it's found, so that there are no occurrences of .. in the final string you use for opening files.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight