Efficiently Traverse Directory Tree with opendir(), readdir() and closedir() - c

The C routines opendir(), readdir() and closedir() provide a way for me to traverse a directory structure. However, each dirent structure returned by readdir() does not seem to provide a useful way for me to obtain the set of pointers to DIR that I would need to recurse into the directory subdirectories.
Of course, they give me the name of the files, so I could either append that name to the directory path and stat() and opendir() them, or I could change the current working directory of the process via chdir() and roll it back via chdir("..").
The problem with the first approach is that if the length of the directory path is great enough, then the cost to pass a string containing it to opendir() will overweight the cost of opening a directory. If you are a bit more theoretical, you could say your complexity could increase beyond linear time (in the total character count of the (relative) filenames in the directory tree).
Also, the second approach has a problem. Since each process has a single current working directory, all but one thread will have to block in a multithreaded application. Also, I don't know if the current working directory is just a mere convenience (i.e., the relative path will be appended to it prior to a filesystem query). If it is, this approach will be inefficient too.
I am accepting alternatives to these functions. So how is it one can traverse a UNIX directory tree efficiently (linear time in the total character count of the files under it)?

Have you tried ftw() aka File Tree Walk ?
Snippit from man 3 ftw:
int ftw(const char *dir, int (*fn)(const char *file, const struct stat *sb, int flag), int nopenfd);
ftw() walks through the directory tree starting from the indicated directory dir. For each found entry in the tree, it calls fn() with the full pathname of the entry, a pointer to the stat(2) structure for the entry and an int flag

You seem to be missing one basic point: directory traversal involves reading data from the disk. Even when/if that data is in the cache, you end up going through a fair amount of code to get it from the cache into your process. Paths are also generally pretty short -- any more than a couple hundred bytes is pretty unusual. Together these mean that you can pretty reasonably build up strings for all the paths you need without any real problem. The time spent building the strings is still pretty minor compared to the time to read data from the disk. That means you can normally ignore the time spent on string manipulation, and work exclusively at optimizing disk usage.
My own experience has been that for most directory traversal a breadth-first search is usually preferable -- as you're traversing the current directory, put the full paths to all sub-directories in something like a priority queue. When you're finished traversing the current directory, pull the first item from the queue and traverse it, continuing until the queue is empty. This generally improves cache locality, so it reduces the amount of time spent reading the disk. Depending on the system (disk speed vs. CPU speed, total memory available, etc.) it's nearly always at least as fast as a depth-first traversal, and can easily be up to twice as fast (or so).

The way to use opendir/readdir/closedir is to make the function recursive! Have a look at the snippet here on Dreamincode.net.
Hope this helps.
EDIT Thanks R.Sahu, the linky has expired, however, found it via wayback archive and took the liberty to add it to gist. Please remember, to check the license accordingly and attribute the original author for the source! :)

Probably overkill for your application, but here's a library designed to traverse a directory tree with hundreds of millions of files.
https://github.com/hpc/libcircle

Instead of opendir(), you can use a combination of openat(), dirfd() and fdopendir() and construct a recursive function to walk a directory tree:
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <dirent.h>
void
dir_recurse (DIR *parent, int level)
{
struct dirent *ent;
DIR *child;
int fd;
while ((ent = readdir(parent)) != NULL) {
if ((strcmp(ent->d_name, ".") == 0) ||
(strcmp(ent->d_name, "..") == 0)) {
continue;
}
if (ent->d_type == DT_DIR) {
printf("%*s%s/\n", level, "", ent->d_name);
fd = openat(dirfd(parent), ent->d_name, O_RDONLY | O_DIRECTORY);
if (fd != -1) {
child = fdopendir(fd);
dir_recurse(child, level + 1);
closedir(child);
} else {
perror("open");
}
} else {
printf("%*s%s\n", level, "", ent->d_name);
}
}
}
int
main (int argc, char *argv)
{
DIR *root;
root = opendir(".");
dir_recurse(root, 0);
closedir(root);
return 0;
}
Here readdir() is still used to get the next directory entry. If the next entry is a directory, then we find the parent directory fd with dirfd() and pass this, along with the child directory name to openat(). The resulting fd refers to the child directory. This is passed to fdopendir() which returns a DIR * pointer for the child directory, which can then be passed to our dir_recurse() where it again will be valid for use with readdir() calls.
This program recurses over the whole directory tree rooted at .. Entries are printed, indented by 1 space per directory level. Directories are printed with a trailing /.
On ideone.

Related

Is there a programmatic way to tell if a symlink points to a file or directory in C?

I am iterating directories recursively with readdir.
struct dirent is documented here, and notes that d_type can have different values including DT_LNK and DT_DIR and DT_REG.
However, when d_type is DT_LNK, there is no hint about whether the link is to a file or directory.
Is there a a way to check whether the target of a symlink is a file or directory?
The dirent structure might tell you if the entry is a directory or a symbolic link, but some file systems do not provide this information in d_type and set it to DT_UNKNOWN.
In any case, the simplest way to tell if a symbolic link points to a directory or not, is via a stat system call that will resolve the symbolic link and if successful, will populate the stat structure, allowing for S_ISDIR to give you the information:
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
int is_directory(const char *path) {
struct stat st;
return !stat(path, &st) && S_ISDIR(st.st_mode);
}
Note these caveats:
you must construct the path from the directory name and the entry name to pass to is_directory().
recursing on directories linked to by symbolic links may cause infinite recursion as the symbolic link may point to a parent directory or the directory itself where it is located.

stat alternative for long file paths

I'm writing a program that iterates through a directory tree depth first (similar to the GNU find program) by recursively constructing paths to each file in the tree and stores the relative paths of encountered files. It also collects some statistics about these files. For this purpose I'm using the stat function.
I've notices that this fails for very deep directory hierarchies, i.e. long file paths, in accordance with stat's documentation.
Now my question is: what alternative approach could I use here that is guaranteed to work for paths of any length? (I don't need working code, just a rough outline would be sufficient).
As you are traversing, open each directory you traverse.
You can then get information about a file in that directory using fstatat. The fstatat function takes an additional parameter, dirfd. If you pass a handle to an open directory in that parameter, the path is interpreted as relative to that directory.
int fstatat(int dirfd, const char *pathname, struct stat *buf,
int flags);
The basic usage is:
int dirfd = open("directory path", O_RDONLY);
struct stat st;
int r = fstatat(dirfd, "relative file path", &st, 0);
You can, of course, also use openat instead of open, as you recurse. And the special value AT_FDCWD can be passed as dirfd to refer to the current working directory.
Caveats
It is easy to get into symlink loops and recurse forever. It is not uncommon to find symlink loops in practice. On my system, /usr/bin/X11 is a symlink to /usr/bin.
Alternatives
There are easier ways to traverse file hierarchies. Use ftw or fts instead, if you can.

Ext2 - how is a file created

How does the process of creating a file in ext2 file system look like?
I am trying to make a simple syscall which takes a path and creates given file - like touch.
For example, the code:
int main(void)
{
syscall(MY_SYSCALL_NUMBER, "/tmp/file");
}
Should create a file called "file" in /tmp.
Now how should the syscall itself work?
My work so far (I ommited error checking for readibility here):
asmlinkage long sys_ccp(const char __user *arg)
{
struct path path;
struct inode *new_inode;
struct qstring qname;
//ommited copy from user for simplicity
qname.name = arg;
qname.len = length(arg);
kern_path(src, LOOKUP_FOLLOW, &path);
new_inode = ext2_new_inode(path.dentry->d_parent->d_inode, S_IFREG, &qname);
}
This seems to work (I can see in logs that an inode is allocated), however, when I call ls on the directory I can't see the file there.
My idea was to add the new inode to struct dentry of directory, so I added this code:
struct dentry *new_dentry;
new_dentry = d_alloc(path.dentry->d_parent, &qname);
d_instantiate(new_dentry, new_inode);
However, this still doesn't seem to work (I can't see the file using ls).
How to implement this syscall correctly, what am I missing?
EDIT:
Regarding R.. answer - purpuse of this syscall is to play around with ext2 and learn about its design, so we can assumie that path is always valid, the filesystem is indeed ext2 and so on.
You're completely mixing up the abstraction layers involved. If something like your code could even work at all (not sure if it can), it would blow up badly and crash the kernel or lead to runaway wrong code execution if someone happened to make this syscall on a path that didn't actually correspond to an ext2 filesystem.
In the kernel's fs abstraction, the fact that the underlying filesystem is ext2 (or whatever it is) is irrelevant to the task of making a file on it. Rather all of this has to go through fs-type-agnostic layers which in turn end up using the fs-type-specific backends for the fs mounted at the path.

stat() giving wrong directory size in c

I need to find the size of a file or a directory whatever given in the commandline using stat(). It works fine for the files (both relative and absolute paths) but when I give a directory, it always returns the size as 512 or 1024.
If I print the files in the directory it goes as follows :
Name : .
Name : ..
Name : new
Name : new.c
but only the new and new.c files are actually in there. For this, the size is returned as 512 even if I place more files in the directory.
Here s my code fragment:
if (stat(request.data,&st)>=0){
request.msgType = (short)0xfe21;
printf("\n Size : %ld\n",st.st_size);
sprintf(reply.data,"%ld",st.st_size);
reply.dataLen = strlen(reply.data);
}
else{
perror("\n Stat()");
}
}
Where did I go wrong???
here is my request, reply structure:
struct message{
unsigned short msgType;
unsigned int offset;
unsigned int serverDelay;
unsigned int dataLen;
char data[100];
};
struct message request,reply;
I run it in gcc compiler in unix os.
stat() on a directory doesn't return the sum of the file sizes in it. The size field represents how much space it taken by the directory entry instead, and it varies depending on a few factors. If you want to know how much space is taken by all files below a specific directory, then you have to recurse down the tree, adding up the space taken by all files. This is how tools like du work.
Yes. opendir() + loop on readdir()/stat() will give you the file/directory sizes which you can sum to get a total. If you have sub-directories you will also have to loop on those and the files within them.
To use du you could use the system() function. This only returns a result code to the calling program so you could save the results to a file and then read the file. The code would be something like,
system("du -sb dirname > du_res_file");
Then you can read the file du_res_file (assuming it has been created successfully) to get your answer. This would give the size of the directory + sub-directories + files in one go.
Im sorry, I missed it the first time, stat only gives the size of files, not directories:
These functions return information about a file. No permissions are required on the file itself, but-in the case of stat() and lstat() - execute (search) permission is required on all of the directories in path that lead to the file.
The st_size field gives the size of the file (if it is a regular file or a symbolic link) in bytes. The size of a symbolic link is the length of the pathname it contains, without a terminating null byte.
look at the man page on fstat/stat

In C, how to I get to a specified directory?

I have to do a program where I need to index the files in a specified directory. I've gotten the indexing part down, but what I'm having trouble with is how to navigate to the directory.
For example, say when I start the program, it will ask "What directory would you like to index," And then the input would be "usr/Documents/CS/Assignment4," how do I get to the "Assignment4" directory? I know recursion is needed, but I'm really confused as to how directories work in C. Say my source file is in "usr/Documents/SourceCode," then what should I do to get to Assignment4?
I know I sound like I want all the answers, but I'm completely lost as to how directories work, and the book I have sucks. So even if all you have is a link to a good tutorial on this, that would be fantastic.
I'm running Linux, Ubuntu to be exact. GCC is the compiler.
The C programming language doesn't have a notion of a file system. This is instead an operating system specific question.
Based on the style of directory in your question though it sounds like you're on a unix / linux style system. If that's the case then you're looking for the opendir function
http://linux.die.net/man/3/opendir
Recursively traversing a directory in C goes something like this:
Use opendir and readdir to list the directory entries. I probably shouldn't be doing this, but I'm posting a full code sample (sans error handling) because there are a bunch of little things you have to do to ensure you're using the API correctly:
DIR *dir;
struct dirent *de;
const char *name;
dir = opendir(dirpath);
if (dir == NULL) {
/* handle error */
}
for (;;) {
errno = 0;
de = readdir(dir);
if (de == NULL) {
if (errno != 0) {
/* handle error */
} else {
/* no more entries left */
break;
}
}
/* name of file (prefix it with dirpath to get a usable file path) */
name = de->d_name;
/* ignore . and .. */
if (name[0] == '.' && (name[1] == '\0' || (name[1] == '.' && name[2] == '\0')))
continue;
/* do something with the file */
}
if (closedir(dir) != 0) {
/* handle error */
}
When working with each file, be sure to prepend the dirpath to it (along with a slash, if needed). You could also use chdir to descend and ascend, but it introduces complications in practice (e.g. you can't traverse two directories simultaneously), so I personally recommend keeping your current working directory stationary and using string manipulation to concatenate paths.
To find out if a path is a directory or not (and hence whether you should opendir() it), I recommend using lstat() rather than stat(), as the latter follows symbolic links, meaning your directory traversal could get caught in a loop and you'll end up with something like this ctags output.
Of course, since directory structure is recursive in nature, recursion plays a natural role in the traversal process: make a recursive call when a child path is a directory.
The name of the directory is only a string.
So opendir("filename"); will make it possible to read the directory "file".
However you should perhaps start thinking in filenames and pathes.
"usr/Documents/SourceCode" + "/../CS/Assignment4" is the same as "usr/Documents/CS/Assignment4" however I assume you are missing the leading "/".
Well, I don't get how you can be lost how directories work. A directory is nothing different than a "folder" in Windows or in Mac OS X. Bottom line a hard disk has a filesystem and a filesystem only consists out of folders/directories that "contain" files (and special files like named sockets etc., this should not interest you right now).
Hope this helped at least a bit.
Angelo

Resources