Delete files while reading directory with readdir() - c

My code is something like this:
DIR* pDir = opendir("/path/to/my/dir");
struct dirent pFile = NULL;
while ((pFile = readdir())) {
// Check if it is a .zip file
if (subrstr(pFile->d_name,".zip") {
// It is a .zip file, delete it, and the matching log file
char zipname[200];
snprintf(zipname, sizeof(zipname), "/path/to/my/dir/%s", pFile->d_name);
unlink(zipname);
char* logname = subsstr(zipname, 0, strlen(pFile->d_name)-4); // Strip of .zip
logname = appendstring(&logname, ".log"); // Append .log
unlink(logname);
}
closedir(pDir);
(this code is untested and purely an example)
The point is: Is it allowed to delete a file in a directory while looping through the directory with readdir()?
Or will readdir() still find the deleted .log file?

Quote from POSIX readdir:
If a file is removed from or added to
the directory after the most recent
call to opendir() or rewinddir(),
whether a subsequent call to readdir()
returns an entry for that file is
unspecified.
So, my guess is ... it depends.
It depends on the OS, on the time of day, on the relative order of the files added/deleted, ...
And, as a further point, between the time the readdir() function returns and you try to unlink() the file, some other process could have deleted that file and your unlink() fails.
Edit
I tested with this program:
#include <dirent.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
int main(void) {
struct dirent *de;
DIR *dd;
/* create files `one.zip` and `one.log` before entering the readdir() loop */
printf("creating `one.log` and `one.zip`\n");
system("touch one.log"); /* assume it worked */
system("touch one.zip"); /* assume it worked */
dd = opendir("."); /* assume it worked */
while ((de = readdir(dd)) != NULL) {
printf("found %s\n", de->d_name);
if (strstr(de->d_name, ".zip")) {
char logname[1200];
size_t i;
if (*de->d_name == 'o') {
/* create `two.zip` and `two.log` when the program finds `one.zip` */
printf("creating `two.zip` and `two.log`\n");
system("touch two.zip"); /* assume it worked */
system("touch two.log"); /* assume it worked */
}
printf("unlinking %s\n", de->d_name);
if (unlink(de->d_name)) perror("unlink");
strcpy(logname, de->d_name);
i = strlen(logname);
logname[i-3] = 'l';
logname[i-2] = 'o';
logname[i-1] = 'g';
printf("unlinking %s\n", logname);
if (unlink(logname)) perror("unlink");
}
}
closedir(dd); /* assume it worked */
return 0;
}
On my computer, readdir() finds deleted files and does not find files created between opendir() and readdir(). But it may be different on another computer; it may be different on my computer if I compile with different options; it may be different if I upgrade the kernel; ...

I'm testing my new Linux reference book. The Linux Programming Interface by Michael Kerrisk and it says the following:
SUSv3 explicitly notes that it is unspecified whether readdir() will return a filename that has been added to or removed from since the last since the last call to opendir() or rewinddir(). All filenames that have been neither added nor removed since the last such call are guaranteed to be returned.
I think that what is unspecified is what happens to dirents not yet scanned. Once an entry has been returned, it is 100% guaranteed that it will not be returned anymore whether or not you unlink the current dirent.
Also note the guarantee provided by the second sentence. Since you are leaving alone the other files and only unlinking the current entry for the zip file, SUSv3 guarantees that all the other files will be returned. What happens to the log file is undefined. it may or may not be returned by readdir() but in your case, it shouldn't be harmful.
The reason why I have explored the question it is to find an efficient way to close file descriptors in a child process before exec().
The suggested way in APUE from Stevens is to do the following:
int max_open = sysconf(_SC_OPEN_MAX);
for (int i = 0; i < max_open; ++i)
close(i);
but I am thinking using code similar to what is found in the OP to scan /dev/fd/ directory to know exactly which fds I need to close. (Special note to myself, skip over dirfd contained in the DIR handle.)

I found the following page describe the solution of this problem.
https://support.apple.com/kb/TA21420

Related

using threading and mutex locks to search directories

I am new to threading and I believe I understand the concept. As locks are a necessary tool to use threading but are (or at least to me) confusing on how to use I need to use them but cannot seem to get them correct. The idea here is to search through directories to find CSV files. (more work will be done on CSVs but that is not relevant here) I have an algorithm to search through directories that works fine without the use of threading. (keep in mind that searching through directories is the kind of task that is perfect for recursion because you need to search through a directory to find another directory and when you find the new directory you want to search that directory) Since I need to use threading on each instance of finding new directory I have the same algorithm set up twice. Once in main where it finds directories and the calls a function (through threading) to search the found directories. Again, if I use this method without threading I have zero problems but with threading the arguments I send in to the function are overwritten. This happens even if I lock the entire function. Clearly I am not using locks and threading correctly but where I'm going wrong eludes me. I have test directories to verify that it is (or is not) working. I have 3 directories in the "." directory and then sub directories beyond that. It finds the first three directories (in main) fine then when it passes those into the threaded function it will search three different times but usually with searching the same directory more than once. In other words the path name seems to be overwritten. I'll post code so you can see what I'm doing. I thank you in advance. Links to complete code:sorter.h https://pastebin.com/0vQZbrmh sorter.c https://pastebin.com/9wd8aa74 dirWorker.c https://pastebin.com/Jd4i1ecr
In sorter.h
#define MAXTHREAD 255
extern pthread_mutex_t lock;
typedef
struct _dir_proc
{
char* path; //the path to the new found directory
char* colName; //related to the other work that must be done
} dir_proc;
In sorter.c
#include <pthread.h>
#include <assert.h>
#include <dirent.h>
#include "sorter.h"
pthread_mutex_t lock;
int main(int argc, char* argv[])
{
int err = 0;
pthread_t threads[MAXTHREAD];
DIR *dirPointer;
char* searchedDirectory = ".";
struct dirent *directEntry;
dir_proc *dir_proc_args = malloc(sizeof(struct _dir_proc));
assert(dir_proc_args != NULL);
dir_proc_args->path = (char*) malloc(256 * (sizeof(char));
assert(dir_proc_args->path != NULL);
dir_proc_args->colName = (char*) malloc(256 * sizeof(char));
assert(dir_proc_args->colName != NULL);
pthread_mutex_init(&lock, NULL)
//dir_proc_args->colName is saved here
if(!(dirPointer = opendir(searchedDirectory)))
{
fprintf(stderr, "opening of directory has failed");
exit(1);
}
while((directEntry = readdir(dirPointer)) != NULL)
{
//do stuff here to ensure it is a directory
//ensure that the dir we are looking at is not current or parent dir
//copy path of found directory to dir_proc_args->path
err = pthread_create(&threads[count++], NULL, &CSVFinder, (void*)dir_proc_args);
if(err != 0)
printf("can't create thread);
}
int i;
for(i=0; i < count; ++i)
{
pthread_join(threads[i], NULL);
}
pthread_mutex_destroy(&lock);
}
in CSVFinder function
#include <assert.h>
#include <pthread.h>
#include "sorter.h"
#include <dirent.h>
void *CSVFinder(void *args)
{
pthread_mutex_lock(&lock); //I have locked the entire function to see I can get it to work. this makes no sense to actually do
DIR *dirPointer;
struct dirent *directEntry;
dir_proc *funcArgs = (struct _dir_proc*)args;
char path[255];
strncpy(path, funcArgs->path, sizeof(path));
if(!(dirPointer = opendir(funcArgs->path)))
{
fprintf(stderr, "opening of directory has failed");
exit(1);
}
while((directEntry = readdir(dirPointer)) != NULL)
{
if(directEntry->d_type == DT_DIR) //if we are looking at a directory
{
//make sure the dir we are looking at is not current or parent dir
snprintf(funcArgs->path, (sizeof(path) + sizeof(directEntry->d_name)), "%s/%s", path, directEntry->d_name);
//I would like to be able to do a recursive call here
//to search for more directories but one thing at a time
}
}
closedir(dirPointer);
pthread_mutex_unlock(&lock);
return(NULL);
}
I hope I have not left out any relevant code. I tried to keep the code to a minimum while not leaving anything necessary out.
It's not clear to me why you want to create a thread to simply traverse a directory structure. However, I will point out a few issues I see.
One minor issue is you in the CSVFinder function, you call readder, not readdir.
But one glaring issue to me is that you do not initialize dirPointer in main or in the CSVFinder() function. I would expect to see a call like
dirPointer = opendir("/");
in the main() function before the while loop.
Then I would expect to see CSVFinder() initialize its dirPointer with a call to opendir(path) where path is a name to a subdirectory found in the main loop.
For a good reference to how to traverse a directory structure go here...
https://www.lemoda.net/c/recursive-directory/

How to list first level directories only in C?

In a terminal I can call ls -d */. Now I want a c program to do that for me, like this:
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <unistd.h>
int main( void )
{
int status;
char *args[] = { "/bin/ls", "-l", NULL };
if ( fork() == 0 )
execv( args[0], args );
else
wait( &status );
return 0;
}
This will ls -l everything. However, when I am trying:
char *args[] = { "/bin/ls", "-d", "*/", NULL };
I will get a runtime error:
ls: */: No such file or directory
The lowest-level way to do this is with the same Linux system calls ls uses.
So look at the output of strace -efile,getdents ls:
execve("/bin/ls", ["ls"], [/* 72 vars */]) = 0
...
openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, /* 23 entries */, 32768) = 840
getdents(3, /* 0 entries */, 32768) = 0
...
getdents is a Linux-specific system call. The man page says that it's used under the hood by libc's readdir(3) POSIX API function.
The lowest-level portable way (portable to POSIX systems), is to use the libc functions to open a directory and read the entries. POSIX doesn't specify the exact system call interface, unlike for non-directory files.
These functions:
DIR *opendir(const char *name);
struct dirent *readdir(DIR *dirp);
can be used like this:
// print all directories, and symlinks to directories, in the CWD.
// like sh -c 'ls -1UF -d */' (single-column output, no sorting, append a / to dir names)
// tested and works on Linux, with / without working d_type
#define _GNU_SOURCE // includes _BSD_SOURCE for DT_UNKNOWN etc.
#include <dirent.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
DIR *dirhandle = opendir("."); // POSIX doesn't require this to be a plain file descriptor. Linux uses open(".", O_DIRECTORY); to implement this
//^Todo: error check
struct dirent *de;
while(de = readdir(dirhandle)) { // NULL means end of directory
_Bool is_dir;
#ifdef _DIRENT_HAVE_D_TYPE
if (de->d_type != DT_UNKNOWN && de->d_type != DT_LNK) {
// don't have to stat if we have d_type info, unless it's a symlink (since we stat, not lstat)
is_dir = (de->d_type == DT_DIR);
} else
#endif
{ // the only method if d_type isn't available,
// otherwise this is a fallback for FSes where the kernel leaves it DT_UNKNOWN.
struct stat stbuf;
// stat follows symlinks, lstat doesn't.
stat(de->d_name, &stbuf); // TODO: error check
is_dir = S_ISDIR(stbuf.st_mode);
}
if (is_dir) {
printf("%s/\n", de->d_name);
}
}
}
There's also a fully compilable example of reading directory entries and printing file info in the Linux stat(3posix) man page. (not the Linux stat(2) man page; it has a different example).
The man page for readdir(3) says the Linux declaration of struct dirent is:
struct dirent {
ino_t d_ino; /* inode number */
off_t d_off; /* not an offset; see NOTES */
unsigned short d_reclen; /* length of this record */
unsigned char d_type; /* type of file; not supported
by all filesystem types */
char d_name[256]; /* filename */
};
d_type is either DT_UNKNOWN, in which case you need to stat to learn anything about whether the directory entry is itself a directory. Or it can be DT_DIR or something else, in which case you can be sure it is or isn't a directory without having to stat it.
Some filesystems, like EXT4 I think, and very recent XFS (with the new metadata version), keep type info in the directory, so it can be returned without having to load the inode from disk. This is a huge speedup for find -name: it doesn't have to stat anything to recurse through subdirs. But for filesystems that don't do this, d_type will always be DT_UNKNOWN, because filling it in would require reading all the inodes (which might not even be loaded from disk).
Sometimes you're just matching on filenames, and don't need type info, so it would be bad if the kernel spent a lot of extra CPU time (or especially I/O time) filling in d_type when it's not cheap. d_type is just a performance shortcut; you always need a fallback (except maybe when writing for an embedded system where you know what FS you're using and that it always fills in d_type, and that you have some way to detect the breakage when someone in the future tries to use this code on another FS type.)
Unfortunately, all solutions based on shell expansion are limited by the maximum command line length. Which varies (run true | xargs --show-limits to find out); on my system, it is about two megabytes. Yes, many will argue that it suffices -- as did Bill Gates on 640 kilobytes, once.
(When running certain parallel simulations on non-shared filesystems, I do occasionally have tens of thousands of files in the same directory, during the collection phase. Yes, I could do that differently, but that happens to be the easiest and most robust way to collect the data. Very few POSIX utilities are actually silly enough to assume "X is sufficient for everybody".)
Fortunately, there are several solutions. One is to use find instead:
system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d");
You can also format the output as you wish, not depending on locale:
system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\n'");
If you want to sort the output, use \0 as the separator (since filenames are allowed to contain newlines), and -t= for sort to use \0 as the separator, too. tr will convert them to newlines for you:
system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\0' | sort -t= | tr -s '\0' '\n'");
If you want the names in an array, use glob() function instead.
Finally, as I like to harp every now and then, one can use the POSIX nftw() function to implement this internally:
#define _GNU_SOURCE
#include <stdio.h>
#include <ftw.h>
#define NUM_FDS 17
int myfunc(const char *path,
const struct stat *fileinfo,
int typeflag,
struct FTW *ftwinfo)
{
const char *file = path + ftwinfo->base;
const int depth = ftwinfo->level;
/* We are only interested in first-level directories.
Note that depth==0 is the directory itself specified as a parameter.
*/
if (depth != 1 || (typeflag != FTW_D && typeflag != FTW_DNR))
return 0;
/* Don't list names starting with a . */
if (file[0] != '.')
printf("%s/\n", path);
/* Do not recurse. */
return FTW_SKIP_SUBTREE;
}
and the nftw() call to use the above is obviously something like
if (nftw(".", myfunc, NUM_FDS, FTW_ACTIONRETVAL)) {
/* An error occurred. */
}
The only "issue" in using nftw() is to choose a good number of file descriptors the function may use (NUM_FDS). POSIX says a process must always be able to have at least 20 open file descriptors. If we subtract the standard ones (input, output, and error), that leaves 17. The above is unlikely to use more than 3, though.
You can find the actual limit using sysconf(_SC_OPEN_MAX), and subtracting the number of descriptors your process may use at the same time. In current Linux systems, it is typically limited to 1024 per process.
The good thing is, as long as that number is at least 4 or 5 or so, it only affects the performance: it just determines how deep nftw() can go in the directory tree structure, before it has to use workarounds.
If you want to create a test directory with lots of subdirectories, use something like the following Bash:
mkdir lots-of-subdirs
cd lots-of-subdirs
for ((i=0; i<100000; i++)); do mkdir directory-$i-has-a-long-name-since-command-line-length-is-limited ; done
On my system, running
ls -d */
in that directory yields bash: /bin/ls: Argument list too long error, while the find command and the nftw() based program all run just fine.
You also cannot remove the directories using rmdir directory-*/ for the same reason. Use
find . -name 'directory-*' -type d -print0 | xargs -r0 rmdir
instead. Or just remove the entire directory and subdirectories,
cd ..
rm -rf lots-of-subdirs
Just call system. Globs on Unixes are expanded by the shell. system will give you a shell.
You can avoid the whole fork-exec thing by doing the glob(3) yourself:
int ec;
glob_t gbuf;
if(0==(ec=glob("*/", 0, NULL, &gbuf))){
char **p = gbuf.gl_pathv;
if(p){
while(*p)
printf("%s\n", *p++);
}
}else{
/*handle glob error*/
}
You could pass the results to a spawned ls, but there's hardly a point in doing that.
(If you do want to do fork and exec, you should start with a template that does proper error checking -- each of those calls may fail.)
If you are looking for a simple way to get a list of folders into your program, I'd rather suggest the spawnless way, not calling an external program, and use the standard POSIX opendir/readdir functions.
It's almost as short as your program, but has several additional advantages:
you get to pick folders and files at will by checking the d_type
you can elect to early discard system entries and (semi)hidden entries by testing the first character of the name for a .
you can immediately print out the result, or store it in memory for later use
you can do additional operations on the list in memory, such as sorting and removing other entries that don't need to be included.
#include <stdio.h>
#include <sys/types.h>
#include <sys/dir.h>
int main( void )
{
DIR *dirp;
struct dirent *dp;
dirp = opendir(".");
while ((dp = readdir(dirp)) != NULL)
{
if (dp->d_type & DT_DIR)
{
/* exclude common system entries and (semi)hidden names */
if (dp->d_name[0] != '.')
printf ("%s\n", dp->d_name);
}
}
closedir(dirp);
return 0;
}
Another less low-level approach, with system():
#include <stdlib.h>
int main(void)
{
system("/bin/ls -d */");
return 0;
}
Notice with system(), you don't need to fork(). However, I recall that we should avoid using system() when possible!
As Nomimal Animal said, this will fail when the number of subdirectories is too big! See his answer for more...

How to determine files and directories in parent/other directories

I found the answer to another question here to be very helpful.
There seems to be a limitation of the sys/stat.h library as when I tried to look in other directories everything was seen as a directory.
I was wondering if anyone knew of another system function or why it sees anything outside the current working directory as only a directory.
I appreciate any help anyone has to offer as this is perplexing me and various searches have turned up no help.
The code I made to test this is:
#include <sys/stat.h>
#include <dirent.h>
#include <stdio.h>
int main(void) {
int status;
struct stat st_buf;
struct dirent *dirInfo;
DIR *selDir;
selDir = opendir("../");
// ^ or wherever you want to look
while ((dirInfo = readdir(selDir))) {
status = stat (dirInfo->d_name, &st_buf);
if (S_ISREG (st_buf.st_mode)) {
printf ("%s is a regular file.\n", dirInfo->d_name);
}
if (S_ISDIR (st_buf.st_mode)) {
printf ("%s is a directory.\n", dirInfo->d_name);
}
}
return 0;
}
You need to check the status of the stat call; it is failing.
The trouble is that you're looking for a file the_file in the current directory when it is actually only found in ../the_file. The readdir() function gives you the name relative to the other directory, but stat() works w.r.t the current directory.
To make it work, you'd have to do the equivalent of:
char fullname[1024];
snprintf(fullname, sizeof(fullname), "%s/%s", "..", dirInfo->d_name);
if (stat(fullname, &st_buf) == 0)
...report on success...
else
...report on failure...
If you printed out stat, you'll notice there's an error (File not found).
This is because stat takes the path to the file, but you're just providing the file name.
You then call IS_REG on garbage values.
So, suppose you have a file ../test.txt
You call stat on test.txt...That isn't in directory ./test.txt, but you still print out the results from IS_REG.

listing the files in a directory and delete them in C/C++

In a "C" code I would like to list all the files in a directory and delete the oldest one. How do I do that?
Can I use popen for that or we have any other solutions??
Thanks,
From the tag, I assume that you want to do this in a POSIX compliant system. In this case a code snippet for listing files in a folder would look like this:
#include <dirent.h>
#include <sys/types.h>
#include <stdio.h>
DIR* dp;
struct dirent* ep;
char* path = "/home/mydir";
dp = opendir(path);
if (dp != NULL)
{
printf("Dir content:\n");
while(ep = readdir(dp))
{
printf("%s\n", ep->d_name);
}
}
closedir(dp);
To check file creation or modification time, use stat (man 2 stat). For removing file, just use function remove(const char* path)
On Linux (and indeed, any POSIX system), you read a directory by calling opendir() / readdir() / closedir(). You can then call stat() on each directory entry to determine if it's a file, and what its access / modification / status-change times are.
If your definition of "oldest" depends on the creation time of the file, then you're on shaky ground - traditionally UNIX didn't record the creation time. On Linux, some recent filesystems do provide it through the extended attribute file.crtime (which you access using getxattr() from sys/xattr.h), but you'll have to handle the common case where that attribute doesn't exist.
You can scan the directory using readdir and opendir
or, if you want to traverse (recursively) a file hierarchy fts or nftw. Don't forget to ignore the entries for the current directory "." and the parent ".." one. You probably want to use the stat syscall too.

Trying to read in the files of a directory and write it to a list

I'm trying to create 5000 junk files, write them to a file and delete them. But this code only is writing a portion of the files to the file. ls -l | grep ^- | wc -l says I have 1598 files remaining in the directory that is supposed to be emptied with unlink();. If I remove close(fd) I get a seg fault if I do any more than 1000 files. Any suggestions?
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <sys/types.h>
#include <dirent.h>
#include <errno.h>
main (int argv, char *args[]){
if(argv<3){
printf("Please run with proper command line arguements.\n");
return;
}
int numFiles = atoi(args[1]);
char *fileName = args[2];
char *fileList[numFiles];
int x, ret,fd;
char buff[50];
for(x=0;x<numFiles;x++){
ret = sprintf(buff,"./stuff/%s-%d.junk",fileName, x);
fd = creat(buff);
close(fd);
}
DIR *odir = opendir("./stuff");
struct dirent *rdir = NULL;
FILE *fp;
fp = fopen("./files.list", "w");
x=0;
while(rdir = readdir(odir)){
char* name = rdir->d_name;
ret = sprintf(buff,"./stuff/%s-%d.junk",fileName, x);
if(strcmp(name,"..")!=0){
if(strcmp(name,".")!=0){
fprintf(fp,"%s %d\n",name,x);
x++;
}
}
unlink(buff);
}
close(fp);
closedir(odir);
}
Thanks!
Note: Use of creat(), opendir(), readdir() and unlink() were required for the assignment. And as for error checking, your right of course but I'm under time constraints and the TA really, really doesn't care... But thank you all!
Here you're using fopen:
FILE *fp;
fp = fopen("./files.list", "w");
But then you're using close instead of fclose to close it:
close(fp);
I'm not at all sure this is what's causing the problem you're seeing, but it's definitely wrong anyway. You probably just want unlink(rdir->d_name) instead of unlink(buff). You embedded the number into the file name when you created it -- you don't need to do it again when you're reading in the name of the file you created.
You're removing things from the directory while calling readdir; I think that's supposed to work OK, but you might want to consider avoiding it.
More to the point: as you iterate over the directory with readdir you're potentially removing different files from the ones readdir is listing. (Because what you pass to unlink is buff which you've filled in from the steadily-incrementing x rather than from anything returned by readdir.) So, here's a toy example to show why that's problematic. Suppose the directory contains files 1,2,3,4 and readdir lists them in the order 4,3,2,1.
readdir tells you about file 4. You delete file 1.
readdir tells you about file 3. You delete file 2.
readdir would have told you about file 2, but it's gone so it doesn't.
readdir would have told you about file 1, but it's gone so it doesn't.
You end up with files 3 and 4 still in the directory.

Resources