How does ls sort filenames? - c

I'm trying to write a function that mimics the output of the ls command in Unix. I was originally trying to perform this using scandir and alphasort, and this did indeed print the files in the directory, and it did sort them, but for some reason, this sorted list does not seem to match the same "sorted list" of filenames that ls gives.
For example, if I have a directory that contains file.c, FILE.c, and ls.c.
ls displays them in the order: file.c FILE.c ls.c
But when I sort it using alphasort/scandir, it sorts them as: FILE.c file.c ls.c
How does ls sort the files in the directory such that it gives such a differently ordered result?

To emulate default ls -1 behaviour, make your program locale-aware by calling
setlocale(LC_ALL, "");
near the beginning of your main(), and use
count = scandir(dir, &array, my_filter, alphasort);
where my_filter() is a function that returns 0 for names that begin with a dot ., and 1 for all others. alphasort() is a POSIX function that uses the locale collation order, same order as strcoll().
The basic implementation is something along the lines of
#define _POSIX_C_SOURCE 200809L
#define _ATFILE_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <locale.h>
#include <string.h>
#include <dirent.h>
#include <stdio.h>
#include <errno.h>
static void my_print(const char *name, const struct stat *info)
{
/* TODO: Better output; use info too, for 'ls -l' -style output? */
printf("%s\n", name);
}
static int my_filter(const struct dirent *ent)
{
/* Skip entries that begin with '.' */
if (ent->d_name[0] == '.')
return 0;
/* Include all others */
return 1;
}
static int my_ls(const char *dir)
{
struct dirent **list = NULL;
struct stat info;
DIR *dirhandle;
int size, i, fd;
size = scandir(dir, &list, my_filter, alphasort);
if (size == -1) {
const int cause = errno;
/* Is dir not a directory, but a single entry perhaps? */
if (cause == ENOTDIR && lstat(dir, &info) == 0) {
my_print(dir, &info);
return 0;
}
/* Print out the original error and fail. */
fprintf(stderr, "%s: %s.\n", dir, strerror(cause));
return -1;
}
/* We need the directory handle for fstatat(). */
dirhandle = opendir(dir);
if (!dirhandle) {
/* Print a warning, but continue. */
fprintf(stderr, "%s: %s\n", dir, strerror(errno));
fd = AT_FDCWD;
} else {
fd = dirfd(dirhandle);
}
for (i = 0; i < size; i++) {
struct dirent *ent = list[i];
/* Try to get information on ent. If fails, clear the structure. */
if (fstatat(fd, ent->d_name, &info, AT_SYMLINK_NOFOLLOW) == -1) {
/* Print a warning about it. */
fprintf(stderr, "%s: %s.\n", ent->d_name, strerror(errno));
memset(&info, 0, sizeof info);
}
/* Describe 'ent'. */
my_print(ent->d_name, &info);
}
/* Release the directory handle. */
if (dirhandle)
closedir(dirhandle);
/* Discard list. */
for (i = 0; i < size; i++)
free(list[i]);
free(list);
return 0;
}
int main(int argc, char *argv[])
{
int arg;
setlocale(LC_ALL, "");
if (argc > 1) {
for (arg = 1; arg < argc; arg++) {
if (my_ls(argv[arg])) {
return EXIT_FAILURE;
}
}
} else {
if (my_ls(".")) {
return EXIT_FAILURE;
}
}
return EXIT_SUCCESS;
}
Note that I deliberately made this more complex than strictly needed for your purposes, because I did not want you to just copy and paste the code. It will be easier for you to compile, run, and investigate this program, then port the needed changes -- possibly just the one setlocale("", LC_ALL); line! -- to your own program, than try and explain to your teacher/lecturer/TA why the code looks like it was copied verbatim from somewhere else.
The above code works even for files specified on the command line (the cause == ENOTDIR part). It also uses a single function, my_print(const char *name, const struct stat *info) to print each directory entry; and to do that, it does call stat for each entry.
Instead of constructing a path to the directory entry and calling lstat(), my_ls() opens a directory handle, and uses fstatat(descriptor, name, struct stat *, AT_SYMLINK_NOFOLLOW) to gather the information in basically the same manner as lstat() would, but name being a relative path starting at the directory specified by descriptor (dirfd(handle), if handle is an open DIR *).
It is true that calling one of the stat functions for each directory entry is "slow" (especially if you do /bin/ls -1 style output). However, the output of ls is intended for human consumption; and very often piped through more or less to let the human view it at leisure. This is why I would personally do not think the "extra" stat() call (even when not really needed) is a problem here. Most human users I know of tend to use ls -l or (my favourite) ls -laF --color=auto anyway. (auto meaning ANSI colors are used only if standard output is a terminal; i.e. when isatty(fileno(stdout)) == 1.)
In other words, now that you have the ls -1 order, I would suggest you modify the output to be similar to ls -l (dash ell, not dash one). You only need to modify my_print() for that.

In alphanumeric (dictionary) order.
That changes with language, of course. Try:
$ LANG=C ls -1
FILE.c
file.c
ls.c
And:
$ LANG=en_US.utf8 ls -1
file.c
FILE.c
ls.c
That is related to the "collating order". Not a simple issue by any measure.

Related

Access inode table to list all filenames

I would like to know the most efficient way to list the filenames on a Posix system. Doing either:
$ ls -R
Or:
$ find /
Or:
$ du /
Or 100 other variations (links abound on StackOverflow/ServerFault about various ways to do this). However, this is way too slow on the filesystem I am on, cifs -- for example, I have currently been running the ls -R for about two days (it takes about 50 hours to complete -- there are tons of files and directories on the system -- several petabytes worth).
So I am wondering if this can be done at a lower-level, hopefully in C. to list out the filenames from the inode database (example here). I don't need a recursive lookup of the entire path, but only the top-level name | filename -- and I would build out everything else manually. Is there a way to do this so that hopefully instead of taking ~50 hours to do an ls command with the billions of recursive lookups (yes, it does get cached after successive runs, but not most of it on the first run) can the inode database itself be dumped?
An an example, perhaps something like:
#filename,inode
myfile.mov,1234
myotherfile.csv,92033
But the main point here --and why I asked this question -- is speed not actually a command in order to do the above (such as $ ls -iR).
Here is a way to directly use getdents recursively. I will update timings of this shortly to compare it to ls and the other standard unix utils:
#define _GNU_SOURCE
#include <dirent.h>
#include <string.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/syscall.h>
#define handle_error(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)
struct linux_dirent {
unsigned long d_ino;
off_t d_off;
unsigned short d_reclen;
char d_name[];
};
void print_files(char* dir, FILE* out)
{
// open the file
int fd = open(dir, O_RDONLY | O_DIRECTORY);
if (fd == -1) handle_error("Error opening file.\n");
// grab a buffer to read the file data
#define BUF_SIZE (1024*1024*1)
char* buffer = malloc(sizeof *buffer * BUF_SIZE);
if (buffer == NULL) handle_error("Error malloc.\n");
// do the getdents syscall writing to buffer
int num_read = syscall(SYS_getdents, fd, buffer, BUF_SIZE);
if (num_read == -1) handle_error("Error getdents syscall.\n");
close(fd);
for (long buffer_position = 0; buffer_position < num_read;) {
struct linux_dirent *d = (struct linux_dirent *) (buffer + buffer_position);
char d_type = *(buffer + buffer_position + d->d_reclen - 1);
// skip on . and .. in the listing
if (d->d_name[0] == '.') {
buffer_position += d->d_reclen;
continue;
}
// path = dir + '/' + name
char path[400];
strcpy(path, dir);
strcat(path, "/");
strcat(path, d->d_name);
// recursive call, as necessary
if (d_type == DT_DIR)
print_files(path, out);
else if (d_type == DT_REG)
fprintf(out, "%s\n", path);
// advance buffer position
buffer_position += d->d_reclen;
}
free(buffer);
}
int main(int argc, char *argv[])
{
char dir[1024];
strcpy(dir, argc > 1 ? argv[1] : ".");
FILE *out = fopen("c-log.txt", "w");
fprintf(out, "-------------[ START ]---------------------\n");
print_files(dir, out);
}

matching the ls command output

I'm trying to code the ls command. I have the following function that prints each file name :
int ft_list(const char *filename)
{
DIR *dirp;
struct dirent *dir;
if (!(dirp = opendir(filename)))
return (-1);
while ((dir = readdir(dirp)))
{
if (dir->d_name[0] != '.')
ft_putendl(dir->d_name);
}
closedir(dirp);
return (0);
}
The ls command prints the files organized into columns to fit the screen width. I have read about it and I think it uses the ioctl standard library function, but I can't find any details. How exactly can I do this?
In order to arrange the files in columns, you need to figure out the current width of the terminal window. On many unix-like systems (including Linux and OS X), you can indeed use ioctl to get that information, using the TIOCGWINSZ selector.
This is precisely what ls does (on systems which support the ioctl request), once it has determined that standard output is a terminal (unless single-column firmat is forced with the -1 flag). If it cannot figure out the terminal width, it uses 80.
Here's a quick example of how to get the information. (On Linux systems, you can probably find the details by typing man tty_ioctl).
For simplicity, the following code assumes that stdout is file descriptor 1. In retrospect, FILE_STDOUT would have been better. If you wanted to check an arbitrary open file, you would need to use fileno to get the fd number for the FILE*.
/* This must come before any include, in order to see the
* declarations of Posix functions which are not in standard C
*/
#define _XOPEN_SOURCE 700
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/ioctl.h>
/* If stdout is a terminal and it is possible to find out how many
* columns its window has, return that number. Otherwise, return -1
*/
int window_get_columns(void) {
struct winsize sizes;
int cols = -1;
if (isatty(1)) {
/* Only try this if stdout is a terminal */
int status = ioctl(1, TIOCGWINSZ, &sizes);
if (status == 0) {
cols = sizes.ws_col;
}
}
return cols;
}
/* Example usage */
/* Print a line consisting of 'len' copies of the character 'ch' */
void print_row(int len, int ch) {
for (int i = 0; i < len; ++i) putchar(ch);
putchar('\n');
}
int main(int argc, char* argv[]) {
/* Print the first argument centred in the terminal window,
* if standard output is a terminal
*/
if (argc <= 1) return 1; /* No argument, nothing to do */
int width = window_get_columns();
/* If we can't figure out the width of the screen, just use the
* width of the string
*/
int arglen = strlen(argv[1]);
if (width < 0) width = arglen;
int indent = (width - arglen) / 2;
print_row(width - 1, '-');
printf("%*s\n", indent + arglen, argv[1]);
print_row(width - 1, '-');
return 0;
}
Since writing the above sample, I tracked down the source of the Gnu implementation of ls; its (somewhat more careful) invocation of ioctl will be seen here

How to get device name on which a file is located from its path in c?

Let's say I have a file in Linux with this path:
/path/to/file/test.mp3
I want to know the path to its device. For example I want to get something like:
/dev/sdb1
How do I do this with the C programming language?
I know the terminal command to do it, but I need C functions that will do the job.
EDIT:
I have read this question before asking mine. It doesn't concretly mention code in C, it's more related to bash than to the C language.
Thanks.
You need to use stat on the file path, and get the device ID st_dev and match that to a device in /proc/partitions
Read this for how to interpret st_dev: https://web.archive.org/web/20171013194110/http://www.makelinux.net:80/ldd3/chp-3-sect-2
I just needed that inside a program I am writing...
So instead of running "df" and parsing the output, I wrote it from scratch.
Feel free to contribute!
To answer the question:
You first find the device inode using stat() then iterate and parse /proc/self/mountinfo to find the inode and get the device name.
/*
Get physical device from file or directory name.
By Zibri <zibri AT zibri DOT org>
https://github.com/Zibri/get_device
*/
#include <stdio.h>
#include <unistd.h>
#include <stdint.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <stdlib.h>
#include <libgen.h>
int get_device(char *name)
{
struct stat fs;
if (stat(name, &fs) < 0) {
fprintf(stderr, "%s: No such file or directory\n", name);
return -1;
}
FILE *f;
char sline[256];
char minmaj[128];
sprintf(minmaj, "%d:%d ", (int) fs.st_dev >> 8, (int) fs.st_dev & 0xff);
f = fopen("/proc/self/mountinfo", "r");
if (f == NULL) {
fprintf(stderr, "Failed to open /proc/self/mountinfo\n");
exit(-1);
}
while (fgets(sline, 256, f)) {
char *token;
char *where;
token = strtok(sline, "-");
where = strstr(token, minmaj);
if (where) {
token = strtok(NULL, " -:");
token = strtok(NULL, " -:");
printf("%s\n", token);
break;
}
}
fclose(f);
return -1;
}
int main(int argc, char **argv)
{
if (argc != 2) {
fprintf(stderr, "Usage:\n%s FILE OR DIRECTORY...\n", basename(argv[0]));
return -1;
}
get_device(argv[1]);
return 0;
}
output is just the device name.
Example:
$ gcc -O3 getdevice.c -o gd -Wall
$ ./gd .
/dev/sda4
$ ./gd /mnt/C
/dev/sda3
$ ./gd /mnt/D
/dev/sdb1
$
Use this command to print the partition path:
df -P <pathname> | awk 'END{print $1}'

How not to open a file twice in linux?

I have a linked list with an fd and a string I used to open this file in each entry. I want to open and add files to this list only if this file is not already opened, because I open and parse this files and do not want to do it twice. My idea was to compare the filename with every single name in this list, but my program do it multiple times and one file in Linux can have multiple names (soft/hard links). I think it should not be so complicated, because its easy for the OS to check, whether I already used a inode or not, r?
I already tried to open the same file with and without flock, but I always get a new fd.
When you successfully open a file use fstat on the file. Check to see if the st_ino and st_dev of the struct stat filed in by fstat have already been recorded in your linked list. If so then close the file descriptor and move on to the next file. Otherwise add the file descriptor, the file name and st_ino and st_dev values to the list.
You can instead use stat to check before opening the file, but using fstat after will be slightly faster if the usual case is that file hasn't already been opened.
In situations like this, it's often useful to consider your data structures. Change to a data structure which does not allow duplicates, such as a hash table.
Maintain a set of which data you've seen before. I've used a hash table for this set. As per #RossRidge's answer, use the inode and device as the key. This allows duplicates to be discovered in O(1) time.
Here is an example implementation.
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <glib.h>
#include <sys/stat.h>
#include <errno.h>
#include <fcntl.h>
static int get_fd(GHashTable *fds, const char *filename, int mode) {
int fd;
struct stat stat;
int keysize = 33;
char key[keysize]; /* Two 64 bit numbers as hex and a separator */
/* Resolve any symlinks */
char *real_filename = realpath(filename, NULL);
if( real_filename == NULL ) {
printf("%s could not be resolved.\n", filename);
return -1;
}
/* Open and stat */
fd = open( real_filename, mode );
if( fd < 0 ) {
printf("Could not open %s: %s.\n", real_filename, strerror(errno));
return -1;
}
if( fstat(fd, &stat) != 0 ) {
printf("Could not stat %s: %s.\n", real_filename, strerror(errno));
return -1;
}
/* Make a key for tracking which data we've processed.
This uses both the inode and the device it's on.
It could be done more efficiently as a bit field.
*/
snprintf(key, keysize, "%lx|%lx", (long int)stat.st_ino, (long int)stat.st_dev);
/* See if we've already processed that */
if( g_hash_table_contains(fds, key) ) {
return 0;
}
else {
/* Note that we've processed it */
g_hash_table_add(fds, key);
return fd;
}
}
int main(int argc, char** argv) {
int mode = O_RDONLY;
int fd;
GHashTable *fds = g_hash_table_new(&g_str_hash, &g_str_equal);
for(int i = 1; i < argc; i++) {
char *filename = argv[i];
fd = get_fd(fds, filename, mode);
if( fd == 0 ) {
printf("%s has already been processed.\n", filename);
}
else if( fd < 0 ) {
printf("%s could not be processed.\n", filename);
}
else {
printf("%s: %d\n", filename, fd);
}
}
}
And here's a sample result.
$ touch one two three
$ ln one one_link
$ ln -s two two_sym
$ ./test one* two* three*
one: 3
one_link has already been processed.
two: 5
two_sym has already been processed.
three: 7
As long as you don't close the successfully and intentionally opened files, you can use nonblocking flock to prevent another lock on the same file:
#include <unistd.h>
#include <sys/file.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <assert.h>
int openAndLock(const char* fn){
int fd = -1;
if(((fd = open(fn, O_RDONLY)) >= 0) && (flock(fd, LOCK_EX|LOCK_NB) == 0)){
fprintf(stderr, "Successfully opened and locked %s\n", fn);
return fd;
}else{
fprintf(stderr, "Failed to open or lock %s\n", fn);
close(fd);
return -1;
}
}
int main(int argc, char** argv){
for(int i=1; i<argc; i++){
openAndLock(argv[i]);
}
return 0;
}
Example:
$ touch foo
$ ln foo bar
$ ./a.out foo foo
Successfully opened and locked foo
Failed to open or lock foo
$ ./a.out foo bar
Successfully opened and locked foo
Failed to open or lock bar

Infinite recursion while listing directories in linux

I try to write program where part of it is listing all directories (especially starting from /), but I have a problem with /proc/self which is infinitely recursive (I get /proc/self/task/4300/fd/3/proc/self/task/4300/fd/3/proc/self/task/4300/fd/3/proc/... and so on). What is nice way to deal with it?
EDIT: Program is written in C language and I use opendir(), readdir()
You can use the S_ISLNK macro to test the st_mode field returned by a call to lstat. If the file is a symbolic link, do not try to follow it.
[user#machine:~]:./list | grep link
/proc/mounts is a symbolic link
/proc/self is a symbolic link
Example code
#include <stdio.h> // For perror
#include <stdlib.h>
#include <sys/types.h> // For stat, opendir, readdir
#include <sys/stat.h> // For stat
#include <unistd.h> // For stat
#include <dirent.h> // For opendir, readdir
const char *prefix = "/proc";
int main(void)
{
DIR *dir;
struct dirent *entry;
int result;
struct stat status;
char path[PATH_MAX];
dir = opendir(prefix);
if (!dir)
{
perror("opendir");
exit(1);
}
entry = readdir(dir);
while (entry)
{
result = snprintf(path, sizeof(path), "%s", prefix);
snprintf(&path[result], sizeof(path) - result, "/%s", entry->d_name);
printf("%s", path);
result = lstat(path, &status);
if (-1 == result)
{
printf("\n");
perror("stat");
exit(2);
}
if (S_ISLNK(status.st_mode))
{
printf("%s", " is a symbolic link");
}
printf("\n");
entry = readdir(dir);
}
return(0);
}
From path_resolution(7):
Length limit
There is a maximum length for pathnames. If the pathname (or some intermediate pathname obtained while resolving symbolic links) is too long, an ENAMETOOLONG error
is returned ("File name too long").
I think you should employ similar behaviour: check for too long pathnames.
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <dirent.h>
#include <sys/param.h>
/* Short & sweet recursive directory scan, finds regular files only.
Good starting point, should work on Linux OS.
Pass the root path, and returns number of dirs and number of files
found.
*/
char *tree_scan( const char *path, int *ndirs, int *nfiles){
DIR *dir;
struct dirent *entry;
char spath[MAXPATHLEN] = "";
if( !(dir = opendir( path))){ perror("opendir"); exit(1);}
for( entry = readdir( dir); entry; entry = readdir( dir)){
sprintf( spath, "%s/%s", path, entry->d_name);
if( entry->d_type == DT_REG){ (*nfiles)++; printf( "%s\n", spath);}
if( entry->d_type == DT_DIR &&
(strcmp( ".", entry->d_name)) &&
(strcmp( "..", entry->d_name))){
(*ndirs)++; tree_scan( spath, ndirs, nfiles);
}
}
closedir( dir);
return(0);
}
/* Call it like so */
int i = 0, l = 0;
tree_scan( "/path", &i, &l);
printf("Scanned %d directories, %d files.\n", i, l);
I don't have a *nix terminal handy, but you could always take a look at the source for ls.c and see how it's done.
The source as part of the gnu core utils can be found here.
I created a ls clone a few years ago in school, and I think I got around it by watching the pathname size as ulidtko mentioned.

Resources