Parsing stdin for files provided by ls

Parsing stdin for files provided by ls - c

TL;DR: Is the output of ls standardised so that there is a perfect way to parse it into an array of files names ?
I have to write a program that processes some files, the program specification states this:
Your program should read a list of files from the standard entry
And an example is given of how to program will be used:
ls /usr/include/std*.h | ./distribuer 3
Where distribuer is the name of my program.
From my tests, I see that ls separates the files names with tabs when called with this sort of argument containing a wildcard, is this behaviour standard ? Or might ls sometimes use simple whitespace characters or even newlines when called with similar wildcard arguments ?
Finally, while this might be an edge case, I am also worried that since Unix allows for tabs and whitespaces in filenames, it could actually be impossible to reliably parse the output of ls, is that correct ?

Is the output of ls standardised so that there is a perfect way to parse it into an array of files names?
The output of ls is certainly standardised, by the Posix standard. In the section STDOUT, the standardised formats are described:
The default format shall be to list one entry per line to standard output; the exceptions are to terminals or when one of the -C, -m, or -x options is specified.
As well as a cautionary note about an important context in which the output is not standardised:
If the output is to a terminal, the format is implementation-defined.
(There is quite a lot of specification of how the format changes with different command-line parameters, which I'm not quoting because it is not immediately relevant here.)
So the standardised format, applicable if stdout is not directed to a terminal and if no command-line options are provided (or if the -1 option is provided, even if stdout is a terminal) is to print one entry per line.
Unfortunately, that does not provide a "perfect way" to parse the output, because it is legal for filenames to include newline characters, and a filename which includes a newline character will obviously span more than one line. If all you have is the ls output, there is no 100% reliable way to tell whether a newline (other than the last one) indicates the end of a filename or is a newline character in the filename.
For the purposes of your assignment, the simple strategy would be just to ignore that imperfection (or, better, document it and then ignore it), which is the same strategy that many Unix utilities use. Files whose names include newlines are extremely rare in the wild, and people who create files with newlines in their names probably deserve the problems they will cause themselves. However, you will find a lot of people here (including me, sometimes) suggesting that scripts should work correctly with all legal filenames. So the rest of this answer discusses some of the possible responses to this pedantry. Note that none of them are "perfect".
One imperfect solution is to try to figure out whether a given newline is embedded or not. If you know the list was produced by ls without any sorting options, you might be able to guess correctly in most cases by using the fact that ls presents files sorted by the current locale's collation rules. So if a line is out of sequence (either less than the preceding line or greater than the following one) then it is appropriate to guess that it is a continuation of the filename. That won't always work, and I don't know any utility which tries it, but it might be worth mentioning.
If you were running ls yourself, you could take advantage of the -q option, which causes non-printing characters (including tabs and newlines) to be replaced with ? in the output. That forces the filename to be printed on a single line, but has the disadvantage that you no longer know what the filename was before the substitution, since there are a variety of characters which could be replaced with a question mark (including a question mark itself). You might be able to query the filesystem to find the real name of the file, but there are a lot of corner cases I'm not going to go into since the premise of this paragraph is not applicable to the actual problem.
The most common solution is to allow the user to tell your utility that filenames are separated with a NUL character rather than a newline. This is 100% reliable because filenames cannot contain NUL characters -- in fact, that's the only character they cannot contain. Unfortunately, ls does not provide an option to produce output in this format, but the user could use the find utility to generate the same listing as ls and then use the non-standard but widely-implemented -print0 option to write out the filenames with NUL terminators. (If only Posix standard options to find are available, you can still produce the output by using -exec with an appropriate command to output the name.)
Many utilities which accept lists of filenames on standard input have (non-standard) options to specify a delimiter character, or to specify that the delimiter is NUL instead of newline. See, for example, xargs -0, sort -z (Gnu or BSD) or read -d (bash). So this is probably a reasonable enhancement if you're interested in coding it.
It's worth noting that most standard shell utilities do not provide an option to take a list of filenames through standard input. Most utilities prefer to receive filenames as command-line arguments. This works well because when the shell expands "globs" (like *) specified on a command-line, it does not rerun word-splitting on the output; each filename becomes a single argument. That means that
./distribute *
is almost perfect as a way of passing a list of filenames to a utility. But it is still not quite perfect because there is a limit to the number of command-line arguments you can provide in a single command-line. So if the directory has a really large number of files, the expansion of * might exceed that limit, causing the utility execution to fail. find also just passes filenames through to -exec as single arguments without word-splitting, and the use of {}+ as an -exec command terminator will split the filenames into sets which are small enough that they will not exceed to command-line limit. That's safer than ./distribute *, but it does mean that the utility may be called several times, once for each set. (And it's also a bit annoying getting the find predicates to give you exactly what you want.)

Related

How to find the owner and group name from uid and gid using system calls listed in man 2 pages?

I have an assignment in which I have to simulate ls -l unix command using C. I have figured out everything except finding the owner and the group of a particular file. I have the uid and gid from the stat structure using stat() system call, but I am not able to map them to the actual name of the user and owner respectively. I am supposed to use only those system calls which are listed in the man 2 pages. I have tried searching for an answer, but everywhere it says to use getpwnam() call, which I can't as it is not listed in the man 2 pages.

Yes, getpwnam(3) and getgrnam(3) are the ways you'd do this in settings not restricted to system calls only.
But even reading the contents of /etc/passwd using only system calls poses a challenge or two. Regular friends like fopen(3), fread(3) and fgets(3) aren't available, nor are string helpers like strsep(3). I guess you're stuck with read(2). You can read one character at a time using read(2), and do a crude parse of contents that way.
Or you could set up a character array that's big enough to read all of /etc/passwd into memory, and walk over the file contents that way. You'll have info stat(2) returned to tell how big /etc/passwd is, so could either fail if it's too big for your assumptions, or implement a buffering strategy yourself.
Or you could look into sbrk(2), and be sure you've enough memory to land all of /etc/passwd in memory from one read(2).
However you read the contents, you'll have some conversion to do of text -- strings of digits -- to C numeric types, harder when atoi(3) isn't available.
Also fun: converting time_t (time(2) output; also some of the fields of struct stat) to a nice string:
1409264099 -> Thu Aug 28 17:14:59 2014
Are these really the rules of the assignment...? C is crude enough without the presence of a conventional libc...

You can search the /etc/passwd file for what you require. You can find out the structure of the /etc/passwd file with man 5 passwd. There is no system call that will do this for you.

Only system calls... oh dear!
Well, I would mmap /etc/passwd and /etc/group and search the memory regions using basic C operations. Both files are line oriented, so lines are separated by \n, and within each line the entries are separated by colons.

Validation for user input on file system

I have written a bunch of web apps and know how to protect against mysql injections and such. I am writing a log storage system for a project in C and I was advised to make sure that it was hack free in the sense that the user could not supply bad data like foo\b\b\b and try to hack into the OS with some rm -rf /* kind of crud. I looked online and found a similar question here: how to check for the "backspace" character in C
This is at least what I thought of, but I know there are probably other things I need to protect against. Can someone who has a bit more experience help me list out the things I need to validate when I am saving files onto a server using user input as part of the hierarchical file naming system?
Example file: /home/webapp/data/{User input}/{Machine-ID}/{hostname}/{tag} where all of these fields could be "faked" when submitted to our log storing system.

Instead of checking for bad characters, turn the problem on its head and specify the good characters. E.g. require {User Input} be a single directory name made of [[:alnum:]_] characters; {Machine-ID} must be made of [[:xdigit:]] to your liking, etc. That gets rid of all the injection stuff quickly.

If you're only ever using these inputs as file names inside your program, and you're storing them on a native Linux filesystem, then the critical things to watch for are:
absolutely proscribe any file name starting with ../ or containing /../ or ending with /... Such file names could allow the user to reach files outside the directory tree that you're working in.
Be wary of any file name containing / as these allow the user to name subdirectories, possibly with unintended consequences.
Other things that could cause trouble include:
Non-ASCII characters that may have a different meaning if used in a different locale.
Some ASCII punctuation characters may have a special meaning in parts of your processing system or may be invalid in some filesystems.
Some parts of your system may be case-sensitive with other parts being case-insensitive. Consider normalizing the case.
If applicable, restrict each field to something that isn't going to cause any trouble. For example:
A machine ID should probably consist of only ASCII lower letters and digits (or only ASCII uppercase letters and digits).
A hostname should consist of only ASCII lowercase letters and digits, plus - but not in an initial position (use Punycode for non-ASCII host names). If these are fully qualified host names, as opposed to host names in a network, then . is also valid, but not in initial position.
No field should be empty or contain a / or start with a . (an initial . could be . or .. — see above — and would be a dot file that ls doesn't show by default and isn't included in the pattern * in shells, so they're best avoided).
While control characters such as backspace aren't directly harmful, they can be indirectly harmful in that if you're investigating an issue on the command line, they can cause you to make mistakes. Do not allow them.

CommandLineToArgvW equivalent on Linux

I'm looking for an equivalent function to Windows' CommandLineToArgvW.
I have a string, which I want to break up exactly the way bash does it, with all the corner cases - i.e. taking into account single and double quotes, backslashes, etc., so splitting a b "c'd" "\"e\"f" "g\\" h 'i"j' would result into:
a
b
c'd
"e"f
g\
h
i"j
Since such a function already exist and is used by the OS/bash, I'm assuming there's a way to call it, or at least get its source code, so I don't need to reinvent the wheel.
Edit
To answer why I need it, it has nothing to do with spawning child processes. I want to make a program that searches text, watching for multiple regular expressions to be true in whatever order. But all the regular expressions would be input in the same text field, so I need to break them up.

GNU/Linux is made of free software and bash is free software, so you can get the source code and improve it (and you should publish your improving patches under GPL license).
But there is no common library doing that, because it is the role of the shell to expand the command line to arguments to the execve(2) syscall (which then go to the main of the invoked program).
(this was different in MS-DOS, where the called program had to expand its command line)
The function wordexp(3) is close to what you may want.
You may want to study the source code of simpler shells, e.g. download sash-3.7.tar.gz

If you want it to expand a string exactly the way Bash does, you will need to run Bash. Remember, Bash does parameter expansion, command substitution, and the like. If it really needs to act exactly like Bash, just call Bash itself.
FILE *f = popen("bash", "r+");
fprintf(f, "echo %s", your_string);
fgets(buffer, sizeof(buffer), f);
pclose(f);
Note that real code would need to handle errors and possibly allocating a bigger buffer if your original is not large enough.
Given your updated requirements, it sounds like you don't want to parse it exactly like Bash does. Instead, you just want to parse space-separated strings with quoting and escaping. I would recommend simply implementing this yourself; I do not know of any off the shelf library that will parse strings exactly the way that you specify. You don't have to write it entirely by hand; you can use a lexical scanner generator like flex or Ragel for this purpose.

How do I check if a file is text-based?

I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?

Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...

Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').

It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.

Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".

well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)

You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.

What are all the illegal characters in the XFS filesystem?

Could someone provide (or point me to a list) of all the illegal characters in the XFS filesystem? I'm writing an app that needs to sanitize filenames.
EDIT:
Okay, so POSIX filesystems should allow all characters except the NUL character, forward slash, and the '.' and '..' filenames are reserved. All other exceptions are application-level. Thanks!

POSIX filesystems (including XFS) allow every character in file names, with the exception of NUL (0x00) and forward-slash (/; 0x2f).
NUL marks the end of a C-string; so it is not allowed in file names.
/ is the directory separator, so it is not allowed.
File names starting with a dot (.; 0x2e) are considered hidden files. This is a userland, not kernel or filesystem convention.
There may be conventions you're following — for example, UTF-8 file names — in which case, there are many, many more restrictions including which normalization form to use.
Now, you probably want to disallow other things too; file name with all kinds of weird characters are no fun to deal with. I strongly suggest the whitelist approach.
Also, when handling file names, beware of the .. entry in every directory. You don't want to traverse it and allow an arbitrary path.
Source: Single Unix Spec v. 3, §3.169, "the characters composing the name may be selected from the set of all character values excluding the slash character and the null byte."

According to Wikipedia, any character except NUL is legal in an XFS filesystem file name. Of course, POSIX typically doesn't allow the forward slash '/' in a filename. Other than this, anything should be good, including international characters.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight