implementing globbing in a shell prototype - c

I'm implementing a linux shell for my weekend assignment and I am having some problems implementing wilcard matching as a feature in shell. As we all know, shells are a complete language by themselves, e.g. bash, ksh, etc. I don't need to implement the complete features like control structures, jobs etc. But how to implement the *?
A quick analysis gives you the following result:
echo *
lists all the files in the current directory. Is this the only logical manifestation of the shell? I mean, not considering the language-specific features of bash, is this what a shell does, internally? Replace a * with all the files in the current directory matching the pattern?
Also I have heard about Perl Compatible Regular Expression , but it seems to complex to use a third party library.
Any suggestions, links, etc.? I will try to look at the source code as well, for bash.

This is called "globbing" and the function performing this is named the same: glob(3)

Yes, that's what shell does. It will replace '*' characters by all files and folder names in cwd. It is in fact very basic regular expressions supporting only '?' and '*' and matching with file and folder names in cwd.
Remark that backslashed \* and '*' enclosed between simple or double quotes ' or " are not replaced (backslash and quotes are removed before passing to the command executed).

If you want more control than glob gives, the standard function fnmatch performs just glob matching.
Note that shells also performs word expansion (e.g. "~" → "/home/user"), which should be done before glob expansion, if you're doing filename matching manually. (Or use wordexp.)

Related

How to safely pass an arbitrary text as parameter to a program in a shell script?

I'm writing a GUI application for character recognition that uses Tesseract. I want to allow the user to specify a custom shell command to be executed with /bin/sh -c when the text is ready.
The problem is the recognized text can contain literally anything, for example && rm -rf some_dir.
My first thought was to make it like in many other programs, where
the user can type the command in a text entry, and then special strings (like in printf()) in the command are replaced by the appropriate data (in my case, it might be %t). Then the whole string is passed to execvp(). For example, here is a screenshot from qBittorrent:
The problem is that even if I properly escape the text before replacing %t, nothing prevents the user to add extra quotes around the specifier:
echo '%t' >> history.txt
So the full command to be executed is:
echo ''&& rm -rf some_dir'' >> history.txt
Obviously, that's a bad idea.
The second option is only let the user to choose an executable (with a file selection dialog), so I can manually put the text from Tesseract as argv[1] for execvp(). The idea is that the executable can be a script where users can put anything they want and access the text with "$1". That way, the command injection is not possible (I think). Here's an example script a user can create:
#!/bin/sh
echo "$1" >> history.txt
It there any pitfalls with this approach? Or maybe there's a better way to safely pass an arbitrary text as parameter to a program in shell script?
In-Band: Escaping Arbitrary Data In An Unquoted Context
Don't do this. See the "Out-Of-Band" section below.
To make an arbitrarily C string (containing no NULs) evaluate to itself when used in an unquoted context in a strictly POSIX-compliant shell, you can use the following steps:
Prepend a ' (moving from the required initial unquoted context to a single-quoted context).
Replace each literal ' within the data with the string '"'"'. These characters work as follows:
' closes the initial single-quoted context.
" enters a double-quoted context.
' is, in a double-quoted context, literal.
" closes the double-quoted context.
' re-enters single-quoted context.
Append a ' (returning to the required initial single-quoted context).
This works correctly in a POSIX-compliant shell because the only character that is not literal inside of a single-quoted context is '; even backslashes are parsed as literal in that context.
However, this only works correctly when sigils are used only in an unquoted context (thus putting onus on your users to get things right), and when a shell is strictly POSIX-compliant. Also, in a worst-case scenario, you can have the string generated by this transform be up to 5x longer than the original; one thus needs to be cautious around how the memory used for the transform is allocated.
(One might ask why '"'"' is advised instead of '\''; this is because backslashes change their meaning used inside legacy backtick command substitution syntax, so the longer form is more robust).
Out-Of-Band: Environment Variables, Or Command-Line Arguments
Data should only be passed out-of-band from code, such that it's never run through the parser at all. When invoking a shell, there are two straightforward ways to do this (other than using files): Environment variables, and command-line arguments.
In both of the below mechanisms, only the user_provided_shell_script need be trusted (though this also requires that it be trusted not to introduce new or additional vulnerabilities; invoking eval or any moral equivalent thereto voids all guarantees, but that's the user's problem, not yours).
Using Environment Variables
Excluding error handling (if setenv() returns a nonzero result, this should be treated as an error, and perror() or similar should be used to report to the user), this will look like:
setenv("torrent_name", torrent_name_str, 1);
setenv("torrent_category", torrent_category_str, 1);
setenv("save_path", path_str, 1);
# shell script should use "$torrent_name", etc
system(user_provided_shell_script);
A few notes:
While values can be arbitrary C strings, it's important that the variable names be restricted -- either hardcoded constants as above, or prefixed with a constant (lowercase 7-bit ASCII) string and tested to contain only characters which are permissible shell variable names. (A lower-case prefix is advised because POSIX-compliant shells use only all-caps names for variables that modify their own behavior; see the POSIX spec on environment variables, particularly the note that "The name space of environment variable names containing lowercase letters is reserved for applications. Applications can define any environment variables with names from this name space without modifying the behavior of the standard utilities").
Environment space is a limited resource; on modern Linux, the maximum combined storage for both environment variables and command-line arguments is typically on the scale of 128kb; thus, setting large environment variables will cause execve()-family calls with large command lines to fail. Validating that length is within reasonable domain-specific limits is wise.
Using Command-Line Arguments:
This version requires an explicit API, such that the user configuring the trigger command knows which value will be passed in $1, which will be passed in $2, etc.
/* You'll need to do the usual fork() before this, and the usual waitpid() after
* if you want to let it complete before proceeding.
* Lots of Q&A entries on the site already showing the context.
*/
execl("/bin/sh", "-c", user_provided_shell_script,
"sh", /* this is $0 in the script */
torrent_name_str, /* this is $1 in the script */
torrent_category_str, /* this is $2 in the script */
path_str, /* this is $3 in the script */
NUL);
Any time you're runnng commands with even the possibility of user input making its way into them you must escape for the shell context.
There's no built-in function in C to do this, so you're on your own, but the basic idea is to render user parameters as either properly escaped strings or as separate arguments to some kind of execution function (e.g. exec family).

Using exec on each file in a bash script

I'm trying to write a basic find command for a assignment (without using find). Right now I have an array of files I want to exec something on. The syntax would look like this:
-exec /bin/mv {} ~/.TRASH
And I have an array called current that holds all of the files. My array only holds /bin/mv, {}, and ~/.TRASH (since I shift the -exec out) and are in an array called arguments.
I need it so that every file gets passed into {} and exec is called on it.
I'm thinking I should use sed to replace the contents of {} like this (within a for loop):
for i in "${current[#]}"; do
sed "s#$i#{}"
#exec stuff?
done
How do I exec the other arguments though?
You can something like this:
cmd='-exec /bin/mv {} ~/.TRASH'
current=(test1.txt test2.txt)
for f in "${current[#]}"; do
eval $(sed "s/{}/$f/;s/-exec //" <<< "$cmd")
done
Be very careful with eval command though as it can do nasty things if input comes from untrusted sources.
Here is an attempt to avoid eval (thanks to #gniourf_gniourf for his comments):
current=( test1.txt test2.txt )
arguments=( "/bin/mv" "{}" ~/.TRASH )
for f in "${current[#]}"; do
"${arguments[#]/\{\}/$f}"
done
Your are lucky that your design is not too bad, that your arguments are in an array.
But you certainly don't want to use eval.
So, if I understand correctly, you have an array of files:
current=( [0]='/path/to/file'1 [1]='/path/to/file2' ... )
and an array of arguments:
arguments=( [0]='/bin/mv' [1]='{}' [2]='/home/alex/.TRASH' )
Note that you don't have the tilde here, since Bash already expanded it.
To perform what you want:
for i in "${current[#]}"; do
( "${arguments[#]//'{}'/"$i"}" )
done
Observe the quotes.
This will replace all the occurrences of {} in the fields of arguments by the expansion of $i, i.e., by the filename1, and execute this expansion. Note that each field of the array will be expanded to one argument (thanks to the quotes), so that all this is really safe regarding spaces, glob characters, etc. This is really the safest and most correct way to proceed. Every solution using eval is potentially dangerous and broken (unless some special quotings is used, e.g., with printf '%q', but this would make the method uselessly awkward). By the way, using sed is also broken in at least two ways.
Note that I enclosed the expansion in a subshell, so that it's impossible for the user to interfere with your script. Without this, and depending on how your full script is written, it's very easy to make your script break by (maliciously) changing some variables stuff or cd-ing somewhere else. Running your argument in a subshell, or in a separate process (e.g., separate instance of bash or sh—but this would add extra overhead) is really mandatory for obvious security reasons!
Note that with your script, user has a direct access to all the Bash builtins (this is a huge pro), compared to some more standard find versions2!
1 Note that POSIX clearly specifies that this behavior is implementation-defined:
If a utility_name or argument string contains the two characters "{}", but not just the two characters "{}", it is implementation-defined whether find replaces those two characters or uses the string without change.
In our case, we chose to replace all occurrences of {} with the filename. This is the same behavior as, e.g., GNU find. From man find:
The string {} is replaced by the current file name being processed everywhere it occurs in the arguments to the command, not just in arguments where it is alone, as in some versions of find.
2 POSIX also specifies that calling builtins is not defined:
If the utility_name names any of the special built-in utilities (see Special Built-In Utilities), the results are undefined.
In your case, it's well defined!
I think that trying to implement (in pure Bash) a find command is a wonderful exercise that should teach you a lot… especially if you get relevant feedback. I'd be happy to review your code!

How can I convert path containing wildcard to corresponding file entries in C program?

I'm trying to implement the ls command with wildcard, *.
I have just learned the fact that most shells convert ls-argument containing * to the corresponding entries when performing ls command.
For example, The directory foo consist of a.file, b.file, and directory bar.
Then, the directory bar has c.file, d.file, and e.file.
and assume that current directory is the directory foo.
the argument */* is converted is to the following entries.
"bar/c.file", "bar/d.file", "bar/e.file"
How can program perform this? I don't know where to start from. And
there are many possible cases.
*/../*, ../../*, */*/*, etc.
Any advice would be awesome. Thank you..
You can of couse use glob() to do a lot of this work.
Such patterns are called globs, for some reason I won't dig up now. :)
POSIX provides glob(3) for programmatic wildcard path expansion.

How do I find out if a file name has any extension in Unix?

I need to find out if a file or directory name contains any extension in Unix for a Bourne shell scripting.
The logic will be:
If there is a file extension
Remove the extension
And use the file name without the extension
This is my first question in SO so will be great to hear from someone.
The concept of an extension isn't as strictly well-defined as in traditional / toy DOS 8+3 filenames. If you want to find file names containing a dot where the dot is not the first character, try this.
case $filename in
[!.]*.*) filename=${filename%.*};;
esac
This will trim the extension (as per the above definition, starting from the last dot if there are several) from $filename if there is one, otherwise no nothing.
If you will not be processing files whose names might start with a dot, the case is superfluous, as the assignment will also not touch the value if there isn't a dot; but with this belt-and-suspenders example, you can easily pick the approach you prefer, in case you need to extend it, one way or another.
To also handle files where there is a dot, as long as it's not the first character (but it's okay if the first character is also a dot), try the pattern ?*.*.
The case expression in pattern ) commands ;; esac syntax may look weird or scary, but it's quite versatile, and well worth learning.
I would use a shell agnostic solution. Runing the name through:
cut -d . -f 1
will give you everything up to the first dot ('-d .' sets the delimeter and '-f 1' selects the first field). You can play with the params (try '--complement' to reverse selection) and get pretty much anything you want.

cmd- comma to separate parameters Compared to space?

I have some questions on the subject, of commas compared to space, in delimiting parameters.
They are questions that C programmers familiar with the cmd prompt, may be able to throw some light on..
I know that when doing
c:\>program a b c
there are 4 parameters [0]=program [1]=a [2]=b [3]=c
According to hh ntcmds.chm concepts..
Shell overview
; and , are used to separate parameters
; or , command1 parameter1;parameter2 Use to separate command parameters.
I see dir a,b gives the same result as dir a b
but
c:\>program a,b,c
gives parameters [0]=program [1]=a,b,c
So do some? or all? windows commands use ; and , ? and is that interpretation within the code of each command, or done by the shell like with space?
And if it is in the code of each command.. how would I know which do it?
I notice that documentation of explorer.exe mentions the comma,e.g. you can do
explorer /e,.
but DIR /? does not mention it, but can use it. And a typical c program doesn't take , as a delimiter at all.. So is it the case that the shell doesn't use comma to delimit, it uses space. And windows commands that do, do so 'cos they are (all?) written to delimit the parameters the shell has given them further when commas are used?
There are two differences here between Unix and Windows:
Internal commands such as DIR are built into the shell; their command line syntax doesn't have to follow the same rules as for regular programs
On Windows, programs are responsible for parsing their own command lines. The shell parses redirects and pipes, then passes the rest of the command line to the program in one string
Windows C programs built using Visual Studio use the command line parser in the Microsoft C runtime, which is similar to a typical Unix shell parser and obeys spaces and quotation marks.
I've never seen a C program that uses , or ; as a command line separator. I was aware of the special case for explorer /e,., but I'd never seen the dir a,b example until just now.
Batch files use a comma or semicolon as an alternative argument separator.
Test batch file:
#echo %1/%2/%3
Test run:
> test.cmd 1,2,3
1/2/3
> test.cmd 1;2 3
1/2/3
And, as you note, dir uses it, copy as well – those are both shell built-ins and probably run through a similar parser like batch files as well (it isn't exactly the same, since you can do things like cd.. or dir/s which aren't possible for anything else). I guess (note: speculation) this is some sort of backwards compatibility that goes back into the DOS or even CP/M days. Nowadays you probably should just use spaces. And as Tim notes, the C runtime dictates certain things about arguments and how they are supposed to be parsed. Many other languages/frameworks follow that convention but not necessarily all. PowerShell for example has completely different argument handling and this can sometimes be a surprise when interacting with native programs from within it (that being said, PowerShell cmdlets and functions are no programs executable elsewhere, but batch files likewise).

Resources