How does grep work? - c

I am trying to understand how grep works.
When I say grep "hello" *.*, does grep get 2 arguments — (1) string to be searched i.e. "hello" and (2) path *.*? Or does the shell convert *.* into something that grep can understand?
Where can I get source code of grep? I came across this GNU grep link. One of the README files says its different from unix grep. How so?
I want to look at source of FreeBSD version of grep and also Linux version of it (if they are different).

The power of grep is the magic of automata theory. GREP is an abbreviation for Global Regular Expression Print. And it works by constructing an automaton (a very simple "virtual machine": not Turing Complete); it then "executes" the automaton against the input stream.
The automaton is a graph or network of nodes or states. The transition between states is determined by the input character under scrutiny. Special automatons like + and * work by having transitions that loop back to themselves. Character classes like [a-z] are represented by a fan: one start node with branches for each character out to the "spokes"; and usually the spokes have a special "epsilon transition" to a single final state so it can be linked up with the next automaton to be built from the regular expression (the search string). The epsilon transitions allow a change of state without moving forward in the string being searched.
Edit: It appears I didn't read the question very closely.
When you type a command-line, it is first pre-processed by the shell. The shell performs alias substitutions and filename globbing. After substituting aliases (they're like macros), the shell chops up the command-line into a list of arguments (space-delimited). This argument list is passed to the main() function of the executable command program as an integer count (often called argc) and a pointer to a NULL-terminated ((void *)0) array of nul-terminated ('\0') char arrays.
Individual commands make use of their arguments however they wish. But most Unix programs will print a friendly help message if given the -h argument (since it begins with a minus-sign, it's called an option). GNU software will also accept a "long-form" option --help.
Since there are a great many differences between different versions of Unix programs the most reliable way to discover the exact syntax that a program requires is to ask the program itself. If that doesn't tell you what you need (or it's too cryptic to understand), you should next check the local manpage (man grep). And for gnu software you can often get even more info from info grep.

The shell does the globbing (conversion from * form to filenames). You can see this by if you have a simple C program:
#include <stdio.h>
int main(int argc, char **argv) {
for(int i=1; i<argc; i++) {
printf("%s\n", argv[i]);
}
return 0;
}
And then run it like this:
./print_args *
You'll see it prints out what matched, not * literally. If you invoke it like this:
./print_args '*'
You'll see it gets a literal *.

The shell expands the '*.*' into a list of file names and passes the expanded list of file names to the program such as grep. The grep program itself does not do expansion of file names.
So, in answer to your question: grep does not get 2 arguments; the shell converts '*.*' into something grep can understand.
GNU grep is different from Unix grep in supporting extra options, such as -w and -B and -A.
It looks to me like FreeBSD uses the GNU version of grep:
http://svnweb.freebsd.org/base/stable/8/gnu/usr.bin/grep/

How grep sees the wildcard argument depends on your shell. (Standard) Bourne shell has a switch (-f) to disable file name globbing (see man pages).
You may activate this switch in a script with
set -f

Related

Trying to get an asterisk * as input to main from command line

I'm trying to send input from the command line to my main function. The input is then sent to the functions checkNum etc.
int main(int argc, char *argv[])
{
int x = checkNum(argv[1]);
int y = checkNum(argv[3]);
int o = checkOP(argv[2]);
…
}
It is supposed to be a calculator so for example in the command line when I write:
program.exe 4 + 2
and it will give me the answer 6 (code for this is not included).
The problem is when I want to multiply and I type for example
program.exe 3 * 4
It seems like it creates a pointer (or something, not quite sure) instead of giving me the char pointer to the char '*'.
The question is can I get the input '*' to behave the same way as when I type '+'?
Edit: Writing "*" in the command line works. Is there a way where I only need to type *?
The code is running on Windows, which seems to be part of the problem.
As #JohnBollinger wrote in the comments, you should use
/path/to/program 3 '*' 4
the way it's written at the moment.
But some explanation is clearly required. This is because the shell will parse the command line before passing it to your program. * will expand to any file in the directory (UNIX) or something similar (windows), space separated. This is not what you need. You cannot fix it within your program as it will be too late. (On UNIX you can ensure you are in an empty directory but that probably doesn't help).
Another way around this is to quote the entire argument (and rewrite you program appropriately), i.e.
/path/to/program '3 * 4'
in which case you would need to use strtok_r or strsep to step through the (single) argument passed, separating it on the space(s).
How the shell handles the command-line arguments is outside the scope and control of your program. There is nothing you can put in the program to tell the shell to avoid performing any of its normal command-handling behavior.
I suggest, however, that instead of relying on the shell for word splitting, you make your program expect the whole expression as a single argument, and for it to parse the expression. That will not relieve you of the need for quotes, but it will make the resulting commands look more natural:
program.exe 3+4
program.exe "3 + 4"
program.exe "4*5"
That will also help if you expand your program to handle more complex expressions, such as those containing parentheses (which are also significant to the shell).
You can turn off the shell globbing if you don't want to use single quote (') or double quote (").
Do
# set -o noglob
or
# set -f
(both are equivalent).
to turn off the shell globbing. Now, the shell won't expand any globs, including *.

Using exec on each file in a bash script

I'm trying to write a basic find command for a assignment (without using find). Right now I have an array of files I want to exec something on. The syntax would look like this:
-exec /bin/mv {} ~/.TRASH
And I have an array called current that holds all of the files. My array only holds /bin/mv, {}, and ~/.TRASH (since I shift the -exec out) and are in an array called arguments.
I need it so that every file gets passed into {} and exec is called on it.
I'm thinking I should use sed to replace the contents of {} like this (within a for loop):
for i in "${current[#]}"; do
sed "s#$i#{}"
#exec stuff?
done
How do I exec the other arguments though?
You can something like this:
cmd='-exec /bin/mv {} ~/.TRASH'
current=(test1.txt test2.txt)
for f in "${current[#]}"; do
eval $(sed "s/{}/$f/;s/-exec //" <<< "$cmd")
done
Be very careful with eval command though as it can do nasty things if input comes from untrusted sources.
Here is an attempt to avoid eval (thanks to #gniourf_gniourf for his comments):
current=( test1.txt test2.txt )
arguments=( "/bin/mv" "{}" ~/.TRASH )
for f in "${current[#]}"; do
"${arguments[#]/\{\}/$f}"
done
Your are lucky that your design is not too bad, that your arguments are in an array.
But you certainly don't want to use eval.
So, if I understand correctly, you have an array of files:
current=( [0]='/path/to/file'1 [1]='/path/to/file2' ... )
and an array of arguments:
arguments=( [0]='/bin/mv' [1]='{}' [2]='/home/alex/.TRASH' )
Note that you don't have the tilde here, since Bash already expanded it.
To perform what you want:
for i in "${current[#]}"; do
( "${arguments[#]//'{}'/"$i"}" )
done
Observe the quotes.
This will replace all the occurrences of {} in the fields of arguments by the expansion of $i, i.e., by the filename1, and execute this expansion. Note that each field of the array will be expanded to one argument (thanks to the quotes), so that all this is really safe regarding spaces, glob characters, etc. This is really the safest and most correct way to proceed. Every solution using eval is potentially dangerous and broken (unless some special quotings is used, e.g., with printf '%q', but this would make the method uselessly awkward). By the way, using sed is also broken in at least two ways.
Note that I enclosed the expansion in a subshell, so that it's impossible for the user to interfere with your script. Without this, and depending on how your full script is written, it's very easy to make your script break by (maliciously) changing some variables stuff or cd-ing somewhere else. Running your argument in a subshell, or in a separate process (e.g., separate instance of bash or sh—but this would add extra overhead) is really mandatory for obvious security reasons!
Note that with your script, user has a direct access to all the Bash builtins (this is a huge pro), compared to some more standard find versions2!
1 Note that POSIX clearly specifies that this behavior is implementation-defined:
If a utility_name or argument string contains the two characters "{}", but not just the two characters "{}", it is implementation-defined whether find replaces those two characters or uses the string without change.
In our case, we chose to replace all occurrences of {} with the filename. This is the same behavior as, e.g., GNU find. From man find:
The string {} is replaced by the current file name being processed everywhere it occurs in the arguments to the command, not just in arguments where it is alone, as in some versions of find.
2 POSIX also specifies that calling builtins is not defined:
If the utility_name names any of the special built-in utilities (see Special Built-In Utilities), the results are undefined.
In your case, it's well defined!
I think that trying to implement (in pure Bash) a find command is a wonderful exercise that should teach you a lot… especially if you get relevant feedback. I'd be happy to review your code!

How to execvp ls *.txt in C

I'm having issues execvping the *.txt wildcard, and reading this thread - exec() any command in C - indicates that it's difficult because of "globbing" issues. Is there any easy way to get around this?
Here's what I'm trying to do:
char * array[] = {"ls", "*.txt", (char *) NULL };
execvp("ls", array);
you could use the system command:
system("ls *.txt");
to let the shell do the globbing for you.
In order to answer this question you have to understand what is going on when you type ls *.txt in your terminal (emulator). When ls *.txt command is typed, it is being interpreted by the shell. The shell then performs directory listing and matches file names in the directory against *.txt pattern. Only after all of the above is done, shell prepares all of the file names as arguments and spawns a new process passing those file names as argv array to execvp call.
In order to assemble something like that yourself, look at the following Q/A:
How to list files in a directory in a C program?
Use fnmatch() to match file name with a shell-like wildcard pattern.
Prepare argument list from matched file names and use vfork() and one of the exec(3) family of functions to run another program.
Alternatively, you can use system() function as #manu-fatto has suggested. But that function will do a little bit different thing — it will actually run the shell program that will evaluate ls *.txt statement which in turn will perform steps similar to one I have described above. It is likely to be less efficient and it may introduce security holes (see manual page for more details, security risk are stated under NOTES section with a suggestion not to use the above function in certain cases).
Hope it helps. Good Luck!

Command line arguments with datafiles

If I want to pass a program data files how can I distinguish the fact they are data files, not just strings of the file names. Basically I want to file redirect, but use command line arguments so I can a sure input is correct.
I have been using:
./theapp < datafile1 < datafile2 arg1 arg2 arg3 > outputfile
but I am wondering is it posible for it to look like this:
./the app datafile1 datafile2 arg1 arg2 arg3 > outputfile
Allowing the use of command line arguments.
It's a little hard to combine two files into standard input like that. Better would be:
cat datafile1 datafile2 | ./theapp arg1 arg2 arg3 >outputfile
With bash (at least), the second input redirection overrides the first, it does not augment it. You can see that with the two commands:
cat <realfile.txt </dev/null # no output.
cat </dev/null <realfile.txt # outputs realfile.txt.
When you use redirection, your application never even sees >outputfile (for example). It is evaluated by the shell which opens it up and connects it to the standard output of the process you're trying to run. All your program will generally see will be:
./theapp arg1 arg2 arg3
Same with standard input, it's taken care of by the shell.
The only possible problem with that first command above is that it combines the two files into one stream so that your program doesn't know where the first ends and second begins (unless it can somehow deduce this from the content of the files).
If you want to process multiple files and know which they are, there's a time-honoured tradition of doing something like:
./theapp arg1 arg2 arg3 #datafile1 #datafile2 >outputfile
and then having your application open and process the files itself. This is more work than letting the shell do it though.
From the perspective of your program, all command line arguments are strings, and you have to decide whether they represent file names or not yourself. There are only two bytes that cannot appear in a file name on Unix: 0x00 and 0x2F (NUL and /). [I really mean bytes. Except for HFS+, Unix file systems are completely oblivious to character encoding, although sensible people use UTF-8, of course.]
Shell redirections don't appear in argv at all.
There is a convention, though: treat each element of argv (except argv[0] of course) that does not begin with a dash as the name of a file to process, in the order that they appear. You do NOT have to do any unquoting operations; just pass them to fopen (or open) as is. If the string "-" appears as an element of argv, process standard input at that point until exhausted, then continue looping over argv. And if the string "--" appears in argv, treat everything after that point as a file name, whether or not it begins with a dash. (Including subsequent appearances of "-" or "--").
There may be a handy library module or even a language primitive to deal with this stuff for you, depending on what language you're using. For instance, in Perl, you just write
for (<>) {
... do stuff with $_ ...
}
and you get everything I said in the "There is a convention..." paragraph for free. (But you said C, so, um, you gotta do most of it yourself. I'm not aware of an argument-processing library for plain C that's worth the space it takes on disk. :-( )

Writing a portable command line wrapper in C

I'm writing a perl module called perl5i. Its aim is to fix a swath of common Perl problems in one module (using lots of other modules).
To invoke it on the command line for one liners you'd write: perl -Mperl5i -e 'say "Hello"' I think that's too wordy so I'd like to supply a perl5i wrapper so you can write perl5i -e 'say "Hello"'. I'd also like people to be able to write scripts with #!/usr/bin/perl5i so it must be a compiled C program.
I figured all I had to do was push "-Mperl5i" onto the front of the argument list and call perl. And that's what I tried.
#include <unistd.h>
#include <stdlib.h>
/*
* Meant to mimic the shell command
* exec perl -Mperl5i "$#"
*
* This is a C program so it works in a #! line.
*/
int main (int argc, char* argv[]) {
int i;
/* This value is set by a program which generates this C file */
const char* perl_cmd = "/usr/local/perl/5.10.0/bin/perl";
char* perl_args[argc+1];
perl_args[0] = argv[0];
perl_args[1] = "-Mperl5i";
for( i = 1; i <= argc; i++ ) {
perl_args[i+1] = argv[i];
}
return execv( perl_cmd, perl_args );
}
Windows complicates this approach. Apparently programs in Windows are not passed an array of arguments, they are passed all the arguments as a single string and then do their own parsing! Thus something like perl5i -e "say 'Hello'" becomes perl -Mperl5i -e say 'Hello' and Windows can't deal with the lack of quoting.
So, how can I handle this? Wrap everything in quotes and escapes on Windows? Is there a library to handle this for me? Is there a better approach? Could I just not generate a C program on Windows and write it as a perl wrapper as it doesn't support #! anyway?
UPDATE: Do be more clear, this is shipped software so solutions that require using a certain shell or tweaking the shell configuration (for example, alias perl5i='perl -Mperl5i') aren't satisfactory.
For Windows, use a batch file.
perl5i.bat
#echo off
perl -Mperl5i %*
%* is all the command line parameters minus %0.
On Unixy systems, a similar shell script will suffice.
Update:
I think this will work, but I'm no shell wizard and I don't have an *nix system handy to test.
perl5i
#!bash
perl -Mperl5i $#
Update Again:
DUH! Now I understood your #! comment correctly. My shell script will work from the CLI but not in a #! line, since #!foo requries that foo is a binary file.
Disregard previous update.
It seems like Windows complicates everything.
I think your best there is to use a batch file.
You could use a file association, associate .p5i with perl -Mperl5i %*. Of course this means mucking about in the registry, which is best avoided IMO. Better to include instructions on how to manually add the association in your docs.
Yet another update
You might want to look at how parl does it.
I can't reproduce the behaviour your describe:
/* main.c */
#include <stdio.h>
int main(int argc, char *argv[]) {
int i;
for (i = 0; i < argc; i++) {
printf("%s\n", argv[i]);
}
return 0;
}
C:\> ShellCmd.exe a b c
ShellCmd.exe
a
b
c
That's with Visual Studio 2005.
Windows is always the odd case. Personally, I wouldn't try to code for the Windows environment exception. Some alternatives are using "bat wrappers" or ftype/assoc Registry hacks for a file extension.
Windows ignores the shebang line when running from a DOS command shell, but ironically uses it when CGI-ing Perl in Apache for Windows. I got tired of coding #!c:/perl/bin/perl.exe directly in my web programs because of portability issues when moving to a *nix environment. Instead I created a c:\usr\bin directory on my workstation and copied the perl.exe binary from its default location, typically c:\perl\bin for AS Perl and c:\strawberry\perl\bin for Strawberry Perl. So in web development mode on Windows my programs wouldn't break when migrated to a Linux/UNIX webhost, and I could use a standard issue shebang line "#!/usr/bin/perl -w" without having to go SED crazy prior to deployment. :)
In the DOS command shell environment I just either set my PATH explicitly or create a ftype pointing to the actual perl.exe binary with embedded switch -Mperl5i. The shebang line is ignored.
ftype p5i=c:\strawberry\perl\bin\perl.exe -Mperl5i %1 %*
assoc .pl=p5i
Then from the DOS command line you can just call "program.pl" by itself instead of "perl -Mperl5i program.pl"
So the "say" statement worked in 5.10 without any additional coaxing just by entering the name of the Perl program itself, and it would accept a variable number of command line arguments as well.
Use CommandLineToArgvW to build your argv, or just pass your command line directly to CreateProcess.
Of couse, this requires a separate Windows-specific solution, but you said you're okay with that, this is relatively simple, and often coding key pieces specifically to the target system helps integration (from the users' POV) significantly. YMMV.
If you want to run the same program both with and without a console, you should read Raymond Chen on the topic.
On Windows, at the system level, the command-line is passed to the launched program as a single UTF-16 string, so any quotes entered in the shell are passed as is. So the double quotes from your example are not removed. This is quite different from the POSIX world where the shell does the job of parsing and the launched program receives an array of strings.
I've described here the behavior at the system level. However, between your C (or your Perl) program there is usually the C standard library that is parsing the system command line string to give it to main() or wmain() as argv[]. This is done inside your process, but you can still access the original command line string with GetCommandLineW() if you really want to control how the parsing is done, or get the string in its full UTF-16 encoding.
To learn more about the Windows command-line parsing quirks, read the following:
http://www.autohotkey.net/~deleyd/parameters/parameters.htm#WIN
http://blogs.msdn.com/b/oldnewthing/archive/2006/05/15/597984.aspx
You may also be interested by the code of the wrapper I wrote for Padre on Win32: this is a GUI program (which means that it will not open a console if launched from the Start menu) called padre.exe that embeds perl to launch the padre Perl script. It also does a small trick: it changes argv[0] to point it to perl.exe so that $^X will be something usable to launch external perl scripts.
The execv you are using in your example code is just an emulation in the C library of the POSIX-like behavior. In particular it will not add quotes around your arguments so that the launched perl works as expected. You have to do that yourself.
Note that due to the fact that the client is responsible for parsing, each client client can do it the way it wants. Many let the libc do it, but not all. So generic command-line generation rules on Windows can not exist: the rule depend on the program launched.
You may still be interested in "best effort" implementation such as Win32::ShellQuote.
If you were able to use C++ then perhaps Boost.Program_options would help:
http://www.boost.org/doc/libs/1_39_0/doc/html/program_options.html

Resources