CommandLineToArgvW equivalent on Linux - c

I'm looking for an equivalent function to Windows' CommandLineToArgvW.
I have a string, which I want to break up exactly the way bash does it, with all the corner cases - i.e. taking into account single and double quotes, backslashes, etc., so splitting a b "c'd" "\"e\"f" "g\\" h 'i"j' would result into:
a
b
c'd
"e"f
g\
h
i"j
Since such a function already exist and is used by the OS/bash, I'm assuming there's a way to call it, or at least get its source code, so I don't need to reinvent the wheel.
Edit
To answer why I need it, it has nothing to do with spawning child processes. I want to make a program that searches text, watching for multiple regular expressions to be true in whatever order. But all the regular expressions would be input in the same text field, so I need to break them up.

GNU/Linux is made of free software and bash is free software, so you can get the source code and improve it (and you should publish your improving patches under GPL license).
But there is no common library doing that, because it is the role of the shell to expand the command line to arguments to the execve(2) syscall (which then go to the main of the invoked program).
(this was different in MS-DOS, where the called program had to expand its command line)
The function wordexp(3) is close to what you may want.
You may want to study the source code of simpler shells, e.g. download sash-3.7.tar.gz

If you want it to expand a string exactly the way Bash does, you will need to run Bash. Remember, Bash does parameter expansion, command substitution, and the like. If it really needs to act exactly like Bash, just call Bash itself.
FILE *f = popen("bash", "r+");
fprintf(f, "echo %s", your_string);
fgets(buffer, sizeof(buffer), f);
pclose(f);
Note that real code would need to handle errors and possibly allocating a bigger buffer if your original is not large enough.
Given your updated requirements, it sounds like you don't want to parse it exactly like Bash does. Instead, you just want to parse space-separated strings with quoting and escaping. I would recommend simply implementing this yourself; I do not know of any off the shelf library that will parse strings exactly the way that you specify. You don't have to write it entirely by hand; you can use a lexical scanner generator like flex or Ragel for this purpose.

Related

Shell in C - how to read and execute user input?

My assignment is to write a very simple shell in C. I don't have many sources given and I have just started learning C, I have used only scanf() and printf() and coded simple functions. I only know it supposed to be written with fork() and different exec(). I spent a lot of time analyzing other shells, but I don't understand the structure of programs and some functions, like:
parsing, why do we need parsing? If I know I won't use arguments in commands, I would like only to compare input with my own functions like help() or date() and execute them.
reading user input. Using the fgets() and then strcmp()?
executing, how does it works? How execvp() knows that the user input is a command(function) in my main or program in my program folder?
First, let me say this sounds like a daunting task for a person just learning C, unless it's very carefully restricted. It does not make sense for you to need to analyze other shells. I'd advise you talk to a teaching assistant about the scope of the assignment, and what you're expected to do given your experience.
To answer your questions, though:
why do we need parsing?
Parsing is (in this context) taking a sequence of characters and producing a data structure which your proagram can work on. Now, this could be a rather simple structure if there are not supposed to be any arguments, no multiple commands per line etc. However, you at least have to make sure that the user has indeed not used arguments, not written down an invalid command line, closed their open parentheses and so on.
If I know I won't use arguments in commands
The program is not written for the perfect user whose behavior you can predict. You must accommodate any user, who may insert just about anything; you'll need to notice this case and report an error.
reading user input. Using the fgets() and then strcmp()?
Remember that fgets() may not get through the entire line - if the line is longer than your buffer length minus 1. But perhaps you are guaranteed a line length limit? Or are allowed to fail on overly long lines?
Also, it might be the case that the user is allowed to use extra white-space as part of the line, in which case strcmp() might not give you want you want.
executing, how does it works? How execvp() knows that the user input is a command(function) in my main or program in my program folder?
Have a look at the man page for execvp() (and friends). Basically, what happens when you call execvp() is that the binary at a specified location gets run, and its command-line is what you pass as the second argument of execvp(). Suppose you run
execvp("/path/to/foo", "/path/to/foo", "bar");
so, the program at /path/to/foo is run. Like any program, its argv[0] is the path to itself. Its argc will be 2 and its argv[1] will be "bar". Its working directory (and user and group id) will be the current directory, and user and group id, of the process which ran the execvp(), so - not necessarily /path/to/foo.
Continuing the previous example, you could do:
chdir("/path/to");
execvp("foo", "foo", "bar");
at which time foo would run with argv[0] being foo.

Parsing stdin for files provided by ls

TL;DR: Is the output of ls standardised so that there is a perfect way to parse it into an array of files names ?
I have to write a program that processes some files, the program specification states this:
Your program should read a list of files from the standard entry
And an example is given of how to program will be used:
ls /usr/include/std*.h | ./distribuer 3
Where distribuer is the name of my program.
From my tests, I see that ls separates the files names with tabs when called with this sort of argument containing a wildcard, is this behaviour standard ? Or might ls sometimes use simple whitespace characters or even newlines when called with similar wildcard arguments ?
Finally, while this might be an edge case, I am also worried that since Unix allows for tabs and whitespaces in filenames, it could actually be impossible to reliably parse the output of ls, is that correct ?
Is the output of ls standardised so that there is a perfect way to parse it into an array of files names?
The output of ls is certainly standardised, by the Posix standard. In the section STDOUT, the standardised formats are described:
The default format shall be to list one entry per line to standard output; the exceptions are to terminals or when one of the -C, -m, or -x options is specified.
As well as a cautionary note about an important context in which the output is not standardised:
If the output is to a terminal, the format is implementation-defined.
(There is quite a lot of specification of how the format changes with different command-line parameters, which I'm not quoting because it is not immediately relevant here.)
So the standardised format, applicable if stdout is not directed to a terminal and if no command-line options are provided (or if the -1 option is provided, even if stdout is a terminal) is to print one entry per line.
Unfortunately, that does not provide a "perfect way" to parse the output, because it is legal for filenames to include newline characters, and a filename which includes a newline character will obviously span more than one line. If all you have is the ls output, there is no 100% reliable way to tell whether a newline (other than the last one) indicates the end of a filename or is a newline character in the filename.
For the purposes of your assignment, the simple strategy would be just to ignore that imperfection (or, better, document it and then ignore it), which is the same strategy that many Unix utilities use. Files whose names include newlines are extremely rare in the wild, and people who create files with newlines in their names probably deserve the problems they will cause themselves. However, you will find a lot of people here (including me, sometimes) suggesting that scripts should work correctly with all legal filenames. So the rest of this answer discusses some of the possible responses to this pedantry. Note that none of them are "perfect".
One imperfect solution is to try to figure out whether a given newline is embedded or not. If you know the list was produced by ls without any sorting options, you might be able to guess correctly in most cases by using the fact that ls presents files sorted by the current locale's collation rules. So if a line is out of sequence (either less than the preceding line or greater than the following one) then it is appropriate to guess that it is a continuation of the filename. That won't always work, and I don't know any utility which tries it, but it might be worth mentioning.
If you were running ls yourself, you could take advantage of the -q option, which causes non-printing characters (including tabs and newlines) to be replaced with ? in the output. That forces the filename to be printed on a single line, but has the disadvantage that you no longer know what the filename was before the substitution, since there are a variety of characters which could be replaced with a question mark (including a question mark itself). You might be able to query the filesystem to find the real name of the file, but there are a lot of corner cases I'm not going to go into since the premise of this paragraph is not applicable to the actual problem.
The most common solution is to allow the user to tell your utility that filenames are separated with a NUL character rather than a newline. This is 100% reliable because filenames cannot contain NUL characters -- in fact, that's the only character they cannot contain. Unfortunately, ls does not provide an option to produce output in this format, but the user could use the find utility to generate the same listing as ls and then use the non-standard but widely-implemented -print0 option to write out the filenames with NUL terminators. (If only Posix standard options to find are available, you can still produce the output by using -exec with an appropriate command to output the name.)
Many utilities which accept lists of filenames on standard input have (non-standard) options to specify a delimiter character, or to specify that the delimiter is NUL instead of newline. See, for example, xargs -0, sort -z (Gnu or BSD) or read -d (bash). So this is probably a reasonable enhancement if you're interested in coding it.
It's worth noting that most standard shell utilities do not provide an option to take a list of filenames through standard input. Most utilities prefer to receive filenames as command-line arguments. This works well because when the shell expands "globs" (like *) specified on a command-line, it does not rerun word-splitting on the output; each filename becomes a single argument. That means that
./distribute *
is almost perfect as a way of passing a list of filenames to a utility. But it is still not quite perfect because there is a limit to the number of command-line arguments you can provide in a single command-line. So if the directory has a really large number of files, the expansion of * might exceed that limit, causing the utility execution to fail. find also just passes filenames through to -exec as single arguments without word-splitting, and the use of {}+ as an -exec command terminator will split the filenames into sets which are small enough that they will not exceed to command-line limit. That's safer than ./distribute *, but it does mean that the utility may be called several times, once for each set. (And it's also a bit annoying getting the find predicates to give you exactly what you want.)

Parsing shell commands in c: string cutting with respect to its contents

I'm currently creating Linux shell to learn more about system calls.
I've already figured out most of the things. Parser, token generation, passing appropriate things to appropriate system calls - works.
The thing is, that even before I start making tokens, I split whole command string into separate words. It's based on array of separators, and it works surprisingly good. Except that I'm struggling with adding additional functionality to it, like escape sequences or quotes. I can't really live without it, since even people using basic grep commands use arguments with quotes. I'll need to add functionality for:
' ' - ignore every other separator, operator or double quotes found between those two, pass this as one string, don't include these quotation marks into resulting word,
" "- same as above, but ignore single quotes,
\\ - escape this into single backslash,
\(space) - escape this into space, do not parse resulting space as separator
\", \' - analogously to the above.
Many other things that I haven't figured out I need yet
and every single one of them seems like an exception on its own. Each of them must operate on diversity of possible positions in commands, being included into result or not, having influence on the rest of the parsing. It makes my code look like big ball of mud.
Is there a better approach to do this? Is there a more general algorithm for that purpose?
You are trying to solve a classic problem in program analysis (of lexing and parsing) using a nontraditional structure for lexer ( I split whole command string into separate words... ). OK, then you will have non-traditional troubles with getting the lexer "right".
That doesn't mean that way is doomed to failure, and without seeing specific instances of your problem, (you list a set of constructs you want to handle, but don't say why these are hard to process), it is hard to provide any specific advice. It also doesn't mean that way will lead to success; splitting the line may break tokens that shouldn't be broken (usually by getting confused about what has been escaped).
The point of using a standard lexer (such as Flex or any of the 1000 variants you can get) is that they provide a proven approach to complex lexing problems, based generally on the idea that one can use regular expressions to describe the shape of individual lexemes. Thus, you get one regexp per lexeme type, thus an ocean of them but each one is pretty easy to specify by itself.
I've done ~~40 languages using strong lexers and parsers (using one of the ones in that list). I assure you the standard approach is empirically pretty effective. The types of surprises are well understood and manageable. A nonstandard approach always has the risk that it will surprise you in a bad way.
Last remark: shell languages for Unix have had people adding crazy stuff for 40 years. Expect the job to be at least medium hard, and don't expect it to be pretty like Wirth's original Pascal.

What are some best practices for file I/O in C?

I'm writing a fairly basic program for personal use but I really want to make sure I use good practices, especially if I decide to make it more robust later on or something.
For all intents and purposes, the program accepts some input files as arguments, opens them using fopen() read from the files, does stuff with that information, and then saves the output as a few different files in a subfolder. eg, if the program is in ~/program then the output files are saved in ~/program/csv/
I just output directly to the files, for example output = fopen("./csv/output.csv", "w");, print to it with fprintf(output,"%f,%f", data1, data2); in a loop, and then close with fclose(output); and I just feel like that is bad practice.
Should I be saving it in a temp directory wile it's being written to and then moving it when it's finished? Should I be using more advanced file i/o libraries? Am I just completely overthinking this?
Best practices in my eyes:
Check every call to fopen, printf, puts, fprintf, fclose etc. for errors
use getchar if you must, fread if you can
use putchar if you must, fwrite if you can
avoid arbitrary limits on input line length (might require malloc/realloc)
know when you need to open output files in binary mode
use Standard C, forget conio.h :-)
newlines belong at the end of a line, not at the beginning of some text, i.e. it is printf ("hello, world\n"); and not "\nHello, world" like those mislead by the Mighty William H. often write to cope with the sillyness of their command shell. Outputting newlines first breaks line buffered I/O.
if you need more than 7bit ASCII, chose Unicode (the most common encoding is UTF-8 which is ASCII compatible). It's the last encoding you'll ever need to learn. Stay away from codepages and ISO-8859-*.
Am I just completely overthinking this?
You are. If the task's simple, don't make a complicated solution on purpose just because it feels "more professional". While you're a beginner, focus on code readability, it will facilitate your and others' lives.
It's fine. I/O is fully buffered by default with stdio file functions, so you won't be writing to the file with every single call of fprintf. In fact, in many cases, nothing will be written to it until you call fclose.
It's good practice to check the return of fopen, to close your files when finished, etc. Let the OS and the compiler do their job in making the rest efficient, for simple programs like this.
If no other program is checking for the presence of ~/program/csv/output.csv for further processing, then what you're doing is just fine.
Otherwise you can consider writing to a FILE * obtained by a call to tmpfile in stdio.h or some similar library call, and when finished copy the file to the final destination. You could also put down a lock file output.csv.lck and remove that when you're done, but that depends on you being able to modify the other program's behaviour.
You can make your own cat, cp, mv programs for practice.

What is the best way to interface with a a program that uses filenames on the command line for input and output?

I need to interface with some executables that expect to be passed filenames for input or output:
./helper in.txt out.txt
What is the standard (and preferably cross-platform) way of doing this?
I could create lots of temporary files int the /tmp directory, but I am concerned that creating tons of files might cause some issues. Also, I want to be able to install my program and not have to worry about permissions later.
I could also just be Unix specific and try to go for a solution using pipes, etc. But then, I don't think I would be able to find a solution with nice, unnamed pipes.
My alternative to this would be piping input to stdin (all the executables I need also accept it this way) and get the results from stdout. However, the outputs they give to stdout are all different and I would need to write lots of adapters by hand to make this uniform (the outputs through files obey a same standard). I don't like how this would lock in my program to a couple of formats though.
There isn't a right or wrong answer necessarily. Reading/writing to stdin/out is probably cleaner and doesn't use disk space. However, using temporary files is just fine too as long as you do it safely. Specifically, see the mktemp and mkstemp manual page for functions that let you create temporary files for short-term usage. Just clean them up afterward (unlink) and it's just fine to use and manipulate temp files.

Resources