C - clarifying delimiters in strtok - c

I'm trying to break up a shell command that contains both pipes (|) and the OR symbols (||) represented as characters in an array with strtok, except, well the OR command could also be two pipes next to each other. Specifically, I need to know when |, ;, &&, or || show up in the command.
Is there a way to specify where one delimiter ends and another begins in strtok, since I know usually the delimiters are one character long and you just list them all out with no spaces or anything in between.
Oh and, is a newline a valid delimiter? Or does strtok only do spaces?

Starting from your last question: yes, strtok can use new-line as a delimiter without any problems.
Unfortunately, the answer to your first question isn't nearly so positive. strtok treats all delimiter characters as equal, and does nothing to differentiate between a single delimiter and an arbitrary number of consecutive delimiters. In other words, if you give |&; as the delimiter, it'll treat ||||||||| or &&& or &|&|; all exactly the same way.
I'll go a little further: I'll go out on a limb and state as a fact that strtok simply isn't suitable for breaking a shell command into constituent pieces -- I'm pretty sure there's just no way to use it for this job that will produce usable results.
In particular, you don't have anything that just acts as a delimiter. For your purposes, the &, |, and || are tokens of their own. In a string being supplied to the shell, you don't necessarily have anything that qualifies as a delimiter the way strtok "thinks" of them.
strtok is oriented toward tokens that are separated by delimiters that are nothing except delimiters. As strtok reads the tokens, the delimiters between them are completely ignored (and, destroyed, for that matter). For the shell, a string like a|b is really three tokens -- you need the a, the | and the b -- there's nothing between them that strtok can safely overwrite and/or ignore -- but that's a requirement for how strtok works. For it to deliver you the first a, it overwrites the next character (the | in this case) with a '\0'. Then it has no way of recovering that pipe to tell you what the next token should be.
I think you probably need a greedy tokenizer instead -- i.e., one that builds the longest string of characters that can be token, and stops when it encounters a character that can't be part of the current token. When you ask for the next token, it starts from the first character after the end of the previous token, without (necessarily) skipping/ignoring anything (though, of course, if it encounters something like white-space that hasn't been quoted somehow, it'll probably skip over it).

For your purpose, strtok() is not the correct tool to use; it destroys the delimiter, so you can't tell what was at the end of a token if someone types ls|wc. It could have been a pipe, a semi-colon, and ampersand, or a space. Also, it treats multiple adjacent delimiters as part of a single delimiter.
Look at strspn() and strcspn(); both are in standard C and are non-destructive relatives of strtok().
strtok() is quite happy to use newline as a delimiter; in fact, any character except '\0' can be used as one of the delimiters.
There are other reasons for being extremely cautious about using strtok(), such as thread safety and the fact that it is highly unwise to use it in library code.

strtok() is a basic, all-purpose parsing function. For more advanced parsing, I don't recommend its use.
For example, in the case of '|', you really need to inspect the next character to determine if you've found '|' or '||'.
I've done a huge amount of parsing of this nature, including writing a small language interpreter. It's not that hard if you break it up into smaller tasks. But my advice is to write your own parsing routine in this case.
And, yes, a newline character is a valid delimiter.

Related

Splitting string in C by blank spaces, besides when said blank space is within a set of quotes

I'm writing a simple Lisp in C without any external dependencies (please do not link the BuildYourOwnLisp), and I'm following this guide as a basis to parse the Lisp. It describes two steps in tokenising a S-exp, those steps being:
Put spaces around every paranthesis
Split on white space
The first step is easy enough, I wrote a trivial function that replaces certain substrings with other substrings, but I'm having problems with the second step. In the article it only uses the string "Lisp" in its examples of S-exps; if I were to use strtok() to blindly split by whitespace, any string in a S-exp that had a space within it would become fragmented and interpreted incorrectly by my Lisp. Obviously, a language limited to single-word strings isn't very useful.
How would I write a function that splits a string by white space, besides when the text is in between two double quotes?
I've tried using regex, but from what I can see of the POSIX regex.h library and PCRE, just extracting the matches would be incredibly laborious in terms of the amount of auxillary code I'd have to write, which would only serve to bloat my codebase. Besides, one of my goals with this project was to use only ANSI C, or, if need be, C99, solely for the sake of portability - fiddling with the POSIX library and the Win32 API would just fatten my code and make moving my lisp around a nightmare.
When researching this problem I came across this StackOverflow answer; but the approved answer only sends the tokenised string onto stdout, which isn't useful for me; I'd ideally have the tokens in a char** so that I could then parse them into useful in memory data structures.
As well as this, the approved answer on the aforementioned SO question is written to be restricted to specifically my problem - ideally, I'd have myself a general purpose function that would allow me to tokenise a string, except when a substring is between two of charachter x. This isn't a huge deal, it's just that I'd like my codebase to be clean and composable.
You have two delimiters: the space and double quotes.
You can use the strcspn (or with example: cppreference - strcspn) function for that.
Iterate over the string and look for the delimiters (space and quotes). strcspn returns if such a delimiter was found. If a space was found, continue looking for both. If a double quote was found, the delimiter chages from " \"" (space and quotes) to "\"" (double quotes). If you then hit the quotes again, change the delimiter to " \"" (space and quotes).
Based on your comment:
Lets say you have a string like
This is an Example.
The output would be
This
is
an
Example.
If the string would look like
This "is an" Example.
The output would be
This
is an
Example.

Check if a string has only whitespace characters in C

I am implementing a shell in C11, and I want to check if the input has the correct syntax before doing a system call to execute the command. One of the possible inputs that I want to guard against is a string made up of only white-space characters. What is an efficient way to check if a string contains only white spaces, tabs or any other white-space characters?
The solution must be in C11, and preferably using standard libraries. The string read from the command line using readline() from readline.h, and it is a saved in a char array (char[]). So far, the only solution that I've thought of is to loop over the array, and check each individual char with isspace(). Is there a more efficient way?
So far, the only solution that I've thought of is to loop over the array, and check each individual char with isspace().
That sounds about right!
Is there a more efficient way?
Not really. You need to check each character if you want to be sure only space is present. There could be some trick involving bitmasks to detect non-space characters in a faster way (like strlen() does to find a NUL terminator), but I would definitely not advise it.
You could make use of strspn() or strcspn() checking the returned value, but that would surely be slower since those functions are meant to work on arbitrary accept/reject strings and need to build lookup tables first, while isspace() is optimized for its purpose using a pre-built lookup table, and will most probably also get inlined by the compiler using proper optimization flags. Other than this, vectorization of the code seems like the only way to speed things up further. Compile with -O3 -march=native -ftree-vectorize (see also this post) and run some benchmarks.
"loop over the array, and check each individual char with isspace()" --> Yes go with that.
The time to do that is trivial compared to readline().
I'm going to provide an alternative solution to your problem: use strtok. It splits a string into substrings based on a specific set of ignored delimiters. With an empty string, you'd just get no tokens at all.
If you need more complicated matching than that for your shell (eg. To do quoted arguments) you're best off writing a small tokenizer/lexer. The strtok method is basically to just look for any of the delimeters you've specified, temporarily replace them with \0, returning the substring up to that point, putting the old character back, and repeating until it reaches the end of the string.
Edit:
As the busybee points out in the comment below, strtok does not put back the character that it replaces with \0. The above paragraph was worded poorly, but my intent was to explain how to implement your own simple tokenizer/lexer if you needed to, not to explain exactly how strtok works down to the smallest detail.

Convert escape sequences from user input into their real representation

I'm trying to write an interpreter for LOLCODE that reads escaped strings from a file in the form:
VISIBLE "HAI \" WORLD!"
For which I wish to show an output of:
HAI " WORLD!
I have tried to dynamically generate a format string for printf in order to do this, but it seems that the escaping is done at the stage of declaration of a string literal.
In essence, what I am looking for is exactly the opposite of this question:
Convert characters in a c string to their escape sequences
Is there any way to go about this?
It's a pretty standard scanning exercise. Depending on how close you intend to be to the LOLCODE specification (which I can't seem to reach right now, so this is from memory), you've got a few ways to go.
Write a lexer by hand
It's not as hard as it sounds. You just want to analyze your input one character at a time, while maintaining a bit of context information. In your case, the important context consists of two flags:
one to remember you're currently lexing a string. It'll be set when reading " and cleared when reading ".
one to remember the previous character was an escape. It'll be set when reading \ and cleared when reading the character after that, no matter what it is.
Then the general algorithm looks like: (pseudocode)
loop on: c ← read next character
if not inString
if c is '"' then clear buf; set inString
else [out of scope here]
if inEscape then append c to buf; clear inEscape
if c is '"' then return buf as result; clear inString
if c is '\' then set inEscape
else append c to buf
You might want to refine the inEscape case should you want to implement \r, \n and the like.
Use a lexer generator
The traditional tools here are lex and flex.
Get inspiration
You're not the first one to write a LOLCODE interpreter. There's nothing wrong with peeking at how the others did it. For example, here's the string parsing code from lci.

Use scanf with Regular Expressions

I've been trying to use regular expressions on scanf, in order to read a string of maximum n characters and discard anything else until the New Line Character. Any spaces should be treated as regular characters, thus included in the string to be read.
I've studied a Wikipedia article about Regular Expressions, yet I can't get scanf to work properly. Here is some code I've tried:
scanf("[ ]*%ns[ ]*[\n]", string);
[ ] is supposed to go for the actual space character, * is supposed to mean one or more, n is the number of characters to read and string is a pointer allocated with malloc.
I have tried several different combinations; however I tend to get only the first word of a sentence read (stops at space character). Furthermore, * seems to discard a character instead of meaning "zero or more"...
Could anybody explain in detail how regular expressions are interpreted by scanf? What is more, is it efficient to use getc repetitively instead?
Thanks in Advance :D
The short answer: scanf does not handle regular expressions literally speaking.
If you want to use regular expressions in C, you could use the regex POSIX library. See the following question for a basic example on this library usage : Regular expressions in C: examples?
Now if you want to do it the scanf way you could try something like
scanf("%*[ ]%ns%*[ ]\n",str);
Replace the n in %ns by the maximal number of characters to read from input stream.
The %*[ ] part asks to ignore any spaces. You could replace the * by a specific number to ignore a precise number of characters. You could add other characters between braces to ignore more than just spaces.
Not sure if the above scanf would work as spaces are also matched with the %s directive.
I would definitely go with a fgets call, then triming the surrounding whitespaces with something like the following: How do I trim leading/trailing whitespace in a standard way?
is it efficient to use getc repetitively instead?
Depends somewhat on the application, but YES, repeated getc() is efficient.
unless I read the question wrong, %[^'\n']s will save everything until the carriage return is encountered.

How do I parse a token from a string in C?

How do i parse tokens from an input string.
For example:
char *aString = "Hello world".
I want the output to be:
"Hello" "world"
You are going to want to use strtok - here is a good example.
Take a look at strtok, part of the standard library.
strtok is the easy answer, but what you really need is a lexer that does it properly. Consider the following:
are there one or two spaces between "hello" and "world"?
could that in fact be any amount of whitespace?
could that include vertical whitespace (\n, \f, \v) or just horizontal (\s, \t, \r)?
could that include any UNICODE whitespace characters?
if there were punctuation between the words, ("hello, world"), would the punctuation be a separate token, part of "hello,", or ignored?
As you can see, writing a proper lexer is not straightforward, and strtok is not a proper lexer.
Other solutions could be a single character state machine that does precisely what you need, or regex-based solution that makes locating words versus gaps more generalized. There are many ways.
And of course, all of this depends on what your actual requirements are, and I don't know them, so start with strtok. But it's good to be aware of the various limitations.
For re-entrant versions you can either use
strtok_s for visual studio or strtok_r for unix
Keep in mind that strtok is very hard to get it right, because:
It modifies the input
The delimiter is replaced by a null terminator
Merges adjacent delimiters, and of course,
Is not thread safe.
You can read about this alternative.

Resources