How do I parse a token from a string in C? - c

How do i parse tokens from an input string.
For example:
char *aString = "Hello world".
I want the output to be:
"Hello" "world"

You are going to want to use strtok - here is a good example.

Take a look at strtok, part of the standard library.

strtok is the easy answer, but what you really need is a lexer that does it properly. Consider the following:
are there one or two spaces between "hello" and "world"?
could that in fact be any amount of whitespace?
could that include vertical whitespace (\n, \f, \v) or just horizontal (\s, \t, \r)?
could that include any UNICODE whitespace characters?
if there were punctuation between the words, ("hello, world"), would the punctuation be a separate token, part of "hello,", or ignored?
As you can see, writing a proper lexer is not straightforward, and strtok is not a proper lexer.
Other solutions could be a single character state machine that does precisely what you need, or regex-based solution that makes locating words versus gaps more generalized. There are many ways.
And of course, all of this depends on what your actual requirements are, and I don't know them, so start with strtok. But it's good to be aware of the various limitations.

For re-entrant versions you can either use
strtok_s for visual studio or strtok_r for unix

Keep in mind that strtok is very hard to get it right, because:
It modifies the input
The delimiter is replaced by a null terminator
Merges adjacent delimiters, and of course,
Is not thread safe.
You can read about this alternative.

Related

Splitting string in C by blank spaces, besides when said blank space is within a set of quotes

I'm writing a simple Lisp in C without any external dependencies (please do not link the BuildYourOwnLisp), and I'm following this guide as a basis to parse the Lisp. It describes two steps in tokenising a S-exp, those steps being:
Put spaces around every paranthesis
Split on white space
The first step is easy enough, I wrote a trivial function that replaces certain substrings with other substrings, but I'm having problems with the second step. In the article it only uses the string "Lisp" in its examples of S-exps; if I were to use strtok() to blindly split by whitespace, any string in a S-exp that had a space within it would become fragmented and interpreted incorrectly by my Lisp. Obviously, a language limited to single-word strings isn't very useful.
How would I write a function that splits a string by white space, besides when the text is in between two double quotes?
I've tried using regex, but from what I can see of the POSIX regex.h library and PCRE, just extracting the matches would be incredibly laborious in terms of the amount of auxillary code I'd have to write, which would only serve to bloat my codebase. Besides, one of my goals with this project was to use only ANSI C, or, if need be, C99, solely for the sake of portability - fiddling with the POSIX library and the Win32 API would just fatten my code and make moving my lisp around a nightmare.
When researching this problem I came across this StackOverflow answer; but the approved answer only sends the tokenised string onto stdout, which isn't useful for me; I'd ideally have the tokens in a char** so that I could then parse them into useful in memory data structures.
As well as this, the approved answer on the aforementioned SO question is written to be restricted to specifically my problem - ideally, I'd have myself a general purpose function that would allow me to tokenise a string, except when a substring is between two of charachter x. This isn't a huge deal, it's just that I'd like my codebase to be clean and composable.
You have two delimiters: the space and double quotes.
You can use the strcspn (or with example: cppreference - strcspn) function for that.
Iterate over the string and look for the delimiters (space and quotes). strcspn returns if such a delimiter was found. If a space was found, continue looking for both. If a double quote was found, the delimiter chages from " \"" (space and quotes) to "\"" (double quotes). If you then hit the quotes again, change the delimiter to " \"" (space and quotes).
Based on your comment:
Lets say you have a string like
This is an Example.
The output would be
This
is
an
Example.
If the string would look like
This "is an" Example.
The output would be
This
is an
Example.

Check if a string has only whitespace characters in C

I am implementing a shell in C11, and I want to check if the input has the correct syntax before doing a system call to execute the command. One of the possible inputs that I want to guard against is a string made up of only white-space characters. What is an efficient way to check if a string contains only white spaces, tabs or any other white-space characters?
The solution must be in C11, and preferably using standard libraries. The string read from the command line using readline() from readline.h, and it is a saved in a char array (char[]). So far, the only solution that I've thought of is to loop over the array, and check each individual char with isspace(). Is there a more efficient way?
So far, the only solution that I've thought of is to loop over the array, and check each individual char with isspace().
That sounds about right!
Is there a more efficient way?
Not really. You need to check each character if you want to be sure only space is present. There could be some trick involving bitmasks to detect non-space characters in a faster way (like strlen() does to find a NUL terminator), but I would definitely not advise it.
You could make use of strspn() or strcspn() checking the returned value, but that would surely be slower since those functions are meant to work on arbitrary accept/reject strings and need to build lookup tables first, while isspace() is optimized for its purpose using a pre-built lookup table, and will most probably also get inlined by the compiler using proper optimization flags. Other than this, vectorization of the code seems like the only way to speed things up further. Compile with -O3 -march=native -ftree-vectorize (see also this post) and run some benchmarks.
"loop over the array, and check each individual char with isspace()" --> Yes go with that.
The time to do that is trivial compared to readline().
I'm going to provide an alternative solution to your problem: use strtok. It splits a string into substrings based on a specific set of ignored delimiters. With an empty string, you'd just get no tokens at all.
If you need more complicated matching than that for your shell (eg. To do quoted arguments) you're best off writing a small tokenizer/lexer. The strtok method is basically to just look for any of the delimeters you've specified, temporarily replace them with \0, returning the substring up to that point, putting the old character back, and repeating until it reaches the end of the string.
Edit:
As the busybee points out in the comment below, strtok does not put back the character that it replaces with \0. The above paragraph was worded poorly, but my intent was to explain how to implement your own simple tokenizer/lexer if you needed to, not to explain exactly how strtok works down to the smallest detail.

C Removing Newlines in a Portable and International Friendly Way

Simple question here with a potentially tricky answer: I am looking for a portable and localization friendly way to remove trailing newlines in C, preferably something standards-based.
I am already aware of the following solutions:
Parsing for some combination of \r and \n. Really not pretty when dealing with Windows, *nix and Mac, all which use different sequences to represent a new line. Also, do other languages even use the same escape sequence for a new line? I expect this will blow up in languages that use different glyphs from English (say, Japanese or the like).
Removing trailing n bytes and replacing final \0. Seems like a more brittle way of doing the above.
isspace looks tempting but I need to only match newlines. Other whitespace is considered valid token text.
C++ has a class to do this but it is of little help to me in a pure-C world.
locale.h seems like what I am after but I cannot see anything pertinent to extracting newline tokens.
So, with that, is this an instance that I will have to "roll my own" functionality or is there something that I have missed? Thanks!
Solution
I ended up combining both answers from Weather Vane and Loic, respectively, for my final solution. What worked was to use the handy strcspn function to break on the first newline character as selected from Loic's provided links. Thus, I can select delimiters based on a number of supported locales. Is a good point that there are too many to support generically at this level; I didn't even know that there were several competing encodings for the Cyrillic.
In this way, I can achieve "good enough" multinational support while still using standard library functions.
Since I can only accept one answer, I am selecting Weather Vane's as his was the final invocation I used. That being said, it was really the two answers together that worked for me.
The best one I know is
buffer [ strcspn(buffer, "\r\n") ] = 0;
which is a safe way of dealing with all the combinations of \r and \n - both, one or none.
I suggest to replace one or more whitespace characters with one standard space (US-ASCII 0x20). Considering only ISO-8859-1 characters (https://en.wikipedia.org/wiki/ISO/IEC_8859-1), whitespace consists of any byte in 0x00..0x20 (C0 control characters and space) and 0x7F..0xA0 (delete, C1 control characters and no-break space). Notice that US-ASCII is subset of ISO-8859-1.
But take into account that Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251) assign different, visible (non-control) characters to the range 0x80..0x9F. In this case, those bytes cannot be replaced by spaces without lost of textual information.
Resources for an extensive definition of whitespace characters:
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace
http://unicode.org/reports/tr23/
http://www.unicode.org/Public/8.0.0/charts/CodeCharts.pdf
Take also onto account that different encodings may be used, most commonly:
ISO-8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
UTF-8 (https://en.wikipedia.org/wiki/UTF-8)
Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251)
But in non-western countries (for instance Russia, Japan), further character encodings are also usual. Numerous encodings exist, but it probably does not make sense to try to support each and every known encoding.
Thus try to define and restrict your use-cases, because implementing it in full generality means a lot of work.
This answer is for C++ users with the same problem.
Matching a newline character for any locale and character type can be done like this:
#include <locale>
template<class Char>
bool is_newline(Char c, std::locale const & loc = std::locale())
{
// Translate character into default locale and character type.
// Then, test against '\n', which is the only newline character there.
return std::use_facet< std::ctype<Char>>(loc).narrow(c, ' ') == '\n';
}
Now, removing all trailing newlines can be done like this:
void remove_trailing_newlines(std::string & str) {
while (!str.empty() && is_newline(*str.rbegin())
str.pop_back();
}
This should be absolutely portable, as it relies only on standard C++ functions.

C - clarifying delimiters in strtok

I'm trying to break up a shell command that contains both pipes (|) and the OR symbols (||) represented as characters in an array with strtok, except, well the OR command could also be two pipes next to each other. Specifically, I need to know when |, ;, &&, or || show up in the command.
Is there a way to specify where one delimiter ends and another begins in strtok, since I know usually the delimiters are one character long and you just list them all out with no spaces or anything in between.
Oh and, is a newline a valid delimiter? Or does strtok only do spaces?
Starting from your last question: yes, strtok can use new-line as a delimiter without any problems.
Unfortunately, the answer to your first question isn't nearly so positive. strtok treats all delimiter characters as equal, and does nothing to differentiate between a single delimiter and an arbitrary number of consecutive delimiters. In other words, if you give |&; as the delimiter, it'll treat ||||||||| or &&& or &|&|; all exactly the same way.
I'll go a little further: I'll go out on a limb and state as a fact that strtok simply isn't suitable for breaking a shell command into constituent pieces -- I'm pretty sure there's just no way to use it for this job that will produce usable results.
In particular, you don't have anything that just acts as a delimiter. For your purposes, the &, |, and || are tokens of their own. In a string being supplied to the shell, you don't necessarily have anything that qualifies as a delimiter the way strtok "thinks" of them.
strtok is oriented toward tokens that are separated by delimiters that are nothing except delimiters. As strtok reads the tokens, the delimiters between them are completely ignored (and, destroyed, for that matter). For the shell, a string like a|b is really three tokens -- you need the a, the | and the b -- there's nothing between them that strtok can safely overwrite and/or ignore -- but that's a requirement for how strtok works. For it to deliver you the first a, it overwrites the next character (the | in this case) with a '\0'. Then it has no way of recovering that pipe to tell you what the next token should be.
I think you probably need a greedy tokenizer instead -- i.e., one that builds the longest string of characters that can be token, and stops when it encounters a character that can't be part of the current token. When you ask for the next token, it starts from the first character after the end of the previous token, without (necessarily) skipping/ignoring anything (though, of course, if it encounters something like white-space that hasn't been quoted somehow, it'll probably skip over it).
For your purpose, strtok() is not the correct tool to use; it destroys the delimiter, so you can't tell what was at the end of a token if someone types ls|wc. It could have been a pipe, a semi-colon, and ampersand, or a space. Also, it treats multiple adjacent delimiters as part of a single delimiter.
Look at strspn() and strcspn(); both are in standard C and are non-destructive relatives of strtok().
strtok() is quite happy to use newline as a delimiter; in fact, any character except '\0' can be used as one of the delimiters.
There are other reasons for being extremely cautious about using strtok(), such as thread safety and the fact that it is highly unwise to use it in library code.
strtok() is a basic, all-purpose parsing function. For more advanced parsing, I don't recommend its use.
For example, in the case of '|', you really need to inspect the next character to determine if you've found '|' or '||'.
I've done a huge amount of parsing of this nature, including writing a small language interpreter. It's not that hard if you break it up into smaller tasks. But my advice is to write your own parsing routine in this case.
And, yes, a newline character is a valid delimiter.

printing delimiters in c language

I am using strtok().But i want to print corresponding delimiters also as we do using StringTokenizer in Java.Is there any function which provides this functionality(printing delimiters) ?
Based on OP's comments, tokenization is not what is actually desired. You want to use strstr(), not strtok(). That will tell you if the string is present, and then you can use strcpy() and strcat() as appropriate.
Please note, the "n" versions of these methods, i.e. strncpy and strncat, are safer -- less likely to crash due to buffer overrun.
i want to print corresponding delimiters also as we do using StringTokenizer in Java
Java's StringTokenizer doesn't return delimeters.
In any case, there is no such function in C. You'll have to write one (using strchr, etc.)
How about using glib?
It seems
http://library.gnome.org/devel/glib/2.26/glib-String-Utility-Functions.html#g-strsplit
is exactly what you're looking for.

Resources