Check if a string has only whitespace characters in C - c

I am implementing a shell in C11, and I want to check if the input has the correct syntax before doing a system call to execute the command. One of the possible inputs that I want to guard against is a string made up of only white-space characters. What is an efficient way to check if a string contains only white spaces, tabs or any other white-space characters?
The solution must be in C11, and preferably using standard libraries. The string read from the command line using readline() from readline.h, and it is a saved in a char array (char[]). So far, the only solution that I've thought of is to loop over the array, and check each individual char with isspace(). Is there a more efficient way?

So far, the only solution that I've thought of is to loop over the array, and check each individual char with isspace().
That sounds about right!
Is there a more efficient way?
Not really. You need to check each character if you want to be sure only space is present. There could be some trick involving bitmasks to detect non-space characters in a faster way (like strlen() does to find a NUL terminator), but I would definitely not advise it.
You could make use of strspn() or strcspn() checking the returned value, but that would surely be slower since those functions are meant to work on arbitrary accept/reject strings and need to build lookup tables first, while isspace() is optimized for its purpose using a pre-built lookup table, and will most probably also get inlined by the compiler using proper optimization flags. Other than this, vectorization of the code seems like the only way to speed things up further. Compile with -O3 -march=native -ftree-vectorize (see also this post) and run some benchmarks.

"loop over the array, and check each individual char with isspace()" --> Yes go with that.
The time to do that is trivial compared to readline().

I'm going to provide an alternative solution to your problem: use strtok. It splits a string into substrings based on a specific set of ignored delimiters. With an empty string, you'd just get no tokens at all.
If you need more complicated matching than that for your shell (eg. To do quoted arguments) you're best off writing a small tokenizer/lexer. The strtok method is basically to just look for any of the delimeters you've specified, temporarily replace them with \0, returning the substring up to that point, putting the old character back, and repeating until it reaches the end of the string.
Edit:
As the busybee points out in the comment below, strtok does not put back the character that it replaces with \0. The above paragraph was worded poorly, but my intent was to explain how to implement your own simple tokenizer/lexer if you needed to, not to explain exactly how strtok works down to the smallest detail.

Related

Splitting string in C by blank spaces, besides when said blank space is within a set of quotes

I'm writing a simple Lisp in C without any external dependencies (please do not link the BuildYourOwnLisp), and I'm following this guide as a basis to parse the Lisp. It describes two steps in tokenising a S-exp, those steps being:
Put spaces around every paranthesis
Split on white space
The first step is easy enough, I wrote a trivial function that replaces certain substrings with other substrings, but I'm having problems with the second step. In the article it only uses the string "Lisp" in its examples of S-exps; if I were to use strtok() to blindly split by whitespace, any string in a S-exp that had a space within it would become fragmented and interpreted incorrectly by my Lisp. Obviously, a language limited to single-word strings isn't very useful.
How would I write a function that splits a string by white space, besides when the text is in between two double quotes?
I've tried using regex, but from what I can see of the POSIX regex.h library and PCRE, just extracting the matches would be incredibly laborious in terms of the amount of auxillary code I'd have to write, which would only serve to bloat my codebase. Besides, one of my goals with this project was to use only ANSI C, or, if need be, C99, solely for the sake of portability - fiddling with the POSIX library and the Win32 API would just fatten my code and make moving my lisp around a nightmare.
When researching this problem I came across this StackOverflow answer; but the approved answer only sends the tokenised string onto stdout, which isn't useful for me; I'd ideally have the tokens in a char** so that I could then parse them into useful in memory data structures.
As well as this, the approved answer on the aforementioned SO question is written to be restricted to specifically my problem - ideally, I'd have myself a general purpose function that would allow me to tokenise a string, except when a substring is between two of charachter x. This isn't a huge deal, it's just that I'd like my codebase to be clean and composable.
You have two delimiters: the space and double quotes.
You can use the strcspn (or with example: cppreference - strcspn) function for that.
Iterate over the string and look for the delimiters (space and quotes). strcspn returns if such a delimiter was found. If a space was found, continue looking for both. If a double quote was found, the delimiter chages from " \"" (space and quotes) to "\"" (double quotes). If you then hit the quotes again, change the delimiter to " \"" (space and quotes).
Based on your comment:
Lets say you have a string like
This is an Example.
The output would be
This
is
an
Example.
If the string would look like
This "is an" Example.
The output would be
This
is an
Example.

C - clarifying delimiters in strtok

I'm trying to break up a shell command that contains both pipes (|) and the OR symbols (||) represented as characters in an array with strtok, except, well the OR command could also be two pipes next to each other. Specifically, I need to know when |, ;, &&, or || show up in the command.
Is there a way to specify where one delimiter ends and another begins in strtok, since I know usually the delimiters are one character long and you just list them all out with no spaces or anything in between.
Oh and, is a newline a valid delimiter? Or does strtok only do spaces?
Starting from your last question: yes, strtok can use new-line as a delimiter without any problems.
Unfortunately, the answer to your first question isn't nearly so positive. strtok treats all delimiter characters as equal, and does nothing to differentiate between a single delimiter and an arbitrary number of consecutive delimiters. In other words, if you give |&; as the delimiter, it'll treat ||||||||| or &&& or &|&|; all exactly the same way.
I'll go a little further: I'll go out on a limb and state as a fact that strtok simply isn't suitable for breaking a shell command into constituent pieces -- I'm pretty sure there's just no way to use it for this job that will produce usable results.
In particular, you don't have anything that just acts as a delimiter. For your purposes, the &, |, and || are tokens of their own. In a string being supplied to the shell, you don't necessarily have anything that qualifies as a delimiter the way strtok "thinks" of them.
strtok is oriented toward tokens that are separated by delimiters that are nothing except delimiters. As strtok reads the tokens, the delimiters between them are completely ignored (and, destroyed, for that matter). For the shell, a string like a|b is really three tokens -- you need the a, the | and the b -- there's nothing between them that strtok can safely overwrite and/or ignore -- but that's a requirement for how strtok works. For it to deliver you the first a, it overwrites the next character (the | in this case) with a '\0'. Then it has no way of recovering that pipe to tell you what the next token should be.
I think you probably need a greedy tokenizer instead -- i.e., one that builds the longest string of characters that can be token, and stops when it encounters a character that can't be part of the current token. When you ask for the next token, it starts from the first character after the end of the previous token, without (necessarily) skipping/ignoring anything (though, of course, if it encounters something like white-space that hasn't been quoted somehow, it'll probably skip over it).
For your purpose, strtok() is not the correct tool to use; it destroys the delimiter, so you can't tell what was at the end of a token if someone types ls|wc. It could have been a pipe, a semi-colon, and ampersand, or a space. Also, it treats multiple adjacent delimiters as part of a single delimiter.
Look at strspn() and strcspn(); both are in standard C and are non-destructive relatives of strtok().
strtok() is quite happy to use newline as a delimiter; in fact, any character except '\0' can be used as one of the delimiters.
There are other reasons for being extremely cautious about using strtok(), such as thread safety and the fact that it is highly unwise to use it in library code.
strtok() is a basic, all-purpose parsing function. For more advanced parsing, I don't recommend its use.
For example, in the case of '|', you really need to inspect the next character to determine if you've found '|' or '||'.
I've done a huge amount of parsing of this nature, including writing a small language interpreter. It's not that hard if you break it up into smaller tasks. But my advice is to write your own parsing routine in this case.
And, yes, a newline character is a valid delimiter.

How can I parse text input and convert strings to integers?

I have a file input, in which i have the following data.
1 1Apple 2Orange 10Kiwi
2 30Apple 4Orange 1Kiwi
and so on. I have to read this data from file and work on it but i dont know how to retrieve the data. I want to store 1(of 1 apple) as integer and then Apple as a string.
I thought of reading the whole 1Apple as a string. and then doing something with the stoi function.
Or I could read the whole thing character by character and then if the ascii value of that character lies b/w 48 to 57 then i will combine that as an integer and save the rest as string? Which one shall I do? Also how do I check what is the ASCII value of the char. (shall I convert the char to int and then compare, or is there any inbuilt function?)
How about using the fscanf() function if and only if your input pattern is not going to change. Otherwise you should probably use fgets() and perform checks if you want to separate the number from the string such as you suggested.
There is one easy right way to do this with standard C library facilities, one rather more difficult right way, and a whole lot of wrong ways. This is the easy right way:
Read an entire line into a char[] buffer using fgets.
Extract numbers from this line using strtol or strtoul.
It is very important to understand why the easier-looking alternatives (*scanf and atoi) should never be used. You might write less code initially, but once you start thinking about how to handle even slightly malformed input, you will discover that you should have used strtol.
The "rather more difficult right way" is to use lex and yacc. They are much more complicated but also much more powerful. You shouldn't need them for this problem.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

How do I parse a token from a string in C?

How do i parse tokens from an input string.
For example:
char *aString = "Hello world".
I want the output to be:
"Hello" "world"
You are going to want to use strtok - here is a good example.
Take a look at strtok, part of the standard library.
strtok is the easy answer, but what you really need is a lexer that does it properly. Consider the following:
are there one or two spaces between "hello" and "world"?
could that in fact be any amount of whitespace?
could that include vertical whitespace (\n, \f, \v) or just horizontal (\s, \t, \r)?
could that include any UNICODE whitespace characters?
if there were punctuation between the words, ("hello, world"), would the punctuation be a separate token, part of "hello,", or ignored?
As you can see, writing a proper lexer is not straightforward, and strtok is not a proper lexer.
Other solutions could be a single character state machine that does precisely what you need, or regex-based solution that makes locating words versus gaps more generalized. There are many ways.
And of course, all of this depends on what your actual requirements are, and I don't know them, so start with strtok. But it's good to be aware of the various limitations.
For re-entrant versions you can either use
strtok_s for visual studio or strtok_r for unix
Keep in mind that strtok is very hard to get it right, because:
It modifies the input
The delimiter is replaced by a null terminator
Merges adjacent delimiters, and of course,
Is not thread safe.
You can read about this alternative.

Resources