I am using strtok().But i want to print corresponding delimiters also as we do using StringTokenizer in Java.Is there any function which provides this functionality(printing delimiters) ?
Based on OP's comments, tokenization is not what is actually desired. You want to use strstr(), not strtok(). That will tell you if the string is present, and then you can use strcpy() and strcat() as appropriate.
Please note, the "n" versions of these methods, i.e. strncpy and strncat, are safer -- less likely to crash due to buffer overrun.
i want to print corresponding delimiters also as we do using StringTokenizer in Java
Java's StringTokenizer doesn't return delimeters.
In any case, there is no such function in C. You'll have to write one (using strchr, etc.)
How about using glib?
It seems
http://library.gnome.org/devel/glib/2.26/glib-String-Utility-Functions.html#g-strsplit
is exactly what you're looking for.
Related
I am implementing a shell in C11, and I want to check if the input has the correct syntax before doing a system call to execute the command. One of the possible inputs that I want to guard against is a string made up of only white-space characters. What is an efficient way to check if a string contains only white spaces, tabs or any other white-space characters?
The solution must be in C11, and preferably using standard libraries. The string read from the command line using readline() from readline.h, and it is a saved in a char array (char[]). So far, the only solution that I've thought of is to loop over the array, and check each individual char with isspace(). Is there a more efficient way?
So far, the only solution that I've thought of is to loop over the array, and check each individual char with isspace().
That sounds about right!
Is there a more efficient way?
Not really. You need to check each character if you want to be sure only space is present. There could be some trick involving bitmasks to detect non-space characters in a faster way (like strlen() does to find a NUL terminator), but I would definitely not advise it.
You could make use of strspn() or strcspn() checking the returned value, but that would surely be slower since those functions are meant to work on arbitrary accept/reject strings and need to build lookup tables first, while isspace() is optimized for its purpose using a pre-built lookup table, and will most probably also get inlined by the compiler using proper optimization flags. Other than this, vectorization of the code seems like the only way to speed things up further. Compile with -O3 -march=native -ftree-vectorize (see also this post) and run some benchmarks.
"loop over the array, and check each individual char with isspace()" --> Yes go with that.
The time to do that is trivial compared to readline().
I'm going to provide an alternative solution to your problem: use strtok. It splits a string into substrings based on a specific set of ignored delimiters. With an empty string, you'd just get no tokens at all.
If you need more complicated matching than that for your shell (eg. To do quoted arguments) you're best off writing a small tokenizer/lexer. The strtok method is basically to just look for any of the delimeters you've specified, temporarily replace them with \0, returning the substring up to that point, putting the old character back, and repeating until it reaches the end of the string.
Edit:
As the busybee points out in the comment below, strtok does not put back the character that it replaces with \0. The above paragraph was worded poorly, but my intent was to explain how to implement your own simple tokenizer/lexer if you needed to, not to explain exactly how strtok works down to the smallest detail.
How does one check the read in string for a substring in C?
If I have the following
char name[21];
fgets(name, 21, stdin);
How do I check the string for a series of substrings?
How does one check for a substring before a character? For example, how would one check for a substring before an = sign?
Be wary of strtok(); it is not re-entrant. Amongst other things, it means that if you need to call it in one function, and then call another function, and if that other function also uses strtok(), your first function is messed up. It also writes NUL ('\0') bytes over the separators, so it modifies the input string as it goes. If you are looking for more than one terminator character, you can't tell which one was found. Further, if you write a library function for others to use, yet your function uses strtok(), you must document the fact so that callers of your function are not bemused by the failures of their own code that uses strtok() after calling your function. In other words, it is poisonous; if your function calls strtok(), it makes your function unreusable, in general; similarly, your code that uses strtok() cannot call other people's functions that also use it.
If you still like the idea of the functionality - some people do (but I almost invariably avoid it) - then look for strtok_r() on your system. It is re-entrant; it takes an extra parameter which means that other functions can use strtok_r() (or strtok()) without affecting your function.
There are a variety of alternatives that might be appropriate. The obvious ones to consider are strchr(), strrchr(), strpbrk(), strspn(), strcspn(): none of these modify the strings they analyze. All are part of Standard C (as is strtok()), so they are essentially available everywhere. Looking for the material before a single character suggests that you should use strchr().
Use strtok() to split the string into tokens.
char *pch;
pch = strtok (name,"=");
if (pch != NULL)
{
printf ("Substring: %s\n",pch);
}
You can keep calling strtok() to find more strings after the =.
You can use strtok but it's not reentrant and it destroys the original string. Other (perhaps safer) functions to look into would be strchr, strstr, strspn, and perhaps the mem* variations. In general, I avoid strn* variants because, while they do "boinds checking," they still rely on the nul terminator. They can fail on a valid string that just happens to be longer than you expected to deal with, and they won't actually prevent a buffer overrun unless you know the buffer size. Better (IMHO) to ignore the terminator and know exactly how much data you're working with every time the way the mem* functions work.
What is the actual difference between memchr() and strchr(), besides the extra parameter? When do you use one or the other one? and would there be a better outcome performance replacing strchr() by memchr() if parsing a big file (theoretically speaking)?
strchr stops when it hits a null character but memchr does not; this is why the former does not need a length parameter but the latter does.
Functionally there is no difference in that they both scan an array / pointer for a provided value. The memchr version just takes an extra parameter because it needs to know the length of the provided pointer. The strchr version can avoid this because it can use strlen to calculate the length of the string.
Differences can popup if you attempt to use a char* which stores binary data with strchr as it potentially won't see the full length of the string. This is true of pretty much any char* with binary data and a str* function. For non-binary data though they are virtually the same function.
You can actually code up strchr in terms of memchr fairly easily
const char* strchr(const char* pStr, char value) {
return (const char*)memchr(pStr, value, strlen(pStr)+1);
}
The +1 is necessary here because strchr can be used to find the null terminator in the string. This is definitely not an optimal implementation because it walks the memory twice. But it does serve to demonstrate how close the two are in functionality.
strchr expects that the first parameter is null-terminated, and hence doesn't require a length parameter.
memchr works similarly but doesn't expect that the memory block is null-terminated, so you may be searching for a \0 character successfully.
No real difference, just that strchr() assumes it is looking through a null-terminated string (so that determines the size).
memchr() simply looks for the given value up to the size passed in.
In practical terms, there's not much difference. Also, implementations are free to make one function faster than the other.
The real difference comes from context. If you're dealing with strings, then use strchr(). If you have a finite-size, non-terminated buffer, then use memchr(). If you want to search a finite-size subset of a string, then use memchr().
While learning C I regularly come across resources which recommend that some functions (e.g. gets()) are never to be used, because they are either difficult or impossible to use safely.
If the C standard library contains a number of these "never-use" functions, it would seem necessary to learn a list of them, what makes them unsafe, and what to do instead.
So far, I've learned that functions which:
Cannot be prevented from overwriting memory
Are not guaranteed to null-terminate a string
Maintain internal state between calls
are commonly regarded as being unsafe to use. Is there a list of functions which exhibit these behaviours? Are there other types of functions which are impossible to use safely?
In the old days, most of the string functions had no bounds checking. Of course they couldn't just delete the old functions, or modify their signatures to include an upper bound, that would break compatibility. Now, for almost every one of those functions, there is an alternative "n" version. For example:
strcpy -> strncpy
strlen -> strnlen
strcmp -> strncmp
strcat -> strncat
strdup -> strndup
sprintf -> snprintf
wcscpy -> wcsncpy
wcslen -> wcsnlen
And more.
See also https://github.com/leafsr/gcc-poison which is a project to create a header file that causes gcc to report an error if you use an unsafe function.
Yes, fgets(..., ..., STDIN) is a good alternative to gets(), because it takes a size parameter (gets() has in fact been removed from the C standard entirely in C11). Note that fgets() is not exactly a drop-in replacement for gets(), because the former will include the terminating \n character if there was room in the buffer for a complete line to be read.
scanf() is considered problematic in some cases, rather than straight-out "bad", because if the input doesn't conform to the expected format it can be impossible to recover sensibly (it doesn't let you rewind the input and try again). If you can just give up on badly formatted input, it's useable. A "better" alternative here is to use an input function like fgets() or fgetc() to read chunks of input, then scan it with sscanf() or parse it with string handling functions like strchr() and strtol(). Also see below for a specific problem with the "%s" conversion specifier in scanf().
It's not a standard C function, but the BSD and POSIX function mktemp() is generally impossible to use safely, because there is always a TOCTTOU race condition between testing for the existence of the file and subsequently creating it. mkstemp() or tmpfile() are good replacements.
strncpy() is a slightly tricky function, because it doesn't null-terminate the destination if there was no room for it. Despite the apparently generic name, this function was designed for creating a specific style of string that differs from ordinary C strings - strings stored in a known fixed width field where the null terminator is not required if the string fills the field exactly (original UNIX directory entries were of this style). If you don't have such a situation, you probably should avoid this function.
atoi() can be a bad choice in some situations, because you can't tell when there was an error doing the conversion (e.g., if the number exceeded the range of an int). Use strtol() if this matters to you.
strcpy(), strcat() and sprintf() suffer from a similar problem to gets() - they don't allow you to specify the size of the destination buffer. It's still possible, at least in theory, to use them safely - but you are much better off using strncat() and snprintf() instead (you could use strncpy(), but see above). Do note that whereas the n for snprintf() is the size of the destination buffer, the n for strncat() is the maximum number of characters to append and does not include the null terminator. Another alternative, if you have already calculated the relevant string and buffer sizes, is memmove() or memcpy().
On the same theme, if you use the scanf() family of functions, don't use a plain "%s" - specify the size of the destination e.g. "%200s".
strtok() is generally considered to be evil because it stores state information between calls. Don't try running THAT in a multithreaded environment!
Strictly speaking, there is one really dangerous function. It is gets() because its input is not under the control of the programmer. All other functions mentioned here are safe in and of themselves. "Good" and "bad" boils down to defensive programming, namely preconditions, postconditions and boilerplate code.
Let's take strcpy() for example. It has some preconditions that the programmer must fulfill before calling the function. Both strings must be valid, non-NULL pointers to zero terminated strings, and the destination must provide enough space with a final string length inside the range of size_t. Additionally, the strings are not allowed to overlap.
That is quite a lot of preconditions, and none of them is checked by strcpy(). The programmer must be sure they are fulfilled, or he must explicitly test them with additional boilerplate code before calling strcpy():
n = DST_BUFFER_SIZE;
if ((dst != NULL) && (src != NULL) && (strlen(dst)+strlen(src)+1 <= n))
{
strcpy(dst, src);
}
Already silently assuming the non-overlap and zero-terminated strings.
strncpy() does include some of these checks, but it adds another postcondition the programmer must take care for after calling the function, because the result may not be zero-terminated.
strncpy(dst, src, n);
if (n > 0)
{
dst[n-1] = '\0';
}
Why are these functions considered "bad"? Because they would require additional boilerplate code for each call to really be on the safe side when the programmer assumes wrong about the validity, and programmers tend to forget this code.
Or even argue against it. Take the printf() family. These functions return a status that indicate error and success. Who checks if the output to stdout or stderr succeeded? With the argument that you can't do anything at all when the standard channels are not working. Well, what about rescuing the user data and terminating the program with an error-indicating exit code? Instead of the possible alternative of crash and burn later with corrupted user data.
In a time- and money-limited environment it is always the question of how much safety nets you really want and what is the resulting worst case scenario? If it is a buffer overflow as in case of the str-functions, then it makes sense to forbid them and probably provide wrapper functions with the safety nets already within.
One final question about this: What makes you sure that your "good" alternatives are really good?
Any function that does not take a maximum length parameter and instead relies on an end-of- marker to be present (such as many 'string' handling functions).
Any method that maintains state between calls.
sprintf is bad, does not check size, use snprintf
gmtime, localtime -- use gmtime_r, localtime_r
To add something about strncpy most people here forgot to mention. strncpy can result in performance problems as it clears the buffer to the length given.
char buff[1000];
strncpy(buff, "1", sizeof buff);
will copy 1 char and overwrite 999 bytes with 0
Another reason why I prefer strlcpy (I know strlcpy is a BSDism but it is so easy to implement that there's no excuse to not use it).
View page 7 (PDF page 9) SAFECode Dev Practices
Edit: From the page -
strcpy family
strncpy family
strcat family
scanf family
sprintf family
gets family
strcpy - again!
Most people agree that strcpy is dangerous, but strncpy is only rarely a useful replacement. It is usually important that you know when you've needed to truncate a string in any case, and for this reason you usually need to examine the length of the source string anwyay. If this is the case, usually memcpy is the better replacement as you know exactly how many characters you want copied.
e.g. truncation is error:
n = strlen( src );
if( n >= buflen )
return ERROR;
memcpy( dst, src, n + 1 );
truncation allowed, but number of characters must be returned so caller knows:
n = strlen( src );
if( n >= buflen )
n = buflen - 1;
memcpy( dst, src, n );
dst[n] = '\0';
return n;
How do i parse tokens from an input string.
For example:
char *aString = "Hello world".
I want the output to be:
"Hello" "world"
You are going to want to use strtok - here is a good example.
Take a look at strtok, part of the standard library.
strtok is the easy answer, but what you really need is a lexer that does it properly. Consider the following:
are there one or two spaces between "hello" and "world"?
could that in fact be any amount of whitespace?
could that include vertical whitespace (\n, \f, \v) or just horizontal (\s, \t, \r)?
could that include any UNICODE whitespace characters?
if there were punctuation between the words, ("hello, world"), would the punctuation be a separate token, part of "hello,", or ignored?
As you can see, writing a proper lexer is not straightforward, and strtok is not a proper lexer.
Other solutions could be a single character state machine that does precisely what you need, or regex-based solution that makes locating words versus gaps more generalized. There are many ways.
And of course, all of this depends on what your actual requirements are, and I don't know them, so start with strtok. But it's good to be aware of the various limitations.
For re-entrant versions you can either use
strtok_s for visual studio or strtok_r for unix
Keep in mind that strtok is very hard to get it right, because:
It modifies the input
The delimiter is replaced by a null terminator
Merges adjacent delimiters, and of course,
Is not thread safe.
You can read about this alternative.