Reading String token by token in C - c

I'm trying to build an LL(1) Recursive Descent Parser in C using a specific grammar given to me. I have an idea how to do this recursively in general... my issue, however, is stopping me from really being able to start my implementation. I'm not too familiar with C, so I'm sure this is why I'm having an issue. Basically, I need to be able to read a String such as "(1+2)*3" token by token. So for instance, in the case of the String of above me I need to first read the "(", then further down the recursive process I'd call something like nextToken() which would give me the "1".
That being said, ultimately I would probably only need to read the very first token of the String each that I call "nextToken()because after I grab the value I'd alter the initial string to be the same as it previously was, minus the most recently read token. So for example, I start with "(1+2)*3", then I call nextToken() on the String which means that I get the "(" and then the initial String is now "1+2)*3".
My issue is I don't know how to do this in C..

That's what a "lexer" does, typically before a parser. I guess the best you can do is try LEX (flex in Flex & Bison probably). (It's true that what lexer does can also be done solely in parser, but it's probably much messier.)
A less preferable way would be to categorize all the possibilities and write regular expressions to match some valid prefix (which is what the LEX does under the hood).

In C, a "string" is just a region of memory containing characters, which is terminated by the first NUL (0) character. That being the case, all you need for a string is a pointer to the first character. (That means that the length of the string needs to be computed, so try to avoid doing that more often than is necessary.)
There are standard library functions which can do things like compare strings and copy strings, but it is important to remember that memory management of strings is your responsibility.
While this may seem primitive, error-prone, and complicated to those used to languages in which strings are actual datatypes, it is how it is. If you're planning on doing string manipulation in C, you need to get used to it.
Nonetheless, string manipulation in C can be both efficient and trouble-free, as long as you follow the rules. For example, if you want to refer to the substring of s starting at the 3rd character, you can just use pointer arithmetic: s + 2. If you want to (temporarily) create a substring at a given point in a string, you can drop a 0 into the string at the end of the substring, and then later restore the character that was there. (In fact, that's what the standard library function strtok does, and it's how a lexical scanner built with (f)lex works.) Note that this strategy requires that the character array be mutable, so you won't be able to apply it to string literals. (String arrays are fine, though, since they are mutable.)
It's quite possible that your best bet for building a lexical scanner would be to use flex. The scanner which flex builds will do a lot of things for you, including input buffering, and flex lets you specify regular expressions instead of hand coding them.
But if you want to do it by hand, it is not that hard, particularly if the entire input is in memory so that buffering is not necessary. (If no token spans a line, you could also read the input a line at a time, but that's not as efficient as reading fixed-length blocks, which is what the flex scanner will do.)
Here, for example, is a simple scanner which handles arithmetic operators, integers, and identifiers. It does not use the "overwrite with NUL" strategy, so it can be used with string literals. For identifiers, it creates a newly-allocated string, so the caller needs to free the identifier when it is no longer needed. (No garbage collection. C'est la vie.) The token is "returned" through a reference argument; the actual return value of the function is a pointer to the remainder of the source string. Quite a lot of error checking has been omitted.
#include <ctype.h>
#include <stdlib.h>
#include <string.h>
/* The type of a single-character operators is the character, so
* other token types need to start at 256. We use 0 to indicate
* the end of input token type.
*/
enum TokenType { NUMBER = 256, ID };
typedef struct Token {
enum TokenType token_type;
union { /* Anonymous unions are a C11 feature. */
long number; /* Only valid if type is NUMBER */
char* id; /* Only valid if type is ID */
};
} Token;
/* You would normally call this like this:
* do {
* s = next_token(s, &token);
* // Do something with token
* } while (token.token_type);
*/
const char* next_token(const char* input, Token* out) {
/* Skip whitespace */
while (isspace(*input)) ++input;
if (isdigit(*input)) {
char* lim;
out->number = strtol(input, &lim, 10);
out->token_type = NUMBER;
return lim;
} else if (isalpha(*input)) {
const char* lim = input + 1;
/* Find the end of the id */
while (isalnum(*lim)) ++lim;
/* Allocate enough memory to copy the id. We need one extra byte
* for the NUL
*/
size_t len = lim - input;
out->id = malloc(len + 1);
memcpy(out->id, input, len);
out->id[len] = 0; /* NUL-terminate the string */
out->token_type = ID;
return lim;
} else {
out->token_type = *input;
/* If we hit the end of the input string, we don't advance the
* input pointer, to avoid reading random memory.
*/
return *input ? input + 1 : input;
}
}

Related

Is there a stack overflow danger with strtok here?

Below is some code which I ran through a static analyzer. It came back saying there is a stack overflow vulnerability in the function that uses strtok below, as described here:
https://cwe.mitre.org/data/definitions/121.html
If you trace the execution, the variables used by strtok ultimately derive their data from the user_input variable in somefunction coming in from the wild. But I figured I prevented problems by first checking the length of user_input as well as by explicitly using strncpy with a bound any time I copied pieces of user_input.
somefunction(user_input) {
if (strlen(user_input) != 23) {
if (user_input != NULL)
free(user_input);
exit(1);
}
Mystruct* mystruct = malloc(sizeof(Mystruct));
mystruct->foo = malloc(3 * sizeof(char));
memset(mystruct->foo, '\0', 3);
strncpy(mystruct->foo,&(user_input[0]),2);
mystruct->bar = malloc(19 * sizeof(char));
memset(mystruct->bar, '\0', 19);
/* Remove spaces from user's input. strtok is not guaranteed to
* not modify the source string so we copy it first.
*/
char *input = malloc(22 * sizeof(char));
strncpy(input,&(user_input[2]),21);
remove_spaces(input,mystruct->bar);
}
void remove_spaces(char *input, char *output) {
const char space[2] = " ";
char *token;
token = strtok(input, space);
while( token != NULL ) {
// the error is indicated on this line
strncat(output, token, strlen(token));
token = strtok(NULL, space);
}
}
I presumed that I didn't have to malloc token per this comment, and elsewhere. Is there something else I'm missing?
strncpy does not increase the safety of your code; indeed, it may well make the code less safe by introducing the possibility of an unterminated output string. But the issue being flagged by the static analyser involves neither with strncpy nor strtok; it's with strncat.
Although they are frequently touted as increasing code safety, that was never the purpose of strncpy, strncat nor strncmp. The strn* alternatives to str* functions are intended for use in a context in which string data is not null-terminated. Such a context exists, although it is rare in student code: fixed-length string fields in fixed-size database records. If a field in a database record always contains 20 characters (CHAR(20) in SQL terms), there's no need to force a trailing 0-byte, which could have been used to allow 21-character names (or whatever the field is). It's a waste of space, and the only reason that those unnecessary bytes might be examined by the database code is to check database integrity. (Not that the extra byte really helps maintain integrity, either. But it must be checked for correctness.)
If you were writing code which used or created fixed-length unterminated string fields, you would certainly need a set of string functions which accept a length argument. But the string library already had those functions: memcpy and memcmp. The strn versions were added to ease the interface when both null-terminated and fixed-length strings are being used in the same application; for example, if a null-terminated string is read from user input and needs to be copied into a fixed-length database field. In that context, the interface of strncpy makes sense: the database field must be completed cleared of old data, but the input string might be too short to guarantee that. So you can't use strcpy even if you check that it won't overflow (because it doesn't necessarily erase old data) and you can't use memcpy (because the bytes following the end of the input string are indeterminated). Hence an interface like strncpy, which guarantees that the destination will be filled but doesn't guarantee that it will be null-terminated.
strncmp and strnlen do have some applications which don't necessarily have to do with fixed-length string records, but they are not safety-related either. strncmp is handy if you want to know whether a given string is a prefix of another string (although a startswith function would have more directly addressed this use case) and strnlen lets you answer the question "Are there at least four characters in this string?" without having to worry about how many cycles would be wasted if the string continued for another four million characters. But that doesn't justify using them in other, more normal, contexts.
OK, that was a bit of a detour. Let's get back to strncat, whose prototype is
char *strncat(char *dest, const char *src, size_t n);
where n is the maximum number of characters to copy. As the man page notes, you (and not the standard library) are responsible for ensuring that the destination has n+1 bytes available for the copy. The library function cannot take responsibility, because it cannot know how much space is available, and it hasn't asked you to specify that.
In my opinion, that makes strncat completely useless. In order to know how much space is available in the destination, you need to know where the concatenation's copy will start. But if you knew that, why on earth would you ask the standard library to scan over the destination looking for the concatenation point? In any case, you are not verifying how much space is available; you simply call:
strncat(output, token, strlen(token));
That does exactly the same thing as strcat(output, token) except that it scans token twice (once to count the bytes and a second time to copy them) and during the copy it does a redundant check to ensure that the count has not been exceeded while copying.
A "safe" version of strncat would require you to specify the length of the destination, but since there is no such function in the standard C library and also no consensus as to what the prototype for such a function would be, you need to guarantee safety yourself by tracking the amount of space used in output by each concatenation. As an extra benefit, if you do that, you can then make the computational complexity of a sequence of concatenations linear in the number of bytes copied, which one might intuitively expect, as opposed to quadratic, as implemented by strcat and strncat.
So a safe and efficient procedure might look like this:
void remove_spaces(char *output, size_t outmax,
char *input) {
if (outmax = 0) return;
char *token = strtok(input, " ");
char *outlimit = output + outmax;
while( token ) {
size_t tokelen = strlen(token);
if (tokelen >= outlimit - output)
tokelen = outlimit - output - 1;
memcpy(output, token, tokelen);
output += tokelen;
token = strtok(NULL, " ");
}
*output = 0;
}
The CWE warning does not mention strtok at all, so the question in the title itself is a red herring. strtok is one of the few parts of your code which is not problematic, although (as you note) it does force an otherwise unnecessary copy of the input string, in case that string is in read-only memory. (As noted above, strncpy does not guarantee that the copy is null-terminated, so it is not safe here. strdup, which needs to be paired with free, is the safest way to copy a string. Fortunately, it will still be part of the C standard instead of just being available almost everywhere.)
That might be a good enough reason to avoid strtok. If so, it's easy to get rid of:
void remove_spaces(char *output, size_t outmax,
/* This version doesn't modify input */
const char *input) {
if (outmax = 0) return;
char *token = strtok(input, " ");
char *outlimit = output + outmax;
while ( *(input += strspn(input, " ")) ) {
size_t cpylen = (tokelen < outlimit - outptr)
? tokelen
: outlimit - outptr - 1;
memcpy(output, input, cpylen);
output += cpylen;
input += tokelen;
}
*output = 0;
}
A better interface would manage to indicate whether the output was truncated, and perhaps give an indication of how many bytes were necessary to accommodate the operation. See snprintf for an example.

How can I use sscanf to analyze string data?

How do I split a string into two strings (array name, index number) only if the string is matching the following string structure: "ArrayName[index]".
The array name can be 31 characters at most and the index 3 at most.
I found the following example which suppose to work with "Matrix[index1][index2]". I really couldn't understand how it does it in order to take apart the part I need to get my strings.
sscanf(inputString, "%32[^[]%*[[]%3[^]]%*[^[]%*[[]%3[^]]", matrixName, index1,index2) == 3
This try over here wasn't a success, what am I missing?
sscanf(inputString, "%32[^[]%*[[]%3[^]]", arrayName, index) == 2
How do I split a string into two strings (array name, index number) only if the string is matching the following string structure: "ArrayName[index]".
With sscanf, you don't. Not if you mean that you can rely on nothing being modified in the event that the input does not match the pattern. This is because sscanf, like the rest of the scanf family, processes its input and format linearly, without backtracking, and by design it fills input fields as they are successfully matched. Thus, if you scan with a format that assigns multiple fields or has trailing literal characters then it is possible for results to be stored for some fields despite a matching failure occurring.
But if that's ok with you then #gsamaras's answer provides a nearly-correct approach to parsing and validating a string according to your specified format, using sscanf. That answer also presents a nice explanation of the meaning of the format string. The problem with it is that it provides no way to distinguish between the input fully matching the format and the input failing to match at the final ], or including additional characters after.
Here is a variation on that code that accounts for those tail-end issues, too:
char array_name[32] = {0}, idx[4] = {0}, c = 0;
int n;
if (sscanf(str, "%31[^[][%3[^]]%c%n", array_name, idx, &c, &n) >= 3
&& c == ']' && str[n] == '\0')
printf("arrayName = %s\nindex = %s\n", array_name, idx);
else
printf("Not in the expected format \"ArrayName[idx]\"\n");
The difference in the format is the replacement of the literal terminating ] with a %c directive, which matches any one character, and the addition of a %n directive, which causes the number of characters of input read so far to be stored, without itself consuming any input.
With that, if the return value is at least 3 then we know that the whole format was matched (a %n never produces a matching failure, but docs are unclear and behavior is inconsistent on whether it contributes to the returned field count). In that event, we examine variable c to determine whether there was a closing ] where we expected to find one, and we use the character count recorded in n to verify that all characters of the string were parsed (so that str[n] refers to a string terminator).
You may at this point be wondering at how complicated and cryptic that all is. And you would be right to do so. Parsing structured input is a complicated and tricky proposition, for one thing, but also the scanf family functions are pretty difficult to use. You would be better off with a regex matcher for cases like yours, or maybe with a machine-generated lexical analyzer (see lex), possibly augmented by machine-generated parser (see yacc). Even a hand-written parser that works through the input string with string functions and character comparisons might be an improvement. It's still complicated any way around, but those tools can at least make it less cryptic.
Note: the above assumes that the index can be any string of up to three characters. If you meant that it must be numeric, perhaps specifically a decimal number, perhaps specifically non-negative, then the format can be adjusted to serve that purpose.
A naive example to get you started:
#include <stdio.h>
#include <string.h>
int main(void)
{
char str[] = "myArray[123]";
char array_name[32] = {0}, idx[4] = {0};
if(sscanf(str, "%31[^[][%3[^]]]", array_name, idx) == 2)
printf("arrayName = %s\nindex = %s\n", array_name, idx);
else
printf("Not in the expected format \"ArrayName[idx]\"\n");
return 0;
}
Output:
arrayName = myArray
index = 123
which will find easy not-in-the-expected format cases, such as "ArrayNameidx]" and "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP[idx]", but not "ArrayName[idx".
The essence of sscanf() is to tell it where to stop, otherwise %s would read until the next whitespace.
This negated scanset %[^[] means read until you find an opening bracket.
This negated scanset %[^]] means read until you find a closing bracket.
Note: I used 31 and 3 as the width specifiers respectively, since we want to reserve the last slot for the NULL terminator, since the name of the array is assumed to be 31 characters at the most, and the index 3 at the most. The size of the array for its token is the max allowed length, plus one.
How can I use sscanf to analyze string data?
Use "%n" to detect a completed scan.
array name can be 31 characters at most and the index 3 at most.
For illustration, let us assume the index needs to limit to a numeric value [0 - 999].
Use string literal concatenation to present the format more clearly.
char name[32]; // array name can be 31 characters
#define NAME_FMT "%31[^[]"
char idx[4]; //
#define IDX_FMT "%3[0-9]"
int n = 0; // be sure to initialize
sscanf(str, NAME_FMT "[" IDX_FMT "]" "%n", array_name, idx, &n);
// Did scan complete (is `n` non-zero) with no extra text?
if (n && str[n] == '\0') {
printf("arrayName = %s\nindex = %d\n", array_name, atoi(idx));
} else {
printf("Not in the expected format \"ArrayName[idx]\"\n");
}

Dynamic Structures And Storing Data without stdlib.h

I have tried using Google, but not really sure how to phrase my search to get relevant results. The programming language is C. I was given a (homework) assignment which requires reading a text file and outputting the unique words in the text file. The restriction is that the only allowable import is <stdio.h>. So, is there a way to use dynamic structures without using <stdlib.h>? Would it be necessary to define those dynamic structures on my own? If this has already been addressed on Stack Overflow, then please point me to the question.
Clarification was provided today that the allowable imports now include <stdlib.h> as well as (though not necessary or desirable) the use of <string.h>, which in turn makes this problem easier (and I am tempted to say trivial).
It is telling that you couldn't find anything with Google. Assignments with completely arbitrary restrictions are idiotic. The assignment tells something profound about the quality of the course and the instructor. There is more to be learnt from an assignment that requires the use of realloc and other standard library functions.
You don't need a data structure, only a large enough 2-dimensional char array - you must know at compile time how long words you're going to have and how many of them are there going to be at most; or you need to read the file once and then you're going to allocate a two-dimensional variable-length array on the stack (and possibly blow the stack), reset the file pointer and read the file again into that array...
Then you read the words into it using fgets, loop over the words using 2 nested for loops and comparing the first and second strings together (of course you'd skip if both outer and inner loop are at the same index) - if you don't find a match in the inner loop, you'll print the word.
Doing the assignment this way doesn't teach anything useful about programming, but the only standard library routine you need replicate yourself is strcmp and at least you'll save your energy for something useful instead.
It is not possible to code dynamic data structures in c using only stdio.h. That may be one of the reasons your teacher restricted you to using just stdio.h--they didn't want you going down the rabbit hole of trying to make a linked list or something in which to store unique words.
However, if you think about it, you don't need a dynamic data structure. Here's something to try: (1) make a copy of your source file. (2) declare a results text file to store your results. (3) Copy the first word in your source file to the results file. Then run through your source file and delete every copy of that word. Now there can't be any duplicates of that word. Then move on to the next word and copy and delete.
When you're done, your source file should be empty (thus the reason for the backup) and your results file should have one copy of every unique word from the original source file.
The benefit of this approach is that it doesn't require you to know (or guess) the size of the initial source file.
Agreed on the points above on "exercises with arbitrary constraints" mostly being used to illustrate a lecturers favorite pet peeve.
However, if you are allowed to be naive you could do what others have said and assume a maximum size for your array of unique strings and use a simple buffer. I wrote a little stub illustrating what I was thinking. However, it is shared with the disclaimer that I am not a "real programmer", with all the bad habits and knowledge-gaps that follows...
I have obviously also ignored the topics of reading the file and filtering unique words.
#include <stdio.h> // scanf, printf, etc.
#include <string.h> // strcpy, strlen (only for convenience here)
#define NUM_STRINGS 1024 // maximum number of strings
#define MAX_STRING_SIZE 32 // maximum length of a string (in fixed buffer)
char fixed_buff[NUM_STRINGS][MAX_STRING_SIZE];
char * buff[NUM_STRINGS]; // <-- Will only work for string literals OR
// if the strings that populates the buffer
// are stored in a separate location and the
// buffer refers to the permanent location.
/**
* Fixed length of buffer (NUM_STRINGS) and max item length (MAX_STRING_SIZE)
*/
void example_1(char strings[][MAX_STRING_SIZE] )
{
// Note: terminates when first item in the current string is '\0'
// this may be a bad idea(?)
for(size_t i = 0; *strings[i] != '\0'; i++)
printf("strings[%ld] : %s (length %ld)\n", i, strings[i], strlen(strings[i]));
}
/**
* Fixed length of buffer (NUM_STRINGS), but arbitrary item length
*/
void example_2(char * strings[])
{
// Note: Terminating on reaching a NULL pointer as the number of strings is
// "unknown".
for(size_t i = 0; strings[i] != NULL; i++)
printf("strings[%ld] : %s (length %ld)\n", i, strings[i], strlen(strings[i]));
}
int main(int argc, char* argv[])
{
// Populate buffers
strncpy(fixed_buff[0], "foo", MAX_STRING_SIZE - 1);
strncpy(fixed_buff[1], "bar", MAX_STRING_SIZE - 1);
buff[0] = "mon";
buff[1] = "ami";
// Run examples
example_1(fixed_buff);
example_2(buff);
return 0;
}

How to find tokens from a c file?

I am trying to generate tokens from a C source file. I have split the C file into an array line and stored the words of the entire file in an array words.
The problem is with the strtok() function, which is splitting the line on whitespace characters. Because of this, I am not getting certain delimiters like parentheses and brackets because there is no whitespace between them and other tokens.
How do I determine which one is an identifier and which one is an operator?
Code so far:
int main()
{
/* ... */
char line[300][200];
char delim[]=" \n\t";
char *words[1000];
char *token;
while (fgets(&line[i][0], 100, fp1) != NULL)
{
token = strtok(&line[i][0], delim);
while (token != NULL)
{
words[j++] = token;
token = strtok(NULL, delim);
}
i++;
}
for(i = 0; i < 50; i++)
{
printf("%s\n", words[i]);
}
return 0;
}
This is a tricky question, something that needs probably more depth than a StackOverflow answer. I'll try, nonetheless.
Tokenizing the input is the first part of the compilation process. The objective is to simplify the task of the parser, which is going to make an abstract syntax tree with the contents of the file. How do we simplify this? We do recognize those tokens that have a special meaning, also identifiers, operators... C is indeed a tricky, complex language. Let's simplify the language to tokenize: we'll start with a typical calculator.
An input example would be:
( 4 +5)* 2
When syntax is free, you can add or skip spaces, so as you have already experimented, splitting by space is not an option.
The tokenized output for the example above would be: LPAR, LIT, OP, LIT, RPAR, OP, LIT. The meaning goes as follows:
LPAR: Left parenthesis
RPAR: Right parenthesis
LIT: Literal (a number)
OP: Operator (say: +, -, * and /).
The complete ouput would therefore be:
{ LPAR, LIT(4), OP('+'), LIT(5), RPAR, OP('*'), LIT(2) }
Your lexer basically has to advance in the input string, char by char, using a state machine. For example, when you read a number, you enter in the "input literal" state, in which only other numbers and '.' are allowed.
Now the parser has an easier task. If you feed it with the previous tokens, it does not have to skip spaces, or distinguish between a negative number and a minus operator, it can just advance in a list or array. It can behave following the type of the token, and some of them have associated data, as you can see.
This is only an introduction of the introduction, anyway. Information about the whole compilation process could fill a book. And there are actually many books devoted to this topic, such as the famous "Dragon book" from Aho, Sethi&Ullman. A more updated one is the "Tiger book".
Finally, lexers are quite similar among each others, and it is therefore possible to find generic lexers out there. You can also even find the C grammar for that kind of tools.
Hope this (somehow) helps.

Correct way of initializing a non known value String in C

Say I want to create a String that will hold some values based on another string. Basically, I want to be able to compress one string, like this: aaabb -> a3b2 - But my question is:
In Java you could do something like this:
String mystr = "";
String original = "aaabb";
char last = original.charAt(0);
for (int i = 1; i < original.length(); i++) {
// Some code not relevant
mystr += last + "" + count; // Here is my doubt.
}
As you can see, we have initialized an empty string and we can modify it (mystr += last + "" + count;). How can you do that in C?
Unfortunately, in C you cannot have it as easy as in Java: string memory needs dynamic allocation.
There are three common choices here:
Allocate as much as you could possibly need, then trim to size once you are done - This is very common, but it is also risky due to a possibility of buffer overrun when you miscalculate the max
Run your algorithm twice - the first time counting the length, and the second time filling in the data - This may be the most efficient one if the timing is dominated by memory allocation: this approach requires you to allocate only once, and you allocate the precise amount of memory.
Allocate as you go - start with a short string, then use realloc when you need more memory.
I would recommend using the second approach. In your case, you would run through the source string once to compute the compressed length (in your case, that's 5 - four characters for the payload "a3b2", and one for the null terminator. With this information in hand, you allocate five bytes, then use the allocated buffer for the output, which is guaranteed to fit.
In C (not C++) you can do something like this:
char mystr[1024];
char * str = "abcdef";
char c = str[1]; // will get 'b'
int int_num = 100;
sprintf(mystr, "%s%c%d", str, c, int_num);
This will create a string in 'mystr':
"abcdefb100"
You can then concatenate more data to this string using strcat()
strcat(mystr, "xyz"); // now it is "abcdefb100xyz"
Please note that mystr has been declared to be 1024 bytes long and this is all the space you can use in it. If you know how long your string will be you can use malloc() in C to allocate the space and then use it.
C++ has much more robust ways of dealing with strings, if you want to use it.
You can use string concatenation method strcat:
http://www.cplusplus.com/reference/cstring/strcat/
You define your string as following:
char mystr[1024]; // Assuming the maximum string you will need is 1024 including the terminating zero
To convert the character last into a string to be able to concatenate it, you use the following syntax:
char lastString[2];
lastString[0] = last; // Set the current character from the for loop
lastString[1] = '\0'; // Set the null terminator
To convert the count into a string you need to use itoa function as following:
char countString[32];
itoa (count, countString, 10); // Convert count to decimal ascii string
Then you can use strcat as following:
strcat(mystr, lastString);
strcat(mystr, countString);
Another solution is to use STL String class or MFC CString if you are using Visual C++.

Resources