I'm pretty new to C and was wondering if I could get some help! I've been working on this bug for +15 hours.
So, this program is a tokenizer.
Basically, the program is supposed to take a string, or "token stream," and break it up into "tokens." A "token" is a string of either a word, hexadecimal int, octal int, decimal int, floating point int, or symbol.
The code I'm posting is only the code where things go wrong, the other portion of my program is what creates the token.
The gist of how the below code works is this: It takes a "token stream", and then finds the next token from that stream. Once that is completed, it will create a substring of the "token stream" minus the new token, and return that as the new "token stream."
Essentially, when the string "0x4356/*abdc 0777 */[]87656879jlhg kl(/j jlkh 'no thank you' /" is passed through, the program will do everything properly except when "jlhg kl(/j jlkh 'no thank you' /" passes. Once that passes through my program, a "jlhg" token is created BUT then it is added to the end of the token stream again. So, the new token stream to be broken down becomes " kl(/j jlkh 'no thank you' / jlhg" where jlhg is added on at the end, where it wasn't there before. It does this same weird thing once more, right afterwards, but with "kl" instead.
It only does this under extremely weird conditions, so I'm not sure the cause. I put print statements throughout my program and things flow normally except seemingly out of no where, the program will just add those at the end. This I why I feel like it might be a memory problem, but I have absolutely no clue where to go from here.
Any help would be GREATLY appreciated!!!!
EDIT: If you pass the string "array[xyz ] += pi 3.14159e-10 A12B" output should be:
word "array"
left brace "["
word "xyz"
right brace "]"
plusequals "+="
word "pi"
float "3.14159e-10"
word "A12B"
My TokenizerT is this:
struct TokenizerT_
{
char *tokenType;
char *token;
};
typedef struct TokenizerT_ TokenizerT;
Relevant code:
/*
* TKNewStream takes two TokenizerT objects.
* It will locate the index of the end of the last token,
* and create a substring with the new string to be tokenized.
* #tokenStream: old token stream
* #newToken: new token created from old token stream
*
*/
char *TKGetNextStream(char *tokenStream, char *newToken)
{
int i,
index = 0,
count = 0;
char last = newToken[strlen(newToken)-1];
for(i = 0; i < strlen(newToken); i++)
{
if(newToken[i] == last)
{
count++;
}
}
for(i = 0; i < strlen(tokenStream); i++)
{
if(tokenStream[i] == last && count == 1)
{
index = i + 1;
break;
}
else if(tokenStream[i] == last)
{
count--;
}
}
char *ret = malloc(sizeof(char)*(strlen(tokenStream) - index));
for(i = 0; i < strlen(tokenStream) - index; i++)
{
ret[i] = tokenStream[i+index];
}
return ret;
}
/*
* This is my main
*/
int main(int argc, char **argv)
{
char *string = "0x4356/*abdc 0777 */[]87656879jlhg kl(/j jlkh 'no thank you' /";
TokenizerT *newToken = malloc(sizeof(struct TokenizerT_)),
*tokenStream = malloc(sizeof(struct TokenizerT_));
tokenStream->token = string;
while(newToken != NULL)
{
newToken = TKCreate(TKGetNextToken(tokenStream));
if(newToken != NULL)
{
tokenStream->token = TKGetNextStream(tokenStream->token,
newToken->token);
printf("%s \"%s\"\n",
newToken->tokenType,
newToken->token);
}
}
TKDestroy(newToken);
return 0;
}
The string created in ret isn't properly null terminated. So all the functions dealing with strings will assume it goes on until the next random zero byte that happens to be found after the allocated memory.
To fix this allocate one more byte of space for ret and set that to zero, or use an existing function like strdup() to copy the string:
ret = strdup(tokenStream + index);
Related
I am attempting to write a very basic lexxer in C and have the following code which is supposed to just do something like the following:
Input: "12 142 123"
Output:
NUMBER -- 12
NUMBER -- 14
NUMBER -- 123
However, I am having an issue where if I do not include an initial printf("") statement before looping over the input, then I will get an output like this:
Output:
NUMBER --
NUMBER -- 14
NUMBER -- 123
where the first number is simply blank. I am really confused as to why this is happening and would really appreciate some help with this!
I have the following code (with a number of irrelevant functions omitted)
#define MAX_LEN 400
char* input;
char* ptr;
char curr_type;
char curr;
enum token_type {
END,
NUMBER,
UNEXPECTED
};
typedef struct {
enum token_type type;
char* str;
} Token;
void print_tok(Token t) {
printf("%s -- %s\n", token_types[t.type], t.str);
}
char get(void) {
return *ptr++;
}
char peek(void) {
return *ptr;
}
Token number(void) {
char arr[MAX_LEN];
arr[0] = peek();
get();
int i = 1;
while (is_digit(peek())) {
arr[i] = get();
++i;
}
arr[++i] = '\0';
Token ret = {NUMBER, (char*)arr};
return ret;
}
Token unexpected(void) {
// omitted
}
Token next(void) {
while (is_space(peek())) get();
char c = peek();
switch (peek()) {
case '0':
// omitted
case '9':
return number();
default:
return unexpected();
}
}
int main(int argc, char **argv) {
printf(""); // works fine with this line
input = argv[1];
ptr = input;
Token tokens[MAX_LEN];
Token t;
int i = 0;
do {
t = next();
print_tok(t);
tokens[i++] = t;
} while (t.type != END && t.type != UNEXPECTED);
return 0;
}
In number, arr is a local variable. The local variable is destroyed when its function ends and its content is then unpredictable. Nonetheless, your program then prints its value by using a pointer in the Token struct.
The value that is printed is unpredictable. The extra printf("") statement may cause the compiler to rearrange the code in a way that causes the variable to not get overwritten, or something like that. You cannot rely on it.
You have several other options to allocate memory per token:
Change str in token so it's an array of chars instead of a pointer. Then each token has its own space to store the string.
Allocate the string with malloc. Then it stays allocated until you free it.
Create the array in main so it's valid for both next and print_tok. You'd have to give next a pointer to the array, so it knows where it should store the string. This would only store one token's string at a time.
Basically any other way of creating an array other than making it a local variable in next.
Make the pointer point to where the token is in the original string. Add another variable in Token which stores how long the token is.
I think the first option is easiest and the last option uses the least memory, but I included some other options for completeness.
EDIT: So it looks like the problem is that the string that getNum is supposed to convert to a float is not actually a string containing all the characters of the token. Instead it contains the character immediately following the token, which is usually NaN so the atof converts it to 0. I'm not sure why this behavior is occuring.
I'm working on a scanner + parser that evaluates arithmetic expressions. I am trying to implement a method that gets a token (stored as a string) which is a number and turns it into a float, but it always returns 0 no matter what the token is.
I was given the code for a get_character function, which I am not sure is correct. I'm having a little trouble parsing what's going on with it though, so I'm not sure:
int get_character(location_t *loc)
{
int rtn;
if (loc->column >= loc->line->length) {
return 0;
}
rtn = loc->line->data[loc->column++];
if (loc->column >= loc->line->length && loc->line->next) {
loc->line = loc->line->next;
loc->column = 0;
}
return rtn;
}
I used it in my getNum() function assuming it was correct. It is as follows:
static float getNum(){
char* tokenstr;
tokenstr = malloc(tok.length * sizeof(char));
int j;
for(j = 0; j < tok.length; j++){
tokenstr[j] = get_character(&loc);
}
match(T_LITERAL); /*match checks if the given token class is the same as the token
class of the token currently being parsed. It then moves the
parser to the next token.*/
printf("%f\n", atof(tokenstr));
return atof(tokenstr);
}
Below is some additional information that is required to understand what is going on in the above functions. These are details about some struct files which organize the input data.
In order to store and find tokens, three types of structs are used. A line_t struct, a location_t struct, and a token_t struct. The code for these are posted, but to summarize:
Lines contain an array of characters (the input from that line of the
input file), an int for the length of the line, an int that is the
line number as a form of identification, and a pointer to the next
line of input that was read into memory.
Locations contain a pointer to a specific line, and an int that
specifies a specific "column" of the line.
Tokens contain an int for the length of the token, a location describing where the token begins, and token class describing what kind of token it is for the parser.
Code for these structs:
typedef struct line {
char * data;
int line_num;
int length; /* number of non-NUL characters == index of trailing NUL */
struct line * next;
} line_t;
typedef struct {
line_t *line;
int column;
} location_t;
typedef struct {
token_class tc;
location_t location;
int length; /* length of token in characters (may span lines) */
} token_t;
It appears that the default behavior intended is to extract a character and then advance to the next prior to returning.
Yet the function, if the line length is exceeded (or the collumn value isn't initialized to less than the line length) will not advance.
Try this:
if (loc->column >= loc->line->length) {
loc->line = loc->line->next;
loc->column = 0;
return 0;
}
And make sure that the column location is properly initialized.
Personally, I would change the function to this:
int get_character(location_t *loc)
{
int rtn = 0;
if (loc->column < loc->line->length) {
rtn = loc->line->data[loc->column++];
}
if (loc->column >= loc->line->length && loc->line->next) {
loc->line = loc->line->next;
loc->column = 0;
}
return rtn;
}
I'd also use unsigned values for the column and length, just to avoid the possibility of negative array indicies.
I see a number of potential problems with this code:
char* tokenstr;
tokenstr = malloc(tok.length * sizeof(char));
int j;
for(j = 0; j < tok.length; j++){
tokenstr[j] = get_character(&loc);
}
match(T_LITERAL); /*match checks if the given token class is the same as the token
class of the token currently being parsed. It then moves the
parser to the next token.*/
printf("%f\n", atof(tokenstr));
return atof(tokenstr);
You create space for a new token string tokenstr, you copy it but you don't null terminate it after, nor is enough space allocated for a token plus the string terminator \0. And at the end there is a memory leak as tokenstr isn't freeed. I might consider a change to something like:
char* tokenstr;
float floatVal;
/* Make sure we have enough space including \0 to terminate string */
tokenstr = malloc((tok.length + 1) * sizeof(char));
int j;
for(j = 0; j < tok.length; j++){
tokenstr[j] = get_character(&loc);
}
/* add the end of string \0 character */
tokenstr[j] = '\0'
match(T_LITERAL); /*match checks if the given token class is the same as the token
class of the token currently being parsed. It then moves the
parser to the next token.*/
floatVal = atof(tokenstr);
/* Free up the `malloc`ed tokenstr as it is no longer needed */
free(tokenstr);
printf("%f\n", floatVal);
return floatVal;
I am currently trying to iterate over a string to find the first white space.
I want to copy all of the characters before that white space into a different string.
Here I am coping more of my code: lineArray is global, and is filled in by a different function which I didn't copy.
char *lineArray[16];
int startProcesses(int background) {
int i = 0;
int var = 0;
int pid;
int status;
int len;
char copyProcessName[255];
while(*(lineArray+i) != NULL) {
len = strlen(lineArray[i]);
for (var = 0; var < len; ++var) {
if(lineArray[i][var] != ' ') {
copyProcessName[var] = lineArray[i][var];
} else {
break;
}
}
I know this is not finished and I am missing '\0', but before that I have noticed on debug that after the first time the compiler tries the copyProcessName[var] = lineArray[i][var]; assignment, the whole string which is in lineArray[i] is destroyed and instead of for example containing ls -l it is replaced with ll - l.
I will mention a few more thins:
lineArray is a global variable, I did try using strcpy but it caused the same destruction so this is the reason I chose to implement it, last thing is that I am using ubuntu.
Does anyone have an idea why is doing that?
Thanks!
I am currently trying to iterate over a string to find the first white space.
Cool.
char s[] = "line with spaces";
char *p = strchr(s, ' '); // pointer to the first WS, if you need it
ptrdiff_t n = p - s; // or its position within the string, if that's what you're looking for
I'm trying to make a quick function that gets a word/argument in a string by its number:
char* arg(char* S, int Num) {
char* Return = "";
int Spaces = 0;
int i = 0;
for (i; i<strlen(S); i++) {
if (S[i] == ' ') {
Spaces++;
}
else if (Spaces == Num) {
//Want to append S[i] to Return here.
}
else if (Spaces > Num) {
return Return;
}
}
printf("%s-\n", Return);
return Return;
}
I can't find a way to put the characters into Return. I have found lots of posts that suggest strcat() or tricks with pointers, but every one segfaults. I've also seen people saying that malloc() should be used, but I'm not sure of how I'd used it in a loop like this.
I will not claim to understand what it is that you're trying to do, but your code has two problems:
You're assigning a read-only string to Return; that string will be in your
binary's data section, which is read-only, and if you try to modify it you will get a segfault.
Your for loop is O(n^2), because strlen() is O(n)
There are several different ways of solving the "how to return a string" problem. You can, for example:
Use malloc() / calloc() to allocate a new string, as has been suggested
Use asprintf(), which is similar but gives you formatting if you need
Pass an output string (and its maximum size) as a parameter to the function
The first two require the calling function to free() the returned value. The third allows the caller to decide how to allocate the string (stack or heap), but requires some sort of contract about the minumum size needed for the output string.
In your code, when the function returns, then Return will be gone as well, so this behavior is undefined. It might work, but you should never rely on it.
Typically in C, you'd want to pass the "return" string as an argument instead, so that you don't have to free it all the time. Both require a local variable on the caller's side, but malloc'ing it will require an additional call to free the allocated memory and is also more expensive than simply passing a pointer to a local variable.
As for appending to the string, just use array notation (keep track of the current char/index) and don't forget to add a null character at the end.
Example:
int arg(char* ptr, char* S, int Num) {
int i, Spaces = 0, cur = 0;
for (i=0; i<strlen(S); i++) {
if (S[i] == ' ') {
Spaces++;
}
else if (Spaces == Num) {
ptr[cur++] = S[i]; // append char
}
else if (Spaces > Num) {
ptr[cur] = '\0'; // insert null char
return 0; // returns 0 on success
}
}
ptr[cur] = '\0'; // insert null char
return (cur > 0 ? 0 : -1); // returns 0 on success, -1 on error
}
Then invoke it like so:
char myArg[50];
if (arg(myArg, "this is an example", 3) == 0) {
printf("arg is %s\n", myArg);
} else {
// arg not found
}
Just make sure you don't overflow ptr (e.g.: by passing its size and adding a check in the function).
There are numbers of ways you could improve your code, but let's just start by making it meet the standard. ;-)
P.S.: Don't malloc unless you need to. And in that case you don't.
char * Return; //by the way horrible name for a variable.
Return = malloc(<some size>);
......
......
*(Return + index) = *(S+i);
You can't assign anything to a string literal such as "".
You may want to use your loop to determine the offsets of the start of the word in your string that you're looking for. Then find its length by continuing through the string until you encounter the end or another space. Then, you can malloc an array of chars with size equal to the size of the offset+1 (For the null terminator.) Finally, copy the substring into this new buffer and return it.
Also, as mentioned above, you may want to remove the strlen call from the loop - most compilers will optimize it out but it is indeed a linear operation for every character in the array, making the loop O(n**2).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *arg(const char *S, unsigned int Num) {
char *Return = "";
const char *top, *p;
unsigned int Spaces = 0;
int i = 0;
Return=(char*)malloc(sizeof(char));
*Return = '\0';
if(S == NULL || *S=='\0') return Return;
p=top=S;
while(Spaces != Num){
if(NULL!=(p=strchr(top, ' '))){
++Spaces;
top=++p;
} else {
break;
}
}
if(Spaces < Num) return Return;
if(NULL!=(p=strchr(top, ' '))){
int len = p - top;
Return=(char*)realloc(Return, sizeof(char)*(len+1));
strncpy(Return, top, len);
Return[len]='\0';
} else {
free(Return);
Return=strdup(top);
}
//printf("%s-\n", Return);
return Return;
}
int main(){
char *word;
word=arg("make a quick function", 2);//quick
printf("\"%s\"\n", word);
free(word);
return 0;
}
I came across the below code while googling which works great. (Credit to Chaitanya Bhatt # Performancecompetence.com)
The below function searches for the last occurrence of the passed delimiter and saves the remaining part of the input string to the returned output string.
void strLastOccr(char inputStr[100], char* outputStr, char *delim)
{
char *temp, *temp2;
int i = 0;
temp = "";
while (temp!=NULL)
{
if(i==0)
{
temp2 = temp;
temp = (char *)strtok(inputStr,delim);
i++;
}
if(i>0)
{
temp2 = temp;
temp = (char *)strtok(NULL,delim);
}
lr_save_string(temp2,outputStr);
}
}
Basically trying to add two new options to pass in.
Occurrence No: Instead of defaulting to the last occurrence, allowing to specific which occurrence to stop at and save the remaining of the string.
Part of the string to save: (Left, Right) At the moment the string is saving the right side once the delimiter is found. Additional option is intended to allow the user to specify for the left or right side of the delimiter is found.
void strOccr(char inputStr[100], char* outputStr, char *delim, int *occrNo, char *stringSide)
So the question is what are the modifications I need to the above function?
Also is it actually possible to do?
UPDATE
After I kept at it I was able to workout a solution.
As I can't answer my own question for another 6 hours, points will be awarded to who can provide an improved function. Specifically I don't like the code under the comment "// Removes the delim at the end of the string."
void lr_custom_string_delim_save (char inputStr[500], char* outputStr, char *delim, int occrNo, int stringSide)
{
char *temp, *temp2;
char temp3[500] = {0};
int i = 0;
int i2;
int iOccrNo = 1;
temp = "";
while (temp!=NULL) {
if(i==0) {
temp2 = temp;
temp = (char *)strtok(inputStr,delim);
i++;
}
if(i>0) {
temp2 = temp;
temp = (char *)strtok(NULL,delim);
if (stringSide==0) {
if (iOccrNo > occrNo) {
strcat(temp3, temp2);
// Ensure an extra delim is not added at the end of the string.
if (temp!=NULL) {
// Adds the delim back into the string that is removed by strtok.
strcat(temp3, delim);
}
}
}
if (stringSide==1) {
if (iOccrNo <= occrNo) {
strcat(temp3, temp2);
strcat(temp3, delim);
}
}
// Increase the occurrence counter.
iOccrNo++;
}
}
// Removes the delim at the end of the string.
if (stringSide==1) {
for( i2 = strlen (temp3) - 1; i2 >= 0
&& strchr ( delim, temp3[i2] ) != NULL; i2-- )
// replace the string terminator:
temp3[i2] = '\0';
}
// Saves the new string to new param.
lr_save_string(temp3,outputStr);
}
You really only need to make a few modifications. As you begin walking the string with strtok() you can store two variables, char *current, *previous.
As you hit each new token, move 'current' to 'previous' and store the new 'current.' At the end of the string parse look at the value of 'previous' to get the second from last element.
Other options, keep a counter and build a pseudo array using the LoadRunner variable handling mechanism, lr_save_string(token_value,"LR_variable_name_"). You'll need to build your variable name string first of course. When you fall out of the parse action your count variable will likely hold the total number of token elements parsed out of the string and then you can use the (counter-1) index value to build your string.
char foo[100]="";
...
sprint(foo, "{LR_variable_name_%d}",counter-1);
lr_message("My second to last element is %s",lr_eval_string(foo));
There are likely other options as well, but these are the two that jump to mind. Also, I recommend a book to you that I recommend to all that want to brush up on their C (including my brother and my uncle), "C for Dummies." There are lots of great options here on the string processing front that you can leverage in LoadRunner.