reading a line, tokenizing and assigning to struct in C - c

line is fgets'd, and running in a while loop with counter n, d is a struct with 2 char arrays, p and q. Basically, in a few words, I want to read a line, separate it into 2 strings, one up until the first space, and the other with the rest of the line. I clean up afterwards (\n from the file becomes \'0'). The code works, but is there a more idiomatic way to do this? What errors am I running into "unknowingly"?
size_t spc = strcspn(line," ");
strncpy(d[n].p, line, spc);
d[n].p[spc+1]='\0';
size_t l = strlen(line)-spc;
strncpy(d[n].q, line+spc+1, l);
char* nl = strchr(d[n].q, '\n');
if(nl){
*nl='\0';
}
n++;
EDIT: q may contain spaces.
Thanks.

This can be done with pure pointer arithmetic only. Assuming line contains the current line:
char *p = line;
char *part1, *part2;
while (*p && *p != ' ') {
p++;
}
if (*p == ' ') {
*p++ = '\0';
part1 = strdup(line);
part2 = strdup(p);
if (!part1 || !part2) {
/* insufficient memory */
}
} else {
/* line doesn't contain a space */
}
Basically you scan the string till the first occurrence of a space, then replace the space with a null character to indicate the end of the first part (strdup needs to know where to stop), and advance the pointer by one to get the rest of the string.
To make the code look even cleaner but with the overhead of calling a function, you could use strchr() instead of the while loop:
char *p = strchr(line, ' ');
char *part1, *part2;
if (p) {
*p++ = '\0';
part1 = strdup(line);
part2 = strdup(p);
}

I would write very nearly the code you have. Some tweaks:
You're not getting anything out of strncpy here, use memcpy.
You're not getting anything out of strcspn either, use strchr.
Avoid scanning parts of the string twice.
So:
char *spc = strchr(line, ' ');
memcpy(d[n].p, line, spc - line);
d[n].p[spc - line] = '\0';
spc++;
char *end = strchr(spc, '\n');
if (end)
{
memcpy(d[n].q, spc, end - spc);
d[n].q[end - spc] = '\0';
}
else
strcpy(d[n].q, spc);
n++;

You could always use:
sscanf(line, "%s %s", d[n].p, d[n].q);
Assuming the stuff you want to put into p and q does not contain spaces, and that p and q is guaranteed to be large enough to hold the tokens including zero-termination.
The scanf function is dangerous, but very useful when used correctly.

scanf("%s %[^\n]", d[n].p, d[n].q);
The %[...] directive is like %s, but instead of matching non-whitespace, it matches the characters within the brackets – or all characters except those in the brackets, if ^ is leading.
You should check the return value to see if q was actually input; this has somewhat different behavior than your code if "rest of line" is actually empty. (Or if the line starts with whitespace.)

Related

Cannot assign chars from one string to another

So I have a function that takes a string, and strips out special format characters, and assigns it to another string for later processing.
A sample call would be:
act_new("$t does $d");
It should strip out the $t and the $d and leave the second string as " does ", but its not assigning anything. I am getting back into programming after quite a few years of inactivity, and this is someone elses code (A MUD codebase, Rom), but I feel like I am missing something fundamental with pointer assignments. Any tips?
(This is truncated code, the rest has no operations on str or point until much later)
void act_new(const char *format)
{
const char *str;
char *point;
str = format;
while ( *str != '\0' ) {
if ( *str != '$' ) {
*point++ = *str++;
continue;
}
}
}
You need to increment str every time through the loop, not only when you assign to point. Otherwise you end up in an infinite loop when the character doesn't match the if condition.
You also want to skip the character after $, so you have to increment str twice when you encounter $.
The code is simpler if you use a for loop and array indexing rather than pointer arithmetic.
size_t len = strlen(format);
for (size_t i = 0; i < len; i++) {
if (format[i] == '$') {
i++; // extra increment to skip character after $
} else {
*point++ = format[i];
}
}
There are a few problems with your code, as pointed out in the comments:
point is not initialized (garbage pointer value)
continue doesn't do anything
infinite loop if a $ is encountered
When writing the function, one must also keep in mind to skip an extra character if a $ is encountered if and only if it's not the last character in the string (except for the '\0').
Since you know how many times you need to loop, a for loop is better suited and, as a bonus, you don't have to explicitly check if the character after a $ is '\0' when skipping an extra character in format (renamed src below). Also, don't forget to terminate the destination string.
This code will take care of those things for you:
void act_new(const char *src)
{
const size_t length = strlen(src);
char * const dst = (char*)malloc(sizeof(char)*(length+1));
if(dst == NULL)
// Error handling left out
return;
char *point = dst;
for(size_t i = 0; i < length; ++i)
{
if(src[i] == '$')
{
++i;
continue;
}
*point++ = src[i];
}
*point = '\0'; //Terminate string properly
printf("%s\n", dst);
free(dst);
}

Splitting user input into strings of specific length

I'm writing a C program that parses user input into a char, and two strings of set length. The user input is stored into a buffer using fgets, and then parsed with sscanf. The trouble is, the three fields have a maximum length. If a string exceeds this length, the remaining characters before the next whitespace should be consumed/discarded.
#include <stdio.h>
#define IN_BUF_SIZE 256
int main(void) {
char inputStr[IN_BUF_SIZE];
char command;
char firstname[6];
char surname[6];
fgets(inputStr, IN_BUF_SIZE, stdin);
sscanf(inputStr, "%c %5s %5s", &command, firstname, surname);
printf("%c %s %s\n", command, firstname, surname);
}
So, with an input of
a bbbbbbbb cc
the desired output would be
a bbbbb cc
but is instead the output is
a bbbbb bbb
Using a format specifier "%c%*s %5s%*s %5s%*s" runs into the opposite problem, where each substring needs to exceed the set length to get to the desired outcome.
Is there way to achieve this by using format specifiers, or is the only way saving the substrings in buffers of their own before cutting them down to the desired length?
In addition to the other answers, never forget when facing string parsing problems, you always have the option of simply walking a pointer down the string to accomplish any type parsing you require. When you read your string into buffer (my buf below), you have an array of characters you are free to analyze manually (either with array indexes, e.g. buffer[i] or by assigning a pointer to the beginning, e.g. char *p = buffer;) With your string, you have the following in buffer with p pointing to the first character in buffer:
--------------------------------
|a| |b|b|b|b|b|b|b|b| |c|c|\n|0| contents
--------------------------------
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 index
|
p
To test the character pointed to by p, you simply dereference the pointer, e.g. *p. So to test whether you have an initial character between a-z followed by a space at the beginning of buffer, you simply need do:
/* validate first char is 'a-z' and followed by ' ' */
if (*p && 'a' <= *p && *p <= 'z' && *(p + 1) == ' ') {
cmd = *p;
p += 2; /* advance pointer to next char following ' ' */
}
note:, you are testing *p first, (which is the shorthand for *p != 0 or the equivalent *p != '\0') to validate the string is not empty (e.g. the first char isn't the nul-byte) before proceeding with further tests. You would also include an else { /* handle error */ } in the event any one of the tests failed (meaning you have no command followed by a space).
When you are done, your are left with p pointing to the third character in buffer, e.g.:
--------------------------------
|a| |b|b|b|b|b|b|b|b| |c|c|\n|0| contents
--------------------------------
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 index
|
p
Now your job is simply, just advance by no more than 5 characters (or until the next space is encountered, assigning the characters to firstname and then nul-terminate following the last character:
/* read up to NLIM chars into fname */
for (n = 0; n < NMLIM && *p && *p != ' ' && *p != '\n'; p++)
fname[n++] = *p;
fname[n] = 0; /* nul terminate */
note: since fgets reads and includes the trailing '\n' in buffer, you should also test for the newline.
When you exit the loop, p is pointing to the seventh character in the buffer as follows:
--------------------------------
|a| |b|b|b|b|b|b|b|b| |c|c|\n|0| contents
--------------------------------
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 index
|
p
You now simply read forward until you encounter the next space and then advance past the space, e.g.:
/* discard remaining chars up to next ' ' */
while (*p && *p != ' ') p++;
p++; /* advance to next char */
note: if you exited the firstname loop pointing at a space, the above code does not execute.
Finally, all you do is repeat the same loop for surname that you did for firstname. Putting all the pieces of the puzzle together, you could do something similar to the following:
#include <stdio.h>
enum { NMLIM = 5, BUFSIZE = 256 };
int main (void) {
char buf[BUFSIZE] = "";
while (fgets (buf, BUFSIZE, stdin)) {
char *p = buf, cmd, /* start & end pointers */
fname[NMLIM+1] = "",
sname[NMLIM+1] = "";
size_t n = 0;
/* validate first char is 'a-z' and followed by ' ' */
if (*p && 'a' <= *p && *p <= 'z' && *(p + 1) == ' ') {
cmd = *p;
p += 2; /* advance pointer to next char following ' ' */
}
else { /* handle error */
fprintf (stderr, "error: no single command followed by space.\n");
return 1;
}
/* read up to NLIM chars into fname */
for (n = 0; n < NMLIM && *p && *p != ' ' && *p != '\n'; p++)
fname[n++] = *p;
fname[n] = 0; /* nul terminate */
/* discard remaining chars up to next ' ' */
while (*p && *p != ' ') p++;
p++; /* advance to next char */
/* read up to NLIM chars into sname */
for (n = 0; n < NMLIM && *p && *p != ' ' && *p != '\n'; p++)
sname[n++] = *p;
sname[n] = 0; /* nul terminate */
printf ("input : %soutput : %c %s %s\n",
buf, cmd, fname, sname);
}
return 0;
}
Example Use/Output
$ echo "a bbbbbbbb cc" | ./bin/walkptr
input : a bbbbbbbb cc
output : a bbbbb cc
Look things over an let me know if you have any questions. No matter how elaborate the string or what you need from it, you can always get what you need by simply walking a pointer (or a pair of pointers) down the length of the string.
One way to split the input buffer as OP desires is to use multiple calls to sscanf(), and to use the %n conversion specifier to keep track of the number of characters read. In the code below, the input string is scanned in three stages.
First, the pointer strPos is assigned to point to the first character of inputStr. Then the input string is scanned with " %c%n%*[^ ]%n". This format string skips over any initial whitespaces that a user might enter before the first character, and stores the first character in command. The %n directive tells sscanf() to store the number of characters read so far in the variable n; then the *[^ ] directive tells sscanf() to read and ignore any characters until a whitespace character is encountered. This effectively skips over any remaining characters that were entered after the initial command character. The %n directive appears again, and overwrites the previous value with the number of characters read until this point. The reason for using %n twice is that, if the user enters a character followed by a whitespace (as expected), the second directive will find no matches, and sscanf() will exit without ever reaching the final %n directive.
The pointer strPos is moved to the beginning of the remaining string by adding n to it, and sscanf() is called a second time, this time with "%5s%n%*[^ ]%n". Here, up to 5 characters are read into the character array firstname[], the number of characters read is saved by the %n directive, any remaining non-whitespace characters are read and ignored, and finally, if the scan made it this far, the number of characters read is saved again.
strPos is increased by n again, and the final scan only needs "%s" to complete the task.
Note that the return value of fgets() is checked to be sure that it was successful. The call to fgets() was changed slightly to:
fgets(inputStr, sizeof inputStr, stdin)
The sizeof operator is used here instead of IN_BUF_SIZE. This way, if the declaration of inputStr is changed later, this line of code will still be correct. Note that the sizeof operator works here because inputStr is an array, and arrays do not decay to pointers in sizeof expressions. But, if inputStr were passed into a function, sizeof could not be used in this way inside the function, because arrays decay to pointers in most expressions, including function calls. Some, #DavidC.Rankin, prefer constants as OP has used. If this seems confusing, I would suggest sticking with the constant IN_BUF_SIZE.
Also note that the return values for each of the calls to sscanf() are checked to be certain that the input matches expectations. For example, if the user enters a command and a first name, but no surname, the program will print an error message and exit. It is worth pointing out that, if the user enters say, a command character and first name only, after the second sscanf() the match may have failed on \n, and strPtr is then incremented to point to the \0 and so is still in bounds. But this relies on the newline being in the string. With no newline, the match might fail on \0, and then strPtr would be incremented out of bounds before the next call to sscanf(). Fortunately, fgets() retains the newline, unless the input line is larger than the specified size of the buffer. Then there is no \n, only the \0 terminator. A more robust program would check the input string for \n, and add one if needed. It would not hurt to increase the size of IN_BUF_SIZE.
#include <stdio.h>
#include <stdlib.h>
#define IN_BUF_SIZE 256
int main(void)
{
char inputStr[IN_BUF_SIZE];
char command;
char firstname[6];
char surname[6];
char *strPos = inputStr; // next scan location
int n = 0; // holds number of characters read
if (fgets(inputStr, sizeof inputStr, stdin) == NULL) {
fprintf(stderr, "Error in fgets()\n");
exit(EXIT_FAILURE);
}
if (sscanf(strPos, " %c%n%*[^ ]%n", &command, &n, &n) < 1) {
fprintf(stderr, "Input formatting error: command\n");
exit(EXIT_FAILURE);
}
strPos += n;
if (sscanf(strPos, "%5s%n%*[^ ]%n", firstname, &n, &n) < 1) {
fprintf(stderr, "Input formatting error: firstname\n");
exit(EXIT_FAILURE);
}
strPos += n;
if (sscanf(strPos, "%5s", surname) < 1) {
fprintf(stderr, "Input formatting error: surname\n");
exit(EXIT_FAILURE);
}
printf("%c %s %s\n", command, firstname, surname);
}
Sample interaction:
a Zaphod Beeblebrox
a Zapho Beebl
The fscanf() functions have a reputation for being subtle and error-prone; the format strings used above may seem a little bit tricky. By writing a function to skip to the next word in the input string, the calls to sscanf() can be simplified. In the code below, skipToNext() takes a pointer to a string as input; if the first character of the string is a \0 terminator, the pointer is returned unchanged. All initial non-whitespace characters are skipped over, then any whitespace characters are skipped, up to the next non-whitespace character (which may be a \0). A pointer is returned to this non-whitespace character.
The resulting program is a little bit longer than the previous program, but it may be easier to understand, and it certainly has simpler format strings. This program does differ from the first in that it no longer accepts leading whitespace in the string. If the user enters whitespace before the command character, this is considered erroneous input.
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#define IN_BUF_SIZE 256
char * skipToNext(char *);
int main(void)
{
char inputStr[IN_BUF_SIZE];
char command;
char firstname[6];
char surname[6];
char *strPos = inputStr; // next scan location
if (fgets(inputStr, sizeof inputStr, stdin) == NULL) {
fprintf(stderr, "Error in fgets()\n");
exit(EXIT_FAILURE);
}
if (sscanf(strPos, "%c", &command) != 1 || isspace(command)) {
fprintf(stderr, "Input formatting error: command\n");
exit(EXIT_FAILURE);
}
strPos = skipToNext(strPos);
if (sscanf(strPos, "%5s", firstname) != 1) {
fprintf(stderr, "Input formatting error: firstname\n");
exit(EXIT_FAILURE);
}
strPos = skipToNext(strPos);
if (sscanf(strPos, "%5s", surname) != 1) {
fprintf(stderr, "Input formatting error: surname\n");
exit(EXIT_FAILURE);
}
printf("%c %s %s\n", command, firstname, surname);
}
char * skipToNext(char *c)
{
int inWord = isspace(*c) ? 0 : 1;
if (inWord && *c != '\0') {
while (!isspace(*c)) {
++c;
}
}
inWord = 0;
while (isspace(*c)) {
++c;
}
return c;
}

Ignoring spaces in a string unless it's in quotes

char *args[32];
char **next = args;
char *temp = NULL;
char *quotes = NULL;
temp = strtok(line, " \n&");
while (temp != NULL) {
if (strncmp(temp, "\"", 1) == 0) {
//int i = strlen(temp);
printf("first if");
quotes = strtok(temp, "\"");
} else if (strncmp(temp, "\"", 1) != 0) {
*next++ = temp;
temp = strtok(NULL, " \n&");
}
}
I'm having trouble with trying to understand with how to still keep spaces if a part of the string is surrounded with quotes. For example, if I want execvp() to execute this: diff "space name.txt" sample.txt, it should save diff at args[0], space name.txt at args[1] and sample.txt at args[2].
I'm not really sure on how to implement this, I've tried a few different ways of logic with if statements, but I'm not quite there. At the moment I am trying to do something simple like: ls "folder", however, it gets stuck in the while loop of printing out my printf() statement.
I know this isn't worded as a question - it's more explaining what I'm trying to achieve and where I'm up to so far, but I'm having trouble and would really appreciate some hints of how the logic should be.
Instead of using strtok process the string char by char. If you see a ", set a flag. If flag is already set - unset it instead. If you see a space - check the flag and either switch to next arg, or add space to current. Any other char - add to current. Zero byte - done processing.
With some extra effort you'll be able to handle even stuff like diff "file \"one\"" file\ two (you should get diff, file "one" and file two as results)
I'm confused even to understand what you try to do. Are you trying to tokenize the input string into space separated tokens?
Just separate the input string on spaces and when you encounter a double quote char you need a second inner loop which handles quoted strings.
There is more to quoted strings than to search for the closing quote. You need to handle backslashes, for example backslashed escaped quotes and also backslash escaped backslashes.
Just consider the following:
diff "space name \" with quotes.txt\\" foo
Which refers to a (trashy) filename space name " with quotes.txt\. Use this as a test case, then you know when you are done with the basics. Note that shell command line splitting is a lot more crazy than that.
Here is my idea:
Make two pointers A and B, initially pointing at first char of the string.
Iterate through the string with pointer A, copying every char into an array as long as it's not a space.
Once you have reached a ", take the pointer B starting from the position A+1 and go forward until you reach the next ", copying everything including space.
Now repeat from number 2, starting from the char B+1.
Repeat as long as you haven't reached \0.
Note: You'll have to consider what to do if there are nested quotes though.
You can also use a flag (int 1 || 0) and a pointer to denote if you're in a quote or not, following 2 separate rules based on the flag.
Write three functions. All of these should return the number of bytes they process. Firstly, the one that handles quoted arguments.
size_t handle_quoted_argument(char *str, char **destination) {
assert(*str == '\"');
/* discard the opening quote */
*destination = str + 1;
/* find the closing quote (or a '\0' indicating the end of the string) */
size_t length = strcspn(str + 1, "\"") + 1;
assert(str[length] == '\"'); /* NOTE: You really should handle mismatching quotes properly, here */
/* discard the closing quote */
str[length] = '\0';
return length + 1;
}
... then a function to handle the unquoted arguments:
size_t handle_unquoted_argument(char *str, char **destination) {
size_t length = strcspn(str, " \n");
char c = str[length];
*destination = str;
str[length] = '\0';
return c == ' ' ? length + 1 : length;
}
... then a function to handle (possibly repetitive) whitespace:
size_t handle_whitespace(char *str) {
int whitespace_count;
/* This will count consecutive whitespace characters, eg. tabs, newlines, spaces... */
assert(sscanf(str, " %n", &whitespace_count) == 0);
return whitespace_count;
}
Combining these three should be simple:
size_t n = 0, argv = 0;
while (line[n] != '\0') {
n += handle_whitespace(line + n);
n += line[n] == '\"' ? handle_quoted_argument(line + n, args + argv++)
: handle_unquoted_argument(line + n, args + argv++);
}
By breaking this up into four separate algorithms, can you see how much simpler this task becomes?
So here is where I read in the line:
while((qtemp = fgets(line, size, stdin)) != NULL ) {
if (strcmp(line, "exit\n") == 0) {
exit(EXIT_SUCCESS);
}
spaceorquotes(qtemp);
}
Then I go to this: (I haven't added my initializers, you get the idea though)
length = strlen(qtemp);
for(i = 0; i < length; i++) {
position = strcspn(qtemp, " \"\n");
while (strncmp(qtemp, " ", 1) == 0) {
memmove(qtemp, qtemp+1, length-1);
position = strcspn(qtemp, " \"\n");
} /*this while loop is for handling multiple spaces*/
if (strncmp(qtemp, "\"", 1) == 0) { /*this is for handling quotes */
memmove(qtemp, qtemp+1, length-1);
position = strcspn(qtemp, "\"");
stemp = malloc(position*sizeof(char));
strncat(stemp, qtemp, position);
args[i] = stemp;
} else { /*otherwise handle it as a (single) space*/
stemp = malloc(position*sizeof(char));
strncat(stemp, qtemp, position);
args[i] = stemp;
}
//printf("args: %s\n", args[i]);
length = strlen(qtemp);
memmove(qtemp, qtemp+position+1, length-position);
}
args[i-1] = NULL; /*the last position seemed to be a space, so I overwrote it with a null to terminate */
if (execvp(args[0], args) == -1) {
perror("execvp");
exit(EXIT_FAILURE);
}
I found that using strcspn helped, as modifiable lvalue suggested.

How does C know the end of my string?

I have a program in which I wanted to remove the spaces from a string. I wanted to find an elegant way to do so, so I found the following (I've changed it a little so it could be better readable) code in a forum:
char* line_remove_spaces (char* line)
{
char *non_spaced = line;
int i;
int j = 0;
for (i = 0; i <= strlen(line); i++)
{
if ( line[i] != ' ' )
{
non_spaced[j] = line[i];
j++;
}
}
return non_spaced;
}
As you can see, the function takes a string and, using the same allocated memory space, selects only the non-spaced characters. It works!
Anyway, according to Wikipedia, a string in C is a "Null-terminated string". I always thought this way and everything was good. But the problem is: we put no "null-character" in the end of the non_spaced string. And somehow the compiler knows that it ends at the last character changed by the "non_spaced" string. How does it know?
This does not happen by magic. You have in your code:
for (i = 0; i <= strlen(line); i++)
^^
The loop index i runs till strlen(line) and at this index there is a nul character in the character array and this gets copied as well. As a result your end result has nul character at the desired index.
If you had
for (i = 0; i < strlen(line); i++)
^^
then you had to put the nul character manually as:
for (i = 0; i < strlen(line); i++)
{
if ( line[i] != ' ' )
{
non_spaced[j] = line[i];
j++;
}
}
// put nul character
line[j] = 0;
Others have answered your question already, but here is a faster, and perhaps clearer version of the same code:
void line_remove_spaces (char* line)
{
char* non_spaced = line;
while(*line != '\0')
{
if(*line != ' ')
{
*non_spaced = *line;
non_spaced++;
}
line++;
}
*non_spaced = '\0';
}
The loop uses <= strlen so you will copy the null terminator as well (which is at i == strlen(line)).
You could try it. Debug it while it is processing a string containing only one space: " ". Watch carefully what happens to the index i.
How do you know that it "knows"? The most likely scenario is that you're simply having luck with your undefined behavior, and that there is a '\0'-character after the valid bytes of line end.
It's also highly likely that you're not seeing spaces at the end, which might be printed before hitting the stray "lucky '\0'".
A few other points:
There's no need to write this using indexing.
It's not very efficient to call strlen() on each loop iteration.
You might want to use isspace() to remove more whitespace characters.
Here's how I would write it, using isspace() and pointers:
char * remove_spaces(char *str)
{
char *ret = str, *put = str;
for(; *str != '\0'; str++)
{
if(!isspace((unsigned char) *str)
*put++ = *str;
}
*put = '\0';
return ret;
}
Note that this does terminate the space-less version of the string, so the returned pointer is guaranteed to point at a valid string.
The string parameter of your function is null-terminated, right?
And in the loop, the null character of the original string get also copied into the non spaced returned string. So the non spaced string is actually also null-terminated!
For your compiler, the null character is just another binary data that doesn't get any special treatment, but it's used by string APIs as a handy character to easily detect end of strings.
If you use the <= strlen(line), the length of the strlen(line) include the '\0' so your program can work. You can use debug and run analysis.

Reading a file in C

I have an input file I need to extract words from. The words can only contain letters and numbers so anything else will be treated as a delimiter. I tried fscanf,fgets+sscanf and strtok but nothing seems to work.
while(!feof(file))
{
fscanf(file,"%s",string);
printf("%s\n",string);
}
Above one clearly doesn't work because it doesn't use any delimiters so I replaced the line with this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
So I used fgets to read the first line and use sscanf:
sscanf(line,"%[A-z]%n,word,len);
line+=len;
This one doesn't work either because whatever I try I can't move the pointer to the right place. I tried strtok but I can't find how to set delimitters
while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");
This one obviously take blank character as a delimitter but I have literally 100s of delimitters.
Am I missing something here becasue extracting words from a file seemed a simple concept at first but nothing I try really works?
Consider building a minimal lexer. When in state word it would remain in it as long as it sees letters and numbers. It would switch to state delimiter when encountering something else. Then it could do an exact opposite in the state delimiter.
Here's an example of a simple state machine which might be helpful. For the sake of brevity it works only with digits. echo "2341,452(42 555" | ./main will print each number in a separate line. It's not a lexer but the idea of switching between states is quite similar.
#include <stdio.h>
#include <string.h>
int main() {
static const int WORD = 1, DELIM = 2, BUFLEN = 1024;
int state = WORD, ptr = 0;
char buffer[BUFLEN], *digits = "1234567890";
while ((c = getchar()) != EOF) {
if (strchr(digits, c)) {
if (WORD == state) {
buffer[ptr++] = c;
} else {
buffer[0] = c;
ptr = 1;
}
state = WORD;
} else {
if (WORD == state) {
buffer[ptr] = '\0';
printf("%s\n", buffer);
}
state = DELIM;
}
}
return 0;
}
If the number of states increases you can consider replacing if statements checking the current state with switch blocks. The performance can be increased by replacing getchar with reading a whole block of the input to a temporary buffer and iterating through it.
In case of having to deal with a more complex input file format you can use lexical analysers generators such as flex. They can do the job of defining state transitions and other parts of lexer generation for you.
Several points:
First of all, do not use feof(file) as your loop condition; feof won't return true until after you attempt to read past the end of the file, so your loop will execute once too often.
Second, you mentioned this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
That's not quite what's happening; if the next character in the stream doesn't match the format specifier, scanf returns without having read anything, and string is unmodified.
Here's a simple, if inelegant, method: it reads one character at a time from the input file, checks to see if it's either an alpha or a digit, and if it is, adds it to a string.
#include <stdio.h>
#include <ctype.h>
int get_next_word(FILE *file, char *word, size_t wordSize)
{
size_t i = 0;
int c;
/**
* Skip over any non-alphanumeric characters
*/
while ((c = fgetc(file)) != EOF && !isalnum(c))
; // empty loop
if (c != EOF)
word[i++] = c;
/**
* Read up to the next non-alphanumeric character and
* store it to word
*/
while ((c = fgetc(file)) != EOF && i < (wordSize - 1) && isalnum(c))
{
word[i++] = c;
}
word[i] = 0;
return c != EOF;
}
int main(void)
{
char word[SIZE]; // where SIZE is large enough to handle expected inputs
FILE *file;
...
while (get_next_word(file, word, sizeof word))
// do something with word
...
}
I would use:
FILE *file;
char string[200];
while(fscanf(file, "%*[^A-Za-z]"), fscanf(file, "%199[a-zA-Z]", string) > 0) {
/* do something with string... */
}
This skips over non-letters and then reads a string of up to 199 letters. The only oddness is that if you have any 'words' that are longer than 199 letters they'll be split up into multiple words, but you need the limit to avoid a buffer overflow...
What are your delimiters? The second argument to strtok should be a string containing your delimiters, and the first should be a pointer to your string the first time round then NULL afterwards:
char * p = strtok(line, ","); // assuming a , delimiter
printf("%s\n", p);
while(p)
{
p = strtok(NULL, ",");
printf("%S\n", p);
}

Resources