C - Read non-alphabetic chars as word boundary - c

I'm trying to parse in a text file, and add each distinct word into a hashtable, with the words as keys, and their frequencies as values. The problem is proving to be the reading part: the file is a very large file of "normal" text, in that it has punctuation and special characters. I want to treat all non-alphabetical chars read in as word-boundaries. I have something basic going with this:
char buffer[128];
while(fscanf(fp, "%127[A-Za-z]%*c", buffer) == 1) {
printf("%s\n", buffer);
memset(buffer, 0, 128);
}
However, that chokes whenever it actually hits a non-alphabetical char preceded by whitespace (e.g., "the,cat was (brown)" would be read in as "the cat was"). I know what the issue is with that code, but I'm not sure how to get around it. Would I be better off just reading in an entire line and doing the parsing manually? I'm trying scanf because I felt that this was a pretty good candidate for the mini-regex thing that you can do with the format string.

Suggest use of isalpha(), fgetc() and a simple state-machine.
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
int AdamRead(FILE *inf, char *dest, size_t n) {
int ch;
do {
ch = fgetc(inf);
if (ch == EOF) return EOF;
} while (!isalpha(ch));
assert(n > 1);
n--; // save room for \0
while (n-- > 0) {
*dest++ = ch;
ch = fgetc(inf);
if (!isalpha(ch)) break;
}
ungetc(ch, inf); // Add this is something else may need to parse `inf`.
*dest = '\0';
return 1;
}
char buffer[128];
while(AdamRead(fp, buffer, sizeof buffer) == 1) {
printf("%s\n", buffer);
}
Note: If you want to go the "%127[A-Za-z]%*[^A-Za-z]" route, code may need to start with a one-time fscanf(fp, "*[^A-Za-z]"); to deal with leading non-letters.

There's another way apart from the one mentioned in the comment. I don't know if it's better though. You can read lines from the file using fgets and then tokenize the line using strtok_r POSIX function. Here, r means the function is reentrant which makes it thread-safe. However, you must know the maximum length a line can have in the file.
#include <stdio.h>
#include <string.h>
#define MAX_LEN 100
// in main
char line[MAX_LEN];
char *token;
const char *delim = "!##$%^&*"; // all special characters
char *saveptr; // for strtok_r
FILE *fp = fopen("myfile.txt", "r");
while(fgets(line, MAX_LEN, fp) != NULL) {
for(; ; line = NULL) {
token = strtok_r(line, delim, &saveptr);
if(token == NULL)
break;
else {
// token is a string.
// process it
}
}
}
fclose(fp);
strtok_r modifies its first argument line, so you should keep a copy of it if it needed for other purposes.

Related

How to read multiple lines of string from stdin in C?

I am a novice in C programming. Suppose I want to read multiple lines of string from stdin. How can I keep reading until a line only containing EOL?
example of input
1+2\n
1+2+3\n
1+2+3+4\n
\n (stop at this line)
It seems that when I hit enter(EOL) directly, scanf won't execute until something other than just EOL has been entered. How can I solve that problem?
I'll be really grateful if someone can help me with this. Thank you.
If you want to learn C, you should avoid scanf. The only use cases where scanf actually makes sense are in problems for which C is the wrong language. Time spent learning the foibles of scanf is not well spent, and it doesn't really teach you much about C. For something like this, just read one character at a time and stop when you see two consecutive newlines. Something like:
#include <stdio.h>
int
main(void)
{
char buf[1024];
int c;
char *s = buf;
while( (c = fgetc(stdin)) != EOF && s < buf + sizeof buf - 1 ){
if( c == '\n' && s > buf && s[-1] == '\n' ){
ungetc(c, stdin);
break;
}
*s++ = c;
}
*s = '\0';
printf("string entered: %s", buf);
return 0;
}
to read multiple lines of string from stdin. How can I keep reading until a line only containing EOL?
Keep track of when reading the beginning of the line. If a '\n' is read at the beginning, stop
getchar() approach:
bool beginning = true;
int ch;
while ((ch = getchar()) != EOF) {
if (beginning) {
if (ch == '\n') break;
}
// Do what ever you want with `ch`
beginning = ch == '\n';
}
fgets() approach - needs more code to handle lines longer than N
#define N 1024
char buf[N+1];
while (fgets(buf, sizeof buf, stdin) && buf[0] != '\n') {
; // Do something with buf
}
If you need to read one character at a time then you can with either getchar or fgetc depending upon whether or not you're reading from stdin or some other stream.
But you said you were reading strings, so I'm assuming fgets is more appropriate.
There are primarily two considerations:
maximum line length
whether or not to handle Windows versus non-Windows line endings
Even if you are a beginner--and I won't go into #2 here--you should know you can defend against it. I will at least say that if you compile on one platform and read from stdin from a redirected file from another platform, then you might have to write a defense.
#include <stdio.h>
#include <string.h>
#include <errno.h>
int main (int argc, char *argv[]) {
char buf[32]; // relatively small buf makes testing easier
int lineContinuation = 0;
// If no characters are read, then fgets returns NULL.
while (fgets(buf, sizeof(buf), stdin) != NULL) {
int l = strlen(buf); // No newline in buf if line len + newline exceeds sizeof(buf)
if (buf[l-1] == '\n') {
if (l == 1 && !lineContinuation) {
break; // errno should indicate no error.
}
printf("send line ending (len=%d) to the parser\n", l);
lineContinuation = 0;
} else {
lineContinuation = 1;
printf("send line part (len=%d) to the parser\n", l);
}
}
printf("check errno (%d) if you must handle unexpected end of input use cases\n", errno);
}

Splitting Strings from file and putting them into array causes program crash

I am trying to read a file line by line and split it into words. Those words should be saved into an array. However, the program only gets the first line of the text file and when it tries to read the new line, the program crashes.
FILE *inputfile = fopen("file.txt", "r");
char buf [1024];
int i=0;
char fileName [25];
char words [100][100];
char *token;
while(fgets(buf,sizeof(buf),inputfile)!=NULL){
token = strtok(buf, " ");
strcpy(words[0], token);
printf("%s\n", words[0]);
while (token != NULL) {
token = strtok(NULL, " ");
strcpy(words[i],token);
printf("%s\n",words[i]);
i++;
}
}
After good answer from xing I decided to write my FULL simple program realizing your task and tell something about my solution. My program reads line-by-line a file, given as input argument and saves next lines into a buffer.
Code:
#include <assert.h>
#include <errno.h>
#define _WITH_GETLINE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define assert_msg(x) for ( ; !(x) ; assert(x) )
int
main(int argc, char **argv)
{
FILE *file;
char *buf, *token;
size_t length, read, size;
assert(argc == 2);
file = fopen(argv[1], "r");
assert_msg(file != NULL) {
fprintf(stderr, "Error ocurred: %s\n", strerror(errno));
}
token = NULL;
length = read = size = 0;
while ((read = getline(&token, &length, file)) != -1) {
token[read - 1] = ' ';
size += read;
buf = realloc(buf, size);
assert(buf != NULL);
(void)strncat(buf, token, read);
}
printf("%s\n", buf);
fclose(file);
free(buf);
free(token);
return (EXIT_SUCCESS);
}
For file file.txt:
that is a
text
which I
would like to
read
from file.
I got a result:
$ ./program file.txt
that is a text which I would like to read from file.
Few things which is worth to say about that solution:
Instead of fgets(3) I used getline(3) function because of easy way to knowledge about string length in line (read variable) and auto memory allocation for got string (token). It is important to remember to free(3) it. For Unix-like systems getline(3) is not provided by default in order to avoid compatibility problems. Therefore, #define _WITH_GETLINE macro is used before <stdio.h> header to make that function available.
buf contains only mandatory amount of space needed to save string. After reading one line from file buf is extended by the required amount of space by realloc(3). Is it a bit more "universal" solution. It is important to remember about freeing objects allocated on heap.
I also used strncat(3) which ensures that no more than read characters (length of token) would be save into buf. It is also not the best way of using strncat(3) because we also should testing a string truncation. But in general it is better than simple using of strcat(3) which is not recommended to use because enables malicious users to arbitrarily change a running program's functionality through a buffer overflow attack. strcat(3) and strncat(3) also adds terminating \0.
A getline(3) returns token with a new line character so I decided to replace it from new line to space (in context of creating sentences from words given in file). I also should eliminate last space but I do not wanted to complicate a source code.
From not mandatory things I also defined my own macro assert_msg(x) which is able to run assert(3) function and shows a text message with error. But it is only a feature but thanks to that we are able to see error message got during wrong attempts open a file.
The problem is getting the next token in the inner while loop and passing the result to strcpy without any check for a NULL result.
while(fgets(buf,sizeof(buf),inputfile)!=NULL){
token = strtok(buf, " ");
strcpy(words[0], token);
printf("%s\n", words[0]);
while (token != NULL) {//not at the end of the line. yet!
token = strtok(NULL, " ");//get next token. but token == NULL at end of line
//passing NULL to strcpy is a problem
strcpy(words[i],token);
printf("%s\n",words[i]);
i++;
}
}
By incorporating the check into the while condition, passing NULL as the second argument to strcpy is avoided.
while ( ( token = strtok ( NULL, " ")) != NULL) {//get next token != NULL
//if token == NULL the while block is not executed
strcpy(words[i],token);
printf("%s\n",words[i]);
i++;
}
Sanitize your loops, and don't repeat yourself:
#include <stdio.h>
#include <string.h>
int main(void)
{
FILE *inputfile = fopen("file.txt", "r");
char buf [1024];
int i=0;
char fileName [25];
char words [100][100];
char *token;
for(i=0; fgets(buf,sizeof(buf),inputfile); ) {
for(token = strtok(buf, " "); token != NULL; token = strtok(NULL, " ")){
strcpy(words[i++], token);
}
}
return 0;
}

Jumping to next line with fscanf()

I have two files .csv and I need to read the whole file but it have to be filed by field. I mean, csv files are files with data separated by comma, so I cant use fgets.
I need to read all the data but I don't know how to jump to the next line.
Here is what I've done so far:
int main()
{
FILE *arq_file;
arq_file = fopen("file.csv", "r");
if(arq_file == NULL){
printf("Not possible to read the file.");
exit(0);
}
while( !feof(arq_file) ){
fscanf(arq_file, "%i %lf", &myStruct[i+1].Field1, &myStruct[i+1].Field2);
}
fclose(arq_file);
return 0;
}
It will get in a infinity loop because it never gets the next line.
How could I reach the line below the one I just read?
Update: File 01 Example
1,Alan,123,
2,Alan Harper,321
3,Jose Rendeks,32132
4,Maria da graça,822282
5,Charlie Harper,9999999999
File 02 Example
1,320,123
2,444,321
3,250,123,321
3,3,250,373,451
2,126,621
1,120,320
2,453,1230
3,12345,0432,1830
I think an example is better than giving you hints, this is a combination of fgets() + strtok(), there are other functions that could work for example strchr(), though it's easier this way and since I just wanted to point you in the right direction, well I did it like this
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int
main(void)
{
FILE *file;
char buffer[256];
char *pointer;
size_t line;
file = fopen("data.dat", "r");
if (file == NULL)
{
perror("fopen()");
return -1;
}
line = 0;
while ((pointer = fgets(buffer, sizeof(buffer), file)) != NULL)
{
size_t field;
char *token;
field = 0;
while ((token = strtok(pointer, ",")) != NULL)
{
printf("line %zu, field %zu -> %s\n", line, field, token);
field += 1;
pointer = NULL;
}
line += 1;
}
return 0;
}
I think it's very clear how the code works and I hope you can understand.
If the same code has to handle both data files, then you're stuck with reading the fields into a string, and subsequently converting the string into a number.
It is not clear from your description whether you need to do something special at the end of line or not — but because only one of the data lines ends with a comma, you do have to allow for fields to be separated by a comma or a newline.
Frankly, you'd probably do OK with using getchar() or equivalent; it is simple.
char buffer[4096];
char *bufend = buffer + sizeof(buffer) - 1;
char *curfld = buffer;
int c;
while ((c = getc(arq_file)) != EOF)
{
if (curfld == bufend)
…process overlong field…
else if (c == ',' || c == '\n')
{
*curfld = '\0';
process(buffer);
curfld = buffer;
}
else
*curfld++ = c;
}
if (c == EOF && curfld != buffer)
{
*curfld = '\0';
process(buffer);
}
However, if you want to go with higher level functions, then you do want to use fgets() to read lines (unless you need to worry about deviant line endings, such as DOS vs Unix vs old-style Mac (CR-only) line endings). Or use POSIX
getline() to read arbitrarily long lines. Then split the lines using strtok_r() or equivalent.
char *buffer = 0;
size_t buflen = 0;
while (getline(&buffer, &buflen, arq_file) != -1)
{
char *posn = buffer;
char *epos;
char *token;
while ((token = strtok_r(posn, ",\n", &epos)) != 0)
{
process(token);
posn = 0;
}
/* Do anything special for end of line */
}
free(buffer);
If you think you must use scanf(), then you need to use something like:
char buffer[4096];
char c;
while (fscanf(arq_file, "%4095[^,\n]%c", buffer, &c) == 2)
process(buffer);
The %4095[^,\n] scan set reads up to 4095 characters that are neither comma nor newline into buffer, and then reads the next character (which must, therefore, either be comma or newline — or conceivably EOF, but that causes problems) into c. If the last character in the file is neither comma nor newline, then you will skip the last field.

Reading a file in C

I have an input file I need to extract words from. The words can only contain letters and numbers so anything else will be treated as a delimiter. I tried fscanf,fgets+sscanf and strtok but nothing seems to work.
while(!feof(file))
{
fscanf(file,"%s",string);
printf("%s\n",string);
}
Above one clearly doesn't work because it doesn't use any delimiters so I replaced the line with this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
So I used fgets to read the first line and use sscanf:
sscanf(line,"%[A-z]%n,word,len);
line+=len;
This one doesn't work either because whatever I try I can't move the pointer to the right place. I tried strtok but I can't find how to set delimitters
while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");
This one obviously take blank character as a delimitter but I have literally 100s of delimitters.
Am I missing something here becasue extracting words from a file seemed a simple concept at first but nothing I try really works?
Consider building a minimal lexer. When in state word it would remain in it as long as it sees letters and numbers. It would switch to state delimiter when encountering something else. Then it could do an exact opposite in the state delimiter.
Here's an example of a simple state machine which might be helpful. For the sake of brevity it works only with digits. echo "2341,452(42 555" | ./main will print each number in a separate line. It's not a lexer but the idea of switching between states is quite similar.
#include <stdio.h>
#include <string.h>
int main() {
static const int WORD = 1, DELIM = 2, BUFLEN = 1024;
int state = WORD, ptr = 0;
char buffer[BUFLEN], *digits = "1234567890";
while ((c = getchar()) != EOF) {
if (strchr(digits, c)) {
if (WORD == state) {
buffer[ptr++] = c;
} else {
buffer[0] = c;
ptr = 1;
}
state = WORD;
} else {
if (WORD == state) {
buffer[ptr] = '\0';
printf("%s\n", buffer);
}
state = DELIM;
}
}
return 0;
}
If the number of states increases you can consider replacing if statements checking the current state with switch blocks. The performance can be increased by replacing getchar with reading a whole block of the input to a temporary buffer and iterating through it.
In case of having to deal with a more complex input file format you can use lexical analysers generators such as flex. They can do the job of defining state transitions and other parts of lexer generation for you.
Several points:
First of all, do not use feof(file) as your loop condition; feof won't return true until after you attempt to read past the end of the file, so your loop will execute once too often.
Second, you mentioned this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
That's not quite what's happening; if the next character in the stream doesn't match the format specifier, scanf returns without having read anything, and string is unmodified.
Here's a simple, if inelegant, method: it reads one character at a time from the input file, checks to see if it's either an alpha or a digit, and if it is, adds it to a string.
#include <stdio.h>
#include <ctype.h>
int get_next_word(FILE *file, char *word, size_t wordSize)
{
size_t i = 0;
int c;
/**
* Skip over any non-alphanumeric characters
*/
while ((c = fgetc(file)) != EOF && !isalnum(c))
; // empty loop
if (c != EOF)
word[i++] = c;
/**
* Read up to the next non-alphanumeric character and
* store it to word
*/
while ((c = fgetc(file)) != EOF && i < (wordSize - 1) && isalnum(c))
{
word[i++] = c;
}
word[i] = 0;
return c != EOF;
}
int main(void)
{
char word[SIZE]; // where SIZE is large enough to handle expected inputs
FILE *file;
...
while (get_next_word(file, word, sizeof word))
// do something with word
...
}
I would use:
FILE *file;
char string[200];
while(fscanf(file, "%*[^A-Za-z]"), fscanf(file, "%199[a-zA-Z]", string) > 0) {
/* do something with string... */
}
This skips over non-letters and then reads a string of up to 199 letters. The only oddness is that if you have any 'words' that are longer than 199 letters they'll be split up into multiple words, but you need the limit to avoid a buffer overflow...
What are your delimiters? The second argument to strtok should be a string containing your delimiters, and the first should be a pointer to your string the first time round then NULL afterwards:
char * p = strtok(line, ","); // assuming a , delimiter
printf("%s\n", p);
while(p)
{
p = strtok(NULL, ",");
printf("%S\n", p);
}

Parsing text in C

I have a file like this:
...
words 13
more words 21
even more words 4
...
(General format is a string of non-digits, then a space, then any number of digits and a newline)
and I'd like to parse every line, putting the words into one field of the structure, and the number into the other. Right now I am using an ugly hack of reading the line while the chars are not numbers, then reading the rest. I believe there's a clearer way.
Edit: You can use pNum-buf to get the length of the alphabetical part of the string, and use strncpy() to copy that into another buffer. Be sure to add a '\0' to the end of the destination buffer. I would insert this code before the pNum++.
int len = pNum-buf;
strncpy(newBuf, buf, len-1);
newBuf[len] = '\0';
You could read the entire line into a buffer and then use:
char *pNum;
if (pNum = strrchr(buf, ' ')) {
pNum++;
}
to get a pointer to the number field.
fscanf(file, "%s %d", word, &value);
This gets the values directly into a string and an integer, and copes with variations in whitespace and numerical formats, etc.
Edit
Ooops, I forgot that you had spaces between the words.
In that case, I'd do the following. (Note that it truncates the original text in 'line')
// Scan to find the last space in the line
char *p = line;
char *lastSpace = null;
while(*p != '\0')
{
if (*p == ' ')
lastSpace = p;
p++;
}
if (lastSpace == null)
return("parse error");
// Replace the last space in the line with a NUL
*lastSpace = '\0';
// Advance past the NUL to the first character of the number field
lastSpace++;
char *word = text;
int number = atoi(lastSpace);
You can solve this using stdlib functions, but the above is likely to be more efficient as you're only searching for the characters you are interested in.
Given the description, I think I'd use a variant of this (now tested) C99 code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
struct word_number
{
char word[128];
long number;
};
int read_word_number(FILE *fp, struct word_number *wnp)
{
char buffer[140];
if (fgets(buffer, sizeof(buffer), fp) == 0)
return EOF;
size_t len = strlen(buffer);
if (buffer[len-1] != '\n') // Error if line too long to fit
return EOF;
buffer[--len] = '\0';
char *num = &buffer[len-1];
while (num > buffer && !isspace((unsigned char)*num))
num--;
if (num == buffer) // No space in input data
return EOF;
char *end;
wnp->number = strtol(num+1, &end, 0);
if (*end != '\0') // Invalid number as last word on line
return EOF;
*num = '\0';
if (num - buffer >= sizeof(wnp->word)) // Non-number part too long
return EOF;
memcpy(wnp->word, buffer, num - buffer);
return(0);
}
int main(void)
{
struct word_number wn;
while (read_word_number(stdin, &wn) != EOF)
printf("Word <<%s>> Number %ld\n", wn.word, wn.number);
return(0);
}
You could improve the error reporting by returning different values for different problems.
You could make it work with dynamically allocated memory for the word portion of the lines.
You could make it work with longer lines than I allow.
You could scan backwards over digits instead of non-spaces - but this allows the user to write "abc 0x123" and the hex value is handled correctly.
You might prefer to ensure there are no digits in the word part; this code does not care.
You could try using strtok() to tokenize each line, and then check whether each token is a number or a word (a fairly trivial check once you have the token string - just look at the first character of the token).
Assuming that the number is immediately followed by '\n'.
you can read each line to chars buffer, use sscanf("%d") on the entire line to get the number, and then calculate the number of chars that this number takes at the end of the text string.
Depending on how complex your strings become you may want to use the PCRE library. At least that way you can compile a perl'ish regular expression to split your lines. It may be overkill though.
Given the description, here's what I'd do: read each line as a single string using fgets() (making sure the target buffer is large enough), then split the line using strtok(). To determine if each token is a word or a number, I'd use strtol() to attempt the conversion and check the error condition. Example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/**
* Read the next line from the file, splitting the tokens into
* multiple strings and a single integer. Assumes input lines
* never exceed MAX_LINE_LENGTH and each individual string never
* exceeds MAX_STR_SIZE. Otherwise things get a little more
* interesting. Also assumes that the integer is the last
* thing on each line.
*/
int getNextLine(FILE *in, char (*strs)[MAX_STR_SIZE], int *numStrings, int *value)
{
char buffer[MAX_LINE_LENGTH];
int rval = 1;
if (fgets(buffer, buffer, sizeof buffer))
{
char *token = strtok(buffer, " ");
*numStrings = 0;
while (token)
{
char *chk;
*value = (int) strtol(token, &chk, 10);
if (*chk != 0 && *chk != '\n')
{
strcpy(strs[(*numStrings)++], token);
}
token = strtok(NULL, " ");
}
}
else
{
/**
* fgets() hit either EOF or error; either way return 0
*/
rval = 0;
}
return rval;
}
/**
* sample main
*/
int main(void)
{
FILE *input;
char strings[MAX_NUM_STRINGS][MAX_STRING_LENGTH];
int numStrings;
int value;
input = fopen("datafile.txt", "r");
if (input)
{
while (getNextLine(input, &strings, &numStrings, &value))
{
/**
* Do something with strings and value here
*/
}
fclose(input);
}
return 0;
}

Resources