Reading data from a file, only alpha characters - c

I'm working on a program for school right now in c and I'm having trouble reading text from a file. I've only ever worked in Java before so I'm not completely familiar with c yet and this has got me thoroughly stumped even though I'm sure it's pretty simple.
Here's an example of how the text can be formatted in the file we have to read:
boo22$Book5555bOoKiNg#bOo#TeX123tEXT(JOHN)
I have to take in each word and store it in a data structure, and a word is only alpha characters, so no numbers or special characters. I already have the data structure working properly so I just need to get each word into a char array and then add it to my structure. It has to keep reading each char until it gets to a non-alpha char value. I've tried looking into the different ways to scan in from a file and I'm not sure what would be best for my scenario.
Here's the code I have right now for my input:
char str[MAX_WORD_SIZE];
char c;
int index = 0;
while (fscanf(dictionaryInputFile, "%c", c) != EOF) //while not at end of file
{
if (isalpha(c)) //if current character is a letter
{
tolower(c); //ignores case in word
str[index] = c; //add char to string
index++;
}
else if (str[0] != '\0') //If a word
{
str[index] = '\0'; //Make sure no left over characters in String
dictionaryRoot = insertNode(str, dictionaryRoot); //insert word to dictionary
index = 0; //reset index
str[index] = '\0'; //Set first character to null since word has been added
}
}
My thinking was that if it doesn't hit that first if statement then I have to check if str is a word or not, that's why it checks if the 0 index of str is null or not. I'm guessing the else if statement I have is not right though, but I can't figure out a way to end the current word I'm building and then reset str to null when it's added to my data structure. Right now when I run this I get a segmentation fault if I pass the txt file as an argument.
I'd just like to know if I'm on the right track and if not maybe some help on how I should be reading this data.
This is my first time posting here so I hope I included everything you'll need to help me, if not just let me know and I'd be happy to add more information.

Biggest problem: Incorrect use of fscanf(). #BLUEPIXY
// while (fscanf(dictionaryInputFile, "%c", c) != EOF)
while (fscanf(dictionaryInputFile, "%c", &c) != EOF)
No protection against overflow.
// str[index] = c; //add char to string
if (index >= MAX_WORD_SIZE - 1) Handle_TooManySomehow();
Not sure why testing against '\0' when '\0' is also a non-alpha.
Pedantically, isalpha() is problematic when a signed char is passed. Better to pass the unsigned char value: is...((unsigned char) c)), when code knows it is not EOF. Alternatively, save the input using int ch = fgetc(stream) and use is...(ch)).
Minor: Better to use size_t for array indexes than int, but be careful as size_t is unsigned. size_t is important should the array become large, unlike in this case.
Also, when EOF received, any data in str is ignored, even if it contained a word. #BLUEPIXY.
For the most part, OP is on the right track.
Follows is a sample non-tested approach to illustrate not overflowing the buffer.
Test for full buffer, then read in a char if needed. If a non-alpha found, add to dictionary if a non-zero length work was accumulated.
char str[MAX_WORD_SIZE];
int ch;
size_t index = 0;
for (;;) {
if ((index >= sizeof str - 1) ||
((ch = fgetc(dictionaryInputFile)) == EOF) ||
(!isalpha(ch))) {
if (index > 0) {
str[index] = '\0';
dictionaryRoot = insertNode(str, dictionaryRoot);
index = 0;
}
if (ch == EOF) break;
}
else {
str[index++] = tolower(ch);
}
}

Related

how to stop my program from skipping characters before saving them

I am making a simple program to read from a file character by character, puts them into tmp and then puts tmp in input[i]. However, the program saves a character in tmp and then saves the next character in input[i]. How do I make it not skip that first character?
I've tried to read into input[i] right away but then I wasn't able to check for EOF flag.
FILE * file = fopen("input.txt", "r");
char tmp;
char input[5];
tmp= getc(file);
input[0]= tmp;
int i=0;
while((tmp != ' ') && (tmp != '\n') && (tmp != EOF)){
tmp= getc(file);
input[i]=tmp;
length++;
i++;
}
printf("%s",input);
It's supposed to print "ADD $02", but instead it prints "DD 02".
You are doing things in the wrong order in your code: The way your code is structures, reading and storing the first char is moved out of the loop. In the loop, that char is then overwritten. In that case start with i = 1.
Perhaps you want to read the first character anyway, but I guess you want to read everything up to the first space, which might be the first character. Then do this:
#include <stdio.h>
int main(void)
{
char input[80];
int i = 0;
int c = getchar();
while (c != ' ' && c != '\n' && c != EOF) {
if (i + 1 < sizeof(input)) { // store char if the is room
input[i++] = c;
}
c = getchar();
}
input[i] = '\0'; // null-terminate input
puts(input);
return 0;
}
Things to note:
The first character is read before the loop. the loop condition and the code that stores the char then use that char. Just before the end of the loop body, the next char is read, which will then be processed in the next iteration.
You don't enforce that the char buffer input cannot be overwritten. This is dangerous, especially since your buffer is tiny.
When you construct strings char by char, you should null-terminate it by placing an explicit '\0' at the end. You have to make sure that there is space for that terminator. Nearly all system functions like puts or printf("%s", ...) expect the string to be null-terminated.
Make the result of getchar an int, so that you can distinguish between all valid character codes and the special value EOF.
The code above is useful if the first and subsequent calls to get the next item are different, for example when tokenizing a string with strtok. Here, you can also choose another approach:
while (1) { // "infinite loop"
int c = getchar(); // read a char first thing in a loop
if (c == ' ' || c == '\n' || c == EOF) break;
// explicit break when done
if (i + 1 < sizeof(input)) {
input[i++] = c;
}
}
This approach has the logic of processing the chars in the loop body only, but you must wrap it in an infinite loop and then use the explicit break.

How to reuse strings in C without old values showing up in the memory?

I'm sure this is extremely simple and probably gets asked a lot but this is driving me absolutely crazy and I cant even figure out how to properly word my question to search for an answer.
Basically, I'm reading a txt file (in C) and identifying how many times a word appears.
I grab an entire line from the txt file using getLine();
copy every character to a string until I reach a space
Sends the new string to another function that parses out invalid characters
The problem I'm running into is each time it goes through the loop, it keeps the old characters in the string and just replaces them. I'm trying to set this up so after a word is passed to parseWord, that temporary string named newWord is reset (and empty). Likewise, cleanWord in the parseWord function is doing the same thing.
I'm sure there is an easy solution to this, but I just don't understand how to and its becoming extremely frustrating. Any help would be very appreciated.
void readFiles(FILE *file1, List *theList, int fileNum) {
int i, lineIndex;
char *newLine;
size_t lineLength = 0;
while(lineLength=getline(&newLine, &lineLength, file1)>0){
lineIndex = 0;
i = 0;
char *newWord;//saves individual words
while(newLine[lineIndex] != '\0'){ //move to new space
if(newLine[lineIndex] == ' '){
//insert(&theList, parseWord(i, newWord), fileNum);
parseWord(i, newWord);
i = 0;
}else{
newWord[i] = newLine[lineIndex];
i++;
}
lineIndex++;
}
}
}
char *parseWord(int theLen, char *theWord){
char cleanWord[theLen]; //the word without other stuff
char *finalWord;
int i, j;
for(j = i = 0; i < theLen; i++) {
char tmp = theWord[i];
if (tmp >= 'A' && tmp <= 'Z') {
cleanWord[j] = tolower((unsigned char) theWord[i]);
j++;
} else if ((tmp >= 'a' && tmp <= 'z') || tmp == 39 || tmp == 45) {
cleanWord[j] = theWord[i];
j++;
}
}
return strcpy(finalWord, cleanWord);
}
For example: the first line being read is: The Red Badge of Courage
when the word gets passed into the second fuction, for theWord I get:
The
Red
Badge
ofdge (this should be of)
A lot of good answers in the comments all of which helped me solve my problem.
I started adding a null terminating character to the end of each string copy which helped eliminate unnecessary characters.
I still don't quite understand the proper method for reusing strings but for all intents and purposes, my problem was solved. Thanks everyone.
Note: Part of my problem was I was trying to reset each string by using
cleanWord = '\0';
which was causing a segfault the next time it tried to assign a char to it.

Same Array in Different Procedures

I'm really new to C, and currently I'm trying to read in from a file which contains a list of names, and import that into an array. The current array is of type char[][] since it will have more information than just the name, but essentially I want team[0][0] to be the first name i read in, team[1][0] to be the second, etc. I'm pretty sure the actual importing of the names is correct, but I'm having problems storing these arrays.
FILE *teamfile;
teamfile = fopen(file, "r");
char line[MAXLENGTH+1];
int i = 0;
while( fgets(line, sizeof line, teamfile) != NULL )
{
trim_line(line);
strcpy(&team[i][NAME],line);
i++;
}
fclose(teamfile);
Which is called from the main function as teams = teamlist(argv[1], team);
But when I try to refer to the array from elsewhere in my program eg printf(&team[0][0]) it outputs what seems to be all names in one block...
What am I doing wrong?
edit:
static void trim_line(char line[])
{
int i = 0;
// LOOP UNTIL WE REACH THE END OF line
while(line[i] != '\0')
{
// CHECK FOR CARRIAGE-RETURN OR NEWLINE
if( line[i] == '\r' || line[i] == '\n' )
{
line[i] = '\0'; // overwrite with nul-byte
break; // leave the loop early
}
i = i+1; // iterate through character array
}
}
thanks for the help so far! :D
if team is declared as char team[NUM_OF_TEAMS][LENGHT_OF_NAME]
then it should always be strcpy(&team[i],line);
Hint: it is a char array, not a "string object" in C

Reading a file in C

I have an input file I need to extract words from. The words can only contain letters and numbers so anything else will be treated as a delimiter. I tried fscanf,fgets+sscanf and strtok but nothing seems to work.
while(!feof(file))
{
fscanf(file,"%s",string);
printf("%s\n",string);
}
Above one clearly doesn't work because it doesn't use any delimiters so I replaced the line with this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
So I used fgets to read the first line and use sscanf:
sscanf(line,"%[A-z]%n,word,len);
line+=len;
This one doesn't work either because whatever I try I can't move the pointer to the right place. I tried strtok but I can't find how to set delimitters
while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");
This one obviously take blank character as a delimitter but I have literally 100s of delimitters.
Am I missing something here becasue extracting words from a file seemed a simple concept at first but nothing I try really works?
Consider building a minimal lexer. When in state word it would remain in it as long as it sees letters and numbers. It would switch to state delimiter when encountering something else. Then it could do an exact opposite in the state delimiter.
Here's an example of a simple state machine which might be helpful. For the sake of brevity it works only with digits. echo "2341,452(42 555" | ./main will print each number in a separate line. It's not a lexer but the idea of switching between states is quite similar.
#include <stdio.h>
#include <string.h>
int main() {
static const int WORD = 1, DELIM = 2, BUFLEN = 1024;
int state = WORD, ptr = 0;
char buffer[BUFLEN], *digits = "1234567890";
while ((c = getchar()) != EOF) {
if (strchr(digits, c)) {
if (WORD == state) {
buffer[ptr++] = c;
} else {
buffer[0] = c;
ptr = 1;
}
state = WORD;
} else {
if (WORD == state) {
buffer[ptr] = '\0';
printf("%s\n", buffer);
}
state = DELIM;
}
}
return 0;
}
If the number of states increases you can consider replacing if statements checking the current state with switch blocks. The performance can be increased by replacing getchar with reading a whole block of the input to a temporary buffer and iterating through it.
In case of having to deal with a more complex input file format you can use lexical analysers generators such as flex. They can do the job of defining state transitions and other parts of lexer generation for you.
Several points:
First of all, do not use feof(file) as your loop condition; feof won't return true until after you attempt to read past the end of the file, so your loop will execute once too often.
Second, you mentioned this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
That's not quite what's happening; if the next character in the stream doesn't match the format specifier, scanf returns without having read anything, and string is unmodified.
Here's a simple, if inelegant, method: it reads one character at a time from the input file, checks to see if it's either an alpha or a digit, and if it is, adds it to a string.
#include <stdio.h>
#include <ctype.h>
int get_next_word(FILE *file, char *word, size_t wordSize)
{
size_t i = 0;
int c;
/**
* Skip over any non-alphanumeric characters
*/
while ((c = fgetc(file)) != EOF && !isalnum(c))
; // empty loop
if (c != EOF)
word[i++] = c;
/**
* Read up to the next non-alphanumeric character and
* store it to word
*/
while ((c = fgetc(file)) != EOF && i < (wordSize - 1) && isalnum(c))
{
word[i++] = c;
}
word[i] = 0;
return c != EOF;
}
int main(void)
{
char word[SIZE]; // where SIZE is large enough to handle expected inputs
FILE *file;
...
while (get_next_word(file, word, sizeof word))
// do something with word
...
}
I would use:
FILE *file;
char string[200];
while(fscanf(file, "%*[^A-Za-z]"), fscanf(file, "%199[a-zA-Z]", string) > 0) {
/* do something with string... */
}
This skips over non-letters and then reads a string of up to 199 letters. The only oddness is that if you have any 'words' that are longer than 199 letters they'll be split up into multiple words, but you need the limit to avoid a buffer overflow...
What are your delimiters? The second argument to strtok should be a string containing your delimiters, and the first should be a pointer to your string the first time round then NULL afterwards:
char * p = strtok(line, ","); // assuming a , delimiter
printf("%s\n", p);
while(p)
{
p = strtok(NULL, ",");
printf("%S\n", p);
}

Strcat throws segmentation fault on simple getch-like password input

I am using Linux and there is a custom function of which returns an ASCII int of current key sort of like getch(). When trying to get used to it and how to store the password I came into an issue, my code is as follows:
int main() {
int c;
char pass[20] = "";
printf("Enter password: ");
while(c != (int)'\n') {
c = mygetch();
strcat(pass, (char)c);
printf("*");
}
printf("\nPass: %s\n", pass);
return 0;
}
Unfortunately I get the warning from GCC:
pass.c:26: warning: passing argument 2 of ‘strcat’ makes pointer from integer without a cast
/usr/include/string.h:136: note: expected ‘const char * __restrict__’ but argument is of type ‘char’
I tried using pointers instead of a char array for pass, but the second I type a letter it segfaults. The function works on its own but not in the loop, atleast not like getch() would on a Windows system.
What can you see is wrong with my example? I am enjoying learning this.
EDIT: Thanks to the answers I came up with the following silly code:
int c;
int i = 0;
char pass[PASS_SIZE] = "";
printf("Enter password: ");
while(c != LINEFEED && strlen(pass) != (PASS_SIZE - 1)) {
c = mygetch();
if(c == BACKSPACE) {
//ensure cannot backspace past prompt
if(i != 0) {
//simulate backspace by replacing with space
printf("\b \b");
//get rid of last character
pass[i-1] = 0; i--;
}
} else {
//passed a character
pass[i] = (char)c; i++;
printf("*");
}
}
pass[i] = '\0';
printf("\nPass: %s\n", pass);
The problem is that strcat expects a char * as its second argument (it concatenates two strings). You don't have two strings, you have one string and one char.
If you want to add c to the end of pass, just keep an int i that stores the current size of pass and then do something like
pass[i] = (char) c.
Make sure to null-terminate pass when you are done (by setting the last position to 0).
A single character is not the same as a string containing a single character.
In other words, 'a' and "a" are very different things.
A string, in C, is a null-terminated array of chars. Your "pass" is an array of 20 chars - a block of memory containing space for 20 chars.
The function mygetch() returns a char.
What you need to do is to insert c into one of the spaces.
Instead of "strcat(pass, c)", you want to do "pass[i] = c", where i starts at zero, and increments by one for every time you call mygetch().
Then you need to do a pass[i] = '\0', when the loop is done, with i equal to the number of times you called mygetch(), to add the null terminator.
You're other problem is that you haven't set a value for c, the first time you check to see if it's '\n'. You want to call mygetch() before you do the comparison:
int i = 0;
for (;;)
{
c = mygetch();
if (c == '\n')
break;
c = mygetch();
pass[i++] = c;
}
pass[i] = '\0';
Over and above the correctly diagnosed issue with strcat() taking two strings -- why did you ignore the compiler warnings, or if there were no warnings, why don't you have warnings turned on? As I was saying, over and above that problem, you also need to consider what happens if you get EOF, and you also need to worry about the initial value of 'c' (which could accidentally be '\n' though it probably isn't).
That leads to code like this:
int c;
char pass[20] = "";
char *end = pass + sizeof(pass) - 1;
char *dst = pass;
while ((c = getchar()) != EOF && c != '\n' && dst < end)
*dst++ = c;
*dst = '\0'; // Ensure null termination
I switched from 'mygetch()' to 'getchar()' - primarily because what I say applies to that and might not apply to your 'mygetch()' function; we don't have a specification of what that function does on EOF.
Alternatively, if you must use strcat(), you still need to keep a track on the length of the string, but you can do:
char c[2] = "";
char pass[20] = "";
char *end = pass + sizeof(pass) - 1;
char *dst = pass;
while (c[0] != '\n' && dst < end)
{
c[0] = mygetch();
strcat(dst, c);
dst++;
}
Not as elegant as all that - using strcat() in context is overkill. You could, I suppose, do simple counting and repeatedly use strcat(pass, c), but that has quadratic behaviour as strcat() has to skip of 0, 1, 2, 3, ... characters on the subsequent iterations. By contrast, the solution where dst points to the NUL at the end of the string means that strcat() doesn't have to skip anything. With a fixed size addition of 1 character, though, you're probably better off with the first loop.

Resources