Parsing contents of a textfile in C(Deleting parts, storing others) - c

I have a basic .txt file that may contain an unknown amount of pieces of data exactly in this format and I need to extract the second part after the '=' identifier. For example:
variable1=Hello
variable2=How
variable3=Are
variable4=You?
I need to extract "Hello" "How" "Are" and "You?" separately and store them into an array(removing/ignoring the variable name) and being able to call each word individually. I'm doing this in C and here is what I currently have.
#include <stdio.h>
#include <string.h>
int main()
{
char*result;
char copy[256];
FILE * filePtr;
filePtr = fopen("testfile.txt", "r+");
strcpy(copy, "testfile.txt");
while(fgets(copy, 256, filePtr)!= NULL)
{
result = strchr(copy, '=');
result = strtok(NULL, "=");
printf("%s",result);
if(result != 0)
{
*result = 0;
}
result = strtok(copy, "=");
}
return 0;
}
My current output is
(null)How
Are
You?

You do not need strtok, using strchr is enough.
no need to copy the filename to the copy buffer.
probably not necessary to open the file in update mode "%r+" either.
Here is a corrected version:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void) {
char *words[20];
int n = 0;
char *result;
char copy[256];
FILE *filePtr;
filePtr = fopen("testfile.txt", "r");
while (fgets(copy, 256, filePtr) != NULL) {
copy[strcspn(copy, "\n")] = '\0'; /* strip the \n if present */
result = strchr(copy, '=');
if (result != NULL) {
words[n++] = strdup(result + 1);
printf("%s ", result + 1);
}
}
printf("\n");
fclose(filePtr);
return 0;
}
Note the one liner to strip the trailing \n left at the end of copy by fgets(): copy[strcspn(copy, "\n")] = '\0';. It works even if fgets() did not see a \n before the end of the buffer or before the end of file. strcspn counts returns the number of characters in copy that are not in the second argument, thus it returns the length of the line without the \n.
The words are collected into an array words of pointers to strings. Each word is copied into memory allocated by malloc by the strdup function. strdup is not part of Standard C, but part of Posix and probably present in your environment, possibly written as _strdup.
Note also that you should also test for failure to open the file, failure to allocate memory in strdup, and also handle more than 20 strings...
If there is a fixed set of words and you just want to strip the initial parts, you can use a simpler hardcoded approach:
int main(void) {
char word1[20], word2[20], word3[20], word4[20];
FILE *filePtr;
filePtr = fopen("testfile.txt", "r");
if (fscanf(filePtr,
"%*[^=]=%19[^\n]%*[^=]=%19[^\n]%*[^=]=%19[^\n]%*[^=]=%19[^\n]",
word1, word2, word3, word4) == 4) {
printf("%s %s %s %s\n", word1, word2, word3, word4);
// perform whatever task with the arrays
} else {
printf("parse failed\n");
}
fclose(filePtr);
return 0;
}

Related

Two words in a string from text file

I'm trying to get two words in a string and I don't know how I can do it. I tried but if in a text file I have 'name Penny Marie' it gives me :name Penny. How can I get Penny Marie in s1? Thank you
#include <stdio.h>
#include <stdlib.h>
int main()
{
printf("Hello world!\n");
char s[50];
char s1[20];
FILE* fp = fopen("file.txt", "rt");
if (fp == NULL)
return 0;
fscanf(fp,"%s %s",s,s1);
{
printf("%s\n",s);
printf("%s",s1);
}
fclose(fp);
return 0;
}
Change the fscanf format, just tell it to not stop reading until new line:
fscanf(fp,"%s %[^\n]s",s,s1);
You shall use fgets.
Or you can try to do this :
fscanf(fp,"%s %s %s", s0, s, s1);
{
printf("%s\n",s);
printf("%s",s1);
}
and declare s0 as a void*
The other answers address adjustments to your fscanf call specific to your stated need. (Although fscanf() is not generally the best way to do what you are asking.) Your question is specific about getting 2 words, Penny & Marie, from a line in a file that contains: name Penny Marie. And as asked in comments, what if the file contains more than 1 line that needs to be parsed, or the name strings contain a variable number of names. Generally, the following functions and techniques are more suitable and are more commonly used to read content from a file and parse its content into strings:
fopen() and its arguments.
fgets()
strtok() (or strtok_r())
How to determine count of lines in a file (useful for creating an array of strings)
How to read lines of file into array of strings.
Deploying these techniques and functions can be adapted in many ways to parse content from files. To illustrate, a small example using these techniques is implemented below that will handle your stated needs, including multiple lines per file and variable numbers of names in each line.
Given File: names.txt in local directory:
name Penny Marie
name Jerry Smith
name Anthony James
name William Begoin
name Billy Jay Smith
name Jill Garner
name Cyndi Elm
name Bill Jones
name Ella Fitz Bella Jay
name Jerry
The following reads a file to characterize its contents in terms of number of lines, and longest line, creates an array of strings then populates each string in the array with names in the file, regardless the number of parts of the name.
int main(void)
{
// get count of lines in file:
int longest=0, i;
int count = count_of_lines(".\\names.txt", &longest);
// create array of strings with information from above
char names[count][longest+2]; // +2 - newline and NULL
char temp[longest+2];
char *tok;
FILE *fp = fopen(".\\names.txt", "r");
if(fp)
{
for(i=0;i<count;i++)
{
if(fgets(temp, longest+2, fp))// read next line
{
tok = strtok(temp, " \n"); // throw away "name" and space
if(tok)
{
tok = strtok(NULL, " \n");//capture first name of line.
if(tok)
{
strcpy(names[i], tok); // write first name element to string.
tok = strtok(NULL, " \n");
while(tok) // continue until all name elements in line are read
{ //concatenate remaining name elements
strcat(names[i], " ");// add space between name elements
strcat(names[i], tok);// next name element
tok = strtok(NULL, " \n");
}
}
}
}
}
}
return 0;
}
// returns count, and passes back longest
int count_of_lines(char *filename, int *longest)
{
int count = 0;
int len=0, lenKeep=0;
int c;
FILE *fp = fopen(filename, "r");
if(fp)
{
c = getc(fp);
while(c != EOF)
{
if(c != '\n')
{
len++;
}
else
{
lenKeep = (len < lenKeep) ? lenKeep : len;
len = 0;
count++;
}
c = getc(fp);
}
fclose(fp);
*longest = lenKeep;
}
return count;
}
Change your fscanf line to fscanf(fp, "%s %s %s", s, s1, s2).
Then you can printf your s1 and s2 variables to get "Penny" and "Marie".
Try the function fgets
fp = fopen("file.txt" , "r");
if(fp == NULL) {
perror("Error opening file");
return(-1);
}
if( fgets (str, 60, fp)!=NULL ) {
/* writing content to stdout */
puts(str);
}
fclose(fp);
In the above piece of code it will write out the content with the maximum of 60 characters. You can make that part dynamic with str(len) if I'm not mistaken.

strstr() causing a segmentation fault error

The objective here is to take a whole text file that I dump into a buffer and then use the strcasestr() function to find the pointer of the word I am looking for within my buffer. It constantly gives me the segmentation fault error. At first, I thought it may be size so I tried with smaller sizes but it doesn't work either. The function only works with strings I create inside the actual code (ex : char * bob = "bob"; char * bobsentence = "bob is cool"; strstr(bobsentence, bob);). Which leads me to believe it has something to do with the fgets(). Any help is appreciated, really stuck on this one.
#define _GNU_SOURCE //to use strcasestr
#include <unistd.h>
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
void textEdit(char *path, char *word){
printf("%s\n", path);
FILE *textFile;
//FILE *locationFile;
//FILE *tempFile;
char counter[1024];
int count = 0;
textFile = fopen(path, "r+");
//locationFile = fopen(path, "r+");
//opens file to read and write and opens temp file to write
if( textFile == NULL){ //|| tempFile == NULL || locationFile == NULL) ) {
printf ("\nerror\n");
return;
}
// SECTION : ALLOCATES MEMORY NEEDED FOR COPY TEXT IN ARRAY
// finds number of lines to estimate total size of array needed for buffer
while((fgets(counter, sizeof(counter), textFile)) != NULL){
count++;
}
fclose(textFile);
FILE *tempFile = fopen(path, "r+");
count *= 1024;
printf("%d %zu\n",count, sizeof(char));
char *buffer = malloc(count); //1024 is the max number of characters per line in a traditional txt
if(buffer == NULL){ //error with malloc
return;
}
// SECTION : DUMPS TEXT INTO ARRAY
if(fgets(buffer, count, tempFile) == NULL){
printf("error");
} //dumps all text into array
printf("%s\n", buffer);
char * searchedWord;
while((searchedWord = strcasestr(buffer, word)) != NULL){
}
fclose(tempFile);
//fclose(locationFile);
free(buffer);
}
It looks that you forgot to initialize count variable to 0:
int count = 0;
You increment it and it can contain any random value, even negative.
Also, note that your utilization of strstr doesn't look correct. The function returns the pointer to first occurrence that matches. Note, that it doesn't remember already found matches, so if match exists it should loop forever in this loop. Instead it should look like:
char *pos = buffer;
while((pos = strcasestr(pos, word)) != NULL){
searchedWord = pos;
/* do something with searchedWord but remember that it belongs to
allocated buffer and can't be used after free() */
pos++;
}

Word palindrome in C

My task is to find word palindromes in a text file and to NOT print them into results file. The results file should only contain all the spaces and words that are NOT palindromes. I've been working on this program for two solid weeks, but as I am a total newb in C, I can't simply imagine how to do this correctly. Also, I have to work in Linux environent, so I can't use commands like strrev() which would make my life a lot easier at this point...
Anyways, data file contains a lot of words in a lot of lines separated by quite a few spaces.
Here is the program that is working, but doesn't work with any spaces, because I don't know how to check them at the needed place.
#include <stdio.h>
#include <string.h>
const int CMAX = 1000;
const int Dydis = 256;
FILE *dataFile;
FILE *resFile;
void palindrome(char *linex);
int main(){
char duom[CMAX], res[CMAX], linex[Dydis];
printf("What's the name of data file? \n");
scanf("%s", duom);
dataFile=fopen(duom, "r");
if (dataFile==NULL){
printf ("Error opening data file \n");
return 0;
};
printf("What's the name of results file? \n");
scanf ("%s", res);
resFile=fopen(res, "w");
if (resFile==NULL){
printf ("Error opening results file \n");
return 0;
};
while (fgets(linex, sizeof(linex), dataFile)) {
palindrome(linex);
}
printf ("all done!");
fclose(dataFile);
fclose(resFile);
}
void palindrome(char *linex){
int i, wordlenght, j;
j = 0;
char *wordie;
const char space[2] = " ";
wordie = strtok(linex, space);
while ( wordie != NULL ) {
wordlenght = strlen(wordie);
if (wordie[j] == wordie[wordlenght-1]) {
for (i = 0; i < strlen(wordie); i++) {
if (wordie[i] == wordie[wordlenght-1]) {
if (i == strlen(wordie)-1) {
fprintf(resFile,"");
}
wordlenght--;
}
else {
fprintf(resFile,"%s", wordie);
break;
}
}
}
else {
fprintf(resFile,"%s", wordie);
}
wordie = strtok(NULL, space);
}
}
EDIT:
Code below works as following:
input file is read char by char
if char read isn't alphanumeric, then it is written to the output file
else, the whole word is read with fscanf
if word is not a palindrome, then write to the output file
#include <stdio.h>
#include <ctype.h>
#include <string.h>
#include <stdlib.h>
int is_pal(char* word) {
size_t len = strlen(word);
char* begin = word;
char* end = word + len - 1;
if (len == 1) {
return 1;
}
while (begin <= end) {
if (*begin != *end) {
return 0;
}
begin++;
end--;
}
return 1;
}
int main(void)
{
FILE* fin = fopen("pals.txt", "r");
if (fin == NULL) {
perror("fopen");
exit(1);
}
FILE* fout = fopen("out_pals.txt", "w");
if (fout == NULL) {
perror("fopen");
exit(1);
}
int ret;
char word[100];
while ((ret = fgetc(fin)) != EOF) {
if (!isalpha(ret)) {
fprintf(fout, "%c", ret);
}
else {
ungetc(ret, fin);
fscanf(fin, "%s", word);
if (!is_pal(word)) {
fprintf(fout, "%s", word);
}
}
}
fclose(fin);
fclose(fout);
return 0;
}
I've created file with following content:
cancer kajak anna sam truck
test1 abc abdcgf groove void
xyz annabelle ponton belowoleb thing
cooc ringnir
The output file :
cancer sam truck
test1 abc abdcgf groove void
xyz annabelle ponton thing
(line with two spaces)
As you can see, the number of spaces between words are the same as in the input file.
I've assumed that single word could have 100 chars maximum. If there would be longer words, reading with fscanf onto fixed-size buffer can be harmful.
Hints:
strtok() gives you a pointer to the start of delimited words but it does not
extract them or put them in their own string for you.
You need some logic to find the end of each word. The function
strlen() will tell you how many characters there are from the char*
that it gets until a null-character. If you give it a pointer to the start
of a word within a sentence it will give you the length from the start of the
word to the end of the sentence.
Breaking palindrome() into a function that loops over words in a line and a
function that returns whether or not a single word is a palindrome
may help.
Your for loop is checking each pair of letters twice. i only needs to scan over half
of the word length.
You only need a single if within palindrome(). I'm not sure why you have so many.
They're redundant.

Get the length of each line in file with C and write in output file

I am a biology student and I am trying to learn perl, python and C and also use the scripts in my work. So, I have a file as follows:
>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG
The output should look like this, that is the name of each sequence and the count of characters in each line and printing the total number of sequences in the end of the file.
sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3
I could make the perl and python scripts work, this is the python script as an example:
#!/usr/bin/python
import sys
my_file = open(sys.argv[1]) #open the file
my_output = open(sys.argv[2], "w") #open output file
total_sequence_counts = 0
for line in my_file:
if line.startswith(">"):
sequence_name = line.rstrip('\n').replace(">","")
total_sequence_counts += 1
continue
dna_length = len(line.rstrip('\n'))
my_output.write(sequence_name + " " + str(dna_length) + '\n')
my_output.write("Total number of sequences = " + str(total_sequence_counts) + '\n')
Now, I want to write the same script in C, this is what I have achieved so far:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[])
{
input = FILE *fopen(const char *filename, "r");
output = FILE *fopen(const char *filename, "w");
double total_sequence_counts = 0;
char sequence_name[];
char line [4095]; // set a temporary line length
char buffer = (char *) malloc (sizeof(line) +1); // allocate some memory
while (fgets(line, sizeof(line), filename) != NULL) { // read until new line character is not found in line
buffer = realloc(*buffer, strlen(line) + strlen(buffer) + 1); // realloc buffer to adjust buffer size
if (buffer == NULL) { // print error message if memory allocation fails
printf("\n Memory error");
return 0;
}
if (line[0] == ">") {
sequence_name = strcpy(sequence_name, &line[1]);
total_sequence_counts += 1
}
else {
double length = strlen(line);
fprintf(output, "%s \t %ld", sequence_name, length);
}
fprintf(output, "%s \t %ld", "Total number of sequences = ", total_sequence_counts);
}
int fclose(FILE *input); // when you are done working with a file, you should close it using this function.
return 0;
int fclose(FILE *output);
return 0;
}
But this code, of course is full of mistakes, my problem is that despite studying a lot, I still can't properly understand and use the memory allocation and pointers so I know I especially have mistakes in that part. It would be great if you could comment on my code and see how it can turn into a script that actually work. By the way, in my actual data, the length of each line is not defined so I need to use malloc and realloc for that purpose.
For a simple program like this, where you look at short lines one at a time, you shouldn't worry about dynamic memory allocation. It is probably good enough to use local buffers of a reasonable size.
Another thing is that C isn't particularly suited for quick-and-dirty string processing. For example, there isn't a strstrip function in the standard library. You usually end up implementing such behaviour yourself.
An example implementation looks like this:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#define MAXLEN 80 /* Maximum line length, including null terminator */
int main(int argc, char *argv[])
{
FILE *in;
FILE *out;
char line[MAXLEN]; /* Current line buffer */
char ref[MAXLEN] = ""; /* Sequence reference buffer */
int nseq = 0; /* Sequence counter */
if (argc != 3) {
fprintf(stderr, "Usage: %s infile outfile\n", argv[0]);
exit(1);
}
in = fopen(argv[1], "r");
if (in == NULL) {
fprintf(stderr, "Couldn't open %s.\n", argv[1]);
exit(1);
}
out = fopen(argv[2], "w");
if (in == NULL) {
fprintf(stderr, "Couldn't open %s for writing.\n", argv[2]);
exit(1);
}
while (fgets(line, sizeof(line), in)) {
int len = strlen(line);
/* Strip whitespace from end */
while (len > 0 && isspace(line[len - 1])) len--;
line[len] = '\0';
if (line[0] == '>') {
/* First char is '>': copy from second char in line */
strcpy(ref, line + 1);
} else {
/* Other lines are sequences */
fprintf(out, "%s: %d\n", ref, len);
nseq++;
}
}
fprintf(out, "Total number of sequences. %d\n", nseq);
fclose(in);
fclose(out);
return 0;
}
A lot of code is about enforcing arguments and opening and closing files. (You could cut out a lot of code if you used stdin and stdout with file redirections.)
The core is the big while loop. Things to note:
fgets returns NULL on error or when the end of file is reached.
The first lines determine the length of the line and then remove white-space from the end.
It is not enough to decrement length, at the end the stripped string must be terminated with the null character '\0'
When you check the first character in the line, you should check against a char, not a string. In C, single and double quotes are not interchangeable. ">" is a string literal of two characters, '>' and the terminating '\0'.
When dealing with countable entities like chars in a string, use integer types, not floating-point numbers. (I've used (signed) int here, but because there can't be a negative number of chars in a line, it might have been better to have used an unsigned type.)
The notation line + 1 is equivalent to &line[1].
The code I've shown doesn't check that there is always one reference per sequence. I'll leave this as exercide to the reader.
For a beginner, this can be quite a lot to keep track of. For small text-processing tasks like yours, Python and Perl are definitely better suited.
Edit: The solution above won't work for long sequences; it is restricted to MAXLEN characters. But you don't need dynamic allocation if you only need the length, not the contents of the sequences.
Here's an updated version that doesn't read lines, but read characters instead. In '>' context, it stored the reference. Otherwise it just keeps a count:
#include <stdlib.h>
#include <stdio.h>
#include <ctype.h> /* for isspace() */
#define MAXLEN 80 /* Maximum line length, including null terminator */
int main(int argc, char *argv[])
{
FILE *in;
FILE *out;
int nseq = 0; /* Sequence counter */
char ref[MAXLEN]; /* Reference name */
in = fopen(argv[1], "r");
out = fopen(argv[2], "w");
/* Snip: Argument and file checking as above */
while (1) {
int c = getc(in);
if (c == EOF) break;
if (c == '>') {
int n = 0;
c = fgetc(in);
while (c != EOF && c != '\n') {
if (n < sizeof(ref) - 1) ref[n++] = c;
c = fgetc(in);
}
ref[n] = '\0';
} else {
int len = 0;
int n = 0;
while (c != EOF && c != '\n') {
n++;
if (!isspace(c)) len = n;
c = fgetc(in);
}
fprintf(out, "%s: %d\n", ref, len);
nseq++;
}
}
fprintf(out, "Total number of sequences. %d\n", nseq);
fclose(in);
fclose(out);
return 0;
}
Notes:
fgetc reads a single byte from a file and returns this byte or EOF when the file has ended. In this implementation, that's the only reading function used.
Storing a reference string is implemented via fgetc here too. You could probably use fgets after skipping the initial angle bracket, too.
The counting just reads bytes without storing them. n is the total count, len is the count up to the last non-space. (Your lines probably consist only of ACGT without any trailing space, so you could skip the test for space and use n instead of len.)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[]){
FILE *my_file = fopen(argv[1], "r");
FILE *my_output = fopen(argv[2], "w");
int total_sequence_coutns = 0;
char *sequence_name;
int dna_length;
char *line = NULL;
size_t size = 0;
while(-1 != getline(&line, &size, my_file)){
if(line[0] == '>'){
sequence_name = strdup(strtok(line, ">\n"));
total_sequence_coutns +=1;
continue;
}
dna_length = strlen(strtok(line, "\n"));
fprintf(my_output, "%s %d\n", sequence_name, dna_length);
free(sequence_name);
}
fprintf(my_output, "Total number of sequences = %d\n", total_sequence_coutns);
fclose(my_file);
fclose(my_output);
free(line);
return (0);
}

Read comma-separated "quoted" strings from a file

I am new to C programming but have a bit of Java knowledge, so I want to write a program that reads strings stored in a file, possibly several names separated by comma, such as "boy","girl","car" etc. In Java I would use something like, string str[]=str1.split(" ");.
So I came up with several codes each time but none seems to work, here is my most recent code:
fscanf(fp,"%[^\n]",c);
But this essentially prints the whole line till a new line is found. I have also tried using
fscanf(fp,"%[^,]",c);
And if I use gets() it only gets the first string and ignores all others from the first comma.
This didn't give any reasonable output, it rather gave some minute(encoded) characters.
Please can anyone help me with how to pick out string values separated by comma and in quotes
You can use strtok() function (string.h) to do this task. store the file data in a string of a considerable size. and apply
str = strtok(full_file_string,",");
/* you can save this string in a 2 dimensional array of characters or print it */
while(NULL != str)
{
str=strtok(NULL,",");
/*print or save your next word here as you like */
}
for further reference see manpage of strtok.
Hope this might help you :)
fscanf doesn't work with regular expressions, but rather with placeholders. So you need to specify the placeholder for what you want to read, and then fscanf will get the next element that matches your pattern. To get what you want one would use something like:
char word[enough_space];
.
.
.
while(fscanf(fp, "\"%s\"", word) != EOF)
{
//Do something with yout word.
};
Here you will be trying to get a string between to quotes. Note how the placeholder indicates which part of the match should be saved. on successive calls fscanf will get to the next match and so on. When it consumes the whole file it will return EOF.
Below example will extract the substring. The format of your fille should be something like:
"boy","girl","car",
Notice that file string should end with ','
int read_file_with_string_tokens() {
char * tocken;
char astring[127];
int current = 0;
int limit;
char *filebuffer = NULL;
FILE *file = fopen("your/file/path/and/name", "r");
if (file != NULL) {
fseek(file, 0L, SEEK_END);
int f_size = ftell(file);
fseek(file, 0L, SEEK_SET);
filebuffer = (char*) malloc(f_size + 2);
if (filebuffer == NULL) {
pclose(file);
free(filebuffer);
return -1;
}
memset(filebuffer, 0, f_size + 2);
if (fgets(filebuffer, f_size + 1, file) == NULL) {
fclose(file);
free(filebuffer);
return -1;
}
fclose(file);
memset(astring, 0, 127);
char *result = NULL;
tocken = strchr(filebuffer, ',');
while (tocken != NULL) {
limit = tocken - filebuffer + 1;
strncpy(astring, &filebuffer[current], limit - current - 1);
printf("%s" , astring);
current = limit;
tocken = strchr(&filebuffer[limit], ',');
memset(astring, 0, 127);
}
free(filebuffer);
}
return 0;
}
#include <stdio.h>
int main(void){
char line[128];
char word[32];
FILE *in, *out;
int line_length;
in = fopen("in.txt", "r");
out = fopen("out.txt", "w");
while(1==fscanf(in, "%[^\n]%n\n", line, &line_length)){//read one line
int pos, len;
for(pos=0;pos < line_length-1 && 1==sscanf(line + pos, "%[^,]%*[,]%n", word, &len);pos+=len){
fprintf(out, "%s\n", word);
}
}
fclose(out);
fclose(in);
return 0;
}
/* output result out.txt
"boy"
"girl"
"car"
...
*/

Resources