strcat() causes segmentation fault only after program is finished? - c

Here's a program that summarizes text. Up to this point, I'm counting the number of occurrences of each word in the text. But, I'm getting a segmentation fault in strcat.
Program received signal SIGSEGV, Segmentation fault.
0x75985629 in strcat () from C:\WINDOWS\SysWOW64\msvcrt.dll
However, while stepping through the code, the program runs the strcat() function as expected. I don't receive the error until line 75, when the program ends.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#define MAXTEXT 1000
#define MAXLINE 200
#define MAXWORDS 200
#define MAXWORD 32
char *strtolower(char *d, const char *s, size_t len);
/* summarizer: summarizes text */
int main(int argc, char *argv[]) {
/* argument check */
char *prog = argv[0];
if (argc < 1) {
fprintf(stderr, "%s: missing arguments, expected 1", prog);
exit(1);
}
/* attempt to open file */
FILE *fp;
if (!(fp = fopen(argv[1], "r"))) {
fprintf(stderr, "%s: Couldn't open file %s", prog, argv[1]);
exit(2);
}
/* read file line by line */
char line[MAXLINE], text[MAXTEXT];
while ((fgets(line, MAXTEXT, fp))) {
strncat(text, line, MAXLINE);
}
/* separate into words and count occurrences */
struct wordcount {
char *word;
int count;
};
struct wordcount words[MAXWORDS];
char *token, *delim = " \t\n.,!?";
char word[MAXWORD], textcpy[strlen(text)+1]; /*len of text and \0 */
int i, j, is_unique_word = 1;
strcpy(textcpy, text);
token = strtok(textcpy, delim);
for (i = 0; i < MAXWORDS && token; i++) {
strtolower(word, token, strlen(token));
token = strtok(NULL, delim);
/* check if word exists */
for (j = 0; words[j].word && j < MAXWORDS; j++) {
if (!strcmp(word, words[j].word)) {
is_unique_word = 0;
words[j].count++;
}
}
/* add to word list of unique */
if (is_unique_word) {
strcpy(words[i].word, word);
words[i].count++;
}
is_unique_word = 1;
}
return 0;
}
/* strtolower: copy str s to dest d, returns pointer of d */
char *strtolower(char *d, const char *s, size_t len) {
int i;
for (i = 0; i < len; i++) {
d[i] = tolower(*(s+i));
}
return d;
}

The problem is in the loop: while ((fgets(line, MAXTEXT, fp))) strncat(text, line, MAXLINE);. It is incorrect for multiple reasons:
text is uninitialized, concatenating a string to it has undefined behavior. Undefined behavior may indeed cause a crash after the end of the function, for example if the return address was overwritten.
there is no reason to use strncat() with a length of MAXLINE, the string read by fgets() has at most MAXLINE-1 bytes.
you do not check if there is enough space at the end of text to concatenate the contents of line. strncat(dest, src, n) copies at most n bytes from src to the end of dest and always sets a null terminator. It is not a safe version of strcat(). If the line does not fit at the end of text, you have unexpected behavior, and again you can observe a crash after the end of the function, for example if the return address was overwritten.
You could just try and read the whole file with fread:
/* read the file into the text array */
char text[MAXTEXT];
size_t text_len = fread(text, 1, sizeof(text) - 1, fp);
text[text_len] = '\0';
If text_len == sizeof(text) - 1, the file is potentially too large for the text array and the while loop would have caused a buffer overflow.

There is at least one problem because you create line with MAXLINE size (200), then you fgets() up to MAXTEXT (1000) chars into it.

Destination string of strncat function shall be null terminated. You need to null terminate text before passing it to strncat function. You also have to write only upto MAXLINE-1 bytes and leave a space for '\0' appended by strncat at the end to stop buffer overflow.
char line[MAXLINE], text[MAXTEXT] = {'\0'};
while ((fgets(line, MAXTEXT, fp)))
{
strncat(text, line, MAXLINE-1);
}

Related

Novice C question: Working with a variable-length array of variable-length strings?

I probably got an easy one for the C programmers out there!
I am trying to create a simple C function that will execute a system command in and write the process output to a string buffer out (which should be initialized as an array of strings of length n). The output needs to be formatted in the following way:
Each line written to stdout should be initialized as a string. Each of these strings has variable length. The output should be an array consisting of each string. There is no way to know how many strings will be written, so this array is also technically of variable length (but for my purposes, I just create a fixed-length array outside the function and pass its length as an argument, rather than going for an array that I would have to manually allocate memory for).
Here is what I have right now:
#define MAX_LINE_LENGTH 512
int exec(const char* in, const char** out, const size_t n)
{
char buffer[MAX_LINE_LENGTH];
FILE *file;
const char terminator = '\0';
if ((file = popen(in, "r")) == NULL) {
return 1;
}
for (char** head = out; (size_t)head < (size_t)out + n && fgets(buffer, MAX_LINE_LENGTH, file) != NULL; head += strlen(buffer)) {
*head = strcat(buffer, &terminator);
}
if (pclose(file)) {
return 2;
}
return 0;
}
and I call it with
#define N 128
int main(void)
{
const char* buffer[N];
const char cmd[] = "<some system command resulting in multi-line output>";
const int code = exec(cmd, buffer, N);
exit(code);
}
I believe the error the above code results in is a seg fault, but I'm not experienced enough to figure out why or how to fix.
I'm almost positive it is with my logic here:
for (char** head = out; (size_t)head < (size_t)out + n && fgets(buffer, MAX_LINE_LENGTH, file) != NULL; head += strlen(buffer)) {
*head = strcat(buffer, &terminator);
}
What I thought this does is:
Get a mutable reference to out (i.e. the head pointer)
Save the current stdout line to buffer (via fgets)
Append a null terminator to buffer (because I don't think fgets does this?)
Overwrite the data at head pointer with the value from step 3
Move head pointer strlen(buffer) bytes over (i.e. the number of chars in buffer)
Continue until fgets returns NULL or head pointer has been moved beyond the bounds of out array
Where am I wrong? Any help appreciated, thanks!
EDIT #1
According to Barmar's suggestions, I edited my code:
#include <stdio.h>
#include <stdlib.h>
#define MAX_LINE_LENGTH 512
int exec(const char* in, const char** out, const size_t n)
{
char buffer[MAX_LINE_LENGTH];
FILE *file;
if ((file = popen(in, "r")) == NULL) return 1;
for (size_t i = 0; i < n && fgets(buffer, MAX_LINE_LENGTH, file) != NULL; i += 1) out[i] = buffer;
if (pclose(file)) return 2;
return 0;
}
#define N 128
int main(void)
{
const char* buffer[N];
const char cmd[] = "<system command to run>";
const int code = exec(cmd, buffer, N);
for (int i = 0; i < N; i += 1) printf("%s", buffer[i]);
exit(code);
}
While there were plenty of redundancies with what I wrote that are now fixed, this still causes a segmentation fault at runtime.
Focusing on the edited code, this assignment
out[i] = buffer;
has problems.
In this expression, buffer is implicitly converted to a pointer-to-its-first-element (&buffer[0], see: decay). No additional memory is allocated, and no string copying is done.
buffer is rewritten every iteration. After the loop, each valid element of out will point to the same memory location, which will contain the last line read.
buffer is an array local to the exec function. Its lifetime ends when the function returns, so the array in main contains dangling pointers. Utilizing these values is Undefined Behaviour.
Additionally,
for (int i = 0; i < N; i += 1)
always loops to the maximum storable number of lines, when it is possible that fewer lines than this were read.
A rigid solution uses an array of arrays to store the lines read. Here is a cursory example (see: this answer for additional information on using multidimensional arrays as function arguments).
#include <stdio.h>
#include <stdlib.h>
#define MAX_LINES 128
#define MAX_LINE_LENGTH 512
int exec(const char *cmd, char lines[MAX_LINES][MAX_LINE_LENGTH], size_t *lc)
{
FILE *stream = popen(cmd, "r");
*lc = 0;
if (!stream)
return 1;
while (*lc < MAX_LINES) {
if (!fgets(lines[*lc], MAX_LINE_LENGTH, stream))
break;
(*lc)++;
}
return pclose(stream) ? 2 : 0;
}
int main(void)
{
char lines[MAX_LINES][MAX_LINE_LENGTH];
size_t n;
int code = exec("ls -al", lines, &n);
for (size_t i = 0; i < n; i++)
printf("%s", lines[i]);
return code;
}
Using dynamic memory is another option. Here is a basic example using strdup(3), lacking robust error handling.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char **exec(const char *cmd, size_t *length)
{
FILE *stream = popen(cmd, "r");
if (!stream)
return NULL;
char **lines = NULL;
char buffer[4096];
*length = 0;
while (fgets(buffer, sizeof buffer, stream)) {
char **reline = realloc(lines, sizeof *lines * (*length + 1));
if (!reline)
break;
lines = reline;
if (!(lines[*length] = strdup(buffer)))
break;
(*length)++;
}
pclose(stream);
return lines;
}
int main(void)
{
size_t n = 0;
char **lines = exec("ls -al", &n);
for (size_t i = 0; i < n; i++) {
printf("%s", lines[i]);
free(lines[i]);
}
free(lines);
}

Reading words separately from file

I'm trying to make a program that scans a file containing words line by line and removes words that are spelled the same if you read them backwards (palindromes)
This is the program.c file:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "header.h"
int main(int argc, char **argv)
{
if(argc != 3)
{
printf("Wrong parameters");
return 0;
}
FILE *data;
FILE *result;
char *StringFromFile = (char*)malloc(255);
char *word = (char*)malloc(255);
const char *dat = argv[1];
const char *res = argv[2];
data = fopen(dat, "r");
result =fopen(res, "w");
while(fgets(StringFromFile, 255, data))
{
function1(StringFromFile, word);
fputs(StringFromFile, result);
}
free(StringFromFile);
free (word);
fclose(data);
fclose(result);
return 0;
}
This is the header.h file:
#ifndef HEADER_H_INCLUDEC
#define HEADER_H_INCLUDED
void function1(char *StringFromFile, char *word);
void moving(char *StringFromFile, int *index, int StringLength, int WordLength);
#endif
This is the function file:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "header.h"
void function1(char *StringFromFile, char *word)
{
int StringLength = strlen(StringFromFile);
int WordLength;
int i;
int p;
int k;
int t;
int m;
int match;
for(i = 0; i < StringLength; i++)
{ k=0;
t=0;
m=i;
if (StringFromFile[i] != ' ')
{ while (StringFromFile[i] != ' ')
{
word[k]=StringFromFile[i];
k=k+1;
i=i+1;
}
//printf("%s\n", word);
WordLength = strlen(word)-1;
p = WordLength-1;
match=0;
while (t <= p)
{
if (word[t] == word[p])
{
match=match+1;
}
t=t+1;
p=p-1;
}
if ((match*2) >= (WordLength))
{
moving(StringFromFile, &m, StringLength, WordLength);
}
}
}
}
void moving(char *StringFromFile, int *index, int StringLength, int WordLength)
{ int i;
int q=WordLength-1;
for(i = *index; i < StringLength; i++)
{
StringFromFile[i-1] = StringFromFile[i+q];
}
*(index) = *(index)-1;
}
It doesn't read each word correctly, though.
This is the data file:
abcba rttt plllp
aaaaaaaaaaaa
ababa
abbbba
kede
These are the separate words the program reads:
abcba
rttta
plllp
aaaaaaaaaaaa
ababa
abbbba
kede
This is the result file:
abcba rtttp
kede
It works fine if there is only one word in a single line, but it messes up when there are multiple words. Any help is appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "header.h"
# define MAX 255
int Find_Number_Words_in_Line( char str[MAX] )
{
char *ptr;
int count = 0;
int j;
/* advance character pointer ptr until end of str[MAX] */
/* everytime you see the space character, increase count */
/* might not always work, you'll need to handle multiple space characters before/between/after words */
ptr = str;
for ( j = 0; j < MAX; j++ )
{
if ( *ptr == ' ' )
count++;
else if (( *ptr == '\0' ) || ( *ptr == '\n' ))
break;
ptr++;
}
return count;
}
void Extract_Word_From_Line_Based_on_Position( char line[MAX], char word[MAX], const int position )
{
char *ptr;
/* move pointer down line[], counting past the number of spaces specified by position */
/* then copy the next word from line[] into word[] */
}
int Is_Palindrome ( char str[MAX] )
{
/* check if str[] is a palindrome, if so return 1, else return 0 */
}
int main(int argc, char **argv)
{
FILE *data_file;
FILE *result_file;
char *line_from_data_file = (char*)malloc(MAX);
char *word = (char*)malloc(MAX);
const char *dat = argv[1];
const char *res = argv[2];
int j, n;
if (argc != 3)
{
printf("Wrong parameters");
return 0;
}
data_file = fopen(dat, "r");
result_file = fopen(res, "w");
fgets( line_from_data_file, MAX, data_file );
while ( ! feof( data_file ) )
{
/*
fgets returns everything up to newline character from data_file,
function1 in original context would only run once for each line read
from data_file, so you would only get the first word
function1( line_from_data_file, word );
fputs( word, result_file );
fgets( line_from_data_file, MAX, data_file );
instead try below, you will need to write the code for these new functions
don't be afraid to name functions in basic English for what they are meant to do
make your code more easily readable
*/
n = Find_Number_Words_in_Line( line_from_data_file );
for ( j = 0; j < n; j++ )
{
Extract_Word_From_Line_Based_on_Position( line_from_data_file, word, n );
if ( Is_Palindrome( word ) )
fputs( word, result_file ); /* this will put one palindrome per line in result file */
}
fgets( line_from_data_file, MAX, data_file );
}
free( line_from_data_file );
free( word );
fclose( data_file );
fclose( result_file );
return 0;
}
To follow up from the comments, you may be overthinking the problem a bit. To check whether each word in each line of a file is a palindrome, you have a 2 part problem. (1) reading each line (fgets is fine), and (2) breaking each line into individual words (tokens) so that you can test whether each token is a palindrome.
When reading each line with fgets, a simple while loop conditioned on the return of fgets will do. e.g., with a buffer buf of sufficient size (MAXC chars), and FILE * stream fp open for reading, you can do:
while (fgets (buf, MAXC, fp)) { /* read each line */
... /* process line */
}
(you can test the length of the line read into buf is less than MAXC chars to insure you read the complete line, if not, any unread chars will be placed in buf on the next loop iteration. This check, and how you want to handle it, is left for you.)
Once you have your line read, you can either use a simple pair of pointers (start and end pointers) to work your way through buf, or you can use strtok and let it return a pointer to the beginning of each word in the line based on the set of delimiters you pass to it. For example, to split a line into words, you probably want to use delimiters like " \t\n.,:;!?" to insure you get words alone and not words with punctuation (e.g. in the line "sit here.", you want "sit" and "here", not "here.")
Using strtok is straight forward. On the first call, you pass the name of the buffer holding the string to be tokenized and a pointer to the string containing the delimiters (e.g. strtok (buf, delims) above), then for each subsequent call (until the end of the line is reached) you use NULL as name of the buffer (e.g. strtok (NULL, delims)) You can either call it once and then loop until NULL is returned, or you can do it all using a single for loop given that for allows setting an initial condition as part of the statement, e.g., using separate calls:
char *delims = " \t\n.,:;"; /* delimiters */
char *p = strtok (buf, delims); /* first call to strtok */
while ((p = strtok (NULL, delims))) { /* all subsequent calls */
... /* check for palindrome */
}
Or you can simply make the initial call and all subsequent calls in a for loop:
/* same thing in a single 'for' statement */
for (p = strtok (buf, delims); p; p = strtok (NULL, delims)) {
... /* check for palindrome */
}
Now you are to the point you need to check for palindromes. That is a fairly easy process. Find the length of the token, then either using string indexes, or simply using a pointer to the first and last character, work from the ends to the middle of each token making sure the characters match. On the first mismatch, you know the token is not a palindrome. I find a start and end pointer just as easy as manipulating sting indexes, e.g. with the token in s:
char *ispalindrome (char *s) /* function to check palindrome */
{
char *p = s, /* start pointer */
*ep = s + strlen (s) - 1; /* end pointer */
for ( ; p < ep; p++, ep--) /* work from end to middle */
if (*p != *ep) /* if chars !=, not palindrome */
return NULL;
return s;
}
If you put all the pieces together, you can do something like the following:
#include <stdio.h>
#include <string.h>
enum { MAXC = 256 }; /* max chars for line buffer */
char *ispalindrome (char *s);
int main (int argc, char **argv) {
char buf[MAXC] = "", /* line buffer */
*delims = " \t\n.,:;"; /* delimiters */
unsigned ndx = 0; /* line index */
FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
if (!fp) { /* validate file open for reading */
fprintf (stderr, "error: file open failed '%s'.\n", argv[1]);
return 1;
}
while (fgets (buf, MAXC, fp)) { /* read each line */
char *p = buf; /* pointer to pass to strtok */
printf ("\n line[%2u]: %s\n tokens:\n", ndx++, buf);
for (p = strtok (buf, delims); p; p = strtok (NULL, delims))
if (ispalindrome (p))
printf (" %-16s - palindrome\n", p);
else
printf (" %-16s - not palindrome\n", p);
}
if (fp != stdin) fclose (fp);
return 0;
}
char *ispalindrome (char *s) /* function to check palindrome */
{
char *p = s, *ep = s + strlen (s) - 1; /* ptr & end-ptr */
for ( ; p < ep; p++, ep--) /* work from end to middle */
if (*p != *ep) /* if chars !=, not palindrome */
return NULL;
return s;
}
Example Input
$ cat dat/palins.txt
abcba rttt plllp
aaaaaaaaaaaa
ababa
abbbba
kede
Example Use/Output
$ ./bin/palindrome <dat/palins.txt
line[ 0]: abcba rttt plllp
tokens:
abcba - palindrome
rttt - not palindrome
plllp - palindrome
line[ 1]: aaaaaaaaaaaa
tokens:
aaaaaaaaaaaa - palindrome
line[ 2]: ababa
tokens:
ababa - palindrome
line[ 3]: abbbba
tokens:
abbbba - palindrome
line[ 4]: kede
tokens:
kede - not palindrome
Look things over and think about what it taking place. As mentioned above, insuring you have read a complete line in each call with fgets should be validated, that is left to you. (but with this input file -- of course it will) If you have any questions, let me know and I'll be happy to help further.

Multiple Command-Line Arguments - Replace Words

I've a program which takes any number of words from the command-line arguments and replaces them with the word 'CENSORED'. I finally have the program working for the first argument passed in, and I am having trouble getting the program to censor all arguments, outputted in just a single string. The program rather functions individually on a given argument and does not take them all into account. How would I modify this?
How does one use/manipulate multiple command-line arguments collectively ?
My code follows.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *replace_str(char *str, char *orig, char *rep, int j, int argc)
{
static char buffer[4096];
char *p;
for ( j = 1; j <= argc; j++ )
{
if(!(p = strstr(str, orig))) // Check if 'orig' is not in 'str'
{
if ( j == argc ) { return str; } // return str once final argument is reached
else { continue; } // restart loop with next argument
}
strncpy(buffer, str, p-str); // Copy characters from 'str' start to 'orig' str
buffer[p-str] = '\0';
if ( j == argc ) { return buffer; }
else { continue; }
}
sprintf(buffer+(p-str), "%s%s", rep, p+strlen(orig));
}
int main( int argc, char* argv[] ) //argv: list of arguments; array of char pointers //argc: # of arguments.
{
long unsigned int c, i = 0, j = 1;
char str[4096];
while ( (c = getchar()) != EOF )
{
str[i] = c; // save input string to variable 'str'
i++;
}
puts(replace_str( str, argv[j], "CENSORED", j, argc ) );
return 0;
}
i.e.
$ cat Hello.txt
Hello, I am me.
$ ./replace Hello me < Hello.txt
CENSORED, I am CENSORED.
Two issues, you are not guaranteeing a null-terminated str and second, you are not iterating over the words on the command line to censor each. Try the following in main after your getchar() loop:
/* null-terminate str */
str[i] = 0;
/* you must check each command line word (i.e. argv[j]) */
for (j = 1; j < argc; j++)
{
puts(replace_str( str, argv[j], "CENSORED", j, argc ) );
}
Note: that will place each of the CENSORED words on a separate line. As noted in the comments, move puts (or preferably printf) outside the loop to keep on a single line.
Edit
I apologize. You have more issues than stated above. Attempting to check the fix, it became apparent that you would continue to have difficulty parsing the words depending on the order the bad words were entered on the command line.
While it is possible to do the pointer arithmetic to copy/expand/contract the original string regardless of the order the words appear on the command line, it is far easier to simply separate the words provided into an array, and then compare each of the bad words against each word in the original string.
This can be accomplished relatively easily with strtok or strsep. I put together a quick example showing this approach. (note: make a copy of the string before passing to strtok, as it will alter the original). I believe this is what you were attempting to do, but you were stumbling on not having the ability to compare each word (thus your use of strstr to test for a match).
Look over the example and let me know if you have further questions. Note: I replaced your hardcoded 4096 with a SMAX define and provided a word max WMAX for words entered on the command line. Also always initialize your strings/buffers. It will enable you to always be able to easily find the last char in the buffer and ensure the buffer is always null-terminated.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define SMAX 4096
#define WMAX 50
char *replace_str (char *str, char **bad, char *rep)
{
static char buffer[SMAX] = {0};
char *p = buffer;
char *wp = NULL;
unsigned i = 0;
unsigned char censored = 0;
char *str2 = strdup (str); /* make copy of string for strtok */
char *savp = str2; /* and save start address to free */
if (!(wp = strtok (str2, " "))) /* get first word in string or bail */
{
if (savp) free (savp);
return str;
}
while (bad[i]) /* test against each bad word */
{
if (strcmp (wp, bad[i++]) == 0) /* if matched, copy rep to buffer */
{
memcpy (buffer, rep, strlen (rep));
censored = 1;
}
}
if (!censored) /* if no match, copy original word */
memcpy (buffer, wp, strlen (wp));
while ((wp = strtok (NULL, " "))) /* repeat for each word in str */
{
i = 0;
censored = 0;
memcpy (strchr (buffer, 0), " ", 1);
p = strchr (buffer, 0); /* (get address of null-term char) */
while (bad[i])
{
if (strcmp (wp, bad[i++]) == 0)
{
memcpy (p, rep, strlen (rep));
censored = 1;
}
}
if (!censored)
memcpy (p, wp, strlen (wp));
}
if (savp) free (savp); /* free copy of strtok string */
return buffer;
}
int main ( int argc, char** argv)
{
unsigned int i = 0;
char str[SMAX] = {0};
char *badwords[WMAX] = {0}; /* array to hold command line words */
for (i = 1; i < argc; i++) /* save command line in array */
badwords[i-1] = strdup (argv[i]);
i = 0; /* print out the censored words */
printf ("\nCensor words:");
while (badwords[i])
printf (" %s", badwords[i++]);
printf ("\n\n");
printf ("Enter string: "); /* promt to enter string to censor */
if (fgets (str, SMAX-1, stdin) == NULL)
{
fprintf (stderr, "error: failed to read str from stdin\n");
return 1;
}
str[strlen (str) - 1] = 0; /* strip linefeed from input str */
/* print out censored string */
printf ("\ncensored str: %s\n\n", replace_str (str, badwords, "CENSORED"));
i = 0; /* free all allocated memory */
while (badwords[i])
free (badwords[i++]);
return 0;
}
use/output
./bin/censorw bad realbad
Censor words: bad realbad
Enter string: It is not nice to say bad or realbad words.
censored str: It is not nice to say CENSORED or CENSORED words.

Read text file, save all digits into character string

I am trying to read a text file containing the string "a3rm5t?7!z*&gzt9v" and put all the numeric characters into a character string to later convert into an integer.
I am currently trying to do this by using sscanf on the buffer after reading the file, and then using sprintf to save all characters found using %u in a character string called str.
However, the integer that is returning when I call printf on str is different each time I run the program. What am I doing right and what am I doing wrong?
This code works when the text file contains a string like "23dog" and returns 23 but not when the string is something like 23dog2.
EDIT: I now realize that i should be putting the numeric characters in a character ARRAY rather than just one string.
int main(int argc, const char **argv)
{
int in;
char buffer[128];
char *str;
FILE *input;
in = open(argv[1], O_RDONLY);
read(in, buffer, 128);
unsigned x;
sscanf(buffer, "%u", &x);
sprintf(str,"%u\n", x);
printf("%s\n",str);
close (in);
exit(0);
}
If you simply want to filter out any non-digits from your input, you need not use scanf, sprintf and the like. Simply loop over the buffer and copy the characters that are digits.
The following program only works for a single line of input read from standard input and only if it is less than 512 characters long but it should give you the correct idea.
#include <stdio.h>
#define BUFFER_SIZE 512
int
main()
{
char buffer[BUFFER_SIZE]; /* Here we read into. */
char digits[BUFFER_SIZE]; /* Here we insert the digits. */
char * pos;
size_t i = 0;
/* Read one line of input (max BUFFER_SIZE - 1 characters). */
if (!fgets(buffer, BUFFER_SIZE, stdin))
{
perror("fgets");
return 1;
}
/* Loop over the (NUL terminated) buffer. */
for (pos = buffer; *pos; ++pos)
{
if (*pos >= '0' && *pos <= '9')
{
/* It's a digit: copy it over. */
digits[i++] = *pos;
}
}
digits[i] = '\0'; /* NUL terminate the string. */
printf("%s\n", digits);
return 0;
}
A good approach to any problem like this is to read the entire line into a buffer and then assign a pointer to the buffer. You can then use the pointer to step through the buffer reading each character and acting on it appropriately. The following is one example of this approach. getline is used to read the line from the file (it has the advantage of allocating space for buffer and returning the number of characters read). You then allocate space for the character string based on the size of buffer as returned by getline. Remember, when done, you are responsible for freeing the memory allocated by getline.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (int argc, const char **argv)
{
char *buffer = NULL; /* forces getline to allocate required space */
ssize_t read = 0; /* number of characters read by getline */
size_t n = 0; /* limit of characters to read, (0 no limit) */
char *str = NULL; /* string to hold digits read from file */
char *p = NULL; /* ptr to use with buffer (could use buffer) */
int idx = 0; /* index for adding digits to str */
int number = 0; /* int to hold number parsed from file */
FILE *input;
/* validate input */
if (argc < 2) { printf ("Error: insufficient input. Usage: %s filename\n", argv[0]); return 1; }
/* open and validate file */
input = fopen(argv[1], "r");
if (!input) { printf ("Error: failed to open file '%s\n", argv[1]); return 1; }
/* read line from file with getline */
if ((read = getline (&buffer, &n, input)) != -1)
{
str = malloc (sizeof (char) * read); /* allocate memory for str */
p = buffer; /* set pointer to buffer */
while (*p) /* read each char in buffer */
{
if (*p > 0x2f && *p < 0x3a) /* if char is digit 0-9 */
{
str[idx] = *p; /* copy to str at idx */
idx++; /* increment idx */
}
p++; /* increment pointer */
}
str[idx] = 0; /* null-terminate str */
number = atoi (str); /* convert str to int */
printf ("\n string : %s number : %d\n\n", buffer, number);
} else {
printf ("Error: nothing read from file '%s\n", argv[1]);
return 1;
}
if (input) fclose (input); /* close input file stream */
if (buffer) free (buffer); /* free memory allocated by getline */
if (str) free (str); /* free memory allocated to str */
return 0;
}
datafile:
$ cat dat/fwintstr.dat
a3rm5t?7!z*&gzt9v
output:
$ ./bin/prsint dat/fwintstr.dat
string : a3rm5t?7!z*&gzt9v
number : 3579

Using realloc to expand buffer while reading from file crashes

I am writing some code that needs to read fasta files, so part of my code (included below) is a fasta parser. As a single sequence can span multiple lines in the fasta format, I need to concatenate multiple successive lines read from the file into a single string. I do this, by realloc'ing the string buffer after reading every line, to be the current length of the sequence plus the length of the line read in. I do some other stuff, like stripping white space etc. All goes well for the first sequence, but fasta files can contain multiple sequences. So similarly, I have a dynamic array of structs with a two strings (title, and actual sequence), being "char *". Again, as I encounter a new title (introduced by a line beginning with '>') I increment the number of sequences, and realloc the sequence list buffer. The realloc segfaults on allocating space for the second sequence with
*** glibc detected *** ./stackoverflow: malloc(): memory corruption: 0x09fd9210 ***
Aborted
For the life of me I can't see why. I've run it through gdb and everything seems to be working (i.e. everything is initialised, the values seems sane)... Here's the code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#include <math.h>
#include <errno.h>
//a struture to keep a record of sequences read in from file, and their titles
typedef struct {
char *title;
char *sequence;
} sequence_rec;
//string convenience functions
//checks whether a string consists entirely of white space
int empty(const char *s) {
int i;
i = 0;
while (s[i] != 0) {
if (!isspace(s[i])) return 0;
i++;
}
return 1;
}
//substr allocates and returns a new string which is a substring of s from i to
//j exclusive, where i < j; If i or j are negative they refer to distance from
//the end of the s
char *substr(const char *s, int i, int j) {
char *ret;
if (i < 0) i = strlen(s)-i;
if (j < 0) j = strlen(s)-j;
ret = malloc(j-i+1);
strncpy(ret,s,j-i);
return ret;
}
//strips white space from either end of the string
void strip(char **s) {
int i, j, len;
char *tmp = *s;
len = strlen(*s);
i = 0;
while ((isspace(*(*s+i)))&&(i < len)) {
i++;
}
j = strlen(*s)-1;
while ((isspace(*(*s+j)))&&(j > 0)) {
j--;
}
*s = strndup(*s+i, j-i);
free(tmp);
}
int main(int argc, char**argv) {
sequence_rec *sequences = NULL;
FILE *f = NULL;
char *line = NULL;
size_t linelen;
int rcount;
int numsequences = 0;
f = fopen(argv[1], "r");
if (f == NULL) {
fprintf(stderr, "Error opening %s: %s\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
rcount = getline(&line, &linelen, f);
while (rcount != -1) {
while (empty(line)) rcount = getline(&line, &linelen, f);
if (line[0] != '>') {
fprintf(stderr,"Sequence input not in valid fasta format\n");
return EXIT_FAILURE;
}
numsequences++;
sequences = realloc(sequences,sizeof(sequence_rec)*numsequences);
sequences[numsequences-1].title = strdup(line+1); strip(&sequences[numsequences-1].title);
rcount = getline(&line, &linelen, f);
sequences[numsequences-1].sequence = malloc(1); sequences[numsequences-1].sequence[0] = 0;
while ((!empty(line))&&(line[0] != '>')) {
strip(&line);
sequences[numsequences-1].sequence = realloc(sequences[numsequences-1].sequence, strlen(sequences[numsequences-1].sequence)+strlen(line)+1);
strcat(sequences[numsequences-1].sequence,line);
rcount = getline(&line, &linelen, f);
}
}
return EXIT_SUCCESS;
}
You should use strings that look something like this:
struct string {
int len;
char *ptr;
};
This prevents strncpy bugs like what it seems you saw, and allows you to do strcat and friends faster.
You should also use a doubling array for each string. This prevents too many allocations and memcpys. Something like this:
int sstrcat(struct string *a, struct string *b)
{
int len = a->len + b->len;
int alen = a->len;
if (a->len < len) {
while (a->len < len) {
a->len *= 2;
}
a->ptr = realloc(a->ptr, a->len);
if (a->ptr == NULL) {
return ENOMEM;
}
}
memcpy(&a->ptr[alen], b->ptr, b->len);
return 0;
}
I now see you are doing bioinformatics, which means you probably need more performance than I thought. You should use strings like this instead:
struct string {
int len;
char ptr[0];
};
This way, when you allocate a string object, you call malloc(sizeof(struct string) + len) and avoid a second call to malloc. It's a little more work but it should help measurably, in terms of speed and also memory fragmentation.
Finally, if this isn't actually the source of error, it looks like you have some corruption. Valgrind should help you detect it if gdb fails.
One potential issue is here:
strncpy(ret,s,j-i);
return ret;
ret might not get a null terminator. See man strncpy:
char *strncpy(char *dest, const char *src, size_t n);
...
The strncpy() function is similar, except that at most n bytes of src
are copied. Warning: If there is no null byte among the first n bytes
of src, the string placed in dest will not be null terminated.
There's also a bug here:
j = strlen(*s)-1;
while ((isspace(*(*s+j)))&&(j > 0)) {
What if strlen(*s) is 0? You'll end up reading (*s)[-1].
You also don't check in strip() that the string doesn't consist entirely of spaces. If it does, you'll end up with j < i.
edit: Just noticed that your substr() function doesn't actually get called.
I think the memory corruption problem might be the result of how you're handling the data used in your getline() calls. Basically, line is reallocated via strndup() in the calls to strip(), so the buffer size being tracked in linelen by getline() will no longer be accurate. getline() may overrun the buffer.
while ((!empty(line))&&(line[0] != '>')) {
strip(&line); // <-- assigns a `strndup()` allocation to `line`
sequences[numsequences-1].sequence = realloc(sequences[numsequences-1].sequence, strlen(sequences[numsequences-1].sequence)+strlen(line)+1);
strcat(sequences[numsequences-1].sequence,line);
rcount = getline(&line, &linelen, f); // <-- the buffer `line` points to might be
// smaller than `linelen` bytes
}

Resources