Subtlety in strstr?

Subtlety in strstr? - c

I have a file of binary data with various character strings sprinkled throughout. I am trying to write a C code to find the first occurrence of user-specified strings in the file. (I know this can be done with bash but I need a C code for other reasons.) The code as it stands is:
#include <stdio.h>
#include <string.h>
#define CHUNK_SIZE 512
int main(int argc, char **argv) {
char *fname = argv[1];
char *tag = argv[2];
FILE *infile;
char *chunk;
char *taglcn = NULL;
long lcn_in_file = 0;
int back_step;
fpos_t pos;
// allocate chunk
chunk = (char*)malloc((CHUNK_SIZE + 1) * sizeof(char));
// find back_step
back_step = strlen(tag) - 1;
// open file
infile = fopen(fname, "r");
// loop
while (taglcn == NULL) {
// read chunk
memset(chunk, 0, (CHUNK_SIZE + 1) * sizeof(char));
fread(chunk, sizeof(char), CHUNK_SIZE, infile);
printf("Read %c\n", chunk[0]);
// look for tag
taglcn = strstr(chunk, tag);
if (taglcn != NULL) {
// if you find tag, add to location the offset in bytes from beginning of chunk
lcn_in_file += (long)(taglcn - chunk);
printf("HEY I FOUND IT!\n");
} else {
// if you don't find tag, add chunk size minus back_step to location and ...
lcn_in_file += ((CHUNK_SIZE - back_step) * sizeof(char));
// back file pointer up by back_step for next read
fseek(infile, -back_step, SEEK_CUR);
fgetpos(infile, &pos);
printf("%ld\n", pos);
printf("%s\n\n\n", chunk);
}
}
printf("%ld\n", lcn_in_file);
fclose(infile);
free(chunk);
}
If you're wondering, back_step is put in to take care of the unlikely eventuality that the string in question is split by a chunk boundary.
The file I am trying to examine is about 1Gb in size. The problem is that for some reason I can find any string within the first 9000 or so bytes, but beyond that, strstr is somehow not detecting any string. That is, if I look for a string located beyond 9000 or so bytes into the file, strstr does not detect it. The code reads through the entire file and never finds the search string.
I have tried varying CHUNK_SIZE from 128 to 50000, with no change in results. I have tried varying back_step as well. I have even put in diagnostic code to print out chunk character by character when strstr fails to find the string, and sure enough, the string is exactly where it is supposed to be. The diagnostic output of pos is always correct.
Can anyone tell me where I am going wrong? Is strstr the wrong tool to use here?

Since you say your file is binary, strstr() will stop scanning at the first null byte in the file.
If you wish to look for patterns in binary data, then the memmem() function is appropriate, if it is available. It is available on Linux and some other platforms (BSD, macOS, …) but it is not defined as part of standard C or POSIX. It bears roughly the same relation to strstr() that memcpy() bears to strcpy().
Note that your code should detect the number of bytes read by fread() and only search on that.
char *tag = …; // Identify the data to be searched for
size_t taglen = …; // Identify the data's length (maybe strlen(tag))
int nbytes;
while ((nbytes = fread(chunk, 1, (CHUNK_SIZE + 1), infile)) > 0)
{
…
tagcln = memmem(chunk, nbytes, tag, taglen);
if (tagcln != 0)
…found it…
…
}
It isn't really clear why you have the +1 on the chunk size. The fread() function doesn't add null bytes at the end of the data or anything like that. I've left that aspect unchanged, but would probably not use it in my own code.
It is good that you take care of identifying a tag that spans the boundaries between two chunks.

The most likely reason for strstr to fail in your code is the presence of null bytes in the file. Furthermore, you should open the file in binary mode for the file offsets to be meaningful.
To scan for a sequence of bytes in a block, use the memmem() function. If it is not available on your system, here is a simple implementation:
#include <string.h>
void *memmem(const void *haystack, size_t n1, const void *needle, size_t n2) {
const unsigned char *p1 = haystack;
const unsigned char *p2 = needle;
if (n2 == 0)
return (void*)p1;
if (n2 > n1)
return NULL;
const unsigned char *p3 = p1 + n1 - n2 + 1;
for (const unsigned char *p = p1; (p = memchr(p, *p2, p3 - p)) != NULL; p++) {
if (!memcmp(p, p2, n2))
return (void*)p;
}
return NULL;
}
You would modify your program this way:
#include <errno.h>
#include <stdio.h>
#include <string.h>
void *memmem(const void *haystack, size_t n1, const void *needle, size_t n2);
#define CHUNK_SIZE 65536
int main(int argc, char **argv) {
if (argc < 3) {
fprintf(sderr, "missing parameters\n");
exit(1);
}
// open file
char *fname = argv[1];
FILE *infile = fopen(fname, "rb");
if (infile == NULL) {
fprintf(sderr, "cannot open file %s: %s\n", fname, strerror(errno));
exit(1);
}
char *tag = argv[2];
size_t tag_len = strlen(tag);
size_t overlap_len = 0;
long long pos = 0;
char *chunk = malloc(CHUNK_SIZE + tag_len - 1);
if (chunk == NULL) {
fprintf(sderr, "cannot allocate memory\n");
exit(1);
}
// loop
for (;;) {
// read chunk
size_t chunk_len = overlap_len + fread(chunk + overlap_len, 1,
CHUNK_SIZE, infile);
if (chunk_len < tag_len) {
// end of file or very short file
break;
}
// look for tag
char *tag_location = memmem(chunk, chunk_len, tag, tag_len);
if (tag_location != NULL) {
// if you find tag, add to location the offset in bytes from beginning of chunk
printf("string found at %lld\n", pos + (tag_location - chunk));
break;
} else {
// if you don't find tag, add chunk size minus back_step to location and ...
overlap_len = tag_len - 1;
memmove(chunk, chunk + chunk_len - overlap_len, overlap_len);
pos += chunk_len - overlap_len;
}
}
fclose(infile);
free(chunk);
return 0;
}
Note that the file is read in chunks of CHUNK_SIZE bytes, which is optimal if CHUNK_SIZE is a multiple of the file system block size.

For some really simple code, you can use mmap() and memcmp().
Error checking and proper header files are left as an exercise for the reader (there is at least one bug - another exercise for the reader to find):
int main( int argc, char **argv )
{
// put usable names on command-line args
char *fname = argv[ 1 ];
char *tag = argv[ 2 ];
// mmap the entire file
int fd = open( fname, O_RDONLY );
struct stat sb;
fstat( fd, &sb );
char *contents = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
close( fd );
size_t tag_len = strlen( tag );
size_t bytes_to_check = 1UL + sb.st_size - tag_len;
for ( size_t ii = 0; ii < bytes_to_check; ii++ )
{
if ( !memcmp( contents + ii, tag, tag_len ) )
{
// match found
// (probably want to check if contents[ ii + tag_len ]
// is a `\0' char to get actual string matches)
}
}
munmap( contents, sb.st_len );
return( 0 );
}
That likely won't be anywhere near the fastest way (in general, mmap() is not going to be anywhere near a performance winner, especially in this use case of simply streaming through a file from beginning to end), but it's simple.
(Note that mmap() also has problems if the file size changes while it's being read. If the file grows, you won't see the additional data. If the file is shortened, you'll get SIGBUS when you try to read the removed data.)

A binary data file is going to contain '\0' bytes acting as string ends. The more that are in there, the shorter the area strstr is going to search will be. Note strstr will consider its work done once it hits a 0 byte.
You can scan the memory in intervals like
while (strlen (chunk) < CHUNKSIZE)
chunk += strlen (chunk) + 1;
i.e. restart after a null byte in the chunk as long as you are still within the chunk.

Related

Segmentation fault: 11 while trying to print array with 50k items

I'm trying to print array with 50k items into a file but it could be done only if I set small numbers of items, e.g. 5k.
void fputsArray(int *arr, int size, char *filename)
{
char *string = (char*)calloc( (8*size+1), sizeof(char) );
for(int i = 0; i < size; i++)
sprintf( &string[ strlen(string) ], "%d\n", arr[i] );
FILE *output;
char fullFilename[50] = "./";
output = fopen(strcat(fullFilename, filename), "w");
fputs(string, output);
fclose(output);
free(string);
}
size is 50000, defined in #DEFINE.
This is working code. But if I delete 8 multiplying to size, that is I supposed to be working, doesn't work. I got in that situation Segmentation fault: 11
Why should I allocate 8 times more memory than I need?

Assuming the input to your function is all correct:
void fputsArray(int *arr, int size, char *filename)
Sizes should be given as size_t.
{
char *string = (char*)calloc( (8*size+1), sizeof(char) );
The clearing of the memory (calloc) is unnecessary, malloc and setting string[0] = '\0' would suffice. sizeof( char ) is always 1 by definition. And you should not cast the result of an allocation.
Actually, the whole construct is unnecessary, but that's for later.
for(int i = 0; i < size; i++)
sprintf( &string[ strlen(string) ], "%d\n", arr[i] );
Not actually that bad, aside from string + strlen( string ) being simpler and that there should always be { } surrounding the statement. Still unnecessarily complex.
FILE *output;
char fullFilename[50] = "./";
output = fopen(strcat(fullFilename, filename), "w");
A filename is always relative to the current working directory, so the "./" is unnecessary. You should however have checked the filename length before strcating it into a static buffer like that.
fputs(string, output);
Ah, but you have not checked if the fopen actually succeeded!
fclose(output);
free(string);
}
All in all, I've seen worse. Whether your numbers actually fit your buffer is guesswork, though, and most importantly the whole memory shenanigans are unnecessary.
Consider:
void printArray( int const * arr, size_t size, char const * filename )
{
FILE * output = fopen( filename, "w" );
if ( output != NULL )
{
for ( size_t i = 0; i < size; ++i )
{
fprintf( output, "%d\n", arr[i] );
}
fclose( output );
}
else
{
perror( "File open failed" );
}
}
I think this is much better than trying to figure out where your memory guesswork went wrong.
Edit: On second thought, I would have that function take a FILE * argument instead of a filename, which would give you the flexibility of printing to an already-opened stream (like stdout) as well, and also let you do the error handling of the fopen in a place higher up that might have additional capabilities to give useful information.

size is 50000, defined in #DEFINE. This is working code. But if I delete 8 multiplying to size, that is I supposed to be working, doesn't work. I got in that situation Segmentation fault: 11 Why should I allocate 8 times more memory than I need?
You are writing about this size estimates:
char *string = (char*)calloc( (8*size+1), sizeof(char) );
But the array in use is int[] and you will write one value per line in disk as in
sprintf( &string[ strlen(string) ], "%d\n", arr[i] );
This seems unnecessary complicated. As for the size, assume all values as INT_MIN, a.k.a. (in limits.h)
#define INT_MIN (-2147483647 - 1)
for an 4-byte integer. So you have 11 chars. Just that. 10 digits plus one symbol for the signal. This will got you covered for any int value. Add 1 for the '\n'
But...
Why use calloc() at all?
Why not just use a size * 12-byte array that would fit every possible values?
why declare a new char* to hold the value in char format instead of just using fprintf() at once?
why void instead of just returning something like -1 for error or the number of itens written to disk in case of success?
Back to the program
If you really want to write down the array to disk in a single call to fputs(), holding the whole giant string in memory, consider that sprintf() returns the number of bytes written, so this is the value you need to use as a pointer to the output string...
If you want to use memory allocation you can do it in blocks. Considere that if all values are under 999 the 50.000 lines would have no more than 4 bytes each. But if all values are constant equal to INT_MIN you will have the max 12 bytes per line.
So you can use the return of sprintf() to update the pointer to the string, and use realloc() when needed, allocating, let's say, in blocks of a few K-bytes. (if you really want to to that write back and I can post an example)
C Example
The code below writes the file the way you tried to, and returns the total bytes written. It depends on the values of the array, anyway. The maximum is what I said, 12 bytes per line...
int fputsArray( unsigned size, int* array , const char* filename)
{
static char string[12 * MY_SIZE_ ] = {0};
unsigned ix = 0; // pointer to the next char to use in string
FILE* output = fopen( filename, "w");
if ( output == NULL ) return -1;
// file is open
for(int i = 0; i < size; i+= 1)
{
unsigned used = sprintf( (string + ix), "%d\n", array[i] );
ix += used;
}
fputs(string, output);
fclose(output);
return ix;
}
Using fprintf()
This code writes the same file, using fprintf() and is way simpler...
int fputsArray_b( unsigned size, int* array , const char* filename)
{
unsigned ix = 0; // bytes written
FILE* output = fopen( filename, "w");
if ( output == NULL ) return -1;
// file is open
for(int i = 0; i < size; i+= 1)
ix += fprintf( output, "%d\n", array[i]);
fclose(output);
return ix;
}
A complete test with the 2 functions
#define MY_SIZE_ 50000
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
int fputsArray(const unsigned,int*,const char*);
int fputsArray_b(const unsigned,int*,const char*);
int main(void)
{
int value[MY_SIZE_];
srand(210726); // seed for today :)
value[0] = INT_MIN; // just to test: this is the longest value
for ( int i=1; i<MY_SIZE_; i+=1 ) value[i] = rand();
int used = fputsArray( MY_SIZE_, value, "test.txt");
printf("%d bytes written to disk\n", used );
used = fputsArray_b( MY_SIZE_, value, "test_b.txt");
printf("%d bytes written to disk using the alternate function\n", used );
return 0;
}
int fputsArray( unsigned size, int* array , const char* filename)
{
static char string[12 * MY_SIZE_ ] = {0};
unsigned ix = 0; // pointer to the next char to use in string
FILE* output = fopen( filename, "w");
if ( output == NULL ) return -1;
// file is open
for(int i = 0; i < size; i+= 1)
{
unsigned used = sprintf( (string + ix), "%d\n", array[i] );
ix += used;
}
fputs(string, output);
fclose(output);
return ix;
}
int fputsArray_b( unsigned size, int* array , const char* filename)
{
unsigned ix = 0; // bytes written
FILE* output = fopen( filename, "w");
if ( output == NULL ) return -1;
// file is open
for(int i = 0; i < size; i+= 1)
ix += fprintf( output, "%d\n", array[i]);
fclose(output);
return ix;
}
The program writes 2 identical files...

Reading the words of a file into a dynamic 2D array

I am trying to read a file and store every word into a dynamically allocated 2D array. The size of the input file is unknown.
I am totally lost and don't know how I could "fix/finish" the program.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void) {
char filename[25];
printf("Input the filename");
scanf("%s", filename);
fileConverter(filename);
}
int fileConverter(char filename[25]) {
//int maxLines = 50000;
//int maxWordSize = 128;
//char words[maxLines][maxWordSize];
//char **words;
char **arr = (char**) calloc(num_elements, sizeof(char*));
for ( i = 0; i < num_elements; i++ ) {
arr[i] = (char*) calloc(num_elements_sub, sizeof(char));
}
FILE *file = NULL;
int amountOfWords = 0;
file = fopen(filename, "r");
if(file == NULL) {
exit(0);
}
while(fgets(words[amountOfWords], 10000, file)) {
words[amountOfWords][strlen(words[amountOfWords]) - 1] = "\0";
amountOfWords++;
}
for(int i = 0; i < amountOfWords; i++) {
printf("a[%d] = ", i);
printf("%s\n", words[i]);
}
printf("The file contains %d words and the same amount of lines.\n", amountOfWords);
return amountOfWords;

The main challenges for this kind of problem are
reallocating the array of strings as the program reads new words, and
handling words that are larger than the buffer used by fgets.
The general approach for these kind of parsing problems, is to design a state machine. The state machine here has two states:
The current character is whitespace. Action: Continue reading whitespace until we reach the end of the buffer, or until we land on a non-whitespace character, in which case we switch to state 2.
The current character is non-whitespace (i.e. a word). Action: Continue reading non-whitespace until we reach the end of the buffer, or until we land on a whitespace character, in which case we copy the word we just read to the array of strings and switch to state 1.
Particularly difficult is the case in which we are in state 2 and reach the end of the buffer. This means that this word spans multiple buffers. To accommodate for this, we deviate slightly from a direct state machine implementation. State 2 is slightly different, depending on if we are reading a new word or continuing one that was started in a previous buffer.
We now keep track of wordSize. If we start reading from the start of a buffer, but wordSize is not 0, then we know we are continuing a previous word and we know what size it was for the realloc we need.
Below is one possible implementation. All the work is done in the wordArrayRead function. Walking through it from the top of the function:
First we declare the variables that we need across lineBuffer reads: an index for the word itself and the length of the word we are currently reading, followed by the declaration of the buffer itself. The outside loop repeatedly reads using fgets until we have exhausted the input.
We start reading at index 0 and stop at the null-terminator. The first if-statement checks if we should be in state 2: either the current character is the start of a word or we were already reading a word.
State 2
The index wordStartIdx stays at the first character of the word (segment) and we walk the wordEndIdx to the end of the word (segment) or to the end of the buffer.
We then check if we need to increase the size of the array of strings. Here we increase it to 2 times + 1 the previous size to avoid frequent reallocations.
We set a boolean value, indication whether we have reached the end of a word. If we have, we need to allocate for and write the null-terminator at the end of the string.
If wordLength == 0 it means we are reading a new word and have to allocate memory for it for the first time. If wordLength != 0, we have to reallocate to append to an existing word.
We copy the word (segment) currently in the lineBuffer to the array of strings.
Now, we do some bookkeeping. If we reached the end of a word, we write the null-terminator, increment the index to point to the next word location and reset wordLength. If this wasn't the case, we only increment the wordLength with the length of the segment we just read. Finally, we update wordStartIdx, which still points to the start of the word, to point to the end of the word, so we can continue iterating over the buffer.
State 1
Having finishing the State 2 processing, we go into State 1 which has only two lines. It simply advances the index until we land at non-whitespace. Note that the null-terminator of the lineBuffer ('\0') does not count as whitespace, so this loop will not continue past the end of the buffer.
After all input has been processed, we shrink the array of strings to the actual size of its data. This "corrects" the allocation policy of increasing the size by 2n+1 each time it wasn't large enough.
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// BUFFER_SIZE must be >1U
#define BUFFER_SIZE 1024U
struct WordArray
{
char **words;
size_t numberOfWords;
};
static struct WordArray wordArrayConstruct(void);
static void wordArrayResize(struct WordArray *wordArray, size_t const newSize);
static void wordArrayDestruct(struct WordArray *wordArray);
static void wordArrayRead(FILE *restrict stream, struct WordArray *wordArray);
static char *reallocStringWrapper(char *restrict str, size_t const newSize);
static void wordArrayPrint(struct WordArray const *wordArray);
int main(void)
{
struct WordArray wordArray = wordArrayConstruct();
wordArrayRead(stdin, &wordArray);
wordArrayPrint(&wordArray);
wordArrayDestruct(&wordArray);
}
static void wordArrayRead(FILE *restrict stream, struct WordArray *wordArray)
{
size_t wordArrayIdx = 0U;
size_t wordLength = 0U;
char lineBuffer[BUFFER_SIZE];
while (fgets(lineBuffer, sizeof lineBuffer, stream) != NULL)
{
size_t wordStartIdx = 0U;
while (lineBuffer[wordStartIdx] != '\0')
{
if (!isspace(lineBuffer[wordStartIdx]) || wordLength != 0U)
{
size_t wordEndIdx = wordStartIdx;
while (!isspace(lineBuffer[wordEndIdx]) && wordEndIdx != BUFFER_SIZE - 1U)
++wordEndIdx;
if (wordArrayIdx >= wordArray->numberOfWords)
wordArrayResize(wordArray, wordArray->numberOfWords * 2U + 1U);
size_t wordSegmentLength = wordEndIdx - wordStartIdx;
size_t foundWordEnd = wordEndIdx != BUFFER_SIZE - 1U; // 0 or 1 bool
// Allocate for a new word, or reallocate for an existing word
// If a word end was found, add 1 to the size for the '\0' character
char *dest = wordLength == 0U ? NULL : wordArray->words[wordArrayIdx];
size_t allocSize = wordLength + wordSegmentLength + foundWordEnd;
wordArray->words[wordArrayIdx] = reallocStringWrapper(dest, allocSize);
memcpy(&(wordArray->words[wordArrayIdx][wordLength]),
&lineBuffer[wordStartIdx], wordSegmentLength);
if (foundWordEnd)
{
wordArray->words[wordArrayIdx][wordLength + wordSegmentLength] = '\0';
++wordArrayIdx;
wordLength = 0U;
}
else
{
wordLength += wordSegmentLength;
}
wordStartIdx = wordEndIdx;
}
while (isspace(lineBuffer[wordStartIdx]))
++wordStartIdx;
}
}
// All done. Shrink the words array to the size of the actual data
if (wordArray->numberOfWords != 0U)
wordArrayResize(wordArray, wordArrayIdx);
}
static struct WordArray wordArrayConstruct(void)
{
return (struct WordArray) {.words = NULL, .numberOfWords = 0U};
}
static void wordArrayResize(struct WordArray *wordArray, size_t const newSize)
{
assert(newSize > 0U);
char **tmp = (char**) realloc(wordArray->words, newSize * sizeof *wordArray->words);
if (tmp == NULL)
{
wordArrayDestruct(wordArray);
fprintf(stderr, "WordArray allocation error\n");
exit(EXIT_FAILURE);
}
wordArray->words = tmp;
wordArray->numberOfWords = newSize;
}
static void wordArrayDestruct(struct WordArray *wordArray)
{
for (size_t wordStartIdx = 0U; wordStartIdx < wordArray->numberOfWords; ++wordStartIdx)
{
free(wordArray->words[wordStartIdx]);
wordArray->words[wordStartIdx] = NULL;
}
free(wordArray->words);
}
static char *reallocStringWrapper(char *restrict str, size_t const newSize)
{
char *tmp = (char*) realloc(str, newSize);
if (tmp == NULL)
{
free(str);
fprintf(stderr, "Realloc string allocation error\n");
exit(EXIT_FAILURE);
}
return tmp;
}
static void wordArrayPrint(struct WordArray const *wordArray)
{
for (size_t wordStartIdx = 0U; wordStartIdx < wordArray->numberOfWords; ++wordStartIdx)
printf("%zu: %s\n", wordStartIdx, wordArray->words[wordStartIdx]);
}
Note: This program reads input from stdin, as Unix/Linux utilities typically do. Use input redirection to read from a file, or provide a file descriptor to the readWordArray function.

to allocate dynamic 2D array you need:
void allocChar2Darray(size_t rows, size_t columns, char (**array)[columns])
{
*array = malloc(rows * sizeof(**array));
}

C read big file into char* array too slow

I'd like to read a big file while the first character of a line isn't " ".
But the code I have written is very slow. How can I speed up the routine?
Is there a better solution instead of getline?
void readString(const char *fn)
{
FILE *fp;
char *vString;
struct stat fdstat;
int stat_res;
stat_res = stat(fn, &fdstat);
fp = fopen(fn, "r+b");
if (fp && !stat_res)
{
vString = (char *)calloc(fdstat.st_size + 1, sizeof(char));
int dataEnd = 1;
size_t len = 0;
int emptyLine = 1;
char **linePtr = malloc(sizeof(char*));
*linePtr = NULL;
while(dataEnd)
{
// Check every line
getline(linePtr, &len, fp);
// When data ends, the line begins with space (" ")
if(*linePtr[0] == 0x20)
emptyLine = 0;
// If line begins with space, stop writing
if(emptyLine)
strcat(vString, *linePtr);
else
dataEnd = 0;
}
strcat(vString, "\0");
free(linePtr);
linePtr = NULL;
}
}
int main(int argc, char **argv){
readString(argv[1]);
return EXIT_SUCCESS;
}

How can I speed up the routine?
The most suspicious aspect of your program performance-wise is the strcat(). On each call, it needs to scan the whole destination string from the beginning to find the place to append the source string. As a result, if your file's lines have length bounded by a constant (even a large one), then your approach's performance scales with the square of the file length.
The asymptotic complexity analysis doesn't necessarily tell the whole story, though. The I/O part of your code scales linearly with file length, and since I/O is much more expensive than in-memory data manipulation, that will dominate your performance for small enough files. If you're in that regime then you're probably not going to do much better than you already do. In that event, though, you might still do a bit better by reading the whole file at once via fread(), and then scanning it for end-of-data via strstr():
size_t nread = fread(vString, 1, fdstat.st_size, fp);
// Handle nread != fdstat.st_size ...
// terminate the buffer as a string
vString[nread] = '\0';
// truncate the string after the end-of-data:
char *eod = strstr(vString, "\n ");
if (eod) {
// terminator found - truncate the string after the newline
eod[1] = '\0';
} // else no terminator found
That scales linearly, so it addresses your asymptotic complexity problem, too, but if the data of interest will often be much shorter than the file, then it will leave you in those cases doing a lot more costly I/O than you need to do. In that event, one alternative would be to read in chunks, as #laissez_faire suggested. Another would be to tweak your original algorithm to track the end of vString so as to use strcpy() instead of strcat() to append each new line. The key part of that version would look something like this:
char *linePtr = NULL;
size_t nread = 0;
size_t len = 0;
*vString = '\0'; // In case the first line is end-of-data
for (char *end = vString; ; end += nread) {
// Check every line
nread = getline(&linePtr, &len, fp);
if (nread < 0) {
// handle eof or error ...
}
// When data ends, the line begins with space (" ")
if (*linePtr == ' ') {
break;
}
strcpy(end, *linePtr);
}
free(linePtr);
Additionally, note that
you do not need to initially zero-fill the memory allocated for *vString, as you're just going to overwrite those zeroes with the data of real interest (and then ignore the rest of the buffer).
You should not cast the return value of malloc()-family functions, including calloc().

Have you tried to read the file using fread and read a bigger chunk of data in each step and then parse the data after reading it? Something like:
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdlib.h>
char *readString(const char *fn)
{
FILE *fp;
char *vString;
struct stat fdstat;
int stat_res;
stat_res = stat(fn, &fdstat);
fp = fopen(fn, "r+b");
if (fp && !stat_res) {
vString = (char *) calloc(fdstat.st_size + 1, sizeof(char));
int newline = 1;
int index = 0;
while (index < fdstat.st_size) {
int len =
fdstat.st_size - index >
4096 ? 4096 : fdstat.st_size - index;
char *buffer = (char *) malloc(len);
int read_len = fread(buffer, 1, len, fp);
int i;
if (newline) {
if (read_len > 0 && buffer[0] == ' ') {
return vString;
}
newline = 0;
}
for (i = 0; i < read_len; ++i) {
if (buffer[i] == '\n') {
if (i + 1 < read_len && buffer[i + 1] == ' ') {
memcpy(vString + index, buffer, i + 1);
return vString;
}
newline = 1;
}
}
memcpy(vString + index, buffer, read_len);
index += read_len;
}
}
return vString;
}
int main(int argc, char **argv)
{
char *str = readString(argv[1]);
printf("%s", str);
free(str);
return EXIT_SUCCESS;
}

Using realloc to expand buffer while reading from file crashes

I am writing some code that needs to read fasta files, so part of my code (included below) is a fasta parser. As a single sequence can span multiple lines in the fasta format, I need to concatenate multiple successive lines read from the file into a single string. I do this, by realloc'ing the string buffer after reading every line, to be the current length of the sequence plus the length of the line read in. I do some other stuff, like stripping white space etc. All goes well for the first sequence, but fasta files can contain multiple sequences. So similarly, I have a dynamic array of structs with a two strings (title, and actual sequence), being "char *". Again, as I encounter a new title (introduced by a line beginning with '>') I increment the number of sequences, and realloc the sequence list buffer. The realloc segfaults on allocating space for the second sequence with
*** glibc detected *** ./stackoverflow: malloc(): memory corruption: 0x09fd9210 ***
Aborted
For the life of me I can't see why. I've run it through gdb and everything seems to be working (i.e. everything is initialised, the values seems sane)... Here's the code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#include <math.h>
#include <errno.h>
//a struture to keep a record of sequences read in from file, and their titles
typedef struct {
char *title;
char *sequence;
} sequence_rec;
//string convenience functions
//checks whether a string consists entirely of white space
int empty(const char *s) {
int i;
i = 0;
while (s[i] != 0) {
if (!isspace(s[i])) return 0;
i++;
}
return 1;
}
//substr allocates and returns a new string which is a substring of s from i to
//j exclusive, where i < j; If i or j are negative they refer to distance from
//the end of the s
char *substr(const char *s, int i, int j) {
char *ret;
if (i < 0) i = strlen(s)-i;
if (j < 0) j = strlen(s)-j;
ret = malloc(j-i+1);
strncpy(ret,s,j-i);
return ret;
}
//strips white space from either end of the string
void strip(char **s) {
int i, j, len;
char *tmp = *s;
len = strlen(*s);
i = 0;
while ((isspace(*(*s+i)))&&(i < len)) {
i++;
}
j = strlen(*s)-1;
while ((isspace(*(*s+j)))&&(j > 0)) {
j--;
}
*s = strndup(*s+i, j-i);
free(tmp);
}
int main(int argc, char**argv) {
sequence_rec *sequences = NULL;
FILE *f = NULL;
char *line = NULL;
size_t linelen;
int rcount;
int numsequences = 0;
f = fopen(argv[1], "r");
if (f == NULL) {
fprintf(stderr, "Error opening %s: %s\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
rcount = getline(&line, &linelen, f);
while (rcount != -1) {
while (empty(line)) rcount = getline(&line, &linelen, f);
if (line[0] != '>') {
fprintf(stderr,"Sequence input not in valid fasta format\n");
return EXIT_FAILURE;
}
numsequences++;
sequences = realloc(sequences,sizeof(sequence_rec)*numsequences);
sequences[numsequences-1].title = strdup(line+1); strip(&sequences[numsequences-1].title);
rcount = getline(&line, &linelen, f);
sequences[numsequences-1].sequence = malloc(1); sequences[numsequences-1].sequence[0] = 0;
while ((!empty(line))&&(line[0] != '>')) {
strip(&line);
sequences[numsequences-1].sequence = realloc(sequences[numsequences-1].sequence, strlen(sequences[numsequences-1].sequence)+strlen(line)+1);
strcat(sequences[numsequences-1].sequence,line);
rcount = getline(&line, &linelen, f);
}
}
return EXIT_SUCCESS;
}

You should use strings that look something like this:
struct string {
int len;
char *ptr;
};
This prevents strncpy bugs like what it seems you saw, and allows you to do strcat and friends faster.
You should also use a doubling array for each string. This prevents too many allocations and memcpys. Something like this:
int sstrcat(struct string *a, struct string *b)
{
int len = a->len + b->len;
int alen = a->len;
if (a->len < len) {
while (a->len < len) {
a->len *= 2;
}
a->ptr = realloc(a->ptr, a->len);
if (a->ptr == NULL) {
return ENOMEM;
}
}
memcpy(&a->ptr[alen], b->ptr, b->len);
return 0;
}
I now see you are doing bioinformatics, which means you probably need more performance than I thought. You should use strings like this instead:
struct string {
int len;
char ptr[0];
};
This way, when you allocate a string object, you call malloc(sizeof(struct string) + len) and avoid a second call to malloc. It's a little more work but it should help measurably, in terms of speed and also memory fragmentation.
Finally, if this isn't actually the source of error, it looks like you have some corruption. Valgrind should help you detect it if gdb fails.

One potential issue is here:
strncpy(ret,s,j-i);
return ret;
ret might not get a null terminator. See man strncpy:
char *strncpy(char *dest, const char *src, size_t n);
...
The strncpy() function is similar, except that at most n bytes of src
are copied. Warning: If there is no null byte among the first n bytes
of src, the string placed in dest will not be null terminated.
There's also a bug here:
j = strlen(*s)-1;
while ((isspace(*(*s+j)))&&(j > 0)) {
What if strlen(*s) is 0? You'll end up reading (*s)[-1].
You also don't check in strip() that the string doesn't consist entirely of spaces. If it does, you'll end up with j < i.
edit: Just noticed that your substr() function doesn't actually get called.

I think the memory corruption problem might be the result of how you're handling the data used in your getline() calls. Basically, line is reallocated via strndup() in the calls to strip(), so the buffer size being tracked in linelen by getline() will no longer be accurate. getline() may overrun the buffer.
while ((!empty(line))&&(line[0] != '>')) {
strip(&line); // <-- assigns a `strndup()` allocation to `line`
sequences[numsequences-1].sequence = realloc(sequences[numsequences-1].sequence, strlen(sequences[numsequences-1].sequence)+strlen(line)+1);
strcat(sequences[numsequences-1].sequence,line);
rcount = getline(&line, &linelen, f); // <-- the buffer `line` points to might be
// smaller than `linelen` bytes
}

C: searching for a string in a file

If I have:
const char *mystr = "cheesecakes";
FILE *myfile = fopen("path/to/file.exe","r");
I need to write a function to determine whether myfile contains any occurrences of mystr. Could anyone help me? Thanks!
UPDATE: So it turns out the platform I need to deploy to doesn't have memstr. Does anyone know of a free implementation I can use in my code?

If you can't fit the whole file into memory, and you have access to the GNU memmem() extension, then:
Read as much as you can into a buffer;
Search the buffer with memmem(buffer, len, mystr, strlen(mystr) + 1);
Discard all but the last strlen(mystr) characters of the buffer, and move those to the start;
Repeat until end of file reached.
If you don't have memmem, then you can implement it in plain C using memchr and memcmp, like so:
/*
* The memmem() function finds the start of the first occurrence of the
* substring 'needle' of length 'nlen' in the memory area 'haystack' of
* length 'hlen'.
*
* The return value is a pointer to the beginning of the sub-string, or
* NULL if the substring is not found.
*/
void *memmem(const void *haystack, size_t hlen, const void *needle, size_t nlen)
{
int needle_first;
const void *p = haystack;
size_t plen = hlen;
if (!nlen)
return NULL;
needle_first = *(unsigned char *)needle;
while (plen >= nlen && (p = memchr(p, needle_first, plen - nlen + 1)))
{
if (!memcmp(p, needle, nlen))
return (void *)p;
p++;
plen = hlen - (p - haystack);
}
return NULL;
}

Because there is no memmem or memstr to find a string in a binary array (others suggested to read it into memory and use strstr - no this doesn't work) you have to read it byte by byte with "fgetch" and write a small state machine to match it while reading.

See here:
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
for a Boyer-Moore implementation in C99. This is a very common string searching algorithm and runs in O(n).

Here is a slapped together version of it. It has no error checking and probably has overflow bugs. But I think it finds the desired string and accounts for the backtracking necessary for partial substring matches. I doubt there are more than 15 bugs left.
Edit: There was at least one in the first answer. I woke up in the middle of the night and realized the backtracking check was wrong. It didn't find '12123' in '1212123'. It might still be wrong, but at least it finds that one now.
int main( int argc, char* argv[] )
{
FILE *fp;
char *find, *hist;
int len, pos=0, hl=0, i;
char c;
fp = fopen( argv[1], "r" );
find = argv[2];
len = (int)strlen( find );
hist = malloc( len );
memset( hist, 0, len );
while ( !feof( fp )) {
c = fgetc( fp );
if ( find[pos++] == c ) {
if ( pos == len ) {
printf( "Found it\n" );
return 1;
}
}
else {
// check history buffer (kludge for backtracking)
if ( pos > 0 ) {
pos = 0;
for ( i = 0; i < len - 1; i++ )
if ( 0 == memcmp( hist+len-i-1, find, i + 1 )) {
// we had a mismatch, but the history matches up to len i
pos = i;
}
}
}
// update history buffer - this is innefficient - better as circular buffer
memmove( hist, hist + 1, len - 1 );
hist[len-1] = c;
}
printf( "Not found\n" );
}

Here's a function that will search for a string in a buffer.
Limitations: it doesn't handle wide characters (in the case of internationalization). You'll have to write your own code to read the file into memory. It won't find the pattern if the pattern is split between 2 read buffers.
/*****************************************************
const char *buffer pointer to your read buffer (the larger, the better).
size_t bufsize the size of your buffer
const char *pattern pattern you are looking for.
Returns an index into the buffer if pattern is found.
-1 if pattern is not found.
Sample:
pos = findPattern (buffer, BUF_SIZE, "cheesecakes");
*****************************************************/
int findPattern (const char *buffer, size_t bufSize, const char *pattern)
{
int i,j;
int patternLen;
// minor optimization. Determine patternLen so we don't
// bother searching buffer if fewer than patternLen bytes remain.
patternLen = strlen (pattern);
for (i=0; i<bufSize-patternLen; ++i)
{
for (j=0; j<patternLen; ++j)
{
if (buffer[i+j] != pattern[j])
{
break;
}
}
if (j == patternLen)
{
return i;
}
}
return -1;
}

chat match = "findthis";
int depth = 0;
while(not eof)
{
char ch = getonebyte();
if(ch == match[depth])
{
if (depth == strlen(match))
break;
else
depth++;
}
else
depth = 0;
}
roughly (I am sure there are off by ones in there)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Subtlety in strstr? - c

Related

Segmentation fault: 11 while trying to print array with 50k items

Reading the words of a file into a dynamic 2D array

C read big file into char* array too slow

Using realloc to expand buffer while reading from file crashes

C: searching for a string in a file

Categories

Resources