Split file into n smaller files in C - c

I'm making a program to split a file into N smaller parts
of (almost) equal sizes. So here's my code:
FILE * fp = fopen(file,"r");
long aux;
long cursor = 0;
long blockSize = 1024000; //supose each smaller file will have 1 MB
long bytesLimit = blockSize;
for( i = 0 ; i < n ; i++) {
FILE * fp_aux = fopen( outputs[i] , "w"); //outputs is an array of temporary file names
while(cursor < bytesLimit) { //here occurs the infinite loop
fscanf(fp,"%lu\n",&aux);
fprintf(fp_aux,"%lu\n",aux);
cursor = ftell(fp);
}
fclose(fp_aux);
bytesLimit = bytesLimit + blockSize;
}
//here add some more logic to get the remaining content left in the main file
The code works if I want to split the file into two or three parts, but when I try to split it into 10 parts, fscanf locks on reading the same number and stays on an infinite loop there.
My input file has the format "%lu\n" like below:
1231231
4341342
4564565
...

If splitting a file is the focus, then simplify your method. Because your post indicates you are working with a text file, the assumption is that it contains words with punctuation, numbers, linefeeds etc. With this type of content, it can be parsed into lines using fgets()/fputs(). This will allow you to read lines from one large file, tracking accumulated size as you go, and writing lines to several smaller files...
Some simple steps:
1) determine file size of file to be split
2) Set desired small file size.
3) open large file
4) Use fgets/fputs in a loop, opening and closing files to split contents, using accumulated size as split point.
5) Clean up. (fclose files etc.)
Here is an example that will illustrate these steps. This splits a large text file by size, regardless of text content. (I used a text file with 130K of volume and split it into segments of 5k
#define SEGMENT 5000 //approximate target size of small file
long file_size(char *name);//function definition below
int main(void)
{
int segments=0, i, len, accum;
FILE *fp1, *fp2;
long sizeFile = file_size(largeFileName);
segments = sizeFile/SEGMENT + 1;//ensure end of file
char filename[260]={"c:\\play\\smallFileName_"};//base name for small files.
char largeFileName[]={"c:\\play\\largeFileName.txt"};//change to your path
char smallFileName[260];
char line[1080];
fp1 = fopen(largeFileName, "r");
if(fp1)
{
for(i=0;i<segments;i++)
{
accum = 0;
sprintf(smallFileName, "%s%d.txt", filename, i);
fp2 = fopen(smallFileName, "w");
if(fp2)
{
while(fgets(line, 1080, fp1) && accum <= SEGMENT)
{
accum += strlen(line);//track size of growing file
fputs(line, fp2);
}
fclose(fp2);
}
}
fclose(fp1);
}
return 0;
}
long file_size(char *name)
{
FILE *fp = fopen(name, "rb"); //must be binary read to get bytes
long size=-1;
if(fp)
{
fseek (fp, 0, SEEK_END);
size = ftell(fp)+1;
fclose(fp);
}
return size;
}

if you have bad data in the file that isn't a long unsigned int format then the fscanf will read it, the file pointer for the fp file object won't change. Then the program sets the fp file pointer back to the start of that read and it will do it again
To prevent this you need to check the return value of the fscanf to see that it has an appropriate value ( probably 1 )

If you want to split a file into several parts with a specified maximum file size of each part, why do you use fscanf(..), ftell(..) and fprintf(..)?
This is not the fastest way to achieve your goal...
I recommend doing it in this way:
Open input file
As long as there is input data (!feof(..))
Open output file (if not already open)
Read block of input data (fread)
Write block of data to output file (fwrite)
track number of bytes written and close output file if maximum file size is reached
Go back to step 2.
Clean up
If doing so the split files will not exceed a specific maximum file size. Additionally you avoid usage of slow file I/O functions like fprintf.
A possible implementation would look like this:
/*
** splitFile
** Splits an existing input file into multiple output files with a specified
** maximum file size.
**
** Return Value:
** Number of created result files, or 0 in case of bad input data or a negative
** value in case of an error during file splitting.
*/
int splitFile(char *fileIn, size_t maxSize)
{
int result = 0;
FILE *fIn;
FILE *fOut;
char buffer[1024 * 16];
size_t size;
size_t read;
size_t written;
if ((fileIn != NULL) && (maxSize > 0))
{
fIn = fopen(fileIn, "rb");
if (fIn != NULL)
{
fOut = NULL;
result = 1; /* we have at least one part */
while (!feof(fIn))
{
/* initialize (next) output file if no output file opened */
if (fOut == NULL)
{
sprintf(buffer, "%s.%03d", fileIn, result);
fOut = fopen(buffer, "wb");
if (fOut == NULL)
{
result *= -1;
break;
}
size = 0;
}
/* calculate size of data to be read from input file in order to not exceed maxSize */
read = sizeof(buffer);
if ((size + read) > maxSize)
{
read = maxSize - size;
}
/* read data from input file */
read = fread(buffer, 1, read, fIn);
if (read == 0)
{
result *= -1;
break;
}
/* write data to output file */
written = fwrite(buffer, 1, read, fOut);
if (written != read)
{
result *= -1;
break;
}
/* update size counter of current output file */
size += written;
if (size >= maxSize) /* next split? */
{
fclose(fOut);
fOut = NULL;
result++;
}
}
/* clean up */
if (fOut != NULL)
{
fclose(fOut);
}
fclose(fIn);
}
}
return (result);
}
The above code split a test file with a size of 126803945 bytes into 121 1MB parts in about 500ms.
Note that the size of buffer (here: 16KB) affects the speed a file is split. The bigger the buffer the faster a huge file is split. If you want to use really large buffers (>1MB or so) you have to allocate (and free) the buffer on each call (or use a static buffer if you do not need reentrant code).

Related

input.wav will copy to output.wav but will not multiply the input file by the factor

This program is supposed to take the input file and copy it to the output file and then go by 2 bit samples and change the volume of the input file and save the updated version in the output file. The input file copies into output.wav but won't change the volume of it. I know I'm fairly on track but can't figure out why it won't work correctly. This also passes check50 somehow but when I compile and run it myself it doesn't do what it's supposed to do.
// Modifies the volume of an audio file
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
// Number of bytes in .wav header
const int HEADER_SIZE = 44;
int main(int argc, char *argv[])
{
// Check command-line arguments
if (argc != 4)
{
printf("Usage: ./volume input.wav output.wav factor\n");
return 1;
}
// Open files and determine scaling factor
FILE *input = fopen(argv[1], "r");
if (input == NULL)
{
printf("Could not open file.\n");
return 1;
}
FILE *output = fopen(argv[2], "w");
if (output == NULL)
{
printf("Could not open file.\n");
return 1;
}
float factor = atof(argv[3]);
// TODO: Copy header from input file to output file
uint8_t header[HEADER_SIZE];
fread(header, HEADER_SIZE, 1, input);
fwrite(header, HEADER_SIZE, 1, output);
// TODO: Read samples from input file and write updated data to output file
int16_t buffer;
while(fread(&buffer, sizeof(int16_t), 1, input))
{
buffer = buffer * factor;
fwrite(&buffer, sizeof(int16_t), 1, output);
}
// Close files
fclose(input);
fclose(output);
}
http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html tells you that a wave file is a sequence of chunks. Each chunk starts with a 4 byte chunk id (chkid) followed by a 4 byte (little-endian) chunk size (cksize). There are different types of chunks, so I suggest you read chkid and cksize, then figure out out if you want process that chunk. For scaling you only want process the samples of ckkid = "data" and it assumes your file is wFormatTag = WAVE_FORMAT_PCM, i.e. PCM encoded.
You didn't tell us much about your input file so I grabbed the first file from https://samplelib.com/sample-wav.html. It consist of 3 chunks "RIFF", "fmt " and "data". The data samples in this file start at byte 44. This file has wBitsPerSample = 16 or 2 bytes per sample. You said "2 bit samples" but I assume you meant byte. When I run your program over this file with factor = 2 the first sample is 0xf872 which is -1934, and the corresponding output file sample is 0xf0e4 = -3868. When I visually compare the input and output files in a wave editor (in my case mhwaveedit) it looks scaled.
As I mentioned, buffer * factor may under or overflow, and in audio you usually clip it rather than wrap. Here is way to accomplish this:
printf("buffer = %d", buffer);
if(factor * buffer > INT16_MAX)
buffer = INT16_MAX;
else if(factor * buffer < INT16_MIN)
buffer = INT16_MIN;
else
buffer *= factor;
printf(" => %d\n", buffer);
And when I now run your program it clips:
$ ./clipping input.wav output.wav 2|egrep -- '-32768|32767' | head -1
buffer = 16661 => 32767
where that sample previously would overflow:
$ ./scaling input.wav output.wav 2|grep 16661 | head -1
buffer = 16661 => -32214

read write binary into a string c

So, I looked around the internet and a couple questions here and I couldn't find anything that could fix my problem here. I have an assignment for C programming, to write a program that allows user to enter words into a string, add more words, put all words in the string to a text file, delete all words in string, and when they exit it saves the words in a binary, which is loaded upon starting up the program again. I've gotten everything to work except where the binary is concerned.
I made two functions, one that loads the bin file when the program starts, one that saves the bin file when it ends. I don't know in which, or if in both, the problem starts. But basically I know it's not working right because I get garbage in my text file if I save it in a text file after the program loads the bin file into the string. I know for sure that the text file saver is working properly.
Thank you to anyone who takes the time to help me out, it's been an all-day process! lol
Here are the two snippets of my functions, everything else in my code seems to work so I don't want to blot up this post with the entire program, but if need be I'll put it up to solve this.
SIZE is a constant of 10000 to meet program specs of a 1000 words. But I couldn't get this to run even asking for only 10 elements or 1, just to clear that up
void loadBin(FILE *myBin, char *stringAll) {
myBin = fopen("myBin.bin", "rb");
if (myBin == NULL) {
saveBin(&myBin, stringAll);
}//if no bin file exists yet
fread(stringAll, sizeof(char), SIZE + 1, myBin);
fclose(myBin); }
/
void saveBin(FILE *myBin, char *stringAll) {
int stringLength = 0;
myBin = fopen("myBin.bin", "wb");
if (myBin == NULL) {
printf("Problem writing file!\n");
exit(-1);
stringLength = strlen(stringAll);
fwrite(&stringAll, sizeof(char), (stringLength + 1), myBin);
fclose(myBin); }
You are leaving bad values in your myBin FILE*, and passing the & (address) of a pointer.
Pass the filename, and you can (re) use the functions for other purposes, other files, et al.
char* filename = "myBin.bin";
Pass the filename, buffer pointer, and max size to read. You should consider using stat/fstat to discover file size
size_t loadBin(char *fn, char *stringAll, size_t size)
{
//since you never use myBin, keep this FILE* local
FILE* myBin=NULL;
if( NULL == (myBin = fopen(fn, "rb")) ) {
//create missing file
saveBin(fn, stringAll, 0);
}//if no bin file exists yet
size_t howmany = fread(stringAll, sizeof(char), size, myBin);
if( howmany < size ) printf("read fewer\n");
if(myBin) fclose(myBin);
return howmany;
}
Pass the file name, buffer pointer, and size to save
size_t saveBin(char *fn, char *stringAll, size_t size)
{
int stringLength = 0;
//again, why carry around FILE* pointer only used locally?
FILE* myBin=NULL;
if( NULL == (myBin = fopen("myBin.bin", "wb")) ) {
printf("Problem writing file!\n");
exit(-1);
}
//binary data may have embedded '\0' bytes, cannot use strlen,
//stringLength = strlen(stringAll);
size_t howmany = fwrite(stringAll, sizeof(char), size, myBin);
if( howmany < size ) printf("short write\n");
if(myBin) fclose(myBin);
return howmany;
}
Call these; you are not guaranteed to write & read the same sizes...
size_t buffer_size = SIZE;
char buffer[SIZE]; //fill this with interesting bytes
saveBin(filename, buffer, buffer_size);
size_t readcount = loadBin(filename, buffer, buffer_size);

Copy a file with buffers of different sizes for read and write

I have been doing some practice problems for job interviews and I came across a function that I can't wrap my mind on how to tackle it. The idea is to create a function that takes the name of two files, and the allowed buffer size to read from file1 and allowed buffer size for write to file2. if the buffer size is the same, I know how to go trough the question, but I am having problems figuring how to move data between the buffers when the sizes are of different. Part of the constraints is that we have to always fill the write buffer before writing it to file. if file1 is not a multiple of file2, we pad the last buffer transfer with zeros.
// input: name of two files made for copy, and their limited buffer sizes
// output: number of bytes copied
int fileCopy(char* file1,char* file2, int bufferSize1, int bufferSize2){
int bytesTransfered=0;
int bytesMoved=o;
char* buffer1, *buffer2;
FILE *fp1, *fp2;
fp1 = fopen(file1, "r");
if (fp1 == NULL) {
printf ("Not able to open this file");
return -1;
}
fp2 = fopen(file2, "w");
if (fp2 == NULL) {
printf ("Not able to open this file");
fclose(fp1);
return -1;
}
buffer1 = (char*) malloc (sizeof(char)*bufferSize1);
if (buffer1 == NULL) {
printf ("Memory error");
return -1;
}
buffer2 = (char*) malloc (sizeof(char)*bufferSize2);
if (buffer2 == NULL) {
printf ("Memory error");
return -1;
}
bytesMoved=fread(buffer1, sizeof(buffer1),1,fp1);
//TODO: Fill buffer2 with maximum amount, either when buffer1 <= buffer2 or buffer1 > buffer2
//How do I iterate trough file1 and ensuring to always fill buffer 2 before writing?
bytesTransfered+=fwrite(buffer2, sizeof(buffer2),1,fp2);
fclose(fp1);
fclose(fp2);
return bytesTransfered;
}
How should I write the while loop for the buffer transfers before the fwrites?
I am having problems figuring how to move data between the buffers when the sizes are of different
Layout a plan. For "some practice problems for job interviews", a good plan and ability to justify it is important. Coding, although important, is secondary.
given valid: 2 FILE *, 2 buffers and their sizes
while write active && read active
while write buffer not full && reading active
if read buffer empty
read
update read active
append min(read buffer length, write buffer available space) of read to write buffer
if write buffer not empty
pad write buffer
write
update write active
return file status
Now code it. A more robust solution would use a struct to group the FILE*, buffer, size, offset, length, active variables.
// Return true on problem
static bool rw(FILE *in_s, void *in_buf, size_t in_sz, FILE *out_s,
void *out_buf, size_t out_sz) {
size_t in_offset = 0;
size_t in_length = 0;
bool in_active = true;
size_t out_length = 0;
bool out_active = true;
while (in_active && out_active) {
// While room for more data
while (out_length < out_sz && in_active) {
if (in_length == 0) {
in_offset = 0;
in_length = fread(in_buf, in_sz, 1, in_s);
in_active = in_length > 0;
}
// Append a portion of `in` to `out`
size_t chunk = min(in_length, out_sz - out_length);
memcpy((char*) out_buf + out_length, (char*) in_buf + in_offset, chunk);
out_length += chunk;
in_length -= chunk;
in_offset += chunk;
}
if (out_length > 0) {
// Padding only occurs, maybe, on last write
memset((char*) out_buf + out_length, 0, out_sz - out_length);
out_active = fwrite(out_buf, out_sz, 1, out_s) == out_sz;
out_length = 0;
}
}
return ferror(in_s) || ferror(out_s);
}
Other notes;
Casting malloc() results not needed. #Gerhardh
// buffer1 = (char*) malloc (sizeof(char)*bufferSize1);
buffer1 = malloc (sizeof *buffer1 * bufferSize1);
Use stderr for error messages. #Jonathan Leffler
Open the file in binary.
size_t is more robust for array/buffer sizes than int.
Consider sizeof buffer1 vs. sizeof (buffer1) as parens not needed with sizeof object
while(bytesMoved > 0) {
for(i=0; i<bytesMoved && i<bufferSize2; i++)
buffer2[i]=buffer1[i];
bytesTransfered+=fwrite(buffer2, i,1,fp2);
bytesMoved-=i;
}
If bufferSize1 is smaller than the filesize you need an outer loop.
As the comments to your question have indicated, this solution is not the best way to transfer data from 1 file to another file. However, your case has certain restrictions, which this solution accounts for.
(1) Since you are using a buffer, you do not need to read and write 1 char at a time, but instead you can make as few calls to those functions possible.
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);
:from the man page for fread, nmemb can = bufferSize1
(2) You will need to check the return from fread() (i.e. bytesMoved) and compare it with both of the bufferSize 1 and 2. If (a) bytesMoved (i.e. return from fread()) is equal to bufferSize1 or if (b) bufferSize2 is less than bufferSize1 or the return from fread(), then you know that there is still data that needs to be read (or written). So, therefore you should begin the next transfer of data, and when completed return to the previous step you left off on.
Note: The pointer to the File Stream in fread() and fwrite() will begin where it left off in the event that the data is larger than the bufferSizes.
PseudoCode:
/* in while() loop continue reading from file 1 until nothing is left to read */
while (bytesMoved = fread(buffer1, sizeof(buffer1), bufferSize1, fp1))
{
/* transfer from buffer1 to buffer2 */
for(i = 0; i < bytesMoved && i < bufferSize2; i++)
buffer2[i] = buffer1[i];
buffer2[i] = '\0';
iterations = 1; /* this is just in case your buffer2 is super tiny and cannot store all from buffer1 */
/* in while() loop continue writing to file 2 until nothing is left to write
to upgrade use strlen(buffer2) instead of bufferSize2 */
while (bytesTransfered = fwrite(buffer2, sizeof(buffer2), bufferSize2, fp2))
{
/* reset buffer2 & write again from buffer1 to buffer2 */
for(i = bufferSize2 * iterations, j = 0; i < bytesMoved && j < bufferSize2; i++, j++)
buffer2[j] = buffer1[i];
buffer2[j] = '\0';
iterations++;
}
/* mem reset buffer1 to prepare for next data transfer*/
}

How to parse files that cannot fit entirely in memory RAM

I have created a framework to parse text files of reasonable size that can fit in memory RAM, and for now, things are going well. I have no complaints, however what if I encountered a situation where I have to deal with large files, say, greater than 8GB(which is the size of mine)?
What would be an efficient approach to deal with such large files?
My framework:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
int Parse(const char *filename,
const char *outputfile);
int main(void)
{
clock_t t1 = clock();
/* ............................................................................................................................. */
Parse("file.txt", NULL);
/* ............................................................................................................................. */
clock_t t2 = clock();
fprintf(stderr, "time elapsed: %.4f\n", (double)(t2 - t1) / CLOCKS_PER_SEC);
fprintf(stderr, "Press any key to continue . . . ");
getchar();
return 0;
}
long GetFileSize(FILE * fp)
{
long f_size;
fseek(fp, 0L, SEEK_END);
f_size = ftell(fp);
fseek(fp, 0L, SEEK_SET);
return f_size;
}
char *dump_file_to_array(FILE *fp,
size_t f_size)
{
char *buf = (char *)calloc(f_size + 1, 1);
if (buf) {
size_t n = 0;
while (fgets(buf + n, INT_MAX, fp)) {
n += strlen(buf + n);
}
}
return buf;
}
int Parse(const char *filename,
const char *outputfile)
{
/* open file for reading in text mode */
FILE *fp = fopen(filename, "r");
if (!fp) {
perror(filename);
return 1;
}
/* store file in dynamic memory and close file */
size_t f_size = GetFileSize(fp);
char *buf = dump_file_to_array(fp, f_size);
fclose(fp);
if (!buf) {
fputs("error: memory allocation failed.\n", stderr);
return 2;
}
/* state machine variables */
// ........
/* array index variables */
size_t x = 0;
size_t y = 0;
/* main loop */
while (buf[x]) {
switch (buf[x]) {
/* ... */
}
x++;
}
/* NUL-terminate array at y */
buf[y] = '\0';
/* write buffer to file and clean up */
outputfile ? fp = fopen(outputfile, "w") :
fp = fopen(filename, "w");
if (!fp) {
outputfile ? perror(outputfile) :
perror(filename);
}
else {
fputs(buf, fp);
fclose(fp);
}
free(buf);
return 0;
}
Pattern deletion function based on the framework:
int delete_pattern_in_file(const char *filename,
const char *pattern, const char *outputfile)
{
/* open file for reading in text mode */
FILE *fp = fopen(filename, "r");
if (!fp) {
perror(filename);
return 1;
}
/* copy file contents to buffer and close file */
size_t f_size = GetFileSize(fp);
char *buf = dump_file_to_array(fp, f_size);
fclose(fp);
if (!buf) {
fputs("error - memory allocation failed", stderr);
return 2;
}
/* delete first match */
size_t n = 0, pattern_len = strlen(pattern);
char *tmp, *ptr = strstr(buf, pattern);
if (!ptr) {
fputs("No match found.\n", stderr);
free(buf);
return -1;
}
else {
n = ptr - buf;
ptr += pattern_len;
tmp = ptr;
}
/* delete the rest */
while (ptr = strstr(ptr, pattern)) {
while (tmp < ptr) {
buf[n++] = *tmp++;
}
ptr += pattern_len;
tmp = ptr;
}
/* copy the rest of the buffer */
strcpy(buf + n, tmp);
/* open file for writing and print the processed buffer to it */
outputfile ? fp = fopen(outputfile, "w") :
fp = fopen(filename, "w");
if (!fp) {
outputfile ? perror(outputfile) :
perror(filename);
}
else {
fputs(buf, fp);
fclose(fp);
}
free(buf);
return 0;
}
If you wish to stick with your current design, an option might be to mmap() the file instead of reading it into a memory buffer.
You could change the function dump_file_to_array to the following (linux-specific):
char *dump_file_to_array(FILE *fp, size_t f_size) {
buf = mmap(NULL, f_size, PROT_READ, MAP_SHARED, fileno(fp), 0);
if (buf == MAP_FAILED)
return NULL;
return buf;
}
Now you can read over the file, the memory manager will take automatically care to only hold the relevant potions of the file in memory.
For Windows, similar mechanisms exist.
Chances you are parsing the file line-by line. So read in a large block (4k or 16k) and parse all the lines in that. Copy the small remainder to the beginning of the 4k or 16k buffer and read in the rest of the buffer. Rinse and repeat.
For JSON or XML you will need an event based parser that can accept multiple blocks or input.
There are multiple issues with your approach.
The concept of maximum and available memory are not so evident: technically, you are not limited by the RAM size, but by the quantity of memory your environment will let you allocate and use for your program. This depends on various factors:
What ABI you compile for: the maximum memory size accessible to your program is limited to less than 4 GB if you compile for 32-bit code, even if your system has more RAM than that.
What quota the system is configured to let your program use. This may be less than available memory.
What strategy the system uses when more memory is requested than is physically available: most modern systems use virtual memory and share physical memory between processes and system tasks (such as the disk cache) using very advanced algorithms that cannot be describe in a few lines. It is possible on some systems for your program to allocate and use more memory than is physically installed on the motherboard, swapping memory pages to disk as more memory is accessed, at a huge cost in lag time.
There are further issues in your code:
The type long might be too small to hold the size of the file: on Windows systems, long is 32-bit even on 64-bit versions where memory can be allocated in chunks larger than 2GB. You must use different API to request the file size from the system.
You read the file with an series of calls to fgets(). This is inefficient, a single call to fread() would suffice. Furthermore, if the file contains embedded null bytes ('\0' characters), chunks from the file will be missing in memory. However you could not deal with embedded null bytes if you use string functions such as strstr() and strcpy() to handle your string deletion task.
the condition in while (ptr = strstr(ptr, pattern)) is an assignment. While not strictly incorrect, it is poor style as it confuses readers of your code and prevents life saving warnings by the compiler where such assignment-conditions are coding errors. You might think that could never happen, but anyone can make a typo and a missing = in a test is difficult to spot and has dire consequences.
you short-hand use of the ternary operator in place of if statements is quite confusing too: outputfile ? fp = fopen(outputfile, "w") : fp = fopen(filename, "w");
rewriting the input file in place is risky too: if anything goes wrong, the input file will be lost.
Note that you can implement the filtering on the fly, without a buffer, albeit inefficiently:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
if (argc < 2) {
fprintf(stderr, "usage: delpat PATTERN < inputfile > outputfile\n");
return 1;
}
unsigned char *pattern = (unsigned char*)argv[1];
size_t i, j, n = strlen(argv[1]);
size_t skip[n + 1];
int c;
skip[0] = 0;
for (i = j = 1; i < n; i++) {
while (memcmp(pattern, pattern + j, i - j)) {
j++;
}
skip[i] = j;
}
i = 0;
while ((c = getchar()) != EOF) {
for (;;) {
if (i < n && c == pattern[i]) {
if (++i == n) {
i = 0; /* match found, consumed */
}
break;
}
if (i == 0) {
putchar(c);
break;
}
for (j = 0; j < skip[i]; j++) {
putchar(pattern[j]);
}
i -= skip[i];
}
}
for (j = 0; j < i; j++) {
putchar(pattern[j]);
}
return 0;
}
First of all I wouldn't suggest holding such big files in RAM but instead using streams. This because buffering is usually done by the library as well as by the kernel.
If you are accessing the file sequentially, which seems to be the case, then you probably know that all modern systems implement read-ahead algorithms so just reading the whole file ahead of time IN RAM may in most cases just waste time.
You didn't specify the use-cases you have to cover so I'm going to have to assume that using streams like
std::ifstream
and doing the parsing on the fly will suit your needs. As a side note, also make sure your operations on files that are expected to be large are done in separate threads.
An alternative solution: If you're on linux systems, and you have a decent amount of swap space, just open the whole bad boy up. It will consume your ram and also consume harddrive space (swap). Thus you can have the entire thing open at once, just not all of it will be on the ram.
Pros
If an unexpected shut down occurred, the memory on the swap space is recoverable.
RAM is expensive, HDDs are cheap, so the application would put less strain on your expensive equipment
Virus could not harm your computer because there would be no room in RAM for them to run
You'll be taking full advantage of the Linux operating system by using the swap space. Normally the swap space module is not used and all it does is clog up precious ram.
The additional energy that is needed to utilize the entirety of the ram can warm the immediate area. Useful during winter time
You can add "Complex and Special Memory Allocation Engineering" to your resume.
Cons
None
Consider treating the file as an external array of lines.
Code can use an array of line indexes. This index array can be kept in memory at a fraction of the size of the large file. Access to any line is accomplished quickly via this lookup, a seek with fsetpos() and an fread()/fgets(). As the lines are edited, the new lines can be saved, in any order, in temporary text file. Saving of the file reads both the original file and temp one in sequence to form and write the new file.
typedef struct {
int attributes; // not_yet_read, line_offset/length_determined,
// line_changed/in_other_file, deleted, etc.
fpos_t line_offset; // use with fgetpos() fsetpos()
unsigned line_length; // optional field as code could re-compute as needed.
} line_index;
size_t line_count;
// read some lines
line_index *index = malloc(sizeof *index * line_count);
// read more lines
index = realloc(index, sizeof *index * line_count);
// edit lines, save changes to appended temporary file.
// ...
// Save file -weave the contents of the source file and temp file to the new output file.
Additionally, with enormous files, the array line_index[] itself can be realized in disk memory too. Access to is easily computed. In an extreme sense, only 1 line of the file needs to in memory at any time.
You mentioned state-machine. Every finite-state-automata can be optimized to have minimal (or no) lookahead.
Is it possible to do this in Lex? It will generate output c file which you can compile.
If you don't want to use Lex, you can always do following:
Read n chars into (ring?) buffer where n is size of pattern.
Try to match buffer with pattern
If match goto 1
Print buffer[0], read char, goto 2
Also for very long patterns and degenerate inputs strstr can be slow. In that case you might want to look into more advanced sting matching aglorithms.
mmap() is a pretty good way of working on files with large sizes.
It provides you with lot of flexibility but you need to be cautious with page size. Here is a good article which talks about more specifics.

Reading content from multiple files in C

Example:
Three files
hi.txt
Inside of txt: "May we be"
again.txt
Inside of txt: "The ones who once"
final.txt
Inside of txt: "knew C"
And then, another file called "order"
order.txt
Inside of txt:
"hi.txt;6"
"again.txt;7"
"final.txt;3"
What I want: read the first file name, open it, list the content, wait 6 seconds, read the second name, open it, list the content, wait 7 seconds, read the third name, open it, list the content, wait 3 seconds.
If I do it without opening the content (you'll see a second while on my code) and list the names, it works, yet for some reason it doesn't when it's about the content.
orderFile = fopen("order.txt","r");
while(fscanf(orderFile,"%49[^;];%d",fileName,&seconds) == 2)
{
contentFile = fopen(fileName,"r");
while(fscanf(contentFile,"%[^\t]",textContent) == 1)
{
printf("%s\n", textContent);
}
sleep(seconds);
fclose(contentFile);
}
fclose(orderFile);
Output:
May we be
(Waits 7 seconds)
Program closes with "RUN SUCCESSFUL"
EDIT#
It works now, as you guys said, this was the problem:
Old:
while(fscanf(orderFile,"%49[^;];%d",fileName,&seconds) == 2)
New:
while(fscanf(orderFile," %49[^;];%d",fileName,&seconds) == 2)
I'm having a "hard" time to completely understand it, what does the space does? doesn't accept enters? spaces? What exactly is it?
Don't use fscanf for that
int
main()
{
FILE *orderFile = fopen("order.txt", "r");
if (orderFile != NULL)
{
int seconds;
char line[128];
/*
* fgets, read sizeof line characters or unitl '\n' is encountered
* this will read one line if it has less than sizeof line characters
*/
while (fgets(line, sizeof line, orderFile) != NULL)
{
/*
* size_t is usually unsigned long int, and is a type used
* by some standard functions.
*/
size_t fileSize;
char *fileContent;
FILE *contentFile;
char fileName[50];
/* parse the readline with scanf, extract fileName and seconds */
if (sscanf(line, "%49[^;];%d", fileName, &seconds) != 2)
continue;
/* try opening the file */
contentFile = fopen(fileName,"r");
if (contentFile == NULL)
continue;
/* seek to the end of the file */
fseek(contentFile, 0, SEEK_END);
/*
* get current position in the stream,
* it's the file size, since we are at the end of it
*/
fileSize = ftell(contentFile);
/* seek back to the begining of the stream */
rewind(contentFile);
/*
* request space in memory to store the file's content
* if the file turns out to be too large, this call will
* fail, and you will need a different approach.
*
* Like reading smaller portions of the file in a loop.
*/
fileContent = malloc(1 + fileSize);
/* check if the system gave us space */
if (fileContent != NULL)
{
size_t readSize;
/* read the whole content from the file */
readSize = fread(fileContent, 1, fileSize, contentFile);
/* add a null terminator to the string */
fileContent[readSize] = '\0';
/* show the contents */
printf("%s\n", fileContent);
/* release the memory back to the system */
free(fileContent);
}
sleep(seconds);
fclose(contentFile);
}
fclose(orderFile);
}
return 0;
}
Everything is barely explained in the code, read the manuals if you need more information.

Resources