Improving IO performance for merging two files in C - c

I wrote a function which merges two large files (file1,file2) into a new file (outputFile).
Each file is a line based format while entries are separated by \0 byte. Both files have the same amount of null bytes.
One example file with two entries could look like this A\nB\n\0C\nZ\nB\n\0
Input:
file1: A\nB\0C\nZ\nB\n\0
file2: BBA\nAB\0T\nASDF\nQ\n\0
Output
outputFile: A\nB\nBBA\nAB\0C\nZ\nB\nT\nASDF\nQ\n\0
FILE * outputFile = fopen(...);
setvbuf ( outputFile , NULL , _IOFBF , 1024*1024*1024 )
FILE * file1 = fopen(...);
FILE * file2 = fopen(...);
int c1, c2;
while((c1=fgetc(file1)) != EOF) {
if(c1 == '\0'){
while((c2=fgetc(file2)) != EOF && c2 != '\0') {
fwrite(&c2, sizeof(char), 1, outputFile);
}
char nullByte = '\0';
fwrite(&nullByte, sizeof(char), 1, outputFile);
}else{
fwrite(&c1, sizeof(char), 1, outputFile);
}
}
Is there a way to improve this IO performance of this function? I increased the buffer size of outputFile to 1 GB by using setvbuf. Would it help to use posix_fadvise on file1 and file2?

You're doing IO character-by-character. That is going to be needlessly and painfully S-L-O-W, even with buffered streams.
Take advantage of the fact that your data is stored in your files as NUL-terminated strings.
Assuming you're alternating nul-terminated strings from each file, and running on a POSIX platform so you can simply mmap() the input files:
typedef struct mapdata
{
const char *ptr;
size_t bytes;
} mapdata_t;
mapdata_t mapFile( const char *filename )
{
mapdata_t data;
struct stat sb;
int fd = open( filename, O_RDONLY );
fstat( fd, &sb );
data.bytes = sb.st_size;
/* assumes we have a NUL byte after the file data
If the size of the file is an exact multiple of the
page size, we won't have the terminating NUL byte! */
data.ptr = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
close( fd );
return( data );
}
void unmapFile( mapdata_t data )
{
munmap( data.ptr, data.bytes );
}
void mergeFiles( const char *file1, const char *file2, const char *output )
{
char zeroByte = '\0';
mapdata_t data1 = mapFile( file1 );
mapdata_t data2 = mapFile( file2 );
size_t strOffset1 = 0UL;
size_t strOffset2 = 0UL;
/* get a page-aligned buffer - a 64kB alignment should work */
char *iobuffer = memalign( 64UL * 1024UL, 1024UL * 1024UL );
/* memset the buffer to ensure the virtual mappings exist */
memset( iobuffer, 0, 1024UL * 1024UL );
/* use of direct IO should reduce memory pressure - the 1 MB
buffer is already pretty large, and since we're not seeking
the page cache is really only slowing things down */
int fd = open( output, O_RDWR | O_TRUNC | O_CREAT | O_DIRECT, 0644 );
FILE *outputfile = fdopen( fd, "wb" );
setvbuf( outputfile, iobuffer, _IOFBF, 1024UL * 1024UL );
/* loop until we reach the end of either mapped file */
for ( ;; )
{
fputs( data1.ptr + strOffset1, outputfile );
fwrite( &zeroByte, 1, 1, outputfile );
fputs( data2.ptr + strOffset2, outputfile );
fwrite( &zeroByte, 1, 1, outputfile );
/* skip over the string, assuming there's one NUL
byte in between strings */
strOffset1 += 1 + strlen( data1.ptr + strOffset1 );
strOffset2 += 1 + strlen( data2.ptr + strOffset2 );
/* if either offset is too big, end the loop */
if ( ( strOffset1 >= data1.bytes ) ||
( strOffset2 >= data2.bytes ) )
{
break;
}
}
fclose( outputfile );
unmapFile( data1 );
unmapFile( data2 );
}
I've put in no error checking at all. You'll also need to add the proper header files.
Note also that the file data is assumed to NOT be an exact multiple of the system page size, thus ensuring that there's a NUL byte mapped after the file contents. If the size of the file is an exact multiple of the page size, you'll have to mmap() an additional page after the file contents to ensure that there's a NUL byte to terminate the last string.
Or you can rely on there being a NUL byte as the last byte of the file's contents. If that ever turns out to not be true, you'll likely get either a SEGV or corrupted data.

you are using two function calls per character, (one for input, one for output) Function calls are slow (they pollute the instruction pipeline)
fgetc() and fputc have their getc() / putc() counterparts, which are (can be) implemented as macros, enabling the compiler to inline the entire loop, except for the reading/writing of buffers , twice per 512 or 1024 or 4096 characters processed. (these will invoke system calls, but these are inevitable anyway)
using read/write instead of buffered I/O will probably not be worth the effort, the extra bookkeeping wil make your loop fatter (btw: using fwrite() to write one character is certainly wastefull, same for write())
maybe a larger output buffer could help, but I wouldnt count on that.

If you can use threads, make one for file1 and another for file2.
Make the outputFile as big as you need, then make thread1 write the file1 into outputFile.
While thread2 seek it's output of outputFile the the length of file1+1, and write file2
Edit:
It's not a correct answer for this case, but to prevent confusions I'll let it here.
More discusion I found about it: improve performance in file IO in C

Related

ext4: Splitting and Concatenating File in situ

Say I have a large file on a large disk and this file fills the disk almost entirely. e.g. 10TB disk, almost 10TB file, say 3GB are free. Also, I do not have any other
I would like to split that file in N pieces, but splitting in half is ok for simple case. As the desired solution is probably FS specific, I'm on an ext4 filesystem.
I am aware of https://www.gnu.org/software/coreutils/manual/coreutils.html#split-invocation
Obviously, I do not have enough free space on the device to create the splits by copying.
Would it be possible to split file A (~10TB) into two files B and C in a way, so these (B and C) would simply be new "references" to the original data of file A.
I.e. B having the same start (A_start = B_start), but a smaller length and C, starting at B_start+B_length having C_length = A_length-B_length.
File A might or might not exist in the FS after the operation.
Also, I'd be fine if there was some constraint/restriction like this was only possible at some sector/block boundary (i.e. only 4096 byte raster).
Same question applies to the inverse situation:
Having two files of almost 5TB each on a 10TB hard disk: concatenating these to a resulting file of nearly 10TB size by merely adjusting the "inode references".
Sorry if the nomenclature is not that precise, I hope it's clear what I try to achieve.
First, there is currently no guaranteed portable way to do what you want - any solution is going to be platform-specific, because to do what you want requires that your underlying filesystem support sparse files.
Code like this will work to split a file in half if the underlying filesystem creates sparse files (proper headers and error checking left out for clarity):
// 1MB chunks (use a power of two)
#define CHUNKSIZE ( 1024L * 1024L )
int main( int argc, char **argv )
{
int origFD = open( argv[ 1 ], O_RDWR );
int newFD = open( argv[ 2 ], O_WRONLY | O_CREAT | O_TRUNC, 0644 );
// get the size of the input file
struct stat sb;
fstat( origFD, &sb );
// get a CHUNKSIZE-aligned offset near the middle of the file
off_t startOffset = ( sb.st_size / 2L ) & ~( CHUNKSIZE - 1L );
// get the largest CHUNKSIZE-aligned offset in the file
off_t readOffset = sb.st_size & ~( CHUNKSIZE - 1L );
// might have to malloc() if it doesn't fit on the stack
char *ioBuffer[ CHUNKSIZE ];
while ( readOffset >= startOffset )
{
// write the data to the end of the file - the underlying
// filesystem had better create a sparse file or this can
// fill up the disk on the first pwrite() call
ssize_t bytesRead = pread(
origFD, ioBuffer, CHUNKSIZE, readOffset );
ssize_t bytesWritten = pwrite(
newFD, ioBuffer, byteRead, readOffset - startOffset );
// cut the end off the input file - this had better free up
// disk space
ftruncate( origFD, readOffset );
readOffset -= CHUNKSIZE;
}
free( ioBuffer );
close( origFD );
close( newFD );
return( 0 );
}
There are other approaches, too. On a Solaris system, you can use fcntl() with the F_FREESPC command and on a Linux system that supports the FALLOC_FL_PUNCH_HOLE you can use the fallocate() function to remove arbitrary blocks from the file after you've copied the data to another file. On such systems you wouldn't be limited only being able to cut the end off the original file with ftruncate().

Overwrite the Contents of a File in C

I am writing a code that forks multiple processes. They all share a file called "character." what I want to do is to have every process read the 'only character' in the file and then erase it by putting its own character so the other process can do the same. The file is the only way the processes can communicate each other. How can I erase the 'only character' in the file and put a new one in its place. I was advise to use freopen() (which closes the file and reopens it erasing its previous contents) but I am not sure if it is the best way to achieve this.
You should not have to reopen the file. That gains you nothing. If you're worried about each process buffering input or output, disable buffering if you want to use FILE *-based stdio functions.
But if I'm reading your question correctly (you want each process to replace the one character in the file when it's a specific value is held in the file, and that value changes for each process), this will do what you want, using POSIX open() pread(), and pwrite() (you're already using POSIX fork(), so using low-level POSIX IO makes things a lot simpler - note that pread() and pwrite() eliminate the need for seeking.)
I'll say this is what I think you're trying to do:
// header files and complete error checking is omitted for clarity
int fd = open( filename, O_RDWR );
// fork() here?
// loop until we read the char we want from the file
for ( ;; )
{
char data;
ssize_t result = pread( fd, &data, sizeof( data ), 0 );
// pread failed
if ( result != sizeof( data ) )
{
break;
}
// if data read matches this process's value, replace the value
// (replace 'a' with 'b', 'c', 'z' or '*' - whatever value you
// want the current process to wait for)
if ( data == 'a' )
{
data = 'b';
result = pwrite( fd, &data, sizeof( data ), 0 );
break;
}
}
close( fd );
For any decent number of processes, that's going to put a lot of stress on your filesystem.
If you really want to start with fopen() and use that family of calls, this might work depending on your implementation:
FILE *fp = fopen( filename, "rb+" );
// disable buffering
setbuf( fd, NULL );
// fork() here???
// loop until the desired char value is read from the file
for ( ;; )
{
char data;
// with fread(), we need to fseek()
fseek( fp, 0, SEEK_SET );
int result = fread( &data, 1, 1, fp );
if ( result != 1 )
{
break;
}
if ( data == 'a' )
{
data = 'b';
fseek( fp, 0, SEEK_SET );
fwrite( &data, 1, 1, fp );
break;
}
}
fclose( fp );
Again, that assumes I'm reading your question properly. Note that the POSIX rules John Bollinger mentioned in his comments regarding multiple handles don't apply - because the streams are explicitly not buffered.

Write to the same file with different processes in order of occurence

I am working on a UNIX based operating system (Lubuntu 14.10. I have several processes that need to print a message to the same file and to the std output.
When I print my message to the screen, it works the way I want, in the order of occurence. E.g:
Process1_message1
Process2_message1
Process3_message1
Process1_message2
Process2_message2
Process3_message2
...
However, when I check the output file it is like below:
Process1_message1
Process1_message2
Process2_message1
Process2_message2
Process3_message1
Process3_message2
...
I use fprintf(FILE *ptr, char *str) to write the message to the file.
Note: I opened the file with following format in the main process:
fptr=fopen("output.txt", "a");
where fptr is a global FILE *.
Any help will be appreciated. Thank you!
fprintf() isn't going to work. It's prone being translated into multiple calls to write() to actually write out the data, exactly like you posted. You call fprintf() once, and under the covers it makes multiple calls to write() to actually write the data into the file.
You need to use open( filename, O_WRONLY | O_CREAT | O_APPEND, 0600 ), and write data something like this in order to ensure you only call write() once, which is guaranteed to be atomic:
ssize_t myprintf( int fd, const char *fmt, ... )
{
char buffer[ 1024 ];
ssize_t bytesWritten;
va_list argp;
va_start( argp, fmt );
int bytes = vsnprintf( buffer, sizeof( buffer ), fmt, argp );
if ( bytes < sizeof( buffer ) )
{
bytesWritten = write( fd, buffer, bytes );
}
// buffer was too small, get a bigger one
else
{
char *bufptr = malloc( bytes + 1 );
bytes = vsnprintf( bufptr, bytes + 1, fmp, argp );
bytesWritten = write( fd, bufptr, bytes );
free( bufptr );
}
return( bytesWritten );
}
Most likely, your problem is that the file output is fully buffered, so the output from each process doesn't appear until the standard I/O buffer for the stream (in that process) is full.
You can probably work around it sufficiently by setting line buffering:
FILE *fptr = fopen("output.txt", "a");
if (fptr != 0)
{
setvbuf(fptr, 0, _IOLBF, BUFSIZ);
…code using fptr — including your fork() calls…
fclose(fptr);
}
Every time a process writes a line to the buffer, it will be flushed. You might run into problems if your output lines are longer than BUFSIZ; then you might want to increase the size passed to setvbuf() to the largest line length you need written atomically.
If that still isn't good enough, or if you need to be able to write groups of lines at one time, you'll have to go to a solution using file descriptors as in Andrew Henle's answer. You might want to look at the O_SYNC and O_DSYNC options to open().
Flushing buffers is different in stdio when you are writing to a terminal (isatty(fptr) ---see isatty(3)--- returns true) than when you output to a file. For a file, stdio output only does a write(2) system call when the buffer is filled up and this makes all the messages to appear together (as each buffer flushes out on exit, they fill up in one single output buffer) On ttys, output is flushed when buffer fills up or when a \n char is output to the buffer (as a compromise on buffering/non buffering)
You can force buffer flushing with fflush(fptr); after fprintf(fptr, ...); or even do fflush(NULL); (which flushes all output buffers in one call).
But, be carefull as the writes are the ones that control the atomicity of calls (not the fprintf calls) so, if you have to write several pages of output in one fprintf call, be ready to accept messed output.

Creating my own archive tool in C [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
i was just assigned a project to create an archiving tool for unix. So after creating the program i would do something like
"./bar -c test_archive.bar file.1"
It would create a test_archive.bar with file.1 inside of it. Then i could do some command where i list the files inside etc. etc.. But i'm having trouble understanding the concept of making a test_archive.bar, i realize in essence its just a file, but if you were to say open a .tgz "vi file.tgz" it would give a list of directories/files inside,
So, are there any good ways to go about creating a archive/directory in which i can extrapolate some files within and list their names etc..
Note: I have looked at tar.c and all the files included in that but every file is so abstracted it's very hard to follow.
Note: i know how to read the command line flags etc.
Using a old (but still valid) tar format is actually pretty easy to do. Wikipedia has a nice explanation of the format here. All you need to do is this:
For each file:
Fill out and emit a header to the tar file
Emit the file contents
Pad the file size to a multiple of 512 bytes
The most basic valid header for a tar file is: (Copied from Wikipedia, basically)
100 bytes: File name
8 bytes: File mode
8 bytes: Owner's numeric ID
8 bytes: Group's numeric ID
12 bytes: File's size
12 bytes: Timestamp of last modified time
8 bytes: Checksum
1 byte: File type
100 bytes: Name of linked file
The file type can be 0 (a normal file), 1 (a hard link) or 2 (a symlink). The name of linked file is the name of the file that a link points at. If I recall correctly, if you have a hard link or symbolic link, the file content should be empty.
To quote Wikipedia:
"Numeric values are encoded in octal numbers using ASCII digits, with leading zeroes. For historical reasons, a final NUL or space character should be used."
"The checksum is calculated by taking the sum of the unsigned byte values of the header record with the eight checksum bytes taken to be ascii spaces (decimal value 32). It is stored as a six digit octal number with leading zeroes followed by a NUL and then a space."
Here's a simple tarball generator. Creating an extractor, dealing with automatic file feeding, etc, is left as an exercise for the reader.
#include<stdio.h>
#include<string.h>
struct tar_header{
char name[100];
char mode[8];
char owner[8];
char group[8];
char size[12];
char modified[12];
char checksum[8];
char type[1];
char link[100];
char padding[255];
};
void fexpand(FILE* f, size_t amount, int value){
while( amount-- ){
fputc( value, f );
}
}
void tar_add(FILE* tar_file, const char* file, const char* internal_name){
//Get current position; round to a multiple of 512 if we aren't there already
size_t index = ftell( tar_file );
size_t offset = index % 512;
if( offset != 0 ){
fexpand( tar_file, 512 - offset, 0);
}
//Store the index for the header to return to later
index = ftell( tar_file );
//Write some space for our header
fexpand( tar_file, sizeof(struct tar_header), 0 );
//Write the input file to the tar file
FILE* input = fopen( file, "rb" );
if( input == NULL ){
fprintf( stderr, "Failed to open %s for reading\n", file);
return;
}
//Copy the file content to the tar file
while( !feof(input) ){
char buffer[2000];
size_t read = fread( buffer, 1, 2000, input );
fwrite( buffer, 1, read, tar_file);
}
//Get the end to calculate the size of the file
size_t end = ftell( tar_file );
//Round the file size to a multiple of 512 bytes
offset = end % 512;
if( end != 0 ){
fexpand( tar_file, 512 - offset, 0);
}
//Fill out a new tar header
struct tar_header header;
memset( &header, 0, sizeof( struct tar_header ) );
snprintf( header.name, 100, "%s", internal_name );
snprintf( header.mode, 8, "%06o ", 0777 ); //You should probably query the input file for this info
snprintf( header.owner, 8, "%06o ", 0 ); //^
snprintf( header.group, 8, "%06o ", 0 ); //^
snprintf( header.size, 12, "%011o", end - 512 - index );
snprintf( header.modified, 12, "%011o ", time(0) ); //Again, get this from the filesystem
memset( header.checksum, ' ', 8);
header.type[0] = '0';
//Calculate the checksum
size_t checksum = 0;
int i;
const unsigned char* bytes = &header;
for( i = 0; i < sizeof( struct tar_header ); ++i ){
checksum += bytes[i];
}
snprintf( header.checksum, 8, "%06o ", checksum );
//Save the new end to return to after writing the header
end = ftell(tar_file);
//Write the header
fseek( tar_file, index, SEEK_SET );
fwrite( bytes, 1, sizeof( struct tar_header ), tar_file );
//Return to the end
fseek( tar_file, end, SEEK_SET );
fclose( input );
}
int main( int argc, char* argv[] ){
if( argc > 1 ){
FILE* tar = fopen( argv[1], "wb" );
if( !tar ){
fprintf( stderr, "Failed to open %s for writing\n", argv[1] );
return 1;
}
int i;
for( i = 2; i < argc; ++i ){
tar_add( tar, argv[i], argv[i] );
}
//Pad out the end of the tar file
fexpand( tar, 1024, 0);
fclose( tar );
return 0;
}
fprintf( stderr, "Please specify some file names!\n" );
return 0;
}
So, are there any good ways to go about creating a archive/directory
in which i can extrapolate some files within and list their names
etc..
There are basically two approaches:
Copy file contents one after another, each prefixed with "header" block, containing information about file name, size and (optionally) other attributes. Tar is an example of this. Example:
Copy file contents one after another and put somewhere (on the beginning of at the end) "index" which contains list of file names with their sizes and (optionally) other attributes. When you look at file sizes, you can compute where individual files begin/end.
Most real world archivers use combination of those, and add other features such as check sums, compression and encryption.
Example
Suppose we have Two files named hello.txt containg Hello, World! (12 bytes) and bar.txt containg foobar (6 bytes).
In first method, archive would look like that
[hello.txt,12][Hello, World!][bar.txt,6][foobar]
^- fixed size ^- 12 bytes ^- fixed size ^- 6 bytes
Length of header blocks would habe to be either constant, or you have to encode somewhere their length.
In second:
[Hello, World!foobar][hello.txt,12,bar.txt,6]
^- 12+6 bytes

Any Idea Why My C Code Can't Read from /proc?

I have been able to write a program that can read any text files... except the ones found in /proc. Any file that I try to read from /proc shows up empty.
But whenever I type
cat /proc/cpuinfo
on terminal, I am presented with my CPU info.
I can also see the file when I open it with a text editor, such as gedit or leafpad.
So it seems that /proc files are indeed text files, but my C program is having a hard time reading them.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
char* readFileString( char* loc ) {
char *fileDat;
FILE * pFile;
long lsize;
pFile = fopen( loc, "r" );
// Grab the file size.
fseek(pFile, 0L, SEEK_END);
lsize = ftell( pFile );
fseek(pFile, 0L, SEEK_SET);
fileDat = calloc( lsize + 1, sizeof(char) );
fread( fileDat, 1, lsize, pFile );
return fileDat;
}
int main( void ) {
char *cpuInfo;
cpuInfo = readFileString( "/proc/cpuinfo" );
printf( "%s\n", cpuInfo );
return 0;
}
Any idea why?
The files from /proc have a size of 0 byte because they are generated on the fly by the kernel.
See here for more information on proc filesystem:
http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html
Most /proc/ textual files are intended to be read sequentially by a classical loop like
FILE *f = fopen("/proc/cpuinfo", "r");
size_t sz = 0;
char * lin = 0;
do {
ssize_t lsz = getline (&lin, &sz, f);
if (lsz<0) break;
handle_line_of_size (lin, lsz);
} while (!feof (f));
fclose (f);
seeking don't work on them. A bit like for pipes.
If you want to know the size of a file, stat(2) is the way to go. But for what you're doing, either allocate a very large buffer (RAM is cheap and this is a one-shot program) you fread() into after you fopen() it, or learn about realloc(3) and use that in your file-reading loop. As ouah said, the files in /proc are special.
For general-purpose use, and especially for strings, calloc() is a waste of cpu cycles, as setting the 0th char of the returned allocation area to '\0' is sufficient to make it an empty string, regardless of the data following that first byte.

Resources