fwrite() - effect of size and count on performance

fwrite() - effect of size and count on performance - c

There seems to be a lot of confusion regarding the purpose of the two arguments 'size' and 'count' in fwrite(). I am trying to figure out which will be faster -
fwrite(source, 1, 50000, destination);
or
fwrite(source, 50000, 1, destination);
This is an important decision in my code as this command will be executed millions of times.
Now, I could just jump to testing and use the one which gives better results, but the problem is that the code is intended for MANY platforms.
So,
How can I get a definitive answer to which is better across platforms?
Will implementation logic of fwrite() vary from platform to platform?
I realize there are similar questions (What is the rationale for fread/fwrite taking size and count as arguments?, Performance of fwrite and write size) but do understand that this is a different question regarding the same issue. The answers in similar questions do not suffice in this case.

The performance should not depend on either way, because anyone implementing fwrite would multiply size and count to determine how much I/O to do.
This is exemplified by FreeBSD's libc implementation of fwrite.c, which in its entirety reads (include directives elided):
/*
* Write `count' objects (each size `size') from memory to the given file.
* Return the number of whole objects written.
*/
size_t
fwrite(buf, size, count, fp)
const void * __restrict buf;
size_t size, count;
FILE * __restrict fp;
{
size_t n;
struct __suio uio;
struct __siov iov;
/*
* ANSI and SUSv2 require a return value of 0 if size or count are 0.
*/
if ((count == 0) || (size == 0))
return (0);
/*
* Check for integer overflow. As an optimization, first check that
* at least one of {count, size} is at least 2^16, since if both
* values are less than that, their product can't possible overflow
* (size_t is always at least 32 bits on FreeBSD).
*/
if (((count | size) > 0xFFFF) &&
(count > SIZE_MAX / size)) {
errno = EINVAL;
fp->_flags |= __SERR;
return (0);
}
n = count * size;
iov.iov_base = (void *)buf;
uio.uio_resid = iov.iov_len = n;
uio.uio_iov = &iov;
uio.uio_iovcnt = 1;
FLOCKFILE(fp);
ORIENT(fp, -1);
/*
* The usual case is success (__sfvwrite returns 0);
* skip the divide if this happens, since divides are
* generally slow and since this occurs whenever size==0.
*/
if (__sfvwrite(fp, &uio) != 0)
count = (n - uio.uio_resid) / size;
FUNLOCKFILE(fp);
return (count);
}

The purpose of two arguments gets more clear, if you consider ther return value, which is the count of objects successfuly written/read to/from the stream:
fwrite(src, 1, 50000, dst); // will return 50000
fwrite(src, 50000, 1, dst); // will return 1
The speed might be implementation dependent although, I don't expect any considerable difference.

I'd like to point you to my question, which ended up exposing an interesting performance difference between calling fwrite once and calling fwrite multiple times to write a file "in chunks".
My problem was that there's a bug in Microsoft's implementation of fwrite so files larger than 4GB cannot be written in one call (it hangs at fwrite). So I had to work around this by writing the file in chunks, calling fwrite in a loop until the data was completely written. I found that this latter method always returns faster than the single fwrite call.
I'm in Windows 7 x64 with 32 GB of RAM, which makes write caching pretty aggressive.

Related

Possible "Integer Overflow or Wraparound" flaw in this C code

Below I have extracted a piece of code from a project over which I ran a static analysis tool looking for security flaws. It flagged this code as being susceptible to an integer overflow/wraparound flaw as explained here:
https://cwe.mitre.org/data/definitions/190.html
Here is the relevant code:
#define STRMAX 16
void bar() {
// ... struct malloc'd elsewhere
mystruct->string = malloc(STRMAX * sizeof(char));
memset(mystruct->string, '\0', STRMAX);
strncpy(mystruct->string,"H3C19H1E4XAA9MQ",STRMAX); // 15 character string
mystruct->baz = malloc(3 * sizeof(char*));
memset(mystruct->baz, '\0', 3);
if (strlen(mystruct->string) > 0) {
strncpy(mystruct->baz,&(mystruct->string[0]),2);
}
mystruct->quux = malloc(19 * sizeof(char));
memset(mystruct->quux, '\0', 19);
for(int i = 0; i < 19; i++) {
mystruct->quux[i] = 'a';
}
foo(struct);
}
void foo(Mystruct *struct) {
// this was flagged
size_t input_len = strlen(mystruct->baz) + strlen(mystruct->quux);
// this one too but it flows from the first I think.
char *input = malloc((input_len + 1) * sizeof(char));
memset(input, '\0', input_len + 1);
strncpy(input, mystruct->baz, strlen(mystruct->baz));
strncat(input, mystruct->quux, strlen(mystruct->quux));
// ...
}
So in other words, there are three members of a struct that are explicitly bounded in one function and then used to create a variable in another one based on the size of the struct's members.
The analyzer flagged the first line of foo in particular. My question is, did it correctly flag this? If so, what would be a simple way to mitigate this? I'm on a platform that doesn't have a BSD-style reallocate function by default.
PS: I realize that strncpy(foo, bar, strlen(bar)) is somewhat frivolous in memory terms.

size_t input_len = strlen(struct->baz) + strlen(struct->quux); can theoretically wraparound so the analyser is right. If strlen(struct->quux) is larger than SIZE_MAX - strlen(struct->baz) it will wraparound
If input_len == SIZE_MAX then input_len + 1 can wraparound as well.

When you add two string lengths, the result can exceed the maximum value of the size_t type. For instance, if SIZE_MAX were 100 (I know this is not realistic, it's just for example purposes), and your strings were 70 and 50 characters long, the total would wrap around to 20, which is not the correct result.
This is rarely a real concern, since the actual maximum is usually very large (the C standard allows it to be as low as 65535, but you're only likely to run into this on microcontrollers), and in practice strings are not even close to the limit, so most programmers would just ignore this problem. Protecting against this overflow gets complicated, although it's possible if you really need to.

Copy part of array to itself?

Basically I'm building a bit reader, slightly before the buffer has been exhausted, I'd like to copy whatever is left of the array to the beginning of the same array, and then zero everything after the copy, and fill the rest with data from the input file.
I'm only trying to use the standard library for portability reasons.
Also, I was profiling my bit reader earlier and Instruments says it was taking like 28 milliseconds to do all of this, is it supposed to take that long?
Code removed

I recommend using memmove for the copy. It has a signature and functionality identical to memcpy, except that it is safe for copying between overlapping regions (which is what you're describing).
For the zero-fill, memset is usually adequate. On the occasion where null pointers aren't represented using an underlying sequence of zeros, for example, you'll need to roll your own using assignment depending upon the type.
For this reason you might want to hide the memmove and memset operations behind abstraction, for example:
#include <string.h>
void copy_int(int *destination, int *source, size_t size) {
memmove(destination, source, size * sizeof *source);
}
void zero_int(int *seq, size_t size) {
memset(seq, 0, size * sizeof *seq);
}
int main(void) {
int array[] = { 0, 1, 2, 3, 4, 5 };
size_t index = 2
, size = sizeof array / sizeof *array - index;
copy_int(array, array + index, size);
zero_int(array + size, index);
}
Should either memmove or memset become unsuitable for usecases in the future, it'll then be simple to drop in your own copy/zero loops.
As for your strange profiler results, I suppose it might be possible that you're using some archaic (or grossly underclocked) implementation, or trying to copy huge arrays... Otherwise, 28 milliseconds does seem quite absurd. Nonetheless, your profiler would have surely identified that this memmove and memset isn't a significant bottleneck in an a program that performs actual I/O work, right? The I/O must surely be the bottleneck, right?
If the memmove+memset is indeed a bottleneck, you could try implementing a circular array to avoid the copies. For example, the following code attempts to find needle in the figurative haystack that is input_file...
Otherwise, if the I/O is a bottleneck, there are tweaks that can be applied to reduce that. For example, the following code uses setvbuf to suggest that the underlying implementation attempt to use an underlying buffer to read chunks of the file, despite the code using fgetc to read one character at a time.
void find_match(FILE *input_file, char const *needle, size_t needle_size) {
char input_array[needle_size];
size_t sz = fread(input_array, 1, needle_size, input_file);
if (sz != needle_size) {
// No matches possible
return;
}
setvbuf(input_file, NULL, _IOFBF, BUFSIZ);
unsigned long long pos = 0;
for (;;) {
size_t cursor = pos % needle_size;
int tail_compare = memcmp(input_array, needle + needle_size - cursor, cursor),
head_compare = memcmp(input_array + cursor, needle, needle_size - cursor);
if (head_compare == 0 && tail_compare == 0) {
printf("Match found at offset %llu\n", pos);
}
int c = fgetc(input_file);
if (c == EOF) {
break;
}
input_array[cursor] = c;
pos++;
}
}
Notice how there's no memmove (or zeroing, FWIW) necessary here? We simply operate as though the start of the array is at cursor, the end at cursor - 1 and we wrap by modulo needle_size to ensure there's no overflow/underflow. Then, after each insertion we simply increment the cursor...

Move a pointer location around to write a recursively allocated buffer

Apologies if the title doesn't make sense, I've been staring at my monitor for 15 minutes trying to come up with one.
I'm using a library function from a C API (in 64-bit Xubuntu 14.04) to move a set number of int16_t values into a buffer and repeat it a set number of times, described here in (sort of) pseudo-code:
int16_t *buffer = calloc(total_values_to_receive, 2 * sizeof(samples[0]));
while (!done){
receive_into_buffer(buffer, num_values_to_receive_per_pass);
fwrite(buffer, 2 * sizeof(samples[0]), num_values_to_receive_per_pass, file);
values_received += num_values_to_receive_per_pass;
if (values_received == total_values_to_receive){
done = true;
}
}
Basically what it does is receive a set number of values on each pass and writes those values to a file, also note that the same file is appended each time. E.g. if total_values_to_receive = 100 and num_values_to_receive_per_pass = 10, there would be 10 passes in total.
What I would like to do, mainly to increase speed, is to have the write part occur after all passes have been completed. The library function prototype contains void* samples, size_t num_samples, which, you guessed it, refers to the buffer where the samples need to be stored into and the amount of samples to store.
I'm not completely confident with pointers, but is there a way to write into the buffer on one pass, and then move the pointer by num_values_to_receive_per_pass so that the next time the library function is called, it just appends that buffer (so to speak). Then the pointer can be moved to the start of the buffer and fwrite can be called to write the total number of values into the file.
Does that make sense? Any tips on how to actually implement it?
Thanks in advance.

Assuming the second argument of receive_into_buffer is in unit of buffer element, not byte, the following should work,
int16_t *buffer = calloc(total_values_to_receive, 2 * sizeof(buffer[0]));
int16_t *temp = buffer;
while (!done){
receive_into_buffer(temp, num_values_to_receive_per_pass);
temp += 2 * num_values_to_receive_per_pass;
values_received += num_values_to_receive_per_pass;
if (values_received == total_values_to_receive){
done = true;
}
}
fwrite(buffer, 2 * sizeof(buffer[0]), total_values_to_receive, file);

Style: a plain for() loop will avoid the indicator variable:
for (values_receive=0; values_received < total_values_to_receive; values_received += num_values_to_receive_per_pass) {
receive_into_buffer(buffer, num_values_to_receive_per_pass);
fwrite(buffer, 2 * sizeof(samples[0]), num_values_to_receive_per_pass, file);
}

Very Slow Data Processing

Consider the following code that loads a dataset of records into a buffer and creates a Record object for each record. A record constitutes one or more columns and this information is uncovered at run-time. However, in this particular example, I have set the number of columns to 3.
typedef unsigned int uint;
typedef struct
{
uint *data;
} Record;
Record *createNewRecord (short num_cols);
int main(int argc, char *argv[])
{
time_t start_time, end_time;
int num_cols = 3;
char *relation;
FILE *stream;
int offset;
char *filename = "file.txt";
stream = fopen(filename, "r");
fseek(stream, 0, SEEK_END);
long fsize = ftell(stream);
fseek(stream, 0, SEEK_SET);
if(!(relation = (char*) malloc(sizeof(char) * (fsize + 1))))
printf((char*)"Could not allocate buffer");
fread(relation, sizeof(char), fsize, stream);
relation[fsize] = '\0';
fclose(stream);
char *start_ptr = relation;
char *end_ptr = (relation + fsize);
while (start_ptr < end_ptr)
{
Record *new_record = createNewRecord(num_cols);
for(short i = 0; i < num_cols; i++)
{
sscanf(start_ptr, " %u %n",
&(new_record->data[i]), &offset);
start_ptr += offset;
}
}
Record *createNewRecord (short num_cols)
{
Record *r;
if(!(r = (Record *) malloc(sizeof(Record))) ||
!(r->data = (uint *) malloc(sizeof(uint) * num_cols)))
{
printf(("Failed to create new a record\n");
}
return r;
}
This code is highly inefficient. My dataset contains around 31 million records (~1 GB) and this code processes only ~200 records per minute. The reason I load the dataset into a buffer is because I'll later have multiple threads process the records in this buffer and hence I want to avoid files accesses. Moreover, I have a 48 GB RAM, so the dataset in memory should not be a problem. Any ideas on how can to speed things up??
SOLUTION: the sscanf function was actually extremely slow and inefficient.. When I switched to strtoul, the job finishes in less than a minute. Malloc-ing ~ 3 million structs of type Record took only few seconds.

Confident that a lurking non-numeric data exist in the file.
int offset;
...
sscanf(start_ptr, " %u %n", &(new_record->data[i]), &offset);
start_ptr += offset;
Notice that if the file begins with non-numeric input, offset is never set and if it had the value of 0, start_ptr += offset; would never increment.
If a non-numeric data exist later in the file like "3x", offset will get the value of 1, and cause the while loop to proceed slowly for it will never get an updated value.
Best to check results of fread(), ftell() and sscanf() for unexpected return values and act accordingly.
Further: long fsizemay be too small a size. Look to using fgetpos() and fsetpos().
Note: to save processing time, consider using strtoul() as it is certainly faster than sscanf(" %u %n"). Again - check for errant results.
BTW: If code needs to uses sscanf(), use sscanf("%u%n"), a tad faster and for your code and the same functionality.

I'm not an optimization professional but I think some tips should help.
First of all, I suggest you use filename and num_cols as macros because they tend to be faster as literals when I don't see you changing their values in code.
Seond, using a struct for storing only one member is generally not recommended, but if you want to use it with functions you should only pass pointers. Since I see you're using malloc to store a struct and again for storing the only member then I suppose that is the reason why it is too slow. You're using twice the memory you need. This might not be the case with some compilers, however. Practically, using a struct with only one member is pointless. If you want to ensure that the integer you get (in your case) is specifically a record, you can typedef it.
You should also make end_pointer and fsize const for some optimization.
Now, as for functionality, have a look at memory mapping io.

How to make this sorting program in C much faster for the large input sets

This sort code fails for very large input file data because it takes too long for it to finish.
rewind(ptr);
j=0;
while(( fread(&temp,sizeof(temp),1,ptr)==1) &&( j!=lines-1)) //read object by object
{
i=j+1;
while(fread(&temp1,sizeof(temp),1,ptr)==1) //read next object , to compare previous object with next object
{
if(temp.key > temp1.key) //compare key value of object
{
temp2=temp; //if you don't want to change records and just want to change keys use three statements temp2.key =temp.key;
temp=temp1;
temp1=temp2;
fseek(ptr,j*sizeof(temp),0); //move stream to overwrite
fwrite(&temp,sizeof(temp),1,ptr); //you can avoid above swap by changing &temp to &temp1
fseek(ptr,i*sizeof(temp),0); //move stream to overwrite
fwrite(&temp1,sizeof(temp),1,ptr); //you can avoid above swap by changing &temp1 to &temp
}
i++;
}
j++;
fseek(ptr,j*sizeof(temp),0);
}
Any idea on how to make this C code much faster? Also would using qsort() (predefined in C) be much faster and how should be applied to the above code?

You asked the question Sorting based on key from a file and were given various answers about how to sort in memory. You added a supplemental question as an answer, and then created this question instead (which was correct).
Your code here is basically a disk-based bubble sort, with O(N2) complexity, and poor time performance because it is manipulating file buffers and disk. A bubble sort is a bad choice at the best of times — simple, yes, but slow.
The basic ways to speed up sorting programs are:
If possible, read all the data into memory, sort in memory, and write the result out.
If it won't all fit into memory, read as much into memory as possible, sort it, and write the sorted data to a temporary file. Repeat as often as necessary to sort all the data. Then merge the temporary files into one file. If the data set is truly astronomical (or the memory truly minuscule), you may have to create intermediate merge files. These days, though, you have to be sorting many hundreds of gigabytes for that to be an issue at all, even on a 32-bit computer.
Make sure you choose a good sorting algorithm. Quick sort with appropriate pivot selection is very good. You could look up 'introsort' too.
You'll find example in-memory sorting code in the answers to the cross-referenced question (your original question). If you choose to write your own sort, you can consider whether to base the interface on the standard C qsort() function. If you write a Quick Sort, you should look at Quicksort — Choosing the pivot where the answers have copious references.
You'll find example merging code in the answer to Merging multiple sorted files into one file. The merging code out-performs the system sort program in its merge mode, which is intriguing since it is not highly polished code (but it is reasonably workmanlike).
You could look at the external sort program described in Software Tools, though it is a bit esoteric in that it is written in 'RatFor' or Rational Fortran. The design, though, is readily transferrable to other languages.

Yes, by all means, use qsort(). Use it either as SpiderPig suggests by reading the whole file into memory, or as the in-memory sort for runs that do fit into memory preparing for a merge sort. Don't worry about the worst-case performance. A decent implementation will take a the median of (first, last, middle) to get fast sorting for the already-sorted and reverse-order pathological case, plus better average performance in the random case.
This all-in-memory example shows you how to use qsort:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
typedef struct record_tag
{
int key;
char data[12];
} record_type, *record_ptr;
const record_type * record_cptr;
void create_file(const char *filename, int n)
{
record_type buf;
int i;
FILE *fptr = fopen(filename, "wb");
for (i=0; i<n; ++i)
{
buf.key = rand();
snprintf(buf.data, sizeof buf.data, "%d", buf.key);
fwrite(&buf, sizeof buf, 1, fptr);
}
fclose(fptr);
}
/* Key comparison function used by qsort(): */
int compare_records(const void *x, const void *y)
{
const record_ptr a=(const record_ptr)x;
const record_ptr b=(const record_ptr)y;
return (a->key > b->key) - (a->key < b->key);
}
/* Read an input file of (record_type) records, sort by key field, and write to the output file */
void sort_file(const char *ifname, const char *ofname)
{
const size_t MAXREC = 10000;
int n;
FILE *ifile, *ofile;
record_ptr buffer;
ifile = fopen(ifname, "rb");
buffer = (record_ptr) malloc(MAXREC*sizeof *buffer);
n = fread(buffer, sizeof *buffer, MAXREC, ifile);
fclose(ifile);
qsort(buffer, n, sizeof *buffer, compare_records);
ofile = fopen(ofname, "wb");
fwrite(buffer, sizeof *buffer, n, ofile);
fclose(ofile);
}
void show_file(const char *fname)
{
record_type buf;
int n = 0;
FILE *fptr = fopen(fname, "rb");
while (1 == fread(&buf, sizeof buf, 1, fptr))
{
printf("%9d : %-12s\n", buf.key, buf.data);
++n;
}
printf("%d records read", n);
}
int main(void)
{
srand(time(NULL));
create_file("test.dat", 99);
sort_file("test.dat", "test.out");
show_file("test.out");
return 0;
}
Notice the compare_records function. The qsort() function needs a function that accepts void pointers, so those pointer must be cast to the correct type. Then the pattern:
(left > right) - (left < right)
...will return 1 if the left argument is greater, 0 if they are equal or -1 if the right argument is greater.
The could be improved. First, there is absolutely no error checking. That's not sensible in production code. Second, you could examine the input file to get the file size instead of guessing that it's less than some MAXxxx value. One way to do that is to use ftell. (Follow the link for a file size example.) Then, use that value to allocate a single buffer, just big enough to qsort the data.
If there is not enough room (if the malloc returns NULL) then you can fall back on sorting chunks (with qsort, as in the snippet) that do fit into memory, writing them to separate temporary files, and then merging them into a single output file. That's more complicated, and rarely done since there are sort/merge utility programs designed specifically for sorting large files.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight