Im trying to implement External Sorting in C.
I have to read N integers (fixed depending on main memory) from a file initially so that I can apply quicksort on them and then continue with the merging process.
I can think of these 2 ways:
read N integers one by one from the file and put them in an array then sort them.
read a bulk of data into a big char array and then reading integers from it using sscanf.
1st method is clearly slow and 2nd method is using lot of extra memory (but we have a limited main memory)
Is there any better way?
Don't try to be more clever than your OS, it probably supports some clever memory management functions, which will make your life easier, and your code faster.
Assuming you are using a POSIX compliant operating system, then you can use mmap(2).
Map your file into memory with mmap
Sort it
Sync it
This way the OS will handle swapping out data when room is tight, and swap it in when you need it.
Since stdio file operations are buffered, you won't really need to worry about the first option, especially if the file isn't huge. Remember you're not operating directly on a file, but a representation of that file in memory.
For example, if you scan in one number at a time, the system will read in a much larger section from the file (on my system it's 4096 bytes, or the entire file if it's shorter).
you can use below function to read ints from file one by one and continue sorting and merging on the go....
the function takes filename and integer count as argument and it returns int from file.
int read_int (const char *file_name, int count)
{
int err = -1;
int num = 0;
int fd = open(filename, O_RDONLY);
if(fd < 0)
{
printf("error opening file\n");
return (fd);
}
err = pread(fd, &num, sizeof(int), count*sizeof(int));
if(err < 0)
{
printf("End of file reached\n");
return (err);
}
close(fd);
return (num);
}
Sort at the same time you read is the best way. and save your data into linked list instead of array is more efficient in the sort
you can use fscanf() to read integer by integer from file. and try to sort at the moment you read integer from the file. I mean when you read integer from the file put it in the array in the right place to get the array sorted when you finish reading.
The following example read from file integer by integer and insert them with sort at the same time of reading. the integer are saved into arrays and not into linked list
void sort_insert(int x, int *array, int len)
{
int i=0, j;
for(i=0; i<(len-1); i++)
{
if (x > array[i])
continue;
for (j=(len-1); j>i; j--)
array[j] = array[j-1];
break;
}
array[i] = x;
}
void main() {
int x, i;
int len = 0;
int array[50];
FILE *fp = fopen("myfile.txt", "r");
while (len<50 && fscanf(fp, " %d",&x)>0)
{
len++;
sort_insert(x, array, len);
}
for (i=0; i<len; i++)
{
printf("array[%d] = %d\n", i, array[i]);
}
}
Related
I'm making a program that reads two sets of data (float) from two different .txt files, and then it transfers these data to two different arrays, which will be used in further calculations. However, when I try to use dynamic allocation more than once, something goes wrong and the data seem not to be stored in the array.
The following simplified program seems to be working fine:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
float *VarA;
int n = 0;
int *counter;
int i;
FILE *input1;
input1 = fopen("C:\\Users\\...test.txt","r");
VarA = (float*)calloc(20001, sizeof(float));
for(i = 0; i < 20001; i++)
{
fscanf(input1,"%f",&VarA[i]);
printf("%f\n",VarA[i]);
}
free(VarA);
fclose(input1);
return 0;
}
it successfully shows the data stored in the array VarA. However, if I introduce a new array to count the number of lines in the file (which is necessary for my further calculations), I just get the value 0.000000 from every array element:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
float *VarA;
int n = 0;
int *counter;
int i;
FILE *input1;
input1 = fopen("C:\\Users\\...test.txt","r");
counter = (int*)calloc(100000, sizeof(int));
while(fscanf(input1,"%f",&counter[n]) != EOF)
{
n++;
}
free(counter);
printf("n = %i\n", n);
VarA = (float*)calloc(n, sizeof(float));
for(i = 0; i < n; i++)
{
fscanf(input1,"%f",&VarA[i]);
printf("%f\n",VarA[i]);
}
free(VarA);
fclose(input1);
return 0;
}
I know that I can avoid using another array to count the number of lines. The point is that every time I use another array, for any purpose, I get the same result. For instance, if I don't use an array to count the number of lines, but I make another one to store my other set of data, one of these arrays just won't present the data after the reading. I tried to modify my program several times in order to find the source of such behavior, but without success.
(At least) two major problems: first,
counter = (int*)calloc(100000, sizeof(int));
while(fscanf(input1,"%f",&counter[n]) != EOF) {
n++;
}
free(counter);
is basically saying "Grab me a chunk of memory, fill it with data as I read the file, then throw it away without ever using it." Probably not what you intended. Then,
VarA = (float*)calloc(n, sizeof(float));
for (i = 0; i < n; i++) {
fscanf(input1,"%f",&VarA[n]);
printf("%f\n",VarA[n]);
}
free(VarA);
which says, "Grab a big chunk of memory, then read data from after the end of the file I just read everything from, put it there, then throw it away."
If you want to read the data from the same file again, you'll have to close it an reopen it (or "seek" to the start). And if you want to do anything with it, you'll have to do it before free()ing the memory you loaded it into.
counter = (int*)calloc(100000, sizeof(int));
// ^--- `int*` ^--- `int`
// v--- `int` pointer
while(fscanf(input1,"%f",&counter[n]) != EOF)
// ^--- `float` designator
Do you see any discrepancies here? Your code allocates ints, then passes a pointer to those ints to fscanf telling it they're floats (using the %f designator). According to the C standard draft n1570, section 7.21.6.2p10 this constitutes undefined behaviour:
If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined.
My suggestion would be to use the * assignment suppression modifier here, for example:
while (fscanf(input1, "%*f") != EOF) n++;
or, alternatively
while (fscanf(input1, "%f", &(float){0}) != 1) n++;
Note also how I've changed the check from EOF to 1. You can find more information about the return values of fscanf here (which you really should read before using any scanf-related function... and stop guessing, because guessing in C can be harmful).
Additionally, you need to rewind your file once it reaches EOF, otherwise every call to fscanf following this loop will return EOF:
rewind(input1);
P.S. Don't cast malloc in C. This goes for calloc and realloc, too. There's a lot of this quoted stuff that has opengroup manuals of its own; I'll leave it as an exercise to you to find (and read) the opengroup manuals.
I have a piece of code that analyzes streams of data from very large (10-100GB) binary files. It works well, so it's time to start optimizing, and currently disk IO is the biggest bottleneck.
There are two types of files in use. The first type of file consists of a stream of 16-bit integers, which must be scaled after I/O to convert to a floating point value which is physically meaningful. I read the file in chunks, and I read in the chunks of data by reading one 16-bit code at a time, performing the required scaling, and then storing the result in an array. Code is below:
int64_t read_current_chimera(FILE *input, double *current,
int64_t position, int64_t length, chimera *daqsetup)
{
int64_t test;
uint16_t iv;
int64_t i;
int64_t read = 0;
if (fseeko64(input, (off64_t)position * sizeof(uint16_t), SEEK_SET))
{
return 0;
}
for (i = 0; i < length; i++)
{
test = fread(&iv, sizeof(uint16_t), 1, input);
if (test == 1)
{
read++;
current[i] = chimera_gain(iv, daqsetup);
}
else
{
perror("End of file reached");
break;
}
}
return read;
}
The chimera_gain function just takes a 16-bit integer, scales it and returns the double for storage.
The second file type contains 64-bit doubles, but it contains two columns, of which I only need the first. To do this I fread pairs of doubles and discard the second one. The double must also be endian-swapped before use. The code I use to do this is below:
int64_t read_current_double(FILE *input, double *current, int64_t position, int64_t length)
{
int64_t test;
double iv[2];
int64_t i;
int64_t read = 0;
if (fseeko64(input, (off64_t)position * 2 * sizeof(double), SEEK_SET))
{
return 0;
}
for (i = 0; i < length; i++)
{
test = fread(iv, sizeof(double), 2, input);
if (test == 2)
{
read++;
swapByteOrder((int64_t *)&iv[0]);
current[i] = iv[0];
}
else
{
perror("End of file reached: ");
break;
}
}
return read;
}
Can anyone suggest a method of reading these file types that would be significantly faster than what I am currently doing?
First off, it would be useful to use a profiler to identify the hot spots in your program. Based on your description of the problem, you have a lot of overhead going on by the sheer number of freads. As the files are large there will be a big benefit to increasing the amount of data you read per io.
Convince yourself of this by putting together 2 small programs that read the stream.
1) read it as you are in the example above, of 2 doubles.
2) read it the same way, but make it 10,000 doubles.
Time both runs a few times, and odds are you will be observe #2 runs much faster.
Best of luck.
WHAT THE CODE DOES: I read a binary file and sort it. I use a frequency array in order to do so.
UPDATES:it does do the loop, but it doesn`t write the numbers correctly...
That is the code. I want to write on file after reading from it. I will suprascript what is already written, and that is okey. The problem is I have no error on compiling, but at run time it seems I have an infinite loop.
The file is binary. I read it byte by byte and that`s also the way I want to write it.
while(fread(readChar, sizeof(readChar)/2, 1, inFile)){
bit = atoi(readChar);
array[bit] = array[bit] + 1;
}
fseek(inFile, 0, SEEK_SET);
for( i = 0; i < 256; i++)
while(array[i] > 0){
writeChar[0] = array[i]; //do I correctly convert int to char?
fwite(writeChar, sizeof(readChar)/2, 1, inFile);
array[i] = array[i] -1;
}
The inFile file declaration is:
FILE* inFile = fopen (readFile, "rb+");
It reads from the file, but does not write!
Undefined behavior:
fread() is used to read a binary representation of data. atoi() takes a textual represetation of data: a string (a pointer to an array of char that is terminated with a '\0'.
Unless the data read into readChar has one of its bytes set to 0, calling atoi() may access data outside readChar.
fread(readChar, sizeof(readChar)/2, 1, inFile);
bit = atoi(readChar);
Code it not reading data "bit by bit" At #Jens comments: "The smallest unit is a byte." and that is at least 8 bits.
The only possible reason for an infinite loop I see is, that your array is not initialized.
After declaration with:
int array[256];
the elements can have any integer value, also very large ones.
So there are no infinite loops, but some loops can have very much iterations.
You should initialize your array with zeros:
int array[256]={0};
I don't know the count of elements in your array and if this is the way you declare it, but if you declare your array like shown, than ={0} will initialize all members with 0. You also can use a loop:
for(int i=0; i < COUNT_OF_ELEMENTS;i++) array[i] = 0;
EDIT: I forgot to mention, that your code is only able to sort files with only numbers within.
For that, you have also to change the conversion while writing:
char writeChar[2]={0};
for( int i = 0; i < 256; i++)
while(array[i] > 0){
_itoa(i,writeChar,10);
fwrite(writeChar, sizeof(char), 1, inFile);
array[i] = array[i] -1;
}
File content before:
12345735280735612385478504873457835489
File content after:
00112223333334444455555556777778888889
Is that what you want?
This sort code fails for very large input file data because it takes too long for it to finish.
rewind(ptr);
j=0;
while(( fread(&temp,sizeof(temp),1,ptr)==1) &&( j!=lines-1)) //read object by object
{
i=j+1;
while(fread(&temp1,sizeof(temp),1,ptr)==1) //read next object , to compare previous object with next object
{
if(temp.key > temp1.key) //compare key value of object
{
temp2=temp; //if you don't want to change records and just want to change keys use three statements temp2.key =temp.key;
temp=temp1;
temp1=temp2;
fseek(ptr,j*sizeof(temp),0); //move stream to overwrite
fwrite(&temp,sizeof(temp),1,ptr); //you can avoid above swap by changing &temp to &temp1
fseek(ptr,i*sizeof(temp),0); //move stream to overwrite
fwrite(&temp1,sizeof(temp),1,ptr); //you can avoid above swap by changing &temp1 to &temp
}
i++;
}
j++;
fseek(ptr,j*sizeof(temp),0);
}
Any idea on how to make this C code much faster? Also would using qsort() (predefined in C) be much faster and how should be applied to the above code?
You asked the question Sorting based on key from a file and were given various answers about how to sort in memory. You added a supplemental question as an answer, and then created this question instead (which was correct).
Your code here is basically a disk-based bubble sort, with O(N2) complexity, and poor time performance because it is manipulating file buffers and disk. A bubble sort is a bad choice at the best of times — simple, yes, but slow.
The basic ways to speed up sorting programs are:
If possible, read all the data into memory, sort in memory, and write the result out.
If it won't all fit into memory, read as much into memory as possible, sort it, and write the sorted data to a temporary file. Repeat as often as necessary to sort all the data. Then merge the temporary files into one file. If the data set is truly astronomical (or the memory truly minuscule), you may have to create intermediate merge files. These days, though, you have to be sorting many hundreds of gigabytes for that to be an issue at all, even on a 32-bit computer.
Make sure you choose a good sorting algorithm. Quick sort with appropriate pivot selection is very good. You could look up 'introsort' too.
You'll find example in-memory sorting code in the answers to the cross-referenced question (your original question). If you choose to write your own sort, you can consider whether to base the interface on the standard C qsort() function. If you write a Quick Sort, you should look at Quicksort — Choosing the pivot where the answers have copious references.
You'll find example merging code in the answer to Merging multiple sorted files into one file. The merging code out-performs the system sort program in its merge mode, which is intriguing since it is not highly polished code (but it is reasonably workmanlike).
You could look at the external sort program described in Software Tools, though it is a bit esoteric in that it is written in 'RatFor' or Rational Fortran. The design, though, is readily transferrable to other languages.
Yes, by all means, use qsort(). Use it either as SpiderPig suggests by reading the whole file into memory, or as the in-memory sort for runs that do fit into memory preparing for a merge sort. Don't worry about the worst-case performance. A decent implementation will take a the median of (first, last, middle) to get fast sorting for the already-sorted and reverse-order pathological case, plus better average performance in the random case.
This all-in-memory example shows you how to use qsort:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
typedef struct record_tag
{
int key;
char data[12];
} record_type, *record_ptr;
const record_type * record_cptr;
void create_file(const char *filename, int n)
{
record_type buf;
int i;
FILE *fptr = fopen(filename, "wb");
for (i=0; i<n; ++i)
{
buf.key = rand();
snprintf(buf.data, sizeof buf.data, "%d", buf.key);
fwrite(&buf, sizeof buf, 1, fptr);
}
fclose(fptr);
}
/* Key comparison function used by qsort(): */
int compare_records(const void *x, const void *y)
{
const record_ptr a=(const record_ptr)x;
const record_ptr b=(const record_ptr)y;
return (a->key > b->key) - (a->key < b->key);
}
/* Read an input file of (record_type) records, sort by key field, and write to the output file */
void sort_file(const char *ifname, const char *ofname)
{
const size_t MAXREC = 10000;
int n;
FILE *ifile, *ofile;
record_ptr buffer;
ifile = fopen(ifname, "rb");
buffer = (record_ptr) malloc(MAXREC*sizeof *buffer);
n = fread(buffer, sizeof *buffer, MAXREC, ifile);
fclose(ifile);
qsort(buffer, n, sizeof *buffer, compare_records);
ofile = fopen(ofname, "wb");
fwrite(buffer, sizeof *buffer, n, ofile);
fclose(ofile);
}
void show_file(const char *fname)
{
record_type buf;
int n = 0;
FILE *fptr = fopen(fname, "rb");
while (1 == fread(&buf, sizeof buf, 1, fptr))
{
printf("%9d : %-12s\n", buf.key, buf.data);
++n;
}
printf("%d records read", n);
}
int main(void)
{
srand(time(NULL));
create_file("test.dat", 99);
sort_file("test.dat", "test.out");
show_file("test.out");
return 0;
}
Notice the compare_records function. The qsort() function needs a function that accepts void pointers, so those pointer must be cast to the correct type. Then the pattern:
(left > right) - (left < right)
...will return 1 if the left argument is greater, 0 if they are equal or -1 if the right argument is greater.
The could be improved. First, there is absolutely no error checking. That's not sensible in production code. Second, you could examine the input file to get the file size instead of guessing that it's less than some MAXxxx value. One way to do that is to use ftell. (Follow the link for a file size example.) Then, use that value to allocate a single buffer, just big enough to qsort the data.
If there is not enough room (if the malloc returns NULL) then you can fall back on sorting chunks (with qsort, as in the snippet) that do fit into memory, writing them to separate temporary files, and then merging them into a single output file. That's more complicated, and rarely done since there are sort/merge utility programs designed specifically for sorting large files.
I have a simulation program written in c and I need to create random numbers and write them to a txt file. Program only stops
- when a random number already generated is generated again or
- 1 billion random number are generated (no repetition)
My problem is that I could not search the generated long int random number in the txt file!
Text file format is:
9875
764
19827
2332
...
Any help is appreciated..
`
FILE * out;
int checkNumber(long int num){
char line[512];
long int number;
int result=0;
if((out = fopen("out.txt","r"))==NULL){
result= 1;
}
char buf[10];
itoa(num, buf, 10);
while(fgets(line, 512, out) != NULL)
{
if((strstr(line,buf)) != NULL){
result = 0;
}
}
if(out) {
fclose(out);
}
return result;
}
int main(){
int seed;
long int nRNs=0;
long int numberGenerated;
out = fopen ("out.txt","w");
nRNs=0;
seed = 12345;
srand (seed);
fprintf(out,"%d\n",numberGenerated);
while( nRNs != 1000000000 )
{
numberGenerated = rand();
nRNs++;
if(checkNumber(numberGenerated)==0){
fclose(out); break; system("pause");
}
else{
fprintf(out,"%d\n",numberGenerated);
}
}
fclose(out);
}`
If the text file only contains randomly generated numbers separated by space, then you need strtok() function(google its usage) and throw it into the binary tree structure as mentioned by #jacekmigacz. But in any circumstance, you will have to search the whole file once at least. Then ftell() the value to get the location you've searched for in the file. When another number is generated you can use fseek() to get the latest number. Remember to get the data line by line with fgets()
Take care of the memory requirements and use malloc() judiciously
Try with tree (data structure).
Searching linearly through the text file every time is gonna take forever with so many numbers. You could hold every number generated so far sorted in a data structure so that you can do a binary search for a duplicate. This is going to need a lot of RAM though. For 1 billion integers that's already 4GB on a system with 32-bit integers, and you'll need several more for the data structure overhead. My estimate is around 16GB in the worst case scenario (where you actually get to 1 billion unique integers.)
If you don't have a memory monster machine, you should instead write the data structure to a binary file and do the binary search there. Though that's still gonna be quite slow.
This may work or you can approach like this : (slow but will work)
int new_rand = rand();
static int couter = 0;
FILE *fptr = fopen("txt","a+");
int i;
char c,buf[10];
while((c=getc(fptr))!=EOF)
{
buf[j++]=c;
if(c == ' ')
{
buf[--j]='\0';
i=atoi(buf);
if(i == new_rand)
return;
j=0;
}
if(counter < 1000000)
{
fwrite(&new_rand, 4, 1, fptr);
counter++;
}
Don't open and scan your file to checkNumber(). You'll be waiting forever.
Instead, keep your generated numbers in memory using a bit set data structure and refer to that.
Your bit set will need to be large enough to indicate every 32-bit integer, so it'll consume 2^32 / 8 bytes (or 512MiB) of memory. This may seem like a lot but it's much smaller than 32-bit * 1,000,000,000 (4GB). Also, both checking and updating will be done in constant time.
Edit: The wikipedia link doesn't do much to explain how to code one, so here's a rough sample: (There're faster ways of writing this, e.g.: using bit shifts instead of division, but this should be easier to understand.)
int checkNumberOrUpdate(char *bitSet, long int num){
char b = 1 << (num % 8);
char w = num / 8;
if (bitSet[w] & ~b) {
return 1;
}
bitSet[w] |= b;
return 0;
}
Note, bitSet needs to be calloc()d to the right size from your main function.