how to sort a lot of data in c? - c

at the moment i am trying to write a unreal amount of data out to files,
basically i generate a new struct of data and write it out to file untill the file becomes 1gb big and this occurs for 6 files of 1gb each, the structs are small. 8 bytes long with two 2 variables id and amount
when i generate my data, the structs are created and written to file in the order of amount.
but i need the data to sorted by id.
remember there is 6gb's of data , how could i sort these structs by there id value and then written to file?
or should i write to file first, and then sort each individual file ,and how would i bring all this data together into one file?
i am kind of stuck , because i would like to hold it in an array , but obviously this amount of data is too big.
i need a good way to sort alot of data? (6gb)

I haven't found a question with a really basic answer on this, so here goes.
If you're on a 64 bit machine, by the way, you should seriously consider writing all the data into a file, memory mapping the file, and just use whatever array sort you like. Quicksort is pretty cache-friendly: it won't thrash badly. The assignment is probably designed to stop you doing this, but might be a bit out of date ;-)
Failing that, you need some kind of external sort. There are other ways to do it, but I think merge sort is probably the simplest. Before you start merging:
work out how much data you can fit into memory (or, again, mmap it). If you're on a PC then 1GB seems like a fair assumption, but it may be a few times more or less.
load this much data (so one of your 6 files, in the example)
quicksort it (since you tagged "quicksort", I guess you know how to do that), or any other sort of your choice.
write it back to disk (if you didn't mmap).
This leaves you with 6 1GB files, each of which individually is sorted. At this point you can either work up gradually, or go for the whole lot in one go. With 6 chunks, going for the whole lot is fine, in what is called a "6-way merge":
open a file for writing
open your 6 files for reading, and read a few million records out of each
examine the 6 records at the start of each of the 6 buffers. One of theses 6 must be the smallest of all. Write this to the output, and move forward one step through that buffer.
as you reach the end of each buffer, refill it from the correct file.
There's some optimization you can do regarding how you work out which of your 6 possibilities is the smallest, but the big performance difference will be to make sure you use large enough read and write buffers.
Obviously there's nothing special about the merge being 6-way. If you'd rather stick to a 2-way merge, which is easier to code, then of course you can. It will take 5 2-way merges to merge 6 files.

I would recommend this tool, it is a light weight database that runs in memory and takes up very little memory. It will hold your information and you can query it to retrieve your information.
http://www.sqlite.org/features.html

I suggest you don't.
If you are to hold such amount of data, why not using a dedicated database format that can have lots of different indexes and a powerful request engine.
But if you still want to use your old fashioned fixed-endian struct, then i would suggest breaking your data into smaller files, sort each one, and merge them. A good merge algorithm runs in nlog(q). Be also sure to pick the right algorithm for your files.

The easiest way (in development time) to do this is to write out the data to separate files according to their ID. You don't have to have a 1 to 1 match between the number of files and the number of IDs (in case there are a lot of IDs), but if you choose a prefix of the ID (so if the key for one particular record is 987 it might go in the 9 file while the record with key 456 would go in the 4 file) you won't have to worry about locating all of the keys across all of the files because sorting each file by itself would result and then looking at the files in their order (by their names) would give you sorted results.
If that is not possible or easy the you need to do an external sort of some type. Since the data is still spread across several files this is a bit of a pain. The easiest thing (by development time) is to first sort each individual file independently and then merge them together into a new set of files sorted by ID. Look up merge sort if you don't know what I'm talking about. At this step you are pretty much starting in the middle of merge sort.
As far as sorting the contents of a file which is too large to fit into RAM you can either use merge sort directly on the file or use replacement selection sort to sort the file in place. This involves making several passes over the file while using some RAM (the more the better) to hold a priority queue (a binary heap) and a set of records that are not possibly of any use in this run (their keys suggest that they should be earlier in the file than the current run position, so you're just holding on to them until the next run).
Searching for replacement selection sort or tournament sort will yield better explanations.

First, sort each file individually. Either load the whole thing into memory, or (better) mmap it, and use the qsort function.
Then, write your own merge sort that takes N FILE * inputs (i.e. N=6 in your case) and outputs to N new files, switching to the next one whenever one fills up.

Check out external sort. Find any of the external mergesort libraries out there and modify them to suit your need.

Well - since the actual assignment is to keep encoded data and later just compare it with decoded-data, I would also say - use a database and just create an hash index on the ID column.
But regarding sort of such hugh number, another very important thing is to do it in parallel. There are many ways to do it. Steve Jessop mentioned a sort-merge approach, it is really easy to sort the first 6 chunks in parallel, the only question is how much cpu cores andd memory you have on your machine. (It is rare to find a computer with only 1 core today and also not so rare to have 4GB memory).

Maybe you could use mmap and use it as a huge array which you could sort with qsort. I'm not sure what the implications would be. Would it grow to much in memory?

Related

Fortran: How do I allocate arrays when reading a file of unknown size?

My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?
I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.
I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.

Better to store data in RAM, text file, or database

I am working on a project where I am using words, encoded by vectors, which are about 2000 floats long. Now when I use these with raw text I need to retrieve the vector for each word as it comes across and do some computations with it. Needless to say for a large vocabulary (~100k words) this has a large storage requirement (about 8 GB in a text file).
I initially had a system where I split the large text file into smaller ones and then for a particular word, I read its file, and retrieved its vector. This was too slow as you might imagine.
I next tried reading everything into RAM (takes about ~40GB RAM) figuring once everything was read in, it would be quite fast. However, it takes a long time to read in and a disadvantage is that I have to use only certain machines which have enough free RAM to do this. However, once the data is loaded, it is much faster than the other approach.
I was wondering how a database would compare with these approaches. Retrieval would be slower than the RAM approach, but there wouldn't be the overhead requirement. Also, any other ideas would be welcome and I have had others myself (i.e. caching, using a server that has everything loaded into RAM etc.). I might benchmark a database, but I thought I would post here to see what other had to say.
Thanks!
UPDATE
I used Tyler's suggestion. Although in my case I did not think a BTree was necessary. I just hashed the words and their offset. I then could look up a word and read in its vector at runtime. I cached the words as they occurred in text so at most each vector is read in only once, however this saves the overhead of reading in and storing unneeded words, making it superior to the RAM approach.
Just an FYI, I used Java's RamdomAccessFile class and made use of the readLine(), getFilePointer(), and seek() functions.
Thanks to all who contributed to this thread.
UPDATE 2
For more performance improvement check out buffered RandomAccessFile from:
http://minddumped.blogspot.com/2009/01/buffered-javaiorandomaccessfile.html
Apparently the readLine from RandomAccessFile is very slow because it reads byte by byte. This gave me some nice improvement.
As a rule, anything custom coded should be much faster than a generic database, assuming you have coded it efficiently.
There are specific C-libraries to solve this problem using B-trees. In the old days there was a famous library called "B-trieve" that was very popular because it was fast. In this application a B-tree will be faster and easier than fooling around with a database.
If you want optimal performance you would use a data structure called a suffix tree. There are libraries which are designed to create and use suffix trees. This will give you the fastest word lookup possible.
In either case there is no reason to store the entire dataset in memory, just store the B-tree (or suffix tree) with an offset to the data in memory. This will require about 3 to 5 megabytes of memory. When you query the tree you get an offset back. Then open the file, seek forwards to the offset and read the vector off disk.
You could use a simple text based index file just mapping the words to indices, and another file just containing the raw vector data for each word. Initially you just read the index to a hashmap that maps each word to the datafile index and keep it in memory. If you need the data for a word, you calculate the offset in the data file (2000 * 32 * index) and read it as needed. You probably want to cache this data in RAM (if you are in java perhaps just use a weak map as a starting point).
This is basically implementing your own primitive database, but it may still be preferable because it avoidy database setup / deployment complexity.

loading multiple files of different lengths into one large array in openmp

I have 4 files (file1,file2,file3,file4) of different lengths (n1,n2,n3,n4) which each contain the following type of data:
x1,y1,z1
x2,y2,z2
...
xn,yn,zn
What is the quickest way to load these into memory - can it be done simultaneously to create one large array (i.e. totarray(1:n1+n2+n3+n4,1:3)) from the 4 smaller arrays? If this can't be done in openmp - what would be the fastest way to do this? At the moment, I simply loop over each filename and added it to the bottom of a temporary array which is filled with the new data in each iteration. There are millions of entries in each file and I want to speed this read in up. Thanks
Unless each file is on a different medium, the fastest way of doing this is probably to read the files one at a time, which is what is sounds like you're doing. In this case, OpenMP will not help you, and might make things worse, as the threads would be competing for a single, slow disk. This assumes that you are I/O bound, though.
You do not specify what format your file is in, though. If it is in binary format, then you can't do much better unless you want to start with compression. If it is in text format, though, you are probably CPU bound due to all the text parsing involved, and can probably get huge speedups simply by moving to a binary format. This will be much more efficient than OpenMP parallelization would be.
HDF is a good binary format you might consider, but you could also go with something as simple as fortran "unformatted" files.

How to represent a random-access text file in memory (C)

I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.
I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.
1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.
It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...
solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.
As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.
Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains

dealing with a large flat data files with a very big record length

I have a large data file that is created from a shell script. The next script processes it by sorting and reading several times. That takes more than 14 hours; it is not viable.
I want to replace this long running script with a program, probably in JAVA, C, or COBOL, that can run on Windows or on Sun Solaris. I have to read a group of records every time, sort and process and write to the output sort file and at the same time insert into db2/sql tables.
If you insert them into a database anyway it might be much simpler to not do the sorting yourself, but just receive the data ordered from the database once you've inserted it all.
Something that might speed up your sorting is alter your data producing script to place the data into different files based on all or the prefix of the key you will be used to sort the entries.
Then when you actually sort the entries you can limit your sort to only work on the smaller files, which will (pretty much) turn your sort time from O( f(N) ) to O( f(n0) + f(n1) + ... ), which for any f() more complex than f(x)=x should be smaller (faster).
This will also open up the possibility of sorting your files concurrently because the disk IO wait time for one sorting thread would be a great time for another thread to actually sort the records that it has loaded.
You will need to find a happy balance between too many files and too bit files. 256 files is a good starting point.
Another thing you might want to investigate is your sorting algorithm. Merge sort is good for secondary storage sorting. Replacement selection sort is also a good algorithm to use for secondary storage sorting.
http://www.cs.auckland.ac.nz/software/AlgAnim/niemann/s_ext.htm
Doing your file IO in large chunks (file system block sized aligned chunks are best) will also help in most cases.
If you do need to use a relational database anyway you might as well just go ahead and put everything in there to start with, though. RDBMSes typically have very good algorithms to handle all of this tricky stuff.

Resources