loading multiple files of different lengths into one large array in openmp

loading multiple files of different lengths into one large array in openmp - arrays

I have 4 files (file1,file2,file3,file4) of different lengths (n1,n2,n3,n4) which each contain the following type of data:
x1,y1,z1
x2,y2,z2
...
xn,yn,zn
What is the quickest way to load these into memory - can it be done simultaneously to create one large array (i.e. totarray(1:n1+n2+n3+n4,1:3)) from the 4 smaller arrays? If this can't be done in openmp - what would be the fastest way to do this? At the moment, I simply loop over each filename and added it to the bottom of a temporary array which is filled with the new data in each iteration. There are millions of entries in each file and I want to speed this read in up. Thanks

Unless each file is on a different medium, the fastest way of doing this is probably to read the files one at a time, which is what is sounds like you're doing. In this case, OpenMP will not help you, and might make things worse, as the threads would be competing for a single, slow disk. This assumes that you are I/O bound, though.
You do not specify what format your file is in, though. If it is in binary format, then you can't do much better unless you want to start with compression. If it is in text format, though, you are probably CPU bound due to all the text parsing involved, and can probably get huge speedups simply by moving to a binary format. This will be much more efficient than OpenMP parallelization would be.
HDF is a good binary format you might consider, but you could also go with something as simple as fortran "unformatted" files.

Related

Fortran: How do I allocate arrays when reading a file of unknown size?

My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?

I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.

I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.

find and replace data on gzip content efficiently

my c linux based program inputs are:
char *in_str, char *find_str, char *replacing_str
the in_str is a compressed data (gzip).
the program needs to find for the find_str within the uncompressed input data, replace it with replacing_str, and then to recompress the data.
the trivial way to do so is by using one of the many available gzip compress/uncompress libraries to uncompress the data, manipulate the uncompressed data, and then to recompress the output. However i need to make it as efficient as possible (it is a RT program).
i wonder if it is more efficient to use an on-the-fly library (e.g. zlibc) approach or simply do the operation as described above.
maybe it is important to mention that:
the find_str and replacing_str strings are a small portion of the data
their lengths are not equal
the find_str supposed to appear about 4 or 5 times
the uncompressed data len is ~2K - 6K bytes
does anyone familiar with an efficient way to implement this?
Thanks

You are going to have to decompress no matter what, in order to search for the strings. (You might be able to get away with doing that only once and building an index. However that might be much larger than the uncompressed data, so you might as well just store it uncompressed instead.)
You can avoid recompressing all of it by preparing the gzip file ahead of time to be compressed in smaller historyless units using, for example, the Z_FULL_FLUSH option of zlib. This will reduce compression slightly depending on how often you do it, but will speed up building the output greatly if only one of many blocks need to be recompressed.

How to export data from C to MATLAB (on different machines)

I am generating long double float data in a C program on a Linux cluster. I need to export the data to Matlab, which is not installed on the cluster.
What is the best way? My advisor says to export using printf statements. I assume he means sending the data to a comma separated file (and fprintf). But it seems to me like that could be slow and use too much memory and we may lose a lot of precision.
I've found this web page for reading and writing .MAT files, but I don't really understand the page, or the example, which I copied to my cluster, but cannot compile (because it's missing libraries which, obviously, come from MATLAB.
What is the best, or easiest, or fastest way to export data from Linux/C to Windows/MATLAB? How do I get started with that method? Be advised when you answer that I am pretty new to C and will likely need help figuring out how to obtain, install, configure, and link any libraries. But once that's done, I think I'm pretty good at learning to use them.

Why do you think you would you lose precision? The only drawback with CSVs is that ASCII files require much more storage than binary files, but in this day and age where you get terabytes of storage for the price of a good haircut, that hardly seems like a problem.
It will only be noticeably slower if you're writing gigabytes upon gigabytes, but normally calculations take so much longer that the difference between ASCII and binary is completely negligible (and if the calculations don't take so long: why do you need a cluster then?)
In any case, I'd go for ASCII -- how to write and read some binary blob needs to be documented in two places, it's easier to create bugs in both the writing end and the reading end, it's harder to solve them since no human can read the file, etc. Also, MAT file formats may change in the next Matlab release (as they have in the past).
With ASCII, you have none of these problems, the only drawback I can think of is that you have to write a small cluster-specific file reader in Matlab (which is still a lot less work than working out all the bugs and maintaining a MAT file writer).
Anyway, there's tons of tools available in Matlab for ASCII: textread, dlmread, importdata, to name a few. On the C-end, indeed just use fprintf (documentation here).

I once had this problem as well (well, kind of...) and used a simple binary format to do the job.
If your data format is static, that means if it will never change, you can restrict yourself to exactly what you need and hard-code the exact format in your loading program. If you want to stay flexible to add and remove columns, however, you should define a kind of header to add information about the data format and evaluate that on reading.
The trick for simple importing of data is the following:
Let the MATLAB program know how longs your data records are and how they are composed.
Read the data with
rest = fread(fid, 'uchar=>uint8', 'b').';
in order to have a row vector of uint8s.
Reshape the data with
rest = reshape(rest, recordlength, []).';
in order to get your data in recordlength columns and as many rows as you need.
For each data column, combine the relevant uint8 rows into a "sub-matrix", using a combination of reshape, typecast, swapbytes to group your data appropriately and convert them into the wanted format.
The most important thing here is the typecast() function, which accepts the "byte-wise" data as 1st and the wanted data type as 2nd parameter. There is a wide range of accepted data types, such as intXX, uintXX (with XX one of 8, 16, 32 and (AFAIK) 64) as well as float and double.
For example, typecast([1, 1], 'uint16') gives you 257, while typecast([0, 0, 96, 64], 'float') gives you 3.5.
Once you do so, you can improve the reading speed - compared with a text file - by factor 20 or so. (At least, this was the case in the application I wrote this for: there were about 10 different measure values every 10 ms, one measurement could be of several minutes or even hours and I wanted to read in such a file as fast as possible. So I recoded the stuff from text to binary and gained about factor 20, or maybe 15 - don't know exactly. But it was a lot...)

I would save the workspace as a .MAT file, as you said. Then you have whatever values are contained in all the present variables saved as a workspace at that moment. However, if you are reading arrays (your data) that are gigabytes of long, then probably you read them chunk by chunk (due to RAM restrictions maybe?) and saving the workspace in that case might not help you.
I would never printf anything for transporting. In my work (on long time asymptotics, so I have huge outputs), I save everything as binary files using fwrite. Converting to text is slow and expensive, as far as I know.
I hope this helps a little bit!

Is saving a binary file a standard? Is it limited to only 1 type?

When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.

If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.

Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)

I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.

how to sort a lot of data in c?

at the moment i am trying to write a unreal amount of data out to files,
basically i generate a new struct of data and write it out to file untill the file becomes 1gb big and this occurs for 6 files of 1gb each, the structs are small. 8 bytes long with two 2 variables id and amount
when i generate my data, the structs are created and written to file in the order of amount.
but i need the data to sorted by id.
remember there is 6gb's of data , how could i sort these structs by there id value and then written to file?
or should i write to file first, and then sort each individual file ,and how would i bring all this data together into one file?
i am kind of stuck , because i would like to hold it in an array , but obviously this amount of data is too big.
i need a good way to sort alot of data? (6gb)

I haven't found a question with a really basic answer on this, so here goes.
If you're on a 64 bit machine, by the way, you should seriously consider writing all the data into a file, memory mapping the file, and just use whatever array sort you like. Quicksort is pretty cache-friendly: it won't thrash badly. The assignment is probably designed to stop you doing this, but might be a bit out of date ;-)
Failing that, you need some kind of external sort. There are other ways to do it, but I think merge sort is probably the simplest. Before you start merging:
work out how much data you can fit into memory (or, again, mmap it). If you're on a PC then 1GB seems like a fair assumption, but it may be a few times more or less.
load this much data (so one of your 6 files, in the example)
quicksort it (since you tagged "quicksort", I guess you know how to do that), or any other sort of your choice.
write it back to disk (if you didn't mmap).
This leaves you with 6 1GB files, each of which individually is sorted. At this point you can either work up gradually, or go for the whole lot in one go. With 6 chunks, going for the whole lot is fine, in what is called a "6-way merge":
open a file for writing
open your 6 files for reading, and read a few million records out of each
examine the 6 records at the start of each of the 6 buffers. One of theses 6 must be the smallest of all. Write this to the output, and move forward one step through that buffer.
as you reach the end of each buffer, refill it from the correct file.
There's some optimization you can do regarding how you work out which of your 6 possibilities is the smallest, but the big performance difference will be to make sure you use large enough read and write buffers.
Obviously there's nothing special about the merge being 6-way. If you'd rather stick to a 2-way merge, which is easier to code, then of course you can. It will take 5 2-way merges to merge 6 files.

I would recommend this tool, it is a light weight database that runs in memory and takes up very little memory. It will hold your information and you can query it to retrieve your information.
http://www.sqlite.org/features.html

I suggest you don't.
If you are to hold such amount of data, why not using a dedicated database format that can have lots of different indexes and a powerful request engine.
But if you still want to use your old fashioned fixed-endian struct, then i would suggest breaking your data into smaller files, sort each one, and merge them. A good merge algorithm runs in nlog(q). Be also sure to pick the right algorithm for your files.

The easiest way (in development time) to do this is to write out the data to separate files according to their ID. You don't have to have a 1 to 1 match between the number of files and the number of IDs (in case there are a lot of IDs), but if you choose a prefix of the ID (so if the key for one particular record is 987 it might go in the 9 file while the record with key 456 would go in the 4 file) you won't have to worry about locating all of the keys across all of the files because sorting each file by itself would result and then looking at the files in their order (by their names) would give you sorted results.
If that is not possible or easy the you need to do an external sort of some type. Since the data is still spread across several files this is a bit of a pain. The easiest thing (by development time) is to first sort each individual file independently and then merge them together into a new set of files sorted by ID. Look up merge sort if you don't know what I'm talking about. At this step you are pretty much starting in the middle of merge sort.
As far as sorting the contents of a file which is too large to fit into RAM you can either use merge sort directly on the file or use replacement selection sort to sort the file in place. This involves making several passes over the file while using some RAM (the more the better) to hold a priority queue (a binary heap) and a set of records that are not possibly of any use in this run (their keys suggest that they should be earlier in the file than the current run position, so you're just holding on to them until the next run).
Searching for replacement selection sort or tournament sort will yield better explanations.

First, sort each file individually. Either load the whole thing into memory, or (better) mmap it, and use the qsort function.
Then, write your own merge sort that takes N FILE * inputs (i.e. N=6 in your case) and outputs to N new files, switching to the next one whenever one fills up.

Check out external sort. Find any of the external mergesort libraries out there and modify them to suit your need.

Well - since the actual assignment is to keep encoded data and later just compare it with decoded-data, I would also say - use a database and just create an hash index on the ID column.
But regarding sort of such hugh number, another very important thing is to do it in parallel. There are many ways to do it. Steve Jessop mentioned a sort-merge approach, it is really easy to sort the first 6 chunks in parallel, the only question is how much cpu cores andd memory you have on your machine. (It is rare to find a computer with only 1 core today and also not so rare to have 4GB memory).

Maybe you could use mmap and use it as a huge array which you could sort with qsort. I'm not sure what the implications would be. Would it grow to much in memory?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight