How to export data from C to MATLAB (on different machines) - c

I am generating long double float data in a C program on a Linux cluster. I need to export the data to Matlab, which is not installed on the cluster.
What is the best way? My advisor says to export using printf statements. I assume he means sending the data to a comma separated file (and fprintf). But it seems to me like that could be slow and use too much memory and we may lose a lot of precision.
I've found this web page for reading and writing .MAT files, but I don't really understand the page, or the example, which I copied to my cluster, but cannot compile (because it's missing libraries which, obviously, come from MATLAB.
What is the best, or easiest, or fastest way to export data from Linux/C to Windows/MATLAB? How do I get started with that method? Be advised when you answer that I am pretty new to C and will likely need help figuring out how to obtain, install, configure, and link any libraries. But once that's done, I think I'm pretty good at learning to use them.

Why do you think you would you lose precision? The only drawback with CSVs is that ASCII files require much more storage than binary files, but in this day and age where you get terabytes of storage for the price of a good haircut, that hardly seems like a problem.
It will only be noticeably slower if you're writing gigabytes upon gigabytes, but normally calculations take so much longer that the difference between ASCII and binary is completely negligible (and if the calculations don't take so long: why do you need a cluster then?)
In any case, I'd go for ASCII -- how to write and read some binary blob needs to be documented in two places, it's easier to create bugs in both the writing end and the reading end, it's harder to solve them since no human can read the file, etc. Also, MAT file formats may change in the next Matlab release (as they have in the past).
With ASCII, you have none of these problems, the only drawback I can think of is that you have to write a small cluster-specific file reader in Matlab (which is still a lot less work than working out all the bugs and maintaining a MAT file writer).
Anyway, there's tons of tools available in Matlab for ASCII: textread, dlmread, importdata, to name a few. On the C-end, indeed just use fprintf (documentation here).

I once had this problem as well (well, kind of...) and used a simple binary format to do the job.
If your data format is static, that means if it will never change, you can restrict yourself to exactly what you need and hard-code the exact format in your loading program. If you want to stay flexible to add and remove columns, however, you should define a kind of header to add information about the data format and evaluate that on reading.
The trick for simple importing of data is the following:
Let the MATLAB program know how longs your data records are and how they are composed.
Read the data with
rest = fread(fid, 'uchar=>uint8', 'b').';
in order to have a row vector of uint8s.
Reshape the data with
rest = reshape(rest, recordlength, []).';
in order to get your data in recordlength columns and as many rows as you need.
For each data column, combine the relevant uint8 rows into a "sub-matrix", using a combination of reshape, typecast, swapbytes to group your data appropriately and convert them into the wanted format.
The most important thing here is the typecast() function, which accepts the "byte-wise" data as 1st and the wanted data type as 2nd parameter. There is a wide range of accepted data types, such as intXX, uintXX (with XX one of 8, 16, 32 and (AFAIK) 64) as well as float and double.
For example, typecast([1, 1], 'uint16') gives you 257, while typecast([0, 0, 96, 64], 'float') gives you 3.5.
Once you do so, you can improve the reading speed - compared with a text file - by factor 20 or so. (At least, this was the case in the application I wrote this for: there were about 10 different measure values every 10 ms, one measurement could be of several minutes or even hours and I wanted to read in such a file as fast as possible. So I recoded the stuff from text to binary and gained about factor 20, or maybe 15 - don't know exactly. But it was a lot...)

I would save the workspace as a .MAT file, as you said. Then you have whatever values are contained in all the present variables saved as a workspace at that moment. However, if you are reading arrays (your data) that are gigabytes of long, then probably you read them chunk by chunk (due to RAM restrictions maybe?) and saving the workspace in that case might not help you.
I would never printf anything for transporting. In my work (on long time asymptotics, so I have huge outputs), I save everything as binary files using fwrite. Converting to text is slow and expensive, as far as I know.
I hope this helps a little bit!

Related

Efficient Output Format for Huge Data Sets?

I have written a program that writes output to a file. The output is in 6-column, n-row format, and all values are double-precision float. It is very common in my code for n to become extremely large (1e20 or so) and, hence, the output data file also becomes extremely large.
I am currently storing everything in *.csv format, which obviously results in huge data files. Is there any more efficient way to store such values? Any new format of file or any new method that would decrease the file size significantly?
For clarification:
The data does not need to be human readable, binary would do just fine.
I will further process the data in the file to get some important parameters from the runs, probably the distance of travel, time of exit at a particular point etc.
The code is actually an astrophysical simulation of moving particles and for about 1e10 particles for a million time steps each, it gets quite high for size.
When designing a file format you have to consider various things, like:
a) Is there a chance that the file could have been corrupted, or maliciously tampered with (or is there any kind of confidentiality requirement)? The answer to this is almost always "yes". To guard against these things you need to consider some kind of checksum and/or encryption. You may also need to consider whether partial recovery is desirable (e.g. is it beneficial to split the file into multiple blocks/sections where each block has its own checksum/encryption, so that if 4 bytes in one block/section are corrupted you can still recover the majority of the data).
b) Is there a portability concern? For example, if you store raw double values in the file will it create a problem on other computers that have a different binary format for "double"?
c) For each type of value; what is the range that actually needs to be represented and what is the precision requirements? Typically software uses "larger and more precise" than necessary (often because its faster to select the next largest type the CPU supports); but for file formats this causes an unnecessary increase in file sizes. For a simple example; maybe you could convert a (64-bit) double into a 32-bit fixed point format and halve the space used while still achieving the range and precision that's actually needed.
d) Are there "clever" ways to reduce the range and precision required for some of the values? For a simple example; maybe you have "starting value" and "ending value" which both do need 64 bits; but you could convert it into "starting value" and "difference" (so that "ending value" can be calculated as "starting value + difference") where "difference" values have less range and only needs 32 bits to store.
e) Is any kind of indexing beneficial? For a simple example; if the file might contain 1 million entries and you only want to find one, then you might be able to use the index to find the offset for the entry you want and only load that one entry (and avoid loading all 1 million entries).
f) What other meta-data might you want? This can be things like a "magic signature" (so that software can check if a file is supposed to comply with the file format and the user didn't give your program the wrong type of file), a "file format version number" (so that the program can do "auto-update to new file format" or at least detect when the file uses an obsolete/deprecated file format that is no longer supported). It can also include information to identify things like who the author was, where the data came from, when the data was obtained, which program created/prepared the file, etc. Sometimes there is also optional data and flags to say if the optional data is/isn't included in the file. You might also want things like "number of entries" and "offset in file for each different area", etc.
g) What kind of allowances do you need to make for extensibility (and backward compatibility, and forward compatibility)? Typically people leave things like (e.g.) "reserved for future use" fields in headers so that they can add/change/extend the file format in future without breaking everything. Sometimes this is even more specific about what software should do when it sees values in reserved fields that it doesn't support - e.g. "reserved for future use, should be zero, if non-zero software should ignore this value" vs. "reserved for future use, should be zero, if non-zero (due to future use) software should generate an error and not use the file"
h) Are any kind of compression techniques useful? For a simple example, if you have "6-columns, N-rows" with an index, and sometimes the data for 2 or more rows happens to be the same; then maybe you can only store one copy of the data for those rows and then use the index to figure out which row uses which data (a bit like "row[n] = unique_row_data[ index[n] ]").

Better to store data in RAM, text file, or database

I am working on a project where I am using words, encoded by vectors, which are about 2000 floats long. Now when I use these with raw text I need to retrieve the vector for each word as it comes across and do some computations with it. Needless to say for a large vocabulary (~100k words) this has a large storage requirement (about 8 GB in a text file).
I initially had a system where I split the large text file into smaller ones and then for a particular word, I read its file, and retrieved its vector. This was too slow as you might imagine.
I next tried reading everything into RAM (takes about ~40GB RAM) figuring once everything was read in, it would be quite fast. However, it takes a long time to read in and a disadvantage is that I have to use only certain machines which have enough free RAM to do this. However, once the data is loaded, it is much faster than the other approach.
I was wondering how a database would compare with these approaches. Retrieval would be slower than the RAM approach, but there wouldn't be the overhead requirement. Also, any other ideas would be welcome and I have had others myself (i.e. caching, using a server that has everything loaded into RAM etc.). I might benchmark a database, but I thought I would post here to see what other had to say.
Thanks!
UPDATE
I used Tyler's suggestion. Although in my case I did not think a BTree was necessary. I just hashed the words and their offset. I then could look up a word and read in its vector at runtime. I cached the words as they occurred in text so at most each vector is read in only once, however this saves the overhead of reading in and storing unneeded words, making it superior to the RAM approach.
Just an FYI, I used Java's RamdomAccessFile class and made use of the readLine(), getFilePointer(), and seek() functions.
Thanks to all who contributed to this thread.
UPDATE 2
For more performance improvement check out buffered RandomAccessFile from:
http://minddumped.blogspot.com/2009/01/buffered-javaiorandomaccessfile.html
Apparently the readLine from RandomAccessFile is very slow because it reads byte by byte. This gave me some nice improvement.
As a rule, anything custom coded should be much faster than a generic database, assuming you have coded it efficiently.
There are specific C-libraries to solve this problem using B-trees. In the old days there was a famous library called "B-trieve" that was very popular because it was fast. In this application a B-tree will be faster and easier than fooling around with a database.
If you want optimal performance you would use a data structure called a suffix tree. There are libraries which are designed to create and use suffix trees. This will give you the fastest word lookup possible.
In either case there is no reason to store the entire dataset in memory, just store the B-tree (or suffix tree) with an offset to the data in memory. This will require about 3 to 5 megabytes of memory. When you query the tree you get an offset back. Then open the file, seek forwards to the offset and read the vector off disk.
You could use a simple text based index file just mapping the words to indices, and another file just containing the raw vector data for each word. Initially you just read the index to a hashmap that maps each word to the datafile index and keep it in memory. If you need the data for a word, you calculate the offset in the data file (2000 * 32 * index) and read it as needed. You probably want to cache this data in RAM (if you are in java perhaps just use a weak map as a starting point).
This is basically implementing your own primitive database, but it may still be preferable because it avoidy database setup / deployment complexity.

loading multiple files of different lengths into one large array in openmp

I have 4 files (file1,file2,file3,file4) of different lengths (n1,n2,n3,n4) which each contain the following type of data:
x1,y1,z1
x2,y2,z2
...
xn,yn,zn
What is the quickest way to load these into memory - can it be done simultaneously to create one large array (i.e. totarray(1:n1+n2+n3+n4,1:3)) from the 4 smaller arrays? If this can't be done in openmp - what would be the fastest way to do this? At the moment, I simply loop over each filename and added it to the bottom of a temporary array which is filled with the new data in each iteration. There are millions of entries in each file and I want to speed this read in up. Thanks
Unless each file is on a different medium, the fastest way of doing this is probably to read the files one at a time, which is what is sounds like you're doing. In this case, OpenMP will not help you, and might make things worse, as the threads would be competing for a single, slow disk. This assumes that you are I/O bound, though.
You do not specify what format your file is in, though. If it is in binary format, then you can't do much better unless you want to start with compression. If it is in text format, though, you are probably CPU bound due to all the text parsing involved, and can probably get huge speedups simply by moving to a binary format. This will be much more efficient than OpenMP parallelization would be.
HDF is a good binary format you might consider, but you could also go with something as simple as fortran "unformatted" files.

Is saving a binary file a standard? Is it limited to only 1 type?

When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.
If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.
Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)
I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.

Reuse of characters in compiled .exe file

Once long ago, out of curiosity, I've tried hex-editing the executable file of the game "Dangerous Dave".
I've looked around the file for any strings I could find, and made some random edits to see if it would actually change the text displayed within the game.
I was surprised to see the result, which I have now recreated using a hex-editor and DOSBox:
As can be seen, editing the two characters "RO" in the string "ROMERO" resulted in 4 characters being changed, with the result becoming "ZUMEZU". It seems as if the program is reusing the two characters and prints them at the start and end of that string.
What is the cause of this? My first guess would be trying to make the executable smaller but just the code that reuses the characters would probably require more space than those 2 bytes to be saved.
Is it just a trick done by the author, or just some compiler voodoo?
Tricky to say for sure without reverse-engineering, but my guess would be that a lot of the constant data in the program is compressed using an algorithm from the LZ family. These compression schemes work essentially in the way that you've observed: they encode repeated substrings as references to text that has previously been decoded.
These compression algorithms were probably used for more than just this one string, and not just for text either; it's quite possible that they were also used to compress other data, such as graphics or level layouts. In short, there were probably significant savings made by using this algorithm!
The use of these compression algorithms is common in older games as a way of saving disk space, but was not automatic - the implementation of this algorithm would likely have been something Romero added himself.

Resources