In Matlab, I need to generate and save over 8 million lines of data to a file.
I have been using fprintf, but as the file gets larger everything seems to slow down. I am not sure but I suspect fprintf is loading the whole file into memory.
Right now I have been iterating like this: calculate values, write to file, repeat.
Would it be better to say first calculate 20,000 values at a time, then write them all to file at once?
I understand data store is a good way to analyze the data, can it also be used to help write it to the file? If so, how?
Related
Suppose I have some program that at each iteration i outputs some data, a,b,c.
What is the best way to write this data to a file, with regards to both speed and precision? What are the Fortran standard methods/ best practices for writing this data to file?
My current considerations are:
Writing all data to an array and then writing this array to file.
Writing line-by-line to unformatted binary file.
Writing line-by-line to text file.
I understand that writing to an array then to file will be faster, but what about the precision in each case? What about the situation where I do not know the value of i and so cannot initially define an array size?
Is it possible to create an array in one programme and then use it in other programmes? The array I am looking to create is very large and its creation will take a while so I don't want to make it anew every time I run the main programme but instead just use it after creating it once in the other programme. Because of its size I'm not sure if printing it to file and then reading it back in would not also be quite inefficient?
It is an integer array of dimensions 1:300 000 and 100.
Long comment:
There are many formats in which you can save data: Fortran unformatted sequential, Fortran unformatted direct, Fortran unformatted stream, NetCDF, HDF5, VTK, ... Really difficult to answer this with any definite answer. We really don't know how time consuming it is to compute it, so we cannot judge whether saving would be more time consuming or not.
Definitely you should be looking for unformatted or binary formats.
Edit: your array is actually not that big. The saving and reading will ne quick. Just use an unformatted file form.
I have 4 files (file1,file2,file3,file4) of different lengths (n1,n2,n3,n4) which each contain the following type of data:
x1,y1,z1
x2,y2,z2
...
xn,yn,zn
What is the quickest way to load these into memory - can it be done simultaneously to create one large array (i.e. totarray(1:n1+n2+n3+n4,1:3)) from the 4 smaller arrays? If this can't be done in openmp - what would be the fastest way to do this? At the moment, I simply loop over each filename and added it to the bottom of a temporary array which is filled with the new data in each iteration. There are millions of entries in each file and I want to speed this read in up. Thanks
Unless each file is on a different medium, the fastest way of doing this is probably to read the files one at a time, which is what is sounds like you're doing. In this case, OpenMP will not help you, and might make things worse, as the threads would be competing for a single, slow disk. This assumes that you are I/O bound, though.
You do not specify what format your file is in, though. If it is in binary format, then you can't do much better unless you want to start with compression. If it is in text format, though, you are probably CPU bound due to all the text parsing involved, and can probably get huge speedups simply by moving to a binary format. This will be much more efficient than OpenMP parallelization would be.
HDF is a good binary format you might consider, but you could also go with something as simple as fortran "unformatted" files.
I'm working on creating a binary search algorithm in C that searches for a string in a .txt file. Each line is a string representing a stock ticker. Not being familiar with C, this is taking far too long. I have a few questions:
1.) Once I have opened a file using fopen, does it make more sense in terms of efficiency for the algorithm to step through the file using some function provided in the C library for scanning files, doing the compare directly from the file, or should I copy each line into an array and have the algorithm search the array?
2.) If I should compare directly from the file, what is the best way to step through it? Assume I have the number of lines in the file, is there some way to go directly to the middle line, scan the string and do the compare?
I'm sorry if this is too vague. Not too sure how to better explain. Thanks for your time
Unless your file is exceedingly big (> 2GB) then loading the file in memory prior searching it is the way to go. In case you cannot load the file in memory, you could hold the offset of each line in an int[] or (if the file contains too many lines...) create another binary file and write the offset of each lines as integers...
Having everything in memory is by far preferable, though.
You cannot binary search lines of a text-file without knowing the length of each line in advance, so you'll most likely want to read each line into memory at first (unless the file is very big).
But if your goal is only to search for a single given line as quickly as possible, you might as well just do linear search directly on the file. There's no point in getting O(log n) at the cost of a O(n) setup cost if the search is only done once.
Reading it all in with a bulk read and walking through it with pointers (to memory) is very fast. Avoid doing multiple I/O calls if you can.
I should also mention that memory mapped files can be very suitable for something like this. See mmap() if on Unix. This is definitely your best bet for really large files.
This is a great question!
The challenge of binary search is that the benefits of binary search come from being able to skip past half the elements at each step in O(1). This guarantees that, since you only do O(lg n) probes, that the runtime is O(lg n). This is why, for example, you can do a fast binary search on an array but not a linked list - in the linked list, finding the halfway point of the elements takes linear time, which dominates the time for the search.
When doing binary search on a file you are in a similar position. Since all the lines in the file might not have the same length, you can't easily jump to the nth line in the file given some number n. Consequently, implementing a good, fast binary search on a file will be a bit tricky. Somehow, you will need to know where each line starts and stops so that you can efficiently jump around in the file.
There are many ways you can do this. First, you could load all the strings from the file into an array, as you've suggested. This takes linear time, but once you have the array of strings in memory all future binary searches will be very fast. The catch is that if you have a very large file, this may take up a lot of memory, and could be prohibitively expansive. Consequently, another alternative might be not to store the actual stings in the array, but rather the offsets into the file at which each string occurs. This would let you do the binary search quickly - you could seek the file to the proper offset when doing a comparison - and for large stings can be much more space-efficient than the above. And, if all the strings are roughly the same length, you could just pad every line to some fixed size to allow for direct computation of the start position of each line.
If you're willing to expend some time implementing more complex solutions, you might want to consider preprocessing the file so that instead of having one string per line, instead you have at the top of the file a list of fixed-width integers containing the offsets of each string in the file. This essentially does the above work, but then stores the result back in the file to make future binary searches much faster. I have some experience with this sort of file structure, and it can be quite fast.
If you're REALLY up for a challenge, you could alternatively store the strings in the file using a B-tree, which would give you incredibly fast lookup times fir each string by minimizing the number of disk reads that you need to do.
Hope this helps!
I don't see how you can do compare directly from the file. You will have to have a buffer to store data read from disk and use that buffer. So it doesn't make sense, it is just impossible.
You cannot jump to a particular line in the file. Not unless you know the offset in bytes of the beginning of that line relative to the beginning of the file.
I'd recommend using mmap to map this file directly into memory and work with it as with character array. Operating system will make work with file (like seeking, reading, writing) transparent to you, and you will just work with it like with a buffer in memory. Note that mmap is limited to 4 GB on 32-bit systems. But if that file is bigger, you probably need to ask the question - why on earth someone has this big file not in an indexed database.
I have a general IO question. I was trying to replace a single line in an ascii encoded file. After searching around quite a bit I found that it is not possible to do that. According to what I read if a single line needs to be replaced in a file, the whole file needs to be rewritten. I read that this is the same for all OS's. After reading that I thought ok, no choice, I'll just rewrite the whole file.\n
What got me wondering about this again is I've been working with a program that uses a ".dat" and ".idx" file for it's database. The program is constantly reading and writing to the db. So my question is, it obviously needs to write only small portions at a time (the db is about 200mb in size) so theres no way it could be efficient to write the whole file each time. So my question is what kind of solution would a program like this have for such a problem. Would it write to memory and then every now and then rewrite the whole database. Would it be writing temp files and then merging them to the DB at some point? Or is it possible for a single (or several) lines in the db to be written without the whole file be written?
Any info on this would be greatly appreciated!
Thx
nt
The general comment 'you have to rewrite the whole file' applies when the line you are replacing is of length L1 and the line you are adding is of length L2 and L1 ≠ L2. The trouble is that if L1 is bigger than L2, then you have to move the data in the rest of the file down the file to avoid leaving a gap with garbage where the end of the line was (and you must chop off the tail of the file - shortening it, to avoid leaving garbage at the end). Conversely, if L1 is smaller than L2, you have to move the lines after line up the file to avoid having the new line overwrite the start of the next line.
In the case of the .dat and .idx files, though, you will find that indeed, you are correct: the software is not rewriting the whole file each time. There's a moderate chance that the files represent a C-ISAM file, or one of the related systems (D-ISAM, T-ISAM, etc). In original (Informix) C-ISAM, the .dat file contains fixed length records, so it is possible to write over any old record with a new record because L1 = L2, always. The .idx file is more complex, but it is split into pages (possibly as small as 512 bytes per page), and when an edit is needed, the whole page is rewritten. Since the pages are all the same size, L1 = L2 again - and it is safe to do the rewrite of just the section of the index file that changes.
When a C-ISAM file contains variable length data, the fixed portion of the record is stored in the .dat file, and the variable length portion of the data is stored in pages within the .idx file. This arrangement has just one merit - it leaves the records in the .dat file at a fixed size.
This is not true ntmp. You can indeed write in the middle of a file. How you do it depends on the system and programming language you use. What you are looking for might be seeking operations in IO.
Well you will not exactly have to rewrite the whole file. Only the rest of the file where you start inserting, since that part will needed to be moved behind what you are inserting.
There are several ways you can solve this, one would for example be to reserve space in the file (making the file larger). That way you would only have to move data when the placeholder areas have been filled out.
Write a bit more and we might be able to help you out.