write one big file or multiple small files

write one big file or multiple small files - c

I'm wondering what is better in therms of performance: write in one big text file (something about 10GB or more) or use a subfolder system that will have 3 levels with 256 folders in each one, the last level will be the text file. Example:
1
1
2
3
1
2
3
4
4
2
3
4
It will be heavy accessed (will be opened, append some text stuff then closed), so I don't know what is better, open and close file pointers thousand times in a second, or change a pointer inside one big file thousand times.
I'm at a core i7, 6GB DDR3 of RAM and a 60MB/s write disk speed under ext4.

You ask a fairly generic question, so the generic answer would be to go with the big file, access it and let the filesystem and its caches worry about optimizing access. Chances are they came up with a more advanced algorithm than you just did (no offence).

To make a decision, you need to know answers to many questions, including:
How are you going to determine which of the many files to access for the information you are after?
When you need to append, will it be to the logical last file, or to the end of whatever file the information should have been found in?
How are you going to know where to look in any given file (large or small) for where the information is?
Your 2563 files (16 million or so if you use all of them) will require a fair amount of directory storage.
You actually don't mention anything about reading the file - which is odd.
If you're really doing write only access to the file or files, then a single file always opened with O_APPEND (or "a") will probably be best. If you are updating (as well as appending) information, then the you get into locking issues (concurrent access; who wins).
So, you have not included enough information in the question for anyone to give any definitive answer. If there is enough more information in the comments you've added, then you should have placed those comments into the question (edit the question; add the comment material).

Related

Fortran: How do I allocate arrays when reading a file of unknown size?

My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?

I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.

I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.

how much time for opening a file? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
in my program, i'm using file.open(path_to_file); .
in the server side, i have a directory that contains plenty of files, and i'm afraid that the program will take longer time to run if the directory is more and more bigger because of the file.open();
//code:
ofstream file;
file.open("/mnt/srv/links/154");//154 is the link id and in directory /mnt/srv/links i have plenty of files
//write to file
file.close();
Question: can the time to excecute file.open() vary according to the number of files in the directory?
I'm using debian, and I believe my filesystem is ext3.

I'm going to try to answer this - however, it is rather difficult, as it would depend on, for example:
What filesystem is used - in some filesystems, a directory consists of an unsorted list of files, in which case the time to find a particular file is O(n) - so with 900000 files, it would be a long list to search. On the other hand, some others use a hash algorithm or a sorted list, allow O(1) and O(log2(n)) respectively - of course, each part of a directory has to be found individually. With a number of 900k, O(n) is 900000 times slower than O(1), and O(log2(n)) for 900k is just under 20, so 18000 times "faster". However, with 900k files, even a binary search may take some doing, because if we have a size of each directory entry of 100 bytes [1], we're talking about 85MB of directory data. So it will be several sectors to read in, even if we only touch at 19 or 20 different places.
The location of the file itself - a file located on my own hard-disk will be much quicker to get to than a file on my Austin,TX colleague's file-server, when I'm in England.
The load of any file-server and comms links involved - naturally, if I'm the only one using a decent setup of a NFS or SAMBA server, it's going to be much quicker than using a file-server that is serving a cluster of 2000 machines that are all busy requesting files.
The amount of memory and overall memory usage on the system with the file, and/or the amount of memory available in the local machine. Most modern OS's will have a file-cache locally, and if you are using a server, also a file-cache on the server. More memory -> more space to cache things -> quicker access. Particularly, it may well cache the directory structure and content.
The overall performance of your local machine. Although nearly all of the above factors are important, the simple effort of searching files may well be enough to make some difference with a huge number of files - especially if the search is linear.
[1] A directory entry will have, at least:
A date/time for access, creation and update. With 64-bit timestamps, that's 24 bytes.
Filesize - at least 64-bits, so 8 bytes
Some sort of reference to where the file is - another 8 bytes at least.
A filename - variable length, but one can assume an average of 20 bytes.
Access control bits, at least 6 bytes.
That comes to 66 bytes. But I feel that 100 bytes is probably more typical.

Yes, it can. That depends entirely on the filesystem, not on the language. The times for opening/reading/writing/closing files are all dominated by the times of the corresponding syscalls. C++ should add relatively little overhead, even though you can get surprises from your C++ implementation.

There are a lot of variables which might affect the answer to this, but the general answer is that the number of files will influence the time taken to open a file.
The biggest variable is the filesystem used. Modern filesystems use directory index structures such as B-Trees, to allow searching for known files to be a relatively fast operation. On the other hand, listing all the files in the directory or searching for subsets using wildcards can take much longer.
Other factors include:
Whether symlinks need to be traversed to identify the file
Whether the file is local or mounter over a network
Cacheing
In my experience, using a modern filesystem, an individual file can be located in directories containing 100's of thousands of files in times less than a second.

One large file or multiple small files?

I have an application (currently written in Python as we iron out the specifics but eventually it will be written in C) that makes use of individual records stored in plain text files. We can't use a database and new records will need to be manually added regularly.
My question is this: would it be faster to have a single file (500k-1Mb) and have my application open, loop through, find and close a file OR would it be faster to have the records separated and named using some appropriate convention so that the application could simply loop over filenames to find the data it needs?
I know my question is quite general so direction to any good articles on the topic are as appreciated as much as suggestions.
Thanks very much in advance for your time,
Dan

Essentially your second approach is an index - it's just that you're building your index in the filesystem itself. There's nothing inherently wrong with this, and as long as you arrange things so that you don't get too many files in the one directory, it will be plenty fast.
You can achieve the "don't put too many files in the one directory" goal by using multiple levels of directories - for example, the record with key FOOBAR might be stored in data/F/FO/FOOBAR rather than just data/FOOBAR.
Alternatively, you can make the single-large-file perform as well by building an index file, that contains a (sorted) list of key-offset pairs. Where the directories-as-index approach falls down is when you want to search on key different from the one you used to create the filenames - if you've used an index file, then you can just create a second index for this situation.
You may want to reconsider the "we can't use a database" restriction, since you are effectively just building your own database anyway.

Reading a directory is in general more costly than reading a file. But if you can find the file you want without reading the directory (i.e. not "loop over filenames" but "construct a file name") due to your naming convention, it may be benefical to split your database.

Given your data is 1 MB, I would even consider to store it entirely in memory.
To give you some clue about your question, I'd consider that having one single big file means that your application is doing the management of the lines. Having multiple small files is relying an the system and the filesystem to manage the data. The latter can be quite slow though, because it involves system calls for all your operations.

Opening File and Closing file in C Would take much time
i.e. you have 500 files 2 KB each... and if you process it 1000 Additonal Operation would be added to your application (500 Opening file and 500 Closing)... while only having 1 file with 1 MB of size would save you that 1000 additional operation...(That is purely my personal Opinion...)

Generally it's better to have multiple small files. Keeps memory usage low and performance is much better when searching through it.
But it depends on the amount of operations you'll need, because filesystem calls are much more expensive when compared to memory storage for instance.

This all depends on your file system, block size and memory cache among others.
As usual, measure and find out if this is a real problem since premature optimization should be avoided. It may be that using one file vs many small files does not matter much for performance in practice and that the choice should be based on clarity and maintainability instead.
(What I can say for certain is that you should not resort to linear file search, use a naming convention to pinpoint the file in O(1) time instead).

The general trade off is that having one big file can be more difficult to update but having lots of little files is fiddly. My suggestion would be that if you use multiple files and you end up having a lot it can get very slow traversing a directory with a million files in it. If possible break the files into some sort of grouping so they can be put into separate directories and "keyed". I have an application that requires the creation of lots of little pdf documents for all user users of the system. If we put this in one directory it would be a nightmare but having a directory per user id makes it much more manageable.

Why can't you use a DB, I'm curious? I respect your preference, but just want to make sure it's for the right reason.
Not all DBs require a server to connect to or complex deployment. SQLite, for instance, can be easily embedded in your application. Python already has it built-in, and it's very easy to connect with C code (SQLite itself is written in C and its primary API is for C). SQLite manages a feature-complete DB in a single file on the disk, where you can create multiple tables and use all the other nice features of a DB.

Is it possible to delete both ends of a large file without copying?

I would like to know if it is possible, using Windows and c++, to take a large video file (several gigabytes in length) and delete the first and last few hundred megabytes of it “in-place”.
The traditional approach of copying the useful data to a new file often takes upwards of 20 minutes of seemingly needless copying.
Is there anything clever that can be done low-level with the disk to make this happen?

Sure, it's possible in theory. But if your filesystem is NTFS, be prepared to spend a few months learning about all the data structures that you'll need to update. (All of which are officially undocumented BTW.)
Also, you'll need to either
Somehow unmount the volume and make your changes then; or
Learn how to write a kernel filesystem driver, buy a license from MS, develop the driver and use it to make changes to a live filesystem.
It's a bit easier if your filesystem is something simpler like FAT32. But either way: in short, it might be possible, but even if it is it'll take years out of your life. My advice: don't bother.
Instead, look at other ways you could solve the problem: e.g. by using an avisynth script to serve just the frames from the region you are interested in.

Are you hoping to just fiddle around with sector addresses in the directory entry? It's virtually inconceivable that plan would work.
First of all, it would require that the amount of data you wish to delete be exactly a sector size. That's not very likely considering that there is probably some header data at the very start that must remain there.
Even if it mets those requirements, it would take a low-level modification, which Windows tries very hard to prevent you from doing.

Maybe your file format allows to 'skip' the bytes, so that you could simply write over (i.e. with memory mapping) the necessary parts. This would of course still use up unnecessarily much disk space.

Yes, you can do this, on NTFS.
The end you remove with SetFileLength.
The beginning, or any other large consecutive region of the file, you overwrite with zeros. You then mark the file "sparse", which allows the file system to reclaim those clusters.
Note that this won't actually change the offset of the data relative to the beginning of the file, it only prevents the filesystem from wasting space storing unneeded data.

Even if low level filesystem operations were easy, editing a video file is not simply a matter of deleting unwanted megabytes. You still do have to consider concepts such as compression, frames, audio and video muxing, media file containers, and many others...
Your best solution is to simply accept your idle twenty minutes.

How can you concatenate two huge files with very little spare disk space? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Suppose that you have two huge files (several GB) that you want to concatenate together, but that you have very little spare disk space (let's say a couple hundred MB). That is, given file1 and file2, you want to end up with a single file which is the result of concatenating file1 and file2 together byte-for-byte, and delete the original files.
You can't do the obvious cat file2 >> file1; rm file2, since in between the two operations, you'd run out of disk space.
Solutions on any and all platforms with free or non-free tools are welcome; this is a hypothetical problem I thought up while I was downloading a Linux ISO the other day, and the download got interrupted partway through due to a wireless hiccup.

time spent figuring out clever solution involving disk-sector shuffling and file-chain manipulation: 2-4 hours
time spent acquiring/writing software to do in-place copy and truncate: 2-20 hours
times median $50/hr programmer rate: $400-$1200
cost of 1TB USB drive: $100-$200
ability to understand the phrase "opportunity cost": priceless

I think the difficulty is determining how the space can be recovered from the original files.
I think the following might work:
Allocate a sparse file of the
combined size.
Copy 100Mb from the end of the second file to the end of the new file.
Truncate 100Mb of the end of the second file
Loop 2&3 till you finish the second file (With 2. modified to the correct place in the destination file).
Do 2&3&4 but with the first file.
This all relies on sparse file support, and file truncation freeing space immediately.
If you actually wanted to do this then you should investigate the dd command. which can do the copying step
Someone in another answer gave a neat solution that doesn't require sparse files, but does copy file2 twice:
Copy 100Mb chunks from the end of file 2 to a new file 3, ending up in reverse order. Truncating file 2 as you go.
Copy 100Mb chunks from the end of file 3 into file 1, ending up with the chunks in their original order, at the end of file 1. Truncating file 3 as you go.

Here's a slight improvement over my first answer.
If you have 100MB free, copy the last 100MB from the second file and create a third file. Truncate the second file so it is now 100MB smaller. Repeat this process until the second file has been completely decomposed into individual 100MB chunks.
Now each of those 100MB files can be appended to the first file, one at a time.

With those constraints I expect you'd need to tamper with the file system; directly edit the file size and allocation blocks.
In other words, forget about shuffling any blocks of file content around, just edit the information about those files.

if the file is highly compressible (ie. logs):
gzip file1
gzip file2
zcat file1 file2 | gzip > file3
rm file1
rm file2
gunzip file3

At the risk of sounding flippant, have you considered the option of just getting a bigger disk? It would probably be quicker...

Not very efficient, but I think it can be done.
Open the first file in append mode, and copy blocks from the second file to it until the disk is almost full. For the remainder of the second file, copy blocks from the point where you stopped back to the beginning of the file via random access I/O. Truncate the file after you've copied the last block. Repeat until finished.

Obviously, the economic answer is buy more storage assuming that's a possible answer. It might not be, though--embedded system with no way to attach more storage, or even no access to the equipment itself--say, space probe in flight.
The previously presented answer based on the sparse file system is good (other than the destructive nature of it if something goes wrong!) if you have a sparse file system. What if you don't, though?
Starting from the end of file 2 copy blocks to the start of the target file reversing them as you go. After each block you truncate the source file to the uncopied length. Repeat for file #1.
At this point the target file contains all the data backwards, the source files are gone.
Read a block from the tart and from the end of the target file, reverse them and write them to the spot the other came from. Work your way inwards flipping blocks.
When you are done the target file is the concatenation of the source files. No sparse file system needed, no messing with the file system needed. This can be carried out at zero bytes free as the data can be held in memory.

ok, for theoretical entertainment, and only if you promise not to waste your time actually doing it:
files are stored on disk in pieces
the pieces are linked in a chain
So you can concatenate the files by:
linking the last piece of the first file to the first piece of the last file
altering the directory entry for the first file to change the last piece and file size
removing the directory entry for the last file
cleaning up the first file's end-of-file marker, if any
note that if the last segment of the first file is only partially filled, you will have to copy data "up" the segments of the last file to avoid having garbage in the middle of the file [thanks #Wedge!]
This would be optimally efficient: minimal alterations, minimal copying, no spare disk space required.
now go buy a usb drive ;-)

Two thoughts:
If you have enough physical RAM, you could actually read the second file entirely into memory, delete it, then write it in append mode to the first file. Of course if you lose power after deleting but before completing the write, you've lost part of the second file for good.
Temporarily reduce disk space used by OS functionality (e.g. virtual memory, "recycle bin" or similar). Probably only of use on Windows.

I doubt this is a direct answer to the question. You can consider this as an alternative way to solve the problem.
I think it is possible to consider 2nd file as the part 2 of the first file. Usually in zip application, we would see a huge file is split into multiple parts. If you open the first part, the application would automatically consider the other parts in further processing.
We can simulate the same thing here. As #edg pointed out, tinkering file system would be one way.

you could do this:
head file2 --bytes=1024 >> file1 && tail --bytes=+1024 file2 >file2
you can increase 1024 according to how much extra disk space you have, then just repeat this until all the bytes have been moved.
This is probably the fastest way to do it (in terms of development time)

You may be able to gain space by compressing the entire file system. I believe NTFS supports this, and I am sure there are flavors of *nix file systems that would support it. It would also have the benefit of after copying the files you would still have more disk space left over than when you started.

OK, changing the problem a little bit. Chances are there's other stuff on the disk that you don't need, but you don't know what it is or where it is. If you could find it, you could delete it, and then maybe you'd have enough extra space.
To find these "tumors", whether a few big ones, or lots of little ones, I use a little sampling program. Starting from the top of a directory (or the root) it makes two passes. In pass 1, it walks the directory tree, adding up the sizes of all the files to get a total of N bytes. In pass 2, it again walks the directory tree, pretending it is reading every file. Every time it passes N/20 bytes, it prints out the directory path and name of the file it is "reading". So the end result is 20 deep samples of path names uniformly spread over all the bytes under the directory.
Then just look at that list for stuff that shows up a lot that you don't need, and go blow it away.
(It's the space-equivalent of the sampling method I use for performance optimization.)

"fiemap"
http://www.mjmwired.net/kernel/Documentation/filesystems/fiemap.txt

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight