Let's assume that on a hard drive I have some very large data file of a sequence of characters:
ABRDZ....
My question is as follows, if the head is positioned at the beginning of the file, and I need 5 characters every 1000 positions interval, would it better be to do a Seek (since I know where to look) or simply have a large buffer that just reads sequentially then do job in memory.
Naively I'd have answered that reading 'A' then seek to read 'V' is faster than >> reading all the file until ,say, position 200 (the position of 'V'). Ok, this is just an example, since smallest I/O is 512bytes.
Edit: my previous self-naive-answer is partly justified by the following case: given a 100Gb file I need the first and the last charcters; Here I obviously would do a seek .... right?
Maybe there is a trade off between how "long" is the seek vs how much data to retrieve?
Can someone clarify this to me?
[UPDATE]
Generally, from your original numbers 5 out of every 1000, (Ill assume that the 5 bytes is part of the 1000, thus making your step count 1000), if your step count is less 2x your block size, than my original answer is a pretty good explanation. It does get a bit more tricky once you get past 2x your HD block size, because at that point, you would easily be wasting read times, when you could be speeding up by seeking past un-used (or for that matter un-necessary) HD blocks.
[ORIGINAL]
Well, this is an extremely interesting question, with what I beleive to be an equally interesting answer (also somewhat complex). I think that actually this comes down to a couple of other questions, like how big is the block size you have implemented on your drive (or the drive your software is going to run on). If your block size is 4KB, then the (true) minimum your hard drive will get for you at a time is 4096 bytes. In your case if you truly need 5 chars every 1000, then if you did this with ALL disk IO, then you would be essentially re-reading the same block 4 times, and doing 3 seeks in between (REALLY NOT EFFICIENT).
My personal belief is that you could (if you wanted to be drive efficient) in your code, try to understand what the block size of the drive that you are using is, then use that size number to know how many bytes at a time you should bring into RAM. This way you wouldn't have to have a HUGE RAM buffer, but at the same time not really have to SEEK, nor would you be wasting (or performing) any extra reads.
IS THIS THE MOST EFFICIENT.
I dont think it is the most efficient, but it may be good enough for the performance you need, who knows. I do think that even if the read head is where you want it to be, that if you perform algorithmic work in the middle of each block read, rather than reading the whole file all at once, that you will lose time in waiting for the next rotation of the drive platters. Whereas, if you were to read it all at once, the drive should be able to perform a sequential read of all parts of the file at once. Again not as simple though, as if your file is truly more than 1 block, on a rotational drive, you may suffer IF your drive has not been defragmented as it may have to perform random seeks just to get to the next block.
Sorry, for the long winded answer, but par usual, there is no simple answer in your case.
I do think that overall performance would PROBABLY be better if you simply read the whole file at once. There is no way to assure this, as each system is going to have inherently different parameters of their drive setup, etc...
Related
My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?
I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.
I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.
I am new to Golang.
Should I always avoid appending slices?
I need to load a linebreak-separated data file in memory.
With performance in mind, should I count lines, then load all the data in a predefined length array, or can I just append lines to a slice?
You should stop thinking about performance and start measuring what the actual bottleneck of you application is.
Any advice to a question like "Should do/avoid X because of performance?" is useless in 50% of the cases and counterproductive in 25%.
There are a few really general advices like "do not needlessly generate garbage" but your question cannot be answered as this depends a lot on the size of your file:
Your file is ~ 3 Tera byte? Most probably you will have to read it line by line anyway...
Your file has just a bunch (~50) of lines: Probably counting lines first is more work than reallocating a []string slice 4 times (or 0 times you you make([]string,0,100) it initially). A string is just 2 words.
Your file has an unknown but large (>10k) lines: Maybe it might be worth. "Maybe" in the sense you should measure on real data.
Your file is known to be big (>500k lines): Definitively count first, but you might start hitting the problem from the first bullet point.
You see: A general advice for performance is a bad advice so I won't give one.
This question recurs frequently on StackOverflow, but I have read all the previous relevant answers, and have a slight twist on the question.
I have a 23Gb file containing 475 million lines of equal size, with each line consisting of a 40-character hash code followed by an identifier (an integer).
I have a stream of incoming hash codes - billions of them in total - and for each incoming hash code I need to locate it and print out corresponding identifier. This job, while large, only needs to be done once.
The file is too large for me to read into memory and so I have been trying to usemmap in the following way:
codes = (char *) mmap(0,statbuf.st_size,PROT_READ,MAP_SHARED,codefile,0);
Then I just do a binary search using address arithmetic based on the address in codes.
This seems to start working beautifully and produces a few million identifiers in a few seconds, using 100% of the cpu, but then after some, seemingly random, amount of time it slows down to a crawl. When I look at the process using ps, it has changed from status "R" using 100% of the cpu, to status "D" (diskbound) using 1% of the cpu.
This is not repeatable - I can start the process off again on the same data, and it might run for 5 seconds or 10 seconds before the "slow to crawl" happens. Once last night, I got nearly a minute out of it before this happened.
Everything is read only, I am not attempting any writes to the file, and I have stopped all other processes (that I control) on the machine. It is a modern Red Hat Enterprise Linux 64-bit machine.
Does anyone know why the process becomes disk-bound and how to stop it?
UPDATE:
Thanks to everyone for answering, and for your ideas; I had not previously tried all the various improvements before because I was wondering if I was somehow using mmap incorrectly. But the gist of the answers seemed to be that unless I could squeeze everything into memory, I would inevitable run into problems. So I squashed the size of the hash code to the size of the leading prefix that did not create any duplicates - the first 15 characters were enough. Then I pulled the resulting file into memory, and ran the incoming hash codes in batches of about 2 billion each.
The first thing to do is split the file.
Make one file with the hash-codes and another with the integer ids. Since the rows are the same then it will line up fine after the result is found. Also you can try an approach that puts every nth hash into another file and then stores the index.
For example, every 1000th hash key put into a new file with the index and then load that into memory. Then binary scan that instead. This will tell you the range of 1000 entries that need to be further scanned in the file. Yes that will do it fine! But probably much less than that. Like probably every 20th record or so will divide that file size down by 20 +- if I am thinking good.
In other words after scanning you only need to touch a few kilobytes of the file on disk.
Another option is to split the file and put it in memory on multiple machines. Then just binary scan each file. This will yield the absolute fastest possible search with zero disk access...
Have you considered hacking a PATRICIA trie algorithm up? It seems to me that if you can build a PATRICIA tree representation of your data file, which refers to the file for the hash and integer values, then you might be able to reduce each item to node pointers (2*64 bits?), bit test offsets (1 byte in this scenario) and file offsets (uint64_t, which might need to correspond to multiple fseek()s).
Does anyone know why the process becomes disk-bound and how to stop it?
Binary search requires a lot of seeking within the file. In the case where the whole file doesn't fit in memory, the page cache doesn't handle the big seeks very well, resulting in the behaviour you're seeing.
The best way to deal with this is to reduce/prevent the big seeks and make the page cache work for you.
Three ideas for you:
If you can sort the input stream, you can search the file in chunks, using something like the following algorithm:
code_block <- mmap the first N entries of the file, where N entries fit in memory
max_code <- code_block[N - 1]
while(input codes remain) {
input_code <- next input code
while(input_code > max_code) {
code_block <- mmap the next N entries of the file
max_code <- code_block[N - 1]
}
binary search for input code in code_block
}
If you can't sort the input stream, you could reduce your disk seeks by building an in-memory index of the data. Pass over the large file, and make a table that is:
record_hash, offset into file where this record starts
Don't store all records in this table - store only every Kth record. Pick a large K, but small enough that this fits in memory.
To search the large file for a given target hash, do a binary search in the in-memory table to find the biggest hash in the table that is smaller than the target hash. Say this is table[h]. Then, mmap the segment starting at table[h].offset and ending at table[h+1].offset, and do a final binary search. This will dramatically reduce the number of disk seeks.
If this isn't enough, you can have multiple layers of indexes:
record_hash, offset into index where the next index starts
Of course, you'll need to know ahead of time how many layers of index there are.
Lastly, if you have extra money available you can always buy more than 23 gb of RAM, and make this a memory bound problem again (I just looked at Dell's website - you pick up a new low-end workstation with 32 GB of RAM for just under $1,400 Australian dollars). Of course, it will take a while to read that much data in from disk, but once it's there, you'll be set.
Instead of using mmap, consider just using plain old lseek+read. You can define some helper functions to read a hash value or its corresponding integer:
void read_hash(int line, char *hashbuf) {
lseek64(fd, ((uint64_t)line) * line_len, SEEK_SET);
read(fd, hashbuf, 40);
}
int read_int(int line) {
lseek64(fd, ((uint64_t)line) * line_len + 40, SEEK_SET);
int ret;
read(fd, &ret, sizeof(int));
return ret;
}
then just do your binary search as usual. It might be a bit slower, but it won't start chewing up your virtual memory.
We don't know the back story. So it is hard to give you definitive advice. How much memory do you have? How sophisticated is your hard drive? Is this a learning project? Who's paying for your time? 32GB of ram doesn't seem so expensive compared to two days of work of person that makes $50/h. How fast does this need to run? How far outside the box are you willing to go? Does your solution need to use advanced OS concepts? Are you married to a program in C? How about making Postgres handle this?
Here's is a low risk alternative. This option isn't as intellectually appealing as the other suggestions but has the potential to give you significant gains. Separate the file into 3 chunks of 8GB or 6 chunks of 4GB (depending on the machines you have around, it needs to comfortably fit in memory). On each machine run the same software, but in memory and put an RPC stub around each. Write an RPC caller to each of your 3 or 6 workers to determine the integer associated with a given hash code.
As I loop through lines in file A, I am parsing the line and putting each string (char*) into a char**.
At the end of a line, I then run a procedure that consists of opening file B, using fgets, fseek and fgetc to grab characters from that file. I then close file B.
I repeat reopening and reclosing file B for each line.
What I would like to know is:
Is there a significant performance hit from using malloc and free, such that I should use something static like myArray[NUM_STRINGS][MAX_STRING_WIDTH] instead of a dynamic char** myArray?
Is there significant performance overhead from opening and closing file B (conceptually, many thousands of times)? If my file A is sorted, is there a way for me to use fseek to move "backwards" in file B, to reset where I was previously located in file B?
EDIT Turns out that a two-fold approach greatly reduced the runtime:
My file B is actually one of twenty-four files. Instead of opening up the same file B1 a thousand times, and then B2 a thousand times, etc. I open up file B1 once, close it, B2 once, close it, etc. This reduces many thousands of fopen and fclose operations to roughly 24.
I used rewind() to reset the file pointer.
This yielded a roughly 60-fold speed improvement, which is more than sufficient. Thanks for pointing me to rewind().
If your dynamic array grows in time, there is a copy cost on some reallocs. If you use the "always double" heuristic, this is amortized to O(n), so it is not horrible. If you know the size ahead of time, a stack allocated array will still be faster.
For the second question read about rewind. It has got to be faster than opening and closing all the time, and lets you do less resource management.
What I would like to know is:
does your code work correctly?
is it running fast enough for your purpose?
If the answer both of these is "yes", don't change anything.
Opening and closing has a variable overhead depending on if other programs are competitng for that resource.
measure the file size first and then use that to calculate the array size in advance to do one big heap allocation.
You won't get a multi-dimensional array right off, but a bit of pointer arithmetic and you are there.
Can you not cache positional information in the other file and then, rather than opening and closing it, use previous seek indexes as an offset? Depends on the exact logic really.
If your files are large, disk I/O will be far more expensive than memory management. Worrying about malloc/free performance before profiling indicates that it is a bottleneck is premature optimization.
It is possible that the overhead from frequent open/close is significant in your program, but again the actual I/O is likely to be more expensive, unless the files are small, in which case the loss of buffers between close and open can potentially cause extra disk I/O. And yes you can use ftell() to get the current position in a file then fseek with SEEK_SET to get to that.
There is always a performance hit with using dynamic memory. Using a static buffer will provide a speed boost.
There is also going to be a performance hit with reopening a file. You can use fseek(pos, SEEK_SET) to set the file pointer to any position in the file or fseek(offset, SEEK_CUR) to do a relative move.
Significant performance hit is relative, and you will have to determine what that means for yourself.
I think it's better to allocate the
actual space you need, and the
overhead will probably not be
significant. This avoids both
wasting space and stack overflows
Yes. Though the IO is cached,
you're making unnecessary syscalls
(open and close). Use fseek with
probably SEEK_CUR or SEEK_SET.
In both cases, there is some performance hit, but the significance will depend on the size of the files and the context your program runs in.
If you actually know the max number of strings and max width, this will be a lot faster (but you may waste a lot of memory if you use less than the "max"). The happy medium is to do what a lot of dynamic array implementations in C++ do: whenever you have to realloc myArray, alloc twice as much space as you need, and only realloc again once you've run out of space. This has O(log n) performance cost.
This may be a big performance hit. I strongly recommend using fseek, though the details will depend on your algorithm.
I often find the performance overhead to be outweighed by the direct memory management that comes with malloc and those low-level C handlers on memory. Unless these areas of memory are going to remain static and untouched for an amount of time that is in amortized time greater than touching this memory, it may be more beneficial to stick with the static array. In the end, it's up to you.
For example, let's say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste of time to copy the entire file into an array and then run binary search...I've effectively made it a linear time algorithm, because I'll have to spend O(n) time copy the darn file before I can run my search.
Is there a faster way to do this? Is there maybe something like lseek which works with lines instead of bytes?
If there isn't, am I better off just doing a linear search instead (assuming I'm only running the search once for the entire duration of my program) ?
You cannot seek by line. It's pretty obvious once you think about it.
But you can do a sort-of binary search on a text file.
What you do is:
Stat the file to get the length or seek to the end and get the position.
Memory map the file.
(This is best, I think, but you can use lseek and read if you must.)
Seek to the middle of the file, minus your average line length. Just guess.
Scan forward for a newline, unless you are at position 0.
Read your line and compare.
Repeat for 1/4th or 3/4ths, 1/8th, 1/16th, etc.
A disk-based binary search needs to be, at least initially, "block-aware", i.e. aware of the fact that whether you read a single byte of a whole bunch, the I/O cost are the same. The other think it need to be aware is of the relative higher cost for a seek operation as compared to a sequential read operation.
Several of the ways that it can use this awareness about the characteristics of disk I/O:
Towards the end of the search, favor linear searching (scanning) rather than seeking into.
In the beginning check both the first and last element in the block, this may help extrapolate a better guess for the next split
Cache a tree (or even short flat list), of some of the items found in various places in the file (a bit like the intermediate nodes in a formal btree structure)
Declare and use an appropriate buffer size
If the file is small, like under a few hundred kilobytes, it's almost certainly faster to read (or virtually memory map) the entire file into memory. This is because the overhead of doing several i/o operations to seek and transfer is much worse than just reading the whole file, which is what most programs do and most operating systems assume is done.
Unless all the lines are the same length, or have a very predictable length, there's no easy way to seek to line #n. But, to perform a binary search, I'd work with byte offsets in the binary search and read, say 100 bytes (if the words are all less than 100 characters long) before and after the offset—a total of 200 bytes. Then scan for the newline before and after the middle of it to extract the word.
Yes you can lseek but it would help if the size of each word/number per line is fixed, if that is not the case, which is more likely, then you have to lseek by the size of file and seek to the nearest word beginning to still achieve close to the typical O(log n) time complexity of binary searches.
There wouldn't be a "lseek" function, because the file commands do not have the concept of a "line" This concept exists in a different layer of abstraction then the raw file commands.
As to whether it's faster or not, the answer will depend upon a number of factors, including the size of the file, the disk drive speed, and the amount of RAM available. If it isn't a large file, my guess is it would be faster to load the entire file into memory.
If it is a large file, I would use the binary search algorithm to narrow it down to a smaller range (say, a couple of megabytes), then load up that entire block.
As mentioned above, since the file is a text file, predicting the byte at which a given line begins within the file can't be done reliably. The ersatz binary search idea is a pretty good one. But it really won't save you a ton unless the file is huge, given how fast sequential I/O is nowadays and how slow random I/O is.
As you mention, if you are going to read it in, you might as well linearly search it as you go. So do so, use a modified Boyer-Moore search as you read it in and you'll do pretty well.
There are so many performance tradeoffs here that it's impossible to know what makes sense until you have measurements on typical data.
If you're going to maintain this code, it needs to be simple. If searches are rare or the file is small, go with linear search. If the cost actually matters, you'll have to do some experiments.
The second thing I would try after linear search would be to mmap the file and scan through it for newlines. This does take linear time, but strchr can be very fast. It helps if you can guarantee the file ends in a newline. Once you have the lines demarcated, you can keep the number of comparisons small by doing a binary search.
Another option you should consider is Boyer-Moore string search. This is a sub-linear time search and depending on the size of the search pattern, it may be faster than the logarithmic binary search. Boyer-Moore is especially good with long search strings.
Finally, if you determine binary search is really good, but that identifying the lines is a performance bottleneck, you could precompute the start location of each line and store these precomputed locations in binary format in an auxiliary file.
I feel comfortable making only one prediction: it is almost certainly worth avoiding reading in one line at a time with something like readline() or fgets(), because this strategy invariably involves calling malloc() to hold the contents of the line. The cost of calling malloc() on every line is likely to swamp any cost of search or comparison.