I am new to Golang.
Should I always avoid appending slices?
I need to load a linebreak-separated data file in memory.
With performance in mind, should I count lines, then load all the data in a predefined length array, or can I just append lines to a slice?
You should stop thinking about performance and start measuring what the actual bottleneck of you application is.
Any advice to a question like "Should do/avoid X because of performance?" is useless in 50% of the cases and counterproductive in 25%.
There are a few really general advices like "do not needlessly generate garbage" but your question cannot be answered as this depends a lot on the size of your file:
Your file is ~ 3 Tera byte? Most probably you will have to read it line by line anyway...
Your file has just a bunch (~50) of lines: Probably counting lines first is more work than reallocating a []string slice 4 times (or 0 times you you make([]string,0,100) it initially). A string is just 2 words.
Your file has an unknown but large (>10k) lines: Maybe it might be worth. "Maybe" in the sense you should measure on real data.
Your file is known to be big (>500k lines): Definitively count first, but you might start hitting the problem from the first bullet point.
You see: A general advice for performance is a bad advice so I won't give one.
Related
I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.
My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?
I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.
I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.
Let's assume that on a hard drive I have some very large data file of a sequence of characters:
ABRDZ....
My question is as follows, if the head is positioned at the beginning of the file, and I need 5 characters every 1000 positions interval, would it better be to do a Seek (since I know where to look) or simply have a large buffer that just reads sequentially then do job in memory.
Naively I'd have answered that reading 'A' then seek to read 'V' is faster than >> reading all the file until ,say, position 200 (the position of 'V'). Ok, this is just an example, since smallest I/O is 512bytes.
Edit: my previous self-naive-answer is partly justified by the following case: given a 100Gb file I need the first and the last charcters; Here I obviously would do a seek .... right?
Maybe there is a trade off between how "long" is the seek vs how much data to retrieve?
Can someone clarify this to me?
[UPDATE]
Generally, from your original numbers 5 out of every 1000, (Ill assume that the 5 bytes is part of the 1000, thus making your step count 1000), if your step count is less 2x your block size, than my original answer is a pretty good explanation. It does get a bit more tricky once you get past 2x your HD block size, because at that point, you would easily be wasting read times, when you could be speeding up by seeking past un-used (or for that matter un-necessary) HD blocks.
[ORIGINAL]
Well, this is an extremely interesting question, with what I beleive to be an equally interesting answer (also somewhat complex). I think that actually this comes down to a couple of other questions, like how big is the block size you have implemented on your drive (or the drive your software is going to run on). If your block size is 4KB, then the (true) minimum your hard drive will get for you at a time is 4096 bytes. In your case if you truly need 5 chars every 1000, then if you did this with ALL disk IO, then you would be essentially re-reading the same block 4 times, and doing 3 seeks in between (REALLY NOT EFFICIENT).
My personal belief is that you could (if you wanted to be drive efficient) in your code, try to understand what the block size of the drive that you are using is, then use that size number to know how many bytes at a time you should bring into RAM. This way you wouldn't have to have a HUGE RAM buffer, but at the same time not really have to SEEK, nor would you be wasting (or performing) any extra reads.
IS THIS THE MOST EFFICIENT.
I dont think it is the most efficient, but it may be good enough for the performance you need, who knows. I do think that even if the read head is where you want it to be, that if you perform algorithmic work in the middle of each block read, rather than reading the whole file all at once, that you will lose time in waiting for the next rotation of the drive platters. Whereas, if you were to read it all at once, the drive should be able to perform a sequential read of all parts of the file at once. Again not as simple though, as if your file is truly more than 1 block, on a rotational drive, you may suffer IF your drive has not been defragmented as it may have to perform random seeks just to get to the next block.
Sorry, for the long winded answer, but par usual, there is no simple answer in your case.
I do think that overall performance would PROBABLY be better if you simply read the whole file at once. There is no way to assure this, as each system is going to have inherently different parameters of their drive setup, etc...
I need to write a program where during run time, a set of integers of arbitrary size will taken as input. They will be seperated by white space. At the end, a new line is given, showing the end of input. How do I save them into an array of integers so that i can display them later. I think it is a little difficult because the number of values that will be entered is not known during compilation
Sounds like homework.
Correct me if I am wrong and I will give you more than hints.
You can either declare an array of a really large size that would not possibly be filled by the user input, then use scanf or something like that to grab the integers until you hit '\n', or you can grab each integer at a time, allocating memory as you go, using a combination of malloc and memcpy calls. The first option should never be done in a real world problem, and I am certainly not advocating such practices even though your textbook probably tells you to do it this way.
There is an example just like this in K&R.
This is a typical problem you will have in C. The solution is usually one of two options.
Use a really large array that is large enough to hold the input. Sometimes this is a poor option when the data could be really large. An example of when it would be a bad idea is when you are saving a video frame or a large text file to the array. This also opens you up to a buffer overrun attack in older versions of Windows. However, this is sometimes a good quick hack solution for smaller (homework) programs where you can count on the user (i.e. your professor who is not trying to break your program) to not input 1000's of characters. Usually this is considered bad practice, please consider my 2nd option for the security reason I mentioned before.
Use dynamic arrays (i.e. malloc). This is probably what your professor wants you to do as this sounds like a typical problem to use when a student is first learning pointers and arrays. This is a great approach, just remember to call free on your memory when you are finished. The tricky part here is that you still have to know the size of the array you want ahead of time (not at compile time though of course).
at the moment i am trying to write a unreal amount of data out to files,
basically i generate a new struct of data and write it out to file untill the file becomes 1gb big and this occurs for 6 files of 1gb each, the structs are small. 8 bytes long with two 2 variables id and amount
when i generate my data, the structs are created and written to file in the order of amount.
but i need the data to sorted by id.
remember there is 6gb's of data , how could i sort these structs by there id value and then written to file?
or should i write to file first, and then sort each individual file ,and how would i bring all this data together into one file?
i am kind of stuck , because i would like to hold it in an array , but obviously this amount of data is too big.
i need a good way to sort alot of data? (6gb)
I haven't found a question with a really basic answer on this, so here goes.
If you're on a 64 bit machine, by the way, you should seriously consider writing all the data into a file, memory mapping the file, and just use whatever array sort you like. Quicksort is pretty cache-friendly: it won't thrash badly. The assignment is probably designed to stop you doing this, but might be a bit out of date ;-)
Failing that, you need some kind of external sort. There are other ways to do it, but I think merge sort is probably the simplest. Before you start merging:
work out how much data you can fit into memory (or, again, mmap it). If you're on a PC then 1GB seems like a fair assumption, but it may be a few times more or less.
load this much data (so one of your 6 files, in the example)
quicksort it (since you tagged "quicksort", I guess you know how to do that), or any other sort of your choice.
write it back to disk (if you didn't mmap).
This leaves you with 6 1GB files, each of which individually is sorted. At this point you can either work up gradually, or go for the whole lot in one go. With 6 chunks, going for the whole lot is fine, in what is called a "6-way merge":
open a file for writing
open your 6 files for reading, and read a few million records out of each
examine the 6 records at the start of each of the 6 buffers. One of theses 6 must be the smallest of all. Write this to the output, and move forward one step through that buffer.
as you reach the end of each buffer, refill it from the correct file.
There's some optimization you can do regarding how you work out which of your 6 possibilities is the smallest, but the big performance difference will be to make sure you use large enough read and write buffers.
Obviously there's nothing special about the merge being 6-way. If you'd rather stick to a 2-way merge, which is easier to code, then of course you can. It will take 5 2-way merges to merge 6 files.
I would recommend this tool, it is a light weight database that runs in memory and takes up very little memory. It will hold your information and you can query it to retrieve your information.
http://www.sqlite.org/features.html
I suggest you don't.
If you are to hold such amount of data, why not using a dedicated database format that can have lots of different indexes and a powerful request engine.
But if you still want to use your old fashioned fixed-endian struct, then i would suggest breaking your data into smaller files, sort each one, and merge them. A good merge algorithm runs in nlog(q). Be also sure to pick the right algorithm for your files.
The easiest way (in development time) to do this is to write out the data to separate files according to their ID. You don't have to have a 1 to 1 match between the number of files and the number of IDs (in case there are a lot of IDs), but if you choose a prefix of the ID (so if the key for one particular record is 987 it might go in the 9 file while the record with key 456 would go in the 4 file) you won't have to worry about locating all of the keys across all of the files because sorting each file by itself would result and then looking at the files in their order (by their names) would give you sorted results.
If that is not possible or easy the you need to do an external sort of some type. Since the data is still spread across several files this is a bit of a pain. The easiest thing (by development time) is to first sort each individual file independently and then merge them together into a new set of files sorted by ID. Look up merge sort if you don't know what I'm talking about. At this step you are pretty much starting in the middle of merge sort.
As far as sorting the contents of a file which is too large to fit into RAM you can either use merge sort directly on the file or use replacement selection sort to sort the file in place. This involves making several passes over the file while using some RAM (the more the better) to hold a priority queue (a binary heap) and a set of records that are not possibly of any use in this run (their keys suggest that they should be earlier in the file than the current run position, so you're just holding on to them until the next run).
Searching for replacement selection sort or tournament sort will yield better explanations.
First, sort each file individually. Either load the whole thing into memory, or (better) mmap it, and use the qsort function.
Then, write your own merge sort that takes N FILE * inputs (i.e. N=6 in your case) and outputs to N new files, switching to the next one whenever one fills up.
Check out external sort. Find any of the external mergesort libraries out there and modify them to suit your need.
Well - since the actual assignment is to keep encoded data and later just compare it with decoded-data, I would also say - use a database and just create an hash index on the ID column.
But regarding sort of such hugh number, another very important thing is to do it in parallel. There are many ways to do it. Steve Jessop mentioned a sort-merge approach, it is really easy to sort the first 6 chunks in parallel, the only question is how much cpu cores andd memory you have on your machine. (It is rare to find a computer with only 1 core today and also not so rare to have 4GB memory).
Maybe you could use mmap and use it as a huge array which you could sort with qsort. I'm not sure what the implications would be. Would it grow to much in memory?