I need to insert data in a binary file in a way that every insertion is a block of 100 positions, for example. Since I know where it starts, I could easily use fseek/seekg to access elements 0 * 100, 1 * 100, 2 * 100,... n * 100. So how could I guarantee that every insertion ends in a position k * 100?
Or, alternatively, is it possible that I can add registers to an file in a way that I can iterate over them as if they were a array?
File systems are not generally written to allow arbitrary insertions. At a minimum you will need to read the file contents after the point at which you wish to write, write your insert then write back the saved data. More typically you create a new temporary file, write the preceding contents, new contents then following contents, and only after the new file is completely written delete the old file and rename the new to have the original file's name. This is done so that the file is certain to always have valid contents even if something happens to your program or even the computer itself in the middle of the operation.
Some file systems do have the ability to write very specific inserts, but only at sizes much larger than 100 bytes. When this ability is present it will only work in certain powers of 2 sizes and offsets.
The loop variable i is not the value you want to write. You want to write the value i instead.
for(int i=0;i<10;i++)
{
fwrite(&i,sizeof(i),1,fp);
}
should be
for(int i=0;i<10;i++)
{
fwrite(&i,sizeof(int),1,fp);
}
In your code, you are writing the value of the loop variable, which is always 0 (only one byte written). You probably want to write the value i instead.
Related
I have a large file (~10GB) with variable length lines, and I would like to programmatically go to different line numbers. Is there an efficient way to do so?
Yes: build an index. For example, just once you can create a text file on the side which contains the byte offset of various line numbers, like this:
line,offset
0,0
10000,48272
20000,93726
Etc. Then when you want to go to line 13043, just jump to offset 48272 and skip another 3043 newlines. Simple and efficient.
Another approach would be to make your line lengths constant. This would work well if they already have similar lengths so you don't waste too much space. You can pad them out with \0 characters or spaces or whatever, then index the file like a big matrix (line N is at N*LEN bytes).
Finally, you could simply write the line numbers at the beginning of the lines themselves. Then just binary-search within the file, skip to a newline, and inspect the next line number to know whether to look backward or forward (and even guess by how much).
There is no efficient way to do so. You need to scan the entire file once to memorize when are the end of line markers.
Pragmatically, you need a large loop on e.g. getline(3)
You could memoize e.g. the offset of every 100 line, perhaps in a big array or some indexed file using GDBM or some Sqlite database.
My feeling is that you should not have such a huge text file in the first place at all (having a huge text file accessed randomly is the symptom of something wrong). It is not an efficient way to store such data, if you need to access it randomly. You could for example predigest it to fill some database, etc... Probably you should not put such a large piece of data in a text file, but directly in a database or whatever.
Not directly with fseek since it's only capable of moving the position by a bytes amount.
If the efficiency requirement comes from the fact that you must do it many times back and forth a simple solution could be to scan the whole file once and compute all the lines length, store them in a map or array and then use the values to move exactly where you want.
I am working with Mac OSX, programming in C and using bash in terminal.
I am currently trying to make a lookup table for the gamma function. Calling gsl_sf_gamma I have been told is pretty expensive and a lookup table would be far faster. I did not wish to lose too much accuracy so I wanted to have a fairly large lookup table. Initializing a huge array would not be ideal since it then defeats the purpose.
My thoughts where to make a large text file with the values pre evaluated for the gamma function in the range of interest. A major problem with this is that I don't know how to call a specific line within a text file using C.
Thanks for any insight and help you guys can offer.
Warning: I know very little about strings and txt files, so I might just not know a simply function that does this already.
Gamma is basically factorial except in continuous form. You want to perform a lookup rather than a computation for the gamma function. You want to use a text file to represent these results. Each line of the file represents the input value multiplied by 1000. I guess for a high enough input value, the file scan could outperform doing the compute.
However, I think you will at minimum want to compute an index into your file. The file can still be arranged as a text file, but you have another step that scans the file, and notes the byte offset for each result line. These offsets get recorded into a binary file, which will serve as your index.
When you run your program, in the beginning, you load the index file into an array, which the index of the array is the floor of the gamma input multiplied by 1000, and the array value at that index is the offset that is recorded in the index file. When you want to compute gamma for a particular number, you multiply the input by 1000, and truncate the result to obtain your array index. You consult this array for the offset, and the next array value for to compute the length of the input. Then, your gamma text file is opened as a binary file. You seek to the offset, and read the length number of bytes to get your digits. You will need to read the next entry too to perform your interpolation.
Yes, calculating gamma is slow (I think GSL uses the Lancosz formula, which sums a series). If the number of values for which you need to calculate it is limited (say, you're only doing integers), then certainly a lookup table might help. But if the table is too big for memory, it won't help--it will be even slower than the calculation.
If the table will fit into memory, there's nothing wrong with storing it in a file until you need it and then loading the whole thing into memory at once.
This question has been answered in multiple times.
People end saying that the only way is to copy the bytes of the old file to a new file, then insert the new bytes and finish copying the remaining bytes from the old file. Then delete the old file and rename the new.
But my question is more related on what technique are applying the best text/binary editors. If I open a very large binary file (480MB) and I add 4 bytes at the beginning with HxD and notepad++ it seems that the programs are copying whole the file because I can see a 100% activity on the HDD and the data speed being written (3-4 seconds writting at ~100MB/s). May I suppose that the only way is to copy the data to a temp file?
The only way to add/remove bytes to/from a file somewhere else than at the end is to rewrite anything after those bytes. You can do that in-place but using a temp file is usually easier and safer since the original file is not touched until the new one has been written.
To insert N bytes in-place you need to read at least N bytes from the position where you want to insert it. Then you write the new bytes, overwriting N bytes. But you just read them so you can now write them back. However, that will again overwrite N bytes so you need to read those first. As you can see it involves a lot of seeking - storing all the remaining data in memory would be much easier and faster for a file that is not too large. For a large file one usually uses a temporary files to avoid this hassle.
i have two files of size 20GB. i have to remove common passwords from either of one file.
i sorted the second file by calling sort command of UNIX. after this i splitted the sorted file into many files so that file could fit in RAM memory using split command. After splitting into n files i just used an structure array of size n to store first password of each splitted file and its corresponding file name.
then i applied binary search technique in that structure array for each key of first file to to first password stored in structure to get index of the corresponding file. and then i applied b search to that indexed splitted file.
i assumed 20 character as a max length of passwords
this program is not yet efficient.
Please help to make it efficient, if possible....
Please give me some advise to sort that 20GB file efficiently .....
in 64 bit stream with 8gb RAM and i3 quard processor.....
i just tested my program with two file of size 10MB. it took about 2.66 hours without using any optimization option. ....according to my program it will take about 7-8 hours to check each passwords of 20GB after splitting , sorting and binary searching.....
can i improve its time complexity? i mean can i make it to run more "faster"???
Check out external sorting. See http://www.umbrant.com/blog/2011/external_sorting.html which does have code at the end of the page (https://github.com/umbrant/extsort).
The idea behind external sorting is selecting and sorting equidistant samples from the file. Then partitioning the file at sampling points, sorting the partitions and merging the results.
example numbers = [1, 100, 2, 400, 60, 5, 0, 4]
example samples (distance 4) = 1, 60
chunks = {0,1,2,5,4} , {60, 100, 400}
Also, I don't think splitting the file is a good idea because you need to write 20GB to disk to split them. You might as well create the structure on the fly by seeking within the file.
For a previous SE question, "What algorithm to use to delete duplicates?" I described an algorithm for a probably-similar problem except with 50GB files instead of 20GB. The method is faster than sorting the big files in that problem.
Here is an adaptation of the method to your problem. Let's call the original two files A and B, and suppose A is larger than B. I don't understand from your problem description what is supposed to happen if or when a duplicate is detected, but in the following I assume you want to leave file A unchanged, and remove from B any items that also are in A. I also assume that entries within A are specified to be unique within A at the outset, and similarly for B. If that is not the case, the method needs more adapting and about twice as much I/O.
Suppose you can fit 1/k'th of file A into memory and still have room for the other required data structures. The whole file B can then be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting either file, depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 2G lines, ln(n) ~ 21.4, likely to be about 4 times as bad as O(k n) if k=5. (Algorithm constants still can change the situation either way, but with a fast hash function the method has good constants.)
Process:
Let Q = B (ie rename or copy B to Q)
Allocate a few gigabytes for a work buffer W, and a gigabyte or so for a hash table H. Open input files A and Q, output file O, and temp file T. Go to step 2.
Fill work buffer W by reading from file A.
For each line L in W, hash L into H, such that H[hash[L]] indexes line L.
Read all of Q, using H to detect duplicates, writing non-duplicates to temp file T.
Close and delete Q, rename T to Q, open new temp file T.
If EOF(A), rename Q to B and quit, else go to step 2.
Note that after each pass (ie at start of step 6) none of the lines in Q are duplicates of what has been read from A so far. Thus, 1/k'th of the original file is processed per pass, and processing takes k passes. Also note that although processing will be I/O bound you can read and write several times faster with big buffers (eg 8MB) than line-by-line.
The algorithm as stated above does not include buffering details or how to deal with partial lines in big buffers.
Here is a simple performance example: Suppose A, B both are 20GB files, that each has about 2G passwords in it, and that duplicates are quite rare. Also suppose 8GB RAM is enough for work buffer W to be 4GB in size leaving enough room for hash table H to have say .6G 4-byte entries. Each pass (steps 2-5) reads 20% of A and reads and writes almost all of B, at each pass weeding out any password already seen in A. I/O is about 120GB read (1*A+5*B), 100GB written (5*B).
Here is a more involved performance example: Suppose about 1G randomly distributed passwords in B are duplicated in A, with all else as in previous example. Then I/O is about 100GB read and 70GB written (20+20+18+16+14+12 and 18+16+14+12+10, respectively).
Searching in external files is going to be painfully slow, even using binary search. You might speed it up by putting the data in an actual database designed for fast lookups. You could also sort both text files once and then do a single linear scan to filter out words. Something like the following pseudocode:
sort the files using any suitable sorting utility
open files A and B for reading
read wordA from A
read wordB from B
while (A not EOF and B not EOF)
{
if (wordA < wordB)
write wordA to output
read wordA from A
else if (wordA > wordB)
read wordB from B
else
/* match found, don't output wordA */
read wordA from A
}
while (A not EOF) /* output remaining words */
{
write wordA to output
read wordA from A
}
Like so:
Concatenate the two files.
Use sort to sort the total result.
Use uniq to remove duplicates from the sorted total.
If c++ it's an option for you, the ready to use STXXL should be able to handle your dataset.
Anyway, if you use external sort in c, as suggested by another answer, I think you should sort both files and then scan both sequentially. The scan should be fast, and the sort can be done in parallel.
I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.
I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.
1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.
It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...
solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.
As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.
Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains