Reading specific line in txt file with C - c

I am working with Mac OSX, programming in C and using bash in terminal.
I am currently trying to make a lookup table for the gamma function. Calling gsl_sf_gamma I have been told is pretty expensive and a lookup table would be far faster. I did not wish to lose too much accuracy so I wanted to have a fairly large lookup table. Initializing a huge array would not be ideal since it then defeats the purpose.
My thoughts where to make a large text file with the values pre evaluated for the gamma function in the range of interest. A major problem with this is that I don't know how to call a specific line within a text file using C.
Thanks for any insight and help you guys can offer.
Warning: I know very little about strings and txt files, so I might just not know a simply function that does this already.

Gamma is basically factorial except in continuous form. You want to perform a lookup rather than a computation for the gamma function. You want to use a text file to represent these results. Each line of the file represents the input value multiplied by 1000. I guess for a high enough input value, the file scan could outperform doing the compute.
However, I think you will at minimum want to compute an index into your file. The file can still be arranged as a text file, but you have another step that scans the file, and notes the byte offset for each result line. These offsets get recorded into a binary file, which will serve as your index.
When you run your program, in the beginning, you load the index file into an array, which the index of the array is the floor of the gamma input multiplied by 1000, and the array value at that index is the offset that is recorded in the index file. When you want to compute gamma for a particular number, you multiply the input by 1000, and truncate the result to obtain your array index. You consult this array for the offset, and the next array value for to compute the length of the input. Then, your gamma text file is opened as a binary file. You seek to the offset, and read the length number of bytes to get your digits. You will need to read the next entry too to perform your interpolation.

Yes, calculating gamma is slow (I think GSL uses the Lancosz formula, which sums a series). If the number of values for which you need to calculate it is limited (say, you're only doing integers), then certainly a lookup table might help. But if the table is too big for memory, it won't help--it will be even slower than the calculation.
If the table will fit into memory, there's nothing wrong with storing it in a file until you need it and then loading the whole thing into memory at once.

Related

Read specific line from CSV efficiently, C

I have a large csv file in the following format:-
ID,Hash
abc,123
def,456
ghij,7890
I want to efficiently read a line corresponding to given ID and make changes to corresponding hash. I am allowed to store some information in an initial pass, but the changes need to be dynamic. What can I do?
I don't want to iterate over all lines while making changes. No assumptions can be made about size of any entry in general. It may also change. File has no order.
This seems difficult, but please provide me some code by which I can acess some part of file in constant time. I think I can figure out a heuristic. It would be best if the address can be iterated in both directions from a given point.
Michael Walz asks "Does the hash have a fixed size? The answer depends heavily on this. If the size is fixed, then it's easy to update the hash directly in the file, otherwise the file must be read and rewritten".
More general, if the records in the file have a fixed length, then you can seek to the record and replace it. If not, you have to spool the file. Assuming fixed length, you can sp[eed up a possible search process if the file has e.g. an order (is sorted) as then you can use binary search to quickly (O(log N)) find the record.
See Klas Lindbach's solution of a basic binary search at Fastest array lookup algorithm in C for embedded system?. The same idea holds for a file (an array of records, but on disk).

Fseek to a line number (with lines of variable length)

I have a large file (~10GB) with variable length lines, and I would like to programmatically go to different line numbers. Is there an efficient way to do so?
Yes: build an index. For example, just once you can create a text file on the side which contains the byte offset of various line numbers, like this:
line,offset
0,0
10000,48272
20000,93726
Etc. Then when you want to go to line 13043, just jump to offset 48272 and skip another 3043 newlines. Simple and efficient.
Another approach would be to make your line lengths constant. This would work well if they already have similar lengths so you don't waste too much space. You can pad them out with \0 characters or spaces or whatever, then index the file like a big matrix (line N is at N*LEN bytes).
Finally, you could simply write the line numbers at the beginning of the lines themselves. Then just binary-search within the file, skip to a newline, and inspect the next line number to know whether to look backward or forward (and even guess by how much).
There is no efficient way to do so. You need to scan the entire file once to memorize when are the end of line markers.
Pragmatically, you need a large loop on e.g. getline(3)
You could memoize e.g. the offset of every 100 line, perhaps in a big array or some indexed file using GDBM or some Sqlite database.
My feeling is that you should not have such a huge text file in the first place at all (having a huge text file accessed randomly is the symptom of something wrong). It is not an efficient way to store such data, if you need to access it randomly. You could for example predigest it to fill some database, etc... Probably you should not put such a large piece of data in a text file, but directly in a database or whatever.
Not directly with fseek since it's only capable of moving the position by a bytes amount.
If the efficiency requirement comes from the fact that you must do it many times back and forth a simple solution could be to scan the whole file once and compute all the lines length, store them in a map or array and then use the values to move exactly where you want.

How to search common passwords from two given files of size 20GB?

i have two files of size 20GB. i have to remove common passwords from either of one file.
i sorted the second file by calling sort command of UNIX. after this i splitted the sorted file into many files so that file could fit in RAM memory using split command. After splitting into n files i just used an structure array of size n to store first password of each splitted file and its corresponding file name.
then i applied binary search technique in that structure array for each key of first file to to first password stored in structure to get index of the corresponding file. and then i applied b search to that indexed splitted file.
i assumed 20 character as a max length of passwords
this program is not yet efficient.
Please help to make it efficient, if possible....
Please give me some advise to sort that 20GB file efficiently .....
in 64 bit stream with 8gb RAM and i3 quard processor.....
i just tested my program with two file of size 10MB. it took about 2.66 hours without using any optimization option. ....according to my program it will take about 7-8 hours to check each passwords of 20GB after splitting , sorting and binary searching.....
can i improve its time complexity? i mean can i make it to run more "faster"???
Check out external sorting. See http://www.umbrant.com/blog/2011/external_sorting.html which does have code at the end of the page (https://github.com/umbrant/extsort).
The idea behind external sorting is selecting and sorting equidistant samples from the file. Then partitioning the file at sampling points, sorting the partitions and merging the results.
example numbers = [1, 100, 2, 400, 60, 5, 0, 4]
example samples (distance 4) = 1, 60
chunks = {0,1,2,5,4} , {60, 100, 400}
Also, I don't think splitting the file is a good idea because you need to write 20GB to disk to split them. You might as well create the structure on the fly by seeking within the file.
For a previous SE question, "What algorithm to use to delete duplicates?" I described an algorithm for a probably-similar problem except with 50GB files instead of 20GB. The method is faster than sorting the big files in that problem.
Here is an adaptation of the method to your problem. Let's call the original two files A and B, and suppose A is larger than B. I don't understand from your problem description what is supposed to happen if or when a duplicate is detected, but in the following I assume you want to leave file A unchanged, and remove from B any items that also are in A. I also assume that entries within A are specified to be unique within A at the outset, and similarly for B. If that is not the case, the method needs more adapting and about twice as much I/O.
Suppose you can fit 1/k'th of file A into memory and still have room for the other required data structures. The whole file B can then be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting either file, depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 2G lines, ln(n) ~ 21.4, likely to be about 4 times as bad as O(k n) if k=5. (Algorithm constants still can change the situation either way, but with a fast hash function the method has good constants.)
Process:
Let Q = B (ie rename or copy B to Q)
Allocate a few gigabytes for a work buffer W, and a gigabyte or so for a hash table H. Open input files A and Q, output file O, and temp file T. Go to step 2.
Fill work buffer W by reading from file A.
For each line L in W, hash L into H, such that H[hash[L]] indexes line L.
Read all of Q, using H to detect duplicates, writing non-duplicates to temp file T.
Close and delete Q, rename T to Q, open new temp file T.
If EOF(A), rename Q to B and quit, else go to step 2.
Note that after each pass (ie at start of step 6) none of the lines in Q are duplicates of what has been read from A so far. Thus, 1/k'th of the original file is processed per pass, and processing takes k passes. Also note that although processing will be I/O bound you can read and write several times faster with big buffers (eg 8MB) than line-by-line.
The algorithm as stated above does not include buffering details or how to deal with partial lines in big buffers.
Here is a simple performance example: Suppose A, B both are 20GB files, that each has about 2G passwords in it, and that duplicates are quite rare. Also suppose 8GB RAM is enough for work buffer W to be 4GB in size leaving enough room for hash table H to have say .6G 4-byte entries. Each pass (steps 2-5) reads 20% of A and reads and writes almost all of B, at each pass weeding out any password already seen in A. I/O is about 120GB read (1*A+5*B), 100GB written (5*B).
Here is a more involved performance example: Suppose about 1G randomly distributed passwords in B are duplicated in A, with all else as in previous example. Then I/O is about 100GB read and 70GB written (20+20+18+16+14+12 and 18+16+14+12+10, respectively).
Searching in external files is going to be painfully slow, even using binary search. You might speed it up by putting the data in an actual database designed for fast lookups. You could also sort both text files once and then do a single linear scan to filter out words. Something like the following pseudocode:
sort the files using any suitable sorting utility
open files A and B for reading
read wordA from A
read wordB from B
while (A not EOF and B not EOF)
{
if (wordA < wordB)
write wordA to output
read wordA from A
else if (wordA > wordB)
read wordB from B
else
/* match found, don't output wordA */
read wordA from A
}
while (A not EOF) /* output remaining words */
{
write wordA to output
read wordA from A
}
Like so:
Concatenate the two files.
Use sort to sort the total result.
Use uniq to remove duplicates from the sorted total.
If c++ it's an option for you, the ready to use STXXL should be able to handle your dataset.
Anyway, if you use external sort in c, as suggested by another answer, I think you should sort both files and then scan both sequentially. The scan should be fast, and the sort can be done in parallel.

How to represent a random-access text file in memory (C)

I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.
I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.
1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.
It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...
solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.
As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.
Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains

Creating a binary search of an alphabetically ordered .txt file in C

I'm working on creating a binary search algorithm in C that searches for a string in a .txt file. Each line is a string representing a stock ticker. Not being familiar with C, this is taking far too long. I have a few questions:
1.) Once I have opened a file using fopen, does it make more sense in terms of efficiency for the algorithm to step through the file using some function provided in the C library for scanning files, doing the compare directly from the file, or should I copy each line into an array and have the algorithm search the array?
2.) If I should compare directly from the file, what is the best way to step through it? Assume I have the number of lines in the file, is there some way to go directly to the middle line, scan the string and do the compare?
I'm sorry if this is too vague. Not too sure how to better explain. Thanks for your time
Unless your file is exceedingly big (> 2GB) then loading the file in memory prior searching it is the way to go. In case you cannot load the file in memory, you could hold the offset of each line in an int[] or (if the file contains too many lines...) create another binary file and write the offset of each lines as integers...
Having everything in memory is by far preferable, though.
You cannot binary search lines of a text-file without knowing the length of each line in advance, so you'll most likely want to read each line into memory at first (unless the file is very big).
But if your goal is only to search for a single given line as quickly as possible, you might as well just do linear search directly on the file. There's no point in getting O(log n) at the cost of a O(n) setup cost if the search is only done once.
Reading it all in with a bulk read and walking through it with pointers (to memory) is very fast. Avoid doing multiple I/O calls if you can.
I should also mention that memory mapped files can be very suitable for something like this. See mmap() if on Unix. This is definitely your best bet for really large files.
This is a great question!
The challenge of binary search is that the benefits of binary search come from being able to skip past half the elements at each step in O(1). This guarantees that, since you only do O(lg n) probes, that the runtime is O(lg n). This is why, for example, you can do a fast binary search on an array but not a linked list - in the linked list, finding the halfway point of the elements takes linear time, which dominates the time for the search.
When doing binary search on a file you are in a similar position. Since all the lines in the file might not have the same length, you can't easily jump to the nth line in the file given some number n. Consequently, implementing a good, fast binary search on a file will be a bit tricky. Somehow, you will need to know where each line starts and stops so that you can efficiently jump around in the file.
There are many ways you can do this. First, you could load all the strings from the file into an array, as you've suggested. This takes linear time, but once you have the array of strings in memory all future binary searches will be very fast. The catch is that if you have a very large file, this may take up a lot of memory, and could be prohibitively expansive. Consequently, another alternative might be not to store the actual stings in the array, but rather the offsets into the file at which each string occurs. This would let you do the binary search quickly - you could seek the file to the proper offset when doing a comparison - and for large stings can be much more space-efficient than the above. And, if all the strings are roughly the same length, you could just pad every line to some fixed size to allow for direct computation of the start position of each line.
If you're willing to expend some time implementing more complex solutions, you might want to consider preprocessing the file so that instead of having one string per line, instead you have at the top of the file a list of fixed-width integers containing the offsets of each string in the file. This essentially does the above work, but then stores the result back in the file to make future binary searches much faster. I have some experience with this sort of file structure, and it can be quite fast.
If you're REALLY up for a challenge, you could alternatively store the strings in the file using a B-tree, which would give you incredibly fast lookup times fir each string by minimizing the number of disk reads that you need to do.
Hope this helps!
I don't see how you can do compare directly from the file. You will have to have a buffer to store data read from disk and use that buffer. So it doesn't make sense, it is just impossible.
You cannot jump to a particular line in the file. Not unless you know the offset in bytes of the beginning of that line relative to the beginning of the file.
I'd recommend using mmap to map this file directly into memory and work with it as with character array. Operating system will make work with file (like seeking, reading, writing) transparent to you, and you will just work with it like with a buffer in memory. Note that mmap is limited to 4 GB on 32-bit systems. But if that file is bigger, you probably need to ask the question - why on earth someone has this big file not in an indexed database.

Resources