Read specific line from CSV efficiently, C - c

I have a large csv file in the following format:-
ID,Hash
abc,123
def,456
ghij,7890
I want to efficiently read a line corresponding to given ID and make changes to corresponding hash. I am allowed to store some information in an initial pass, but the changes need to be dynamic. What can I do?
I don't want to iterate over all lines while making changes. No assumptions can be made about size of any entry in general. It may also change. File has no order.
This seems difficult, but please provide me some code by which I can acess some part of file in constant time. I think I can figure out a heuristic. It would be best if the address can be iterated in both directions from a given point.

Michael Walz asks "Does the hash have a fixed size? The answer depends heavily on this. If the size is fixed, then it's easy to update the hash directly in the file, otherwise the file must be read and rewritten".
More general, if the records in the file have a fixed length, then you can seek to the record and replace it. If not, you have to spool the file. Assuming fixed length, you can sp[eed up a possible search process if the file has e.g. an order (is sorted) as then you can use binary search to quickly (O(log N)) find the record.
See Klas Lindbach's solution of a basic binary search at Fastest array lookup algorithm in C for embedded system?. The same idea holds for a file (an array of records, but on disk).

Related

Reading specific line in txt file with C

I am working with Mac OSX, programming in C and using bash in terminal.
I am currently trying to make a lookup table for the gamma function. Calling gsl_sf_gamma I have been told is pretty expensive and a lookup table would be far faster. I did not wish to lose too much accuracy so I wanted to have a fairly large lookup table. Initializing a huge array would not be ideal since it then defeats the purpose.
My thoughts where to make a large text file with the values pre evaluated for the gamma function in the range of interest. A major problem with this is that I don't know how to call a specific line within a text file using C.
Thanks for any insight and help you guys can offer.
Warning: I know very little about strings and txt files, so I might just not know a simply function that does this already.
Gamma is basically factorial except in continuous form. You want to perform a lookup rather than a computation for the gamma function. You want to use a text file to represent these results. Each line of the file represents the input value multiplied by 1000. I guess for a high enough input value, the file scan could outperform doing the compute.
However, I think you will at minimum want to compute an index into your file. The file can still be arranged as a text file, but you have another step that scans the file, and notes the byte offset for each result line. These offsets get recorded into a binary file, which will serve as your index.
When you run your program, in the beginning, you load the index file into an array, which the index of the array is the floor of the gamma input multiplied by 1000, and the array value at that index is the offset that is recorded in the index file. When you want to compute gamma for a particular number, you multiply the input by 1000, and truncate the result to obtain your array index. You consult this array for the offset, and the next array value for to compute the length of the input. Then, your gamma text file is opened as a binary file. You seek to the offset, and read the length number of bytes to get your digits. You will need to read the next entry too to perform your interpolation.
Yes, calculating gamma is slow (I think GSL uses the Lancosz formula, which sums a series). If the number of values for which you need to calculate it is limited (say, you're only doing integers), then certainly a lookup table might help. But if the table is too big for memory, it won't help--it will be even slower than the calculation.
If the table will fit into memory, there's nothing wrong with storing it in a file until you need it and then loading the whole thing into memory at once.

How to represent a random-access text file in memory (C)

I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.
I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.
1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.
It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...
solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.
As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.
Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains

Creating a binary search of an alphabetically ordered .txt file in C

I'm working on creating a binary search algorithm in C that searches for a string in a .txt file. Each line is a string representing a stock ticker. Not being familiar with C, this is taking far too long. I have a few questions:
1.) Once I have opened a file using fopen, does it make more sense in terms of efficiency for the algorithm to step through the file using some function provided in the C library for scanning files, doing the compare directly from the file, or should I copy each line into an array and have the algorithm search the array?
2.) If I should compare directly from the file, what is the best way to step through it? Assume I have the number of lines in the file, is there some way to go directly to the middle line, scan the string and do the compare?
I'm sorry if this is too vague. Not too sure how to better explain. Thanks for your time
Unless your file is exceedingly big (> 2GB) then loading the file in memory prior searching it is the way to go. In case you cannot load the file in memory, you could hold the offset of each line in an int[] or (if the file contains too many lines...) create another binary file and write the offset of each lines as integers...
Having everything in memory is by far preferable, though.
You cannot binary search lines of a text-file without knowing the length of each line in advance, so you'll most likely want to read each line into memory at first (unless the file is very big).
But if your goal is only to search for a single given line as quickly as possible, you might as well just do linear search directly on the file. There's no point in getting O(log n) at the cost of a O(n) setup cost if the search is only done once.
Reading it all in with a bulk read and walking through it with pointers (to memory) is very fast. Avoid doing multiple I/O calls if you can.
I should also mention that memory mapped files can be very suitable for something like this. See mmap() if on Unix. This is definitely your best bet for really large files.
This is a great question!
The challenge of binary search is that the benefits of binary search come from being able to skip past half the elements at each step in O(1). This guarantees that, since you only do O(lg n) probes, that the runtime is O(lg n). This is why, for example, you can do a fast binary search on an array but not a linked list - in the linked list, finding the halfway point of the elements takes linear time, which dominates the time for the search.
When doing binary search on a file you are in a similar position. Since all the lines in the file might not have the same length, you can't easily jump to the nth line in the file given some number n. Consequently, implementing a good, fast binary search on a file will be a bit tricky. Somehow, you will need to know where each line starts and stops so that you can efficiently jump around in the file.
There are many ways you can do this. First, you could load all the strings from the file into an array, as you've suggested. This takes linear time, but once you have the array of strings in memory all future binary searches will be very fast. The catch is that if you have a very large file, this may take up a lot of memory, and could be prohibitively expansive. Consequently, another alternative might be not to store the actual stings in the array, but rather the offsets into the file at which each string occurs. This would let you do the binary search quickly - you could seek the file to the proper offset when doing a comparison - and for large stings can be much more space-efficient than the above. And, if all the strings are roughly the same length, you could just pad every line to some fixed size to allow for direct computation of the start position of each line.
If you're willing to expend some time implementing more complex solutions, you might want to consider preprocessing the file so that instead of having one string per line, instead you have at the top of the file a list of fixed-width integers containing the offsets of each string in the file. This essentially does the above work, but then stores the result back in the file to make future binary searches much faster. I have some experience with this sort of file structure, and it can be quite fast.
If you're REALLY up for a challenge, you could alternatively store the strings in the file using a B-tree, which would give you incredibly fast lookup times fir each string by minimizing the number of disk reads that you need to do.
Hope this helps!
I don't see how you can do compare directly from the file. You will have to have a buffer to store data read from disk and use that buffer. So it doesn't make sense, it is just impossible.
You cannot jump to a particular line in the file. Not unless you know the offset in bytes of the beginning of that line relative to the beginning of the file.
I'd recommend using mmap to map this file directly into memory and work with it as with character array. Operating system will make work with file (like seeking, reading, writing) transparent to you, and you will just work with it like with a buffer in memory. Note that mmap is limited to 4 GB on 32-bit systems. But if that file is bigger, you probably need to ask the question - why on earth someone has this big file not in an indexed database.

how to sort a lot of data in c?

at the moment i am trying to write a unreal amount of data out to files,
basically i generate a new struct of data and write it out to file untill the file becomes 1gb big and this occurs for 6 files of 1gb each, the structs are small. 8 bytes long with two 2 variables id and amount
when i generate my data, the structs are created and written to file in the order of amount.
but i need the data to sorted by id.
remember there is 6gb's of data , how could i sort these structs by there id value and then written to file?
or should i write to file first, and then sort each individual file ,and how would i bring all this data together into one file?
i am kind of stuck , because i would like to hold it in an array , but obviously this amount of data is too big.
i need a good way to sort alot of data? (6gb)
I haven't found a question with a really basic answer on this, so here goes.
If you're on a 64 bit machine, by the way, you should seriously consider writing all the data into a file, memory mapping the file, and just use whatever array sort you like. Quicksort is pretty cache-friendly: it won't thrash badly. The assignment is probably designed to stop you doing this, but might be a bit out of date ;-)
Failing that, you need some kind of external sort. There are other ways to do it, but I think merge sort is probably the simplest. Before you start merging:
work out how much data you can fit into memory (or, again, mmap it). If you're on a PC then 1GB seems like a fair assumption, but it may be a few times more or less.
load this much data (so one of your 6 files, in the example)
quicksort it (since you tagged "quicksort", I guess you know how to do that), or any other sort of your choice.
write it back to disk (if you didn't mmap).
This leaves you with 6 1GB files, each of which individually is sorted. At this point you can either work up gradually, or go for the whole lot in one go. With 6 chunks, going for the whole lot is fine, in what is called a "6-way merge":
open a file for writing
open your 6 files for reading, and read a few million records out of each
examine the 6 records at the start of each of the 6 buffers. One of theses 6 must be the smallest of all. Write this to the output, and move forward one step through that buffer.
as you reach the end of each buffer, refill it from the correct file.
There's some optimization you can do regarding how you work out which of your 6 possibilities is the smallest, but the big performance difference will be to make sure you use large enough read and write buffers.
Obviously there's nothing special about the merge being 6-way. If you'd rather stick to a 2-way merge, which is easier to code, then of course you can. It will take 5 2-way merges to merge 6 files.
I would recommend this tool, it is a light weight database that runs in memory and takes up very little memory. It will hold your information and you can query it to retrieve your information.
http://www.sqlite.org/features.html
I suggest you don't.
If you are to hold such amount of data, why not using a dedicated database format that can have lots of different indexes and a powerful request engine.
But if you still want to use your old fashioned fixed-endian struct, then i would suggest breaking your data into smaller files, sort each one, and merge them. A good merge algorithm runs in nlog(q). Be also sure to pick the right algorithm for your files.
The easiest way (in development time) to do this is to write out the data to separate files according to their ID. You don't have to have a 1 to 1 match between the number of files and the number of IDs (in case there are a lot of IDs), but if you choose a prefix of the ID (so if the key for one particular record is 987 it might go in the 9 file while the record with key 456 would go in the 4 file) you won't have to worry about locating all of the keys across all of the files because sorting each file by itself would result and then looking at the files in their order (by their names) would give you sorted results.
If that is not possible or easy the you need to do an external sort of some type. Since the data is still spread across several files this is a bit of a pain. The easiest thing (by development time) is to first sort each individual file independently and then merge them together into a new set of files sorted by ID. Look up merge sort if you don't know what I'm talking about. At this step you are pretty much starting in the middle of merge sort.
As far as sorting the contents of a file which is too large to fit into RAM you can either use merge sort directly on the file or use replacement selection sort to sort the file in place. This involves making several passes over the file while using some RAM (the more the better) to hold a priority queue (a binary heap) and a set of records that are not possibly of any use in this run (their keys suggest that they should be earlier in the file than the current run position, so you're just holding on to them until the next run).
Searching for replacement selection sort or tournament sort will yield better explanations.
First, sort each file individually. Either load the whole thing into memory, or (better) mmap it, and use the qsort function.
Then, write your own merge sort that takes N FILE * inputs (i.e. N=6 in your case) and outputs to N new files, switching to the next one whenever one fills up.
Check out external sort. Find any of the external mergesort libraries out there and modify them to suit your need.
Well - since the actual assignment is to keep encoded data and later just compare it with decoded-data, I would also say - use a database and just create an hash index on the ID column.
But regarding sort of such hugh number, another very important thing is to do it in parallel. There are many ways to do it. Steve Jessop mentioned a sort-merge approach, it is really easy to sort the first 6 chunks in parallel, the only question is how much cpu cores andd memory you have on your machine. (It is rare to find a computer with only 1 core today and also not so rare to have 4GB memory).
Maybe you could use mmap and use it as a huge array which you could sort with qsort. I'm not sure what the implications would be. Would it grow to much in memory?

Determining string uniqueness in a large file

In C, I want to process a file that contains 108 16-digit alphanumeric strings and determine if each one is unique in the file. How can I do that?
As other people have said, the most straightforward method is to simply load the entire file and use something like qsort to sort it.
If you can't load that much into memory at once, another option is to load the data in several passes. On your first pass, read the file and only load in lines that start with A. Sort those and find the unique lines. For the next pass, load all the lines that start with B, sort, and find unique lines. Repeat this process for every alphanumeric character that a line might start with. Using this technique, you should only have to load a fraction of the file into memory at a time and it shouldn't cause you to mis-classify any lines.
Given that you're talking about ~16 megabytes of data, the obvious way to do it would be to just load the data into a hash table (or something on that order) and count the occurrences of each string.
I can't quite imagine doing this in C though -- most other languages will supply a reasonable data structure (some sort of map), making the job substantially easier.
Do a bucket sort(Hash function) into multiple files, one file for each bucket. Then process each bucket's file to determine if all strings are unique within the bucket.
You'll need to sort the file.
Just load it into a single memory block, run qsort from the C runtime library on the memory block and the finally run sequentially over all strings to check for two consecutive strings that are the same.
Take a library with set/map functions, e.g. see link text

Resources